*Detection and correction of mislabeled samples based on graph structure*

**DOI:**10.54647/isss12083 77 Downloads 2897 Views

**Author(s)**

**Abstract**

Machine learning trains and obtains learning models based on a large amount of training samples. Mislabeled training samples will affect the generalization/performance of the final predictive model. Some methods of detecting/correcting mislabeled samples such as graph-based methods, are proposed and used in machine learning to improve predictive models' generalization. However, these methods do not perform well for high-dimensional samples. In this paper, we present three algorithms for detecting/correcting mislabeled samples in high-dimensional feature space. First, we propose an improved high-dimensional detection algorithm: PCA-k-RNG. Next, we introduce a notion of ∈-scalar relative neighbourhood graph (∈-SRNG) and study its relationship with relative neighbourhood graph (RNG) and k-relative neighbourhood graph (k-RNG). Then, we give an alternative high-dimensional detection algorithm: PCA-∈-SRNG. After detecting mislabeled training samples, it is necessary to correct these mislabeled samples. Then we further propose a scalar-adapted correction algorithm: Fat location correction/deletion. Finally, we explore and validate our algorithms based on real datasets with high-dimensional features.

**Keywords**

high dimension, inaccurate supervision learning, mislabeled samples, relative neighbourhood graph (RNG), detection, correction.

**Cite this paper**

Junyan Li, Xinxing Wu,
Detection and correction of mislabeled samples based on graph structure
, *SCIREA Journal of Information Science and Systems Science*.
Volume 5, Issue 2, April 2021 | PP. 12-33.
10.54647/isss12083

**References**

[ 1 ] | Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44– 53, 2018. |

[ 2 ] | B.Fre ́nayandM.Verleysen,“Classificationinthepres- ence of label noise: a survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014. |

[ 3 ] | M. Poel, “Detecting mislabeled data using supervised machine learning techniques,” in Proceedings of the 11th International Conference on Augmented Cognition (AC), Vancouver, Canada, Jul. 2017, pp. 571–581. |

[ 4 ] | C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of Artificial Intelligence Research, vol. 11, no. 1, pp. 131–167, 1999. |

[ 5 ] | J. Sun, F. Zhao, C. Wang, and S. Chen, “Identifying and correcting mislabeled training instances,” in Proceedings of Future Generation Communication and Networking (FGCN), Jeju, South Korea, Dec. 2007, pp. 244–250. |

[ 6 ] | S. Lallich, F. Muhlenbach, and D. A. Zighed, “Improv- ing classification by removing or relabeling mislabeled instances,” in Proceedings of the 13th International Sym- posium on Foundations of Intelligent Systems (ISMIS), Lyon, France, Jun. 2002, pp. 5–15. |

[ 7 ] | D. A. Zighed, S. Lallich, and F. Muhlenbach, “Separabil- ity index in supervised learning,” in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, Aug. 2002, pp. 475–487. |

[ 8 ] | F. Muhlenbach, S. Lallich, and D. A. Zighed, “Iden- tifying and handling mislabelled instances,” Journal of Intelligent Information Systems, vol. 22, no. 1, pp. 89– 109, 2004. |

[ 9 ] | D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, 1991. |

[ 10 ] | R. Ekambaram, S. Fefilatyev, M. Shreve, K. Kramer, L. O. Hall, D. B. Goldgof, and R. Kasturi, “Active cleaning of label noise,” Pattern Recognition, vol. 51, no. 3, pp. 463–480, 2016. |

[ 11 ] | J. Cao, S. Kwong, and R. Wang, “A noise-detection based adaboost algorithm for mislabeled data,” Pattern Recognition, vol. 45, no. 12, pp. 4451–4465, 2012. |

[ 12 ] | M. S. Chang, C. Y. Tang, and R. C. T. Lee, “20-relative neighborhood graphs are hamiltonian,” Journal of Graph Theory, vol. 15, no. 5, pp. 543–557, 1991. |

[ 13 ] | ——, “Solving the euclidean bottleneck matching prob- lem by k-relative neighborhood graphs,” Algorithmica, vol. 8, no. 1-6, pp. 177–194, 1992. |

[ 14 ] | J. Ranstam, “Why the p-value culture is bad and confidence intervals a better alternative,” Osteoarthritis and Cartilage, vol. 20, no. 8, pp. 805–808, 2012. |

[ 15 ] | R. L. Wasserstein and N. A. Lazar, “The ASA’s state- ment on p-values: Context, process, and purpose,” The American Statistician, vol. 70, no. 2, pp. 129–133, 2016. |

[ 16 ] | M. Kleindessner and U. V. Luxburg, “Lens depth func- tion and k-relative neighborhood graph: Versatile tools for ordinal data analysis,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1–52, 2017. |

[ 17 ] | G. T. Toussaint, “Applications of the relative neigh- bourhood graph,” International Journal of Advances in Computer Science and Its Applications, vol. 4, no. 3, pp. 77–85, 2014. |

[ 18 ] | ——, “The relative neighborhood graph of a finite planar set,” Pattern Recognition, vol. 12, no. 4, pp. 261–268, 1980. |

[ 19 ] | Z. Liu and R. Modarres, “Lens data depth and median,” Journal of Nonparametric Statistics, vol. 23, no. 4, pp. 1063–1074, 2011. |

[ 20 ] | R. V. Hogg, E. Tanis, and D. Zimmerman, Probability and Statistical Inference, 9th ed. Pearson Education, Inc., 2014. |

[ 21 ] | D. Dua and E. K. Taniskidou. “UCI Machine Learning Repository”. Retrieved 5 Jan. 2019. [Online]. Available: http://archive.ics.uci.edu/ml |

[ 22 ] | Y. LeCun and C. Cortes and C. J. C. Burges. “The MNIST database of handwritten digits”. Retrieved 5 Jan. 2019. [Online]. Available: http://yann.lecun.com/ exdb/mnist/ |