Detection and correction of mislabeled samples based on graph structure

Abstract
Machine learning trains and obtains learning models based on a large amount of training samples. Mislabeled training samples will affect the generalization/performance of the final predictive model. Some methods of detecting/correcting mislabeled samples such as graph-based methods, are proposed and used in machine learning to improve predictive models' generalization. However, these methods do not perform well for high-dimensional samples. In this paper, we present three algorithms for detecting/correcting mislabeled samples in high-dimensional feature space. First, we propose an improved high-dimensional detection algorithm: PCA-k-RNG. Next, we introduce a notion of ∈-scalar relative neighbourhood graph (∈-SRNG) and study its relationship with relative neighbourhood graph (RNG) and k-relative neighbourhood graph (k-RNG). Then, we give an alternative high-dimensional detection algorithm: PCA-∈-SRNG. After detecting mislabeled training samples, it is necessary to correct these mislabeled samples. Then we further propose a scalar-adapted correction algorithm: Fat location correction/deletion. Finally, we explore and validate our algorithms based on real datasets with high-dimensional features.

Keywords
high dimension, inaccurate supervision learning, mislabeled samples, relative neighbourhood graph (RNG), detection, correction.

Cite this paper
Junyan Li, Xinxing Wu, Detection and correction of mislabeled samples based on graph structure , SCIREA Journal of Information Science and Systems Science. Volume 5, Issue 2, April 2021 | PP. 12-33. 10.54647/isss12083

References

[ 1 ]	Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44– 53, 2018.
[ 2 ]	B.Fre ́nayandM.Verleysen,“Classificationinthepres- ence of label noise: a survey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 5, pp. 845–869, 2014.
[ 3 ]	M. Poel, “Detecting mislabeled data using supervised machine learning techniques,” in Proceedings of the 11th International Conference on Augmented Cognition (AC), Vancouver, Canada, Jul. 2017, pp. 571–581.
[ 4 ]	C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” Journal of Artificial Intelligence Research, vol. 11, no. 1, pp. 131–167, 1999.
[ 5 ]	J. Sun, F. Zhao, C. Wang, and S. Chen, “Identifying and correcting mislabeled training instances,” in Proceedings of Future Generation Communication and Networking (FGCN), Jeju, South Korea, Dec. 2007, pp. 244–250.
[ 6 ]	S. Lallich, F. Muhlenbach, and D. A. Zighed, “Improv- ing classification by removing or relabeling mislabeled instances,” in Proceedings of the 13th International Sym- posium on Foundations of Intelligent Systems (ISMIS), Lyon, France, Jun. 2002, pp. 5–15.
[ 7 ]	D. A. Zighed, S. Lallich, and F. Muhlenbach, “Separabil- ity index in supervised learning,” in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Helsinki, Finland, Aug. 2002, pp. 475–487.
[ 8 ]	F. Muhlenbach, S. Lallich, and D. A. Zighed, “Iden- tifying and handling mislabelled instances,” Journal of Intelligent Information Systems, vol. 22, no. 1, pp. 89– 109, 2004.
[ 9 ]	D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, 1991.
[ 10 ]	R. Ekambaram, S. Fefilatyev, M. Shreve, K. Kramer, L. O. Hall, D. B. Goldgof, and R. Kasturi, “Active cleaning of label noise,” Pattern Recognition, vol. 51, no. 3, pp. 463–480, 2016.
[ 11 ]	J. Cao, S. Kwong, and R. Wang, “A noise-detection based adaboost algorithm for mislabeled data,” Pattern Recognition, vol. 45, no. 12, pp. 4451–4465, 2012.
[ 12 ]	M. S. Chang, C. Y. Tang, and R. C. T. Lee, “20-relative neighborhood graphs are hamiltonian,” Journal of Graph Theory, vol. 15, no. 5, pp. 543–557, 1991.
[ 13 ]	——, “Solving the euclidean bottleneck matching prob- lem by k-relative neighborhood graphs,” Algorithmica, vol. 8, no. 1-6, pp. 177–194, 1992.
[ 14 ]	J. Ranstam, “Why the p-value culture is bad and confidence intervals a better alternative,” Osteoarthritis and Cartilage, vol. 20, no. 8, pp. 805–808, 2012.
[ 15 ]	R. L. Wasserstein and N. A. Lazar, “The ASA’s state- ment on p-values: Context, process, and purpose,” The American Statistician, vol. 70, no. 2, pp. 129–133, 2016.
[ 16 ]	M. Kleindessner and U. V. Luxburg, “Lens depth func- tion and k-relative neighborhood graph: Versatile tools for ordinal data analysis,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1–52, 2017.
[ 17 ]	G. T. Toussaint, “Applications of the relative neigh- bourhood graph,” International Journal of Advances in Computer Science and Its Applications, vol. 4, no. 3, pp. 77–85, 2014.
[ 18 ]	——, “The relative neighborhood graph of a finite planar set,” Pattern Recognition, vol. 12, no. 4, pp. 261–268, 1980.
[ 19 ]	Z. Liu and R. Modarres, “Lens data depth and median,” Journal of Nonparametric Statistics, vol. 23, no. 4, pp. 1063–1074, 2011.
[ 20 ]	R. V. Hogg, E. Tanis, and D. Zimmerman, Probability and Statistical Inference, 9th ed. Pearson Education, Inc., 2014.
[ 21 ]	D. Dua and E. K. Taniskidou. “UCI Machine Learning Repository”. Retrieved 5 Jan. 2019. [Online]. Available: http://archive.ics.uci.edu/ml
[ 22 ]	Y. LeCun and C. Cortes and C. J. C. Burges. “The MNIST database of handwritten digits”. Retrieved 5 Jan. 2019. [Online]. Available: http://yann.lecun.com/ exdb/mnist/