Adaptive Approximate Record Matching

Author

Computer Engineering Department, Azad University-Tehran Central Branch, Tehran, Iran.

Abstract

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error patterns. In field matching phase, edit distance method is used. Naturally, it had been customized for Persian language problems such as similarity of Persian characters, usual typographical errors in Persian, etc. In record matching phase, the importance of each field can be determined by specifying a coefficient related to each field. Coefficient of each field must be dynamically changed, because of changes of typographical error patterns. For this reason, Genetic Algorithm (GA) is used for supervised learning of coefficient values. The simulation results show the high abilities of this algorithm compared with other methods (such as Decision Trees).

Keywords


[1]
D. E. Goldberg, “Genetic Algorithms in Search Optimization and Machine Learning”, Addison_Wesley, 1989.
[2]
] J. Han, M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001.
[3]
M. A. Hernadez, S.J.Stolfo, "Real-world Data is Dirty: Data Cleansing and the Merge/Purge Problem", Journal of Data Mining and Knowledge Discovery, Vol.1, No.2, 1998
[4]
J.A. Hylton, “Identifying and Merging Related Bibliographical Records”, Master's Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1996.
[5]
M. Kantardzic,” Data Mining: Concepts, Methods, and Algorithms”, IEEE Press, 2003.
[6]
K. Kukich, "Techniques for Automatically Correcting Words in Text", ACM Computing Survey, Vol.24, No.4, 1992.
[7]
A. E. Monge, “Adaptive Detection of Approximately Duplicate Database Records and Database Integration Approach to Information Discovery”, PHD Thesis, University of California, San Diego, 1997.
[8]
] A. E. Monge, C. P. Elkan, "The Field Matching Problem: Algorithms and Applications" Second International Conference of Knowledge Discovery and Data Mining, AAAI Press, 1996.
[9]
V. S. Verykios, A.K.Elmagarmid, E.H.Houstis, "Automating the Approximate Record Matching Process", Information Science, Vol.126, No 1-4, 2000.
[10]
V. S. Verykios, G.V.Moustakides, "A Cost Optimal Decision Model for Record Matching", Workshop on Data Quality, 2001.