ANALYSIS OF KEY METADATA FOR IDENTIFYING DUPLICATES IN BIBLIOGRAPHIC RECORDS
DOI:
https://doi.org/10.28925/2663-4023.2025.27.700Keywords:
bibliographic record,, bibliographic metadata; duplicate detection;, automated library information systems;, Prediction by Partial Matching;, multi-level bibliographic description.Abstract
This study addresses the issue of duplicate bibliographic records in library information systems, a problem that is becoming increasingly relevant with the growth of digital catalogs. It specifically examines the key metadata fields used for comparing records and identifying duplicate entries. The analysis includes critical metadata fields such as title, ISBN, publisher, place of publication, publication date, pagination, series, and additional attributes used for identifying editions. Special attention is given to the variability of data within these fields, including issues arising from misplaced subfields (e.g., place of publication instead of year, or vice versa) and the use of various date formats, such as copyright dates, ranges, or approximate dates. The study explores the specifics of multi-level records, particularly for journals and multi-volume publications, as well as errors caused by data migration between different automated library information systems (ALIS). The research demonstrates that, despite the existence of ISBD, UNIMARC, and other standards, a significant proportion of inconsistencies persist in bibliographic records, complicating automated processing. Fields containing publishers and places of publication exhibit a high degree of variability, with unique values accounting for only 9% to 38% of total records. During data clustering using the nearest neighbor method with the Prediction by Partial Matching (PPM) algorithm, the number of unique values was reduced by 6–24%, highlighting the potential of automation to improve the efficiency of manual record editing. The findings make a substantial contribution to the development of effective approaches for enhancing automated bibliographic data management systems, optimizing duplicate detection, and improving the overall quality of library databases.
Downloads
References
aas, J., Schotten, M., Plume, A., & Côté, G. (2020). Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies. Quantitative science studies, 1(1), 377–386. https://doi.org/10.1162/qss_a_00019
Beesley, L., Bondarenko, I., Elliot, M., & Kurian, A. (2021). Multiple imputation with missing data indicators. Stat Methods Med Res., 30(12), 2685–2700. https://doi.org/10.1177/09622802211047346
Burnham, J. F. (2006). Scopus database: a review. Biomedical digital libraries, 3(1), 1–8. https://doi.org/10.1186/1742-5581-3-1
Ceasar, S. A., & Ignacimuthu, S. (2023). CRISPR/Cas genome editing in plants: Dawn ofAgrobacterium transformation for recalcitrant and transgene-free plants for future cropbreeding. Plant Physiology and Biochemistry, 196, 724–730. https://doi.org/10.1016/j.plaphy.2023.02.030
Delgado-Quirós, L., & Ortega, J. L. (2024). Completeness degree of publication metadata in eight free-access scholarly databases. Quantitative Science Studies, 5(1), 31–49. https://doi.org/10.1162/qss_a_00286
Elango, B. (2024). Duplication issues with the new interface of Scopus. INFONOMY, 2. https://doi.org/10.3145/infonomy.24.015
Elango, B., & Matilda, S. (2023). Mapping thecybersecurity research: A scientometric analysis of Indian publications. Journal of ComputerInformation Systems, 63(2), 293–309. https://doi.org/10.1080/08874417.2022.2058644
Elango, B., Kozak, M., & Rajendran, P. (2019). Analysis of retractions in Indian science. Scientometrics, 119(2), 1081–1094. https://doi.org/10.1007/s11192-019-03079-y
Hammer, B., Virgili, E., & Bilotta, F. (2023). Evidence-based literature review: De-duplication a cornerstone for quality. World J Methodol, 13(5), 390–398. https://doi.org/10.5662/wjm.v13.i5.390
Krauskopf, E. (2018). An analysis of discontinued journals by Scopus. Scientometrics, 116(3), 1805–1815. https://doi.org/10.1007/s11192-021-03948-5
Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: acomparative analysis. Scientometrics, 106, 213–228. https://doi.org/10.1007/s11192-015-1765-5
Pranckutė, R. (2021). Web of Science (WoS) and Scopus: The titans of bibliographicinformation in today’s academic world. Publications, 9(1). https://doi.org/10.3390/publications9010012
Tennant, J. P. (2020). Web of Science and Scopus are not global databases of knowledge. European Science Editing, 46. https://doi.org/10.3897/ese.2020.e51987
Thelwall, M. (2018). Dimensions: A competitor to Scopus and the Web of Science? Journal of informetrics, 12(2), 430–435. https://doi.org/10.1016/j.joi.2018.03.006
Thelwall, M., & Sud, P. (2022). Scopus 1900–2020: Growth in articles, abstracts, countries,fields, and journals. Quantitative Science Studies, 3(1), 37–50. https://doi.org/10.1162/qss_a_00177
APN Ukrainy & Derzh. nauk.-ped. b-ka Ukrainy im. V. O. Sukhomlynskoho. (2010). Uprovadzhennia v praktyku roboty bibliotek osvitianskoi haluzi DSTU HOST 7.1:2006 «Bibliohrafichnyi zapys. Bibliohrafichnyi opys. Zahalni vymohy ta pravyla skladannia» ta DSTU HOST 7.80:2007 «SSIBVS. Bibliohrafichnyi zapys. Zaholovok. Zahalni vymohy ta pravyla skladannia» [Implementation in the practice of libraries in the educational sector of DSTU GOST 7.1:2006 “Bibliographic record. Bibliographic description. General requirements and rules of compilation” and DSTU GOST 7.80:2007 “SSIBVS. Bibliographic record. Title. General requirements and rules of compilation”].
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Олег Василенко

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.