ANALYSIS OF KEY METADATA FOR IDENTIFYING DUPLICATES IN BIBLIOGRAPHIC RECORDS

Authors

DOI:

https://doi.org/10.28925/2663-4023.2025.27.700

Keywords:

bibliographic record,, bibliographic metadata; duplicate detection;, automated library information systems;, Prediction by Partial Matching;, multi-level bibliographic description.

Abstract

This study addresses the issue of duplicate bibliographic records in library information systems, a problem that is becoming increasingly relevant with the growth of digital catalogs. It specifically examines the key metadata fields used for comparing records and identifying duplicate entries. The analysis includes critical metadata fields such as title, ISBN, publisher, place of publication, publication date, pagination, series, and additional attributes used for identifying editions. Special attention is given to the variability of data within these fields, including issues arising from misplaced subfields (e.g., place of publication instead of year, or vice versa) and the use of various date formats, such as copyright dates, ranges, or approximate dates. The study explores the specifics of multi-level records, particularly for journals and multi-volume publications, as well as errors caused by data migration between different automated library information systems (ALIS). The research demonstrates that, despite the existence of ISBD, UNIMARC, and other standards, a significant proportion of inconsistencies persist in bibliographic records, complicating automated processing. Fields containing publishers and places of publication exhibit a high degree of variability, with unique values accounting for only 9% to 38% of total records. During data clustering using the nearest neighbor method with the Prediction by Partial Matching (PPM) algorithm, the number of unique values was reduced by 6–24%, highlighting the potential of automation to improve the efficiency of manual record editing. The findings make a substantial contribution to the development of effective approaches for enhancing automated bibliographic data management systems, optimizing duplicate detection, and improving the overall quality of library databases.

Downloads

Download data is not yet available.

References

aas, J., Schotten, M., Plume, A., & Côté, G. (2020). Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies. Quantitative science studies, 1(1), 377–386. https://doi.org/10.1162/qss_a_00019

Beesley, L., Bondarenko, I., Elliot, M., & Kurian, A. (2021). Multiple imputation with missing data indicators. Stat Methods Med Res., 30(12), 2685–2700. https://doi.org/10.1177/09622802211047346

Burnham, J. F. (2006). Scopus database: a review. Biomedical digital libraries, 3(1), 1–8. https://doi.org/10.1186/1742-5581-3-1

Ceasar, S. A., & Ignacimuthu, S. (2023). CRISPR/Cas genome editing in plants: Dawn ofAgrobacterium transformation for recalcitrant and transgene-free plants for future cropbreeding. Plant Physiology and Biochemistry, 196, 724–730. https://doi.org/10.1016/j.plaphy.2023.02.030

Delgado-Quirós, L., & Ortega, J. L. (2024). Completeness degree of publication metadata in eight free-access scholarly databases. Quantitative Science Studies, 5(1), 31–49. https://doi.org/10.1162/qss_a_00286

Elango, B. (2024). Duplication issues with the new interface of Scopus. INFONOMY, 2. https://doi.org/10.3145/infonomy.24.015

Elango, B., & Matilda, S. (2023). Mapping thecybersecurity research: A scientometric analysis of Indian publications. Journal of ComputerInformation Systems, 63(2), 293–309. https://doi.org/10.1080/08874417.2022.2058644

Elango, B., Kozak, M., & Rajendran, P. (2019). Analysis of retractions in Indian science. Scientometrics, 119(2), 1081–1094. https://doi.org/10.1007/s11192-019-03079-y

Hammer, B., Virgili, E., & Bilotta, F. (2023). Evidence-based literature review: De-duplication a cornerstone for quality. World J Methodol, 13(5), 390–398. https://doi.org/10.5662/wjm.v13.i5.390

Krauskopf, E. (2018). An analysis of discontinued journals by Scopus. Scientometrics, 116(3), 1805–1815. https://doi.org/10.1007/s11192-021-03948-5

Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: acomparative analysis. Scientometrics, 106, 213–228. https://doi.org/10.1007/s11192-015-1765-5

Pranckutė, R. (2021). Web of Science (WoS) and Scopus: The titans of bibliographicinformation in today’s academic world. Publications, 9(1). https://doi.org/10.3390/publications9010012

Tennant, J. P. (2020). Web of Science and Scopus are not global databases of knowledge. European Science Editing, 46. https://doi.org/10.3897/ese.2020.e51987

Thelwall, M. (2018). Dimensions: A competitor to Scopus and the Web of Science? Journal of informetrics, 12(2), 430–435. https://doi.org/10.1016/j.joi.2018.03.006

Thelwall, M., & Sud, P. (2022). Scopus 1900–2020: Growth in articles, abstracts, countries,fields, and journals. Quantitative Science Studies, 3(1), 37–50. https://doi.org/10.1162/qss_a_00177

APN Ukrainy & Derzh. nauk.-ped. b-ka Ukrainy im. V. O. Sukhomlynskoho. (2010). Uprovadzhennia v praktyku roboty bibliotek osvitianskoi haluzi DSTU HOST 7.1:2006 «Bibliohrafichnyi zapys. Bibliohrafichnyi opys. Zahalni vymohy ta pravyla skladannia» ta DSTU HOST 7.80:2007 «SSIBVS. Bibliohrafichnyi zapys. Zaholovok. Zahalni vymohy ta pravyla skladannia» [Implementation in the practice of libraries in the educational sector of DSTU GOST 7.1:2006 “Bibliographic record. Bibliographic description. General requirements and rules of compilation” and DSTU GOST 7.80:2007 “SSIBVS. Bibliographic record. Title. General requirements and rules of compilation”].

Downloads


Abstract views: 169

Published

2025-03-27

How to Cite

Vasylenko, O. (2025). ANALYSIS OF KEY METADATA FOR IDENTIFYING DUPLICATES IN BIBLIOGRAPHIC RECORDS. Electronic Professional Scientific Journal «Cybersecurity: Education, Science, Technique», 3(27), 87–99. https://doi.org/10.28925/2663-4023.2025.27.700