{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,14]],"date-time":"2025-12-14T12:06:13Z","timestamp":1765713973036,"version":"3.41.2"},"reference-count":33,"publisher":"Emerald","issue":"6","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/www.emerald.com\/insight\/site-policies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IJWIS"],"published-print":{"date-parts":[[2021,12,1]]},"abstract":"<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Purpose<\/jats:title>\n<jats:p>In the world of big data, data integration technology is crucial for maximising the capability of data-driven decision-making. Integrating data from multiple sources drastically expands the power of information and allows us to address questions that are impossible to answer using a single data source. Record Linkage (RL) is a task of identifying and linking records from multiple sources that describe the same real world object (e.g. person), and it plays a crucial role in the data integration process. RL is challenging, as it is uncommon for different data sources to share a unique identifier. Hence, the records must be matched based on the comparison of their corresponding values. Most of the existing RL techniques assume that records across different data sources are structured and represented by the same scheme (i.e. set of attributes). Given the increasing amount of heterogeneous data sources, those assumptions are rather unrealistic. The purpose of this paper is to propose a novel RL model for unstructured data.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Design\/methodology\/approach<\/jats:title>\n<jats:p>In the previous work (Jurek-Loughrey, 2020), the authors proposed a novel approach to linking unstructured data based on the application of the Siamese Multilayer Perceptron model. It was demonstrated that the method performed on par with other approaches that make constraining assumptions regarding the data. This paper expands the previous work originally presented at iiWAS2020 [16] by exploring new architectures of the Siamese Neural Network, which improves the generalisation of the RL model and makes it less sensitive to parameter selection.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Findings<\/jats:title>\n<jats:p>The experimental results confirm that the new Autoencoder-based architecture of the Siamese Neural Network obtains better results in comparison to the Siamese Multilayer Perceptron model proposed in (Jurek <jats:italic>et al.<\/jats:italic>, 2020). Better results have been achieved in three out of four data sets. Furthermore, it has been demonstrated that the second proposed (hybrid) architecture based on integrating the Siamese Autoencoder with a Multilayer Perceptron model, makes the model more stable in terms of the parameter selection.<\/jats:p>\n<\/jats:sec>\n<jats:sec>\n<jats:title content-type=\"abstract-subheading\">Originality\/value<\/jats:title>\n<jats:p>To address the problem of unstructured RL, this paper presents a new deep learning based approach to improve the generalisation of the Siamese Multilayer Preceptron model and make is less sensitive to parameter selection.<\/jats:p>\n<\/jats:sec>","DOI":"10.1108\/ijwis-05-2021-0058","type":"journal-article","created":{"date-parts":[[2021,10,16]],"date-time":"2021-10-16T04:37:47Z","timestamp":1634359067000},"page":"607-621","source":"Crossref","is-referenced-by-count":5,"title":["Deep learning based approach to unstructured record linkage"],"prefix":"10.1108","volume":"17","author":[{"given":"Anna","family":"Jurek-Loughrey","sequence":"first","affiliation":[]}],"member":"140","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"issue":"5","key":"key2021112917082172800_ref001","doi-asserted-by":"crossref","first-page":"16","DOI":"10.1109\/MIS.2003.1234765","article-title":"Adaptive name matching in information integration","volume":"18","year":"2003","journal-title":"IEEE Intelligent Systems"},{"key":"key2021112917082172800_ref002","first-page":"737","article-title":"Signature verification using a \u2018siamese\u2019 time delay neural network","year":"1994","journal-title":"In: Advances in Neural Information Processing Systems"},{"volume-title":"Data Matching: concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection","year":"2012","key":"key2021112917082172800_ref003"},{"issue":"9","key":"key2021112917082172800_ref004","doi-asserted-by":"crossref","first-page":"1537","DOI":"10.1109\/TKDE.2011.127","article-title":"A survey of indexing techniques for scalable record linkage and deduplication","volume":"24","year":"2012","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"key2021112917082172800_ref005","first-page":"73","article-title":"A comparison of string metrics for matching names and records","volume":"3","year":"2003","journal-title":"In Kdd Workshop on Data Cleaning and Object Consolidation"},{"year":"2018","key":"key2021112917082172800_ref006","article-title":"Bert: pre-training of deep bidirectional transformers for language understanding"},{"key":"key2021112917082172800_ref007","first-page":"1245","article-title":"Big data integration","volume-title":"In 2013 IEEE 29th International Conference on Data Engineering (ICDE)","year":"2013"},{"issue":"11","key":"key2021112917082172800_ref008","doi-asserted-by":"crossref","first-page":"1454","DOI":"10.14778\/3236187.3236198","article-title":"Distributed representations of tuples for entity resolution","volume":"11","year":"2018","journal-title":"Proceedings of the VLDB Endowment"},{"key":"key2021112917082172800_ref009","first-page":"17","article-title":"Tailor: a record linkage toolbox","volume-title":"In: Proceedings 18th International Conference on Data Engineering","year":"2002"},{"issue":"1","key":"key2021112917082172800_ref010","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate record detection: a survey","volume":"19","year":"2007","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"12","key":"key2021112917082172800_ref011","doi-asserted-by":"crossref","first-page":"2018","DOI":"10.14778\/2367502.2367564","article-title":"Entity resolution: theory, practice and open challenges","volume":"5","year":"2012","journal-title":"Proceedings of the VLDB Endowment"},{"issue":"11","key":"key2021112917082172800_ref012","doi-asserted-by":"crossref","first-page":"1638","DOI":"10.14778\/2350229.2350276","article-title":"Learning expressive linkage rules using genetic programming","volume":"5","year":"2012","journal-title":"Proceedings of the VLDB Endowment"},{"key":"key2021112917082172800_ref013","first-page":"241","article-title":"Distribution de la flore alpine dans le bassin des dranses et dans quelques r\u00e9gions voisines","volume":"37","year":"1901","journal-title":"Bull. Soc. Vaud. Sci. Nat"},{"issue":"406","key":"key2021112917082172800_ref014","doi-asserted-by":"crossref","first-page":"414","DOI":"10.1080\/01621459.1989.10478785","article-title":"Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida","volume":"84","year":"1989","journal-title":"Journal of the American Statistical Association"},{"key":"key2021112917082172800_ref015","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1016\/j.is.2017.06.006","article-title":"A novel ensemble learning approach to unsupervised record linkage","volume":"71","year":"2017","journal-title":"Information Systems"},{"first-page":"417","article-title":"Deep learning based approach to unstructured record linkage","year":"2020","key":"key2021112917082172800_ref016"},{"key":"key2021112917082172800_ref017","first-page":"340","article-title":"An unsupervised algorithm for learning blocking schemes","volume-title":"In 2013 IEEE 13th International Conference on Data Mining","year":"2013"},{"key":"key2021112917082172800_ref018","first-page":"388","article-title":"Semi-supervised instance matching using boosted classifiers","volume-title":"In European Semantic Web Conference","year":"2015"},{"key":"key2021112917082172800_ref019","article-title":"Siamese neural networks for one-shot image recognition","volume":"2","year":"2015","journal-title":"In ICML Deep Learning Workshop"},{"issue":"1\/2","key":"key2021112917082172800_ref020","first-page":"484","article-title":"Evaluation of entity resolution approaches on real-world match problems","volume":"3","year":"2010","journal-title":"Proceedings of the VLDB Endowment"},{"key":"key2021112917082172800_ref021","first-page":"107","article-title":"Supervised autoencoders: improving generalization performance with unsupervised regularizers","volume":"31","year":"2018","journal-title":"Advances in Neural Information Processing Systems"},{"key":"key2021112917082172800_ref022","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","year":"1966","journal-title":"In: Soviet Physics Doklady"},{"issue":"8","key":"key2021112917082172800_ref023","doi-asserted-by":"crossref","first-page":"1388","DOI":"10.1111\/j.1551-6709.2010.01106.x","article-title":"Composition in distributional models of semantics","volume":"34","year":"2010","journal-title":"Cognitive Science"},{"article-title":"Siamese recurrent architectures for learning sentence similarity","volume-title":"In: Thirtieth AAAI Conference on Artificial Intelligence","year":"2016","key":"key2021112917082172800_ref024"},{"key":"key2021112917082172800_ref025","first-page":"25","article-title":"Unsupervised learning of link specifications: deterministic vs non-deterministic","volume-title":"In: Proceedings of the 8th International Conference on Ontology Matching","year":"2013"},{"issue":"2","key":"key2021112917082172800_ref026","article-title":"High-value token-blocking: efficient blocking method for record linkage","volume":"16","year":"2021","journal-title":"ACM Transactions on Knowledge Discovery from Data"},{"key":"key2021112917082172800_ref027","first-page":"1532","article-title":"Glove: global vectors for word representation","volume-title":"In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","year":"2014"},{"key":"key2021112917082172800_ref028","first-page":"1195","article-title":"Leveraging social media signals for record linkage","volume-title":"Proceedings of the 2018 World Wide Web Conference on World Wide Web, International World Wide Web Conferences Steering Committee","year":"2018"},{"issue":"1","key":"key2021112917082172800_ref029","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1145\/584091.584093","article-title":"A mathematical theory of communication","volume":"5","year":"2001","journal-title":"ACM SIGMOBILE Mobile Computing and Communications Review"},{"key":"key2021112917082172800_ref030","first-page":"103","article-title":"W ombat \u2013 a generalization approach for automatic link discovery","volume-title":"In: European Semantic Web Conference","year":"2017"},{"key":"key2021112917082172800_ref031","first-page":"562","article-title":"Efficient interactive training selection for large-scale entity resolution","volume-title":"In: Pacific-Asia Conference on Knowledge Discovery and Data Mining","year":"2015"},{"issue":"10","key":"key2021112917082172800_ref032","doi-asserted-by":"crossref","first-page":"622","DOI":"10.14778\/2021017.2021020","article-title":"Entity matching: how similar is similar","volume":"4","year":"2011","journal-title":"Proceedings of the VLDB Endowment"},{"first-page":"354","article-title":"String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage","year":"1990","key":"key2021112917082172800_ref033"}],"container-title":["International Journal of Web Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/IJWIS-05-2021-0058\/full\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.emerald.com\/insight\/content\/doi\/10.1108\/IJWIS-05-2021-0058\/full\/html","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,24]],"date-time":"2025-07-24T22:24:06Z","timestamp":1753395846000},"score":1,"resource":{"primary":{"URL":"http:\/\/www.emerald.com\/ijwis\/article\/17\/6\/607-621\/445119"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":33,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2021,10,18]]},"published-print":{"date-parts":[[2021,12,1]]}},"alternative-id":["10.1108\/IJWIS-05-2021-0058"],"URL":"https:\/\/doi.org\/10.1108\/ijwis-05-2021-0058","relation":{},"ISSN":["1744-0084","1744-0084"],"issn-type":[{"type":"print","value":"1744-0084"},{"type":"print","value":"1744-0084"}],"subject":[],"published":{"date-parts":[[2021,10,18]]}}}