{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T10:29:37Z","timestamp":1766399377842,"version":"3.48.0"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"6","license":[{"start":{"date-parts":[[2024,8,22]],"date-time":"2024-08-22T00:00:00Z","timestamp":1724284800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,8,22]],"date-time":"2024-08-22T00:00:00Z","timestamp":1724284800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100007397","name":"Univerzita Karlova v Praze","doi-asserted-by":"publisher","award":["SVV 260 698"],"award-info":[{"award-number":["SVV 260 698"]}],"id":[{"id":"10.13039\/100007397","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007397","name":"Univerzita Karlova v Praze","doi-asserted-by":"publisher","award":["UNCE 24\/SCI\/008"],"award-info":[{"award-number":["UNCE 24\/SCI\/008"]}],"id":[{"id":"10.13039\/100007397","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100007397","name":"Charles University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100007397","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Electron Commer Res"],"published-print":{"date-parts":[[2025,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Product mapping or product matching is the field of research dedicated to solving the problem of identifying which product listings (including names, descriptions, specifications, images, and other information) from different e-shops refer to the same product. The problem belongs among important data integration tasks processing data originating from different sources and with different structures. In our previous work, we created basic ProMapEn and ProMapCz datasets for product mapping in English and Czech. The main advantage of the ProMap datasets compared to existing product mapping datasets is that they contain different types of non-matches based on the similarity of the two products. In this paper, we extend the previous two datasets into a completely new collection of datasets for generalized product mapping in the Czech and English languages. We publish those datasets freely for other researchers in the area of product mapping on e-commerce. The main contributions are the extension of the ProMap datasets by adding a new class of non-matching products, the introduction of new ProMapMulti datasets of product pairs from multiple English e-shops, and the introduction of ProMapTransl datasets, obtained by translating the Czech datasets to English and vice versa. Moreover, we provide a very detailed analysis of these datasets with several experiments based on neural network techniques comparing different text preprocessing methods, and similarity computation methods. We also compare the differences among several product categories and evaluate state-of-the-art product mapping methods on these datasets. We also include generalised entity matching techniques and compare their behaviour on product mapping datasets which belong to this area. Finally, we include an appendix with a number of other basic experiments, such as an analysis of feature importances.<\/jats:p>","DOI":"10.1007\/s10660-024-09892-9","type":"journal-article","created":{"date-parts":[[2024,8,23]],"date-time":"2024-08-23T12:52:00Z","timestamp":1724417520000},"page":"5045-5074","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Extended ProMap datasets for product mapping"],"prefix":"10.1007","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9815-2763","authenticated-orcid":false,"given":"Kate\u0159ina","family":"Mackov\u00e1","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1239-1566","authenticated-orcid":false,"given":"Martin","family":"Pil\u00e1t","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2024,8,22]]},"reference":[{"key":"9892_CR1","doi-asserted-by":"publisher","unstructured":"Akritidis, L., & Bozanis, P. (2018). Effective unsupervised matching of product titles with k-combinations and permutations. In 2018 innovations in intelligent systems and applications (INISTA), pp. 1\u201310. https:\/\/doi.org\/10.1109\/INISTA.2018.8466294","DOI":"10.1109\/INISTA.2018.8466294"},{"key":"9892_CR2","doi-asserted-by":"publisher","unstructured":"Akritidis, L., Fevgas, A., & Bozanis, P. (2018). Effective products categorization with importance scores and morphological analysis of the titles. In 2018 IEEE 30th international conference on tools with artificial intelligence (ICTAI), pp. 213\u2013220. https:\/\/doi.org\/10.1109\/ICTAI.2018.00041","DOI":"10.1109\/ICTAI.2018.00041"},{"issue":"7","key":"9892_CR3","doi-asserted-by":"publisher","first-page":"4777","DOI":"10.1007\/s10462-020-09807-8","volume":"53","author":"L Akritidis","year":"2020","unstructured":"Akritidis, L., Fevgas, A., Bozanis, P., & Makris, C. (2020). A self-verifying clustering approach to unsupervised matching of product titles. Artificial Intelligence Review, 53(7), 4777\u20134820. https:\/\/doi.org\/10.1007\/s10462-020-09807-8","journal-title":"Artificial Intelligence Review"},{"key":"9892_CR4","doi-asserted-by":"crossref","unstructured":"Ba\u00f1\u00f3n, M., Chen, P., Haddow, B., Heafield, K., Hoang, H., Espla-Gomis, M., Forcada, M., Kamran, A., Kirefu, F., Koehn, P., & Ortiz-Rojas, S. (2020). Paracrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 4555\u20134567.","DOI":"10.18653\/v1\/2020.acl-main.417"},{"key":"9892_CR5","doi-asserted-by":"crossref","unstructured":"Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR arXiv:1911.02116","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"9892_CR6","unstructured":"Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR. arXiv:1810.04805"},{"key":"9892_CR7","unstructured":"Howard, A., Liew, C., Wong, M., & Dane, S. (2021). Shopee: Price match guarantee. https:\/\/kaggle.com\/competitions\/shopee-product-matching."},{"issue":"13","key":"9892_CR8","doi-asserted-by":"publisher","first-page":"1581","DOI":"10.14778\/3007263.3007314","volume":"9","author":"P Konda","year":"2016","unstructured":"Konda, P., Das, S., Sagunthan, P. Doan, A., Aradalan, A., Ballard, J., Li, H., Panahi, F., Zhang, H., Naughton, J., Prasad, S., Krishnan, G., Deep, R., & Raghavendra, V. (2016). Magellan: Toward building entity matching management systems over data science stacks. Proceedings of the VLDB Endowment, 9(13), 1581\u20131584.","journal-title":"Proceedings of the VLDB Endowment"},{"issue":"1\u20132","key":"9892_CR9","doi-asserted-by":"publisher","first-page":"484","DOI":"10.14778\/1920841.1920904","volume":"3","author":"H K\u00f6pcke","year":"2010","unstructured":"K\u00f6pcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proc VLDB Endow, 3(1\u20132), 484\u2013493. https:\/\/doi.org\/10.14778\/1920841.1920904","journal-title":"Proc VLDB Endow"},{"key":"9892_CR10","doi-asserted-by":"crossref","unstructured":"Li, Y., Li, J., Suhara, Y., Doan, A., & Tan, W. C. (2020). Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584.","DOI":"10.14778\/3421424.3421431"},{"key":"9892_CR11","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized BERT pretraining approach. CoRR. arXiv:1907.11692"},{"key":"9892_CR12","doi-asserted-by":"publisher","unstructured":"Mackov\u00e1, K., & Pil\u00e1t, M. (2023). Promap: Datasets for product mapping in e-commerce. CoRR https:\/\/doi.org\/10.48550\/ARXIV.2309.06882. arXiv:2309.06882","DOI":"10.48550\/ARXIV.2309.06882"},{"key":"9892_CR13","doi-asserted-by":"publisher","unstructured":"Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., & Raghavendra, V. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 international conference on management of data. association for computing machinery, (pp. 19\u201334). SIGMOD \u201918. https:\/\/doi.org\/10.1145\/3183713.3196926","DOI":"10.1145\/3183713.3196926"},{"issue":"4","key":"9892_CR14","doi-asserted-by":"publisher","first-page":"738","DOI":"10.14778\/3574245.3574258","volume":"16","author":"A Narayan","year":"2022","unstructured":"Narayan, A., Chami, I., Orr, L., Arora, S., & Re, C. (2022). Can foundation models wrangle your data? Proc VLDB Endow, 16(4), 738\u2013746. https:\/\/doi.org\/10.14778\/3574245.3574258","journal-title":"Proc VLDB Endow"},{"key":"9892_CR15","unstructured":"Naumann, F. (2011). Amazon-walmart dataset. pp. 353\u2013362. https:\/\/hpi.de\/naumann\/projects\/repeatability\/datasets\/amazon-walmart-dataset.htm"},{"key":"9892_CR16","unstructured":"Peeters, R., & Bizer, C. (2024). Entity matching using large language models. arXiv:2310.11244"},{"key":"9892_CR17","doi-asserted-by":"publisher","unstructured":"Peeters, R., Der, R.C., & Bizer, C. (2024). WDC products: A multi-dimensional entity matching benchmark. In: Tanca, L., Luo, Q., Polese, G., et\u00a0al. (eds) Proceedings 27th international conference on extending database technology, EDBT 2024, Paestum, Italy, March 25\u2013March 28. OpenProceedings.org, pp. 22\u201333. https:\/\/doi.org\/10.48786\/EDBT.2024.03","DOI":"10.48786\/EDBT.2024.03"},{"issue":"4381","key":"9892_CR18","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41467-020-18073-9","volume":"11","author":"M Popel","year":"2020","unstructured":"Popel, M., Tomkova, M., Tomek, J., Kaiser, L., Uszkoreit, J., Bojar, O., & Zabokrtsky, Z. (2020). Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nature Communications, 11(4381), 1\u201315. https:\/\/doi.org\/10.1038\/s41467-020-18073-9","journal-title":"Nature Communications"},{"key":"9892_CR19","doi-asserted-by":"publisher","unstructured":"Primpeli, A., & Bizer, C. (2020). Profiling entity matching benchmark tasks. In Proceedings of the 29th ACM international conference on information & knowledge management. Association for Computing Machinery (pp. 3101\u20133108). CIKM \u201920. https:\/\/doi.org\/10.1145\/3340531.3412781","DOI":"10.1145\/3340531.3412781"},{"key":"9892_CR20","doi-asserted-by":"publisher","unstructured":"Primpeli, A., Peeters, R., & Bizer, C. (2019). The wdc training dataset and gold standard for large-scale product matching. In Companion proceedings of the 2019 World Wide Web conference. Association for computing machinery (pp. 381\u2013386). WWW \u201919. https:\/\/doi.org\/10.1145\/3308560.3316609","DOI":"10.1145\/3308560.3316609"},{"key":"9892_CR21","unstructured":"Rahm, E., Peukert, E., Saeedi, A., & Nentwig, M. (2010). Benchmark datasets for entity resolution. https:\/\/dbs.uni-leipzig.de\/research\/projects\/object_matching\/benchmark_datasets_for_entity_resolution."},{"key":"9892_CR22","unstructured":"Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. ArXiv arXiv:1910.01108"},{"key":"9892_CR23","doi-asserted-by":"crossref","unstructured":"Sedl\u00e1\u010dek, R., & Smr\u017e, P. (2001). A new Czech morphological analyser Ajka. In Text, speech and dialogue: 4th international conference, TSD 2001 \u017eelezn\u00e1 Ruda, Czech Republic, September 11\u201313, 2001, Proceedings 4 (pp. 100-107). Springer.","DOI":"10.1007\/3-540-44805-5_13"},{"key":"9892_CR24","doi-asserted-by":"crossref","unstructured":"Strakov\u00e1, J., Straka, M., & Hajic, J. (2014). Open-source tools for morphology, lemmatization, pos tagging and named entity recognition. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations, pp. 13\u201318.","DOI":"10.3115\/v1\/P14-5003"},{"key":"9892_CR25","doi-asserted-by":"crossref","unstructured":"Yang, B., Gu, F., & Niu, X. (2006). Block mean value based image perceptual hashing. In 2006 International conference on intelligent information hiding and multimedia (pp. 167\u2013172), IEEE.","DOI":"10.1109\/IIH-MSP.2006.265125"}],"container-title":["Electronic Commerce Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10660-024-09892-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10660-024-09892-9","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10660-024-09892-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T10:25:22Z","timestamp":1766399122000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10660-024-09892-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,22]]},"references-count":25,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2025,12]]}},"alternative-id":["9892"],"URL":"https:\/\/doi.org\/10.1007\/s10660-024-09892-9","relation":{},"ISSN":["1389-5753","1572-9362"],"issn-type":[{"type":"print","value":"1389-5753"},{"type":"electronic","value":"1572-9362"}],"subject":[],"published":{"date-parts":[[2024,8,22]]},"assertion":[{"value":"9 August 2024","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 August 2024","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"On behalf of all authors, the corresponding author states that there is no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}