{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T13:57:29Z","timestamp":1768571849199,"version":"3.49.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2024,5,14]],"date-time":"2024-05-14T00:00:00Z","timestamp":1715644800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMOD Rec."],"published-print":{"date-parts":[[2024,5,14]]},"abstract":"<jats:p>Data matching, which decides whether two data elements (e.g., string, tuple, column, or knowledge graph entity) are the \"same\" (a.k.a. a match), is a key concept in data integration. The widely used practice is to build task-specific or even dataset-specific solutions, which are hard to generalize and disable the opportunities of knowledge sharing that can be learned from different datasets and multiple tasks. In this paper, we propose Unicorn, a unified model for generally supporting common data matching tasks. Building such a unified model is challenging due to heterogeneous formats of input data elements and various matching semantics of multiple tasks. To address the challenges, Unicorn employs one generic Encoder that converts any pair of data elements (a, b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether a matches b. To align matching semantics of multiple tasks, Unicorn adopts a mixture-of-experts model that enhances the learned representation into a better representation. We conduct extensive experiments using 20 datasets on 7 well-studied data matching tasks, and find that our unified model can achieve better performance on most tasks and on average, compared with the state-of-the-art specific models trained for ad-hoc tasks and datasets separately. Moreover, Unicorn can also well serve new matching tasks with zero-shot learning.<\/jats:p>","DOI":"10.1145\/3665252.3665263","type":"journal-article","created":{"date-parts":[[2024,5,14]],"date-time":"2024-05-14T22:04:33Z","timestamp":1715724273000},"page":"44-53","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Unicorn: A Unified Multi-Tasking Matching Model"],"prefix":"10.1145","volume":"53","author":[{"given":"Ju","family":"Fan","sequence":"first","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"given":"Jianhong","family":"Tu","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}]},{"given":"Peng","family":"Wang","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"given":"Xiaoyong","family":"Du","sequence":"additional","affiliation":[{"name":"Renmin University of China, Beijing, China"}]},{"given":"Xiaofeng","family":"Jia","sequence":"additional","affiliation":[{"name":"Beijing Big Data Centre, Beijing, China"}]},{"given":"Song","family":"Gao","sequence":"additional","affiliation":[{"name":"Beijing Big Data Centre, Beijing, China"}]},{"given":"Nan","family":"Tang","sequence":"additional","affiliation":[{"name":"HKUST (GZ) \/ HKUST, Guangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2024,5,14]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901","author":"Brown T.","year":"2020","unstructured":"T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020."},{"key":"e_1_2_1_2_1","volume-title":"Learning semantic annotations for tabular data. arXiv preprint arXiv:1906.00781","author":"Chen J.","year":"2019","unstructured":"J. Chen, E. Jim\u00b4enez-Ruiz, I. Horrocks, and C. Sutton. Learning semantic annotations for tabular data. arXiv preprint arXiv:1906.00781, 2019."},{"key":"e_1_2_1_3_1","volume-title":"2023 Conference on Innovative Data Systems Research (CIDR)","author":"Chen Z.","year":"2023","unstructured":"Z. Chen, Z. Gu, L. Cao, J. Fan, S. Madden, and N. Tang. Symphony: Towards natural language query answering over multi-modal data lakes. In 2023 Conference on Innovative Data Systems Research (CIDR), 2023."},{"key":"e_1_2_1_4_1","first-page":"240","article-title":"Palm: Scaling language modeling with pathways","volume":"24","author":"Chowdhery A.","year":"2023","unstructured":"A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1--240:113, 2023.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_5_1","volume-title":"An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR), 53(6):1--42","author":"Christophides V.","year":"2020","unstructured":"V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis. An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR), 53(6):1--42, 2020."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390177"},{"key":"e_1_2_1_7_1","volume-title":"8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings","author":"Deng D.","year":"2017","unstructured":"D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In 8th Biennial Conference on Innovative Data Systems Research, CIDR 2017, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings, 2017."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3430915.3430921"},{"key":"e_1_2_1_9_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin J.","year":"2018","unstructured":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/2401764"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-003-0104-2"},{"key":"e_1_2_1_12_1","volume-title":"ICLR","author":"Dosovitskiy A.","year":"2021","unstructured":"A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.2105\/AJPH.36.12.1412"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/3236187.3236198"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-68288-4_16"},{"key":"e_1_2_1_16_1","volume-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","author":"Fedus W.","year":"2022","unstructured":"W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588917"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.331"},{"issue":"1","key":"e_1_2_1_19_1","first-page":"79","article-title":"Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020. [20] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts","volume":"3","author":"He P.","year":"1991","unstructured":"P. He, X. Liu, J. Gao, and W. Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020. [20] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79--87, 1991.","journal-title":"Neural computation"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00047"},{"key":"e_1_2_1_21_1","volume-title":"Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668","author":"Lepikhin D.","year":"2020","unstructured":"D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020."},{"key":"e_1_2_1_22_1","volume-title":"Table-gpt: Table-tuned GPT for diverse table tasks. CoRR, abs\/2310.09263","author":"Li P.","year":"2023","unstructured":"P. Li, Y. He, D. Yashar, W. Cui, S. Ge, H. Zhang, D. R. Fainman, D. Zhang, and S. Chaudhuri. Table-gpt: Table-tuned GPT for diverse table tasks. CoRR, abs\/2310.09263, 2023."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_24_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Y.","year":"2019","unstructured":"Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019."},{"key":"e_1_2_1_25_1","unstructured":"J. Ma Z. Zhao X. Yi J. Chen L. Hong and E. H."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_1_27_1","volume-title":"Can foundation models wrangle your data? CoRR, abs\/2205.09911","author":"Narayan A.","year":"2022","unstructured":"A. Narayan, I. Chami, L. J. Orr, and C. R\u00b4e. Can foundation models wrangle your data? CoRR, abs\/2205.09911, 2022."},{"key":"e_1_2_1_28_1","volume-title":"GPT-4 technical report. CoRR, abs\/2303.08774","author":"AI.","year":"2023","unstructured":"OpenAI. GPT-4 technical report. CoRR, abs\/2303.08774, 2023."},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the VLDB Endowment, 12(3)","author":"Paul Suganthan G.","year":"2019","unstructured":"G. Paul Suganthan, A. Ardalan, A. Doan, and A. Akella. Smurf: Self-service string matching using random forests. Proceedings of the VLDB Endowment, 12(3), 2019."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3191696"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403359"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/3455716.3455856"},{"key":"e_1_2_1_33_1","volume-title":"A generalist agent. CoRR, abs\/2205.06175","author":"Reed S. E.","year":"2022","unstructured":"S. E. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y. Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas. A generalist agent. CoRR, abs\/2205.06175, 2022."},{"key":"e_1_2_1_34_1","volume-title":"Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538","author":"Shazeer N.","year":"2017","unstructured":"N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017."},{"key":"e_1_2_1_35_1","first-page":"3174","volume-title":"Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence","author":"Tang X.","year":"2021","unstructured":"X. Tang, J. Zhang, B. Chen, Y. Yang, H. Chen, and C. Li. Bert-int: a bert-based interaction model for knowledge graph alignment. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3174--3180, 2021."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3476249.3476294"},{"key":"e_1_2_1_37_1","volume-title":"Attention is all you need. Advances in neural information processing systems, 30","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017."},{"key":"e_1_2_1_38_1","first-page":"23318","volume-title":"International Conference on Machine Learning","author":"Wang P.","year":"2022","unstructured":"P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318--23340. PMLR, 2022."},{"key":"e_1_2_1_39_1","volume-title":"A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1--37","author":"Wang W.","year":"2019","unstructured":"W. Wang, V. W. Zheng, H. Yu, and C. Miao. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1--37, 2019."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.340"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407793"},{"key":"e_1_2_1_42_1","volume-title":"Jellyfish: A large language model for data preprocessing. CoRR, abs\/2312.01678","author":"Zhang H.","year":"2023","unstructured":"H. Zhang, Y. Dong, C. Xiao, and M. Oyamada. Jellyfish: A large language model for data preprocessing. CoRR, abs\/2312.01678, 2023."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2021.3070203"}],"container-title":["ACM SIGMOD Record"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3665252.3665263","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3665252.3665263","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:58:33Z","timestamp":1750294713000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3665252.3665263"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,14]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2024,5,14]]}},"alternative-id":["10.1145\/3665252.3665263"],"URL":"https:\/\/doi.org\/10.1145\/3665252.3665263","relation":{},"ISSN":["0163-5808"],"issn-type":[{"value":"0163-5808","type":"print"}],"subject":[],"published":{"date-parts":[[2024,5,14]]},"assertion":[{"value":"2024-05-14","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}