{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T17:41:11Z","timestamp":1771954871272,"version":"3.50.1"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,4,27]],"date-time":"2021-04-27T00:00:00Z","timestamp":1619481600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Building Healthcare Informatics Systems Utilising Web Data"},{"name":"Department of Science & Technology, Government of India"},{"name":"NVIDIA Corporation"},{"name":"Titan Xp GPU"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2021,9,30]]},"abstract":"<jats:p>A large fraction of textual data available today contains various types of \u201cnoise,\u201d such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search\/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data\/human intervention. The proposed algorithm is applicable to text over different languages and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods.<\/jats:p>","DOI":"10.1145\/3418036","type":"journal-article","created":{"date-parts":[[2021,4,27]],"date-time":"2021-04-27T14:14:25Z","timestamp":1619532865000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["An Unsupervised Normalization Algorithm\u00a0for Noisy Text: A Case Study for Information Retrieval and Stance Detection"],"prefix":"10.1145","volume":"13","author":[{"given":"Anurag","family":"Roy","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India"}]},{"given":"Shalmoli","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India"}]},{"given":"Kripabandhu","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Application, Indian Institute of Science Education and Research Kolkata, Mohanpur, India"}]},{"given":"Saptarshi","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India"}]}],"member":"320","published-online":{"date-parts":[[2021,4,27]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-56608-5_53"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSS.2019.2914179"},{"key":"e_1_2_1_3_1","volume-title":"Proc. Workshop on Noisy User-generated Text (WNUT\u201916)","author":"Costa Bertaglia Thales Felipe","year":"2016","unstructured":"Thales Felipe Costa Bertaglia , and Maria das Gra\u00e7as Volpe Nunes . 2016 . Exploring word embeddings for unsupervised textual user-generated content normalization . In Proc. Workshop on Noisy User-generated Text (WNUT\u201916) . 112\u2013120. Thales Felipe Costa Bertaglia, and Maria das Gra\u00e7as Volpe Nunes. 2016. Exploring word embeddings for unsupervised textual user-generated content normalization. In Proc. Workshop on Noisy User-generated Text (WNUT\u201916). 112\u2013120."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1088\/1742-5468\/2008\/10\/P10008"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075218.1075255"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.5555\/3171837.3171843"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.physrep.2009.11.002"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2034617.2034622"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2016.03.006"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/2140458.2140468"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/2002472.2002520"},{"key":"e_1_2_1_12_1","volume-title":"Proc. International Conference on Language Resources and Evaluation (LREC\u201914)","author":"Hartmann Nathan","year":"2014","unstructured":"Nathan Hartmann , Lucas Avan\u00e7o , Pedro Balage , Magali Duran , Maria das Gra\u00e7as Volpe Nunes , Thiago Pardo , and Sandra Alu\u00edsio . 2014 . A large corpus of product reviews in portuguese: Tackling Out-of-vocabulary words . In Proc. International Conference on Language Resources and Evaluation (LREC\u201914) . 3865\u20133871. Nathan Hartmann, Lucas Avan\u00e7o, Pedro Balage, Magali Duran, Maria das Gra\u00e7as Volpe Nunes, Thiago Pardo, and Sandra Alu\u00edsio. 2014. A large corpus of product reviews in portuguese: Tackling Out-of-vocabulary words. In Proc. International Conference on Language Resources and Evaluation (LREC\u201914). 3865\u20133871."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009902609570"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1289189.1289208"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3369026"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-3012"},{"key":"e_1_2_1_18_1","volume-title":"Proc. EMNLP. 73\u201384","author":"Ling Wang","year":"2013","unstructured":"Wang Ling , Chris Dyer , Alan W. Black , and Isabel Trancoso . 2013 . Paraphrasing 4 microblog normalization . In Proc. EMNLP. 73\u201384 . Wang Ling, Chris Dyer, Alan W. Black, and Isabel Trancoso. 2013. Paraphrasing 4 microblog normalization. In Proc. EMNLP. 73\u201384."},{"key":"e_1_2_1_19_1","volume-title":"Proc. Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial\u201918)","author":"Lusetti Massimo","year":"2018","unstructured":"Massimo Lusetti , Tatyana Ruzsics , Anne G\u00f6hring , Tanja Samard\u017ei\u0107 , and Elisabeth Stark . 2018 . Encoder-decoder methods for text normalization . In Proc. Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial\u201918) . 18\u201328. Massimo Lusetti, Tatyana Ruzsics, Anne G\u00f6hring, Tanja Samard\u017ei\u0107, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proc. Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial\u201918). 18\u201328."},{"key":"e_1_2_1_20_1","volume-title":"3rd Workshop on Very Large Corpora.","author":"Melamed I. Dan","year":"1995","unstructured":"I. Dan Melamed . 1995 . Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons . In 3rd Workshop on Very Large Corpora. I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In 3rd Workshop on Very Large Corpora."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999792.2999959"},{"key":"e_1_2_1_22_1","volume-title":"Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT\u201913)","author":"Mikolov T.","unstructured":"T. Mikolov , W. T. Yih , and G. Zweig . 2013. Linguistic regularities in continuous space word representations . In Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT\u201913) . 746\u2013751. T. Mikolov, W. T. Yih, and G. Zweig. 2013. Linguistic regularities in continuous space word representations. In Proc. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT\u201913). 746\u2013751."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S16-1003"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.fcij.2017.12.002"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.0601602103"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/349124.349132"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3132847.3133103"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2806416.2806485"},{"key":"e_1_2_1_29_1","volume-title":"Proc. IEEE International Conference on Data Mining Workshops (ICDMW\u201917)","author":"Satapathy R.","unstructured":"R. Satapathy , C. Guerreiro , I. Chaturvedi , and E. Cambria . 2017. Phonetic-based microtext normalization for twitter sentiment analysis . In Proc. IEEE International Conference on Data Mining Workshops (ICDMW\u201917) . 407\u2013413. R. Satapathy, C. Guerreiro, I. Chaturvedi, and E. Cambria. 2017. Phonetic-based microtext normalization for twitter sentiment analysis. In Proc. IEEE International Conference on Data Mining Workshops (ICDMW\u201917). 407\u2013413."},{"key":"e_1_2_1_30_1","volume-title":"Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing. 8\u201316","author":"Sridhar Rangarajan","year":"2015","unstructured":"Rangarajan Sridhar and Vivek Kumar . 2015 . Unsupervised text normalization using distributed representations of words and phrases . In Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing. 8\u201316 . Rangarajan Sridhar and Vivek Kumar. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing. 8\u201316."},{"key":"e_1_2_1_31_1","volume-title":"Indri: A language-model based search engine for complex queries. Information Retrieval - IR 2 (Jan.","author":"Strohman Trevor","year":"2005","unstructured":"Trevor Strohman , Donald Metzler , Howard Turtle , and W. Croft . 2005 . Indri: A language-model based search engine for complex queries. Information Retrieval - IR 2 (Jan. 2005), 2--6. Trevor Strohman, Donald Metzler, Howard Turtle, and W. Croft. 2005. Indri: A language-model based search engine for complex queries. Information Retrieval - IR 2 (Jan. 2005), 2--6."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/1568296.1568315"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935842"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073109"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2005.248"},{"key":"e_1_2_1_36_1","volume-title":"DeepNorm-a deep learning approach to text normalization. CoRR abs\/1712.06994","author":"Zare Maryam","year":"2017","unstructured":"Maryam Zare and Shaurya Rohatgi . 2017. DeepNorm-a deep learning approach to text normalization. CoRR abs\/1712.06994 ( 2017 ). http:\/\/arxiv.org\/abs\/1712.06994. Maryam Zare and Shaurya Rohatgi. 2017. DeepNorm-a deep learning approach to text normalization. CoRR abs\/1712.06994 (2017). http:\/\/arxiv.org\/abs\/1712.06994."}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3418036","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3418036","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:35Z","timestamp":1750195895000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3418036"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,4,27]]},"references-count":36,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9,30]]}},"alternative-id":["10.1145\/3418036"],"URL":"https:\/\/doi.org\/10.1145\/3418036","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"value":"1936-1955","type":"print"},{"value":"1936-1963","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,4,27]]},"assertion":[{"value":"2020-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}