{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T14:12:08Z","timestamp":1761401528211,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2010,9,1]],"date-time":"2010-09-01T00:00:00Z","timestamp":1283299200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001711","name":"Swiss National Science Foundation","doi-asserted-by":"publisher","award":["200021-113273"],"award-info":[{"award-number":["200021-113273"]}],"id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Transactions on Asian Language Information Processing"],"published-print":{"date-parts":[[2010,9]]},"abstract":"<jats:p>The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world\u2019s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them.<\/jats:p>\n          <jats:p>\n            In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the\n            <jats:italic>n<\/jats:italic>\n            -gram and trunc-\n            <jats:italic>n<\/jats:italic>\n            (truncation of the first\n            <jats:italic>n<\/jats:italic>\n            letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches:\n            <jats:italic>tf idf<\/jats:italic>\n            and\n            <jats:italic>Lnu-ltc<\/jats:italic>\n            .\n          <\/jats:p>\n          <jats:p>\n            Experiments performed with all three languages demonstrate that the I(n\n            <jats:sub>e<\/jats:sub>\n            )C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.\n          <\/jats:p>","DOI":"10.1145\/1838745.1838748","type":"journal-article","created":{"date-parts":[[2010,9,22]],"date-time":"2010-09-22T11:55:58Z","timestamp":1285156558000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages"],"prefix":"10.1145","volume":"9","author":[{"given":"Ljiljana","family":"Dolamic","sequence":"first","affiliation":[{"name":"University of Neuchatel"}]},{"given":"Jacques","family":"Savoy","sequence":"additional","affiliation":[{"name":"University of Neuchatel"}]}],"member":"320","published-online":{"date-parts":[[2010,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/11880592_28"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1011942104443"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582416"},{"volume-title":"Grammar of the Bengali Language, Literary, and Colloquial","author":"Beames J.","key":"e_1_2_1_4_1","unstructured":"}} Beames , J. 1891. Grammar of the Bengali Language, Literary, and Colloquial . Clarendon Press , Oxford, UK . }}Beames, J. 1891. Grammar of the Bengali Language, Literary, and Colloquial. Clarendon Press, Oxford, UK."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:INRT.0000009438.69013.fa"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:INRT.0000011208.60754.a1"},{"volume-title":"Overview of the 3rd Text Retrieval Conference (TREC\u201996)","author":"Buckley C.","key":"e_1_2_1_7_1","unstructured":"}} Buckley , C. , Singhal , A. , Mitra , M. , and Salton , G . 1996. New retrieval approaches using SMART . In Overview of the 3rd Text Retrieval Conference (TREC\u201996) . D. K. Harman Eds., 25--48. }}Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Overview of the 3rd Text Retrieval Conference (TREC\u201996). D. K. Harman Eds., 25--48."},{"key":"e_1_2_1_8_1","unstructured":"}}Buckley C. and Voorhees E. M. 2005. Retrieval system evaluation. In E. M. Voorhees D. K. Harman Eds. TREC Experiment and evaluation in information retrieval. The MIT Press Cambridge MA 53--75.  }} Buckley C. and Voorhees E. M. 2005. Retrieval system evaluation. In E. M. Voorhees D. K. Harman Eds. TREC Experiment and evaluation in information retrieval . The MIT Press Cambridge MA 53--75."},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"}}\n      Di Nunzio G. M. Ferro N. Melucci M. and \n      Orio N\n  . \n  2004\n  . Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems Lecture Notes in Computer Science Springer Berlin 220--235.  }} Di Nunzio G. M. Ferro N. Melucci M. and Orio N. 2004. Experiments to evaluate probabilistic models for automatic stemmer generation and query word translation. In Comparative Evaluation of Multilingual Information Access Systems Lecture Notes in Computer Science Springer Berlin 220--235.","DOI":"10.1007\/978-3-540-30222-3_21"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2009.06.001"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1002\/asi.v60:8"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/378881.378888"},{"key":"e_1_2_1_13_1","unstructured":"}}Gungaly D. and Mitra M. 2008. Using language modeling at FIRE 2008 Bengali monolingual track. In Working Notes of the Forum for Information Retrieval Evaluation (FIRE\u201908). http:\/\/www.isical.ac.in\/~fire\/paper\/lm_at_fire.pdf.  }} Gungaly D. and Mitra M. 2008. Using language modeling at FIRE 2008 Bengali monolingual track. In Working Notes of the Forum for Information Retrieval Evaluation (FIRE\u201908) . http:\/\/www.isical.ac.in\/~fire\/paper\/lm_at_fire.pdf."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1002\/asi.4630260402"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:INRT.0000009439.19151.4c"},{"volume-title":"Trench, Trubner &amp","author":"Kellogg S. H.","key":"e_1_2_1_17_1","unstructured":"}} Kellogg , S. H. 1938. A Grammar of the Hindi Language. Kegan Paul , Trench, Trubner &amp ; Co. Ltd ., London, UK. }}Kellogg, S. H. 1938. A Grammar of the Hindi Language. Kegan Paul, Trench, Trubner &amp; Co. Ltd., London, UK."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/11816508_42"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1031171.1031285"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/160688.160718"},{"key":"e_1_2_1_21_1","first-page":"22","article-title":"Development of a stemming algorithm","volume":"11","author":"Lovins J. B.","year":"1968","unstructured":"}} Lovins , J. B. 1968 . Development of a stemming algorithm . Mechan. Trans. Comput. Linguist. 11 , 1, 22 -- 31 . }}Lovins, J. B. 1968. Development of a stemming algorithm. Mechan. Trans. Comput. Linguist. 11, 1, 22--31.","journal-title":"Mechan. Trans. Comput. Linguist."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1281485.1281489"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"}}Manning C. Raghavan P. and Sch\u00fctze H. 2008. Introduction to Information Retrieval. Cambridge University Press Cambridge UK.   }} Manning C. Raghavan P. and Sch\u00fctze H. 2008. Introduction to Information Retrieval . Cambridge University Press Cambridge UK.","DOI":"10.1017\/CBO9780511809071"},{"volume-title":"The Indo-Aryan Languages","author":"Masica C. P.","key":"e_1_2_1_24_1","unstructured":"}} Masica , C. P. 1991. The Indo-Aryan Languages . Cambridge University Press , Cambridge, UK . }}Masica, C. P. 1991. The Indo-Aryan Languages. Cambridge University Press, Cambridge, UK."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:INRT.0000009441.78971.be"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1571941.1571957"},{"key":"e_1_2_1_27_1","unstructured":"}}Navalkar G. R. 2001. The Student\u2019s Marathi Grammar. Asian Education Services New Dehli.  }} Navalkar G. R. 2001. The Student\u2019s Marathi Grammar . Asian Education Services New Dehli."},{"key":"e_1_2_1_28_1","volume-title":"Eds","author":"Peters C.","year":"2008","unstructured":"}} Peters , C. , Jijkoun , V. , Mandl , T. , M\u00fcller , H. , Oard , D.W. , Pe\u00f1as , A. and Santos , D . Eds . 2008 . Advances in multilingual and multimodal information retrieval. Lecture Notes in Comuter Science. Springer-Verlag , Berlin. }}Peters, C., Jijkoun, V., Mandl, T., M\u00fcller, H., Oard, D.W., Pe\u00f1as, A. and Santos, D. Eds. 2008. Advances in multilingual and multimodal information retrieval. Lecture Notes in Comuter Science. Springer-Verlag, Berlin."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1108\/eb046814"},{"volume-title":"Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL\u201903)","author":"Ramanathan A.","key":"e_1_2_1_30_1","unstructured":"}} Ramanathan , A. and Rao , D . 2003. A lightweight stemmer for Hindi . In Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL\u201903) . 42--48. }}Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings Workshop of Computational Linguistics for the South Asian Languages (EACL\u201903). 42--48."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(99)00046-1"},{"volume-title":"Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP\u201908)","author":"Sakar S.","key":"e_1_2_1_32_1","unstructured":"}} Sakar , S. and Bandyopadhyay , S . 2008. Design of a rule-based stemmer for natural language text in Bengal . In Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP\u201908) . 65--72. }}Sakar, S. and Bandyopadhyay, S. 2008. Design of a rule-based stemmer for natural language text in Bengal. In Proceedings of the International Joint Conference on Natural Language Processing for Less Privileged Languages (IJCNLP\u201908). 65--72."},{"volume-title":"The SMART Retrieval System: Experiments in Automatic Document Processing","author":"Salton G.","key":"e_1_2_1_33_1","unstructured":"}} Salton , G. Ed. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing . Prentice-Hall , Englewood Cliffs, N.J. }}Salton, G. Ed. 1971. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, N.J."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-4571(199301)44:1<1::AID-ASI1>3.0.CO;2-1"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0306-4573(97)00027-7"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1141277.1141523"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/505282.505283"},{"volume-title":"Morphology and Computation","author":"Sproat R.","key":"e_1_2_1_38_1","unstructured":"}} Sproat , R. 1992. Morphology and Computation . The MIT Press , Cambridge, MA . }}Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA."},{"key":"e_1_2_1_39_1","volume-title":"Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003","author":"Tomlinson S.","year":"2004","unstructured":"}} Tomlinson , S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003 ( 2004 ). In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science. Springer-Verlag , Berlin, 286--300. }}Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM at CLEF 2003 (2004). In Comparative Evaluation of Multilingual Information Access Systems, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 286--300."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/267954.267957"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/984321.984322"}],"container-title":["ACM Transactions on Asian Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1838745.1838748","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1838745.1838748","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T11:39:49Z","timestamp":1750246789000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1838745.1838748"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,9]]},"references-count":40,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2010,9]]}},"alternative-id":["10.1145\/1838745.1838748"],"URL":"https:\/\/doi.org\/10.1145\/1838745.1838748","relation":{},"ISSN":["1530-0226","1558-3430"],"issn-type":[{"type":"print","value":"1530-0226"},{"type":"electronic","value":"1558-3430"}],"subject":[],"published":{"date-parts":[[2010,9]]},"assertion":[{"value":"2009-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2010-09-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}