{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,21]],"date-time":"2025-12-21T06:23:48Z","timestamp":1766298228453,"version":"build-2065373602"},"reference-count":46,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2021,3,3]],"date-time":"2021-03-03T00:00:00Z","timestamp":1614729600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Idioms are multi-word expressions whose meaning cannot always be deduced from the literal meaning of constituent words. A key feature of idioms that is central to this paper is their peculiar mixture of fixedness and variability, which poses challenges for their retrieval from large corpora using traditional search approaches. These challenges hinder insights into idiom usage, affecting users who are conducting linguistic research as well as those involved in language education. To facilitate access to idiom examples taken from real-world contexts, we introduce an information retrieval system designed specifically for idioms. Given a search query that represents an idiom, typically in its canonical form, the system expands it automatically to account for the most common types of idiom variation including inflection, open slots, adjectival or adverbial modification and passivisation. As a by-product of query expansion, other types of idiom variation captured include derivation, compounding, negation, distribution across multiple clauses as well as other unforeseen types of variation. The system was implemented on top of Elasticsearch, an open-source, distributed, scalable, real-time search engine. Flexible retrieval of idioms is supported by a combination of linguistic pre-processing of the search queries, their translation into a set of query clauses written in a query language called Query DSL, and analysis, an indexing process that involves tokenisation and normalisation. Our system outperformed the phrase search in terms of recall and outperformed the keyword search in terms of precision. Out of the three, our approach was found to provide the best balance between precision and recall. By providing a fast and easy way of finding idioms in large corpora, our approach can facilitate further developments in fields such as linguistics, language education and natural language processing.<\/jats:p>","DOI":"10.3390\/make3010013","type":"journal-article","created":{"date-parts":[[2021,3,3]],"date-time":"2021-03-03T20:33:57Z","timestamp":1614803637000},"page":"263-283","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Leaving No Stone Unturned: Flexible Retrieval of Idiomatic Expressions from a Large Text Corpus"],"prefix":"10.3390","volume":"3","author":[{"given":"Callum","family":"Hughes","sequence":"first","affiliation":[{"name":"School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3144-9252","authenticated-orcid":false,"given":"Maxim","family":"Filimonov","sequence":"additional","affiliation":[{"name":"School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alison","family":"Wray","sequence":"additional","affiliation":[{"name":"School of English, Communication and Philosophy, Cardiff University, Cardiff CF10 3EU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8132-3885","authenticated-orcid":false,"given":"Irena","family":"Spasi\u0107","sequence":"additional","affiliation":[{"name":"School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,3,3]]},"reference":[{"key":"ref_1","first-page":"103","article-title":"(How) is formulaic language universal? Insights from Korean, German and English","volume":"Volume 2","author":"Piirainen","year":"2020","journal-title":"Formulaic Language and New Data: Theoretical and Methodological Implications, Formulaic Language"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"543","DOI":"10.1016\/j.lingua.2005.05.005","article-title":"The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form","volume":"117","author":"Wray","year":"2007","journal-title":"Lingua"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wray, A. (2002). Formulaic Language and the Lexicon, Cambridge University Press.","DOI":"10.1017\/CBO9780511519772"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Hanks, P. (2013). Lexical Analysis: Norms and Exploitations, MIT Press.","DOI":"10.7551\/mitpress\/9780262018579.001.0001"},{"key":"ref_5","unstructured":"BBC Radio 4 (2021, January 01). Spoilers for May 25th\u201328th 2020. Available online: https:\/\/www.facebook.com\/notes\/archers-appreciation\/spoilers-for-may-25th28th-2020\/851327765348107\/."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Moon, R. (1998). Fixed Expressions and Idioms in English: A Corpus-Based Approach, OUP Oxford.","DOI":"10.1093\/oso\/9780198236146.001.0001"},{"key":"ref_7","unstructured":"Haagsma, H., Nissim, M., and Bos, J. (2018, January 25\u201326). The other side of the coin: Unsupervised disambiguation of potentially idiomatic expressions by contrasting senses. Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions, Santa Fe, NM, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Cook, P., Fazly, A., and Stevenson, S. (2007, January 28). Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, Prague, Czech Republic.","DOI":"10.3115\/1613704.1613710"},{"key":"ref_9","first-page":"17","article-title":"Computing linear discriminants for idiomatic sentence detection","volume":"46","author":"Peng","year":"2010","journal-title":"Res. Comput. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Feldman, A., and Peng, J. (2013, January 24\u201330). Automatic detection of idiomatic clauses. Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing, Samos, Greece.","DOI":"10.1007\/978-3-642-37247-6_35"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Liu, C., and Hwa, R. (2018\u20134, January 31). Heuristically informed unsupervised idiom usage recognition. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1199"},{"key":"ref_12","unstructured":"Sporleder, C., and Li, L. (April, January 30). Unsupervised recognition of literal and non-literal use of idiomatic expressions. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece."},{"key":"ref_13","unstructured":"Halliday, M.A.K., and Hasan, R. (1976). Cohesion in English, Longman."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1145\/219717.219748","article-title":"WordNet: A lexical database for English","volume":"38","author":"Miller","year":"1995","journal-title":"Commun. ACM"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1109\/TKDE.2007.48","article-title":"The Google similarity distance","volume":"19","author":"Cilibrasi","year":"2007","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Liu, P., Qian, K., Qiu, X., and Huang, X. (2017, January 7\u201311). Idiom-aware compositional distributed semantics. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark.","DOI":"10.18653\/v1\/D17-1124"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Tai, K.S., Socher, R., and Manning, C.D. (2015, January 26\u201331). Improved semantic representations from tree-structured long short-term memory networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China.","DOI":"10.3115\/v1\/P15-1150"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Salton, G.D., Ross, R.J., and Kelleher, J.D. (2016, January 7\u201312). Idiom token classification using sentential distributed semantics. Proceedings of the 54th Annual Meeting on Association for Computational Linguistics, Berlin, Germany.","DOI":"10.18653\/v1\/P16-1019"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"7375","DOI":"10.1016\/j.eswa.2015.05.039","article-title":"The role of idioms in sentiment analysis","volume":"42","author":"Williams","year":"2015","journal-title":"Expert Syst. Appl."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1109\/TAFFC.2017.2777842","article-title":"Idiom-based features in sentiment analysis: Cutting the Gordian knot","volume":"11","author":"Williams","year":"2020","journal-title":"IEEE Trans. Affect. Comput."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Flor, M., and Klebanov, B.B. (2018, January 6). Catching idiomatic expressions in EFL essays. Proceedings of the Workshop on Figurative Language Processing, New Orleans, LA, USA.","DOI":"10.18653\/v1\/W18-0905"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"813","DOI":"10.1002\/spe.4380171105","article-title":"The text editor sam","volume":"17","author":"Pike","year":"1987","journal-title":"Softw. Pract. Exp."},{"key":"ref_23","unstructured":"Laurikari, V. (2000, January 27\u201329). NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, La Curuna, Spain."},{"key":"ref_24","unstructured":"Cox, R. (2021, January 01). Regular Expression Matching can be Simple and Fast (but is Slow in Java, Perl, PHP, Python, Ruby,...). Available online: https:\/\/swtch.com\/~rsc\/regexp\/regexp1.html."},{"key":"ref_25","unstructured":"Gormley, C., and Tony, Z. (2015). Elasticsearch: The Definitive Guide, O\u2019Reilly Media, Inc."},{"key":"ref_26","unstructured":"Bia\u0142ecki, A., Muir, R., and Ingersoll, G. (2012, January 16). Apache Lucene 4. Proceedings of the SIGIR Workshop on Open Source Information Retrieval, Portland, OR, USA."},{"key":"ref_27","unstructured":"Burnard, L. (2021, January 01). Reference Guide for the British National Corpus (XML Edition). Available online: http:\/\/www.natcorp.ox.ac.uk\/docs\/URG\/."},{"key":"ref_28","unstructured":"BNC Consortium The British National Corpus, Version 3 (BNC XML Edition). Available online: http:\/\/www.natcorp.ox.ac.uk\/."},{"key":"ref_29","unstructured":"Tumblr I Don\u2019t Think Uma will Ever Fully Forgive Mal. Available online: https:\/\/tmblr.co\/ZVqBcbYOic9sWi00."},{"key":"ref_30","unstructured":"Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python\u2014Analyzing Text with the Natural Language Toolkit, O\u2019Reilly Media."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Vega-Moreno, R.E. (2007). Creativity and Convention: The Pragmatics of Everyday Figurative Speech, John Benjamins Publishing Company.","DOI":"10.1075\/pbns.156"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Langlotz, A. (2006). Idiomatic Creativity: A Cognitive-Linguistic Model of Idiom-Representation and Idiom-Variation in English, John Benjamins.","DOI":"10.1075\/hcp.17"},{"key":"ref_33","unstructured":"Dutton, K. (2009). Exploring the Boundaries of Formulaic Sequences: A Corpus-Based Study of Lexical Substitution and Insertion in Contemporary British English, VDM Verlag."},{"key":"ref_34","unstructured":"Riehemann, S.Z. (2001). A Constructional Approach to Idioms and Word Formation, Stanford University."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"130","DOI":"10.1108\/eb046814","article-title":"An algorithm for suffix stripping","volume":"14","author":"Porter","year":"1980","journal-title":"Program"},{"key":"ref_36","unstructured":"Beke, K. (2021, January 01). Learn English Today. Available online: https:\/\/www.learn-english-today.com\/."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1561\/1500000019","article-title":"The probabilistic relevance framework: BM25 and beyond","volume":"3","author":"Robertson","year":"2009","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1108\/eb051463","article-title":"Estimating the recall performance of Web search engines","volume":"49","author":"Clarke","year":"1997","journal-title":"Aslib Proc."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1177\/001316446002000104","article-title":"A coefficient of agreement for nominal scales","volume":"20","author":"Cohen","year":"1960","journal-title":"Educ. Psychol. Meas."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"323","DOI":"10.1037\/h0028106","article-title":"Large sample standard errors of kappa and weighted kappa","volume":"72","author":"Fleiss","year":"1969","journal-title":"Psychol. Bull."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Altman, D.G. (1990). Practical Statistics for Medical Research, Chapman and Hall\/CRC.","DOI":"10.1201\/9780429258589"},{"key":"ref_42","unstructured":"Richards, J.C., and Rogers, T.S. (2006). Approaches and Methods in Language Teaching, Cambridge University Press."},{"key":"ref_43","unstructured":"Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. (2015, January 11\u201312). Convergent learning: Do different neural networks learn the same representations?. Proceedings of the 1st NIPS International Workshop on Feature Extraction: Modern Questions and Challenges, Montr\u00e9al, QC, Canada."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"883","DOI":"10.1002\/asi.20177","article-title":"The clustering power of low frequency words in academic Webs","volume":"56","author":"Price","year":"2005","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Sch\u00f6nhofen, P., and Bencz\u00far, A.A. (2006, January 18\u201322). Exploiting extremely rare features in text categorization. Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany.","DOI":"10.1007\/11871842_77"},{"key":"ref_46","unstructured":"Fadaee, M., Bisazza, A., and Monz, C. (2018, January 7\u201312). Examining the tip of the iceberg: A data set for idiom translation. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/3\/1\/13\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:31:56Z","timestamp":1760160716000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/3\/1\/13"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,3]]},"references-count":46,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2021,3]]}},"alternative-id":["make3010013"],"URL":"https:\/\/doi.org\/10.3390\/make3010013","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2021,3,3]]}}}