{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T12:40:14Z","timestamp":1686660014930},"reference-count":45,"publisher":"Cambridge University Press (CUP)","issue":"1","license":[{"start":{"date-parts":[[2012,1,10]],"date-time":"2012-01-10T00:00:00Z","timestamp":1326153600000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2013,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper presents<jats:italic>TrendStream<\/jats:italic>, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel trie-based architecture, features lossless compression, and provides optimization for both speed and memory use. In addition to literal queries, it also supports fast pattern matching searches (with wildcards or regular expressions), on the same data structure, without any additional indexing. Language models are updateable directly in the compiled binary format, allowing rapid encoding of existing tabulated collections, incremental generation of n-gram models from streaming text, and merging of encoded compiled files. This architecture offers flexible choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model, or on-demand partial data loading with very modest memory requirements. The implemented system runs successfully on several different platforms, under different operating systems, even when the n-gram model file is much larger than available memory. Experimental evaluation results are presented with the Google Web1T collection and the Gigaword corpus.<\/jats:p>","DOI":"10.1017\/s1351324911000349","type":"journal-article","created":{"date-parts":[[2012,1,10]],"date-time":"2012-01-10T11:30:11Z","timestamp":1326195011000},"page":"61-93","source":"Crossref","is-referenced-by-count":1,"title":["A fast and flexible architecture for very large word n-gram datasets"],"prefix":"10.1017","volume":"19","author":[{"given":"MICHAEL","family":"FLOR","sequence":"first","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2012,1,10]]},"reference":[{"key":"S1351324911000349_ref8","first-page":"103","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), System Demonstrations","author":"Ceylan","year":"2011"},{"key":"S1351324911000349_ref3","first-page":"1507","volume-title":"Proceedings of 2009 International Joint Conference on Artificial Intelligence (IJCAI 2009)","author":"Bergsma","year":"2009"},{"key":"S1351324911000349_ref12","first-page":"32","volume-title":"Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (WAC-6)","author":"Evert","year":"2010"},{"key":"S1351324911000349_ref7","first-page":"166","volume-title":"Proceedings of the Sixth International Conference on Machine Learning and Applications","author":"Carlson","year":"2007"},{"key":"S1351324911000349_ref10","first-page":"199","volume-title":"Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007)","author":"Church","year":"2007"},{"key":"S1351324911000349_ref28","first-page":"756","volume-title":"Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP-2009)","author":"Levenberg","year":"2009"},{"key":"S1351324911000349_ref34","first-page":"2682","volume-title":"Proceedings of Language Resource and Evaluation Conference (LREC-2010)","author":"Sekine","year":"2010"},{"key":"S1351324911000349_ref45","first-page":"492","volume-title":"Proceedings of The Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT-2007)","author":"Zens","year":"2007"},{"key":"S1351324911000349_ref37","first-page":"505","volume-title":"Proceedings of 46th Annual Meeting of the Association for Computational Linguistics and Human Language Technology Conference (ACL-08: HLT)","author":"Talbot","year":"2008"},{"key":"S1351324911000349_ref27","doi-asserted-by":"publisher","DOI":"10.1145\/1075389.1075392"},{"key":"S1351324911000349_ref13","first-page":"88","volume-title":"Proceedings of the ACL 2007 Workshop on Statistical Machine Translation","author":"Federico","year":"2007"},{"key":"S1351324911000349_ref15","doi-asserted-by":"publisher","DOI":"10.1145\/1871840.1871846"},{"key":"S1351324911000349_ref22","first-page":"40","volume-title":"Proceedings of the Australasian Language Technology Workshop 2007","author":"Hawker","year":"2007"},{"key":"S1351324911000349_ref25","first-page":"1689","volume-title":"Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009)","author":"Islam","year":"2009"},{"key":"S1351324911000349_ref43","doi-asserted-by":"crossref","first-page":"33","DOI":"10.21437\/Eurospeech.2001-8","volume-title":"Proceedings of 7th European Conference on Speech Communication and Technology (EUROSPEECH'01)","author":"Whittaker","year":"2001"},{"key":"S1351324911000349_ref42","first-page":"341","volume-title":"Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009)","author":"Watanabe","year":"2009"},{"key":"S1351324911000349_ref18","volume-title":"English Gigaword","author":"Graff","year":"2003"},{"key":"S1351324911000349_ref14","doi-asserted-by":"publisher","DOI":"10.1145\/367390.367400"},{"key":"S1351324911000349_ref6","first-page":"858","volume-title":"Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing (EMNLP-CoNLL 2007)","author":"Brants","year":"2007"},{"key":"S1351324911000349_ref1","first-page":"360","volume-title":"Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms","author":"Bentley","year":"1997"},{"key":"S1351324911000349_ref5","volume-title":"Web 1T 5-gram Version 1","author":"Brants","year":"2006"},{"key":"S1351324911000349_ref9","first-page":"30","volume-title":"Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2004)","author":"Chazelle","year":"2004"},{"key":"S1351324911000349_ref4","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"S1351324911000349_ref33","first-page":"181","volume-title":"Proceedings of the 22nd International Conference on Computational Linguistics (COLING-08)","author":"Sekine","year":"2008"},{"key":"S1351324911000349_ref11","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2007.367157"},{"key":"S1351324911000349_ref16","first-page":"31","volume-title":"Proceedings of the NAACL HLT Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009)","author":"Germann","year":"2009"},{"key":"S1351324911000349_ref17","unstructured":"Giuliano C. 2007. jWeb1T: a library for searching the Web 1T 5-gram corpus. Software available at http:\/\/hlt.fbk.eu\/en\/technology\/jWeb1t. Accessed 8 April 2011."},{"key":"S1351324911000349_ref31","first-page":"388","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03)","author":"Raj","year":"2003"},{"key":"S1351324911000349_ref19","first-page":"21","volume-title":"Proceedings of SIGIR 2010 Web N-gram Workshop","author":"Guthrie","year":"2010"},{"key":"S1351324911000349_ref20","first-page":"262","volume-title":"Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP-2010)","author":"Guthrie","year":"2010"},{"key":"S1351324911000349_ref21","first-page":"325","volume-title":"Proceedings of 10th Annual Conference of the International Speech Communication Association (Interspeech 2009)","author":"Harb","year":"2009"},{"key":"S1351324911000349_ref23","first-page":"187","volume-title":"Proceedings of the 6th Workshop on Statistical Machine Translation","author":"Heafield","year":"2011"},{"key":"S1351324911000349_ref26","first-page":"1","volume-title":"Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'09)","author":"Islam","year":"2009"},{"key":"S1351324911000349_ref24","first-page":"1","volume-title":"Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE 2009)","author":"Islam","year":"2009"},{"key":"S1351324911000349_ref29","volume-title":"Foundations of Statistical Natural Language Processing","author":"Manning","year":"1999"},{"key":"S1351324911000349_ref30","first-page":"258","volume-title":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011)","author":"Pauls","year":"2011"},{"key":"S1351324911000349_ref32","unstructured":"Ravishankar M. 1996. Efficient Algorithms for Speech Recognition. PhD thesis, Technical Report. CMU-CS-96-143, Carnegie Mellon University, Pittsburgh, PA, USA."},{"key":"S1351324911000349_ref2","unstructured":"Bentley J. L. and Sedgewick R. 1998. Ternary search trees. Dr Dobb's Journal, April 01, http:\/\/drdobbs.com\/windows\/184410528. Accessed 8 April 2011."},{"key":"S1351324911000349_ref40","first-page":"1574","volume-title":"Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2009)","author":"Van Durme","year":"2009"},{"key":"S1351324911000349_ref35","first-page":"901","volume-title":"Proceedings of 7th International Conference on Spoken Language Processing","author":"Stolcke","year":"2002"},{"key":"S1351324911000349_ref36","first-page":"1243","volume-title":"Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2009)","author":"Talbot","year":"2009"},{"key":"S1351324911000349_ref38","first-page":"512","volume-title":"Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL 2007)","author":"Talbot","year":"2007"},{"key":"S1351324911000349_ref39","first-page":"468","volume-title":"Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007)","author":"Talbot","year":"2007"},{"key":"S1351324911000349_ref44","first-page":"141","volume-title":"Proceedings of 46th Annual Meeting of the Association for Computational Linguistics and Human Language Technology Conference (ACL-08: HLT)","author":"Yuret","year":"2008"},{"key":"S1351324911000349_ref41","first-page":"4733","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Wang","year":"2009"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324911000349","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T12:01:14Z","timestamp":1686657674000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324911000349\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,1,10]]},"references-count":45,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2013,1]]}},"alternative-id":["S1351324911000349"],"URL":"https:\/\/doi.org\/10.1017\/s1351324911000349","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,1,10]]}}}