{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T07:12:19Z","timestamp":1777446739757,"version":"3.51.4"},"reference-count":25,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2013,7,24]],"date-time":"2013-07-24T00:00:00Z","timestamp":1374624000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/www.springer.com\/tdm"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2014,6]]},"DOI":"10.1007\/s10579-013-9246-z","type":"journal-article","created":{"date-parts":[[2013,7,23]],"date-time":"2013-07-23T02:08:56Z","timestamp":1374545336000},"page":"227-248","source":"Crossref","is-referenced-by-count":13,"title":["General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes"],"prefix":"10.1007","volume":"48","author":[{"given":"Jan","family":"\u0160vec","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jan","family":"Lehe\u010dka","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Pavel","family":"Ircing","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lucie","family":"Skorkovsk\u00e1","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ale\u0161","family":"Pra\u017e\u00e1k","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jan","family":"Vavru\u0161ka","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Petr","family":"Stanislav","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jan","family":"Hoidekr","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2013,7,24]]},"reference":[{"key":"9246_CR1","unstructured":"Baroni, M. & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In In Proceedings of LREC 2004, pp. 1313\u20131316."},{"issue":"8-13","key":"9246_CR2","doi-asserted-by":"crossref","first-page":"1157","DOI":"10.1016\/S0169-7552(97)00031-7","volume":"29","author":"A. Z. Broder","year":"1997","unstructured":"Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8\u201313), 1157\u20131166.","journal-title":"Computer Networks and ISDN Systems"},{"key":"9246_CR3","unstructured":"Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A., & \u00c7etin, O. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing (TSLP), 5(1), 1:1\u20131:25."},{"key":"9246_CR4","unstructured":"Fairon, C. (2006). Corporator: a tool for creating rss-based specialized corpora. In Proceedings of the 2nd international workshop on web as corpus, WAC \u201906 (pp. 43\u201349). Stroudsburg, PA, USA: Association for Computational Linguistics."},{"key":"9246_CR5","first-page":"93","volume-title":"TSD 2010. LNCS","author":"J. Kanis","year":"2010","unstructured":"Kanis, J., & Skorkovsk\u00e1, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In: P. Sojka, A. Hor\u00e1k, I. Kope\u010dek, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 93\u2013100). Heidelberg: Springer."},{"issue":"1","key":"9246_CR6","doi-asserted-by":"crossref","first-page":"97","DOI":"10.1075\/ijcl.6.1.05kil","volume":"6","author":"A. Kilgarriff","year":"2001","unstructured":"Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97\u2013133.","journal-title":"International Journal of Corpus Linguistics"},{"key":"9246_CR7","unstructured":"Kilgarriff, A., Reddy, S., Pomik\u00e1lek, J., & PVS, A. (2010). A corpus factory for many languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC\u201910) (pp. 904\u2013910). Valletta, Malta: European Language Resources Association (ELRA)."},{"issue":"2","key":"9246_CR8","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1093\/llc\/17.2.245","volume":"17","author":"K. Ku\u010dera","year":"2002","unstructured":"Ku\u010dera, K. (2002). The Czech National Corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245\u2013257.","journal-title":"Literary and Linguistic Computing"},{"key":"9246_CR9","unstructured":"Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In: V. Matousek & P. Mautner (Eds.), TSD. Lecture Notes in Computer Science (Vol. 4629, pp. 56\u201365). New York: Springer."},{"key":"9246_CR10","unstructured":"Malkin, M. & Venkatesan, R. (2005). Comparison of texts streams in the presence of mild adversaries. In Proceedings of the 2005 Australasian workshop on grid computing and e-research (Vol. 44, pp. 179\u2013186). ACSW Frontiers \u201905. Australian Computer Society, Inc.,."},{"key":"9246_CR11","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511809071","volume-title":"Introduction to information retrieval","author":"C. D. Manning","year":"2008","unstructured":"Manning, C. D., Raghavan, P., & Sch\u00fctze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press."},{"key":"9246_CR12","unstructured":"Pomik\u00e1lek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic."},{"key":"9246_CR13","unstructured":"Pra\u017e\u00e1k, A., Loose, Z., Psutka, J., Radov\u00e1, V., & M\u00fcller, L. (2011). Four-phase re-speaker training system. In Proceedings of SIGMAP 2011. Seville."},{"key":"9246_CR14","doi-asserted-by":"crossref","unstructured":"Psutka, J., Ircing, P., Psutka, J.V., Radov\u00e1, V., Byrne, W., Haji\u010d, J., M\u00edrovsk\u00fd, J., & Gustman, S. (2003). Large vocabulary ASR for spontaneous Czech in the MALACH project. In Proceedings of Eurospeech 2003 (pp. 1821\u20131824). Geneva.","DOI":"10.21437\/Eurospeech.2003-551"},{"key":"9246_CR15","unstructured":"Psutka, J., Radov\u00e1, V., M\u00fcller, L., Matou\u0161ek, J., Ircing, P., & Graff, D. (2001). Large broadcast news and read speech corpora of spoken Czech. In Proceedings of Eurospeech 2001 (pp. 2067\u20132070). Denmark: Aalborg."},{"key":"9246_CR16","doi-asserted-by":"crossref","unstructured":"Psutka, J., \u0160vec, J., Psutka, J.V., Van\u011bk, J., Pra\u017e\u00e1k, A., \u0160m\u00eddl, L., & Ircing, P. (2011). System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP Journal on Audio, Speech, and Music Processing, 10.","DOI":"10.1186\/1687-4722-2011-10"},{"key":"9246_CR17","unstructured":"Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus (pp. 63\u201398). Gedit."},{"key":"9246_CR18","unstructured":"Spoustov\u00e1, D., Spousta, M., & Pecina, P. (2010). Building a Web Corpus of Czech. In Proceedings of the seventh conference on international language resources and evaluation (LREC\u201910). Valletta, Malta."},{"key":"9246_CR19","doi-asserted-by":"crossref","unstructured":"Stolcke, A. (2002). SRILM\u2014an extensible language modeling toolkit. In Proceedings of ICSLP 2002 (pp. 901\u2013904). Denver.","DOI":"10.21437\/ICSLP.2002-303"},{"key":"9246_CR22","unstructured":"\u0160vec, J. (2010). The Voiar (Voice Archive) library. University of West Bohemia, Plze\u0148."},{"key":"9246_CR23","doi-asserted-by":"crossref","first-page":"356","DOI":"10.1007\/978-3-642-23538-2_45","volume-title":"Text, speech and dialogue. Lecture Notes in Computer Science","author":"J. \u0160vec","year":"2011","unstructured":"\u0160vec, J., Hoidekr, J., Soutner, D., & Vavru\u0161ka, J. (2011). Web text data mining for building large scale language modelling corpus. In: I. Habernal & V. Matou\u0161ek (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 6836, pp. 356\u2013363). Berlin \/ Heidelberg: Springer."},{"key":"9246_CR20","doi-asserted-by":"crossref","first-page":"416","DOI":"10.1007\/978-3-642-15760-8_53","volume-title":"Text, speech and dialogue. Lecture Notes in Artificial Intelligence","author":"J. Trmal","year":"2010","unstructured":"Trmal, J., Pra\u017e\u00e1k, A., Loose, Z., & Psutka, J. (2010). Online TV Captioning of Czech Parliamentary Sessions. In: Sojka, P., Hor\u00e1k, A., Kope\u010dek, I., & Pala, K. (Eds.), Text, speech and dialogue. Lecture Notes in Artificial Intelligence (Vol. 6231, pp. 416\u2013422). Berlin: Springer."},{"key":"9246_CR21","first-page":"431","volume-title":"TSD 2010. LNCS","author":"J. Van\u011bk","year":"2010","unstructured":"Van\u011bk, J. & Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In: P. Sojka, A. Hor\u00e1k, I. Kope\u010dek, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 431\u2013438). Heidelberg: Springer."},{"key":"9246_CR24","first-page":"464","volume-title":"TSD 2010. LNCS","author":"Z. Zaj\u00edc","year":"2010","unstructured":"Zaj\u00edc, Z., Machlica, L., & M\u00fcller, L. (2010). Robust statistic estimates for adaptation in the task of speech recognition. In: P. Sojka, A. Hor\u00e1k, I. Kope\u010dek, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 464\u2013471). Heidelberg: Springer."},{"key":"9246_CR25","doi-asserted-by":"crossref","first-page":"326","DOI":"10.1007\/11551874_42","volume-title":"Text, speech and dialogue. Lecture Notes in Computer Science","author":"J. Zelinka","year":"2005","unstructured":"Zelinka, J., Kanis, J., & M\u00fcller, L. (2005). Automatic transcription of numerals in inflectional languages. In: V. Matou\u0161ek, P. Mautner, & T. Pavelka (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 3658, pp. 326\u2013333). Berlin\/Heidelberg: Springer."}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-013-9246-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s10579-013-9246-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-013-9246-z","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,7,3]],"date-time":"2023-07-03T11:09:12Z","timestamp":1688382552000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s10579-013-9246-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,7,24]]},"references-count":25,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2014,6]]}},"alternative-id":["9246"],"URL":"https:\/\/doi.org\/10.1007\/s10579-013-9246-z","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"value":"1574-020X","type":"print"},{"value":"1574-0218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2013,7,24]]}}}