{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T13:13:02Z","timestamp":1740143582735,"version":"3.37.3"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2021,1,25]],"date-time":"2021-01-25T00:00:00Z","timestamp":1611532800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,1,25]],"date-time":"2021-01-25T00:00:00Z","timestamp":1611532800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Emil Aaltonen"},{"name":"Fulbright Finland"},{"name":"National Science Foundation","award":["1147581"],"award-info":[{"award-number":["1147581"]}]},{"DOI":"10.13039\/501100000155","name":"Social Sciences and Humanities Research Council of Canada","doi-asserted-by":"crossref","award":["430-2019-00851"],"award-info":[{"award-number":["430-2019-00851"]}],"id":[{"id":"10.13039\/501100000155","id-type":"DOI","asserted-by":"crossref"}]},{"name":"AGE-WELL Graduate Student and Postdoctoral Award in Technology and Aging"},{"DOI":"10.13039\/501100008982","name":"National Science Foundation","doi-asserted-by":"publisher","award":["1147581"],"award-info":[{"award-number":["1147581"]}],"id":[{"id":"10.13039\/501100008982","id-type":"DOI","asserted-by":"publisher"}]},{"name":"University of Turku (UTU) including Turku University Central Hospital"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2021,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The Internet offers great possibilities for many scientific disciplines that utilize text data. However, the potential of online data can be limited by the lack of information on the genre or <jats:italic>register<\/jats:italic> of the documents, as register\u2014whether a text is, e.g., a news article or a recipe\u2014is arguably the most important predictor of linguistic variation (see Biber in Corpus Linguist Linguist Theory 8:9\u201337, 2012). Despite having received significant attention in recent years, the modeling of online registers has faced a number of challenges, and previous studies have presented contradictory results. In particular, these have concerned (1) the extent to which registers can be automatically identified in a large, unrestricted corpus of web documents and (2) the stability of the models, specifically the kinds of linguistic features that achieve the best performance while reflecting the registers instead of corpus idiosyncrasies. Furthermore, although the linguistic properties of registers vary importantly in a number of ways that may affect their modeling, this variation is often bypassed. In this article, we tackle these issues. We model online registers in the largest available corpus of online registers, the Corpus of Online Registers of English (CORE). Additionally, we evaluate the stability of the models towards corpus idiosyncrasies, analyze the role of different linguistic features in them, and examine how individual registers differ in these two aspects. We show that (1) competitive classification performance on a large-scale, unrestricted corpus can be achieved through a combination of lexico-grammatical features, (2) the inclusion of grammatical information improves the stability of the model, whereas many of the previously best-performing feature sets are less stable, and that (3) registers can be placed in a continuum based on the discriminative importance of lexis and grammar. These register-specific characteristics can explain the variation observed in previous studies concerning the automatic identification of online registers and the importance of different linguistic features for them. Thus, our results offer explanations for the jungle-likeness of online data and provide essential information on online registers for all studies using online data.<\/jats:p>","DOI":"10.1007\/s10579-020-09519-z","type":"journal-article","created":{"date-parts":[[2021,1,25]],"date-time":"2021-01-25T14:13:44Z","timestamp":1611584024000},"page":"757-788","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents"],"prefix":"10.1007","volume":"55","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7635-429X","authenticated-orcid":false,"given":"Veronika","family":"Laippala","sequence":"first","affiliation":[]},{"given":"Jesse","family":"Egbert","sequence":"additional","affiliation":[]},{"given":"Douglas","family":"Biber","sequence":"additional","affiliation":[]},{"given":"Aki-Juhani","family":"Kyr\u00f6l\u00e4inen","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,1,25]]},"reference":[{"issue":"1","key":"9519_CR1","doi-asserted-by":"publisher","first-page":"100","DOI":"10.1075\/rs.18015.arg","volume":"1","author":"S Argamon","year":"2019","unstructured":"Argamon, S. (2019). Register in computational language research. Register Studies, 1(1), 100\u2013135.","journal-title":"Register Studies"},{"key":"9519_CR2","unstructured":"Asheghi, N., Markert, K., & Sharoff, S. (2014). Semi-supervised graph-based genre classification for web pages. Proceedings of TextGraphs-9: The workshop on graph-based methods for natural language processing (pp. 39\u201347)."},{"issue":"3","key":"9519_CR3","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1007\/s10579-015-9331-6","volume":"50","author":"N Asheghi","year":"2016","unstructured":"Asheghi, N., Sharoff, S., & Markert, K. (2016). Crowdsourcing for web genre annotation. Language Resources and Evaluation, 50(3), 603\u2013641.","journal-title":"Language Resources and Evaluation"},{"key":"9519_CR4","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1075\/ijcl.15026.ber","volume":"23","author":"T Berber Sardinha","year":"2018","unstructured":"Berber Sardinha, T. (2018). Dimensions of variation across Internet registers. International Journal of Corpus Linguistics, 23, 125\u2013157.","journal-title":"International Journal of Corpus Linguistics"},{"key":"9519_CR5","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511621024","volume-title":"Variation across speech and writing","author":"D Biber","year":"1988","unstructured":"Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press."},{"key":"9519_CR6","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1515\/cllt-2012-0002","volume":"8","author":"D Biber","year":"2012","unstructured":"Biber, D. (2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory, 8, 9\u201337.","journal-title":"Corpus Linguistics and Linguistic Theory"},{"key":"9519_CR7","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511814358","volume-title":"Register, genre, and style","author":"D Biber","year":"2009","unstructured":"Biber, D., & Conrad, S. (2009). Register, genre, and style. Cambridge: Cambridge University Press."},{"key":"9519_CR8","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1558\/jrds.v2i1.27637","volume":"2","author":"D Biber","year":"2015","unstructured":"Biber, D., & Egbert, J. (2015). Using grammatical features for automatic register identification in an unrestricted corpus of documents from the open web. Journal of Research Design and Statistics in Linguistics and Communication Science, 2, 3\u201336.","journal-title":"Journal of Research Design and Statistics in Linguistics and Communication Science"},{"key":"9519_CR9","doi-asserted-by":"publisher","DOI":"10.1017\/9781316388228","volume-title":"Register variation online","author":"D Biber","year":"2018","unstructured":"Biber, D., & Egbert, J. (2018). Register variation online. Cambridge: Cambridge University Press."},{"issue":"1","key":"9519_CR10","doi-asserted-by":"publisher","first-page":"11","DOI":"10.3366\/cor.2015.0065","volume":"10","author":"D Biber","year":"2015","unstructured":"Biber, D., Egbert, J., & Davies, M. (2015). Exploring the composition of the searchable web: A corpus-based taxonomy of web registers. Corpora, 10(1), 11\u201345.","journal-title":"Corpora"},{"key":"9519_CR11","doi-asserted-by":"publisher","DOI":"10.1515\/cllt-2018-0086","author":"D Biber","year":"2020","unstructured":"Biber, D., Egbert, J., & Keller, D. (2020). Reconceptualizing register in a continuous situational space. Corpus Linguistics and Linguistic Theory. https:\/\/doi.org\/10.1515\/cllt-2018-0086.","journal-title":"Corpus Linguistics and Linguistic Theory"},{"key":"9519_CR12","volume-title":"The Longman grammar of spoken and written English","author":"D Biber","year":"1999","unstructured":"Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman grammar of spoken and written English. London: Longman."},{"key":"9519_CR13","doi-asserted-by":"crossref","unstructured":"Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory (pp. 144\u2013152).","DOI":"10.1145\/130385.130401"},{"issue":"1","key":"9519_CR14","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5\u201332.","journal-title":"Machine Learning"},{"issue":"1","key":"9519_CR15","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1016\/j.ipm.2013.08.005","volume":"50","author":"M Clark","year":"2014","unstructured":"Clark, M., Ruthven, I., O\u2019Brian Holt, P., Song, D., & Watt, S. (2014). You have e-mail, what happens next? Tracking the eyes for genre. Information Processing and Management, 50(1), 175\u2013198.","journal-title":"Information Processing and Management"},{"key":"9519_CR16","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4899-4541-9","volume-title":"An introduction to the bootstrap","author":"B Efron","year":"1993","unstructured":"Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall."},{"issue":"9","key":"9519_CR17","doi-asserted-by":"publisher","first-page":"1817","DOI":"10.1002\/asi.23308","volume":"66","author":"J Egbert","year":"2015","unstructured":"Egbert, J., Biber, D., & Davies, M. (2015). Developing a bottom-up, user-based method of web register classification. Journal of the Association for Information Science and Technology, 66(9), 1817\u20131831.","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"9519_CR18","unstructured":"Giesbrecht, E., & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as corpus. In Web as corpus workshop (WAC5) (pp. 27\u201336)."},{"key":"9519_CR19","first-page":"1157","volume":"3","author":"I Guyon","year":"2003","unstructured":"Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157\u20131182.","journal-title":"The Journal of Machine Learning Research"},{"issue":"1\u20133","key":"9519_CR20","doi-asserted-by":"publisher","first-page":"389","DOI":"10.1023\/A:1012487302797","volume":"46","author":"I Guyon","year":"2002","unstructured":"Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1\u20133), 389\u2013422.","journal-title":"Machine Learning"},{"key":"9519_CR21","doi-asserted-by":"crossref","unstructured":"Haider, T., & Palmer, A. (2017). Modeling communicative purpose with functional style: Corpus and features for german genre and register. Proceedings of the workshop on stylistic variation. Copenhagen, Denmark (pp. 74\u201384). Association for Computational Linguistics.","DOI":"10.18653\/v1\/W17-4910"},{"key":"9519_CR22","first-page":"25","volume-title":"Empirical approaches to cognitive linguistics: Analysing real-life data","author":"T Huumo","year":"2017","unstructured":"Huumo, T., Kyr\u00f6l\u00e4inen, A.-J., Kanerva, J., Luotolahti, J., Salakoski, T., Ginter, F., et al. (2017). Distributional semantics of the partitive A argument construction in Finnish. In M. Luodonp\u00e4\u00e4-Manni, E. Penttil\u00e4, & J. Viimaranta (Eds.), Empirical approaches to cognitive linguistics: Analysing real-life data (pp. 25\u201348). Newcastle Upon Tyne: Cambridge Scholars Publishing."},{"key":"9519_CR23","doi-asserted-by":"crossref","unstructured":"Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning (pp. 137\u2013142).","DOI":"10.1007\/BFb0026683"},{"issue":"1","key":"9519_CR24","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1108\/eb026526","volume":"28","author":"KS Jones","year":"1972","unstructured":"Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11\u201321.","journal-title":"Journal of Documentation"},{"issue":"5","key":"9519_CR25","doi-asserted-by":"publisher","first-page":"499","DOI":"10.1016\/j.ipm.2009.05.003","volume":"45","author":"I Kanaris","year":"2007","unstructured":"Kanaris, I., & Stamatatos, E. (2007). Learning to recognize webpage genres. Information Processing and Management, 45(5), 499\u2013512.","journal-title":"Information Processing and Management"},{"key":"9519_CR26","unstructured":"Kilgarrif, A. (2001). The web as corpus. Proceedings of corpus linguistics. Lancaster University."},{"key":"9519_CR200","volume-title":"Information retrieval","author":"C Manning","year":"2008","unstructured":"Manning, C., Raghavan, P., & Sch\u00fctze, H. (2008). Information retrieval. Cambridge: Cambridge University Press."},{"key":"9519_CR27","doi-asserted-by":"crossref","unstructured":"Ng, A. (2004). Feature selection, L1 vs. L2 regularization, and rotational invariance. Proceedings of the twenty-first international conference on Machine learning.","DOI":"10.1145\/1015330.1015435"},{"issue":"2","key":"9519_CR28","doi-asserted-by":"publisher","first-page":"385","DOI":"10.1162\/COLI_a_00052","volume":"37","author":"P Petrenz","year":"2011","unstructured":"Petrenz, P., & Webber, B. (2011). Stable classification of text genres. Computational Linguistics, 37(2), 385\u2013393.","journal-title":"Computational Linguistics"},{"issue":"4","key":"9519_CR29","doi-asserted-by":"publisher","first-page":"949","DOI":"10.1007\/s10579-018-9418-y","volume":"52","author":"D Pritsos","year":"2018","unstructured":"Pritsos, D., & Stamatatos, E. (2018). Open set evaluation in web genre identification. Language Resources and Evaluation, 52(4), 949\u2013968.","journal-title":"Language Resources and Evaluation"},{"key":"9519_CR30","doi-asserted-by":"crossref","unstructured":"Rodrigues, M. J., & Couto, M. J. (2017). MoRS at SemEval-2017 Task 3: Easy to use SVM in ranking tasks. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 287\u2013291). Association for Computational Linguistics.","DOI":"10.18653\/v1\/S17-2046"},{"key":"9519_CR31","unstructured":"Santini, M. (2007). Automatic identification of genre in web pages. Ph.D. thesis, University of Brighton."},{"key":"9519_CR32","unstructured":"Sch\u00e4fer, R. (2016). Prototype-driven alternations: The case of German weak nouns. Corpus linguistics and linguistic theory."},{"key":"9519_CR33","doi-asserted-by":"crossref","unstructured":"Sharoff, S. (2008). In the garden and in the jungle: Comparing genres in the BNC and Internet. Genres on the Web, 149\u2013166.","DOI":"10.1007\/978-90-481-9178-9_7"},{"key":"9519_CR34","unstructured":"Sharoff, S., Wu, Z., & Markert, K. (2010). The web library of babel: evaluating genre collections. Proceedings of the seventh conference on international language resources and evaluation (pp. 3063\u20133070)."},{"issue":"1","key":"9519_CR35","doi-asserted-by":"publisher","first-page":"355","DOI":"10.1515\/pralin-2017-0033","volume":"108","author":"A Srivastava","year":"2017","unstructured":"Srivastava, A., Rehm, G., & Sasaki, F. (2017). Improving machine translation through linked data. The Prague Bulletin of Mathematical Linguistics, 108(1), 355\u2013366.","journal-title":"The Prague Bulletin of Mathematical Linguistics"},{"key":"9519_CR36","doi-asserted-by":"crossref","unstructured":"Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Text genre detection using common word frequencies. Proceedings of the 18th conference on computational linguistics 2. Association for Computational Linguistics.","DOI":"10.3115\/992730.992763"},{"key":"9519_CR37","doi-asserted-by":"crossref","unstructured":"Tiedemann, J., Cap, F., Kanerva, J., Ginter, F., Stymne, S., \u00d6stling, R., & Weller-Di Marco, M. (2016). Phrase-based SMT for finnish with more data, better models and alternative alignment and translation tools. Proceedings of the first conference on machine translation (Vol. 2, pp. 391\u2013398): Shared Task Papers. Berlin, Germany: Association for Computational Linguistics.","DOI":"10.18653\/v1\/W16-2326"},{"issue":"2","key":"9519_CR38","doi-asserted-by":"publisher","first-page":"235","DOI":"10.3366\/cor.2013.0042","volume":"8","author":"A Titak","year":"2013","unstructured":"Titak, A., & Roberson, A. (2013). Dimensions of web registers: An exploratory multi-dimensional comparison. Corpora, 8(2), 235\u2013260.","journal-title":"Corpora"},{"issue":"1\u20132","key":"9519_CR39","first-page":"23","volume":"20","author":"P Turney","year":"1995","unstructured":"Turney, P. (1995). Technical note: Bias and the quantification of stability. Machine Learning, 20(1\u20132), 23\u201333.","journal-title":"Machine Learning"},{"key":"9519_CR40","volume-title":"Statistical learning theory","author":"V Vapnik","year":"1998","unstructured":"Vapnik, V. (1998). Statistical learning theory. New York: Wiley Interscience."},{"key":"9519_CR41","doi-asserted-by":"crossref","unstructured":"Webber, B. (2009). Genre distinctions for discourse in the Penn treebank. Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (pp. 674\u2013682). Association for Computational Linguistics.","DOI":"10.3115\/1690219.1690240"},{"key":"9519_CR42","first-page":"1","volume":"2","author":"D Zeman","year":"2017","unstructured":"Zeman, D., Popel, M., Straka, M., Hajic, J., Nivre, J., Ginter, F., et al. (2017). CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. CoNLL Shared Task, 2, 1\u201319.","journal-title":"CoNLL Shared Task"}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-020-09519-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10579-020-09519-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-020-09519-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,7,31]],"date-time":"2021-07-31T17:13:14Z","timestamp":1627751594000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10579-020-09519-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,25]]},"references-count":43,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9]]}},"alternative-id":["9519"],"URL":"https:\/\/doi.org\/10.1007\/s10579-020-09519-z","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"type":"print","value":"1574-020X"},{"type":"electronic","value":"1574-0218"}],"subject":[],"published":{"date-parts":[[2021,1,25]]},"assertion":[{"value":"30 October 2020","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 January 2021","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}