{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,7,20]],"date-time":"2024-07-20T06:31:18Z","timestamp":1721457078410},"reference-count":41,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2011,3,2]],"date-time":"2011-03-02T00:00:00Z","timestamp":1299024000000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/www.springer.com\/tdm"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Lang Resources &amp; Evaluation"],"published-print":{"date-parts":[[2011,5]]},"DOI":"10.1007\/s10579-011-9141-4","type":"journal-article","created":{"date-parts":[[2011,3,1]],"date-time":"2011-03-01T23:28:06Z","timestamp":1299022086000},"page":"209-241","source":"Crossref","is-referenced-by-count":7,"title":["Constructing specialised corpora through analysing domain representativeness of websites"],"prefix":"10.1007","volume":"45","author":[{"given":"Wilson","family":"Wong","sequence":"first","affiliation":[]},{"given":"Wei","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Mohammed","family":"Bennamoun","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2011,3,2]]},"reference":[{"issue":"1","key":"9141_CR1","first-page":"143","volume":"3","author":"L. Adamic","year":"2002","unstructured":"Adamic, L., & Huberman, B. (2002). Zipf\u2019s law and the internet.Glottometrics, 3(1), 143\u2013150.","journal-title":"Glottometrics"},{"key":"9141_CR2","unstructured":"Agbago, A., & Barriere, C. (2005). Corpus construction for terminology. In Proceedings of the corpus linguistics conference, Birmingham, UK."},{"key":"9141_CR3","unstructured":"Baroni, M., & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the 4th language resources and evaluation conference (LREC), Lisbon, Portugal."},{"key":"9141_CR4","unstructured":"Baroni, M., & Bernardini, S. (2006). Wacky! working papers on the web as corpus. Bologna, Italy: GEDIT."},{"key":"9141_CR5","unstructured":"Baroni, M., & Ueyama, M. (2006). Building general- and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium on language corpora: Their compilation and application."},{"key":"9141_CR6","unstructured":"Baroni, M., Kilgarriff, A., Pomikalek, J., & Rychly, P. (2006). Webbootcat: Instant domain-specific corpora to support human translators. In Proceedings of the 11th annual conference of the European association for Machine Translation (EAMT), Norway."},{"key":"9141_CR7","unstructured":"Basili, R., Moschitti, A., Pazienza, M., & Zanzotto, F. (2001). A contrastive approach to term extraction. In Proceedings of the 4th terminology and artificial intelligence conference (TIA), France."},{"issue":"2","key":"9141_CR8","doi-asserted-by":"crossref","first-page":"286","DOI":"10.3758\/BF03195456","volume":"34","author":"I. Blair","year":"2002","unstructured":"Blair, I., Urland, G., & Ma, J. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods Instruments & Computers, 34(2), 286\u2013290.","journal-title":"Behavior Research Methods Instruments & Computers"},{"key":"9141_CR9","unstructured":"Cavaglia, G., & Kilgarriff, A. (2001). Corpora from the web. In Proceedings of the 4th annual CLUCK colloquium, Sheffield, UK."},{"issue":"3","key":"9141_CR10","doi-asserted-by":"crossref","first-page":"370","DOI":"10.1109\/TKDE.2007.48","volume":"19","author":"R. Cilibrasi","year":"2007","unstructured":"Cilibrasi, R., & Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370\u2013383.","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"key":"9141_CR11","unstructured":"Evert, S. (2007). Stupidos: A high-precision approach to boilerplate removal. In Proceedings of the 3rd web as corpus workshop, Belgium."},{"key":"9141_CR12","unstructured":"Evert, S. (2008). A lightweight and efficient tool for cleaning web pages. In Proceedings of the 4th web as corpus workshop (WAC), Morocco."},{"key":"9141_CR13","doi-asserted-by":"crossref","unstructured":"Fetterly, D., Manasse, M., Najork, M., & Wiener, J. (2003). A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on world wide web, Budapest, Hungary.","DOI":"10.1145\/775152.775246"},{"key":"9141_CR14","unstructured":"Fletcher, W. (2007). Implementing a bnc-comparable web corpus. In Proceedings of the 3rd web as corpus workshop, Belgium."},{"key":"9141_CR15","unstructured":"Francis, W., & Kucera, H. (1979). Brown corpus manual. http:\/\/icame.uib.no\/brown\/bcm.html ."},{"key":"9141_CR16","unstructured":"Girardi, C. (2007). Htmlcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd web as corpus workshop, Belgium."},{"key":"9141_CR17","volume-title":"Lexicology and corpus linguistics: An introduction","author":"M. Halliday","year":"2004","unstructured":"Halliday, M., Teubert, W., Yallop, C., & Cermakova, A. (2004). Lexicology and corpus linguistics: An introduction. Continuum, London."},{"issue":"1","key":"9141_CR18","doi-asserted-by":"crossref","first-page":"5186","DOI":"10.1073\/pnas.0307528100","volume":"101","author":"M. Henzinger","year":"2004","unstructured":"Henzinger, M., & Lawrence, S. (2004). Extracting knowledge from the world wide web. PNAS, 101(1), 5186\u20135191.","journal-title":"PNAS"},{"key":"9141_CR19","unstructured":"Jock, F. (2009). An overview of the importance of page rank. http:\/\/www.associatedcontent.com\/article\/1502284\/an_overview_of_the_importance_of_page.html?cat=15; 9 March 2009."},{"key":"9141_CR20","doi-asserted-by":"crossref","unstructured":"Keller, F., Lapata, M., & Ourioupina, O. (2002). Using the web to overcome data sparseness. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), Philadelphia.","DOI":"10.3115\/1118693.1118723"},{"issue":"14","key":"9141_CR21","doi-asserted-by":"crossref","first-page":"11","DOI":"10.1002\/scj.20852","volume":"38","author":"M. Kida","year":"2007","unstructured":"Kida, M., Tonoike, M., Utsuro, T., & Sato, S (2007). Domain classification of technical terms using the web. Systems and Computers in Japan, 38(14), 11\u201319.","journal-title":"Systems and Computers in Japan"},{"key":"9141_CR22","unstructured":"Kilgarriff, A. (2001). Web as corpus. In Proceedings of the corpus linguistics (CL), Lancaster University, UK."},{"issue":"1","key":"9141_CR23","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1162\/coli.2007.33.1.147","volume":"33","author":"A. Kilgarriff","year":"2007","unstructured":"Kilgarriff, A. (2007). Googleology is bad science. Computational Linguistics, 33(1), 147\u2013151","journal-title":"Computational Linguistics"},{"issue":"3","key":"9141_CR24","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1162\/089120103322711569","volume":"29","author":"A. Kilgarriff","year":"2003","unstructured":"Kilgarriff, A., & Grefenstette, G. (2003). Web as corpus. Computational Linguistics, 29(3), 1\u201315.","journal-title":"Computational Linguistics"},{"issue":"1","key":"9141_CR25","doi-asserted-by":"crossref","first-page":"180","DOI":"10.1093\/bioinformatics\/btg1023","volume":"19","author":"J. Kim","year":"2003","unstructured":"Kim, J., Ohta, T., Teteisi, Y., & Tsujii, J. (2003). Genia corpus-a semantically annotated corpus for bio-textmining. Bioinformatics, 19(1), 180\u2013182.","journal-title":"Bioinformatics"},{"issue":"1","key":"9141_CR26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1075389.1075392","volume":"2","author":"M. Lapata","year":"2005","unstructured":"Lapata, M., & Keller, F. (2005). Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 2(1),1\u201330.","journal-title":"ACM Transactions on Speech and Language Processing"},{"key":"9141_CR27","unstructured":"Liberman, M. (2005). Questioning reality. http:\/\/www.itre.cis.upenn.edu.\/myl\/languagelog\/archives\/001837.html; 26 March 2009."},{"key":"9141_CR28","unstructured":"Liu, V., & Curran, J. (2006). Web text corpus for natural language processing. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL), Italy."},{"key":"9141_CR29","volume-title":"Corpus-based language studies: An advanced resource book","author":"T. McEnery","year":"2005","unstructured":"McEnery, T., Xiao, R., & Tono, Y. (2005). Corpus-based language studies: An advanced resource book. London, UK: Taylor & Francis Group Plc."},{"key":"9141_CR30","unstructured":"Nakov, P., & Hearst, M. (2005). A study of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of the international conference on recent advances in natural language processing (RANLP), Bulgaria."},{"issue":"3","key":"9141_CR31","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1300\/J111v34n03_07","volume":"34","author":"E. O\u2019Neill","year":"2001","unstructured":"O\u2019Neill, E., McClain, P., & Lavoie, B. (2001). A methodology for sampling the world wide web. Journal of Library Administration, 34(3), 279\u2013291.","journal-title":"Journal of Library Administration"},{"key":"9141_CR32","doi-asserted-by":"crossref","unstructured":"Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and nlp: Using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd annual meeting on association for computational linguistics, Michigan, USA.","DOI":"10.3115\/1219840.1219917"},{"key":"9141_CR33","volume-title":"Corpus Linguistics and the Web","author":"A. Renouf","year":"2007","unstructured":"Renouf, A., Kehoe, A., & Banerjee, J. (2007). Webcorp: An integrated system for web text search. In Nadja Nesselhauf MHCB (Ed.), Corpus linguistics and the web. Amsterdam: Rodopi"},{"issue":"3","key":"9141_CR34","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1162\/089120103322711578","volume":"29","author":"P. Resnik","year":"2003","unstructured":"Resnik, P., & Smith, N. (2003). The web as a parallel corpus. Computational Linguistics, 29(3), 349\u2013380.","journal-title":"Computational Linguistics"},{"key":"9141_CR35","unstructured":"Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), Wacky! Working papers on the web as corpus. Bologna: GEDIT"},{"issue":"13","key":"9141_CR36","doi-asserted-by":"crossref","first-page":"1771","DOI":"10.1002\/asi.20388","volume":"57","author":"M. Thelwall","year":"2006","unstructured":"Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy and denial of service. Journal of the American Society for Information Science and Technology, 57(13), 1771\u20131779.","journal-title":"Journal of the American Society for Information Science and Technology"},{"key":"9141_CR37","doi-asserted-by":"crossref","unstructured":"Turney, P. (2001). Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the 12th European conference on machine learning (ECML). Freiburg, Germany.","DOI":"10.1007\/3-540-44795-4_42"},{"issue":"3","key":"9141_CR38","doi-asserted-by":"crossref","first-page":"349","DOI":"10.1007\/s10618-007-0073-y","volume":"15","author":"W. Wong","year":"2007","unstructured":"Wong, W., Liu, W., & Bennamoun, M. (2007). Tree-traversing ant algorithm for term clustering based on featureless similarities. Data Mining and Knowledge Discovery, 15(3), 349\u2013381.","journal-title":"Data Mining and Knowledge Discovery"},{"key":"9141_CR39","doi-asserted-by":"crossref","unstructured":"Wong, W., Liu, W., & Bennamoun, M. (2008a). Constructing web corpora through topical web partitioning for term recognition. In Proceedings of the 21st Australasian joint conference on artificial intelligence (AI). Auckland, New Zealand.","DOI":"10.1007\/978-3-540-89378-3_7"},{"key":"9141_CR40","doi-asserted-by":"crossref","unstructured":"Wong W., Liu W., & Bennamoun M. (2008b). Determination of unithood and termhood for term recognition. In M. Song & Y. Wu (Eds.), Handbook of research on text and web mining technologies. IGI Global","DOI":"10.4018\/978-1-59904-990-8.ch030"},{"issue":"4","key":"9141_CR41","doi-asserted-by":"crossref","first-page":"499","DOI":"10.3233\/IDA-2009-0379","volume":"13","author":"W. Wong","year":"2009","unstructured":"Wong, W., Liu, W., & Bennamoun, M. (2009). A probabilistic framework for automatic term recognition. Intelligent Data Analysis 13(4), 499\u2013539.","journal-title":"Intelligent Data Analysis"}],"container-title":["Language Resources and Evaluation"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-011-9141-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s10579-011-9141-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s10579-011-9141-4","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,11,19]],"date-time":"2021-11-19T17:08:48Z","timestamp":1637341728000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s10579-011-9141-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,3,2]]},"references-count":41,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2011,5]]}},"alternative-id":["9141"],"URL":"https:\/\/doi.org\/10.1007\/s10579-011-9141-4","relation":{},"ISSN":["1574-020X","1574-0218"],"issn-type":[{"value":"1574-020X","type":"print"},{"value":"1574-0218","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,3,2]]}}}