{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,24]],"date-time":"2025-12-24T12:29:06Z","timestamp":1766579346636,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,10,29]],"date-time":"2018-10-29T00:00:00Z","timestamp":1540771200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Data and Information Quality"],"published-print":{"date-parts":[[2018,12,31]]},"abstract":"<jats:p>\n            The evolution of web pages from static HTML pages toward dynamic pieces of software has rendered archiving them increasingly difficult. Nevertheless, an accurate,\n            <jats:italic>reproducible<\/jats:italic>\n            web archive is a necessity to ensure the reproducibility of web-based research. Archiving web pages reproducibly, however, is currently not part of best practices for web corpus construction. As a result, and despite the ongoing efforts of other stakeholders to archive the web, tools for the construction of reproducible web corpora are insufficient or ill-fitted. This article presents a new tool tailored to this purpose. It relies on emulating user interactions with a web page while recording all network traffic. The customizable user interactions can be replayed on demand, while requests sent by the archived page are served with the recorded responses. The tool facilitates reproducible user studies, user simulations, and evaluations of algorithms that rely on extracting data from web pages. To evaluate our tool, we conduct the first systematic assessment of reproduction quality for rendered web pages. Using our tool, we create a corpus of 10,000\u00a0web pages carefully sampled from the Common Crawl and manually annotated with regard to reproduction quality via crowdsourcing. Based on this data, we test three approaches to automatic reproduction-quality assessment. An off-the-shelf neural network, trained on visual differences between the web page during archiving and reproduction, matches the manual assessments best. This automatic assessment of reproduction quality allows for immediate bugfixing during archiving and continuous development of our tool as the web continues to evolve.\n          <\/jats:p>","DOI":"10.1145\/3239574","type":"journal-article","created":{"date-parts":[[2018,10,29]],"date-time":"2018-10-29T12:02:18Z","timestamp":1540814538000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Reproducible Web Corpora"],"prefix":"10.1145","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1617-6508","authenticated-orcid":false,"given":"Johannes","family":"Kiesel","sequence":"first","affiliation":[{"name":"Bauhaus-Universit\u00e4t Weimar, Germany"}]},{"given":"Florian","family":"Kneist","sequence":"additional","affiliation":[{"name":"Ulm University, Germany"}]},{"given":"Milad","family":"Alshomary","sequence":"additional","affiliation":[{"name":"Paderborn University, Germany"}]},{"given":"Benno","family":"Stein","sequence":"additional","affiliation":[{"name":"Bauhaus-Universit\u00e4t Weimar, Germany"}]},{"given":"Matthias","family":"Hagen","sequence":"additional","affiliation":[{"name":"Martin-Luther-Universit\u00e4t Halle-Wittenberg, Germany"}]},{"given":"Martin","family":"Potthast","sequence":"additional","affiliation":[{"name":"Leipzig University, Germany"}]}],"member":"320","published-online":{"date-parts":[[2018,10,29]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Alexa Internet Inc. 2017. The top 500 sites on the web. Retrieved from https:\/\/www.alexa.com\/topsites.  Alexa Internet Inc. 2017. The top 500 sites on the web. Retrieved from https:\/\/www.alexa.com\/topsites."},{"volume-title":"Proceedings of the Research and Advanced Technology for Digital Libraries: International Conference on Theory and Practice of Digital Libraries (TPDL\u201913)","author":"Alnoamany Yasmin","key":"e_1_2_1_2_1","unstructured":"Yasmin Alnoamany , Ahmed Alsum , Michele C. Weigle , and Michael L. Nelson . 2013. Who and what links to the internet archive . In Proceedings of the Research and Advanced Technology for Digital Libraries: International Conference on Theory and Practice of Digital Libraries (TPDL\u201913) , Trond Aalberg, Christos Papatheodorou, Milena Dobreva, Giannis Tsakonas, and Charles J. Farrugia (Eds.). Springer, Berlin, 346--357. Yasmin Alnoamany, Ahmed Alsum, Michele C. Weigle, and Michael L. Nelson. 2013. Who and what links to the internet archive. In Proceedings of the Research and Advanced Technology for Digital Libraries: International Conference on Theory and Practice of Digital Libraries (TPDL\u201913), Trond Aalberg, Christos Papatheodorou, Milena Dobreva, Giannis Tsakonas, and Charles J. Farrugia (Eds.). Springer, Berlin, 346--357."},{"key":"e_1_2_1_3_1","volume-title":"Mark Edward Phillips, and Lauren Ko","author":"Ayala Brenda Reyes","year":"2014","unstructured":"Brenda Reyes Ayala , Mark Edward Phillips, and Lauren Ko . 2014 . Current Quality Assurance Practices in Web Archiving. Technical Report. University of North Texas Libraries . Brenda Reyes Ayala, Mark Edward Phillips, and Lauren Ko. 2014. Current Quality Assurance Practices in Web Archiving. Technical Report. University of North Texas Libraries."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2914799"},{"key":"e_1_2_1_5_1","volume-title":"Web archiving in the United States: A 2016 survey. Natl. Dig. Steward. Alliance","author":"Bailey Jefferson","year":"2017","unstructured":"Jefferson Bailey , Abigail Grotke , Edward McCain , Christie Moffatt , and Nicholas Taylor . 2017. Web archiving in the United States: A 2016 survey. Natl. Dig. Steward. Alliance ( 2017 ). Jefferson Bailey, Abigail Grotke, Edward McCain, Christie Moffatt, and Nicholas Taylor. 2017. Web archiving in the United States: A 2016 survey. Natl. Dig. Steward. Alliance (2017)."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES\u201913)","author":"Banos Vangelis","year":"2013","unstructured":"Vangelis Banos , Yunhyong Kim , Seamus Ross , and Yannis Manolopoulos . 2013 . CLEAR: A credible method to evaluate website archivability . Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES\u201913) . Vangelis Banos, Yunhyong Kim, Seamus Ross, and Yannis Manolopoulos. 2013. CLEAR: A credible method to evaluate website archivability. Proceedings of the 10th International Conference on Preservation of Digital Objects (iPRES\u201913)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00799-015-0144-4"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908)","author":"Baroni Marco","year":"2008","unstructured":"Marco Baroni , Francis Chantree , Adam Kilgarriff , and Serge Sharoff . 2008 . Cleaneval: A competition for cleaning web pages . In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908) . Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC\u201908)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00799-015-0150-6"},{"key":"e_1_2_1_10_1","unstructured":"Fran\u00e7ois Chollet et al. 2015. Keras. Retrieved from https:\/\/github.com\/fchollet\/keras.  Fran\u00e7ois Chollet et al. 2015. Keras. Retrieved from https:\/\/github.com\/fchollet\/keras."},{"key":"e_1_2_1_11_1","unstructured":"International Internet Preservation Consortium. 2017. OpenWayback. Retrieved from http:\/\/netpreserve.org\/web-archiving\/openwayback.  International Internet Preservation Consortium. 2017. OpenWayback. Retrieved from http:\/\/netpreserve.org\/web-archiving\/openwayback."},{"key":"e_1_2_1_12_1","unstructured":"cyberpower678. 2017. InternetArchiveBot. Retrieved from https:\/\/meta.wikimedia.org\/wiki\/InternetArchiveBot.  cyberpower678. 2017. InternetArchiveBot. Retrieved from https:\/\/meta.wikimedia.org\/wiki\/InternetArchiveBot."},{"key":"e_1_2_1_13_1","volume-title":"Rush","author":"Deng Yuntian","year":"2016","unstructured":"Yuntian Deng , Anssi Kanervisto , and Alexander M . Rush . 2016 . What you get is what you see: A visual markup decompiler. CoRR abs\/1609.04938. Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush. 2016. What you get is what you see: A visual markup decompiler. CoRR abs\/1609.04938."},{"key":"e_1_2_1_14_1","unstructured":"Docker Inc. 2017. Docker\u2014Build Ship and Run Any App Anywhere. Retrieved from https:\/\/www.docker.com.  Docker Inc. 2017. Docker\u2014Build Ship and Run Any App Anywhere. Retrieved from https:\/\/www.docker.com."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2187980.2187996"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/2042536.2042590"},{"key":"e_1_2_1_17_1","unstructured":"Google Webmaster Central Blog. 2015. Deprecating our AJAX crawling scheme. Retrieved from https:\/\/webmasters.googleblog.com\/2015\/10\/deprecating-our-ajax-crawling-scheme.html.  Google Webmaster Central Blog. 2015. Deprecating our AJAX crawling scheme. Retrieved from https:\/\/webmasters.googleblog.com\/2015\/10\/deprecating-our-ajax-crawling-scheme.html."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2396761.2398558"},{"key":"e_1_2_1_19_1","unstructured":"Ariya Hidayat. 2017. PhantomJS\u2014Scriptable Headless WebKit. Retrieved from https:\/\/github.com\/ariya\/phantomjs.  Ariya Hidayat. 2017. PhantomJS\u2014Scriptable Headless WebKit. Retrieved from https:\/\/github.com\/ariya\/phantomjs."},{"key":"e_1_2_1_20_1","unstructured":"Helen Hockx-Yu Lewis Crawford Roger Coram and Stephen Johnson. 2010. Capturing and replaying streaming media in a web archive\u2014A British Library case study. Retrieved from http:\/\/www.ipres-conference.org\/ipres10\/papers\/hockxyu-44.pdf.  Helen Hockx-Yu Lewis Crawford Roger Coram and Stephen Johnson. 2010. Capturing and replaying streaming media in a web archive\u2014A British Library case study. Retrieved from http:\/\/www.ipres-conference.org\/ipres10\/papers\/hockxyu-44.pdf."},{"key":"e_1_2_1_21_1","unstructured":"ImageMagick Studio. 2017. Convert Edit Or Compose Bitmap Images @ ImageMagick. Retrieved from https:\/\/www.imagemagick.org.  ImageMagick Studio. 2017. Convert Edit Or Compose Bitmap Images @ ImageMagick. Retrieved from https:\/\/www.imagemagick.org."},{"key":"e_1_2_1_22_1","unstructured":"Andrej Karpathy. 2017. CS231n Convolutional Neural Networks for Visual Recognition. Retrieved from http:\/\/cs231n.github.io\/convolutional-networks.  Andrej Karpathy. 2017. CS231n Convolutional Neural Networks for Visual Recognition. Retrieved from http:\/\/cs231n.github.io\/convolutional-networks."},{"key":"e_1_2_1_23_1","volume-title":"Kingma and Jimmy Ba","author":"Diederik","year":"2014","unstructured":"Diederik P. Kingma and Jimmy Ba . 2014 . Adam : A method for stochastic optimization. CoRR abs\/1412.6980. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs\/1412.6980."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1718487.1718542"},{"key":"e_1_2_1_25_1","unstructured":"Ilya Kreymer. 2017. Python WayBack for web archive replay and url-rewriting HTTP\/S web proxy. Retrieved from https:\/\/github.com\/ikreymer\/pywb.  Ilya Kreymer. 2017. Python WayBack for web archive replay and url-rewriting HTTP\/S web proxy. Retrieved from https:\/\/github.com\/ikreymer\/pywb."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1989.1.4.541"},{"key":"e_1_2_1_27_1","unstructured":"Patrick Lightbody. 2017. BrowserMob Proxy: A free utility to help web developers watch and manipulate network traffic from their AJAX applications. Retrieved from https:\/\/github.com\/lightbody\/browsermob-proxy.  Patrick Lightbody. 2017. BrowserMob Proxy: A free utility to help web developers watch and manipulate network traffic from their AJAX applications. Retrieved from https:\/\/github.com\/lightbody\/browsermob-proxy."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097570"},{"key":"e_1_2_1_29_1","unstructured":"National Library of New Zealand and British Library. 2014. Web Curator Tool Project. Retrieved from http:\/\/webcurator.sourceforge.net\/.  National Library of New Zealand and British Library. 2014. Web Curator Tool Project. Retrieved from http:\/\/webcurator.sourceforge.net\/."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1240624.1240719"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-65813-1_16"},{"key":"e_1_2_1_32_1","unstructured":"Selenium Contributors. 2017. SeleniumHQ Browser Automation. Retrieved from http:\/\/www.seleniumhq.org.  Selenium Contributors. 2017. SeleniumHQ Browser Automation. Retrieved from http:\/\/www.seleniumhq.org."},{"key":"e_1_2_1_33_1","volume-title":"Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs\/1409.1556 (2014)."},{"key":"e_1_2_1_34_1","unstructured":"StatCounter. June 2018. Screen Resolution Stats Worldwide. Retrieved from http:\/\/gs.statcounter.com\/screen-resolution-stats.  StatCounter. June 2018. Screen Resolution Stats Worldwide. Retrieved from http:\/\/gs.statcounter.com\/screen-resolution-stats."},{"key":"e_1_2_1_35_1","unstructured":"Simon Stewart David Burns and Mozilla. 2018. WebDriver W3C Candidate Recommendation. Retrieved from https:\/\/www.w3.org\/TR\/webdriver.  Simon Stewart David Burns and Mozilla. 2018. WebDriver W3C Candidate Recommendation. Retrieved from https:\/\/www.w3.org\/TR\/webdriver."},{"key":"e_1_2_1_36_1","unstructured":"The Common Crawl team. 2017. January 2017 Common Crawl Archive. Retrieved from http:\/\/commoncrawl.org\/2017\/02\/january-2017-crawl-archive-now-available.  The Common Crawl team. 2017. January 2017 Common Crawl Archive. Retrieved from http:\/\/commoncrawl.org\/2017\/02\/january-2017-crawl-archive-now-available."},{"key":"e_1_2_1_37_1","unstructured":"The Internet Archive. 2017. Brozzler. Retrieved from https:\/\/github.com\/internetarchive\/brozzler.  The Internet Archive. 2017. Brozzler. Retrieved from https:\/\/github.com\/internetarchive\/brozzler."},{"key":"e_1_2_1_38_1","unstructured":"The Internet Archive. 2017. Umbra. Retrieved from https:\/\/github.com\/internetarchive\/umbra.  The Internet Archive. 2017. Umbra. Retrieved from https:\/\/github.com\/internetarchive\/umbra."},{"key":"e_1_2_1_39_1","unstructured":"The Internet Archive. 2017. Warcprox\u2014WARC writing MITM HTTP\/S proxy. Retrieved from https:\/\/github.com\/internetarchive\/warcprox.  The Internet Archive. 2017. Warcprox\u2014WARC writing MITM HTTP\/S proxy. Retrieved from https:\/\/github.com\/internetarchive\/warcprox."},{"key":"e_1_2_1_40_1","unstructured":"The Lemur Project. 2009. The ClueWeb09 Dataset. Retrieved from http:\/\/lemurproject.org\/clueweb09\/.  The Lemur Project. 2009. The ClueWeb09 Dataset. Retrieved from http:\/\/lemurproject.org\/clueweb09\/."},{"key":"e_1_2_1_41_1","unstructured":"The Lemur Project. 2012. The ClueWeb12 Dataset. Retrieved from http:\/\/lemurproject.org\/clueweb12\/.  The Lemur Project. 2012. The ClueWeb12 Dataset. Retrieved from http:\/\/lemurproject.org\/clueweb12\/."},{"key":"e_1_2_1_42_1","unstructured":"The Memento Development Group. 2017. About the Memento Project. Retrieved from http:\/\/mementoweb.org\/about RFC 7089.  The Memento Development Group. 2017. About the Memento Project. Retrieved from http:\/\/mementoweb.org\/about RFC 7089."},{"key":"e_1_2_1_43_1","unstructured":"The Open Preserve Foundation. 2014. Pagelyzer. Retrieved from https:\/\/github.com\/openpreserve\/pagelyzer.  The Open Preserve Foundation. 2014. Pagelyzer. Retrieved from https:\/\/github.com\/openpreserve\/pagelyzer."},{"key":"e_1_2_1_44_1","volume-title":"Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST\u201911)","author":"Tzekou Paraskevi","year":"2011","unstructured":"Paraskevi Tzekou , Sofia Stamou , Nikos Kirtsis , and Nikos Zotos . 2011 . Quality assessment of Wikipedia external links . In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST\u201911) , Jos\u00e9 Cordeiro and Joaquim Filipe (Eds.). SciTePress, 248--254. Paraskevi Tzekou, Sofia Stamou, Nikos Kirtsis, and Nikos Zotos. 2011. Quality assessment of Wikipedia external links. In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST\u201911), Jos\u00e9 Cordeiro and Joaquim Filipe (Eds.). SciTePress, 248--254."},{"key":"e_1_2_1_45_1","unstructured":"W3schools.com. 2018. Browser Display Statistics. Retrieved from https:\/\/www.w3schools.com\/browsers\/browsers_display.asp.  W3schools.com. 2018. Browser Display Statistics. Retrieved from https:\/\/www.w3schools.com\/browsers\/browsers_display.asp."},{"key":"e_1_2_1_46_1","unstructured":"W3Techs. October 2017. Usage of JavaScript libraries for websites. Retrieved from https:\/\/w3techs.com\/technologies\/overview\/javascript_library\/all.  W3Techs. October 2017. Usage of JavaScript libraries for websites. Retrieved from https:\/\/w3techs.com\/technologies\/overview\/javascript_library\/all."},{"key":"e_1_2_1_47_1","volume-title":"Wiggins and The Open Group","author":"David","year":"2017","unstructured":"David P. Wiggins and The Open Group , Inc. 2017 . Xvfb\u2014virtual framebuffer X server for X Version 11. Retrieved from https:\/\/www.x.org\/releases\/X11R7.7\/doc\/man\/man1\/Xvfb.1.xhtml. David P. Wiggins and The Open Group, Inc. 2017. Xvfb\u2014virtual framebuffer X server for X Version 11. Retrieved from https:\/\/www.x.org\/releases\/X11R7.7\/doc\/man\/man1\/Xvfb.1.xhtml."},{"volume-title":"libfaketime","key":"e_1_2_1_48_1","unstructured":"Wolfcw. 2014. libfaketime ( FakeTime Preload Library) \u2014Report faked system time to programs without having to change the system-wide time. Retrieved from http:\/\/www.code-wizards.com\/projects\/libfaketime\/. Wolfcw. 2014. libfaketime (FakeTime Preload Library)\u2014Report faked system time to programs without having to change the system-wide time. Retrieved from http:\/\/www.code-wizards.com\/projects\/libfaketime\/."}],"container-title":["Journal of Data and Information Quality"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3239574","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3239574","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:08:20Z","timestamp":1750208900000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3239574"}},"subtitle":["Interactive Archiving with Automatic Quality Assessment"],"short-title":[],"issued":{"date-parts":[[2018,10,29]]},"references-count":48,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,12,31]]}},"alternative-id":["10.1145\/3239574"],"URL":"https:\/\/doi.org\/10.1145\/3239574","relation":{},"ISSN":["1936-1955","1936-1963"],"issn-type":[{"type":"print","value":"1936-1955"},{"type":"electronic","value":"1936-1963"}],"subject":[],"published":{"date-parts":[[2018,10,29]]},"assertion":[{"value":"2017-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}