{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T05:04:04Z","timestamp":1755839044945},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,10,22]],"date-time":"2020-10-22T00:00:00Z","timestamp":1603324800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,10,22]],"date-time":"2020-10-22T00:00:00Z","timestamp":1603324800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"National Key R&D Program of China","award":["2018YFB1004404","2018YFB1004404"],"award-info":[{"award-number":["2018YFB1004404","2018YFB1004404"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Sci. Eng."],"published-print":{"date-parts":[[2021,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4\u00d7\u201319.7\u00d7 faster to get a result within 1% error while the confidence interval always covers the accurate results very well.<\/jats:p>","DOI":"10.1007\/s41019-020-00144-y","type":"journal-article","created":{"date-parts":[[2020,10,22]],"date-time":"2020-10-22T16:02:42Z","timestamp":1603382562000},"page":"1-19","update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Parrot: A Progressive Analysis System on Large Text Collections"],"prefix":"10.1007","volume":"6","author":[{"given":"Yazhong","family":"Zhang","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hanbing","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhenying","family":"He","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yinan","family":"Jing","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kai","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"X. Sean","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,10,22]]},"reference":[{"key":"144_CR1","unstructured":"7.4.2, E.S. (2019). https:\/\/www.elastic.co"},{"key":"144_CR2","doi-asserted-by":"publisher","unstructured":"Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The aqua approximate query answering system. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) SIGMOD 1999, Proceedings ACM SIGMOD international conference on management of data, June 1\u20133, Philadelphia, Pennsylvania, USA, ACM Press, pp 574\u2013576 (1999). https:\/\/doi.org\/10.1145\/304182.304581","DOI":"10.1145\/304182.304581"},{"key":"144_CR3","doi-asserted-by":"publisher","unstructured":"Agarwal S, Milner H, Kleiner A, Talwalkar A, Jordan MI, Madden S, Mozafari B, Stoica I (2014) Knowing when you\u2019re wrong: building fast and reliable approximate query processing systems. In: Dyreson CE, Li F, \u00d6zsu MT (eds) International conference on management of data, SIGMOD 2014, Snowbird, UT, USA, June 22\u201327, ACM, pp 481\u2013492 (2014). https:\/\/doi.org\/10.1145\/2588555.2593667","DOI":"10.1145\/2588555.2593667"},{"key":"144_CR4","doi-asserted-by":"publisher","unstructured":"Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) Blinkdb: queries with bounded errors and bounded response times on very large data. In: Hanz\u00e1lek Z, H\u00e4rtig H, Castro M, Kaashoek MF (eds) Eighth Eurosys conference 2013, EuroSys \u201913, Prague, Czech Republic, April 14\u201317, ACM, pp. 29\u201342 (2013). https:\/\/doi.org\/10.1145\/2465351.2465355","DOI":"10.1145\/2465351.2465355"},{"issue":"6","key":"144_CR5","doi-asserted-by":"publisher","first-page":"684","DOI":"10.1016\/j.ijinfomgt.2017.06.005","volume":"37","author":"M Bouakkaz","year":"2017","unstructured":"Bouakkaz M, Ouinten Y, Loudcher S, Strekalova Y (2017) Textual aggregation approaches in OLAP context: a survey. Int J Inf Manag 37(6):684\u2013692. https:\/\/doi.org\/10.1016\/j.ijinfomgt.2017.06.005","journal-title":"Int J Inf Manag"},{"key":"144_CR6","unstructured":"Corral A, Boleda G, Ferrer-i-Cancho R (2014) Zipf\u2019s law for word frequencies: word forms versus lemmas in long texts. CoRR abs\/1407.8322 (2014). arXiv: org\/abs\/1407.8322"},{"key":"144_CR7","doi-asserted-by":"publisher","unstructured":"Dimitriadou K, Papaemmanouil O, Diao Y (2014) Interactive data exploration based on user relevance feedback. In: Workshops proceedings of the 30th international conference on data engineering workshops, ICDE 2014, Chicago, IL, USA, March 31\u2013April 4, 2014, IEEE Computer Society, pp 292\u2013295 (2014). https:\/\/doi.org\/10.1109\/ICDEW.2014.6818343","DOI":"10.1109\/ICDEW.2014.6818343"},{"key":"144_CR8","doi-asserted-by":"crossref","unstructured":"Efron B (1992) Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics, Springer, pp 569\u2013593","DOI":"10.1007\/978-1-4612-4380-9_41"},{"key":"144_CR9","doi-asserted-by":"publisher","unstructured":"Galakatos A, Crotty A, Zgraggen E, Binnig C, Kraska T (2017) Revisiting reuse for approximate query processing. PVLDB 10(10):1142\u20131153. https:\/\/doi.org\/10.14778\/3115404.3115418. http:\/\/www.vldb.org\/pvldb\/vol10\/p1142-galakatos.pdf","DOI":"10.14778\/3115404.3115418"},{"key":"144_CR10","unstructured":"Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (2007) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. CoRR abs\/cs\/0701155. arXiv:org\/abs\/cs\/0701155"},{"key":"144_CR11","doi-asserted-by":"publisher","unstructured":"Griffin T, Libkin L (1995) Incremental maintenance of views with duplicates. In: Carey MJ, Schneider DA (eds) Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, USA, May 22\u201325, 1995, ACM Press, pp 328\u2013339. https:\/\/doi.org\/10.1145\/223784.223849","DOI":"10.1145\/223784.223849"},{"key":"144_CR12","unstructured":"Haas PJ, Haas PJ (1996) Hoeffding inequalities for join-selectivity estimation and online aggregation. IBM"},{"key":"144_CR13","doi-asserted-by":"publisher","unstructured":"Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Peckham J (ed) SIGMOD 1997, Proceedings ACM SIGMOD international conference on management of data, May 13\u201315, 1997, Tucson, Arizona, USA, ACM Press, pp. 171\u2013182. https:\/\/doi.org\/10.1145\/253260.253291","DOI":"10.1145\/253260.253291"},{"key":"144_CR14","unstructured":"Idreos S, Kersten ML, Manegold S (2007) Database cracking. In: CIDR 2007, Third biennial conference on innovative data systems research, Asilomar, CA, USA, January 7\u201310, 2007, Online Proceedings, pp 68\u201378. www.cidrdb.org. http:\/\/cidrdb.org\/cidr2007\/papers\/cidr07p07.pdf"},{"issue":"5","key":"144_CR15","doi-asserted-by":"publisher","first-page":"628","DOI":"10.1109\/TPAMI.1987.4767957","volume":"9","author":"AK Jain","year":"1987","unstructured":"Jain AK, Dubes RC, Chen C (1987) Bootstrap techniques for error estimation. IEEE Trans Pattern Anal Mach Intell 9(5):628\u2013633. https:\/\/doi.org\/10.1109\/TPAMI.1987.4767957","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"144_CR16","unstructured":"Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2012) The big data bootstrap. In: Proceedings of the 29th international conference on machine learning, ICML 2012, Edinburgh, Scotland, UK, June 26\u2013July 1, 2012. icml.cc\/Omnipress. http:\/\/icml.cc\/2012\/papers\/861.pdf"},{"issue":"2","key":"144_CR17","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1007\/s00778-013-0348-4","volume":"23","author":"C Koch","year":"2014","unstructured":"Koch C, Ahmad Y, Kennedy O, Nikolic M, N\u00f6tzli A, Lupei D, Shaikhha A (2014) Dbtoaster: higher-order delta processing for dynamic, frequently fresh views. VLDB J 23(2):253\u2013278. https:\/\/doi.org\/10.1007\/s00778-013-0348-4","journal-title":"VLDB J"},{"issue":"4","key":"144_CR18","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1007\/s41019-018-0074-4","volume":"3","author":"K Li","year":"2018","unstructured":"Li K, Li G (2018) Approximate query processing: What is new and where to go? A survey on approximate query processing. Data Sci Eng 3(4):379\u2013397. https:\/\/doi.org\/10.1007\/s41019-018-0074-4","journal-title":"Data Sci Eng"},{"key":"144_CR19","doi-asserted-by":"publisher","unstructured":"Lin CX, Ding B, Han J, Zhu F, Zhao B (2008) Text cube: computing IR measures for multidimensional text database analysis. In: Proceedings of the 8th IEEE international conference on data mining (ICDM 2008), December 15\u201319, 2008, Pisa, Italy, IEEE Computer Society, pp 905\u2013910 (2008). https:\/\/doi.org\/10.1109\/ICDM.2008.135","DOI":"10.1109\/ICDM.2008.135"},{"issue":"12","key":"144_CR20","doi-asserted-by":"publisher","first-page":"2456","DOI":"10.1109\/TVCG.2013.179","volume":"19","author":"LD Lins","year":"2013","unstructured":"Lins LD, Klosowski JT, Scheidegger CE (2013) Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans Vis Comput Graph 19(12):2456\u20132465. https:\/\/doi.org\/10.1109\/TVCG.2013.179","journal-title":"IEEE Trans Vis Comput Graph"},{"issue":"3","key":"144_CR21","doi-asserted-by":"publisher","first-page":"421","DOI":"10.1111\/cgf.12129","volume":"32","author":"Z Liu","year":"2013","unstructured":"Liu Z, Jiang B, Heer J (2013) imMens: real-time visual querying of big data. Comput Graph Forum 32(3):421\u2013430. https:\/\/doi.org\/10.1111\/cgf.12129","journal-title":"Comput Graph Forum"},{"key":"144_CR22","doi-asserted-by":"publisher","unstructured":"Palpanas T, Sidle R, Cochrane R, Pirahesh H (2002) Incremental maintenance for non-distributive aggregate functions. In: Proceedings of 28th international conference on very large data bases, VLDB 2002, Hong Kong, August 20\u201323, 2002, Morgan Kaufmann, pp 802\u2013813. https:\/\/doi.org\/10.1016\/B978-155860869-6\/50076-7. http:\/\/www.vldb.org\/conf\/2002\/S22P04.pdf","DOI":"10.1016\/B978-155860869-6\/50076-7"},{"key":"144_CR23","doi-asserted-by":"publisher","unstructured":"Park Y, Mozafari B, Sorenson J, Wang J (2018) Verdictdb: universalizing approximate query processing. In: Das G, Jermaine CM, Bernstein PA (eds) Proceedings of the 2018 international conference on management of data, SIGMOD conference 2018, Houston, TX, USA, June 10\u201315, ACM, pp 1461\u20131476 (2018). https:\/\/doi.org\/10.1145\/3183713.3196905","DOI":"10.1145\/3183713.3196905"},{"key":"144_CR24","doi-asserted-by":"publisher","unstructured":"Parr T, Fisher K (2011) Ll(*): the foundation of the ANTLR parser generator. In: Hall MW, Padua DA (eds) Proceedings of the 32nd ACM SIGPLAN conference on programming language design and implementation, PLDI 2011, San Jose, CA, USA, June 4\u20138, 2011, ACM, pp 425\u2013436. https:\/\/doi.org\/10.1145\/1993498.1993548","DOI":"10.1145\/1993498.1993548"},{"key":"144_CR25","doi-asserted-by":"publisher","unstructured":"Pol A, Jermaine C (2005) Relational confidence bounds are easy with the bootstrap. In: \u00d6zcan F (ed) Proceedings of the ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA, June 14\u201316, 2005, ACM, pp 587\u2013598. https:\/\/doi.org\/10.1145\/1066157.1066224","DOI":"10.1145\/1066157.1066224"},{"key":"144_CR26","unstructured":"Rice JA (2006) Mathematical statistics and data analysis. Cengage Learning"},{"key":"144_CR27","doi-asserted-by":"publisher","DOI":"10.1002\/9781118771075","volume-title":"Mathematical statistics: an introduction to likelihood based inference","author":"RJ Rossi","year":"2018","unstructured":"Rossi RJ (2018) Mathematical statistics: an introduction to likelihood based inference. Wiley, New York"},{"key":"144_CR28","doi-asserted-by":"crossref","unstructured":"Wu Z, Jing Y, He Z, Guo C, Wang XS (2019) Polytope: a flexible sampling system for answering exploratory queries. World Wide Web, pp 1\u201322","DOI":"10.1007\/s11280-019-00685-x"},{"key":"144_CR29","doi-asserted-by":"publisher","unstructured":"Zeng K, Agarwal S, Stoica I (2016) iolap: managing uncertainty for efficient incremental OLAP. In: \u00d6zcan F, Koutrika G, Madden S (eds) Proceedings of the 2016 international conference on management of data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26\u2013July 01, ACM, pp 1347\u20131361 (2016). https:\/\/doi.org\/10.1145\/2882903.2915240","DOI":"10.1145\/2882903.2915240"},{"key":"144_CR30","doi-asserted-by":"publisher","unstructured":"Zeng K, Gao S, Mozafari B, Zaniolo C (2014) The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: Dyreson CE, Li F, \u00d6zsu MT (eds) International conference on management of data, SIGMOD 2014, Snowbird, UT, USA, June 22\u201327, 2014, ACM, pp 277\u2013288. https:\/\/doi.org\/10.1145\/2588555.2588579","DOI":"10.1145\/2588555.2588579"},{"issue":"8","key":"144_CR31","doi-asserted-by":"publisher","first-page":"1977","DOI":"10.1109\/TVCG.2016.2607714","volume":"23","author":"E Zgraggen","year":"2017","unstructured":"Zgraggen E, Galakatos A, Crotty A, Fekete J, Kraska T (2017) How progressive visualizations affect exploratory analysis. IEEE Trans Vis Comput Graph 23(8):1977\u20131987. https:\/\/doi.org\/10.1109\/TVCG.2016.2607714","journal-title":"IEEE Trans Vis Comput Graph"},{"key":"144_CR32","doi-asserted-by":"publisher","unstructured":"Zhang S, Sun C, He Z (2016) Listmerge: accelerating top-k aggregation queries over large number of lists. In: Navathe SB, Wu W, Shekhar S, Du X, Wang XS, Xiong S (eds) Database systems for advanced applications\u201421st international conference, DASFAA 2016, Dallas, TX, USA, April 16\u201319, 2016, Proceedings, Part II, lecture notes in computer science, vol 9643, Springer, pp 67\u201381.https:\/\/doi.org\/10.1007\/978-3-319-32049-6_5","DOI":"10.1007\/978-3-319-32049-6_5"}],"container-title":["Data Science and Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-020-00144-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41019-020-00144-y\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-020-00144-y.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,10,22]],"date-time":"2021-10-22T00:02:12Z","timestamp":1634860932000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41019-020-00144-y"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,22]]},"references-count":32,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,3]]}},"alternative-id":["144"],"URL":"https:\/\/doi.org\/10.1007\/s41019-020-00144-y","relation":{},"ISSN":["2364-1185","2364-1541"],"issn-type":[{"value":"2364-1185","type":"print"},{"value":"2364-1541","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,10,22]]},"assertion":[{"value":"1 June 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 August 2020","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 October 2020","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"22 October 2020","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}