{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T17:15:24Z","timestamp":1769274924892,"version":"3.49.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2015,6,2]],"date-time":"2015-06-02T00:00:00Z","timestamp":1433203200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Singapore National Research Foundation under its International Research Centre@Singapore Funding Initiative"},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["NSF IIS-1160862"],"award-info":[{"award-number":["NSF IIS-1160862"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Media Development Authority"},{"DOI":"10.13039\/501100014790","name":"Singapore Management University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100014790","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2015,6,20]]},"abstract":"<jats:p>Researchers have begun studying content obtained from microblogging services such as Twitter to address a variety of technological, social, and commercial research questions. The large number of Twitter users and even larger volume of tweets often make it impractical to collect and maintain a complete record of activity; therefore, most research and some commercial software applications rely on samples, often relatively small samples, of Twitter data. For the most part, sample sizes have been based on availability and practical considerations. Relatively little attention has been paid to how well these samples represent the underlying stream of Twitter data. To fill this gap, this article performs a comparative analysis on samples obtained from two of Twitter\u2019s streaming APIs with a more complete Twitter dataset to gain an in-depth understanding of the nature of Twitter data samples and their potential for use in various data mining tasks.<\/jats:p>","DOI":"10.1145\/2746366","type":"journal-article","created":{"date-parts":[[2015,6,2]],"date-time":"2015-06-02T18:19:47Z","timestamp":1433269187000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":48,"title":["Should We Use the Sample? Analyzing Datasets Sampled from Twitter\u2019s Stream API"],"prefix":"10.1145","volume":"9","author":[{"given":"Yazhe","family":"Wang","sequence":"first","affiliation":[{"name":"Singapore Management University, Singapore"}]},{"given":"Jamie","family":"Callan","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA"}]},{"given":"Baihua","family":"Zheng","sequence":"additional","affiliation":[{"name":"Singapore Management University, Singapore"}]}],"member":"320","published-online":{"date-parts":[[2015,6,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1242572.1242685"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935845"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1644893.1644900"},{"key":"e_1_2_1_4_1","unstructured":"Shea Bennett. 2012. Twitter Now Seeing 400 Million Tweets per Day Increased Mobile Ad Revenue Says CEO@ONLINE. Retrieved from http:\/\/www.mediabistro.com\/alltwitter\/twitter-400-million-tweets&lowbar;b23744.  Shea Bennett. 2012. Twitter Now Seeing 400 Million Tweets per Day Increased Mobile Ad Revenue Says CEO@ONLINE. Retrieved from http:\/\/www.mediabistro.com\/alltwitter\/twitter-400-million-tweets&lowbar;b23744."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocs.2010.12.007"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.socnet.2005.05.001"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/382979.383040"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1298306.1298309"},{"key":"e_1_2_1_9_1","volume-title":"Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM\u201911)","author":"Choudhury Munmun De","year":"2011","unstructured":"Munmun De Choudhury , Scott Counts , and Mary Czerwinski . 2011 . Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search . In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM\u201911) . The AAAI Press. Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM\u201911). The AAAI Press."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0378-8733(03)00012-1"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505615"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1298306.1298310"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1086\/226224"},{"key":"e_1_2_1_14_1","volume-title":"Information Retrieval: Computational and Theoretical Aspects","author":"Heaps H. S.","year":"1978","unstructured":"H. S. Heaps . 1978 . Information Retrieval: Computational and Theoretical Aspects . Academic Press, Inc. , Orlando, FL . H. S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2187836.2187940"},{"key":"e_1_2_1_16_1","first-page":"1","article-title":"Social Networks that matter: Twitter under the microscope","volume":"14","author":"Huberman Bernardo A.","year":"2009","unstructured":"Bernardo A. Huberman , Daniel M. Romero , and Fang Wu . 2009 . Social Networks that matter: Twitter under the microscope . First Monday 14 , 1 . Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. 2009. Social Networks that matter: Twitter under the microscope. First Monday 14, 1.","journal-title":"First Monday"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1348549.1348556"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.socnet.2005.07.002"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1397735.1397741"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1150402.1150476"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772751"},{"key":"e_1_2_1_22_1","first-page":"1","article-title":"Statistical properties of sampled networks","volume":"73","author":"Lee SangHoon","year":"2006","unstructured":"SangHoon Lee , Pan-Jun Kim , Hawoong Jeong , and Fang Wu . 2006 . Statistical properties of sampled networks . Physical Review E 73 , 1 . SangHoon Lee, Pan-Jun Kim, Hawoong Jeong, and Fang Wu. 2006. Statistical properties of sampled networks. Physical Review E 73, 1.","journal-title":"Physical Review E"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1150402.1150479"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020431"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807306"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1298306.1298311"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 7th International Conference on Weblog fs and Social Media (ICWSM\u201913)","author":"Morstatter Fred","unstructured":"Fred Morstatter , J\u00fcrgen Pfeffer , Huan Liu , and Kathleen M. Carley . 2013. Is the sample good enough&quest; Comparing data from Twitter\u2019s streaming API with Twitter Firehose . In Proceedings of the 7th International Conference on Weblog fs and Social Media (ICWSM\u201913) . The AAAI Press. Fred Morstatter, J\u00fcrgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough&quest; Comparing data from Twitter\u2019s streaming API with Twitter Firehose. In Proceedings of the 7th International Conference on Weblog fs and Social Media (ICWSM\u201913). The AAAI Press."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1718918.1718953"},{"key":"e_1_2_1_29_1","volume-title":"Proceedings of the Ninth IT&T Conference. 13","author":"Ohana B.","unstructured":"B. Ohana and B. Tierney . 2009. Sentiment classification of reviews using SentiWordNet . In Proceedings of the Ninth IT&T Conference. 13 . B. Ohana and B. Tierney. 2009. Sentiment classification of reviews using SentiWordNet. In Proceedings of the Ninth IT&T Conference. 13."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063212.2063223"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772777"},{"key":"e_1_2_1_32_1","unstructured":"Semiocast. 2012. Twitter Reaches Half a Billion Accounts More Than 140 Millions in the U.S. @ONLINE. Retrieved from http:\/\/semiocast.com\/publications\/2012&lowbar;07&lowbar;30&lowbar;Twitter&lowbar;reaches&lowbar;half&lowbar;a&lowbar;billion&lowbar;accounts&lowbar;140m&lowbar;in&lowbar;the&lowbar;US.  Semiocast. 2012. Twitter Reaches Half a Billion Accounts More Than 140 Millions in the U.S. @ONLINE. Retrieved from http:\/\/semiocast.com\/publications\/2012&lowbar;07&lowbar;30&lowbar;Twitter&lowbar;reaches&lowbar;half&lowbar;a&lowbar;billion&lowbar;accounts&lowbar;140m&lowbar;in&lowbar;the&lowbar;US."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2309996.2310048"},{"key":"e_1_2_1_34_1","volume-title":"Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM\u201910)","author":"Cohen William W.","year":"2010","unstructured":"W. Cohen William and Gosling Samuel . 2010 . How does the data sampling strategy impact the discovery of information diffusion in social media . In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM\u201910) . The AAAI Press. W. Cohen William and Gosling Samuel. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM\u201910). The AAAI Press."},{"key":"e_1_2_1_35_1","first-page":"1","article-title":"Statistical properties of sampled networks by random walk","volume":"73","author":"Yoon Sooyeon","year":"2006","unstructured":"Sooyeon Yoon , Sungmin Lee , Soon-Hyung Yook , and Yup Kin . 2006 . Statistical properties of sampled networks by random walk . Physical Review E 73 , 1 . Sooyeon Yoon, Sungmin Lee, Soon-Hyung Yook, and Yup Kin. 2006. Statistical properties of sampled networks by random walk. Physical Review E 73, 1.","journal-title":"Physical Review E"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.5555\/1996889.1996934"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2746366","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2746366","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:16:58Z","timestamp":1750227418000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2746366"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,6,2]]},"references-count":36,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2015,6,20]]}},"alternative-id":["10.1145\/2746366"],"URL":"https:\/\/doi.org\/10.1145\/2746366","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,6,2]]},"assertion":[{"value":"2013-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-06-02","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}