{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,12]],"date-time":"2026-03-12T16:05:14Z","timestamp":1773331514372,"version":"3.50.1"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2015,6,4]],"date-time":"2015-06-04T00:00:00Z","timestamp":1433376000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Government of India (Ref. No.: ITRA\/15"},{"DOI":"10.13039\/501100002183","name":"DeITY","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002183","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Indo-German Max Planck Centre for Computer Science"},{"name":"Information Technology Research Academy"},{"name":"Postdoctoral fellowship from the Alexander von Humboldt Foundation"},{"name":"Fellowship from Tata Consultancy Services"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Web"],"published-print":{"date-parts":[[2015,6,20]]},"abstract":"<jats:p>\n            Analysis of content streams gathered from social networking sites such as Twitter has several applications ranging from content search and recommendation, news detection to business analytics. However, processing large amounts of data generated on these sites in real-time poses a difficult challenge. To cope with the data deluge, analytics companies and researchers are increasingly resorting to sampling. In this article, we investigate the crucial question of\n            <jats:italic>how to sample content streams generated by users in online social networks<\/jats:italic>\n            . The traditional method is to randomly sample all the data. For example, most studies using Twitter data today rely on the 1% and 10% randomly sampled streams of tweets that are provided by Twitter. In this paper, we analyze a different sampling methodology, one where content is gathered only from a relatively small sample (&lt;1%) of the user population, namely, the\n            <jats:italic>expert users<\/jats:italic>\n            . Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the popularity, topical diversity, trustworthiness, and timeliness of the information contained within them, and on the sentiment\/opinion expressed on specific topics. Our analysis reveals several important differences in data obtained through the different sampling methodologies, which have serious implications for applications such as topical search, trustworthy content recommendations, breaking news detection, and opinion mining.\n          <\/jats:p>","DOI":"10.1145\/2743023","type":"journal-article","created":{"date-parts":[[2015,6,8]],"date-time":"2015-06-08T15:11:11Z","timestamp":1433776271000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Sampling Content from Online Social Networks"],"prefix":"10.1145","volume":"9","author":[{"given":"Muhammad Bilal","family":"Zafar","sequence":"first","affiliation":[{"name":"Max Planck Institute for Software Systems, Germany"}]},{"given":"Parantapa","family":"Bhattacharya","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Kharagpur, India; Max Planck Institute for Software Systems, Germany"}]},{"given":"Niloy","family":"Ganguly","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Kharagpur, India"}]},{"given":"Krishna P.","family":"Gummadi","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Software Systems, Germany"}]},{"given":"Saptarshi","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Max Planck Institute for Software Systems, Germany; Indian Institute of Engineering Science and Technology Shibpur, India"}]}],"member":"320","published-online":{"date-parts":[[2015,6,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1571941.1572033"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505525"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/WI-IAT.2010.63"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2531602.2531636"},{"key":"e_1_2_1_5_1","article-title":"Latent dirichlet allocation","author":"Blei David M.","year":"2003","unstructured":"David M. Blei , Andrew Y. Ng , and Michael I. Jordan . 2003 . Latent dirichlet allocation . The Journal of Machine Learning Research 3 ( March 2003), 993--1022. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (March 2003), 993--1022.","journal-title":"The Journal of Machine Learning Research 3"},{"key":"e_1_2_1_6_1","volume-title":"Technical Report C-1, Center for Research in Psychophysiology","author":"Bradley M. M.","year":"1999","unstructured":"M. M. Bradley and P. J. Lang . 1999 . Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology , University of Florida (1999) . M. M. Bradley and P. J. Lang. 1999. Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology, University of Florida (1999)."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/297805.297827"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2007.914731"},{"key":"e_1_2_1_9_1","volume-title":"Gummadi","author":"Cha Meeyoung","year":"2010","unstructured":"Meeyoung Cha , Hamed Haddadi , Fabricio Benevenuto , and Krishna P . Gummadi . 2010 . Measuring user influence in Twitter : The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). AAAI Press . Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). AAAI Press."},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201911)","author":"Choudhury Munmun De","year":"2011","unstructured":"Munmun De Choudhury , Scott Counts , and Mary Czerwinski . 2011 a. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201911) . AAAI Press. Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011a. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201911). AAAI Press."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995966.1995990"},{"key":"e_1_2_1_12_1","volume-title":"How does the data sampling strategy impact the discovery of information diffusion in social media&quest","author":"Choudhury Munmun De","unstructured":"Munmun De Choudhury , Yu-Ru Lin , Hari Sundaram , K. Selcuk Candan , Lexing Xie , and Aisling Kelliher . 2010. How does the data sampling strategy impact the discovery of information diffusion in social media&quest ; In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). The AAAI Press . Munmun De Choudhury, Yu-Ru Lin, Hari Sundaram, K. Selcuk Candan, Lexing Xie, and Aisling Kelliher. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media&quest; In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). The AAAI Press."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.chb.2004.11.013"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.2307\/2325486"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/0378-8733(78)90015-1"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2348283.2348361"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2187836.2187846"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2505615"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/1833515.1833840"},{"key":"e_1_2_1_20_1","volume-title":"Assessing the bias in samples of large online networks. Social Networks 38 (July","author":"Gonzalez-Bailon Sandra","year":"2014","unstructured":"Sandra Gonzalez-Bailon , Ning Wang , Alejandro Rivero , Javier Borge-Holthoefer , and Yamir Moreno . 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014 ), 16--27. Sandra Gonzalez-Bailon, Ning Wang, Alejandro Rivero, Javier Borge-Holthoefer, and Yamir Moreno. 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014), 16--27."},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk (CSLDAMT","author":"Grady Catherine","year":"2010","unstructured":"Catherine Grady and Matthew Lease . 2010 . Crowdsourcing document relevance assessment with Mechanical Turk . In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179. Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with Mechanical Turk. In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon\u2019s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1086\/226224"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1866307.1866311"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of International Conference on Very Large Data Bases (VLDB) -","volume":"30","author":"Gy\u00f6ngyi Zolt\u00e1n","year":"2004","unstructured":"Zolt\u00e1n Gy\u00f6ngyi , Hector Garcia-Molina , and Jan Pedersen . 2004 . Combating web spam with trustrank . In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30 . VLDB Endowment, 576--587. Zolt\u00e1n Gy\u00f6ngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30. VLDB Endowment, 576--587."},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201912)","author":"Hannak Aniko","year":"2012","unstructured":"Aniko Hannak , Eric Anderson , Lisa Feldman Barrett , Sune Lehmann , Alan Mislove , and Mirek Riedewald . 2012 . Tweetin\u2019 in the rain: Exploring societal-scale effects of weather on mood . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201912) . AAAI Press, Dublin, Ireland. Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin\u2019 in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201912). AAAI Press, Dublin, Ireland."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1963405.1963489"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1967.1054019"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1397735.1397741"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772751"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1150402.1150479"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2348283.2348380"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020476"},{"key":"e_1_2_1_33_1","unstructured":"lists-howtouse. 2013. Twitter Help Center\u2014Using Twitter Lists. Retrieved from https:\/\/support.twitter.com\/articles\/76460-using-twitter-lists.  lists-howtouse. 2013. Twitter Help Center\u2014Using Twitter Lists. Retrieved from https:\/\/support.twitter.com\/articles\/76460-using-twitter-lists."},{"key":"e_1_2_1_34_1","volume-title":"Web Data Mining: Exploring Hyperlinks, Contents and Usage Data","author":"Liu Bing","unstructured":"Bing Liu . 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data . Springer-Verlag . Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer-Verlag."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807306"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2567948.2576952"},{"key":"e_1_2_1_37_1","volume-title":"Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201913)","author":"Morstatter Fred","unstructured":"Fred Morstatter , J\u00fcrgen Pfeffer , Huan Liu , and Kathleen M. Carley . 2013. Is the sample good enough&quest; Comparing data from Twitter\u2019s streaming API with Twitter\u2019s firehose . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201913) . AAAI Press. Fred Morstatter, J\u00fcrgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough&quest; Comparing data from Twitter\u2019s streaming API with Twitter\u2019s firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201913). AAAI Press."},{"key":"e_1_2_1_38_1","volume-title":"A C\/C&plus;&plus","author":"Phan Xuan-Hieu","unstructured":"Xuan-Hieu Phan and Cam-Tu Nguyen . 2007. GibbsLDA&plus;&plus; : A C\/C&plus;&plus ; Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http:\/\/gibbslda.sourceforge.net\/. Xuan-Hieu Phan and Cam-Tu Nguyen. 2007. GibbsLDA&plus;&plus;: A C\/C&plus;&plus; Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http:\/\/gibbslda.sourceforge.net\/."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1001\/archinte.1990.00390200068013"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910)","author":"Ramage Daniel","year":"2010","unstructured":"Daniel Ramage , Susan Dumais , and Dan Liebling . 2010 . Characterizing microblogs with topic models . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910) . AAAI Press. Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). AAAI Press."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSP.2007.914729"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128","author":"Rusmevichientong Paat","unstructured":"Paat Rusmevichientong , David M. Pennock , Steve Lawrence , and C. Lee Giles . 2001. Methods for sampling pages uniformly from the world wide web . In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128 . Paat Rusmevichientong, David M. Pennock, Steve Lawrence, and C. Lee Giles. 2001. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772777"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1653771.1653781"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2377677.2377782"},{"key":"e_1_2_1_46_1","unstructured":"spritzer-gnip-blog. 2011. Guide to the Twitter API\u2014Part 3 of 3: An Overview of Twitter\u2019s Streaming API. Retrieved from http:\/\/blog.gnip.com\/tag\/spritzer\/.  spritzer-gnip-blog. 2011. Guide to the Twitter API\u2014Part 3 of 3: An Overview of Twitter\u2019s Streaming API. Retrieved from http:\/\/blog.gnip.com\/tag\/spritzer\/."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/1935826.1935842"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2068816.2068840"},{"key":"e_1_2_1_49_1","volume-title":"Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910)","author":"Tumasjan A.","unstructured":"A. Tumasjan , T. Sprenger , P. Sandner , and I. Welpe . 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910) . AAAI Press, 178--185. A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM\u201910). AAAI Press, 178--185."},{"key":"e_1_2_1_50_1","unstructured":"twitter-rate-limit. 2013. Rate Limiting\u2014Twitter Developers. Retrieved from https:\/\/dev.twitter.com\/docs\/rate-limiting.  twitter-rate-limit. 2013. Rate Limiting\u2014Twitter Developers. Retrieved from https:\/\/dev.twitter.com\/docs\/rate-limiting."},{"key":"e_1_2_1_51_1","unstructured":"Twitter-stats. 2014. Twitter Statistics\u2014Statistics Brain. Retrieved from http:\/\/www.statisticbrain.com\/twitter-statistics\/.  Twitter-stats. 2014. Twitter Statistics\u2014Statistics Brain. Retrieved from http:\/\/www.statisticbrain.com\/twitter-statistics\/."},{"key":"e_1_2_1_52_1","unstructured":"Twitter-stream-api. 2012. GET Statuses\/Sample\u2014Twitter Developers. Retrieved from https:\/\/dev.twitter.com\/docs\/api\/1\/get\/statuses\/sample.  Twitter-stream-api. 2012. GET Statuses\/Sample\u2014Twitter Developers. Retrieved from https:\/\/dev.twitter.com\/docs\/api\/1\/get\/statuses\/sample."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/SocialCom-PASSAT.2012.30"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/1963405.1963504"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/2187836.2187872"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2339530.2339591"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1145\/1963405.1963443"}],"container-title":["ACM Transactions on the Web"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2743023","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2743023","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:07:15Z","timestamp":1750223235000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2743023"}},"subtitle":["Comparing Random vs. Expert Sampling of the Twitter Stream"],"short-title":[],"issued":{"date-parts":[[2015,6,4]]},"references-count":57,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2015,6,20]]}},"alternative-id":["10.1145\/2743023"],"URL":"https:\/\/doi.org\/10.1145\/2743023","relation":{},"ISSN":["1559-1131","1559-114X"],"issn-type":[{"value":"1559-1131","type":"print"},{"value":"1559-114X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,6,4]]},"assertion":[{"value":"2014-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-03-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-06-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}