{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:29:05Z","timestamp":1750220945389,"version":"3.41.0"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2018,12,11]],"date-time":"2018-12-11T00:00:00Z","timestamp":1544486400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGKDD Explor. Newsl."],"published-print":{"date-parts":[[2018,12,11]]},"abstract":"<jats:p>As data reported by humans about our world, text data play a very important role in all data mining applications, yet how to develop a general text analysis system to sup- port all text mining applications is a difficult challenge. In this position paper, we introduce SOFSAT, a new frame- work that can support set-like operators for semantic analy- sis of natural text data with variable text representations. It includes three basic set-like operators|TextIntersect, Tex- tUnion, and TextDi erence|that are analogous to the cor- responding set operators intersection, union, and di erence, respectively, which can be applied to any representation of text data, and di erent representations can be combined via transformation functions that map text to and from any rep- resentation. Just as the set operators can be exibly com- bined iteratively to construct arbitrary subsets or supersets based on some given sets, we show that the correspond- ing text analysis operators can also be combined exibly to support a wide range of analysis tasks that may require di erent work ows, thus enabling an application developer to \\program\" a text mining application by using SOFSAT as an application programming language for text analysis. We discuss instantiations and implementation strategies of the framework with some speci c examples, present ideas about how the framework can be implemented by exploit- ing\/extending existing techniques, and provide a roadmap for future research in this new direction.<\/jats:p>","DOI":"10.1145\/3299986.3299990","type":"journal-article","created":{"date-parts":[[2018,12,12]],"date-time":"2018-12-12T12:49:32Z","timestamp":1544618972000},"page":"21-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["SOFSAT"],"prefix":"10.1145","volume":"20","author":[{"given":"Shubhra Kanti","family":"Karmaker Santu","sequence":"first","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"Chase","family":"Geigle","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"Duncan","family":"Ferguson","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"William","family":"Cope","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"Mary","family":"Kalantzis","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"Duane","family":"Searsmith","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]},{"given":"Chengxiang","family":"Zhai","sequence":"additional","affiliation":[{"name":"University of Illinois Urbana-Champaign, Champaign, IL, USA"}]}],"member":"320","published-online":{"date-parts":[[2018,12,11]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Apache lucene. https:\/\/lucene.apache.org\/. Accessed: 2018-05--14.  Apache lucene. https:\/\/lucene.apache.org\/. Accessed: 2018-05--14."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-85836-2_29"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/2331533"},{"key":"e_1_2_1_4_1","volume-title":"Short-term memory for word sequences as a function of acoustic, semantic and formal sim- ilarity. Quarterly journal of experimental psychology, 18(4):362{365","author":"Baddeley A. D.","year":"1966","unstructured":"A. D. Baddeley . Short-term memory for word sequences as a function of acoustic, semantic and formal sim- ilarity. Quarterly journal of experimental psychology, 18(4):362{365 , 1966 . A. D. Baddeley. Short-term memory for word sequences as a function of acoustic, semantic and formal sim- ilarity. Quarterly journal of experimental psychology, 18(4):362{365, 1966."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1162\/089120105774321091"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.3115\/1219044.1219075"},{"key":"e_1_2_1_8_1","volume-title":"Latent dirich- let allocation. Journal of machine Learning research, 3(Jan):993{1022","author":"Blei D. M.","year":"2003","unstructured":"D. M. Blei , A. Y. Ng , and M. I. Jordan . Latent dirich- let allocation. Journal of machine Learning research, 3(Jan):993{1022 , 2003 . D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich- let allocation. Journal of machine Learning research, 3(Jan):993{1022, 2003."},{"key":"e_1_2_1_9_1","first-page":"269","volume-title":"NIST SPECIAL PUBLICATION SP","author":"Cavnar W.","year":"1995","unstructured":"W. Cavnar . Using an n-gram-based document represen- tation with a vector processing retrieval model . NIST SPECIAL PUBLICATION SP , pages 269{ 269 , 1995 . W. Cavnar. Using an n-gram-based document represen- tation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP, pages 269{269, 1995."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/800296.811515"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of Eleventh International World Wide Web Conference","author":"Choudhary B.","year":"2002","unstructured":"B. Choudhary and P. Bhattacharyya . Text clustering using universal networking language representation . In Proceedings of Eleventh International World Wide Web Conference , 2002 . B. Choudhary and P. Bhattacharyya. Text clustering using universal networking language representation. In Proceedings of Eleventh International World Wide Web Conference, 2002."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/362384.362685"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.compcom.2011.04.007"},{"key":"e_1_2_1_14_1","volume-title":"Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998","author":"Dai A. M.","year":"2015","unstructured":"A. M. Dai , C. Olah , and Q. V. Le . Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 , 2015 . A. M. Dai, C. Olah, and Q. V. Le. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998, 2015."},{"key":"e_1_2_1_15_1","first-page":"185","volume-title":"Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing","author":"Filippova K.","unstructured":"K. Filippova and M. Strube . Sentence fusion via depen- dency graph compression . In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing , pages 177{ 185 . Association for Computational Linguistics, 2008. K. Filippova and M. Strube. Sentence fusion via depen- dency graph compression. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 177{185. Association for Computational Linguistics, 2008."},{"key":"e_1_2_1_16_1","first-page":"45","volume-title":"Feature Engineering for Machine Learning and Data Analytics","author":"Geigle C.","unstructured":"C. Geigle , Q. Mei , and C. Zhai . Feature engineering for text data . In G. Dong and H. Liu, editors, Feature Engineering for Machine Learning and Data Analytics , Chapman & Hall\/CRC Data Mining and Knowledge Discovery Series , pages 15{ 45 . CRC Press, 2018. C. Geigle, Q. Mei, and C. Zhai. Feature engineering for text data. In G. Dong and H. Liu, editors, Feature Engineering for Machine Learning and Data Analytics, Chapman & Hall\/CRC Data Mining and Knowledge Discovery Series, pages 15{45. CRC Press, 2018."},{"key":"e_1_2_1_17_1","first-page":"119","volume-title":"Special Issue on RTIPPR (2)","author":"Harish B. S.","year":"2010","unstructured":"B. S. Harish , D. S. Guru , and S. Manjunath . Repre- sentation and classi cation of text documents: A brief review. IJCA , Special Issue on RTIPPR (2) , pages 110{ 119 , 2010 . B. S. Harish, D. S. Guru, and S. Manjunath. Repre- sentation and classi cation of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), pages 110{ 119, 2010."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1181"},{"key":"e_1_2_1_19_1","first-page":"948","volume-title":"Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies","author":"He H.","year":"2016","unstructured":"H. He and J. Lin . Pairwise word interaction modeling with deep neural networks for semantic similarity mea- surement . In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies , pages 937{ 948 , 2016 . H. He and J. Lin. Pairwise word interaction modeling with deep neural networks for semantic similarity mea- surement. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, pages 937{948, 2016."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3130348.3130370"},{"issue":"4","key":"e_1_2_1_22_1","first-page":"54","article-title":"Ontology-based text document clustering","volume":"16","author":"Hotho A.","year":"2002","unstructured":"A. Hotho , A. Maedche , and S. Staab . Ontology-based text document clustering . KI , 16 ( 4 ):48{ 54 , 2002 . A. Hotho, A. Maedche, and S. Staab. Ontology-based text document clustering. KI, 16(4):48{54, 2002.","journal-title":"KI"},{"key":"e_1_2_1_23_1","volume-title":"E ects of high-order co- occurrences on word semantic similarity. Current psy- chology letters. Behaviour, brain & cognition, (18","author":"Lemaire B.","year":"2006","unstructured":"B. Lemaire and G. Denhiere . E ects of high-order co- occurrences on word semantic similarity. Current psy- chology letters. Behaviour, brain & cognition, (18 , Vol. 1 , 2006 ), 2006. B. Lemaire and G. Denhiere. E ects of high-order co- occurrences on word semantic similarity. Current psy- chology letters. Behaviour, brain & cognition, (18, Vol. 1, 2006), 2006."},{"key":"e_1_2_1_24_1","first-page":"2901","volume-title":"Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers","author":"Levy O.","year":"2016","unstructured":"O. Levy , I. Dagan , G. Stanovsky , J. Eckle-Kohler , and I. Gurevych . Modeling extractive sentence intersec- tion via subtree entailment . In Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers , pages 2891{ 2901 , 2016 . O. Levy, I. Dagan, G. Stanovsky, J. Eckle-Kohler, and I. Gurevych. Modeling extractive sentence intersec- tion via subtree entailment. In Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers, pages 2891{2901, 2016."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2003.1209005"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/41.8.537"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-4016"},{"key":"e_1_2_1_28_1","first-page":"320","volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics","author":"McKeown K.","unstructured":"K. McKeown , S. Rosenthal , K. Thadani , and C. Moore . Time-efficient creation of an accurate sentence fusion corpus . In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 317{ 320 . Association for Computational Linguistics, 2010. K. McKeown, S. Rosenthal, K. Thadani, and C. Moore. Time-efficient creation of an accurate sentence fusion corpus. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 317{320. Association for Computational Linguistics, 2010."},{"key":"e_1_2_1_29_1","first-page":"780","volume-title":"AAAI","volume":"6","author":"Mihalcea R.","year":"2006","unstructured":"R. Mihalcea , C. Corley , C. Strapparava , Corpus- based and knowledge-based measures of text semantic similarity . In AAAI , volume 6 , pages 775{ 780 , 2006 . R. Mihalcea, C. Corley, C. Strapparava, et al. Corpus- based and knowledge-based measures of text semantic similarity. In AAAI, volume 6, pages 775{780, 2006."},{"key":"e_1_2_1_30_1","first-page":"3119","volume-title":"Advances in Neu- ral Information Processing Systems 26","author":"Mikolov T.","unstructured":"T. Mikolov , I. Sutskever , K. Chen , G. S. Corrado , and J. Dean . Distributed representations of words and phrases and their compositionality . In Advances in Neu- ral Information Processing Systems 26 , pages 3111{ 3119 . 2013. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neu- ral Information Processing Systems 26, pages 3111{ 3119. 2013."},{"key":"e_1_2_1_31_1","first-page":"2792","volume-title":"AAAI","author":"Mueller J.","year":"2016","unstructured":"J. Mueller and A. Thyagarajan . Siamese recurrent ar- chitectures for learning sentence similarity . In AAAI , pages 2786{ 2792 , 2016 . J. Mueller and A. Thyagarajan. Siamese recurrent ar- chitectures for learning sentence similarity. In AAAI, pages 2786{2792, 2016."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_2_1_33_1","volume-title":"Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013)","author":"Piech C.","year":"2013","unstructured":"C. Piech , J. Huang , Z. Chen , C. Do , A. Ng , and D. Koller . Tuned models of peer assessment in moocs . In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013) , 2013 . C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in moocs. In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013), 2013."},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"L. R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Ap- plications in Speech Recognition pages 267{296. 1990.   L. R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Ap- plications in Speech Recognition pages 267{296. 1990.","DOI":"10.1016\/B978-0-08-051584-7.50027-9"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/361219.361220"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSC.2007.107"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.19173\/irrodl.v15i3.1680"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S15-2027"},{"key":"e_1_2_1_39_1","volume-title":"Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7{18","author":"Swanson D. R.","year":"1986","unstructured":"D. R. Swanson . Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7{18 , 1986 . D. R. Swanson. Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7{18, 1986."},{"key":"e_1_2_1_40_1","first-page":"53","volume-title":"Proceedings of the Workshop on Monolingual Text-To- Text Generation","author":"Thadani K.","unstructured":"K. Thadani and K. McKeown . Towards strict sentence intersection: decoding and evaluation strategies . In Proceedings of the Workshop on Monolingual Text-To- Text Generation , pages 43{ 53 . Association for Compu- tational Linguistics, 2011. K. Thadani and K. McKeown. Towards strict sentence intersection: decoding and evaluation strategies. In Proceedings of the Workshop on Monolingual Text-To- Text Generation, pages 43{53. Association for Compu- tational Linguistics, 2011."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390334.1390387"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2007.07.008"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/2915031"}],"container-title":["ACM SIGKDD Explorations Newsletter"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3299986.3299990","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3299986.3299990","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:39Z","timestamp":1750204419000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3299986.3299990"}},"subtitle":["Towards a Setlike Operator based Framework for Semantic Analysis of Text"],"short-title":[],"issued":{"date-parts":[[2018,12,11]]},"references-count":42,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2018,12,11]]}},"alternative-id":["10.1145\/3299986.3299990"],"URL":"https:\/\/doi.org\/10.1145\/3299986.3299990","relation":{},"ISSN":["1931-0145","1931-0153"],"issn-type":[{"type":"print","value":"1931-0145"},{"type":"electronic","value":"1931-0153"}],"subject":[],"published":{"date-parts":[[2018,12,11]]},"assertion":[{"value":"2018-12-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}