{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,21]],"date-time":"2026-06-21T12:57:20Z","timestamp":1782046640929,"version":"3.54.5"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2006,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene\/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on<jats:italic>non-negative matrix factorization<\/jats:italic>(NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-7-41","type":"journal-article","created":{"date-parts":[[2006,2,4]],"date-time":"2006-02-04T19:14:13Z","timestamp":1139080453000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":61,"title":["Discovering semantic features in the literature: a foundation for building functional associations"],"prefix":"10.1186","volume":"7","author":[{"given":"Monica","family":"Chagoyen","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pedro","family":"Carmona-Saez","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hagit","family":"Shatkay","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jose M","family":"Carazo","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alberto","family":"Pascual-Montano","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2006,1,26]]},"reference":[{"key":"780_CR1","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1089\/106652703322756104","volume":"10","author":"H Shatkay","year":"2003","unstructured":"Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: An overview. J Comput Biol 2003, 10: 821\u2013855.","journal-title":"J Comput Biol"},{"key":"780_CR2","doi-asserted-by":"publisher","first-page":"i91","DOI":"10.1093\/bioinformatics\/btg1011","volume":"19 Suppl 1","author":"PB Dobrokhotov","year":"2003","unstructured":"Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003, 19 Suppl 1: i91-i94.","journal-title":"Bioinformatics"},{"key":"780_CR3","first-page":"3","volume-title":"Proc 37th annual meeting of the Association for Computational Linguistics","author":"MA Hearst","year":"1999","unstructured":"Hearst MA: Untangling text data mining. Proc 37th annual meeting of the Association for Computational Linguistics 1999, 3\u201310."},{"key":"780_CR4","first-page":"21","volume":"28","author":"TK Jenssen","year":"2001","unstructured":"Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28: 21\u201328.","journal-title":"Nat Genet"},{"key":"780_CR5","doi-asserted-by":"publisher","first-page":"2049","DOI":"10.1093\/bioinformatics\/bti268","volume":"21","author":"R Jelier","year":"2005","unstructured":"Jelier R, Jenster G, Dorssers LC, van der Eijk CC, van Mulligen EM, Mons B, Kors JA: Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes. Bioinformatics 2005, 21: 2049\u20132058.","journal-title":"Bioinformatics"},{"key":"780_CR6","doi-asserted-by":"publisher","first-page":"191","DOI":"10.1093\/bioinformatics\/btg390","volume":"20","author":"JD Wren","year":"2004","unstructured":"Wren JD, Garner HR: Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics 2004, 20: 191\u2013198.","journal-title":"Bioinformatics"},{"key":"780_CR7","doi-asserted-by":"publisher","first-page":"256","DOI":"10.1007\/s101420000036","volume":"1","author":"C Blaschke","year":"2001","unstructured":"Blaschke C, Oliveros JC, Valencia A: Mining functional information associated with expression arrays. Funct Integr Genomics 2001, 1: 256\u2013268.","journal-title":"Funct Integr Genomics"},{"key":"780_CR8","doi-asserted-by":"crossref","first-page":"ii259","DOI":"10.1093\/bioinformatics\/bti1143","volume":"21 Suppl 2","author":"R Kuffner","year":"2005","unstructured":"Kuffner R, Fundel K, Zimmer R: Expert knowledge without the expert: integrated analysis of gene expression and literature to derive active functional contexts. Bioinformatics 2005, 21 Suppl 2: ii259-ii267.","journal-title":"Bioinformatics"},{"key":"780_CR9","doi-asserted-by":"publisher","first-page":"1582","DOI":"10.1101\/gr.116402","volume":"12","author":"S Raychaudhuri","year":"2002","unstructured":"Raychaudhuri S, Schutze H, Altman RB: Using text analysis to identify functionally coherent gene groups. Genome Res 2002, 12: 1582\u20131590.","journal-title":"Genome Res"},{"key":"780_CR10","first-page":"317","volume":"8","author":"H Shatkay","year":"2000","unstructured":"Shatkay H, Edwards S, Wilbur WJ, Boguski M: Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol 2000, 8: 317\u2013328.","journal-title":"Proc Int Conf Intell Syst Mol Biol"},{"key":"780_CR11","first-page":"183","volume-title":"Proc IEEE Advances in Digital Libraries","author":"H Shatkay","year":"2000","unstructured":"Shatkay H, Wilbur WJ: Finding themes in Medline documents: Probabilistic similarity search. Proc IEEE Advances in Digital Libraries 2000, 183\u2013192."},{"key":"780_CR12","doi-asserted-by":"publisher","first-page":"RESEARCH0055","DOI":"10.1186\/gb-2002-3-10-research0055","volume":"3","author":"D Chaussabel","year":"2002","unstructured":"Chaussabel D, Sher A: Mining microarray expression data by literature profiling. Genome Biol 2002, 3: RESEARCH0055.","journal-title":"Genome Biol"},{"key":"780_CR13","volume-title":"Automatic information organization and retrieval","author":"G Salton","year":"1968","unstructured":"Salton G: Automatic information organization and retrieval. New York, McGraw-Hill; 1968."},{"key":"780_CR14","doi-asserted-by":"publisher","first-page":"617","DOI":"10.1145\/361219.361220","volume":"18","author":"G Salton","year":"1975","unstructured":"Salton G, Wong A, Yang CS: A vector space model for automatic indexing. Communications of the ACM 1975, 18: 617\u2013620.","journal-title":"Communications of the ACM"},{"key":"780_CR15","first-page":"391","volume-title":"Pac Symp Biocomput","author":"P Glenisson","year":"2003","unstructured":"Glenisson P, Antal P, Mathys J, Moreau Y, De Moor B: Evaluation of the vector space representation in text-based gene clustering. Pac Symp Biocomput 2003, 391\u2013402."},{"key":"780_CR16","first-page":"384","volume-title":"Pac Symp Biocomput","author":"I Iliopoulos","year":"2001","unstructured":"Iliopoulos I, Enright AJ, Ouzounis CA: Textquest: document clustering of Medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 2001, 384\u2013395."},{"key":"780_CR17","first-page":"489","volume-title":"Proc AMIA Symp","author":"W Mao","year":"2002","unstructured":"Mao W, Chu WW: Free-text medical document retrieval via phrase-based vector space model. Proc AMIA Symp 2002, 489\u2013493."},{"key":"780_CR18","first-page":"54","volume-title":"Pac Symp Biocomput","author":"A Renner","year":"2000","unstructured":"Renner A, Aszodi A: High-throughput functional annotation of novel gene products using document clustering. Pac Symp Biocomput 2000, 54\u201368."},{"key":"780_CR19","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1093\/bioinformatics\/bth464","volume":"21","author":"R Homayouni","year":"2005","unstructured":"Homayouni R, Heinrich K, Wei L, Berry MW: Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics 2005, 21: 104\u2013115.","journal-title":"Bioinformatics"},{"key":"780_CR20","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","volume":"41","author":"S Deerwester","year":"1990","unstructured":"Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R: Indexing by Latent Semantic Analysis. J Am Soc Inform Sci 1990, 41: 391\u2013407.","journal-title":"J Am Soc Inform Sci"},{"key":"780_CR21","doi-asserted-by":"publisher","first-page":"R43","DOI":"10.1186\/gb-2004-5-6-r43","volume":"5","author":"P Glenisson","year":"2004","unstructured":"Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate: profiling gene groups with text-based information. Genome Biol 2004, 5: R43.","journal-title":"Genome Biol"},{"key":"780_CR22","doi-asserted-by":"publisher","first-page":"788","DOI":"10.1038\/44565","volume":"401","author":"DD Lee","year":"1999","unstructured":"Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401: 788\u2013791.","journal-title":"Nature"},{"key":"780_CR23","doi-asserted-by":"publisher","first-page":"1706","DOI":"10.1101\/gr.903503","volume":"13","author":"PM Kim","year":"2003","unstructured":"Kim PM, Tidor B: Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res 2003, 13: 1706\u20131718.","journal-title":"Genome Res"},{"key":"780_CR24","doi-asserted-by":"publisher","first-page":"4164","DOI":"10.1073\/pnas.0308531101","volume":"101","author":"JP Brunet","year":"2004","unstructured":"Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 2004, 101: 4164\u20134169.","journal-title":"Proc Natl Acad Sci U S A"},{"key":"780_CR25","doi-asserted-by":"publisher","first-page":"i130","DOI":"10.1093\/bioinformatics\/btg1017","volume":"19 Suppl 1","author":"A Heger","year":"2003","unstructured":"Heger A, Holm L: Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins. Bioinformatics 2003, 19 Suppl 1: i130-i137.","journal-title":"Bioinformatics"},{"key":"780_CR26","doi-asserted-by":"publisher","first-page":"162","DOI":"10.1186\/1471-2105-6-162","volume":"6","author":"P Pehkonen","year":"2005","unstructured":"Pehkonen P, Wong G, Toronen P: Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 2005, 6: 162.","journal-title":"BMC Bioinformatics"},{"key":"780_CR27","first-page":"267","volume-title":"Proc Int ACM SIGIR Conf on Research and Development in Information Retrieval","author":"W Xu","year":"2003","unstructured":"Xu W, Liu X, Gong Y: Document clustering based on non-negative matrix factorization. Proc Int ACM SIGIR Conf on Research and Development in Information Retrieval 2003, 267\u2013273."},{"key":"780_CR28","doi-asserted-by":"publisher","first-page":"373","DOI":"10.1016\/j.ipm.2004.11.005","volume":"42","author":"F Shahnaz","year":"2006","unstructured":"Shahnaz F, Berry MW, Pauca VP, Plemmons RJ: Document clustering using nonnegative matrix factorization. Information Processing & Management 2006, 42: 373\u2013386.","journal-title":"Information Processing & Management"},{"key":"780_CR29","doi-asserted-by":"publisher","first-page":"960","DOI":"10.1109\/ICSMC.2001.973042","volume":"2","author":"S Tsuge","year":"2001","unstructured":"Tsuge S, Shishibori M, Kuroiwa S, Kita K: Dimensionality reduction using non-negative matrix factorization for information retrieval. Proc IEEE Int Conf on Systems, Man and Cybernetics 2001, 2: 960\u2013965.","journal-title":"Proc IEEE Int Conf on Systems, Man and Cybernetics"},{"key":"780_CR30","unstructured":"Saccharomyces Genome Database (SGD)[http:\/\/www.yeastgenome.org]"},{"key":"780_CR31","first-page":"D54","volume":"33 Database Iss","author":"D Maglott","year":"2005","unstructured":"Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2005, 33 Database Issue: D54-D58.","journal-title":"Nucleic Acids Res"},{"key":"780_CR32","unstructured":"Entrez Gene[http:\/\/www.ncbi.nlm.nih.gov\/entrez\/query.fcgi?db=gene]"},{"key":"780_CR33","unstructured":"Associated web site[http:\/\/www.cnb.uam.es\/~monica\/Discovering\/]"},{"key":"780_CR34","unstructured":"SGD Gene Ontology Slim Mapper[http:\/\/db.yeastgenome.org\/cgi-bin\/GO\/goTermMapper]"},{"key":"780_CR35","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1093\/nar\/28.1.27","volume":"28","author":"M Kanehisa","year":"2000","unstructured":"Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27\u201330.","journal-title":"Nucleic Acids Res"},{"key":"780_CR36","doi-asserted-by":"publisher","first-page":"375","DOI":"10.1016\/S0168-9525(97)01223-7","volume":"13","author":"M Kanehisa","year":"1997","unstructured":"Kanehisa M: A database for post-genome analysis. Trends Genet 1997, 13: 375\u2013376.","journal-title":"Trends Genet"},{"key":"780_CR37","unstructured":"KEGG PATHWAY database[http:\/\/www.genome.jp\/kegg]"},{"key":"780_CR38","first-page":"50","volume-title":"Proc Int ACM SIGIR Conf on Research and Development in Information Retrieval","author":"T Hoffmann","year":"1999","unstructured":"Hoffmann T: Probabilistic latent semantic indexing. Proc Int ACM SIGIR Conf on Research and Development in Information Retrieval 1999, 50\u201357."},{"key":"780_CR39","first-page":"36","volume":"25","author":"S Deerwester","year":"1988","unstructured":"Deerwester S, Dumais S, Landauer T, Furnas G, Beck L: Improving Information-Retrieval with Latent Semantic Indexing. P Asis Annu Meet P Asis Annu Meet 1988, 25: 36\u201340.","journal-title":"P Asis Annu Meet P Asis Annu Meet"},{"key":"780_CR40","doi-asserted-by":"publisher","first-page":"5214","DOI":"10.1073\/pnas.0400341101","volume":"101 Suppl 1","author":"TK Landauer","year":"2004","unstructured":"Landauer TK, Laham D, Derr M: From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci U S A 2004, 101 Suppl 1: 5214\u20135219.","journal-title":"Proc Natl Acad Sci U S A"},{"key":"780_CR41","first-page":"556","volume-title":"Proc Advances in Neural Information Processing","author":"DD Lee","year":"2000","unstructured":"Lee DD, Seung HS: Algorithms for non-negative matrix factorization. Proc Advances in Neural Information Processing 2000, 556\u2013562."},{"key":"780_CR42","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1109\/TPAMI.2006.60","volume":"28","author":"A Pascual-Montano","year":"2006","unstructured":"Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD: Non-smooth Non-Negative Matrix Factorization (nsNMF). IEEE Trans on Pattern Analysis and Machine Intelligence 2006, 28: 403\u2013415.","journal-title":"IEEE Trans on Pattern Analysis and Machine Intelligence"},{"key":"780_CR43","first-page":"35","volume":"24","author":"A Singhal","year":"2001","unstructured":"Singhal A: Modern information retrieval: a brief overview. IEEE Data Eng Bull 2001, 24: 35\u201343.","journal-title":"IEEE Data Eng Bull"},{"key":"780_CR44","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1108\/eb026526","volume":"28","author":"K Spark-Jones","year":"1972","unstructured":"Spark-Jones K: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 1972, 28: 11\u201321.","journal-title":"Journal of Documentation"},{"key":"780_CR45","doi-asserted-by":"publisher","first-page":"130","DOI":"10.1108\/eb046814","volume":"14","author":"MF Porter","year":"1980","unstructured":"Porter MF: An algorithm for suffix stripping. Program 1980, 14: 130\u2013137.","journal-title":"Program"},{"key":"780_CR46","doi-asserted-by":"publisher","first-page":"236","DOI":"10.1080\/01621459.1963.10500845","volume":"58","author":"JH Ward","year":"1963","unstructured":"Ward JH: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 1963, 58: 236\u2013244.","journal-title":"Journal of the American Statistical Association"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-7-41.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,2]],"date-time":"2024-02-02T18:54:44Z","timestamp":1706900084000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-7-41"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,1,26]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2006,12]]}},"alternative-id":["780"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-7-41","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2006,1,26]]},"assertion":[{"value":"1 September 2005","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 January 2006","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 January 2006","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"41"}}