{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T04:52:41Z","timestamp":1761540761578},"reference-count":15,"publisher":"Oxford University Press (OUP)","issue":"5","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,3,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates.<\/jats:p><jats:p>Results: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out &amp;gt;95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately.<\/jats:p><jats:p>Contact: \u00a0tingchen@usc.edu<\/jats:p><jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btl645","type":"journal-article","created":{"date-parts":[[2007,1,20]],"date-time":"2007-01-20T01:12:50Z","timestamp":1169255570000},"page":"612-618","source":"Crossref","is-referenced-by-count":40,"title":["Speeding up tandem mass spectrometry database search: metric embeddings and fast near neighbor search"],"prefix":"10.1093","volume":"23","author":[{"given":"Debojyoti","family":"Dutta","sequence":"first","affiliation":[]},{"given":"Ting","family":"Chen","sequence":"additional","affiliation":[]}],"member":"286","published-online":{"date-parts":[[2007,1,19]]},"reference":[{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"947","DOI":"10.1074\/mcp.M200066-MCP200","article-title":"Toward a human blood serum proteome i: analysis by multidimensional separation coupled with mass spectrometry","volume":"1","author":"Adkins","year":"2002","journal-title":"Mol. Cell. Proteomics"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"198","DOI":"10.1038\/nature01511","article-title":"Mass spectrometry-based proteomics","volume":"422","author":"Aebersold","year":"2003","journal-title":"Nature"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","DOI":"10.1007\/11415770_27","article-title":"Eigenms: de novo analysis of peptide tandem mass spectra by spectral graph partitioning","volume-title":"RECOMB '05: Proceedings of the Ninth Annual International Conference on Computational Molecular Biology","author":"Bern","year":"2005"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1145\/997817.997857","article-title":"Locality-sensitive hashing scheme based on p-stable distributions","volume-title":"SCG '04: Proceedings of the Twentieth Annual Symposium on Computational Geometry","author":"Datar","year":"2004"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"976","DOI":"10.1016\/1044-0305(94)80016-2","article-title":"An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database","volume":"5","author":"Eng","year":"1994","journal-title":"J. Am. Soc. Mass. Spec."},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1089\/153623102760092805","article-title":"Experimental protein mixture for validating tandem mass spectral analysis","volume":"6","author":"Keller","year":"2002","journal-title":"OMICS"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1038\/nbt0303-255","article-title":"Proteomic analysis of post-translational modifications","volume":"21","author":"Mann","year":"2003","journal-title":"Nat. Biotechnol"},{"key":"2023041109374929700_","unstructured":"Marcotte EM Opd (open proteomics database) http:\/\/apropos.icmb.utexas.edu\/opd\/"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"837","DOI":"10.1038\/35015709","article-title":"Proteomics to study genes and genomes","volume":"405","author":"Pandey","year":"2003","journal-title":"Nature"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"3551","DOI":"10.1002\/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2","article-title":"Probability-based protein identification by searching sequence databases using mass spectrometry data","volume":"20","author":"Perkins","year":"1999","journal-title":"Electrophoresis"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","unstructured":"Ramakrishnan S \u00a0et al. A fast coarse filtering method for peptide identification by mass spectrometry Bioinformatics 2006 (in press)","DOI":"10.1093\/bioinformatics\/btl118"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"2470","DOI":"10.1021\/ac026424o","article-title":"Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility","volume":"75","author":"Tabb","year":"2003","journal-title":"Anal. Chem"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"3557","DOI":"10.1021\/ac980122y","article-title":"Method to compare collision-induced dissociation spectra of peptides: potential for library searching and subtractive analysis","volume":"70","author":"Tabb","year":"1998","journal-title":"Anal. Chem"},{"key":"2023041109374929700_","article-title":"A hidden markov model based scoring function for tandem mass spectrometry","volume-title":"RECOMB 2005","author":"Wan"},{"key":"2023041109374929700_","doi-asserted-by":"crossref","first-page":"1426","DOI":"10.1021\/ac00104a020","article-title":"Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database","volume":"67","author":"Yates","year":"1995","journal-title":"Anal. Chem"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/5\/612\/49830035\/bioinformatics_23_5_612.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/5\/612\/49830035\/bioinformatics_23_5_612.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,10]],"date-time":"2024-02-10T09:42:36Z","timestamp":1707558156000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/23\/5\/612\/238280"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,1,19]]},"references-count":15,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2007,3,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btl645","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2007,3]]},"published":{"date-parts":[[2007,1,19]]}}}