{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T17:37:41Z","timestamp":1774633061724,"version":"3.50.1"},"reference-count":29,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2011,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusions<\/jats:title><jats:p>The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-12-412","type":"journal-article","created":{"date-parts":[[2011,11,16]],"date-time":"2011-11-16T03:56:33Z","timestamp":1321415793000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":105,"title":["Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features"],"prefix":"10.1186","volume":"12","author":[{"given":"Ozgur","family":"Demir-Kavuk","sequence":"first","affiliation":[]},{"given":"Mayumi","family":"Kamada","sequence":"additional","affiliation":[]},{"given":"Tatsuya","family":"Akutsu","sequence":"additional","affiliation":[]},{"given":"Ernst-Walter","family":"Knapp","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2011,10,25]]},"reference":[{"key":"4878_CR1","doi-asserted-by":"crossref","unstructured":"Demir-Kavuk O, Riedesel H, Knapp EW: Exploring classification strategies with the CoEPrA 2006 contest. Bioinformatics 26(5):603\u20139.","DOI":"10.1093\/bioinformatics\/btq021"},{"issue":"1","key":"4878_CR2","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","volume":"58","author":"R Tibshirani","year":"1996","unstructured":"Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological 1996, 58(1):267\u2013288.","journal-title":"Journal of the Royal Statistical Society Series B-Methodological"},{"issue":"1","key":"4878_CR3","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1080\/00401706.1970.10488634","volume":"12","author":"AE Hoerl","year":"1970","unstructured":"Hoerl AE, Kennard RW: Ridge Regression - Biased Estimation For Nonorthogonal Problems. Technometrics 1970, 12(1):55. 10.2307\/1267351","journal-title":"Technometrics"},{"key":"4878_CR4","doi-asserted-by":"publisher","first-page":"301","DOI":"10.1111\/j.1467-9868.2005.00503.x","volume":"67","author":"H Zou","year":"2005","unstructured":"Zou H, Hastie T: Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society B 2005, 67: 301\u2013320. 10.1111\/j.1467-9868.2005.00503.x","journal-title":"Journal of the Royal Statistical Society B"},{"issue":"6","key":"4878_CR5","first-page":"1159","volume":"53","author":"ZongBen Xu","year":"2010","unstructured":"Xu ZongBen, Z H, Wang Yao, Chang XiangYu, Yong Liang: L1\/2 regularizer. SCIENCE CHINA 2010, 53(6):1159\u20131169.","journal-title":"SCIENCE CHINA"},{"key":"4878_CR6","volume-title":"ICML '\u212207","author":"G Andrew","year":"2007","unstructured":"Andrew G, Gao J: Scalable training of L1-regularized log-linear models. ICML '\u212207 2007."},{"key":"4878_CR7","volume-title":"Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06)","author":"S Lee","year":"2006","unstructured":"Lee S, Lee H, Abbeel P, Ng A: Efficient L1 Regularized Logistic Regression. Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06) 2006."},{"key":"4878_CR8","volume-title":"Proceedings of HLTNAACL 2004","author":"J Goodman","year":"2003","unstructured":"Goodman J: Exponential Priors for Maximum Entropy Models . Proceedings of HLTNAACL 2004 2003."},{"issue":"1","key":"4878_CR9","doi-asserted-by":"publisher","first-page":"16","DOI":"10.1109\/TNN.2003.809398","volume":"15","author":"V Roth","year":"2004","unstructured":"Roth V: The generalized LASSO. IEEE Trans Neural Netw 2004, 15(1):16\u201328. 10.1109\/TNN.2003.809398","journal-title":"IEEE Trans Neural Netw"},{"key":"4878_CR10","first-page":"21","volume-title":"In Machine Learning, Proceedings of the Twentieth International Conference","author":"S Perkins","year":"2003","unstructured":"Perkins S, Theiler J: Online feature selection using grafting. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003. AAAI Press; 21\u201324."},{"key":"4878_CR11","doi-asserted-by":"publisher","first-page":"586","DOI":"10.1109\/ICNN.1993.298623","volume-title":"Proceedings of the IEEE International Conference on Neural Networks","author":"M Riedmiller","year":"1993","unstructured":"Riedmiller M, Braun H: A direct adaptive method for faster backpropagation learning: The Rprop algorithm. Proceedings of the IEEE International Conference on Neural Networks 1993, 586\u2013591."},{"key":"4878_CR12","unstructured":"CoEPrA[http:\/\/www.coepra.org\/]"},{"issue":"1","key":"4878_CR13","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","volume":"45","author":"L Breiman","year":"2001","unstructured":"Breiman L: Random Forests. Machine Learning 2001, 45(1):5\u201332. 10.1023\/A:1010933404324","journal-title":"Machine Learning"},{"key":"4878_CR14","first-page":"1418","volume-title":"Journal of the American Statistical Association","author":"H Zou","year":"2006","unstructured":"Zou H: The adaptive lasso and its oracle properties. In Journal of the American Statistical Association. ASA; 2006:1418\u20131429."},{"key":"4878_CR15","first-page":"1348","volume-title":"Journal of the American Statistical Association","author":"JaLR Fan","year":"2001","unstructured":"Fan JaLR: Variable selection via nonconcave penalized likelihood and its oracle properties. In Journal of the American Statistical Association. ASA; 2001:1348\u20131360."},{"issue":"18","key":"4878_CR16","doi-asserted-by":"publisher","first-page":"6395","DOI":"10.1073\/pnas.0408677102","volume":"102","author":"WR Atchley","year":"2005","unstructured":"Atchley WR, Zhao J, Fernandes AD, Druke T: Solving the protein sequence metric problem. Proc Natl Acad Sci USA 2005, 102(18):6395\u2013400. 10.1073\/pnas.0408677102","journal-title":"Proc Natl Acad Sci USA"},{"issue":"5","key":"4878_CR17","doi-asserted-by":"publisher","first-page":"703","DOI":"10.1089\/cmb.2008.0173","volume":"16","author":"AG Georgiev","year":"2009","unstructured":"Georgiev AG: Interpretable numerical descriptors of amino acid space. J Comput Biol 2009, 16(5):703\u201323. 10.1089\/cmb.2008.0173","journal-title":"J Comput Biol"},{"issue":"12","key":"4878_CR18","doi-asserted-by":"publisher","first-page":"445","DOI":"10.1007\/s00894-001-0058-5","volume":"7","author":"MS Venkatarajan","year":"2001","unstructured":"Venkatarajan MS, Braun W: New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical\u00e2\u20ac\"chemical properties. Journal of Molecular Modeling 2001, 7(12):445\u2013453. 10.1007\/s00894-001-0058-5","journal-title":"Journal of Molecular Modeling"},{"issue":"6","key":"4878_CR19","doi-asserted-by":"publisher","first-page":"559","DOI":"10.1080\/14786440109462720","volume":"2","author":"K Pearson","year":"1901","unstructured":"Pearson K: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 1901, 2(6):559\u2013572.","journal-title":"Philosophical Magazine"},{"issue":"22","key":"4878_CR20","doi-asserted-by":"publisher","first-page":"10915","DOI":"10.1073\/pnas.89.22.10915","volume":"89","author":"S Henikoff","year":"1992","unstructured":"Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915\u20139. 10.1073\/pnas.89.22.10915","journal-title":"Proc Natl Acad Sci USA"},{"issue":"17","key":"4878_CR21","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"SF Altschul","year":"1997","unstructured":"Altschul SF, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389\u2013402. 10.1093\/nar\/25.17.3389","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"4878_CR22","doi-asserted-by":"publisher","first-page":"368","DOI":"10.1093\/nar\/27.1.368","volume":"27","author":"S Kawashima","year":"1999","unstructured":"Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27(1):368\u2013369. 10.1093\/nar\/27.1.368","journal-title":"Nucleic Acids Res"},{"key":"4878_CR23","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1515\/9781400874668","volume-title":"Adaptive control processes - A guided tour","author":"R Bellman","year":"1961","unstructured":"Bellman R: Adaptive control processes - A guided tour. In Adaptive control processes - A guided tour. Princeton University Press; 1961:255."},{"key":"4878_CR24","volume-title":"A Tutorial on Principal Component Analysis","author":"J Shlens","year":"2005","unstructured":"Shlens J: A Tutorial on Principal Component Analysis. 2005."},{"issue":"5","key":"4878_CR25","doi-asserted-by":"publisher","first-page":"514","DOI":"10.2174\/138620709788488984","volume":"12","author":"L Hansen","year":"2009","unstructured":"Hansen L, et al.: Controlling feature selection in random forests of decision trees using a genetic algorithm: classification of class I MHC peptides. Comb Chem High Throughput Screen 2009, 12(5):514\u20139. 10.2174\/138620709788488984","journal-title":"Comb Chem High Throughput Screen"},{"issue":"5","key":"4878_CR26","doi-asserted-by":"publisher","first-page":"507","DOI":"10.2174\/138620709788488993","volume":"12","author":"D Patil","year":"2009","unstructured":"Patil D, et al.: Feature selection and classification employing hybrid ant colony optimization\/random forest methodology. Comb Chem High Throughput Screen 2009, 12(5):507\u201313. 10.2174\/138620709788488993","journal-title":"Comb Chem High Throughput Screen"},{"issue":"1","key":"4878_CR27","first-page":"198","volume":"15","author":"H Riedesel","year":"2004","unstructured":"Riedesel H, Kolbeck B, Schmetzer O, Knapp EW: Peptide binding at class I major histocompatibility complex scored with linear functions and support vector machines. Genome Inform 2004, 15(1):198\u2013212.","journal-title":"Genome Inform"},{"key":"4878_CR28","volume-title":"Numerical linear algebra","author":"D Bau III","year":"1997","unstructured":"Bau D III, Trefethen LN: Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics; 1997."},{"issue":"3-4","key":"4878_CR29","doi-asserted-by":"publisher","first-page":"441","DOI":"10.2307\/1422689","volume":"100","author":"C Spearman","year":"1987","unstructured":"Spearman C: The proof and measurement of association between two things. By C. Spearman, 1904. Am J Psychol 1987, 100(3\u20134):441\u201371.","journal-title":"Am J Psychol"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-12-412.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,14]],"date-time":"2025-03-14T03:04:30Z","timestamp":1741921470000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-12-412"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,10,25]]},"references-count":29,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2011,12]]}},"alternative-id":["4878"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-12-412","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,10,25]]},"assertion":[{"value":"6 July 2011","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 October 2011","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"25 October 2011","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"412"}}