{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T03:58:52Z","timestamp":1760241532522,"version":"build-2065373602"},"reference-count":31,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2018,5,15]],"date-time":"2018-05-15T00:00:00Z","timestamp":1526342400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["7142010725","71371019","11501586"],"award-info":[{"award-number":["7142010725","71371019","11501586"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>In existing principle component analysis (PCA) methods for histogram-valued symbolic data, projection results are approximated based on Moore\u2019s algebra and fail to reflect the data\u2019s true structure, mainly because there is no precise, unified calculation method for the linear combination of histogram data. In this paper, we propose a new PCA method for histogram data that distinguishes itself from various well-established methods in that it can project observations onto the space spanned by principal components more accurately and rapidly by sampling through a MapReduce framework. The new histogram PCA method is implemented under the same assumption of \u201corthogonal dimensions for every observation\u201d with the existing literatures. To project observations, the method first samples from the original histogram variables to acquire single-valued data, on which linear combination operations can be performed. Then, the projection of observations can be given by linear combination of loading vectors and single-valued samples, which is close to accurate projection results. Finally, the projection is summarized to histogram data. These procedures involve complex algorithms and large-scale data, which makes the new method time-consuming. To speed it up, we undertake a parallel implementation of the new method in a multicore MapReduce framework. A simulation study and an empirical study confirm that the new method is effective and time-saving.<\/jats:p>","DOI":"10.3390\/sym10050162","type":"journal-article","created":{"date-parts":[[2018,5,15]],"date-time":"2018-05-15T11:36:13Z","timestamp":1526384173000},"page":"162","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Sampling Based Histogram PCA and Its Mapreduce Parallel Implementation on Multicore"],"prefix":"10.3390","volume":"10","author":[{"given":"Cheng","family":"Wang","sequence":"first","affiliation":[{"name":"School of Economics and Management, Beihang University, Beijing 100191, China"}]},{"given":"Huiwen","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Economics and Management, Beihang University, Beijing 100191, China"},{"name":"Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations, Beijing 100191, China"}]},{"given":"Siyang","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Statistics and Mathematics, Central University of Finance and Economics, Beijing 100081, China"}]},{"given":"Edwin","family":"Diday","sequence":"additional","affiliation":[{"name":"CEREMADE, Paris-Dauphine University, 75775 Paris, France"}]},{"given":"Richard","family":"Emilion","sequence":"additional","affiliation":[{"name":"MAPMO, University of Orleans, 45067 Orleans, France"}]}],"member":"1968","published-online":{"date-parts":[[2018,5,15]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1002\/sam.10111","article-title":"The quantile method for symbolic principal component analysis","volume":"4","author":"Ichino","year":"2011","journal-title":"Stat. Anal. Data Min."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1080\/14786440109462720","article-title":"On lines and planes of closest fit to systems of points in space","volume":"2","author":"Pearson","year":"1901","journal-title":"Lond. Edinb. Dublin Philos. Mag. J. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1037\/h0071325","article-title":"Analysis of a complex of statistical variables into principal components","volume":"24","author":"Hotelling","year":"1933","journal-title":"J. Educat. Psychol."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Jolliffe, I. (1986). Principal Component Analysis, Spring.","DOI":"10.1007\/978-1-4757-1904-8"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1550","DOI":"10.1016\/j.neucom.2009.08.022","article-title":"Multilinear principal component analysis for face recognition with fewer features","volume":"73","author":"Wang","year":"2010","journal-title":"Neurocomputing"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"2730","DOI":"10.1002\/sim.2747","article-title":"Assessing local influence in principal component analysis with application to haematology study data","volume":"26","author":"Fung","year":"2007","journal-title":"Stat. Med."},{"key":"ref_7","unstructured":"Diday, E. (July, January 29). The symbolic approach in clustering and relating methods of data analysis: The basic choices. Proceedings of the Conference of the International Federation of Classification Societies, Aachen, Germany."},{"key":"ref_8","unstructured":"Diday, E., and Bock, H.H. (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Diday, E., and Billard, L. (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining, Wiley.","DOI":"10.1002\/9780470090183"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Diday, E., and Noirhomme-Fraiture, M. (2008). Symbolic Data Analysis and the SODAS Software, Wiley Online Library.","DOI":"10.1002\/9780470723562"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Nagabhushan, P., and Kumar, R.P. (2007). Histogram PCA. Advances in Neural Networks\u2013ISNN 2007, Springer.","DOI":"10.1007\/978-3-540-72393-6_120"},{"key":"ref_12","unstructured":"Rodr\u0131guez, O., Diday, E., and Winsberg, S. (2000, January 13\u201316). Generalization of the principal components analysis to histogram data. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France."},{"key":"ref_13","unstructured":"Makosso Kallyth, S. (2010). Analyse en Composantes Principales de Variables Symboliques de Type Histogramme. [Ph.D. Thesis, Universit\u00e9 Paris-Dauphine]."},{"key":"ref_14","unstructured":"Ichino, M. (2008, January 5\u20138). Symbolic PCA for histogram-valued data. Proceedings of the IASC, Yokohama, Japan."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1002\/sam.11188","article-title":"Principal component analysis for bar charts and metabins tables","volume":"6","author":"Diday","year":"2013","journal-title":"Stat. Anal. Data Min."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"59","DOI":"10.1007\/s11634-014-0178-2","article-title":"Principal component analysis for probabilistic symbolic data: A more generic and accurate algorithm","volume":"9","author":"Chen","year":"2015","journal-title":"Adv. Data Anal. Classif."},{"key":"ref_17","unstructured":"Verde, R., and Irpino, A. (arXiv, 2018). Multiple factor analysis of distributional data, arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"\u017d\u00e1k, J., and Vach, M. (2017, January 28\u201330). A histogram based radar signal identification process. Proceedings of the 2017 18th International Radar Symposium (IRS), Prague, Czech Republic.","DOI":"10.23919\/IRS.2017.8008204"},{"key":"ref_19","unstructured":"Moore, R.E. (1966). Interval Analysis, Prentice-Hall Englewood Cliffs."},{"key":"ref_20","first-page":"5","article-title":"Entension de l\u2019analyse en composantes principales \u00e0 des donn\u00e9es de type intervalle","volume":"45","author":"Cazes","year":"1997","journal-title":"Revue de Statistique Appliqu\u00e9e"},{"key":"ref_21","unstructured":"Dean, J., and Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Oper. Syst. Des. Implement., 137\u2013149."},{"key":"ref_22","first-page":"281","article-title":"Map-reduce for machine learning on multicore","volume":"6","author":"Chu","year":"2006","journal-title":"NIPS"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. (2007, January 10\u201314). Evaluating mapreduce for multi-core and multiprocessor systems. Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture, Phoenix, AZ, USA.","DOI":"10.1109\/HPCA.2007.346181"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1016\/j.future.2013.09.010","article-title":"Large-scale incremental processing with MapReduce","volume":"36","author":"Lee","year":"2013","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Kiran, M., Kumar, A., Mukherjee, S., and Ravi Prakash, G. (2013). Verification and Validation of MapReduce Program Model for Parallel Support Vector Machine Algorithm on Hadoop Cluster. Int. J. Comput. Sci. Issues (IJCSI), 10.","DOI":"10.1109\/ICACCS.2013.6938728"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Bertrand, P., and Goupil, F. (2000). Descriptive statistics for symbolic data. Analysis of Symbolic Data, Springer.","DOI":"10.1007\/978-3-642-57155-8_6"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"470","DOI":"10.1198\/016214503000242","article-title":"From the statistics of data to the statistics of knowledge: Symbolic data analysis","volume":"98","author":"Billard","year":"2003","journal-title":"J. Am. Stat. Assoc."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"649","DOI":"10.2140\/pjm.1961.11.649","article-title":"On large deviations of the empiric df of vector chance variables and a law of the iterated logarithm","volume":"11","author":"Kiefer","year":"1961","journal-title":"Pac. J. Math."},{"key":"ref_29","unstructured":"Dias, S., and Brito, P. (arXiv, 2013). Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables, arXiv."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1016\/j.neucom.2012.01.018","article-title":"CIPCA: Complete-Information-based Principal Component Analysis for interval-valued data","volume":"86","author":"Wang","year":"2012","journal-title":"Neurocomputing"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Irpino, A., and Verde, R. (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. Data Science and Classification, Springer.","DOI":"10.1007\/3-540-34416-0_20"}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/5\/162\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:04:19Z","timestamp":1760195059000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/10\/5\/162"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,5,15]]},"references-count":31,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2018,5]]}},"alternative-id":["sym10050162"],"URL":"https:\/\/doi.org\/10.3390\/sym10050162","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2018,5,15]]}}}