{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:33:23Z","timestamp":1760146403581,"version":"build-2065373602"},"reference-count":60,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T00:00:00Z","timestamp":1730332800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001691","name":"MEXT KAKENHI","doi-asserted-by":"publisher","award":["24K03004","23KJ1336"],"award-info":[{"award-number":["24K03004","23KJ1336"]}],"id":[{"id":"10.13039\/501100001691","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Gatsby Charitable Foundation","award":["24K03004","23KJ1336"],"award-info":[{"award-number":["24K03004","23KJ1336"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>In this study, we consider the problem of self-supervised learning (SSL) utilizing the 1-Wasserstein distance on a tree structure (a.k.a., Tree-Wasserstein distance (TWD)), where TWD is defined as the L1 distance between two tree-embedded vectors. In SSL methods, the cosine similarity is often utilized as an objective function; however, it has not been well studied when utilizing the Wasserstein distance. Training the Wasserstein distance is numerically challenging. Thus, this study empirically investigates a strategy for optimizing the SSL with the Wasserstein distance and finds a stable training procedure. More specifically, we evaluate the combination of two types of TWD (total variation and ClusterTree) and several probability models, including the softmax function, the ArcFace probability model, and simplicial embedding. We propose a simple yet effective Jeffrey divergence-based regularization method to stabilize optimization. Through empirical experiments on STL10, CIFAR10, CIFAR100, and SVHN, we find that a simple combination of the softmax function and TWD can obtain significantly lower results than the standard SimCLR. Moreover, a simple combination of TWD and SimSiam fails to train the model. We find that the model performance depends on the combination of TWD and probability model, and that the Jeffrey divergence regularization helps in model training. Finally, we show that the appropriate combination of the TWD and probability model outperforms cosine similarity-based representation learning.<\/jats:p>","DOI":"10.3390\/e26110939","type":"journal-article","created":{"date-parts":[[2024,11,1]],"date-time":"2024-11-01T13:09:27Z","timestamp":1730466567000},"page":"939","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["An Empirical Study of Self-Supervised Learning with Wasserstein Distance"],"prefix":"10.3390","volume":"26","author":[{"given":"Makoto","family":"Yamada","sequence":"first","affiliation":[{"name":"Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan"},{"name":"Center for Advanced Intelligence Project RIKEN, Tokyo 103-0027, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8532-2775","authenticated-orcid":false,"given":"Yuki","family":"Takezawa","sequence":"additional","affiliation":[{"name":"Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan"},{"name":"Department of Intelligence Science and Technology, Kyoto University, Kyoto 606-8501, Japan"}]},{"given":"Guillaume","family":"Houry","sequence":"additional","affiliation":[{"name":"Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan"},{"name":"Paris-Saclay Ecole Normale Superieure, 75005 Paris, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3217-5326","authenticated-orcid":false,"given":"Kira Michaela","family":"D\u00fcsterwald","sequence":"additional","affiliation":[{"name":"Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan"},{"name":"Gatsby Computational Neuroscience Unit, University College London, London WC1E 6BT, UK"}]},{"given":"Deborah","family":"Sulem","sequence":"additional","affiliation":[{"name":"Barcelona School of Economics, Universitat Pompeu Fabra, 08002 Barcelona, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8579-1600","authenticated-orcid":false,"given":"Han","family":"Zhao","sequence":"additional","affiliation":[{"name":"Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA"}]},{"given":"Yao-Hung","family":"Tsai","sequence":"additional","affiliation":[{"name":"Machine Learning and Data Science Unit, Okinawa Institute of Science and Technology, Okinawa 904-0412, Japan"},{"name":"Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA"}]}],"member":"1968","published-online":{"date-parts":[[2024,10,31]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"233","DOI":"10.1002\/aic.690370209","article-title":"Nonlinear principal component analysis using autoassociative neural networks","volume":"37","author":"Kramer","year":"1991","journal-title":"AIChE J."},{"key":"ref_2","unstructured":"Kingma, D.P., and Welling, M. (2013). Auto-encoding variational bayes. arXiv."},{"key":"ref_3","unstructured":"Chen, X., Fan, H., Girshick, R., and He, K. (2020). Improved baselines with momentum contrastive learning. arXiv."},{"key":"ref_4","unstructured":"Grill, J.B., Strub, F., Altch\u00e9, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., and Gheshlaghi Azar, M. (2020, January 6\u201312). Bootstrap your own latent\u2014A new approach to self-supervised learning. Proceedings of the NeurIPS, Virtual."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 14\u201319). Momentum contrast for unsupervised visual representation learning. Proceedings of the CVPR, Virtual.","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"ref_6","unstructured":"Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. (2020, January 6\u201312). Unsupervised learning of visual features by contrasting cluster assignments. Proceedings of the NeurIPS, Virtual."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Chen, X., and He, K. (2021, January 19\u201325). Exploring simple siamese representation learning. Proceedings of the CVPR, Virtual.","DOI":"10.1109\/CVPR46437.2021.01549"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Caron, M., Touvron, H., Misra, I., J\u00e9gou, H., Mairal, J., Bojanowski, P., and Joulin, A. (2021, January 11\u201317). Emerging properties in self-supervised vision transformers. Proceedings of the ICCV, Virtual.","DOI":"10.1109\/ICCV48922.2021.00951"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Jiang, Q., Chen, C., Zhao, H., Chen, L., Ping, Q., Tran, S.D., Xu, Y., Zeng, B., and Chilimbi, T. (2023, January 18\u201322). Understanding and constructing latent modality structures in multi-modal representation learning. Proceedings of the CVPR, Vancouver, BC, Canada.","DOI":"10.1109\/CVPR52729.2023.00740"},{"key":"ref_10","unstructured":"Lavoie, S., Tsirigotis, C., Schwarzer, M., Vani, A., Noukhovitch, M., Kawaguchi, K., and Courville, A. (2023, January 1\u20135). Simplicial embeddings in self-supervised learning and downstream classification. Proceedings of the ICLR, Kigali, Rwanda."},{"key":"ref_11","unstructured":"Arjovsky, M., Chintala, S., and Bottou, L. (2017, January 6\u201311). Wasserstein generative adversarial networks. Proceedings of the ICML, Sydney, NSW, Australia."},{"key":"ref_12","unstructured":"Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015, January 6\u201311). From word embeddings to document distances. Proceedings of the ICML, Lille, France."},{"key":"ref_13","unstructured":"Sato, R., Yamada, M., and Kashima, H. (2022, January 17\u201323). Re-evaluating Word Mover\u2019s Distance. Proceedings of the ICML, Baltimore, MD, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Sarlin, P.E., DeTone, D., Malisiewicz, T., and Rabinovich, A. (2020, January 14\u201319). Superglue: Learning feature matching with graph neural networks. Proceedings of the CVPR, Virtual.","DOI":"10.1109\/CVPR42600.2020.00499"},{"key":"ref_15","unstructured":"Xian, R., Yin, L., and Zhao, H. (2023, January 23\u201329). Fair and Optimal Classification via Post-Processing. Proceedings of the ICML, Honolulu, HI, USA."},{"key":"ref_16","unstructured":"Zhao, H. (2022). Costs and Benefits of Fair Regression. TMLR, 1\u201322."},{"key":"ref_17","unstructured":"Indyk, P., and Thaper, N. (2003, January 12). Fast image retrieval via embeddings. Proceedings of the 3rd International Workshop on Statistical and Computational Theories of Vision, Nice, France."},{"key":"ref_18","unstructured":"Le, T., Yamada, M., Fukumizu, K., and Cuturi, M. (2019, January 8\u201314). Tree-sliced variants of wasserstein distances. Proceedings of the NeurIPS, Vancouver, BC, Canada."},{"key":"ref_19","unstructured":"Rabin, J., Peyr\u00e9, G., Delon, J., and Bernot, M. (June, January 29). Wasserstein Barycenter and Its Application to Texture Mixing. Proceedings of the International Conference on Scale Space and Variational Methods in Computer Vision, Ein-Gedi, Israel."},{"key":"ref_20","unstructured":"Kolouri, S., Zou, Y., and Rohde, G.K. (\u20131, January 26). Sliced Wasserstein kernels for probability distributions. Proceedings of the CVPR, Las Vegas, NV, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019, January 16\u201320). Arcface: Additive angular margin loss for deep face recognition. Proceedings of the CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00482"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1038\/355161a0","article-title":"Self-organizing neural network that discovers surfaces in random-dot stereograms","volume":"355","author":"Becker","year":"1992","journal-title":"Nature"},{"key":"ref_23","unstructured":"Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020, January 12\u201318). A simple framework for contrastive learning of visual representations. Proceedings of the ICML, Vienna, Austria."},{"key":"ref_24","unstructured":"Oord, A.v.d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv."},{"key":"ref_25","unstructured":"Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. (2021, January 18\u201324). Barlow twins: Self-supervised learning via redundancy reduction. Proceedings of the ICML, Virtual."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gretton, A., Bousquet, O., Smola, A., and Sch\u00f6lkopf, B. (2005, January 8\u201311). Measuring statistical dependence with Hilbert-Schmidt norms. Proceedings of the ALT, Singapore.","DOI":"10.1007\/11564089_7"},{"key":"ref_27","unstructured":"Tsai, Y.H.H., Bai, S., Morency, L.P., and Salakhutdinov, R. (2021). A note on connecting barlow twins with negative-sample-free contrastive learning. arXiv."},{"key":"ref_28","unstructured":"Cover, T.M., and Thomas, J.A. (2012). Elements of Information Theory, John Wiley & Sons."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., and Eger, S. (2019, January 3\u20137). MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. Proceedings of the EMNLP-IJCNLP, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1053"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Yokoi, S., Takahashi, R., Akama, R., Suzuki, J., and Inui, K. (2020, January 16\u201320). Word Rotator\u2019s Distance. Proceedings of the EMNLP, Virtual.","DOI":"10.18653\/v1\/2020.emnlp-main.236"},{"key":"ref_31","unstructured":"Cuturi, M. (2013, January 5\u201310). Sinkhorn distances: Lightspeed computation of optimal transport. Proceedings of the NIPS, Lake Tahoe, NV, USA."},{"key":"ref_32","unstructured":"Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. (2019, January 8\u201314). Generalized sliced wasserstein distances. Proceedings of the NeurIPS, Vancouver, BC, Canada."},{"key":"ref_33","unstructured":"Mueller, J.W., and Jaakkola, T. (2015, January 7\u201312). Principal differences analysis: Interpretable characterization of differences between distributions. Proceedings of the NIPS, Montreal, QC, Canada."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Deshpande, I., Hu, Y.T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A.G. (2019, January 16\u201320). Max-Sliced Wasserstein distance and its use for GANs. Proceedings of the CVPR, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01090"},{"key":"ref_35","unstructured":"Paty, F.P., and Cuturi, M. (2019, January 9\u201315). Subspace Robust Wasserstein Distances. Proceedings of the ICML, Long Beach, CA, USA."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"569","DOI":"10.1111\/j.1467-9868.2011.01018.x","article-title":"The phylogenetic Kantorovich\u2013Rubinstein metric for environmental sequence samples","volume":"74","author":"Evans","year":"2012","journal-title":"J. R. Stat. Soc. Ser. B (Stat. Methodol.)"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"8228","DOI":"10.1128\/AEM.71.12.8228-8235.2005","article-title":"UniFrac: A new phylogenetic method for comparing microbial communities","volume":"71","author":"Lozupone","year":"2005","journal-title":"Appl. Environ. Microbiol."},{"key":"ref_38","unstructured":"Sato, R., Yamada, M., and Kashima, H. (2020, January 6\u201312). Fast Unbalanced Optimal Transport on Tree. Proceedings of the NeurIPS, Virtual."},{"key":"ref_39","unstructured":"Le, T., and Nguyen, T. (2021, January 13\u201315). Entropy partial transport with tree metrics: Theory and practice. Proceedings of the AISTATS, Virtual."},{"key":"ref_40","unstructured":"Takezawa, Y., Sato, R., and Yamada, M. (2021, January 18\u201324). Supervised tree-wasserstein distance. Proceedings of the ICML, Virtual."},{"key":"ref_41","unstructured":"Takezawa, Y., Sato, R., Kozareva, Z., Ravi, S., and Yamada, M. (2022, January 28\u201330). Fixed Support Tree-Sliced Wasserstein Barycenter. Proceedings of the AISTATS, Valencia, Spain."},{"key":"ref_42","unstructured":"Le, T., Nguyen, T., and Fukumizu, K. (2024, January 2\u20134). Optimal transport for measures with noisy tree metric. Proceedings of the AISTATS, Valencia, Spain."},{"key":"ref_43","unstructured":"Chen, S., Tabaghi, P., and Wang, Y. (2024, January 3\u20136). Learning ultrametric trees for optimal transport regression. Proceedings of the AAAI, Buffalo, NY, USA."},{"key":"ref_44","unstructured":"Houry, G., Bao, H., Zhao, H., and Yamada, M. (2024, January 2\u20134). Fast 1-Wasserstein distance approximations using greedy strategies. Proceedings of the AISTATS, Valencia, Spain."},{"key":"ref_45","unstructured":"Tong, A.Y., Huguet, G., Natik, A., MacDonald, K., Kuchroo, M., Coifman, R., Wolf, G., and Krishnaswamy, S. (2021, January 18\u201324). Diffusion earth mover\u2019s distance and distribution embeddings. Proceedings of the ICML, Virtual."},{"key":"ref_46","unstructured":"Le, T., Nguyen, T., Phung, D., and Nguyen, V.A. (2022, January 28\u201330). Sobolev transport: A scalable metric for probability measures with graph metrics. Proceedings of the AISTATS, Virtual."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Otao, S., and Yamada, M. (2023, January 6\u201310). A linear time approximation of Wasserstein distance with word embedding selection. Proceedings of the EMNLP, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.935"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Laouar, C., Takezawa, Y., and Yamada, M. (2023, January 6\u201310). Large-scale similarity search with Optimal Transport. Proceedings of the EMNLP, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.730"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"5606","DOI":"10.1016\/j.cell.2023.11.005","article-title":"Trellis tree-based analysis reveals stromal regulation of patient-derived organoid drug responses","volume":"186","author":"Zapatero","year":"2023","journal-title":"Cell"},{"key":"ref_50","unstructured":"Backurs, A., Dong, Y., Indyk, P., Razenshteyn, I., and Wagner, T. (2020, January 12\u201318). Scalable nearest neighbor search for optimal transport. Proceedings of the ICML, Vienna, Austria."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Dey, T.K., and Zhang, S. (2022, January 9\u201310). Approximating 1-Wasserstein Distance between Persistence Diagrams by Graph Sparsification. Proceedings of the ALENEX, Alexandria, VA, USA.","DOI":"10.1137\/1.9781611977042.14"},{"key":"ref_52","unstructured":"Yamada, M., Takezawa, Y., Sato, R., Bao, H., Kozareva, Z., and Ravi, S. (2022). Approximating 1-Wasserstein Distance with Trees. TMLR, 1\u20139."},{"key":"ref_53","unstructured":"Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio, T.A. (2015, January 7\u201312). Learning with a Wasserstein loss. Proceedings of the NIPS, Montreal, QC, Canada."},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Toyokuni, A., Yokoi, S., Kashima, H., and Yamada, M. (2021, January 19\u201323). Computationally Efficient Wasserstein Loss for Structured Labels. Proceedings of the ECAL: Student Research Workshop, Virtual.","DOI":"10.18653\/v1\/2021.eacl-srw.1"},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/0100000064","article-title":"Concentration of measure inequalities in information theory, communications, and coding","volume":"10","author":"Raginsky","year":"2013","journal-title":"Found. Trends\u00ae Commun. Inf. Theory"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1007\/BF01448847","article-title":"Zur theorie der gesellschaftsspiele","volume":"100","author":"Neumann","year":"1928","journal-title":"Math. Ann."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"42","DOI":"10.1073\/pnas.39.1.42","article-title":"Minimax theorems","volume":"39","author":"Fan","year":"1953","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_58","unstructured":"Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv."},{"key":"ref_59","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017, January 4\u20139). Attention is all you need. Proceedings of the NIPS, Long Beach, CA, USA."},{"key":"ref_60","doi-asserted-by":"crossref","first-page":"90","DOI":"10.1109\/T-C.1974.223784","article-title":"Discrete cosine transform","volume":"100","author":"Ahmed","year":"1974","journal-title":"IEEE Trans. Comput."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/11\/939\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T16:25:59Z","timestamp":1760113559000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/26\/11\/939"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,31]]},"references-count":60,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2024,11]]}},"alternative-id":["e26110939"],"URL":"https:\/\/doi.org\/10.3390\/e26110939","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2024,10,31]]}}}