{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,15]],"date-time":"2025-11-15T10:34:03Z","timestamp":1763202843221,"version":"3.37.3"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2023,11,20]],"date-time":"2023-11-20T00:00:00Z","timestamp":1700438400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,11,20]],"date-time":"2023-11-20T00:00:00Z","timestamp":1700438400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100002666","name":"Aalto-Yliopisto","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100002666","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002666","name":"Aalto University","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100002666","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach Learn"],"published-print":{"date-parts":[[2024,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Negative sampling is a common approach for making the training of deep models in classification problems with very large output spaces, known as extreme multilabel classification (XMC) problems, tractable. Negative\u00a0sampling methods aim\u00a0to find per instance negative labels with higher scores, known as hard negatives, and limit the computations of the negative part of the loss to these labels. Two well-known methods for negative sampling in XMC models are meta-classifier-based and Maximum Inner product Search (MIPS)-based adaptive methods. Owing to their good prediction performance, methods which employ a meta classifier are more common in contemporary XMC research. On the flip side, they need to train and store the meta classifier (apart from the extreme classifier), which can involve millions of additional parameters. In this paper, we focus on the MIPS-based methods for negative sampling. We highlight two issues which may prevent deep models trained by these methods to undergo stable training. First, we argue that using hard negatives excessively from the beginning of training leads to unstable gradient. Second, we show that when all the negative labels in a MIPS-based method are restricted to only those determined by MIPS, training is sensitive to the length of intervals for pre-processing the weights in the MIPS method. To mitigate the aforementioned issues, we propose to limit the labels selected by MIPS to only a few and sample the rest of the needed labels from a uniform distribution. We show that our proposed MIPS-based negative sampling can reach the performance of LightXML, a transformer-based model trained by a meta classifier, while there is no need to train and store any additional classifier. The code for our experiments is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/xmc-aalto\/mips-negative-sampling\">https:\/\/github.com\/xmc-aalto\/mips-negative-sampling<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s10994-023-06468-w","type":"journal-article","created":{"date-parts":[[2023,11,20]],"date-time":"2023-11-20T23:02:31Z","timestamp":1700521351000},"page":"675-697","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Meta-classifier free negative sampling for extreme multilabel classification"],"prefix":"10.1007","volume":"113","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3840-4724","authenticated-orcid":false,"given":"Mohammadreza","family":"Qaraei","sequence":"first","affiliation":[]},{"given":"Rohit","family":"Babbar","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,11,20]]},"reference":[{"key":"6468_CR1","unstructured":"Auvolat, A., Chandar, S., Vincent, P.,\u00a0Larochelle, H., & Bengio, Y. (2015). Clustering is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910 ."},{"key":"6468_CR2","doi-asserted-by":"crossref","unstructured":"Babbar, R., & Sch\u00f6lkopf, B. (2017). Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM international conference on web search and data mining, (pp. 721\u2013729).","DOI":"10.1145\/3018661.3018741"},{"issue":"8","key":"6468_CR3","doi-asserted-by":"publisher","first-page":"1329","DOI":"10.1007\/s10994-019-05791-5","volume":"108","author":"R Babbar","year":"2019","unstructured":"Babbar, R., & Sch\u00f6lkopf, B. (2019). Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108(8), 1329\u20131351.","journal-title":"Machine Learning"},{"issue":"4","key":"6468_CR4","doi-asserted-by":"publisher","first-page":"713","DOI":"10.1109\/TNN.2007.912312","volume":"19","author":"Y Bengio","year":"2008","unstructured":"Bengio, Y., & Sen\u00e9cal, J. S. (2008). Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4), 713\u2013722.","journal-title":"IEEE Transactions on Neural Networks"},{"key":"6468_CR5","unstructured":"Bhatia, K., Dahiya, K.,\u00a0Jain, H.,\u00a0Kar, P.,\u00a0Mittal, A.,\u00a0Prabhu Y., & Varma, M. (2016). The extreme classification repository: Multi-label datasets and code."},{"key":"6468_CR6","unstructured":"Blanc, G., & Rendle, S. (2018). Adaptive sampled softmax with kernel based sampling. In International conference on machine learning, (pp. 590\u2013599). PMLR."},{"key":"6468_CR7","first-page":"291","volume":"2","author":"B Chen","year":"2020","unstructured":"Chen, B., Medini, T., Farwell, J., Tai, C., Shrivastava, A., et al. (2020). Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2, 291\u2013306.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"6468_CR8","unstructured":"Daghaghi, S., Medini, T.,\u00a0Meisburger, N.,\u00a0Chen, B.,\u00a0Zhao, M., & Shrivastava, A. (2021). A tale of two efficient and informative negative sampling distributions. In International conference on machine learning, (pp. 2319\u20132329). PMLR."},{"key":"6468_CR9","unstructured":"Dahiya, K., Agarwal, A.,\u00a0Saini, D.,\u00a0Gururaj, K.,\u00a0Jiao, J.,\u00a0Singh, A.,\u00a0Agarwal, S.,\u00a0Kar, P.,\u00a0Varma, M. (2021a). Siamesexml: Siamese networks meet extreme classifiers with 100m labels. In International conference on machine learning, (pp. 2330\u20132340). PMLR."},{"key":"6468_CR10","doi-asserted-by":"crossref","unstructured":"Dahiya, K., Gupta, N., Saini, D., Soni, A., Wang, Y., Dave, K., & Varma, M. (2022). Ngame: Negative mining-aware mini-batching for extreme classification. arXiv preprint arXiv:2207.04452 .","DOI":"10.1145\/3539597.3570392"},{"key":"6468_CR11","doi-asserted-by":"crossref","unstructured":"Dahiya, K., Saini, D., Mittal, A., Shaw, A., Dave, K., Soni, A., & Varma, M. (2021b). Deepxml: A deep extreme multi-label learning framework applied to short text documents. In Proceedings of the 14th ACM international conference on web search and data mining, (pp. 31\u201339).","DOI":"10.1145\/3437963.3441810"},{"key":"6468_CR12","doi-asserted-by":"crossref","unstructured":"Jain, H., Balasubramanian, V.,\u00a0Chunduri, B., &\u00a0Varma, M. (2019). Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the twelfth ACM international conference on web search and data mining, (pp. 528\u2013536).","DOI":"10.1145\/3289600.3290979"},{"key":"6468_CR13","doi-asserted-by":"crossref","unstructured":"Jain, H., Prabhu, Y., Varma, M. (2016). August. Extreme multi-label loss functions for recommendation, tagging, ranking and other missing label applications. In KDD.","DOI":"10.1145\/2939672.2939756"},{"key":"6468_CR14","doi-asserted-by":"publisher","first-page":"7987","DOI":"10.1609\/aaai.v35i9.16974","volume":"35","author":"T Jiang","year":"2021","unstructured":"Jiang, T., Wang, D., Sun, L., Yang, H., Zhao, Z., & Zhuang, F. (2021). Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 7987\u20137994.","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"issue":"3","key":"6468_CR15","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1109\/TBDATA.2019.2921572","volume":"7","author":"J Johnson","year":"2019","unstructured":"Johnson, J., Douze, M., & J\u00e9gou, H. (2019). Billion-scale similarity search with GPUS. IEEE Transactions on Big Data, 7(3), 535\u2013547.","journal-title":"IEEE Transactions on Big Data"},{"key":"6468_CR16","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10994-020-05888-2","volume":"109","author":"S Khandagale","year":"2020","unstructured":"Khandagale, S., Xiao, H., & Babbar, R. (2020). Bonsai: Diverse and shallow trees for extreme multi-label classification. Machine Learning, 109, 1\u201321.","journal-title":"Machine Learning"},{"key":"6468_CR17","doi-asserted-by":"crossref","unstructured":"Kharbanda, S., Banerjee, A., Gupta, D., Palrecha, A., & Babbar, R. (2023). Inceptionxml: A lightweight framework with synchronized negative sampling for short text extreme classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 760\u2013769).","DOI":"10.1145\/3539618.3591699"},{"key":"6468_CR18","unstructured":"Kharbanda, S., Banerjee, A.,\u00a0Schultheis, E., Babbar, R. (2022). Cascadexml: Rethinking transformers for end-to-end multi-resolution training in extreme multi-label classification. In Advances in neural information processing systems."},{"key":"6468_CR19","doi-asserted-by":"crossref","unstructured":"Lee, K., Chang, M. W., & Toutanova, K. (2019). Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300 .","DOI":"10.18653\/v1\/P19-1612"},{"key":"6468_CR20","unstructured":"Mikolov, T., Sutskever, I.,\u00a0Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26."},{"key":"6468_CR21","doi-asserted-by":"crossref","unstructured":"Mittal, A., Dahiya, K., Agrawal, S., Saini, D., Agarwal, S., Kar, P., & Varma, M. (2021). Decaf: Deep extreme classification with label features. In Proceedings of the 14th ACM international conference on web search and data mining, (pp. 49\u201357).","DOI":"10.1145\/3437963.3441807"},{"key":"6468_CR22","doi-asserted-by":"crossref","unstructured":"Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M. R., & Galinari, P. (2015). Lshtc: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581.","DOI":"10.1145\/2556195.2556208"},{"key":"6468_CR23","doi-asserted-by":"crossref","unstructured":"Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., Varma, M. (2018). Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, (pp. 993\u20131002).","DOI":"10.1145\/3178876.3185998"},{"key":"6468_CR24","unstructured":"Rawat, A. S., Menon, A.K., Jitkrittum, W., Jayasumana, S., Yu, F., Reddi, S., & Kumar, S. (2021). Disentangling sampling and labeling bias for learning in large-output spaces. In International conference on machine learning, (pp. 8890\u20138901). PMLR."},{"key":"6468_CR25","first-page":"13857","volume":"32","author":"AS Rawat","year":"2019","unstructured":"Rawat, A. S., Chen, J., Yu, F. X. X., Suresh, A. T., & Kumar, S. (2019). Sampled softmax with random fourier features. Advances in Neural Information Processing Systems, 32, 13857\u201313867.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6468_CR26","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 815\u2013823).","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"6468_CR27","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s10994-022-06228-2","volume":"111","author":"E Schultheis","year":"2022","unstructured":"Schultheis, E., & Babbar, R. (2022). Speeding-up one-versus-all training for extreme classification via mean-separating initialization. Machine Learning, 111, 1\u201324.","journal-title":"Machine Learning"},{"key":"6468_CR28","unstructured":"Shrivastava, A., & Li, P. (2014). Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). Advances in neural information processing systems,\u00a027."},{"key":"6468_CR29","unstructured":"Vijayanarasimhan, S., Shlens, J., Monga, R., Yagnik, J. (2014). Deep networks with large output spaces. arXiv preprint arXiv:1412.7479."},{"key":"6468_CR30","doi-asserted-by":"crossref","unstructured":"Wu, C. Y., Manmatha, R., Smola, A. J., & Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In Proceedings of the IEEE international conference on computer vision, (pp. 2840\u20132848).","DOI":"10.1109\/ICCV.2017.309"},{"key":"6468_CR31","unstructured":"Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P., Ahmed, J., & Overwijk, A., (2020). Approximate nearest neighbor negative contrastive learning for dense text retrieval. ICLR ."},{"key":"6468_CR32","unstructured":"Yen, I. E. H., Kale, S.,\u00a0Yu, F.,\u00a0Holtmann-Rice, D.,\u00a0Kumar, S., & Ravikumar, P., (2018). Loss decomposition for fast learning in large output spaces. In International Conference on Machine Learning, (pp. 5640\u20135649). PMLR."},{"key":"6468_CR33","unstructured":"You, R., Zhang, Z.,\u00a0Wang, Z.,\u00a0Dai, S.,\u00a0Mamitsuka, H., &\u00a0Zhu, S. (2019). Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In NeurIPS, (pp. 5812\u20135822)."},{"key":"6468_CR34","first-page":"7267","volume":"34","author":"J Zhang","year":"2021","unstructured":"Zhang, J., Chang, W. C., Yu, H. F., & Dhillon, I. (2021). Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. Advances in Neural Information Processing Systems, 34, 7267\u20137280.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"6468_CR35","doi-asserted-by":"publisher","first-page":"3180","DOI":"10.1109\/IJCNN.2005.1556436","volume":"5","author":"S Zhong","year":"2005","unstructured":"Zhong, S. (2005). Efficient online spherical k-means clustering. Proceedings 2005 IEEE International Joint Conference on Neural Networks, 5, 3180\u20133185.","journal-title":"Proceedings 2005 IEEE International Joint Conference on Neural Networks"}],"container-title":["Machine Learning"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06468-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10994-023-06468-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10994-023-06468-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,1,18]],"date-time":"2024-01-18T19:09:12Z","timestamp":1705604952000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10994-023-06468-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,11,20]]},"references-count":35,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,2]]}},"alternative-id":["6468"],"URL":"https:\/\/doi.org\/10.1007\/s10994-023-06468-w","relation":{},"ISSN":["0885-6125","1573-0565"],"issn-type":[{"type":"print","value":"0885-6125"},{"type":"electronic","value":"1573-0565"}],"subject":[],"published":{"date-parts":[[2023,11,20]]},"assertion":[{"value":"14 February 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"15 June 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 October 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 November 2023","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no conflict of interest to disclose associated with this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethical approval"}},{"value":"Not applicable.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"The authors give consent for the publication of identifiable details in this paper, including figures, tables, and the results, in other projects by mentioning the reference.","order":5,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}