{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,24]],"date-time":"2026-01-24T18:36:22Z","timestamp":1769279782218,"version":"3.49.0"},"reference-count":43,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2020,4,9]],"date-time":"2020-04-09T00:00:00Z","timestamp":1586390400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100008383","name":"Bundesministerium f\u00fcr Verkehr und Digitale Infrastruktur","doi-asserted-by":"publisher","award":["16AVF2019A"],"award-info":[{"award-number":["16AVF2019A"]}],"id":[{"id":"10.13039\/100008383","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>This paper analyzes in detail how different loss functions influence the generalization abilities of a deep learning-based next frame prediction model for traffic scenes. Our prediction model is a convolutional long-short term memory (ConvLSTM) network that generates the pixel values of the next frame after having observed the raw pixel values of a sequence of four past frames. We trained the model with 21 combinations of seven loss terms using the Cityscapes Sequences dataset and an identical hyper-parameter setting. The loss terms range from pixel-error based terms to adversarial terms. To assess the generalization abilities of the resulting models, we generated predictions up to 20 time-steps into the future for four datasets of increasing visual distance to the training dataset\u2014KITTI Tracking, BDD100K, UA-DETRAC, and KIT AIS Vehicles. All predicted frames were evaluated quantitatively with both traditional pixel-based evaluation metrics, that is, mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), and recent, more advanced, feature-based evaluation metrics, that is, Fr\u00e9chet inception distance (FID), and learned perceptual image patch similarity (LPIPS). The results show that solely by choosing a different combination of losses, we can boost the prediction performance on new datasets by up to 55%, and by up to 50% for long-term predictions.<\/jats:p>","DOI":"10.3390\/make2020006","type":"journal-article","created":{"date-parts":[[2020,4,9]],"date-time":"2020-04-09T14:42:03Z","timestamp":1586443323000},"page":"78-98","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":11,"title":["The Importance of Loss Functions for Increasing the Generalization Abilities of a Deep Learning-Based Next Frame Prediction Model for Traffic Scenes"],"prefix":"10.3390","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1951-6213","authenticated-orcid":false,"given":"Sandra","family":"Aigner","sequence":"first","affiliation":[{"name":"TUM Department of Aerospace and Geodesy, Technical University of Munich, 80333 Munich, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9186-4175","authenticated-orcid":false,"given":"Marco","family":"K\u00f6rner","sequence":"additional","affiliation":[{"name":"TUM Department of Aerospace and Geodesy, Technical University of Munich, 80333 Munich, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,4,9]]},"reference":[{"key":"ref_1","unstructured":"Yu, F., Xian, W., Chen, Y., Liu, F., Liao, M., Madhavan, V., and Darrell, T. (2018). BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling. arXiv preprint."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (July, January 26). The Cityscapes Dataset for Semantic Urban Scene Understanding. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.350"},{"key":"ref_3","unstructured":"Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., and Garnett, R. (2015). Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. Advances in Neural Information Processing Systems 28 (NeurIPS 2015), Curran Associates, Inc."},{"key":"ref_4","unstructured":"Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M.C., Qui, H., Lim, J., Yang, M.H., and Lyu, S. (2015). UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. arXiv preprint."},{"key":"ref_5","unstructured":"Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., and Chopra, S. (2014). Video (Language) Modeling: A Baseline for generative Models of natural Videos. arXiv preprint."},{"key":"ref_6","unstructured":"Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., and Garnett, R. (2016). Dynamic Filter Networks. Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Curran Associates, Inc."},{"key":"ref_7","unstructured":"Lotter, W., Kreiman, G., and Cox, D. (2017, January 24\u201326). Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. Proceedings of the 5th International Conference on Learning Representations ICLR, Toulon, France."},{"key":"ref_8","unstructured":"Elsayed, N., Maida, A.S., and Bayoumi, M. (2018). Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction. arXiv preprint."},{"key":"ref_9","unstructured":"Wei, H., Yin, X., and Lin, P. (2018). Novel Video Prediction for Large-scale Scene using Optical Flow. arXiv preprint."},{"key":"ref_10","unstructured":"Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (2018). Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Curran Associates, Inc."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Byeon, W., Wang, Q., Srivastava, R.K., and Koumoutsakos, P. (2018, January 8\u201314). ContextVP: Fully Context-Aware Video Prediction. Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany.","DOI":"10.1007\/978-3-030-01270-0_46"},{"key":"ref_12","unstructured":"Nabavi, S.S., Rochan, M., and Wang, Y. (2018, January 3\u20136). Future Semantic Segmentation with Convolutional LSTM. Proceedings of the 29th British Machine Vision Conference (BMVC 2018), Newcastle upon Tyne, UK."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Xu, J., Ni, B., Li, Z., Cheng, S., and Yang, X. (2018, January 18\u201322). Structure Preserving Video Prediction. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00158"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long Short-Term Memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_15","unstructured":"Bhattacharyya, A., Fritz, M., and Schiele, B. (2019, January 6\u20139). Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods. Proceedings of the 7th International Conference on Learning Representations ICLR 2019, New Orleans, LA, USA."},{"key":"ref_16","unstructured":"Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., and Weinberger, K.Q. (2014). Generative Adversarial Networks. Advances in Neural Information Processing Systems 27 (NeurIPS 2014), Curran Associates, Inc."},{"key":"ref_17","unstructured":"Bhattacharjee, P., and Das, S. (2018, January 8\u201314). Context Graph based Video Frame Prediction using Locally Guided Objective. Proceedings of the 15th European Conference on Computer Vision\u2014Workshop on Anticipating Human Behavior (ECCV 2018 Workshops), Munich, Germany."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Aigner, S., and K\u00f6rner, M. (2019, January 18\u201320). FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing GANs. Proceedings of the ISPRS\u2014International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Munich, Germany. XLII-2\/W16.","DOI":"10.5194\/isprs-archives-XLII-2-W16-3-2019"},{"key":"ref_19","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks. Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Curran Associates, Inc."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Liang, X., Lee, L., Dai, W., and Xing, E.P. (2017, January 22\u201329). Dual Motion GAN for Future-Flow Embedded Video Prediction. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy.","DOI":"10.1109\/ICCV.2017.194"},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"K\u00f6nig, P., Aigner, S., and K\u00f6rner, M. (2019, January 27\u201330). Enhancing Traffic Scene Predictions with Generative Adversarial Networks. Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference ITSC 2019, Auckland, New Zealand.","DOI":"10.1109\/ITSC.2019.8917046"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Luc, P., Neverova, N., Couprie, C., Verbeek, j., and LeCun, Y. (2017, January 22\u201329). Predicting Deeper Into the Future of Semantic Segmentation. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy.","DOI":"10.1109\/ICCV.2017.77"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y., Dong, J., Liu, L., and Jie, Z. (2017, January 22\u201329). Video Scene Parsing With Predictive Feature Learning. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy.","DOI":"10.1109\/ICCV.2017.595"},{"key":"ref_24","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Predicting Scene Parsing and Motion Dynamics in the Future. Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Curran Associates, Inc."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., and Catanzaro, B. (2019, January 16\u201320). Improving Semantic Segmentation via Video Propagation and Label Relaxation. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00906"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., and Darrell, T. (November, January 27). Disentangling Propagation and Generation for Video Prediction. Proceedings of the 2019 IEEE International Conference on Computer Vision (ICCV 2019), Seoul, Korea.","DOI":"10.1109\/ICCV.2019.00910"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Reda, F.A., Liu, G., Shih, K.J., Kirby, R., Barker, J., Tarjan, D., Tao, A., and Catanzaro, B. (2018, January 8\u201314). SDC-Net: Video Prediction Using Spatially-Displaced Convolution. Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_44"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Hao, Z., Huang, X., and Belongie, S. (2018, January 18\u201322). Controllable Video Generation with Sparse Trajectories. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00819"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Liu, W., Luo, W., Lian, D., and Gao, S. (2018, January 18\u201322). Future Frame Prediction for Anomaly Detection\u2014A New Baseline. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00684"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Geiger, A., Lenz, P., and Urtasun, R. (2012, January 16\u201321). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, USA.","DOI":"10.1109\/CVPR.2012.6248074"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2009, January 20\u201325). Pedestrian Detection: A Benchmark. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206631"},{"key":"ref_32","unstructured":"Karras, T., Aila, T., Laine, S., and Lehtinen, J. (May, January 30). Progressive Growing of GANs for Improved Quality, Stability, and Variation. Proceedings of the 6th International Conference on Learning Representations ICLR 2018, Vancouver, BC, Canada."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Johnson, J., Alahi, A., and Fei-Fei, L. (2016, January 8\u201316). Perceptual Losses for Real-Time Style Transfer and Super-Resolution. Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46475-6_43"},{"key":"ref_34","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very Deep Convolutional Networks for Large-Scale Image Recognition. Proceedings of the 3rd International Conference on Learning Representations ICLR 2015, San Diego, CA, USA."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). ImageNet: A Large-Scale Hierarchical Image Database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_36","unstructured":"Mathieu, M., Couprie, C., and LeCun, Y. (2016, January 2\u20134). Deep multi-scale video prediction beyond mean square error. Proceedings of the 4th International Conference on Learning Representations ICLR 2016, San Juan, PR, USA."},{"key":"ref_37","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). Improved Training of Wasserstein GANs. Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Curran Associates, Inc."},{"key":"ref_38","unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations ICLR 2015, San Diego, CA, USA."},{"key":"ref_39","unstructured":"Schmidt, F. (2019, July 29). Data Set for Tracking Vehicles in Aerial Image Sequences. KIT AIS Vehicles Data Set. Available online: http:\/\/www.ipf.kit.edu\/downloads_data_set_AIS_vehicle_tracking.php."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1109\/TIP.2003.819861","article-title":"Image Quality Assessment: From Error Visibility to Structural Similarity","volume":"13","author":"Wang","year":"2004","journal-title":"IEEE Trans. Image Process."},{"key":"ref_41","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Curran Associates, Inc."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhang, R., Isola, P., Efros, A.A., Shechtman, E., and Wang, O. (2018, January 18\u201322). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00068"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (July, January 26). Rethinking the Inception Architecture for Computer Vision. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.308"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/2\/2\/6\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:17:05Z","timestamp":1760174225000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/2\/2\/6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,4,9]]},"references-count":43,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2020,6]]}},"alternative-id":["make2020006"],"URL":"https:\/\/doi.org\/10.3390\/make2020006","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,4,9]]}}}