{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:02:25Z","timestamp":1750309345291,"version":"3.41.0"},"reference-count":76,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,10,31]],"date-time":"2024-10-31T00:00:00Z","timestamp":1730332800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation through CAREER","award":["SHF-2048094, CNS-1932620, CNS-2039087, FMitF-1837131, and CCF-SHF-1932620"],"award-info":[{"award-number":["SHF-2048094, CNS-1932620, CNS-2039087, FMitF-1837131, and CCF-SHF-1932620"]}]},{"name":"Toyota R&D and Siemens Corporate Research through the USC Center for Autonomy and AI, and the Airbus Institute for Engineering Research"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Cyber-Phys. Syst."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>\n            This article introduces a model-based approach for training feedback controllers for an autonomous agent operating in a highly non-linear (albeit deterministic) environment. We desire the trained policy to ensure that the agent satisfies specific task objectives and safety constraints, both expressed in Discrete-Time Signal Temporal Logic (DT-STL). One advantage for reformulation of a task via formal frameworks, like DT-STL, is that it permits quantitative satisfaction semantics. In other words, given a trajectory and a DT-STL formula, we can compute the\n            <jats:italic>robustness<\/jats:italic>\n            , which can be interpreted as an approximate signed distance between the trajectory and the set of trajectories satisfying the formula. We utilize feedback control, and we assume a feed forward neural network for learning the feedback controller. We show how this learning problem is similar to training recurrent neural networks (RNNs), where the number of recurrent units is proportional to the temporal horizon of the agent\u2019s task objectives. This poses a challenge: RNNs are susceptible to vanishing and exploding gradients, and na\u00efve gradient descent-based strategies to solve long-horizon task objectives thus suffer from the same problems. To address this challenge, we introduce a novel gradient approximation algorithm based on the idea of dropout or gradient sampling. One of the main contributions is the notion of\n            <jats:italic>controller network dropout<\/jats:italic>\n            , where we approximate the NN controller in several timesteps in the task horizon by the control input obtained using the controller in a previous training step. We show that our control synthesis methodology can be quite helpful for stochastic gradient descent to converge with less numerical issues, enabling scalable back-propagation over longer time horizons and trajectories over higher-dimensional state spaces. We demonstrate the efficacy of our approach on various motion planning applications requiring complex spatio-temporal and sequential tasks ranging over thousands of timesteps.\n          <\/jats:p>","DOI":"10.1145\/3696112","type":"journal-article","created":{"date-parts":[[2024,9,16]],"date-time":"2024-09-16T12:30:40Z","timestamp":1726489840000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Scaling Learning-based Policy Optimization for Temporal Logic Tasks by Controller Network Dropout"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6147-3675","authenticated-orcid":false,"given":"Navid","family":"Hashemi","sequence":"first","affiliation":[{"name":"University of Southern California, Los Angeles, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6255-7566","authenticated-orcid":false,"given":"Bardh","family":"Hoxha","sequence":"additional","affiliation":[{"name":"Toyota NA R&amp;D, Ann Arbor, Michigan, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6208-4233","authenticated-orcid":false,"given":"Danil","family":"Prokhorov","sequence":"additional","affiliation":[{"name":"Toyota NA R&amp;D, Ann Arbor, Michigan, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0456-2129","authenticated-orcid":false,"given":"Georgios","family":"Fainekos","sequence":"additional","affiliation":[{"name":"Toyota NA R&amp;D, Ann Arbor, Michigan, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4683-5540","authenticated-orcid":false,"given":"Jyotirmoy V.","family":"Deshmukh","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, California, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,11,18]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACC.2013.6580518"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-21668-3_21"},{"key":"e_1_3_2_4_2","volume-title":"Techniques for Automatic Verification of Real-Time Systems","author":"Alur Rajeev","year":"1991","unstructured":"Rajeev Alur. 1991. Techniques for Automatic Verification of Real-Time Systems. Stanford University."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CDC.2014.7040372"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.2016.2638961"},{"key":"e_1_3_2_7_2","unstructured":"Dario Amodei Chris Olah Jacob Steinhardt Paul Christiano John Schulman and Dan Man\u00e9. 2016. Concrete problems in AI safety. arXiv:1606.06565. Retrieved from https:\/\/arxiv.org\/abs\/1606.06565"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-19835-9_21"},{"key":"e_1_3_2_9_2","unstructured":"Jimmy Lei Ba Jamie Ryan Kiros and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/IROS40897.2019.8968254"},{"key":"e_1_3_2_11_2","unstructured":"Randal Beard. 2008. Quadrotor Dynamics and Control Rev 0.1."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-50763-7"},{"key":"e_1_3_2_13_2","unstructured":"Luigi Berducci Edgar A. Aguilar Dejan Ni\u010dkovi\u0107 and Radu Grosu. 2021. Hierarchical potential-based reward shaping from task specifications. arXiv:2110.02792. Retrieved from https:\/\/arxiv.org\/abs\/2110.02792"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10575-8_27"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA40945.2020.9196796"},{"key":"e_1_3_2_16_2","volume-title":"Advances in Neural Information Processing Systems","author":"Chua Kurtland","year":"2018","unstructured":"Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, 31."},{"key":"e_1_3_2_17_2","volume-title":"Order Statistics","author":"David Herbert A.","year":"2004","unstructured":"Herbert A. David and Haikady N. Nagaraja. 2004. Order Statistics. John Wiley & Sons."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD45719.2019.8942130"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15297-9_9"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1007\/11940197_12"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.automatica.2008.08.008"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/s41315-019-00103-5"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2015.11.195"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2007.902966"},{"key":"e_1_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Jie Fu and Ufuk Topcu. 2014. Probably approximately correct MDP learning and control with temporal logic constraints. arXiv:1404.7073. Retrieved from https:\/\/arxiv.org\/abs\/1404.7073","DOI":"10.15607\/RSS.2014.X.039"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCSYS.2020.3001875"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10710-017-9314-z"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.2018.2799561"},{"key":"e_1_3_2_29_2","volume-title":"Advances in Neural Information Processing Systems","author":"Hadfield-Menell Dylan","year":"2017","unstructured":"Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J. Russell, and Anca Dragan. 2017. Inverse reward design. In Advances in Neural Information Processing Systems, 30."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2018.08.013"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-17108-6_12"},{"key":"e_1_3_2_32_2","unstructured":"Mohammadhosein Hasanbeig Alessandro Abate and Daniel Kroening. 2018. Logically-constrained reinforcement learning. arXiv:1801.08099. Retrieved from https:\/\/arxiv.org\/abs\/1801.08099"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2024.07.445"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3576841.3585928"},{"key":"e_1_3_2_35_2","first-page":"4096","volume-title":"2023 American Control Conference (ACC)","author":"Hashemi Navid","unstructured":"Navid Hashemi, Xin Qin, Jyotirmoy V. Deshmukh, Georgios Fainekos, Bardh Hoxha, Danil Prokhorov, and Tomoya Yamaguchi. [n. d.]. Risk-awareness in learning neural controllers for temporal logic objectives. In 2023 American Control Conference (ACC), 4096\u20134103."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2022.3155197"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_39"},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","unstructured":"Krishna C. Kalagarla Rahul Jain and Pierluigi Nuzzo. 2020. Synthesis of discounted-reward optimal policies for Markov decision processes under linear temporal logic specifications. arXiv:2011.00632. Retrieved from https:\/\/arxiv.org\/abs\/2011.00632","DOI":"10.23919\/ACC50511.2021.9482749"},{"key":"e_1_3_2_39_2","unstructured":"Parv Kapoor Anand Balakrishnan and Jyotirmoy V. Deshmukh. 2020. Model-based reinforcement learning from signal temporal logic specifications. arXiv:2011.04950. Retrieved from https:\/\/arxiv.org\/abs\/2011.04950"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF01995674"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TRO.2009.2030225"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCSYS.2022.3172857"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/IVS.2019.8814167"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-66723-8_26"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1177\/02783649221082115"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2017.8206234"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCSYS.2018.2853182"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCSYS.2021.3049917"},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-30206-3_12"},{"key":"e_1_3_2_50_2","volume-title":"International Conference on Learning Representations","author":"Pan Alexander","year":"2022","unstructured":"Alexander Pan, Kush Bhatia, and Jacob Steinhardt. 2022. The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCTA.2017.8062628"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCPS.2018.00026"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.23919\/ACC50511.2021.9483206"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.1977.32"},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/LRA.2022.3226072"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CDC.2014.7039363"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/2728606.2728628"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2021.08.465"},{"key":"e_1_3_2_59_2","article-title":"Combined left and right temporal robustness for control under STL specifications","author":"Rodionova Al\u00ebna","year":"2022","unstructured":"Al\u00ebna Rodionova, Lars Lindemann, Manfred Morari, and George J. Pappas. 2022. Combined left and right temporal robustness for control under STL specifications. IEEE Control Systems Letters (2022).","journal-title":"IEEE Control Systems Letters"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.15607\/RSS.2016.XII.017"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/CDC.2014.7039527"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/ALLERTON.2015.7447084"},{"key":"e_1_3_2_63_2","first-page":"9460","article-title":"Defining and characterizing reward gaming","volume":"35","author":"Skalse Joar","year":"2022","unstructured":"Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems 35 (2022), 9460\u20139471.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_64_2","article-title":"Reward design via online gradient ascent","volume":"23","author":"Sorg Jonathan","year":"2010","unstructured":"Jonathan Sorg, Richard L. Lewis, and Satinder Singh. 2010. Reward design via online gradient ascent. In Advances in Neural Information Processing Systems, 23.","journal-title":"Advances in Neural Information Processing Systems,"},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.5555\/2627435.2670313"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/CDC49753.2023.10383605"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-53288-8_1"},{"key":"e_1_3_2_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/CDC.2010.5717316"},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.arcontrol.2004.11.002"},{"key":"e_1_3_2_70_2","first-page":"308","volume-title":"Learning for Dynamics and Control","author":"Venkataraman Harish","year":"2020","unstructured":"Harish Venkataraman, Derya Aksaray, and Peter Seiler. 2020. Tractable reinforcement learning of signal temporal logic objectives. In Learning for Dynamics and Control. PMLR, 308\u2013317."},{"key":"e_1_3_2_71_2","doi-asserted-by":"publisher","DOI":"10.5555\/1062391"},{"key":"e_1_3_2_72_2","doi-asserted-by":"publisher","DOI":"10.2514\/6.2021-2355"},{"key":"e_1_3_2_73_2","first-page":"5319","volume-title":"the International Conference on Robotics and Automation","author":"Wolff Eric M.","year":"2014","unstructured":"Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. 2014. Optimization-based control of nonlinear systems with linear temporal logic specifications. In the International Conference on Robotics and Automation, 5319\u20135325."},{"key":"e_1_3_2_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.2012.2195811"},{"key":"e_1_3_2_75_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ifacol.2015.11.152"},{"key":"e_1_3_2_76_2","doi-asserted-by":"publisher","DOI":"10.1145\/3302504.3311814"},{"key":"e_1_3_2_77_2","doi-asserted-by":"publisher","DOI":"10.1145\/3358239"}],"container-title":["ACM Transactions on Cyber-Physical Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696112","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3696112","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:05:32Z","timestamp":1750291532000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3696112"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10,31]]},"references-count":76,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3696112"],"URL":"https:\/\/doi.org\/10.1145\/3696112","relation":{},"ISSN":["2378-962X","2378-9638"],"issn-type":[{"type":"print","value":"2378-962X"},{"type":"electronic","value":"2378-9638"}],"subject":[],"published":{"date-parts":[[2024,10,31]]},"assertion":[{"value":"2024-04-15","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-13","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}