{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,17]],"date-time":"2026-06-17T20:00:24Z","timestamp":1781726424726,"version":"3.54.5"},"reference-count":28,"publisher":"Institute of Electronics, Information and Communications Engineers (IEICE)","issue":"9","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["IEICE Trans. Inf. &amp; Syst."],"published-print":{"date-parts":[[2018,9,1]]},"DOI":"10.1587\/transinf.2017edp7278","type":"journal-article","created":{"date-parts":[[2018,8,31]],"date-time":"2018-08-31T18:41:53Z","timestamp":1535740913000},"page":"2315-2322","source":"Crossref","is-referenced-by-count":22,"title":["Deep Reinforcement Learning with Sarsa and Q-Learning: A Hybrid Approach"],"prefix":"10.1587","volume":"E101.D","author":[{"given":"Zhi-xiong","family":"XU","sequence":"first","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lei","family":"CAO","sequence":"additional","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xi-liang","family":"CHEN","sequence":"additional","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chen-xi","family":"LI","sequence":"additional","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yong-liang","family":"ZHANG","sequence":"additional","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jun","family":"LAI","sequence":"additional","affiliation":[{"name":"Institute of Command Information System, PLA University of Science and Technology"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"532","reference":[{"key":"1","unstructured":"[1] R.S. Sutton and A.G. Barto, Introduction to Reinforcement Learning, Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions, pp.90-127, 2011."},{"key":"2","doi-asserted-by":"crossref","unstructured":"[2] C.H.C.J. Watkins, \u201cLearning from delayed rewards,\u201d Robotics &amp; Autonomous Systems, vol.15, no.4, pp.233-235, 1989.","DOI":"10.1016\/0921-8890(95)00026-C"},{"key":"3","unstructured":"[3] S. Thrun and A. Schwartz, \u201cIssues in using function approximation for reinforcement learning,\u201d Proc. Fourth Connectionist Models Summer School, vol.14, no.3, pp.65-90, 1993."},{"key":"4","unstructured":"[4] H.V. Hasselt, \u201cDouble Q-learning,\u201d Advances in Neural Information Processing Systems 23, Proceedings of A Meeting Held 6-9 Dec. 2010, Conference on Neural Information Processing Systems 2010, Vancouver, British Columbia, Canada, OAI, pp.2613-2621, 2010."},{"key":"5","unstructured":"[5] V. Mnih, K. Kavukcuoglu, D. Silver, et al., \u201cPlaying Atari with deep reinforcement learning,\u201d arXiv preprint arXiv:1312.5602v1 [cs.LG], 2013."},{"key":"6","doi-asserted-by":"publisher","unstructured":"[6] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, \u201cHuman-level control through deep reinforcement learning,\u201d vol.518, no.7540, pp.529-533, 2015. 10.1038\/nature14236","DOI":"10.1038\/nature14236"},{"key":"7","doi-asserted-by":"crossref","unstructured":"[7] M.G. Bellemare, Y. Naddaf, J. Veness, et al., \u201cThe arcade learning environment: an evaluation platform for general agents,\u201d Journal of Artificial Intelligence Research, vol.47, no.1, pp.253-279, 2013.","DOI":"10.1613\/jair.3912"},{"key":"8","unstructured":"[8] H. Van Hasselt, A. Guez, and D. Silver, \u201cDeep Reinforcement learning with double Q-learning,\u201d arXiv preprint arXiv:1509.06461v1 [cs.LG], 2015."},{"key":"9","unstructured":"[9] O. Anschel, N. Baram, and N. Shimkin, \u201cDeep reinforcement learning with averaged target DQN,\u201d 30th Conference on Neural Information Processing Systems, pp.78-99, 2016."},{"key":"10","unstructured":"[10] G.A. Rummery and M. Niranjan, Online Q-learning using connectionist systems, Cambridge University, pp.23-86, 1994."},{"key":"11","doi-asserted-by":"crossref","unstructured":"[11] R.S. Sutton, \u201cDyna, an integrated architecture for learning, planning, and reacting,\u201d AAAI Spring Symposium, pp.151-155, 1991.","DOI":"10.1145\/122344.122377"},{"key":"12","doi-asserted-by":"crossref","unstructured":"[12] J. Peng and R.J. Williams, \u201cEfficient learning and planning within the Dyna framework,\u201d Adaptive Behaviore, vol.78, no.4, pp.437-549, 1993.","DOI":"10.1177\/105971239300100403"},{"key":"13","doi-asserted-by":"publisher","unstructured":"[13] R.S. Sutton, \u201cLearning to predict by the methods of temporal differences[J],\u201d Machine learning, vol.3, no.1, pp.9-44, 1988. 10.1007\/bf00115009","DOI":"10.1007\/BF00115009"},{"key":"14","doi-asserted-by":"publisher","unstructured":"[14] C.J. Watkins and P. Dayan, \u201cQ-learning,\u201d Machine Learning, vol.8, no.3-4, pp.279-292, 1992. 10.1023\/a:1022676722315","DOI":"10.1023\/A:1022676722315"},{"key":"15","doi-asserted-by":"crossref","unstructured":"[15] M. Riedmiller, \u201cNeural fitted q iteration-first experiences with a data efficient neural reinforcement learning method,\u201d Machine Learning: Ecml 2005, European Conference on Machine Learning, Porto, Portugal, pp.317-328, 2005. 10.1007\/11564096_32","DOI":"10.1007\/11564096_32"},{"key":"16","doi-asserted-by":"publisher","unstructured":"[16] L.-J. Lin, \u201cSelf-improving reactive agents based on reinforcement learning, planning and teaching,\u201d Machine Learning, vol.8, no.3, pp.293-321, 1992. 10.1007\/bf00992699","DOI":"10.1007\/BF00992699"},{"key":"17","unstructured":"[17] H. van Hasselt, \u201cDouble Q-learning,\u201d Advances in Neural Information Processing Systems, vol.23, no.7, pp.2613-2621, 2010."},{"key":"18","doi-asserted-by":"publisher","unstructured":"[18] R.S. Sutton and A.G. Barto, \u201cReinforcement learning: an introduction[J],\u201d IEEE Trans. Neural Netw., vol.9, no.5, p.1054, 1998. 10.1109\/tnn.1998.712192","DOI":"10.1109\/TNN.1998.712192"},{"key":"19","doi-asserted-by":"crossref","unstructured":"[19] D. Zhao, H. Wang, K. Shao, et al., \u201cDeep reinforcement learning with experience replay based on SARSA,\u201d IEEE Computational Intelligence, 2017.","DOI":"10.1109\/SSCI.2016.7849837"},{"key":"20","doi-asserted-by":"crossref","unstructured":"[20] W. Hu, \u201cDouble Sarsa and double expected Sarsa with shallow and deep learning,\u201d vol.04, no.4, pp.159-176, 2016.","DOI":"10.4236\/jdaip.2016.44014"},{"key":"21","unstructured":"[21] L.-J. Lin, \u201cReinforcement learning for robots using neural networks,\u201d Technical report, DTIC Document, vol.8, no.4, pp.12-45, 1993."},{"key":"22","doi-asserted-by":"publisher","unstructured":"[22] J.L. McClelland, B.L. Mcnaughton, and R.C. O&apos;Reilly, \u201cWhy there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory,\u201d Psychological Review, vol.102, no.3, p.419, 1995. 10.1037\/\/0033-295x.102.3.419","DOI":"10.1037\/\/0033-295X.102.3.419"},{"key":"23","doi-asserted-by":"publisher","unstructured":"[23] J. O&apos;Neill, B. Pleydell-Bouverie, D. Dupret, and J. Csicsvari, \u201cPlay it again: reactivation of waking experience and memory,\u201d Trends in Neurosciences, vol.33, no.5, pp.220-229, 2010. 10.1016\/j.tins.2010.01.006","DOI":"10.1016\/j.tins.2010.01.006"},{"key":"24","unstructured":"[24] I. Zamora, N.G. Lopez, V.M. Vilches, et al., \u201cExtending the OpenAI Gym for robotics: a toolkit for reinforcement learning using ROS and Gazebo,\u201d vol.9, no.11, pp.89-115, 2016."},{"key":"25","unstructured":"[25] D.P. Kingma and J. Ba, \u201cAdam: A Method for StochasticOptimization,\u201d Computer Science, vol.89, no.5, pp.45-145, 2014."},{"key":"26","doi-asserted-by":"publisher","unstructured":"[26] A. Stephenson, \u201cLXXI. On induced stability,\u201d Philosophical Magazine, vol.17, no.101, pp.765-766, 1968. 10.1080\/14786440508636652","DOI":"10.1080\/14786440508636652"},{"key":"27","unstructured":"[27] A. Moore, \u201cEfficient memory-based learning for robot control,\u201d Technical report, University of Cambridge, Computer Laboratory, vol.63, no.9, pp.62-167, 1990."},{"key":"28","doi-asserted-by":"crossref","unstructured":"[28] G. Dejong and M.W. Spong, \u201cSwinging up the Acrobot: an example of intelligent control,\u201d IEEE American Control Conference, pp.2158-2162, 1994. 10.1109\/acc.1994.752458","DOI":"10.1109\/ACC.1994.752458"}],"container-title":["IEICE Transactions on Information and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E101.D\/9\/E101.D_2017EDP7278\/_pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,10,23]],"date-time":"2019-10-23T06:08:34Z","timestamp":1571810914000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.jstage.jst.go.jp\/article\/transinf\/E101.D\/9\/E101.D_2017EDP7278\/_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,9,1]]},"references-count":28,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2018]]}},"URL":"https:\/\/doi.org\/10.1587\/transinf.2017edp7278","relation":{},"ISSN":["0916-8532","1745-1361"],"issn-type":[{"value":"0916-8532","type":"print"},{"value":"1745-1361","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,9,1]]}}}