📄 Page
1
(This page has no text content)
📄 Page
2
Table of Contents Cover Title Page Copyright Page Preface Acknowledgments About the Authors 1 Introduction 1.1 Introduction 1.2 Single Agent Planning 1.3 Multi agent Planning and Coordination 1.4 Coordination by Optimization Algorithm 1.5 Summary References 2 Improve Convergence Speed of Multi Agent Q Learning for Cooperative Task Planning 2.1 Introduction 2.2 Literature Review 2.3 Preliminaries 2.4 Proposed MAQL 2.5 Proposed FCMQL Algorithms and Their Convergence Analysis 2.6 FCMQL Based Cooperative Multi agent Planning 2.7 Experiments and Results 2.8 Conclusions 2.9 Summary 2.A More Details on Experimental Results References 3 Consensus Q Learning for Multi agent Cooperative Planning 3.1 Introduction 3.2 Preliminaries 3.3 Consensus 3.4 Proposed CoQL and Planning 3.5 Experiments and Results 3.6 Conclusions 3.7 Summary
📄 Page
3
References 4 An Efficient Computing of Correlated Equilibrium for Cooperative Q Learning Based Multi Robot Planning 4.1 Introduction 4.2 Single Agent Q Learning and Equilibrium Based MAQL 4.3 Proposed Cooperative MAQL and Planning 4.4 Complexity Analysis 4.5 Simulation and Experimental Results 4.6 Conclusion 4.7 Summary Appendix 4.A Supporting Algorithm and Mathematical Analysis References 5 A Modified Imperialist Competitive Algorithm for Multi Robot Stick Carrying Application 5.1 Introduction 5.2 Problem Formulation for Multi Robot Stick Carrying 5.3 Proposed Hybrid Algorithm 5.4 An Overview of FA 5.5 Proposed ICFA 5.6 Simulation Results 5.7 Computer Simulation and Experiment 5.8 Conclusion 5.9 Summary Appendix 5.A Additional Comparison of ICFA References 6 Conclusions and Future Directions 6.1 Conclusions 6.2 Future Directions Index End User License Agreement List of Tables Chapter 1 Table 1.1 Trace of Dijkstra's algorithm for Figure 1.11. Table 1.2 Trace of A* algorithm from Figure 1.10. Table 1.3 Trace of D* algorithm from Figure 1.12.
📄 Page
4
Table 1.4 Expected reward of R1 and R2 at MSNE. Chapter 2 Table 2.1 List of acronyms. Table 2.2 Details of 10 × 10 grid maps. Table 2.3 Run time complexity of Algorithm 2.3 over reference algorithms in d... Table 2.4 Run time complexity of Algorithm 2.3 over reference algorithms in s... Table 2.5 Time taken by Khepera II mobile robots to reach a team goal with diffe... Table 2.A.1 Number of joint state–action pair converged in deterministic situ... Table 2.A.2 Number of joint state–action pair converged in stochastic situati... Table 2.A.3 Count of team goal explored in the deterministic situation for tw... Table 2.A.4 Count of team goal explored in the stochastic situation for two a... Chapter 3 Table 3.1 List of acronyms. Table 3.2 Planning performance. Chapter 4 Table 4.1 Average of the percentage (%) of joint state–action pair converged ... Table 4.2 Average run time complexity of different learning algorithms (secon... Table 4.3 Average run time complexity of different planning algorithms (secon... Table 4.4 Average run time complexity of different planning algorithms (secon... Table 4.A.1 Time complexity analysis. Chapter 5 Table 5.1 Comparative analysis of performance of the proposed ICFA with other... Table 5.2 Comparative analysis of performance of the proposed ICFA with other... Table 5.3 Average rankings obtained through Friedman's test Table 5.4 Comparison of number of steps, average path traversed, and average ... Table 5.A.1 No of successful runs out of 25 runs and success performance in p...
📄 Page
5
List of Illustrations Chapter 1 Figure 1.1 Single agent system. Figure 1.2 Three discrete states in an environment. Figure 1.3 Robot executing action Right (R) at state s1 and moves to the nex... Figure 1.4 Deterministic state transition. Figure 1.5 Stochastic state transition. Figure 1.6 Two dimensional 5 × 5 grid environment. Figure 1.7 Refinement approach in robotics. Figure 1.8 Hierarchical tree. Figure 1.9 Hierarchical model. Figure 1.10 Two dimensional 3 × 3 grid environment. Figure 1.11 Corresponding graph of Figure 1.10. Figure 1.12 Two dimensional 3 × 3 grid environment with an obstacle. Figure 1.13 Structure of reinforcement learning. Figure 1.14 Variation of average reward with the number of trial for differe... Figure 1.15 Correlation between the RL and DP. Figure 1.16 Single agent Q learning. Figure 1.17 Possible next state in stochastic situation. Figure 1.18 Single agent planning. Figure 1.19 Multi agent system with m agents. Figure 1.20 Robots executing joint action <R, L> at joint state <1, 8> and m... Figure 1.21 Classification of multi robot systems. Figure 1.22 Hands gestures in rock paper scissor game: (a) rock, (b) paper, ... Figure 1.23 Rock paper scissor game. Figure 1.24 Reward mapping from joint Q table to reward matrix. Figure 1.25 Pure strategy Nash equilibrium evaluation. (a) Fix A1 = L and A2... Figure 1.26 Evaluation of mixed strategy Nash equilibrium. Figure 1.27 Reward matrix for tennis game. Figure 1.28 Reward matrix of in a common reward two agent static game. Figure 1.29 Pure strategy Egalitarian equilibrium, which is one variant of C... Figure 1.30 Game of chicken.
📄 Page
6
Figure 1.31 Reward matrix in the game of chicken. Figure 1.32 Constant sum game. Figure 1.33 Matching pennies. Figure 1.34 Reward matrix in Prisoner's Dilemma game. Figure 1.35 Correlation among the MARL, DP, and GT. Figure 1.36 Classification of multi agent reinforcement learning. Figure 1.37 The climbing game reward matrix. Figure 1.38 The penalty game reward matrix. Figure 1.39 The penalty game reward matrix. Figure 1.40 Individual Q values obtained in the climbing game reward matrix ... Figure 1.41 The penalty game reward matrix. Figure 1.42 Individual Q values obtained in the penalty game reward matrix b... Figure 1.43 Reward matrix of a three player coordination game. Figure 1.44 Reward matrix in a two player two agent game. Figure 1.45 Nonstrict EDNP in normal form game. Figure 1.46 Multistep negotiation process between agent A and B. Figure 1.47 Multi robot coordination for the well known stick carrying probl... Figure 1.48 Multi robot local planning by swarm/evolutionary algorithm. Figure 1.49 Surface plot of (1.97). Figure 1.50 Surface plot of (1.98). Figure 1.51 Steps of Differential evolution (DE) algorithm [132]. Chapter 2 Figure 2.1 Block diagram of reinforcement leaning (RL). Figure 2.2 Experimental workspace for two agents during the learning phase.... Figure 2.3 Convergence plot of NQLP12 and reference algorithms for two agent... Figure 2.4 Average of average reward (AAR) plot of NQLP12 and reference algo... Figure 2.5 Joint action selection strategy in EQLP12 and reference algorithm... Figure 2.6 Cooperative path planning to carry a triangle by three robots in ... Figure 2.7 Cooperative path planning to carry a stick by two Khepera II mobi... Figure 2.8 Cooperative path planning to carry a stick by two Khepera II mobi...
📄 Page
7
Figure 2.A.1 Convergence plot of FCMQL and reference algorithms for two agen... Figure 2.A.2 Convergence plot of EQLP12 and reference algorithms for three a... Figure 2.A.3 Convergence plot of EQLP12 and reference algorithms for four ag... Figure 2.A.4 CR versus learning epoch plot for FCMQL and reference algorithm... Figure 2.A.5 Average of average reward (AAR) plot of FCMQL and reference alg... Figure 2.A.6 Average of average reward (AAR) plot of EQLP12 and reference al... Figure 2.A.7 Average of average reward (AAR) plot of EQLP12 and reference al... Figure 2.A.8 Joint action selection strategy in EQLP12 and reference algorit... Figure 2.A.9 Joint action selection strategy in EQLP12 and reference algorit... Figure 2.A.10 Path planning with stick in deterministic situation by: (a) NQ... Figure 2.A.11 Path planning with stick in stochastic situation by: (a) NQIMP... Figure 2.A.12 Path planning with triangle in stochastic situation by: (a) NQ... Figure 2.A.13 Path planning with square in stochastic situation by: (a) NQIM... Figure 2.A.14 Path planning with square in deterministic situation by: (a) N... Chapter 3 Figure 3.1 Equilibrium selection in multi agent system. (a) Two UE (ax and b... Figure 3.2 AAR versus learning epoch for two agent system. Figure 3.3 AAR versus learning epoch for three agent system. Figure 3.4 Planning path offered by the consensus based multi agent planning... Figure 3.5 Planning path offered by the Nash Q learning based planning algor... Chapter 4 Figure 4.1 Corner cell, boundary cell, and other cell. Figure 4.2 Feasible joint states for two agent systems in stick carrying pro... Figure 4.3 Convergence comparison of ΩQL, CΩQL, NQL, FQL, and CQL algorithms... Figure 4.4 Convergence comparison of ΩQL, CΩQL, NQL, FQL, and CQL algorithms... Figure 4.5 Convergence comparison of ΩQL, CΩQL, NQL, FQL, and CQL
📄 Page
8
algorithms... Figure 4.6 (Map 4.1) Planning with box by CQIP, CΩMP, and ΩMP algorithms. Figure 4.7 (Map 4.1) Planning using Khepera II mobile robot by CQIP, CΩMP, a... Figure 4.8 (Map 4.2) Planning with stick by CQIP, CΩMP, and ΩMP algorithms.... Figure 4.9 (Map 4.2) Path planning using Khepera II mobile robot by CQIP, CΩ... Figure 4.10 (Map 4.3) Path planning with triangle employing CQIP, CΩMP, and ... Chapter 5 Figure 5.1 Diagram illustrating the calculation of d. Figure 5.2 Evolution of the expected population variance. Figure 5.3 Relative performance in mean best objective function versus funct... Figure 5.4 Relative performance in mean best objective function versus funct... Figure 5.5 Relative performance in mean best objective function versus funct... Figure 5.6 Relative performance in mean best objective function versus funct... Figure 5.7 Relative performance in accuracy versus function evaluation for I... Figure 5.8 Variation of FEs required for convergence to predefined threshold... Figure 5.9 Graphical representation of Bonferroni–Dunn's procedure consideri... Figure 5.10 Initial (a) and final configuration of the world map after execu... Figure 5.11 Average total path traversed versus number of obstacles. Figure 5.12 Average total path deviation versus number of obstacles. Figure 5.13 Average uncovered target distance versus number of steps with nu... Figure 5.14 Final configuration of the world map after experiment using Khep...
📄 Page
9
Multi‐Agent Coordination A Reinforcement Learning Approach Arup Kumar Sadhu Amit Konar
📄 Page
10
This edition first published 2021 © 2021 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Tamilvanan Shunmugaperumal to be identified as the author of this work has been asserted in accordance with law. Registered Office John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print on demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging in Publication Data Names: Sadhu, Arup Kumar, author. | Konar, Amit, author. Title: Multi-agent coordination : a reinforcement learning approach / Arup Kumar Sadhu, Amit Konar. Description: Hoboken, New Jersey : Wiley-IEEE, [2021] | Includes bibliographical references and index. Identifiers: LCCN 2020024706 (print) | LCCN 2020024707 (ebook) | ISBN 9781119699033 (cloth) | ISBN 9781119698999 (adobe pdf) | ISBN 9781119699026 (epub) Subjects: LCSH: Reinforcement learning. | Multiagent systems. Classification: LCC Q325.6 .S23 2021 (print) | LCC Q325.6 (ebook) | DDC 006.3/1--dc23 LC record available at https://lccn.loc.gov/2020024706 LC ebook record available at https://lccn.loc.gov/2020024707 Cover design: Wiley Cover image: © Color4260/Shutterstock
📄 Page
11
Preface Coordination is a fundamental trait in lower level organisms as they used their collective effort to serve their goals. Hundreds of interesting examples of coordination are available in nature. For example, ants individually cannot carry a small food item, but they collectively carry quite a voluminous food to their nest. The tracing of the trajectory of motion of an ant following the pheromone deposited by its predecessor also is attractive. The queen bee in her nest directs the labor bees to specific directions by her dance patterns and gestures to collect food resources. These natural phenomena often remind us the scope of coordination among agents to utilize their collective intelligence and activities to serve complex goals. Coordination and planning are closely related terminologies from the domain of multi robot system. Planning refers to the collection of feasible steps required to reach a predefined goal from a given position. However, coordination indicates the skillful interaction among the agents to generate a feasible planning step. Therefore, coordination is an important issue in the field of multi robot coordination to address complex real world problems. Coordination usually is of three different types: cooperation, competition, and mixed. As evident from their names, cooperation refers to improving the performance of the agents to serve complex goals, which otherwise seems to be very hard for an individual agent because of the restricted availability of hardware/software resources of the agents or deadline/energy limits of the tasks. Unlike cooperation, competition refers to serving conflicting goals by two (team of) agents. For example, in robot soccer, the two teams compete to win the game. Here, each team plans both offensively and defensively to score goals and thus act competitively. Mixed coordination indicates a mixture of cooperation and competition. In the example of a soccer game, inter team competition and intra team cooperation is the mixed coordination. Most of the common usage of coordination in robotics lies in cooperation of agents to serve a common goal. The book deals with the cooperation of robots/robotic agents to efficiently complete a complex task. In recent times, researchers are taking keen interest to employ machine learning in multi agent cooperation. The primary advantage of machine learning is to generate the action plans in sequence from the available sensory readings of the robots. In case of a single robot, learning the action plans from the sensory readings is straightforward. However, in the context of multi robot, the positional changes of the other robots act as additional inputs for the learner robot, and thus learning is relatively difficult. Several machine learning and evolutionary algorithms have been adopted over the last two decades to handle the situations. The simplest of all is the supervised learning technique that requires an exhaustive list of sensory instances and the action plan by the robots. Usually, a human experimenter provides these data from his/her long acquaintance with such problems or by direct measurement of the sensory instances and decisions. The training instances being too large, sometimes has a negative influence to the engineer, and he/she feels it uncomfortable not to miss a single instance that carries valuable mapping from sensory instance to action plan by the robots.
📄 Page
12
Because of the difficulty of generating training instances and excessive computational overhead to learn those instances, coupled with the need for handling dynamic situations, researchers felt the importance of reinforcement learning (RL). In RL, we need not provide any training instance, but employ a critic who provides a feedback to the learning algorithm about the possible reward/penalty of the actions by the agent. The agent/s on receiving the approximate measure of penalty/reward understands which particular sensory motor instances they need to learn for future planning applications. The dynamic nature of environment thus can easily be learned by RL. In the multi agent scenario, RL needs to take care of learning in joint state/action space of the agents. Here, each agent learns the sensory motor instances in the joint state/action space with an ultimate motive to learn the best actions for itself to optimize its rewards. The superiority of evolutionary algorithms (EAs) in optimizing diverse objective functions is subjected to the No Free Lunch Theorem (NFLT). According to NFLT, the expected effectiveness of any two traditional EAs across all possible optimization problems is identical. A self evident implication of NFLT is that the elevated performance of one EA, say A, over the other, say B, for one class of optimization problems is counterbalanced by their respective performances over another class. It is therefore practically difficult to devise a universal EA that would solve all the problems. This apparently paves the way for hybridization of EAs with other optimization strategies, machine learning techniques, and heuristics. In evolutionary computation paradigm, hybridization refers to the process of integrating the attractive features of two or more EAs synergistically to develop a new hybrid EA. The hybrid EA is expected to outperform its ancestors with respect to both accuracy and complexity over application specific or general benchmark problems. The fusion of EAs through hybridization hence can be regarded as the key to overcome their individual limitations. Hence, apart from the RL, hybridization of the EAs is also an effective approach to serve the purpose of multi robot coordination in a complex environment. The primary objective of an EA in the context of multi robot coordination is concerned with the minimization of the time consumed by the robots (i.e. the length of the path to be traversed by the robots) for complete traversal of the planned trajectory. In other words, robots plan their local trajectory, so that robots shifted from given positions to the next positions (subgoals) in a time optimal sense avoiding collision with the obstacles or the boundary of the world map. The optimization algorithm is executed in each local planning step to move a small distance. Hence, cumulatively robots move to the desired goal position using the sequence of local planning. There are traces of literature on hybridization of the EAs. Several algorithms for multi agent learning are available in the literature, each with one specific flavor to optimize certain learning intents of the agents. Of these algorithms, quite a few interesting works on the MAQL have been reported in the literature. Among the state of the art MAQL algorithms, the following need special mentions. Claus and Boutilier, aimed at solving the coordination problem using two types of reinforcement learners. The first one, called independent learner (IL), takes care of the learning behavior of individual agents by ignoring the presence of other agents. The second one, called joint action learner (JAL), considers all agents including the self to learn at joint
📄 Page
13
action space. Unlike JAL, in Team Q learning proposed by Littman, an agent updates its Q value at a joint state–action pair without utilizing associated agents' reward; rather the value function of the agent at the next joint state is evaluated by obtaining the maximum Q value among the joint actions at the next joint state. Ville proposed Asymmetric Q learning (AQL) algorithm, where the leader agents are capable of maintaining all the agents' Q tables. However, the follower agents are not allowed to maintain all the agents' Q tables and hence, they just maximize their own rewards. In AQL, agents always achieve the pure strategy Nash equilibrium (NE), although there does exist mixed strategy NE. Hu and Wellman extended the Littman's Minimax Q learning to the general sum stochastic game (where the summation of all agents' payoff is neither zero nor constant) by taking into account of other agents' dynamics using NE. They also offered a proof of convergence of their algorithm. In case of multiple NE occurrences, one is selected optimally. Littman proposed Friend or Foe Q learning (FQL) algorithm for general sum games. In this algorithm, the learner is instructed to treat each other agent either as a friend in Friend Q learning or as a foe in Foe Q learning. FQL provides a stronger convergence guarantee in comparison to that of the existing NE based learning rule. Greenwald and Hall proposed correlated Q learning (CQL) employing correlated equilibrium (CE) to generalize both Nash Q learning (NQL) and FQL. The bottlenecks of the above MAQL algorithms are update policy selection for adaptation of the Q tables in joint state–action space and the curse of dimensionality with an increase in the number of learning agents. Several attempts have been made to handle the curse of dimensionality in MAQL. Jelle and Nikos proposed Sparse Cooperative Q learning, where a sparse representation of the joint state–action space of the agents is done by identifying the need for coordination among the agents at a joint state. Here, agents undertake coordination by their actions only in a few joint states. Hence, each agent maintains two Q tables: one is the individual action Q table for uncoordinated joint states and another one is the joint action Q table to represent the coordinated joint states. In case of uncoordinated states, a global Q value is evaluated by adding the individual Q values. Zinkevich offers a neural network based approach for generalized representation of the state space for multi agent coordination. By such generalization, agents (here robots) can avoid collision with an obstacle or other robots by collecting minimum information from the sensors. Reinaldo et al. proposed a novel algorithm to heuristically accelerate the TMAQL algorithms. In the literature of MAQL, agents either converge to NE or CE. The equilibrium based MAQL algorithms are most popular for their inherent ability to determine optimal strategy (equilibrium) at a given joint state. Hu et al. identified the phenomenon of similar equilibria in different joint states and introduced the concept of equilibrium transfer to accelerate the state of the art equilibrium based MAQL (NQL and CQL). In equilibrium transfer, agents recycle the previously computed equilibria having very small transfer loss. Recently, Zhang et al. attempted to reduce the dimension of the Q tables in NQL. The reduction is done by allowing the agents to store the Q values in joint state–individual action space, instead of joint state–action space. In the state of the art MAQL (NQL and CQL), balancing exploration/exploitation during the learning phase is an important issue. Traditional approaches used to balance exploration/exploitation in MAQL are summarized here. The greedy exploration, although has wide publicity, needs to tune the value of which is time costly. In the
📄 Page
14
Boltzmann strategy, the action selection probability is controlled by tuning a control parameter (temperature) and by utilizing the Q values due to all actions at a given state. Here, the setting of temperature to infinity (zero) implies pure exploration (exploitation). Unfortunately, the Boltzmann strategy antagonistically affects the speed of learning. Evolution of the Boltzmann strategy toward better performance is observed in a series of literature. However, the above selection mechanisms are not suitable for selecting a joint action preferred for the team (all the agents) because of the dissimilar joint Q values offered by the agents at a common joint state–action pair. There are traces of literature concerning joint action selection at a joint state during learning. However, with the best of our knowledge, there is no work in the literature, which considers the work, presented in this book. The book includes six chapters. Chapter 1 provides an introduction to the multi robot coordination algorithms for complex real world problems, including transportation of a box/stick, formation control for defense applications and soccer playing by multiple robots utilizing the principles of RL, the theory of games, dynamic programming, and/or EA. Naturally, this chapter provides a thorough survey of the existing literature of RL with a brief overview of the evolutionary optimization to examine the role of the algorithms in the context of multi agent coordination. Chapter 1 includes multi robot coordination employing evolutionary optimization, and especially RL for cooperative, competitive, and their composition for application to static and dynamic games. The latter part of the chapter deals with an overview of the metrics used to compare the performance of the algorithms while coordinating. Fundamental metrics for performance analysis are defined to study the learning and planning algorithms. Chapter 2 offers learning based planning algorithms, by extending the traditional multi agent Q learning algorithms (NQL and CQL) for multi robot coordination and planning. This extension is achieved by employing two interesting properties. The first property deals with the exploration of the team goal (simultaneous success of all the robots) and the other property is related to the selection of joint action at a given joint state. The exploration of team goal is realized by allowing the agents, capable of reaching their goals, to wait at their individual goal states, until the remaining agents explore their individual goals synchronously or asynchronously. Selection of joint action, which is a crucial problem in traditional multi agent Q learning, is performed here by taking the intersection of individual preferred joint actions of all the agents. In case the resulting intersection is a null set, the individual actions are selected randomly or otherwise following classical techniques. The superiority of the proposed learning and learning based planning algorithms are validated over contestant algorithms in terms of the speed of convergence and run time complexity, respectively. In Chapter 3, it is shown that robots may select the suboptimal equilibrium in the presence of multiple types of equilibria (here NE or CE). In the above perspective, robots need to adapt to such a strategy, which can select the optimal equilibrium in each step of the learning and the planning. To address the bottleneck of the optimal equilibrium selection among multiple types, Chapter 3 presents a novel consensus Q learning (CoQL) for multi robot coordination, by extending the equilibrium based multi agent Q learning algorithms. It is also shown that a consensus (joint action) jointly satisfies the conditions of the coordination type pure strategy NE and the pure strategy
📄 Page
15
CE. The superiority of the proposed CoQL algorithm over traditional reference algorithms in terms of the average reward collection are shown in the experimental section. In addition, the proposed consensus based planning algorithm is also verified considering the multi robot stick carrying problem as the testbed. Unlike CQL, Chapter 4 proposes an attractive approach to adapt composite rewards of all the agents in one Q table in joint state–action space during learning, and subsequently, these rewards are employed to compute CE in the planning phase. Two separate models of multi agent Q learning have been proposed. If the success of only one agent is enough to make the team successful, then model I is employed. However, if an agent's success is contingent upon other agents and simultaneous success of the agents is required, then model II is employed. It is also shown that the CE obtained by the proposed algorithms and by the traditional CQL are identical. In order to restrict the exploration within the feasible joint states, constraint versions of the said algorithms are also proposed. Complexity analysis and experiments have been undertaken to validate the performance of the proposed algorithms in multi robot planning on both simulated and real platforms. Chapter 5 hybridizes the Firefly Algorithm (FA) and the Imperialist Competitive Algorithm (ICA). The above explained hybridization results in the Imperialist Competitive Firefly Algorithm (ICFA), which is employed to determine the time optimal trajectory of a stick, being carried by two robots, from a given starting position to a predefined goal position amidst static obstacles in a robot world map. The motion dynamics of fireflies of the FA is embedded into the sociopolitical evolution based meta heuristic ICA. Also, the trade off between the exploration and exploitation is balanced by modifying the random walk strategy based on the position of the candidate solutions in the search space. The superiority of the proposed ICFA is studied considering run time and accuracy as the performance metrics. Finally, the proposed algorithm has been verified in a real time multi robot stick carrying problem. Chapter 6 concludes the book based on the analysis made, experimental and simulation results obtained from the earlier chapters. The chapter also examines the prospects of the book in view of the future research trends. In summary, the book aimed at developing multi robot coordination algorithms with a minimum computational burden and less storage requirement as compared to the traditional algorithms. The novelty, originality, and applicability of the book are illustrated below. Chapter 1 introduces fundamentals of the multi robot coordination. Chapter 2 offers two useful properties, which have been developed to speedup the convergence of TMAQL algorithms in view of the team goal exploration, where team goal exploration refers to the simultaneous exploration of individual goals. The first property accelerates exploration of the team goal. Here, each agent accumulates high (immediate) reward for team goal state transition, thereby improving the entries in the Q table for state transitions leading to the team goal. The Q table thus obtained offers the team the additional benefit to identify the joint action leading to a transition to the team goal during the planning, where TMAQL based planning stops inadvertently. The second property directs an alternative approach to speedup the convergence of TMAQL by
📄 Page
16
identifying the preferred joint action for the team. Finding preferred joint action for the team is crucial when robots are acting synchronously in a tight cooperative system. The superiority of the proposed algorithms in Chapter 2 is verified both theoretically as well as experimentally in terms of the convergence speed and the run time complexity. Chapter 3 proposes the novel CoQL, which addresses the equilibrium selection problem. In case multiple equilibria exist at a joint state, by adapting the Q functions at a consensus. Analytically it is shown that a consensus at a joint state is a coordination type pure strategy NE as well as a pure strategy CE. Experimentally, it is shown that the average rewards earned by the robots are more when adapting at consensus, than by either NE or CE. Chapter 4 introduces a new dimension in the literature of the traditional CQL. In traditional CQL, CE is evaluated both in learning and planning phases. In Chapter 4, CE is computed partly in the learning and the rest in the planning phases, thereby requiring CE computation once only. It is shown in an analysis that the CE obtained by the proposed techniques is same as that obtained by the traditional CQL algorithms. In addition, the computational cost to evaluate CE by the proposed techniques is much smaller than that obtained by traditional CQL algorithms for the following reasons. Computation of CE in the traditional CQL requires consulting m Q tables in joint state– action space for m robots, whereas in the present context, we use a single Q table in the joint state–action space for evaluation of CE. Complexity analysis (both time and space complexity) undertaken here confirms the last point. Two schemes are proposed: one for a loosely and the other one for a tightly coupled multi robot system. Also, the problem specific constraints are taken care of in Chapter 4 to avoid unwanted exploration of the infeasible state space during the learning phase, thereby saving additional run time complexity during the planning phase. Experiments are undertaken to validate the proposed concepts in simulated and practical multi agent robotic platform (here Khepera environment). Chapter 5 offers the evolutionary optimization approach to address the multi robot stick carrying problem using the proposed ICFA. ICFA is the synergistic fusion of the motion dynamics of a firefly in the FA and the local exploration capabilities of the ICA. In ICA, an evolving colony is not guided by the experience of more powerful colonies within the same empire. However, in ICFA, each colony attempts to contribute to the improvement of its governing empire by improving its sociopolitical attributes following the motion dynamics of a firefly in the FA. To improve the performance of the above mentioned hybrid algorithm further, the step size for random movement of each firefly is modulated according to its relative position in the search space. An inferior solution is driven by the explorative force while a qualitative solution should be confined to its local neighborhood in the search space. The chapter also recommends a novel approach of evaluating the threshold value for uniting empires without imposing any serious computational overhead on the traditional ICA. Simulation and experimental results confirm the superiority of the proposed ICFA over the state of the art techniques. Chapter 6 concludes the book with interesting future research directions. Arup Kumar Sadhu Amit Konar
📄 Page
17
Acknowledgments The authors sincerely like to thank Prof. Surnajan Das, the vice chancellor of Jadavpur University (JU), and Prof. Chiranjib Bhattacharjee and Dr. Pradip Kumar Ghosh, the pro vice chancellors of JU, Kolkata, for creating a beautiful and lively academic environment to carry out the necessary scientific work and experiments for the present book. They also would like to acknowledge the technical and moral support they received from Prof. Sheli Sinha Chaudhuri, the HoD of the Department of Electronics and Tele Communication Engineering (ETCE), Jadavpur University, where the background research work for the present book is carried out. Special thanks go to the reviewers of the previous publications by the authors on the selected subject. Their suggestions helped a lot to develop the present book in its current shape. The authors like to thank their family members for their support in many ways for the successful completion of the book. The first author wishes to mention the everlasting support and words of optimism he received from his parents, Mrs. Purnima Sadhu and Mr. Prabhat Kumar Sadhu, without whose active support, love, and affection, it would not have been possible to complete the book in the current form. He likes to acknowledge the strong gratitude he has for his elder sisters, Dr. Sucheta Sadhu and Mrs. Mithu Sadhu, who have nurtured him since his childhood and always remained as a source of inspiration in his life. The second author acknowledges the support he received from his family members for sparing him from many family responsibilities while writing this book. The authors like to thank their students, colleagues, and coresearchers of the AI Lab, Jadavpur University, for their support in many ways during the phase of writing the book. Finally, the authors thank all their well wishers, who have contributed directly and indirectly toward the completion of the book. Arup Kumar Sadhu Amit Konar Artificial Intelligence Laboratory Department of Electronics and Telecommunication Engineering Jadavpur University, Kolkata, India 12 April 2020
📄 Page
18
1 Introduction: Multi‐agent Coordination by Reinforcement Learning and Evolutionary Algorithms This chapter provides an introduction to the multi agent coordination by reinforcement learning (RL) and evolutionary algorithms (EAs). A robot (agent) is an intelligent programmable device capable of performing complex tasks and decision making like the human beings. Mobility is part and parcel of modern robots. Mobile robots employ sensing action cycles to sense the world around them with an aim to plan their journey to the desired destination. Coordination is an important issue in modern robotics. In recent times, researchers are taking keen interest to synthesize multi agent coordination in complex real world problems, including transportation of a box/stick, formation control for defense applications, and soccer playing by multiple robots by utilizing the principles of RL, theory of games (GT), dynamic programming (DP), and/or evolutionary optimization (EO) algorithms. This chapter provides a thorough survey of the existing literature of RL with a brief overview of EO to examine the role of the algorithms in the context of multi agent coordination. The study includes the classification of multi agent coordination based on different criterion, such as the level of cooperation, knowledge sharing, communication, and the like. The chapter also includes multi robot coordination employing EO, and specially RL for cooperative, competitive, and their composition for application to static and dynamic games. The later part of the chapter deals with an overview of the metrics used to compare the performance of the algorithms in coordination. Two fundamental metrics of performance analysis are defined, where the first one is required to study the learning performance, while the other to measure the performance of the planning algorithm. Conclusions are listed at the end of the chapter with possible explorations for the future real time applications. 1.1 Introduction A robot is an intelligent and programmable manipulator, targeted at developing the functionality similar to those of a living creature [1]. It can serve complex and/or repetitive tasks efficiently. Based on the ability of locomotion, robots are categorized into two basic types: fixed base robots and mobile robots. Depending upon the type of locomotion, mobile robots are categorized into three types: wheeled/legged robots, winged/flying robots, and underwater robots, where for the last one, locomotion is controlled by water thrust. In this chapter, we would deal with wheeled robots only. Agency is a commonly used jargon in modern robotics [1]. An agent is a piece of program/hardware that helps a robot to serve a directed goal. Like humans, when complexity of the problem grows, collective intelligence of the agents is required to achieve the target. The book is on collective/group behavior of agents, who can sense and act rationally. On occasions, agents can share the sensory information or its decision with its teammates directly through a communication network or by displaying
📄 Page
19
its gestural/postural patterns, carrying a specific signature, to communicate a message to its team members. Communication is a vital issue to generate plans by the agents. However, communication is time costly and thus is often disregarded for real world robotic applications. In the present book, we attempted to learn the agent behavioral patterns by a process of learning, and thus avoid communication overhead in real time planning [1]. There exists quite a vast literature on planning algorithms [2–29]. One of the early robot planning algorithms is due to Nilsson in connection with his research on reasoning based planning undertaken in Stanford AI research laboratory, which later was adopted in STRIPs [30–32]. In late 1980s to early 1990s, several planning algorithms, including A* [31, 32], Voronoi diagrams [33], Quad tree, and potential field [34] were evolved. These algorithms presume static world. At the beginning of the 1990s, Michalewicz in one of his renowned papers introduced genetic operators to undertake dynamic planning with local adaptation in trial solutions by specialized mutation operators. The period 1990–2000 has seen significant changes in the planning algorithm with the introduction of supervised/unsupervised neural learning in planning algorithms [32]. The neural algorithms worked in both static and dynamic environments. Typically, in dynamic environments, they predict the direction and speed of motion to determine possible avoidance of collisions. However, they had limited learned experience, and thus were unable to handle planning in the presence of random motions of dynamic obstacles/persons in the environment. Almost at the beginning of the first quarter of the 1990s, Sutton proposed RL algorithm [35], which can help the robot learn its environment through semi supervised learning. We would deal with multi agent RL (MARL) in this chapter. Planning and coordination are two closely used terms in multi agent robotics [30]. While planning is concerned with determining the sequence of steps to achieve a goal, coordination refers to skillful interaction among the agents to serve their individual short run/long run purposes. Apparently, coordination among the agents is required to implement the steps of planning. In centralized planning, the agents need not require coordination, as the central manager takes care of all the agents' states as if its own state and generate a planning cycle by taking care of all the agents' states and goals jointly. Unfortunately, centralized planning is very slow and single point failure may occur. Thus, centralized planning is not amenable for real time applications, when the number of agents is excessive. In distributed planning, each agent generates one step of planning by coordinating with other active agents. Coordination is broadly divided into two types: cooperation [36] and competition [37]. As the names indicate, cooperation requires agents to work hand in hand to purposefully serve the common objective of the team. Competition, on the other hand, leads to the success of one team against the failure of its opponent. For instance, in robot soccer, teammates work harmoniously in a cooperative manner, while each team of agents competes for winning at the cost of defeat of the other team. Researchers are taking keen interest to model agent cooperation/competition by various models/tools. A few of these that need special mention include RL, GT [38–45], DP [46,
📄 Page
20
47], EO [48–56], and many others [6, 15–28, 57–59]. In RL, agents learn the most profitable joint action at each joint state through a feedback from the environment, and use them for subsequent planning applications [35]. GT requires for strategic analysis in multi agent domain. In GT, agents evaluate the equilibrium, representative of the most profitable joint action for the team in a joint state, and execute the joint action for joint state transition in a loop until the joint goal is explored [38, 41–43, 60]. In DP [46], a complex problem is divided into finite overlapping subproblems. Each subproblem is solved by a DP algorithm and the solution is stored in a database. In the subsequent iterations, if a subproblem already addressed reappears, then that subproblem is not readdressed, but its solution is exploited from the database. In EO algorithm [48, 61–70], the constraint to satisfy the cooperation is checked on the members of the trial solutions before the solutions are entertained for the next generation. Recently, researchers aimed at developing MARL fusing RL, DP, and GT [71, 72]. In this book, we would explore new algorithms of MARL and novel EO. 1.2 Single Agent Planning In single agent planning [5], an agent searches for the sequence of actions, for which it reaches its predefined goal state from a given state optimally in terms of predefined performance metric. The section describes the single agent planning terminologies and algorithms. Here, single agent planning algorithm includes search based and learning based planning algorithms. 1.2.1 Terminologies Used in Single Agent Planning Definition 1.1 An agent [1] is a mathematical entity that acts on its environment and senses the changes in the environment due to its action. The agent is realized by hardware/software means. A hardwired agent has an actuator (motors/levers) and a sensor to serve the purpose of actuation and sensing, respectively. A learning agent learns its right action at a given location/grid, called state, from its sensory action doublets. A planning agent identifies its best action at its current state to obtain maximum reward for its action in the given environment. In a single agent system, the environment includes a single agent. Naturally, the learning/planning steps/moves of the agent is undisturbed by the environment. Figure 1.1 offers architecture of a single agent system.