Скачать 123.02 Kb.

Generalized Pursuit Learning Schemes : New Families of Continuous and Discretized Learning Automata^{1}Mariana Agache and B. John Oommen^{2} Abstract The fastest Learning Automata (LA) algorithms currently available fall in the family of Estimator Algorithms introduced by Thathachar and Sastry [0]. The pioneering work of these authors was the Pursuit Algorithm, which pursues only the current estimated optimal action. If this action is not the one with the minimum penalty probability, this algorithm pursues a wrong action. In this paper, we argue that a Pursuit scheme that generalizes the traditional Pursuit Algorithm by pursuing all the actions with higher reward estimates than the chosen action, minimizes the probability of pursuing a wrong action, and is a faster converging scheme. To attest this, in this paper we present two new generalized Pursuit algorithms and also present a quantitative comparison of their performance against the existing Pursuit algorithms. Empirically, the algorithms proposed here are among the fastest reported LA todate. Keywords: Learning Automata, Estimator Algorithms, Pursuit Algorithms I.IntroductionThe goal of a Learning Automaton (LA) is to determine the optimal action out of a set of allowable actions, where the optimal action is defined as the action that maximizes the probability of being rewarded. The fastest LA algorithms todate fall within the family of the socalled Estimator Algorithms. From this family, the pioneering work of Thathachar and Sastry [0] was the Pursuit Algorithm, which pursues only the current estimated optimal action. If this action is not the one with the minimum penalty probability, this algorithm pursues a wrong action. In this paper, we present various generalizations of the above philosophy. Quite simply stated : We demonstrate that a Pursuit scheme, which generalizes the traditional Pursuit algorithm by pursuing all the actions with higher reward estimates than the chosen action, minimizes the probability of pursuing a wrong action, thus leading to a faster converging scheme. Based on this fundamental premise, we devise new pursuit LA which, empirically, are probably the fastest and most accurate reported LA todate. I.1.Fundamentals of Learning AutomataThe functionality of the learning automaton can be described in terms of a sequence of repetitive feedback cycles in which the automaton interacts with the environment. During a cycle, the automaton chooses an action, which triggers a response from the environment, a response that can be either a reward or a penalty. The automaton uses this response and the knowledge acquired in the past actions to determine which is the next action. By learning to choose the optimal action, the automaton adapts itself to the environment. Excellent books that survey the field are the books by Lakshmivarahan [2], Narendra and Thathachar [9] and Najim and Poznyak [31]. The learning paradigm as modeled by LA has found applications in systems that posses incomplete knowledge about the environment in which they operate. A variety of applications^{3} that use LA have been reported in the literature. They have been used in game playing [0], [0], [0], pattern recognition [0], [0], object partitioning [0], [0], parameter optimization [9], [28], [35], [54] and multiobjective analysis [43], telephony routing [0], [0], priority assignments in a queuing system [0]. They have also been used in statistical decision making [9], [31], distribution approximation [40], natural language processing, modeling biological learning systems [26], string taxonomy [56], graph partitioning [57], distributed scheduling [39], network protocols (including conflict avoidance [44]) for LANs [36], photonic LANs [45], star networks [37] and broadcast communication systems [38], dynamic channel allocation [38], tuning PID controllers [30], assigning capacities in prioritized networks [32], map learning [41], digital filter design [42], controlling client/server systems [46], adaptive signal processing [49], vehicle path control [50], and even the control of power systems [51] and vehicle suspension systems [52]. The beauty of incorporating LA in any particular application domain, is indeed, the elegance of the technology. Essentially, LA are utilized exactly as one would expect  by interacting with the “Environment”  which is the system from which the LA learns. For example, in parameter optimization the LA chooses a parameter, observes the effect of the parameter in the control loop, and then updates the parameter to optimize the objective function, where this updating is essentially achieved by the LA's stochastic updating rule. The details of how this is achieved in the various application domains involves modeling the "actions", transforming the system's outputs so that they are perceived to be of a reward or penalty flavor. This where the ingenuity of the researcher comes into the picture  this is often a thoughtprovoking task. Although we have interests in the application of LA, this paper is intended to be of a theoretical sort. Thus, our discussions on the applications of LA are necessarily brief ! In the first LA designs, the transition and the output functions were time invariant, and for this reason these LA were considered “fixed structure” automata. Tsetlin, Krylov, and Krinsky [0], [0] presented notable examples of this type of automata. Later, in [27], Vorontsova and Varshavskii introduced a class of stochastic automata known in literature as Variable Structure Stochastic Automata (VSSA). In the definition of a Variable Structure Stochastic Automaton (VSSA), the LA is completely defined by a set of actions (one of which is the output of the automata), a set of inputs (which is usually the response of the environment) and a learning algorithm, T. The learning algorithm [0], [0], [0] operates on a vector (called the Action Probability vector) P(t) =[p_{1}(t),…,p_{r}(t)]^{T}, where p_{i}(t) (i = 1,…,r) is the probability that the automaton will select the action _{i} at the time t: p_{i}(t)=Pr[(t)= _{i}], i=1,…,r, and it satisfies, for all ‘t’. Note that the algorithm T : [0,1]^{r}AB[0,1]^{r} is an updating scheme where A={_{1}, _{2},…,_{r}}, 2r<, is the set of output actions of the automaton, and B is the set of responses from the environment. Thus, the updating is such that P(t+1) = T(P(t), (t), (t)), (0) where (t) is the response that the LA receives from the Environment. These terms will be formalized presently, but is wellacclaimed in the LA literature. If the mapping T is chosen in such a manner that the Markov process has absorbing states, the algorithm is referred to as an absorbing algorithm. Families of VSSA that posses absorbing barriers have been studied in [0], [0], [9], [0]. Ergodic VSSA have also been investigated [0], [9], [0], [0]. These VSSA converge in distribution and thus, the asymptotic distribution of the action probability vector has a value that is independent of the corresponding initial vector. Thus, while ergodic VSSA are suitable for nonstationary environment, automata with absorbing barriers are preferred in stationary environments. In practice, the relatively slow rate of convergence of these algorithms constituted a limiting factor in their applicability. In order to increase their speed of convergence, the concept of discretizing the probability space was introduced in [0]. This concept is implemented by restricting the probability of choosing an action to a finite number of values in the interval [0,1]. If the values allowed are equally spaced in this interval, the discretization is said to be linear, otherwise, the discretization is called nonlinear. Following the discretization concept, many of the continuous VSSA have been discretized; indeed, various discrete automata have been presented in literature [0], [0]. In the quest to design faster converging learning algorithms, Thathachar and Sastry [0] opened another avenue by introducing a new class of algorithms, called “Estimator” Algorithms. The main feature of these algorithms is that they maintain running estimates for the reward probability of each possible action, and use them in the probability updating equations. Typically, in the first step of the functional cycle, the automaton chooses an action and the environment generates a response to this action. Based on this response, the Estimator Algorithm updates the estimate of the reward probability for that action. The change in the action probability vector is based on both the running estimates of the reward probabilities, and on the feedback received from the environment. Thus, in essence, within the family of estimator algorithms, the LA's updating rule is of the form : Q(t+1) = T(Q(t), (t), (t)), where Q(t) is the pair <P(t), >, and is the vector of estimates of the reward probabilities d. A detailed catalogue of the estimator algorithms can be found in [0], [0], [0], [0], [0], [33], [34] and [47]. Historically, as mentioned earlier, these algorithms were pioneered by Thathachar and Sastry, who introduced the continuous versions. Oommen and his coauthors [5], [16] discretized them, thus enhancing them both with regard to speed and accuracy. Papadimitriou [33], [34] introduced the concept of stochastic estimator algorithms and recommended the use of windows for the estimates. He also used the nonlinear discretized version in a brilliant hierarchical manner for dealing with scenarios with a large number of actions. The most recent work in this field is the work that we have done in [20]  which will be explained in greater detail presently. Most of the analysis has centered around the optimality of the corresponding LA, although the LA described in [34] have been proven to only possess the property of absolute expediency. However, we commend the excellent work of Rajaraman and Sastry [47] which is the only known finite time analysis for any of the estimator algorithms. I.2.Contribution of this paperAs mentioned above, Pursuit algorithms are a subset of the Estimator algorithms. The existing Pursuit algorithms are characterized by the fact that the action probability vector ‘pursues’ the action that is currently estimated to be the optimal action. This is achieved by increasing the probability of the action whose current estimate of being rewarded is maximal [0], [0]. This implies that if, at any time ‘t’, the action that has the maximum reward estimate is not the action that has the minimum penalty probability, then the automaton pursues a wrong action. Our primary goal here is to minimize this probability of pursuing an erroneous action. It is pertinent to mention that the effect of minimizing this probability of erroneous Pursuit leads to both faster and more accurate schemes, as we shall see. Furthermore, such a pursuit also leads to a new class of pseudodiscretized Pursuit automata that are distinct from all the previously reported discretized LA. These contributions will be clear in the subsequent sections. We achieve this our goal by generalizing the design of the Pursuit algorithm by permitting it to pursue a set of actions^{4} instead of a single "currentbest" action. Specifically, the set of actions pursued are those actions that have higher reward estimates than the current chosen action. In this paper, we introduce two new Pursuit algorithms, a continuous and a discretized version, that use such a generalized "pursuit" learning approach. Apart from proving their learning properties, we also demonstrate their superiority over traditional pursuit and estimator schemes by means of extensive simulation results. The new algorithms presented here are among the best reported, and we believe represent the state of the art ! 