How computers can learn better



Reinforcement learning is a technique common in computer science, in which a computer system learns best solve any problem through trial and error. Classic applications of reinforcement learning involve such diverse issues as the robot navigation, monitoring and automated network administration.

The Association for uncertainty in Artificial Intelligence Annual Conference, which this summer, researchers at the laboratory for information and decision systems (caps) and the MIT computer science and Artificial Intelligence Lab will present a new reinforcement learning algorithm that, for a wide range of problems, allows computer systems to find solutions much more efficiently than the previous algorithms.


The role also represents the first application of a new programming framework that the researchers developed, which makes it much easier to configure and perform experiments to reinforce learning. Alborz province Geramifard, a postdoctoral covers and first author of the new book, expects the software, dubbed RLPy (for reinforcement and Learning Python, which uses the programming language), allows researchers to more effectively test new algorithms and compare the performance of algorithms in different tasks. It could also be a useful tool for teaching computer science students about the principles of reinforcement learning.


Geramifard developed RLPy with Robert Klein, master student of the MIT Department of Aeronautics and Astronautics. RLPy and its source code were released online in April.


Each reinforcement learning experience involves what is called an agent, that artificial intelligence research is often that a computer system, is being trained to perform some task. The agent can be a robot learn to navigate your environment, or a software agent, learning to automatically manage a computer network. The agent has reliable information about the current state of a system: the robot can know where he is in a room, while the network administrator can know that the computers on the network are running and that have closed. But there is some information that the agent is missing — what obstacles the room contains, for example, or as computational tasks are divided between the computers.


Finally, the experiment involves a "reward function", a quantitative measure of progress that the agent is doing your task. That measure could be positive or negative: your network administrator, for example, could be rewarded for each computer crashes is up and running but penalized for every computer that goes down.


The goal of the experiment is the agent to learn a set of policies that will maximize their payoff, given any State of the system. Part of this process is to evaluate each new policy in as many States as possible. But thoroughly scrutinize all system States can be very time consuming.


Consider, for example, the problem of network administration. Assume that the administrator has observed that, in many cases, few computers are rebooted restored the entire network. Is that a generally applicable solution?


One way to answer this question would be to evaluate each possible State network failure. But even for a network of only 20 machines, each of which has only two possible States — or not — that means that 1 million possibilities of sales promotion.


Before such a combinatorial explosion, a standard approach in reinforcement learning is trying to identify a set of "features" that are close to a much larger number of States. For example, it may be that when the 12 and 17 computers are down, rarely matter how many other computers have failed: a specific policy will always restart. The failure of 12 and 17 as well is for the failure of 12, 17 and 1; of 12, 17, 1 and 2; of 12, 17 and 2 and so on.


Geramifard — along with Jonathan How, Richard Cockburn Maclaurin Professor of Aeronautics and Astronautics, Thomas Walsh, a postdoc in how lab and Nicholas Roy, professor of Aeronautics and Astronautics — has developed a new technique to identify relevant features in reinforcement learning tasks. The algorithm first constructs a data structure known as a tree — kind of like a family tree diagram — that represents different combinations of features. In the case of network problem, the top layer of the tree would be individual machines, the next layer would be combinations of two machines, the third layer would be combinations of three machines and so on.


The algorithm then proceeds to investigate the tree, determining which combinations of characteristics determine the success or failure of a policy. The relatively simple key to its efficiency is that, when he realizes that certain combinations consistently produce the same result, it stops, exploiting them. For example, if he realizes that the same policy seems to work when 12 machines and 17 failed, it stops considering combinations that include 12 and 17 and start looking for others.


Geramifard believes that this approach captures something about how humans learn to perform new tasks. "If you teach a small child that a horse is, first of all he can think that all four legs is a horse," he says. "But when you show him a cow, he learns to look for a different feature — say, horns." Similarly, explains Geramifard, the new algorithm identifies a resource for initial decisions, and then searches for additional resources that can refine the initial trial.


RLPy allowed the researchers to quickly test your new algorithm against a host of others. "Think of it as a set of Lego," says that. Geramifard "You can fit a module out and grab another in its place."


In particular, the RLPy comes with a number of standard modules that represent different algorithms of machine learning; various problems (such as the problem of network administration, some standard control theory problems that involve balance pendulums and some standard control problems); different techniques to model the computer system environment; and different types of agents.


It also allows anyone familiar with the Python programming language to build new modules. They just have to be able to connect with existing modules in prescribed ways. Geramifard and colleagues found that in computer simulations, its new algorithm evaluated policies more efficiently than its predecessors, get more reliable forecasts in a fifth of the time.


RLPy can be used to perform experiments that involve computer simulations, such as those that the MIT researchers evaluated, but it can also be used to perform experiments that collect data from real-world interactions. In an ongoing project, for example, Geramifard and his colleagues plan to use RLPy to run an experiment involving a vehicle, learn how to navigate your environment. In the early stages of the project, however, he is using simulations to start building a reasonably good conditions. "While he is learning, you don't want to run it on a wall and destroy their equipment," he says.


View the original article here