Microsoft researchers develop a game-theoretic approach to provably correct and scalable offline reinforcement learning

by

in

[ad_1]

Although machine learning is used in most fields and aspects, most of the automation is designed by humans, not artificial intelligence. For example, decision strategies in robotics and other applications with long-term consequences are implemented by experienced human engineers.

IN reinforcement learning, the online RL agents learn by trial and error. They try different actions, know the consequences and improve it. So they make various suboptimal decisions and learn through them, which is okay and acceptable in most cases, but it is not an option in every task, for example in self-driving cars.

Offline RL is a model that learns from large static datasets previously collected. It does not collect online data for learning policies, nor does it interact with a simulator. Offline RL has excellent potential for large-scale real-world implementation.

Offline RL faces a challenge.

The most fundamental challenge facing offline RL is that the data collected lacks diversity, making it difficult to assess the goodness of its policies in the real world. Making the data set diverse is impossible because it requires humans to run unrealistic experiments, for example for self-driving cars – Staging a car accident, etc. So because of all this, the data collected in large quantities lack diversity, which reduces their applicability.

Microsoft researchers have tried to solve this problem. They introduce a generic game-theoretic framework for offline RL, where they posit offline RL as a two-player game between the learning agent and the adversary, simulating the uncertain decision outcomes due to lack of data coverage. They also showed that this framework, through generative adversarial networks, provides a natural link between offline RL and imitation learning. It is also shown that this framework is no worse than the data collection policies. Existing data can be robustly used to learn policies that improve human strategies in the system. To solve the central problem of not having all possible outcomes, the agent must carefully consider uncertainties caused by missing data to solve this problem. Before making a choice, the agent should consider all potential outcomes rather than focusing on a specific data-consistent outcome. When the agent’s choices may have adverse effects, this kind of purposeful conservatism is in fact exceptionally crucial.

This is done through the concept of version space, which is mathematical. You can read further from the sources.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Adversarially Trained Actor Critic for Offline Reinforcement Learning'. All Credit For This Research Goes To Researchers on This Project. Check out the paper1, paper2 and reference article.

Please Don't Forget To Join Our ML Subreddit

Prathvik is a ML/AI Research content intern at MarktechPost, he is a 3rd year undergraduate at IIT Kharagpur. He has a keen interest in machine learning and data science. He is excited to learn about the applications of in various fields of study.

[ad_2]


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *