Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. The shielded agent continues to explore the environment and collects new data on the environment. ![]() After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next k steps. Given a learned MDP and a safety specification, we construct a shield. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. The agent starts exploring the environment and collects traces. Initially, the MDP representing the environment is unknown. ![]() Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. ![]() In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Safety is still one of the major research challenges in reinforcement learning (RL).
0 Comments
Leave a Reply. |