Safe Reinforcement Learning

Robson Adem

Outline

● Motivation and Problem Statement

● Notions of Safety

● Acting safely in known vs. in unknown environment

● Different Approaches

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

● Where to go from here?

Motivation

The problem with using RL for such systems

● Unknown environment/dynamics

● Unmodelled/unobserved states

● Sensing errors

● Dynamic feedback

● Systems that change over time

● Difficulty in guaranteeing the safety of a system during dev’t and deployment

● Lack of fairness, well-being, and user agency in social networks

● Failure to ensure discovery and novel experiences in recommender systems

Specifying Safe Behavior

Safe and Efficient Exploration in Reinforcement Learning

Andreas Krause, YouTube 2020

Specifying Safe Behavior

What does it mean to be safe in RL sense?

How do we quantify uncertainty and risk?

Notions of Safety: Worst-case

Notions of Safety: Stochastic Uncertain Environment

Notions of Safety: Value at Risk

Notions of Safety: Conditional Value at Risk

Notions of safety using Lyapunov Functions

Notions of safety in Lyapunov sense

The General Problem of the Stability of Motion (In Russian)

Aleksandr Lyapunov, Doctoral dissertation, Univ. Kharkov 1892 Translated 1992

Notions of safety in Lyapunov sense

The General Problem of the Stability of Motion (In Russian)

Aleksandr Lyapunov, Doctoral dissertation, Univ. Kharkov 1892 Translated 1992

Notions of safety in Lyapunov sense

The General Problem of the Stability of Motion (In Russian)

Aleksandr Lyapunov, Doctoral dissertation, Univ. Kharkov 1892 Translated 1992

Notions of Safety: Summary

We also looked at safety in Lyapunov sense!

Act safely in known environment

Act safely in unknown environment

Key challenge: Don’t know the consequences of actions taken!

Act safely in unknown environment with prior knowledge

Using prior knowledge to establish a good first policy!

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Data-driven Optimal Control

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Data-driven Optimal Control — Linear Dynamics

Data-driven Optimal Control — Linear Quadratic Regulator

Safely Learning to Control the Constrained Linear Quadratic Regulator

Sarah Dean, Stephen Tu, Nikolai Matni and Benjamin Recht ACC 2019

Data-driven Optimal Control — Linear Quadratic Regulator

Safely Learning to Control the Constrained Linear Quadratic Regulator

Sarah Dean, Stephen Tu, Nikolai Matni and Benjamin Recht ACC 2019

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Data-driven Optimal Control —Model Predictive Control (MPC)

Data-driven Optimal Control — MPC vs LQR

The main differences is that LQR optimizes across the entire time window

(horizon) whereas MPC optimizes in a receding time window!

Data-driven Optimal Control — Robust MPC

Safe Reinforcement Learning Using Robust MPC

Mario Zanon and Sebastien Gros IEEE Transactions on Automatic Control 2020

Stability-Constrained Markov Decision Processes Using MPC

Mario Zanon, Sébastien Gros, and Michele Palladino Preprint to Automatica 2021

Learning-Based Model Predictive Control: Toward Safe Learning in Control

Lukas Hewing, Kim P. Wabersich, Marcel Menner, and Melanie N. Zeilinger Annual Review of Control, 2020

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Notions of safety using Lyapunov Functions

Constrained MDP via Lyapunov Function

Lyapunov Design for Safe Reinforcement Learning

Theodore J. Perkins and Andrew G. Barto JMLR 2003

A Lyapunov-based Approach to Safe Reinforcement Learning

Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, Mohammad Ghavamzadeh Preprint 2018

Act safely in unknown environment with prior knowledge

○ Data-driven Optimal Control

■ via LQR

■ via MPC

○ Constrained MDP via Lyapunov Function

○ Imitation Learning and Curriculum Learning

Act safely in unknown environment with prior knowledge

○ Imitation Learning and Curriculum Learning

Where to go from here ?

This presentation includes content from the sources cited!