Duy P. Nguyen$~{}^{1}$Kai-Chieh Hsu^{∗1}Wenhao Yu^{2}Jie Tan^{2}Jaime Fernández Fisac^{1}^{1}Princeton University, United States^{2}Google DeepMind, United States

{duyn,kaichieh,jfisac}@princeton.edu,{magicmelon,jietan}@google.com

###### Abstract

Despite the impressive recent advances in learning-based robot control, ensuring robustness toout-of-distribution conditionsremains an open challenge.Safety filters can, in principle, keep arbitrary control policies from incurring catastrophic failures by overriding unsafe actions,but existing solutions for complex (e.g., legged) robot dynamics do not span the full motion envelope and instead rely on local, reduced-order models.These filterstend to overly restrict agility and can still fail when perturbed away from nominal conditions.This paper presents the *gameplay filter*, a new class of predictive safety filter that continually plays outhypothetical matches between its simulation-trained safety strategy and a virtual adversary co-trained to invoke worst-case events and sim-to-real error,and precludes actions that would cause it to fail down the line.We demonstrate the scalability and robustness of the approach with a first-of-its-kind full-order safety filter for (36-D) quadrupedal dynamics.Physical experimentson two different quadruped platformsdemonstrate the superior zero-shot effectiveness of the gameplay filter under large perturbations such as tugging and unmodeled terrain.

Keywords: Robust Safety, Adversarial Reinforcement Learning, Game Theory

## 1 Introduction

Autonomous robots are increasingly required to operate reliably in uncertain conditionsand quickly adapt to carry out a broad range of jobs on the fly[1, 2, 3, 4, 5].Rather than synthesize an intrinsically safe control policy for every new assigned task, it is efficient to endow each robot with a *safety filter*that automatically precludes unsafe actions,relieving task policies of the burden of safety altogether.

Unfortunately, today’s safety filter methods fall short of this promise for most modern-day robots.To cover a diverse range of tasks and environments, a safety filter needs to give the robot significant freedom to execute varied motions across its state space while robustly protecting it from catastrophic failures throughout this large envelope.To date, such *minimally restrictive* safety filters are only systematically computable for systems with5–6 state variables[6, 7, 8], woefullyshort ofthe 12 needed to accurately model drone flight and the 30–50 needed for legged locomotion.Existing safety filters for high-order robot dynamicsrely on reduced-order models[9, 10, 11, 12].These filters restrict the robot’s motion to a local envelope, such as the vicinity of a stable walking gait, and become ineffective whenever the robot is perturbed away from it by external forces or unmodeled environment features(LABEL:fig:front).How can we tractably and systematically compute safety filters that cover broad regions of robots’ high-dimensional state spaces and a wide variety of deployment conditions?

Contribution.This paper introduces the *gameplay filter*,a novel type of predictive safety filter that can scale to full-order robot dynamics and enforce safety across a broad motion envelope and a designer-specified range of possible conditions (operational design domain).The filter is first synthesized by simulated self-play between a safety-seeking robot control policy and a virtual adversary that invokes worst-case realizations of uncertainty and modeling error (or *sim-to-real gap*).At runtime, the deployed filter continuallyrolls out hypothetical games between the two learned agents,overriding candidate actions that would result in the robot losing a future safety game.This methodology—based on the core game-theoretic principle that a strategy that wins against the worst-case opponent must also win against all others—unlocksreal-time filtering in the robot’s full state space by only requiring a single, highly informative trajectory rollout.We demonstrate the effectiveness of our approach experimentally on two quadruped robots that differ in physical parameters and sensing capabilities(LABEL:fig:front). Each gameplay filter is synthesized and deployed using an off-the-shelf physics engine to simulate a manufacturer-provided robot model with a 36-D state space and a 12-D control space.We observe highly robust zero-shot safety-preserving behavior without incurring the conservativeness typical of robust predictive filters.To the best of our knowledge, this constitutes the first successful demonstration of a full-order safety filter on legged robot platforms.

Related Work.The last decade has seen important advances in robot safety filters.We briefly discuss the techniques most relevant to our work and direct interested readers to recent survey efforts[13, 14, 15]that shed light on safety filters’ common structure and relative strengths.

Value-based filters.Hamilton–Jacobi (HJ) reachability methods use finite-difference dynamic programming to computethe best available safety fallback policy and the worst possible uncertainty realization from each state on a finite grid[16, 17, 6],which enables minimally restrictive safety filters.Althoughhighly general, HJ computational tools suffer exponential blowup and do not scale beyond 5–6 state dimensions[18, 19].Control barrier function (CBF) filters keep the system inside a smaller safe set while discouragingexcessive control overrides[20].CBFs lack a general constructive procedure and instead rely on manual design[21], sum-of-squares synthesis[22],or learning from demonstrations[23].Robust formulations are comparatively less mature[24, 25, 26, 27].Self-supervised and reinforcement learning techniquescan synthesize safety-oriented control policies and value functions (“safety critics”)for systems beyond the reach of classical methods,but they are inherently approximate and offer no formal assurances[28, 29, 30, 31, 32].Statistical generalization theory may be used to bound the probability of failure under the assumption that the robot can be tested on a statistically representative sample of environments and conditions before deployment[3].

Rollout-based filters.Predictive safety filters perform model-based runtime assurance by continually simulating—and in some cases optimizing—the robot’s future safety efforts for a short lookahead time horizon[33, 34, 35, 36, 37, 38].Recent advances in fast forward-reachable set over-approximation[39, 40, 41] make it possible to check safety against all possible uncertainty realizations, although this runtime robustness comes at the cost of significant added conservativeness:for example,Hsu etal. [38] observe safety overrides $5$ times as frequent as those of a least-restrictive HJ filter.Bastani and Li [35] instead propose sampling multiple possible trajectories, assuming a well-characterized disturbance distribution, to maintain a statistical guarantee.Our approach mimicsHsu etal. [38] inco-training a safety controller and a worst-case disturbance through simulated self-play,but it eschews over-conservative reachable setsby instead simulating a single closed-loop match between the two.

Legged robot safety filters.Legged robots have attracted increasing interest from researchers due to their versatility and increasing availability, as well as their challenging high-order and contact-rich dynamics[42].Recent simulation-trained controllers leveraging domain randomization are showing promising agility and adaptability[1, 43, 2, 44];however, robustness to out-of-distribution conditions cannot be easily quantified and remains an open issue.Unfortunately, all safety filters demonstrated on legged robots to date are based on simplified reduced-order dynamical models[10, 11, 3, 12], sometimes combined with local analysis around nominal walking gaits[45, 9, 46].The dynamic envelope protected by these safety filters is limited to local state space regions where the simplified models apply, and their robustness to disturbances and modeling errors is contingent on the effectiveness of low-level tracking controllers.Our demonstration of the gameplay filter uses a full-order dynamical model of the robot, both at synthesis and at deployment, which enables it to enforce safety across a broad range of motions and operating conditions.

## 2 Preliminaries: Robust Robot Safety in an Operational Design Domain

We wish to ensure the safe operation of a robot with potentially high-order nonlinear dynamicsunder a wide range of environments and task specifications, which may be unknown at design time.Formally, we consider a robotic system with uncertain discrete-time dynamics

$\displaystyle{{x}_{{k}+1}}={f}({{x}_{k}},{{u}_{k}},{{d}_{k}}),$ | (1) |

where, at each time step ${k}\in\mathbb{N}$, ${{x}_{k}}\in{\mathcal{X}}\subseteq\mathbb{R}^{n_{x}}$ is the state of the system,${{u}_{k}}\in{\mathcal{U}}\subset\mathbb{R}^{n_{u}}$ is the bounded control input (typically from a control policy ${\pi}^{u}\in{\Pi^{u}}\colon{\mathcal{X}}\to{\mathcal{U}}$),and ${{d}_{k}}\in{\mathcal{D}}\subset\mathbb{R}^{n_{d}}$ is a disturbance input, unknown *a priori* but bounded by a compact ${\mathcal{D}}$.While the control bound ${\mathcal{U}}$ encodes actuator limits,the disturbance bound ${\mathcal{D}}$ is a key part of the operational design domain(ODD).

Operational Design Domain.The ODD can be viewed as social contract between the system operator and the public,delineating the set of conditions under which the robotic system is required to function correctly and safely[47].In this paper, we are interested in *robust safety*, where the disturbance (or “domain”) bound ${\mathcal{D}}$ may encode a range of potential perturbations like wind or contact forces, environmental parameters like terrain friction, manufacturing tolerances, variations in actuator performance and state estimation accuracy, and other factors contributing to designer uncertainty about future deployment conditions and modeling error.The ODD further specifies a deployment set ${\mathcal{X}}_{0}\subset{\mathcal{X}}$ of allowable initial states (for example, the robot is always turned on while static on flat ground) and, crucially, a *failure set* ${\mathcal{F}}\subset{\mathcal{X}}$, which characterizes all configurations that the system state must never reach, such as falls or collisions.The required safety property can then be succinctly expressed as:

$\forall{x}_{0}\in{\mathcal{X}}_{0},\forall{k}\geq 0,\forall{d}_{0},\dots,{{d}_%{k}}\in{\mathcal{D}},\quad{{x}_{k}}\not\in{\mathcal{F}}\,,$ | (2) |

that is, once deployed in an admissible initial state, the robot must stay clear of the failure set for any realization of the domain uncertainty.

Safety Filter.Explicitly ensuring the safety property in the synthesis of every robot task policy${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ can be impractically cumbersome, especially for increasingly general-purpose robotic systems with broad ODDs.Instead, we aim to relieve task policies of the burden of safety by augmenting them with a safety filter ${\phi}$ that depends on the robot’s ODD but not on the task specification.Rather than directly applying the proposed task action ${u}_{k}={{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}({x}_{k})$ from each state ${x}_{k}$, the robot executes^{1}^{1}1For the scope of this paper, we assume that the robot maintains an appropriately accurate estimate of its dynamical state through onboard perception.We make two observations:First, moderate state estimation errors typical in many robotic systems can be absorbed by inflating the failure set ${\mathcal{F}}$ and dynamical uncertainty${\mathcal{D}}$.Second, more substantial state uncertainty, e.g., induced by sensor faults, occluding objects, or multiagent interaction, may be handled with information-space safety filters, a subject of ongoing research[48, 49, 50].

${u}_{k}={\phi}({x}_{k},{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}%}}})\,.$ | (3) |

The safety filter’s role is to prevent the execution of any candidate actions that would jeopardize future safety, while alsoavoiding spurious interventions that unnecessarily disrupt task progress.In fact, for any well-defined ODD there exists a *perfect safety filter* that allows every safe candidate action and overrides every unsafe one, robustly enforcing(2) with no overstepping[13, Prop.1].Formally, a perfect safety filter only disallows actions that may cause the state to exit the *maximal safe set* ${{\Omega}^{*}}\subset{\mathcal{X}}$, the set of all states from which there exists a control policy that can enforce(2).While computing such a perfect filter is known to be intractable for most practical systems[7],we aim to synthesize effective safety filters that allow robots significant freedom to perform a wide range of tasks (including online learning and exploration) while maintaining safety across their ODD.Intuitively, we would like to obtain a safety filter that robustly keeps the robot inside a conservative safe set ${\Omega}\subseteq{{\Omega}^{*}}$ as close as possible to the theoretical ${{\Omega}^{*}}$.Our proposed method uses game-theoretic reinforcement learning and faster-than-real-time gameplay simulation to approximate a perfect safety filter for any given robot ODD,targeting the robot’s full dynamic envelope,in contrast with existing reduced-order filters, which aim to enforce safety within a significantly smaller set${\Omega}$.

Reach–Avoid Safety Game.Whether it is possible for the robot to robustly maintain safety, as in(2), can be seen as the categorical (true/false) outcome of a *game of kind* between the robot’s controller and an adversarial disturbance that aims to drive it into the failure set.In turn, this result can be encoded implicitly through a *game of degree* with a continuous outcome (for example, the closest distance that will separate the robot and any obstacle).In particular, for the purposes of predictive safety filtering, we consider a sufficient finite-time condition for all-time safety: it is enough for the robot to reach a known controlled-invariant set ${\mathcal{T}}\subset{\mathcal{F}}^{c}$ (for example, coming to a stable stance) in ${H}$ steps without previously entering the failure set ${\mathcal{F}}$.Once there, the robot can switch to a policy ${{\pi}^{{\mathcal{T}}}}$ that keeps it in ${\mathcal{T}}$ indefinitely).This induces a reach–avoid game[17, 32] with outcome

$\displaystyle{J}^{{\pi}^{u}\!,{\pi}^{d}}_{k}({x}):=\max_{{\tau}\in[{k},{H}]}%\min\left\{{\ell}\left({x}_{{\tau}}\right),\,\min_{{s}\in[{k},{\tau}]}{g}\left%({x}_{{s}}\right)\right\}$ | (4) |

where ${g}$ and ${\ell}$ are the (Lipschitz) failure and target margins, satisfying ${{g}({x})<0\Leftrightarrow{x}\in{\mathcal{F}}}$, ${{\ell}({x})\geq 0\Leftrightarrow{x}\in{\mathcal{T}}}$.The outcome summarizes the aforementioned condition for all-time safety: For any given ${\tau}\in[0,{H}]$, if we previously enter the failure set ${\mathcal{F}}$, ${g}({x}_{\tau})<0$, then for ${k}\in[0,{\tau}]$, ${J}^{{\pi}^{u}\!,{\pi}^{d}}_{k}({x})<0$, denoting that past failure overrides future successes.The value function of this game satisfies the reach-avoid Isaacs equation

$\displaystyle{V}_{k}({x})$ | $\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,\max%\left\{{\ell}({x}),\,{V}_{{k}+1}\big{(}{f}({x},{u},{d})\big{)}\right\}\right\},$ | (5a) | ||

$\displaystyle{V}_{H}({x})$ | $\displaystyle=\min\left\{{\ell}({x}),\,{g}({x})\right\}\,,$ | (5b) |

and the robot’s controller is guaranteed a winning strategy from any state where ${V}_{0}({x})\geq 0\}$.

## 3 Predictive Gameplay Safety Filters

### 3.1 Offline Gameplay Learning

We extend the Iterative Soft Adversarial Actor–Critic for Safety (ISAACS) scheme[38] to reach-avoid games(4), approximately solving the infinite-horizon counterpart of the Isaacs equation(5).

Simulated Adversarial Safety Games. At every time step of gameplay, we record the transition $({x},{u},{d},{{x}^{\prime}},{{\ell}^{\prime}},{{g}^{\prime}})$ in the replay buffer ${\mathcal{B}}$, with ${{{x}^{\prime}}:={f}({x},{u},{d})}$, ${{\ell}^{\prime}}:={\ell}({{x}^{\prime}})$ and ${{g}^{\prime}}:={g}({{x}^{\prime}})$.

Policy and Critic Networks UpdateThe core of the proposed offline gameplay learning is to find an approximate solution to the time-discounted infinite-horizon version ofEq.5.We employ the Soft Actor-Critic (SAC)[51] framework to update the critic and actor networks with the following loss functions.

We update the critic to reduce the deviation from the Isaacs target^{2}^{2}2Deep reinforcement learning typically involves training an auxiliary target critic ${Q}_{\omega}^{\prime}$, with parameters $\omega^{\prime}$ that undergo slow adjustments to align with the critic parameters $\omega$. This process aims to stabilize the regression by maintaining a fixed target within a relatively short timeframe. | ||||

$\displaystyle L(\omega)$ | $\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u},{d},{{x}^{\prime}},{{%\ell}^{\prime}},{{g}^{\prime}})\sim{\mathcal{B}}}\left[\left({Q}_{\omega}({x},%{u},{d})-y\right)^{2}\right]\,,$ | |||

$\displaystyle y$ | $\displaystyle=\gamma\min\left\{\ {{g}^{\prime}},\max\left\{{{\ell}^{\prime}},{%Q}_{\omega}^{\prime}({{x}^{\prime}},{{u}^{\prime}},{{d}^{\prime}})\right\}%\right\}+(1-\gamma)\min\left\{{{\ell}^{\prime}},{{g}^{\prime}}\right\}$ | (6a) | ||

with ${{u}^{\prime}}\sim{\pi}_{\theta}(\cdot\mid{{x}^{\prime}})$, ${{d}^{\prime}}\sim{\pi}_{\psi}(\cdot\mid{{x}^{\prime}})$.We update control and disturbance policies following the policy gradient induced by the critic with entropy regularization: | ||||

$\displaystyle L(\theta)$ | $\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{d})\sim{\mathcal{B}}}\Big{[%}-{Q}_{\omega}({x},{\tilde{{u}}},{d})+{{\alpha}^{u}}\log{\pi}_{\theta}({\tilde%{{u}}}\mid{x})\Big{]},$ | (6b) | ||

$\displaystyle L(\psi)$ | $\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u})\sim{\mathcal{B}}}\Big{[%}{Q}_{\omega}({x},{u},{\tilde{{d}}})+{{\alpha}^{d}}\log{\pi}_{\psi}({\tilde{{d%}}}\mid{x})\Big{]},$ | (6c) | ||

where ${\tilde{{u}}}\sim{\pi}_{\theta}(\cdot\mid{x})$, ${\tilde{{d}}}\sim{\pi}_{\psi}(\cdot\mid{x})$, and ${{\alpha}^{u}},{{\alpha}^{d}}$ are hyperparameters incentivizing exploration (entropy in the stochastic policies), which decay gradually in magnitude through the training. |

Following the ISAACS scheme, we jointly train the safety critic, controller actor and disturbance actor throughEq.6.For better learning stability, the controller actor can be updated at a slower rate (only every $\tau\geq 1$ disturbance updates), consistent with the asymmetric information structure of the game,and a leaderboard of best-performing controllers and disturbances can be maintained to mitigate mutual overfitting to the latest adversary iteration[38].

### 3.2 Online Gameplay Filter

This section demonstrates how the reach–avoid control actor ${\pi}_{\theta}$ and disturbance actor ${\pi}_{\psi}$ synthesized offline through game-theoretic RL can be systematically used at runtime to construct highly effective safety filters for general nonlinear, high-dimensional dynamic systems.

The gameplay rollout considers applying the candidate task policy ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ followed by the learned fallback policy ${{\pi}^{\text{{\faShield*}}}}$, with the whole rollout under attack by the learned disturbance policy ${\pi}_{\psi}$. It is effectively a gameplay between the learned fallback policy ${{\pi}^{\text{{\faShield*}}}}$ and the learned disturbance policy ${\pi}_{\psi}$ to check if accepting the candidate action from task policy ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ will result in an inevitable failure even if we then apply our best-effort attempt to maintain safety. The reach-avoid outcome defined in Eq.4 is used to determine the game outcome. A runtime gameplay filter can then be defined with the simple switching rule:

$\displaystyle{\phi}({x},{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]%}}}})=$ | $\displaystyle\left\{\begin{array}[]{ll}{{\pi}^{\text{{\hskip 0.49005pt%\faCheckSquare[regular]}}}},&{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{\text{%{\hskip 0.49005pt\faCheckSquare[regular]}}}})=1,\\{{\pi}^{\text{{\faShield*}}}},&{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{%\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}})=0,\end{array}\right.$ | $\displaystyle\begin{array}[]{ll}{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{%\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}):=\mathbbm{1}\Big{\{}&%\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_{{\tau}}\in{\mathcal{T}}\,\land\\&\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F}}\Big{\}}%\end{array}$ | (7e) | ||

with ${\hat{x}}_{0}={x}$,${\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\pi}_{\psi}({%\hat{x}}_{{\tau}})),{\tau}\geq 0$, and | |||||

$\displaystyle{\hat{u}}_{{\tau}}=\begin{cases}{{\pi}^{\text{{\hskip 0.49005pt%\faCheckSquare[regular]}}}}({\hat{x}}_{{\tau}}),&{\tau}=0,\\{{\pi}^{\text{{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{1,\dots,{H}-1\},%\end{cases}$ | $\displaystyle{{\pi}^{\text{{\faShield*}}}}({x})=\left\{\begin{array}[]{ll}{\pi%}_{\theta}({x}),&{x}\not\in{\mathcal{T}},\\{{\pi}^{{\mathcal{T}}}}({x}),&{x}\in{\mathcal{T}}.\end{array}\right.$ | (7h) |

That is, if gameplay monitor returns “success” (the simulated trajectory safely reaches the target set), the filter selects the task policy; otherwise, it selects the fallback safety policy.

In practice, the computation of a full gameplay rollout may require multiple time steps (i.e., multiple control policy executions).In that case, the filter in(7) can be extended to a multi-step variant in which decisions are made by the filter every ${L}$ steps, appropriately accounting for the latency.Figure1 illustrates the gameplay safety filter logic with the ${L}$-step rollout latency.

## 4 Experimental Evaluation

We run hardware experiments and an extensive simulation study,focusing on quadruped robots as an informative platform but stressing thatour proposed methodology is general and can be applied to other types of robots.We aim to evaluate the extent to which the synthesizedgameplay filters can maintain safety within the ODD specified at training,generalize beyond the ODD,and avoid unnecessarily impeding task execution.We also conduct ablation studies to investigate the importance of reach–avoid reinforcement learning and adversarial self-play in the filter synthesis, and of the gameplay-rollout in the filter’s runtime monitoring.Implementation details are inAppendixB.

### 4.1 Experiment Setup

Robots and simulator.We use a Ghost Robotics Spirit S40(LABEL:fig:front) and a Unitree Go-2.Both have built-in IMUs to obtain body angular velocities and linear acceleration,and internal motor encoders to measure joint positions and velocities.The S40 has no foot contact sensing; the Go-2 receives a Boolean contact signal for each foot.Neither robot’s safety filter is given access to visual perception.We use the PyBullet physics engine[52] for both training and runtime gameplay simulation.

Gameplay filter.We set up an offboard gameplay rollout server, a ROS servicethat receives the current robot state estimate and candidate task policy, runs a ${H}$-step gameplay rollout, and returns a single policy selection (either task or fallback) for the subsequent $L$ control cycles.Our physical robot experiments use horizon ${H}=300$, latency $L=10$, with filter decisions running at around $3.5~{}\text{Hz}$.

Task.The robot’s task is to move from its initial location to a goal on the other side of the terrain.

Operational design domain.The safety filter is computed for a fairly simple ODD, defined by the nominal robot simulator perturbed by forces of up to $50~{}\text{N}$ applied anywhere on the robot’s torso;the disturbance adversary acts by a vector ${d}\in{\mathcal{D}}\subset\mathbb{R}^{6}$ encoding what force to apply and where.We intentionally limit the ODD to only consider flat ground.The failure set${\mathcal{F}}$ is defined as all *fall* states, in which any non-foot robot part makes contact with the ground.The deployment set and controlled-invariant set ${\mathcal{X}}_{0}={\mathcal{T}}$ are chosen empirically to contain all four-legged stances with a lowered torso, around which the robot is robustly stable with a simple leg position controller $\pi^{\mathcal{T}}$.

Test conditions.We test in two types of conditions: flat terrain with tugging forces (similar to ODD) and irregular terrain (out-of-ODD).The irregular terrain is a $2~{}\text{m}\times 4~{}\text{m}$ area with a 15-degree incline along one edge, and twomemory foam mounds, $5~{}\text{cm}$ and $15~{}\text{cm}$ high,positioned $1.8~{}\text{m}$ from each other.Tugging forces areapplied manually through a rope, attached to the robot’s torso and to a motion-tracked dynamometerset to provide audiovisual alerts at 80% and 100% of the ODD limit.

Baselines.To evaluate the effectiveness of the reach–avoid learning signal androbust in-simulationlearning, we consider three prior reinforcement learning algorithms: (1) standard SAC[51] with reward defined as$+1$ inside ${\mathcal{T}}$,$-1$ inside ${\mathcal{F}}$,and $0$ everywhere else;(2)single-agentreach–avoid reinforcement learning (RARL)[32]; RARL with domain randomization (DR); and (4) adversarial SAC with the above indicator reward.We also compare to a critic (value-based) filter, which queries the learned ${Q}_{\omega}$ for the current state and proposed task action and intervenes if it is below a threshold;we run a parameter sweep in simulation to tune the threshold value and use it in all experiments.

Policies.All learned policies are neural networks with 3 fully-connected layers of 256 neurons; critics have 3 layers of 128 neurons.We handcraft a task policy using an inverse kinematics gait planner for forward/sideways walking.We use a low-level PD position controller that outputs torques $\tau^{i}=K_{p}({\delta{\theta}^{i}_{\text{J}}})-K_{d}\cdot{{\omega}^{i}_{\text%{J}}}$ to the robot motor controller with $K_{p},K_{d}$ the proportional and derivative gains.

Policy | Tugging Forces | Irregular Terrain | ||||||||||||

Successful Runs | Failed Runs | Successful Runs | ||||||||||||

Safe/All Runs | Withstood AttacksAll (within 110% ODD) | Filter Freq. | $T_{\text{goal}}$ | $F^{\text{peak}}_{\text{avg}}$ | $F^{\text{peak}}_{\text{max}}$ | $F^{\text{peak}}_{\text{avg}}$ | $F^{\text{peak}}_{\text{min}}$ | Safe/All Runs | Filter Freq. | $T_{\text{goal}}$ | ||||

${{\phi}^{\text{game}}}$ | 7/10 | 53/56 (33/35) | 0.17 | 26.3 | 67.5N | 70.5N | 59.8N | 52.7N | 10/10 | 0.19 | 41.2 | |||

${{\phi}^{\text{critic}}}$ | 4/10 | 22/28 (10/15) | 0.10 | 26.8 | 73.7N | 80.9N | 53.6N | 40.0N | 5/10 | 0.22 | 33.5 | |||

${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ | 0/10 | 6/16 (1/5) | – | – | – | – | 56.5N | 41.4N | 5/10 | – | 16.4 | |||

| •^{${\dagger}$}Safety policies from reward-based RL and ISAACS with the avoid-only objective fail immediately before applying force.•^{$*$}The policy was able to withstand this magnitude of force.Because the policy made the quadruped move in the tugging direction, we were not able to apply a larger force in 10 pull attempts. |

### 4.2 Physical Results

Safe walking within and beyond the ODD.We evaluate the effectiveness of our proposed gameplay filter in terms of both safety and disruption of task performance.We run similar experiments with baseline methodsfor rough comparison purposesbut caution that, due to the impossibility of reproducing identical conditions, these results should not be taken as a fine-grained quantitative comparison between methods.Such a comparison is conducted at scale, albeit in simulation, in Section4.3.Table1 shows the results for the S40 robot, subject to tugging forces and irregular terrain (not considered in the ODD),andTable2 shows the results for the Go-2 robot under a larger range of tugging forces (up to $4\times$ the ODD bound).Our proposed gameplay safety filteris remarkably robust across robot platforms and test conditions;while not unbeatable outside of the specified ODD, it still withstands large tugging forces before it violates the safety constraints. Importantly, the gameplay filter does not disproportionately interfere with the task-oriented actions: it maintains comparable filter frequency and task performance as the critic filter while drastically reducing safety failures.LABEL:fig:front shows the gameplay filter in action on the S40,dynamically counterbalancing tugs or springing into a wide stance.Time plots of tugging forces in all S40 runs are given inAppendixD.

External forces. We measure the maximum tugging force withstood by various safety policies and filters, reported in Table3.We pull the quadruped from different directions, with “low” indicating angles in the range $[-0.1,\,0.4]~{}\text{rad}$, and “high” in $[0.5,\,1.0]~{}\text{rad}$.The employed ${\pi}_{\theta}$ can withstand $150~{}\text{N}$ from all directions, but the non-game-theoretic counterpart (RARL+DR) is vulnerable to the tugging from the left and can only withstand $43~{}\text{N}$.This suggests that DR struggles to capture the worst-case realization of disturbances in a bounded class. This arises from its inherent nature: as the dimension of the disturbance input increases, the likelihood of the random policy simulating the worst-case disturbance decreases exponentially.Further, we notice the reward-based RL baselines and ISAACS with avoid-only objective fail almost immediately by overreacting and flipping over.Reach–avoid policies behave more robustly by bringing the robot to a stable stance.We also include tests for task policy ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ and the fixed-pose policy ${{\pi}^{{\mathcal{T}}}}$ (used when the state is in the target set).We observe that ISAACS control actor is strictly better than ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ and is comparable to ${{\pi}^{{\mathcal{T}}}}$.

### 4.3 Simulated Results

Bespoke ultimate stress test (BUST).To test each policy’s robustness when taken to the limit,we RL-train a *specialized* adversarial disturbance${{\pi}_{\psi}^{*}}$ to exploit its safety vulnerabilities(Table4).For each robot–disturbance policy pair,we play 1,000 finite horizon games and record the safe rate—overall fraction of failure-free runs. All pairs use the same set of 1,000 initial states.We observe that ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ is vulnerable to all ${{\pi}_{\psi}^{*}}$, while the proposed gameplay filter is only exploited by its associated BUST disturbance ${{\pi}_{\psi}^{*}}({{\phi}^{\text{game}}})$. Further, the robustness of ${{\phi}^{\text{game}}}$ pushes ${{\pi}_{\psi}^{*}}({{\phi}^{\text{game}}})$ to learn effective attacks that also exploit other policies (the third column has the lowest safe rates compared to other columns across the board).The last two columns show the safe rate under random disturbances.All safety filters and safety policies remain at remarkably high safe rates, suggesting that our adversarial BUST evaluation method establishes a more demanding safety benchmark for policies than DR.

${{\pi}_{\psi}^{*}}\left({\pi}_{\theta}\right)$ | ${{\pi}_{\psi}^{*}}\left({{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]%}}}}\right)$ | ${{\pi}_{\psi}^{*}}\left({{\phi}^{\text{game}}}\right)$ | ${{\pi}_{\psi}^{*}}\left({{\phi}^{\text{critic}}}\right)$ | ${\pi}^{\text{rnd}}$ | ${\pi}^{\text{rnd,+}}$ | |

${\pi}_{\theta}$ | 0.37 | 0.38 | 0.17 | 0.44 | 0.88 | 0.85 |

${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$ | 0.0 | 0.0 | 0.0 | 0.0 | 0.03 | 0.03 |

${{\phi}^{\text{game}}}$ | 0.42 | 0.35 | 0.03 | 0.45 | 0.84 | 0.89 |

${{\phi}^{\text{critic}}}$ | 0.37 | 0.34 | 0.10 | 0.44 | 0.86 | 0.86 |

Sensitivity analysis: reach–avoid criteria vs. avoid-only.We evaluate the significance of using reach–avoid criteria in the gameplay filter by performing a sensitivity analysis of the horizon in the imagined gameplay.Figure2 shows that the gameplay filter with reach–avoid criteria still remains 100$\%$ safe rate even when the gameplay horizon is short (${H}=10$).In contrast, an “avoid-only” gameplay filter that only requires not reaching ${\mathcal{F}}$ for ${H}$ stepsincurs more safety violations as the horizon decreases.The difference is due to shorter imagined gameplay resulting in more frequent filter intervention for reach–avoid criteria but overly optimistic monitoring for avoid-only criteria (oblivious to imminent failures beyond ${H}$).Further, as the gameplay horizon increases, the reach–avoid gameplay filter’s intervention frequency decreases.

## 5 Conclusion

This work presents a game-theoretic learning approach to synthesize safety filters for high-order, nonlinear dynamics.The proposed gameplay safety filter monitors system safety through imagined games between its best-effort safety fallback policy and a learned virtual adversary, aiming to realize the worst-case uncertainty in the system.We validate our approach on two different quadruped robots under strong tugging forces and unmodeled irregular terrain while maintaining zero-shot safety.An exhaustive simulation study is also performed to rigorously stress-test the approach and quantify its reliability and conservativeness.

Limitations.Despite the strong empirical robustness in both simulated and physical experiments,we do not have strong theoretical guarantees on convergence of offline gameplay learning, and therefore learned disturbance policy can in general be expected to behave suboptimally in at least certain regions of the state space.Naturally, the potential implications are quite serious, since a suboptimal (not-truly-worst-case) disturbance model may lead the gameplay rollout to erroneously conclude that a proposed course of action is safe, only to then be met by an ODD realization that unexpectedly drives the robot into a catastrophic failure state.Without strong theoretical assurances that for now remain elusive, this is not a method that should be placed in sole charge of a truly safety-critical system where an eventual catastrophic failure can carry inadmissible cost.

The remarkably high effectiveness demonstrated by the gameplay filter across various within-ODD experiments and even under out-of-ODD conditions could indicate that this new type of filter does in fact enjoy desirable properties yet to be established.This calls for future theoretical work at the intersection of game-theoretic reinforcement learning and nonlinear systems theory.In parallel, we see an opportunity for application-driven research to leverage the computational scalability and *de facto* robustness of gameplay filters to tackle ongoing challenges in robot learning, for example for safe acquisition of novel skills as well as rapid detection of shifts in operating conditions enabling safe runtime adaptation of ODD assumptions.

#### Acknowledgments

This work has been supported in part by the Google Research Scholar Award and the DARPA LINC program. The authors thank Zixu Zhang for his help preparing the Go-2 robot for experiments.

## References

- Kumar etal. [2021]A.Kumar, Z.Fu, D.Pathak, and J.Malik.RMA: Rapid Motor Adaptation for Legged Robots.In
*Proc.Robotics: Science and Systems*, 7 2021.doi:10.15607/RSS.2021.XVII.011. - Zhuang etal. [2023]Z.Zhuang, Z.Fu, J.Wang, C.Atkeson, S.Schwertfeger, C.Finn, and H.Zhao.Robot parkour learning.In
*Conf.Robot Learning*, volume 229 of*Proceedings of Machine Learning Research*, pages 73–92, 11 2023.URL https://proceedings.mlr.press/v229/zhuang23a.html. - Hsu etal. [2023]K.-C. Hsu, A.Z. Ren, D.P. Nguyen, A.Majumdar, and J.F. Fisac.Sim-to-lab-to-real: Safe reinforcement learning with shielding and generalization guarantees.
*Artificial Intelligence*, 314:103811, 2023.ISSN 0004-3702.doi:https://doi.org/10.1016/j.artint.2022.103811.URL https://www.sciencedirect.com/science/article/pii/S0004370222001515. - Margolis etal. [2022]G.Margolis, G.Yang, K.Paigwar, T.Chen, and P.Agrawal.Rapid locomotion via reinforcement learning.In
*Proc.Robotics: Science and Systems*, New York City, NY, USA, 6 2022.doi:10.15607/RSS.2022.XVIII.022. - [5]A.Z. Ren, A.Dixit, A.Bodrova, S.Singh, S.Tu, N.Brown, P.Xu, L.Takayama, F.Xia, J.Varley, Z.Xu, D.Sadigh, A.Zeng, and A.Majumdar.Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners.URL https://openreview.net/forum?id=4ZK8ODNyFXx.
- Bansal etal. [2017]S.Bansal, M.Chen, S.Herbert, and C.J. Tomlin.Hamilton-Jacobi reachability: A brief overview and recent advances.In
*Proc.IEEE Conf.Decision and Control*, pages 2242–2253, 2017.doi:10.1109/CDC.2017.8263977. - Bui etal. [2021]M.Bui, M.Lu, R.Hojabr, M.Chen, and A.Shriraman.Real-time Hamilton-Jacobi reachability analysis of autonomous system with an FPGA.In
*IEEE/RSJ Int.Conf.Intelligent Robots & Systems*, pages 1666–1673, 2021.doi:10.1109/IROS51168.2021.9636410. - Mattila etal. [2015]R.Mattila, Y.Mo, and R.M. Murray.An iterative abstraction algorithm for reactive correct-by-construction controller synthesis.In
*Proc.IEEE Conf.Decision and Control*, pages 6147–6152, 2015.doi:10.1109/CDC.2015.7403186. - [9]Q.Nguyen and K.Sreenath.Robust safety-critical control for dynamic robotics.67(3):1073–1088.ISSN 1558-2523.doi:10.1109/TAC.2021.3059156.
- [10]T.G. Molnar, R.K. Cosner, A.W. Singletary, W.Ubellacker, and A.D. Ames.Model-free safety-critical control for robotic systems.7(2):944–951.ISSN 2377-3766.doi:10.1109/LRA.2021.3135569.
- Yang etal. [2022]T.-Y. Yang, T.Zhang, L.Luu, S.Ha, J.Tan, and W.Yu.Safe reinforcement learning for legged locomotion.In
*IEEE/RSJ Int.Conf.Intelligent Robots & Systems*, pages 2454–2461, 2022.doi:10.1109/IROS47612.2022.9982038. - [12]T.He, C.Zhang, W.Xiao, G.He, C.Liu, and G.Shi.Agile but safe: Learning collision-free high-speed legged locomotion.URL http://arxiv.org/abs/2401.17583.
- Hsu etal. [2023]K.-C. Hsu, H.Hu, and J.F. Fisac.The safety filter: A unified view of safety-critical control in autonomous systems, 2023.URL https://arxiv.org/abs/2309.05837.
- Wabersich etal. [2023]K.P. Wabersich, A.J. Taylor, J.J. Choi, K.Sreenath, C.J. Tomlin, A.D. Ames, and M.N. Zeilinger.Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems.
*IEEE Control Systems Magazine*, 43(5):137–177, 2023.doi:10.1109/MCS.2023.3291885. - [15]K.L. Hobbs, M.L. Mote, M.C. Abate, S.D. Coogan, and E.M. Feron.Runtime Assurance for Safety-Critical Systems: An Introduction to Safety Filtering Approaches for Complex Control Systems.43(2):28–65.ISSN 1941-000X.doi:10.1109/MCS.2023.3234380.URL https://ieeexplore.ieee.org/document/10081233.
- Mitchell etal. [2005]I.Mitchell, A.Bayen, and C.Tomlin.A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games.
*IEEE Transactions Automatic Control*, 50(7):947–957, 2005.ISSN 1558-2523.doi:10.1109/TAC.2005.851439. - Fisac etal. [2015]J.F. Fisac, M.Chen, C.J. Tomlin, and S.S. Sastry.Reach-avoid problems with time-varying dynamics, targets and constraints.In
*Hybrid Systems: Computation and Control*, pages 11–20, Seattle, WA, USA, 2015.doi:10.1145/2728606.2728612. - Mitchell [2008]I.M. Mitchell.The flexible, extensible and efficient toolbox of level set methods.
*Journal Scientific Computing*, 35(2):300–329, 2008.doi:10.1007/s10915-007-9174-4. - Bui etal. [2022]M.Bui, G.Giovanis, M.Chen, and A.Shriraman.OptimizedDP: An efficient, user-friendly library for optimal control and dynamic programming, 2022.URL https://arxiv.org/abs/2204.05520.
- Ames etal. [2019]A.D. Ames, S.Coogan, M.Egerstedt, G.Notomista, K.Sreenath, and P.Tabuada.Control barrier functions: Theory and applications.In
*European Control Conference*, pages 3420–3431, 2019.doi:10.23919/ECC.2019.8796030. - Xu etal. [2017]X.Xu, J.W. Grizzle, P.Tabuada, and A.D. Ames.Correctness guarantees for the composition of lane keeping and adaptive cruise control.
*IEEE Transactions Automation Science and Engineering*, 15(3):1216–1229, 2017. - Wang etal. [2018]L.Wang, D.Han, and M.Egerstedt.Permissive barrier certificates for safe stabilization using sum-of-squares.In
*Proc.American Control Conference*, pages 585–590, 2018.doi:10.23919/ACC.2018.8431617. - Lindemann etal. [2021]L.Lindemann, H.Hu, A.Robey, H.Zhang, D.Dimarogonas, S.Tu, and N.Matni.Learning hybrid control barrier functions from data.In
*Conf.Robot Learning*, pages 1351–1370, 2021. - Xu etal. [2015]X.Xu, P.Tabuada, J.W. Grizzle, and A.D. Ames.Robustness of control barrier functions for safety critical control.48(27):54–61, 2015.ISSN 24058963.doi:10.1016/j.ifacol.2015.11.152.URL https://linkinghub.elsevier.com/retrieve/pii/S2405896315024106.
- Robey etal. [2020]A.Robey, H.Hu, L.Lindemann, H.Zhang, D.V. Dimarogonas, S.Tu, and N.Matni.Learning control barrier functions from expert demonstrations.In
*Proc.IEEE Conf.Decision and Control*, pages 3717–3724, 2020.doi:10.1109/CDC42340.2020.9303785. - Choi etal. [2021]J.J. Choi, D.Lee, K.Sreenath, C.J. Tomlin, and S.L. Herbert.Robust control barrier-value functions for safety-critical control.In
*Proc.IEEE Conf.Decision and Control*, pages 6814–6821, 2021.doi:10.1109/CDC45484.2021.9683085. - Robey etal. [2021]A.Robey, L.Lindemann, S.Tu, and N.Matni.Learning robust hybrid control barrier functions for uncertain systems.
*IFAC-PapersOnLine*, 54(5):1–6, 2021. - Bansal and Tomlin [2021]S.Bansal and C.J. Tomlin.Deepreach: A deep learning approach to high-dimensional reachability.In
*Proc.IEEE Conf.Robotics and Automation*, pages 1817–1824, 2021.doi:10.1109/ICRA48506.2021.9561949. - Fisac etal. [2019]J.F. Fisac, N.F. Lugovoy, V.Rubies-Royo, S.Ghosh, and C.J. Tomlin.Bridging hamilton-jacobi safety analysis and reinforcement learning.In
*Proc.IEEE Conf.Robotics and Automation*, pages 8550–8556, 2019.doi:10.1109/ICRA.2019.8794107. - Bharadhwaj etal. [2021]H.Bharadhwaj, A.Kumar, N.Rhinehart, S.Levine, F.Shkurti, and A.Garg.Conservative safety critics for exploration.In
*Int.Conf.Learning Representations*, 2021.URL https://openreview.net/forum?id=iaO86DUuKi. - Thananjeyan etal. [2021]B.Thananjeyan, A.Balakrishna, S.Nair, M.Luo, K.Srinivasan, M.Hwang, J.E. Gonzalez, J.Ibarz, C.Finn, and K.Goldberg.Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones.
*IEEE Robotics and Automation Letters*, 6(3):4915–4922, 2021.doi:10.1109/LRA.2021.3070252. - Hsu etal. [2021]K.-C. Hsu, V.Rubies-Royo, C.J. Tomlin, and J.F. Fisac.Safety and liveness guarantees through reach-avoid reinforcement learning.In
*Proc.Robotics: Science and Systems*, 7 2021.doi:10.15607/RSS.2021.XVII.077. - Wabersich and Zeilinger [2018]K.P. Wabersich and M.N. Zeilinger.Linear model predictive safety certification for learning-based control.In
*Proc.IEEE Conf.Decision and Control*, pages 7130–7135, 2018.doi:10.1109/CDC.2018.8619829. - Wabersich and Zeilinger [2021]K.P. Wabersich and M.N. Zeilinger.A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.
*Automatica*, 129:109597, 2021.ISSN 0005-1098.doi:https://doi.org/10.1016/j.automatica.2021.109597.URL https://www.sciencedirect.com/science/article/pii/S0005109821001175. - Bastani and Li [2021]O.Bastani and S.Li.Safe reinforcement learning via statistical model predictive shielding.In
*Proc.Robotics: Science and Systems*, 7 2021.doi:10.15607/RSS.2021.XVII.026. - Leeman etal. [2023]A.Leeman, J.Köhler, S.Bennani, and M.Zeilinger.Predictive safety filter using system level synthesis.In
*Learning for Dynamics & Control*, volume 211 of*Proceedings of Machine Learning Research*, pages 1180–1192, 6 2023.URL https://proceedings.mlr.press/v211/leeman23a.html. - RameshKumar etal. [2023]A.RameshKumar, K.-C. Hsu, P.J. Ramadge, and J.F. Fisac.Fast, smooth, and safe: Implicit control barrier functions through reach-avoid differential dynamic programming.
*IEEE Control Systems Letters*, 7:2994–2999, 2023.doi:10.1109/LCSYS.2023.3292132. - Hsu etal. [2023]K.-C. Hsu, D.P. Nguyen, and J.F. Fisac.ISAACS: Iterative soft adversarial actor-critic for safety.In
*Learning for Dynamics & Control*, volume 211 of*Proceedings of Machine Learning Research*, 6 2023.URL https://proceedings.mlr.press/v211/hsu23a.html. - Hu etal. [2020]H.Hu, M.Fazlyab, M.Morari, and G.J. Pappas.Reach-sdp: Reachability analysis of closed-loop systems with neural network controllers via semidefinite programming.In
*Proc.IEEE Conf.Decision and Control*, pages 5929–5934, 2020.doi:10.1109/CDC42340.2020.9304296. - [40]E.Luo, N.Kochdumper, and S.Bak.Reachability Analysis for Linear Systems with Uncertain Parameters using Polynomial Zonotopes.In
*Proceedings of the 26th ACM International Conference on Hybrid Systems: Computation and Control*, HSCC ’23, pages 1–12. Association for Computing Machinery.ISBN 9798400700330.doi:10.1145/3575870.3587130.URL https://dl.acm.org/doi/10.1145/3575870.3587130. - Bird etal. [2023]T.J. Bird, H.C. Pangborn, N.Jain, and J.P. Koeln.Hybrid zonotopes: A new set representation for reachability analysis of mixed logical dynamical systems.
*Automatica*, 154:111107, 2023.ISSN 0005-1098.doi:https://doi.org/10.1016/j.automatica.2023.111107.URL https://www.sciencedirect.com/science/article/pii/S0005109823002674. - Reher and Ames [2021]J.Reher and A.D. Ames.Dynamic walking: Toward agile and efficient bipedal robots.
*Annual Review of Control, Robotics, and Autonomous Systems*, 4(1):535–572, 2021.doi:10.1146/annurev-control-071020-045021. - [43]H.Lai, W.Zhang, X.He, C.Yu, Z.Tian, Y.Yu, and J.Wang.Sim-to-real transfer for quadrupedal locomotion via terrain transformer.In
*2023 IEEE International Conference on Robotics and Automation (ICRA)*, pages 5141–5147. IEEE.ISBN 9798350323658.doi:10.1109/ICRA48891.2023.10160497.URL https://ieeexplore.ieee.org/document/10160497/. - [44]L.Campanaro, S.Gangapurwala, W.Merkt, and I.Havoutis.Learning and deploying robust locomotion policies with minimal dynamics randomization.URL http://arxiv.org/abs/2209.12878.
- Nguyen and Sreenath [2016]Q.Nguyen and K.Sreenath.Optimal Robust Time-Varying Safety-Critical Control With Application to Dynamic Walking on Moving Stepping Stones, 10 2016.
- [46]Z.Gu, Y.Zhao, Y.Chen, R.Guo, J.K. Leestma, G.S. Sawicki, and Y.Zhao.Robust-locomotion-by-logic: Perturbation-resilient bipedal locomotion via signal temporal logic guided model predictive control.URL http://arxiv.org/abs/2403.15993.
- [47]S.O.-R. A. D.O. committee.Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles.URL https://www.sae.org/content/j3016_202104.
- Laine etal. [2020]F.Laine, C.-Y. Chiu, and C.Tomlin.Eyes-Closed Safety Kernels: Safety of Autonomous Systems Under Loss of Observability.In
*Proc.Robotics: Science and Systems*, Corvalis, Oregon, USA, 7 2020.doi:10.15607/RSS.2020.XVI.096. - Zhang and Fisac [2021]Z.Zhang and J.F. Fisac.Safe Occlusion-Aware Autonomous Driving via Game-Theoretic Active Perception.In
*Proc.Robotics: Science and Systems*, 7 2021.doi:10.15607/RSS.2021.XVII.066. - [50]H.Hu, Z.Zhang, K.Nakamura, A.Bajcsy, and J.F. Fisac.Deception game: Closing the safety-learning loop in interactive robot autonomy.URL https://openreview.net/forum?id=0o2JgvlzMUc.
- Haarnoja etal. [2018]T.Haarnoja, A.Zhou, P.Abbeel, and S.Levine.Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.In
*Int.Conf.Machine Learning*, volume80 of*Proceedings of Machine Learning Research*, pages 1861–1870, 7 2018.URL https://proceedings.mlr.press/v80/haarnoja18b.html. - Coumans and Bai [2016–2021]E.Coumans and Y.Bai.Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016–2021.

\appendixpage

## Appendix A Frequently Asked Questions

We discuss some design choices and implications of our method using an informal FAQ format.

Why choose worst-case safety and not probabilistic analysis?Although not as established in the robot learning community, robust/worst-case formulations are widely used across engineering.Their key advantage is that they can enforce systematic handling of all scenarios in a well-defined class, even if some of them are highly unlikely—e.g., the robot must withstand all (rather than most) external forces of up to 50 N, even the unlucky push that happens to maximally disturb its stance. This is consistent with much of the safety analysis found in bridges, elevators, automobiles, aircraft, and other safety-critical engineering systems, in great part because it facilitates a clear-cut social contract between their designers and the broader public. For example, we do not certify elevators for 95% of loads up to 300 kg or bridges for 99% of earthquakes up to magnitude 8, but rather all such loads and earthquakes, and we treat any loss of safety within the specified bounds as a serious failure to comply with the promise made to society. As robots and autonomous systems become more widely deployed, we argue that their safety should be certified and held to similar standards, at least in truly safety-critical settings where people could otherwise get hurt.

Isn’t worst-case safety too conservative to be useful?Actually, this is a common misconception.Robust/worst-case assessments are not intrinsically more or less conservative than probabilistic ones: this depends entirely on what set and distribution we choose to run these assessments against.The term “worst-case” doesn’t mean a system must preserve safety in the worst conceivable scenario (whatever that means), but rather under all conditions—including the worst one—in a specified set.Worst-case safety lets designers and regulators draw this line (the ODD boundary), and it ensures that the system then maintains safety across all certified (in-ODD) conditions.If your robot’s behavior is “too conservative” this means it’s guarding against eventualities you don’t really care about: just exclude them from your ODD.But, if you *do* want safety under these conditions, then your robot is not actually too conservative: it’s doing what it should.With the gameplay filter, you are never left wondering: each time it overrides the task policy, it logs the specific future it’s preempting.Then, only one question remains: did you or did you not want your robot to avoid that hypothetical crash?Worst-case safety is extremely powerful, and it lets you control exactly what situations your robot is required to handle. You just need to be ready to answer to some hard what-if questions.

What does it mean for the proposed gameplay filter to approximate a perfect filter?If we had the exact solution to the Isaacs reach-avoid equation(5), our gameplay rollouts would be necessary and sufficient for safely reaching ${\mathcal{T}}$ in $H$ (or fewer) steps. Since ${\mathcal{T}}$ is typically chosen to be a broad, naturally reachable class of robot states (e.g., coming to a stable stance for a walking robot or pulling over for an autonomous vehicle), safely reaching ${\mathcal{T}}$ within a long enough horizon ${H}$ is possible as long as remaining safe is possible in the first place. In other words, the sufficient reach-avoid condition becomes a tight approximation of the all-time safety condition. We can observe this phenomenon in Fig.2, where the reach-avoid filter’s overstepping vanishes with long ${H}$.

Why is computing a gameplay rollout better than just querying the learned reach–avoid critic?In theory, the critic should make fairly accurate predictions of game outcomes after training. In practice, we have found thatit’soften unreliable and/or overly conservative.A key advantage of the gameplay rollout is that the uncertainty linked to the learning-based safety analysis is much more structured:the robot’s future safety fallback is perfectly predicted (since it will be implemented as-is), and the dynamics can be reliably simulated given players’ actions, so all uncertainty falls on the learned disturbance.One very useful implication of this structure is that, even if the disturbance is suboptimally adversarial, a predicted gameplay rollout ending in a safety failure constitutes a valid certificate (i.e., a proof) that there exists an ODD realization in which the robot will violate safety if the filter does not intervene immediately.That is, we know the gameplay safety monitor isn’t falsely crying wolf—we can’t prove anything like that about the black-box neural safety critic’s predictions.

Why is reach–avoid preferable if it’s more conservative than avoid-only?This is an important aspect of predictive safety filtering and relates to a deeper tenet in safety engineering philosophy: whatever the safety boundary is (i.e., a strategy that is “just safe enough”), it is preferable to approach it from the safe side than from the unsafe side. In practice, we don’t know a priori how many prediction steps ${H}$ we need to avoid being blindsided by future failures just beyond the lookahead horizon. When in doubt, it’s preferable to risk being overly conservative than to risk losing safety.

Having a terminal state constraint is common in MPC, how is reach–avoid different? The use of a terminal controlled-invariant set in MPC is well established and ensures recursive feasibility. Our choice of reach–avoid over an avoid–only safety condition is an instance of the same principle. An important difference is that the (also well-established) reach–avoid condition gives our filter extra flexibility by allowing the gameplay trajectory to reach the forever-safe set ${\mathcal{T}}$ at any time within the horizon. This reduces conservativeness and often lets us to terminate the gameplay rollout early.

How do you determine ${\mathcal{T}}$?In practice, a suitable ${\mathcal{T}}$ is obtained from domain knowledge, offline computation, pre-deployment learning, or some combination, often in the form of a stability basin (region of attraction) around a desirable class of equilibrium points sufficiently away from failure. For example, most robots can be robustly stabilized around static or steady cruising configurations by comparatively simple linear feedback controllers (e.g., most modern walking robots ship with built-in controllers that can stabilize them around a default stance). Larger all-time safe regions may be found by (robust) Lyapunov analysis or even optimized through control Lyapunov functions.

What are the implications of the choice of ${\mathcal{T}}$?Broadly speaking, the larger the ${\mathcal{T}}$ we can characterize offline, the easier the job of the gameplay filter at runtime, and, potentially, the fewer steps we’ll need to reach it from more dynamic configurations. In fact, in the extreme case, we could be remarkably lucky and find ${\mathcal{T}}={\Omega}^{*}$, in which case, the gameplay filter’s job is made much easier, since all candidate actions that are safe will keep the state in ${\mathcal{T}}$, immediately terminating the rollout check. Conversely, all actions that leave ${\mathcal{T}}$ are unsafe and the gameplay rollout will not be able to return to ${\mathcal{T}}$. In order to avoid initializing the gameplay filter from a no-win scenario, designers should ensure that ${\mathcal{T}}$ contains the range of expected robot deployment conditions (${\mathcal{X}}_{0}$) in the ODD.

Why aren’t you using onboard cameras or lidar?Our empirical focus in this paper is on demonstrating automatically synthesized safety filters that account for the full-order (36–D) walking dynamics of quadruped robots. We think that the simplest and clearest demonstration of this concept is by having the filter only consider the robot’s own state (proprioception) without accounting for the environment, obstacles, etc. (exoception). That said, incorporating information about the robot’s surroundings can be extremely valuable—and often critical—to safety. We are very excited by the scalability and generality that new safety approaches like the one we present in this paper seem to enjoy, and we expect they will soon unlock full-order safety filters that incorporate rich exoceptive information in real time, whether straight from raw sensor data or through intermediate representations provided by the perception and localization stack.

## Appendix B Implementation Details

State and action spaces.For the scope of this paper, we aim to construct a *proprioceptive* safety filter that relies on onboard estimation of the robot’s kinematic state but *no exoceptive* information (from camera, lidar, etc.) about the surrounding environment.^{3}^{3}3Ranged perception can improve the robustness of walking controllers by sensing terrain geometry and texture, and it is strictly needed for ODDs including unmapped or moving obstacles. Full-order legged robot safety filters combining proprioception and exoception have significant potential and are ripe for investigation.We encode the quadrupedal robots’ state and action vectors as follows:

$\displaystyle{x}$ | $\displaystyle:=\left[{{p}_{x}},{{p}_{y}},{{p}_{z}},{{v}_{x}},{{v}_{y}},{{v}_{z%}},{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}},{{\omega}_{x}},{{\omega}_{y}},%{{\omega}_{z}},\{{{\theta}^{i}_{\text{J}}}\},\{{{\omega}^{i}_{\text{J}}}\}%\right],$ | ||

$\displaystyle{u}$ | $\displaystyle:=\left[\{{\delta{\theta}^{i}_{\text{J}}}\}\right],$ |

with ${{p}_{x}},{{p}_{y}},{{p}_{z}}$ the position of the body frame with respect to a fixed reference (“world”) frame;${{v}_{x}},{{v}_{y}},{{v}_{z}}$ the velocity of the robot’s torso expressed in (forward–left–up) body coordinates;${{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}}$ the roll, pitch, and yaw angles of the robot’s body frame with respect to the world frame;^{4}^{4}4For the purposes of this demonstration, we find that an Euler angle representation of body attitude performs adequately and makes the failure set straightforward to encode. In general, a quaternion-based representation may be preferable, avoiding the risk of computational issues in the neighborhood of singularities (at ${{\theta}_{y}}=\pm\frac{\pi}{2}$).${{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}$ the body frame’s axial rotational rates;and ${{\theta}^{i}_{\text{J}}},{{\omega}^{i}_{\text{J}}},{\delta{\theta}^{i}_{\text%{J}}}$ the angle, angular velocity, and commanded angular increment of the robot’s $i^{\text{th}}$ joint.

The above constitutes a full-order state representation of the robot’s idealized Lagrangian mechanics. A total of 18 generalized coordinates encode the 6 degrees of freedom of the torso’s rigid-body pose in addition to the configuration of 3 rotational joints (hip abduction, hip flexion, and knee flexion) for each of the 4 legs; the robot’s rate of motion is expressed through 18 corresponding generalized velocities, for a total 36 continuous state variables.We discuss discrete contact variables below.

The robot’s control authority is achieved by independently modulating the torque applied on each of its 12 rotational joints by an electric motor; in modern legged platforms, these motors typically have dedicated low-level controllers, so our control policy sends a tracking reference to each motor controller rather than directly commanding a torque.

Finally, the disturbance is modeled as an external force that can act on any point of the robot’s torso and in any direction of Euclidean space, with a bounded modulus. The specified range of admissible disturbance forces is discussed below.

Black-box simulator(s).The dynamical model is implemented by the off-the-shelf PyBullet physics engine[52] using the standardized robot description files made available by the manufacturers of each platform. Our method treats the simulator as black-box environment for both training and runtime safety filtering,allowing the engine and/or robot model to be easily swapped out.The generality and modularity of our approach is perhaps best illustrated by the fact thatwe synthesized and deployed the safety filter for the Go-2 robot using identical hyperparameter values as for the S40 robot.Our only modification, other than replacing the robot model in the physics engine, was to append 4 state components to each neural network’s input space to account for foot contact information;we note that even this straightforward addition is entirely optional, sincewe could have alternatively constructed a safety filter that simply disregarded the extra sensor data.

Actor and critic networks.The learned control and disturbance actors, as well as the safety critics, are independent of the robot’s absolute position ${{p}_{x}},{{p}_{y}},{{p}_{z}}$ and heading angle ${{\theta}_{z}}$; of these, only distance to the ground has an effect on the dynamics, but since it is hard to observe without vision, we do not make it available.In the case of the Go-2 quadruped (but not the S40), the policies additionally depend on the discrete contact state, encoded as a Boolean (true/false) indicator for each foot.In simulation, each neural network policy receives as input the ground-truth state of the robot in the simulator; in hardware experiments, they instead receive a state estimate computed by the robot’s on-board perception stack.Each policy is implemented by a fully-connected feedforward neural network with 3 hidden layers of 256 neurons, and critics have 3 hidden layers with 128 neurons.

Safety specification.We are interested in preventing *falls*, understood as any part of the robot other than its feet making contact with the ground.To encode the failure set of all such falls with a simple margin function, we define a small number of critical points $\mathbb{p_{c}}$, including the 8 corners of a (tight) 3-D bounding box around the robot’s torso as well as its four knee joints. The failure margin is

$\displaystyle{g}({x})=\min\left\{\min_{i}\{z_{\text{corner}}^{i}\}-\bar{z}_{%\text{corner},{g}},\,\min_{i}\{z_{\text{knee}}^{i}\}-\bar{z}_{\text{knee}}%\right\},$ |

with $z_{\text{corner}}^{i}$ the vertical distance to the ground of the $i^{\text{th}}$ robot body corner point and $z_{\text{knee}}^{i}$ the vertical distance to the ground of the $i^{\text{th}}$ robot knee point.The target (all-time safe set) is defined as a narrow neighborhood of a static stance with all four feet on the ground and a sufficiently lowered torso, chosen so that the robot is robustly stable with a simple stance controller. The target margin is

$\displaystyle{\ell}({x})=\min\Big{\{}\,$ | $\displaystyle\bar{{\omega}}-|{{\omega}_{x}}|,\,\bar{{\omega}}-|{{\omega}_{y}}|%,\,\bar{{\omega}}-|{{\omega}_{z}}|,$ | ||

$\displaystyle\bar{{v}}-|{{v}_{x}}|,\,\bar{{v}}-|{{v}_{y}}|,\,\bar{{v}}-|{{v}_{%z}}|,$ | |||

$\displaystyle\bar{z}_{\text{corner},{\ell}}-\max_{i}\{z_{\text{corner}}^{i}\},%\,\bar{z}_{\text{foot}}-\max_{i}\{z_{\text{foot}}^{i}\}\Big{\}},$ |

with $z_{\text{foot}}^{i}$ the vertical elevation of the $i^{\text{th}}$ robot foot relative to the ground.The threshold valueswe used for our failure and target set specification are as follows.

$\displaystyle\bar{z}_{\text{corner},{g}}$ | $\displaystyle=0.1~{}\text{m}$ | $\displaystyle\bar{z}_{\text{knee}}$ | $\displaystyle=0.05~{}\text{m}$ | ||

$\displaystyle\bar{z}_{\text{corner},{\ell}}$ | $\displaystyle=0.4~{}\text{m}$ | $\displaystyle\bar{z}_{\text{foot}}$ | $\displaystyle=0.05~{}\text{m}$ | ||

$\displaystyle\bar{{\omega}}$ | $\displaystyle=10\degree\!/\text{s}$ | $\displaystyle\bar{v}$ | $\displaystyle=0.2~{}\text{m}/\text{s}$ |

Uncertainty specification.To account for uncertainty in the deployment conditions as well as general modeling error (or sim-to-real gap), our operational design domain (ODD) includes an external force that may push or pull any point on the robot’s torso in any direction with a maximum magnitude of $50~{}\text{N}$:

$\displaystyle{d}$ | $\displaystyle=\left[{{F}_{x}},{{F}_{y}},{{F}_{z}},{{p}^{F}_{x}},{{p}^{F}_{y}},%{{p}^{F}_{z}}\right]\,,$ | (8) |

where ${F}=[{{F}_{x}},{{F}_{y}},{{F}_{z}}]$represents the force vector applied at position defined by ${{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}$in the body coordinates ${{p}^{F}_{x}},{{p}^{F}_{y}}\in[-0.1,0.1]$,${{p}^{F}_{z}}\in[0,0.05]~{}\text{m}$.The red arrows in the imagined gameplay of Figure1 show examples of learned adversarial disturbance.

Gameplay filter runtime implementation.To easily deploy our gameplay safety filter across two different robots for the physical experiments, we encapsulate its computation inside a ROS service, which we run on an offboard computer. Each robot’s onboard process calls this service wirelessly (approximately 3.5 times per second) passing its current state estimate and proposed course of action; the offboard server then simulates the gameplay for a fixed horizon (for us, $H=300$ or $3~{}\text{s}$) and returns a Boolean indicating which policy to use for the next $L$ time steps; our choice of $L=10$ accounts for the wireless round trip, which makes up a significant fraction (approximately $70\%$) of the total latency.

We note that the computational resources used for the offboard computation are comparable to those available on current mobile robot platforms. In particular, the entire simulation and filter logic was run on one single core of an Intel i7-1185G7 processor at 3GHz. For comparison, the Go-2 is equipped with a second computer (not used in our experiments) with a 6-core 8GB NVIDIA Jetson Orin Nano processor at 1.5 GHz. We estimate that the total latency of the gameplay filter run fully onboard the Go-2 with the same simulator would be roughly comparable, possibly lower given the absence of a wireless roundtrip.

## Appendix C Extended Evaluation

To further demonstrate the strengths of our approach and shed light on its superior scalability to complex robot dynamics, we compare the gameplay performance of the self-play–trained controller and disturbance policies *as training proceeds*.The results inFigs.3 and4 suggest that the dense temporal difference signal in reach–avoid games plays a determining role in enabling data-efficient learning, while previously proposed safety methods that use reward-based RL with a (sparse) failure indicator consistently require more training episodes before starting to learn meaningfully robust safe control strategies.

## Appendix D Detailed Tugging Force Plots

We provide time plots for all runs of the tug test experiment on the S40 robot (summarized inTable3), displaying the magnitude of the tugging force over the course of each trial. We present all 10 runs for each of the three evaluated control schemes: gameplay filter ${{\phi}^{\text{game}}}$, critic (value-based) filter ${{\phi}^{\text{critic}}}$, and unfiltered task policy ${{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}$. Each run is annotated to show individual attacks, defined as sequences of significant tug forces ($\geq 10$N) that are applied continually or close together in time (less than $1~{}\text{s}$ interruption within an attack). Conversely, distinct attacks are at least $1~{}\text{s}$ away from each other, to ensure that the effects of the previous attack have died off before the next one begins.

Looking at individual attacks within each run provides a more fine-grained insight on the performance of each control scheme under various disturbances (both within-ODD and out-of-ODD). Importantly, it allows us to attribute a safety failure to the attack that immediately preceded it in a given run, but mark all earlier attacks in the same run as safely handled.