Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (2024)

Duy P. Nguyen11~{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTKai-Chieh Hsu∗1Wenhao Yu2Jie Tan2Jaime Fernández Fisac1
1Princeton University, United States2Google DeepMind, United States
{duyn,kaichieh,jfisac}@princeton.edu,{magicmelon,jietan}@google.com

Abstract

Despite the impressive recent advances in learning-based robot control, ensuring robustness toout-of-distribution conditionsremains an open challenge.Safety filters can, in principle, keep arbitrary control policies from incurring catastrophic failures by overriding unsafe actions,but existing solutions for complex (e.g., legged) robot dynamics do not span the full motion envelope and instead rely on local, reduced-order models.These filterstend to overly restrict agility and can still fail when perturbed away from nominal conditions.This paper presents the gameplay filter, a new class of predictive safety filter that continually plays outhypothetical matches between its simulation-trained safety strategy and a virtual adversary co-trained to invoke worst-case events and sim-to-real error,and precludes actions that would cause it to fail down the line.We demonstrate the scalability and robustness of the approach with a first-of-its-kind full-order safety filter for (36-D) quadrupedal dynamics.Physical experimentson two different quadruped platformsdemonstrate the superior zero-shot effectiveness of the gameplay filter under large perturbations such as tugging and unmodeled terrain.

Keywords: Robust Safety, Adversarial Reinforcement Learning, Game Theory

1 Introduction

Autonomous robots are increasingly required to operate reliably in uncertain conditionsand quickly adapt to carry out a broad range of jobs on the fly[1, 2, 3, 4, 5].Rather than synthesize an intrinsically safe control policy for every new assigned task, it is efficient to endow each robot with a safety filterthat automatically precludes unsafe actions,relieving task policies of the burden of safety altogether.

Unfortunately, today’s safety filter methods fall short of this promise for most modern-day robots.To cover a diverse range of tasks and environments, a safety filter needs to give the robot significant freedom to execute varied motions across its state space while robustly protecting it from catastrophic failures throughout this large envelope.To date, such minimally restrictive safety filters are only systematically computable for systems with5–6 state variables[6, 7, 8], woefullyshort ofthe 12 needed to accurately model drone flight and the 30–50 needed for legged locomotion.Existing safety filters for high-order robot dynamicsrely on reduced-order models[9, 10, 11, 12].These filters restrict the robot’s motion to a local envelope, such as the vicinity of a stable walking gait, and become ineffective whenever the robot is perturbed away from it by external forces or unmodeled environment features(LABEL:fig:front).How can we tractably and systematically compute safety filters that cover broad regions of robots’ high-dimensional state spaces and a wide variety of deployment conditions?

Contribution.This paper introduces the gameplay filter,a novel type of predictive safety filter that can scale to full-order robot dynamics and enforce safety across a broad motion envelope and a designer-specified range of possible conditions (operational design domain).The filter is first synthesized by simulated self-play between a safety-seeking robot control policy and a virtual adversary that invokes worst-case realizations of uncertainty and modeling error (or sim-to-real gap).At runtime, the deployed filter continuallyrolls out hypothetical games between the two learned agents,overriding candidate actions that would result in the robot losing a future safety game.This methodology—based on the core game-theoretic principle that a strategy that wins against the worst-case opponent must also win against all others—unlocksreal-time filtering in the robot’s full state space by only requiring a single, highly informative trajectory rollout.We demonstrate the effectiveness of our approach experimentally on two quadruped robots that differ in physical parameters and sensing capabilities(LABEL:fig:front). Each gameplay filter is synthesized and deployed using an off-the-shelf physics engine to simulate a manufacturer-provided robot model with a 36-D state space and a 12-D control space.We observe highly robust zero-shot safety-preserving behavior without incurring the conservativeness typical of robust predictive filters.To the best of our knowledge, this constitutes the first successful demonstration of a full-order safety filter on legged robot platforms.

Related Work.The last decade has seen important advances in robot safety filters.We briefly discuss the techniques most relevant to our work and direct interested readers to recent survey efforts[13, 14, 15]that shed light on safety filters’ common structure and relative strengths.

Value-based filters.Hamilton–Jacobi (HJ) reachability methods use finite-difference dynamic programming to computethe best available safety fallback policy and the worst possible uncertainty realization from each state on a finite grid[16, 17, 6],which enables minimally restrictive safety filters.Althoughhighly general, HJ computational tools suffer exponential blowup and do not scale beyond 5–6 state dimensions[18, 19].Control barrier function (CBF) filters keep the system inside a smaller safe set while discouragingexcessive control overrides[20].CBFs lack a general constructive procedure and instead rely on manual design[21], sum-of-squares synthesis[22],or learning from demonstrations[23].Robust formulations are comparatively less mature[24, 25, 26, 27].Self-supervised and reinforcement learning techniquescan synthesize safety-oriented control policies and value functions (“safety critics”)for systems beyond the reach of classical methods,but they are inherently approximate and offer no formal assurances[28, 29, 30, 31, 32].Statistical generalization theory may be used to bound the probability of failure under the assumption that the robot can be tested on a statistically representative sample of environments and conditions before deployment[3].

Rollout-based filters.Predictive safety filters perform model-based runtime assurance by continually simulating—and in some cases optimizing—the robot’s future safety efforts for a short lookahead time horizon[33, 34, 35, 36, 37, 38].Recent advances in fast forward-reachable set over-approximation[39, 40, 41] make it possible to check safety against all possible uncertainty realizations, although this runtime robustness comes at the cost of significant added conservativeness:for example,Hsu etal. [38] observe safety overrides 5555 times as frequent as those of a least-restrictive HJ filter.Bastani and Li [35] instead propose sampling multiple possible trajectories, assuming a well-characterized disturbance distribution, to maintain a statistical guarantee.Our approach mimicsHsu etal. [38] inco-training a safety controller and a worst-case disturbance through simulated self-play,but it eschews over-conservative reachable setsby instead simulating a single closed-loop match between the two.

Legged robot safety filters.Legged robots have attracted increasing interest from researchers due to their versatility and increasing availability, as well as their challenging high-order and contact-rich dynamics[42].Recent simulation-trained controllers leveraging domain randomization are showing promising agility and adaptability[1, 43, 2, 44];however, robustness to out-of-distribution conditions cannot be easily quantified and remains an open issue.Unfortunately, all safety filters demonstrated on legged robots to date are based on simplified reduced-order dynamical models[10, 11, 3, 12], sometimes combined with local analysis around nominal walking gaits[45, 9, 46].The dynamic envelope protected by these safety filters is limited to local state space regions where the simplified models apply, and their robustness to disturbances and modeling errors is contingent on the effectiveness of low-level tracking controllers.Our demonstration of the gameplay filter uses a full-order dynamical model of the robot, both at synthesis and at deployment, which enables it to enforce safety across a broad range of motions and operating conditions.

2 Preliminaries: Robust Robot Safety in an Operational Design Domain

We wish to ensure the safe operation of a robot with potentially high-order nonlinear dynamicsunder a wide range of environments and task specifications, which may be unknown at design time.Formally, we consider a robotic system with uncertain discrete-time dynamics

xk+1=f(xk,uk,dk),subscript𝑥𝑘1𝑓subscript𝑥𝑘subscript𝑢𝑘subscript𝑑𝑘\displaystyle{{x}_{{k}+1}}={f}({{x}_{k}},{{u}_{k}},{{d}_{k}}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,(1)

where, at each time step k𝑘{k}\in\mathbb{N}italic_k ∈ blackboard_N, xk𝒳nxsubscript𝑥𝑘𝒳superscriptsubscript𝑛𝑥{{x}_{k}}\in{\mathcal{X}}\subseteq\mathbb{R}^{n_{x}}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the state of the system,uk𝒰nusubscript𝑢𝑘𝒰superscriptsubscript𝑛𝑢{{u}_{k}}\in{\mathcal{U}}\subset\mathbb{R}^{n_{u}}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the bounded control input (typically from a control policy πuΠu:𝒳𝒰:superscript𝜋𝑢superscriptΠ𝑢𝒳𝒰{\pi}^{u}\in{\Pi^{u}}\colon{\mathcal{X}}\to{\mathcal{U}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_U),and dk𝒟ndsubscript𝑑𝑘𝒟superscriptsubscript𝑛𝑑{{d}_{k}}\in{\mathcal{D}}\subset\mathbb{R}^{n_{d}}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_D ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a disturbance input, unknown a priori but bounded by a compact 𝒟𝒟{\mathcal{D}}caligraphic_D.While the control bound 𝒰𝒰{\mathcal{U}}caligraphic_U encodes actuator limits,the disturbance bound 𝒟𝒟{\mathcal{D}}caligraphic_D is a key part of the operational design domain(ODD).

Operational Design Domain.The ODD can be viewed as social contract between the system operator and the public,delineating the set of conditions under which the robotic system is required to function correctly and safely[47].In this paper, we are interested in robust safety, where the disturbance (or “domain”) bound 𝒟𝒟{\mathcal{D}}caligraphic_D may encode a range of potential perturbations like wind or contact forces, environmental parameters like terrain friction, manufacturing tolerances, variations in actuator performance and state estimation accuracy, and other factors contributing to designer uncertainty about future deployment conditions and modeling error.The ODD further specifies a deployment set 𝒳0𝒳subscript𝒳0𝒳{\mathcal{X}}_{0}\subset{\mathcal{X}}caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊂ caligraphic_X of allowable initial states (for example, the robot is always turned on while static on flat ground) and, crucially, a failure set 𝒳𝒳{\mathcal{F}}\subset{\mathcal{X}}caligraphic_F ⊂ caligraphic_X, which characterizes all configurations that the system state must never reach, such as falls or collisions.The required safety property can then be succinctly expressed as:

x0𝒳0,k0,d0,,dk𝒟,xk,formulae-sequencefor-allsubscript𝑥0subscript𝒳0formulae-sequencefor-all𝑘0for-allsubscript𝑑0formulae-sequencesubscript𝑑𝑘𝒟subscript𝑥𝑘\forall{x}_{0}\in{\mathcal{X}}_{0},\forall{k}\geq 0,\forall{d}_{0},\dots,{{d}_%{k}}\in{\mathcal{D}},\quad{{x}_{k}}\not\in{\mathcal{F}}\,,∀ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∀ italic_k ≥ 0 , ∀ italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_D , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∉ caligraphic_F ,(2)

that is, once deployed in an admissible initial state, the robot must stay clear of the failure set for any realization of the domain uncertainty.

Safety Filter.Explicitly ensuring the safety property in the synthesis of every robot task policyπ\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT can be impractically cumbersome, especially for increasingly general-purpose robotic systems with broad ODDs.Instead, we aim to relieve task policies of the burden of safety by augmenting them with a safety filter ϕitalic-ϕ{\phi}italic_ϕ that depends on the robot’s ODD but not on the task specification.Rather than directly applying the proposed task action uk=π\faCheckSquare[regular](xk)subscript𝑢𝑘superscript𝜋\faCheckSquare[regular]subscript𝑥𝑘{u}_{k}={{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}({x}_{k})italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) from each state xksubscript𝑥𝑘{x}_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the robot executes111For the scope of this paper, we assume that the robot maintains an appropriately accurate estimate of its dynamical state through onboard perception.We make two observations:First, moderate state estimation errors typical in many robotic systems can be absorbed by inflating the failure set {\mathcal{F}}caligraphic_F and dynamical uncertainty𝒟𝒟{\mathcal{D}}caligraphic_D.Second, more substantial state uncertainty, e.g., induced by sensor faults, occluding objects, or multiagent interaction, may be handled with information-space safety filters, a subject of ongoing research[48, 49, 50].

uk=ϕ(xk,π\faCheckSquare[regular]).subscript𝑢𝑘italic-ϕsubscript𝑥𝑘superscript𝜋\faCheckSquare[regular]{u}_{k}={\phi}({x}_{k},{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}%}}})\,.italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ) .(3)

The safety filter’s role is to prevent the execution of any candidate actions that would jeopardize future safety, while alsoavoiding spurious interventions that unnecessarily disrupt task progress.In fact, for any well-defined ODD there exists a perfect safety filter that allows every safe candidate action and overrides every unsafe one, robustly enforcing(2) with no overstepping[13, Prop.1].Formally, a perfect safety filter only disallows actions that may cause the state to exit the maximal safe set Ω𝒳superscriptΩ𝒳{{\Omega}^{*}}\subset{\mathcal{X}}roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊂ caligraphic_X, the set of all states from which there exists a control policy that can enforce(2).While computing such a perfect filter is known to be intractable for most practical systems[7],we aim to synthesize effective safety filters that allow robots significant freedom to perform a wide range of tasks (including online learning and exploration) while maintaining safety across their ODD.Intuitively, we would like to obtain a safety filter that robustly keeps the robot inside a conservative safe set ΩΩΩsuperscriptΩ{\Omega}\subseteq{{\Omega}^{*}}roman_Ω ⊆ roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as close as possible to the theoretical ΩsuperscriptΩ{{\Omega}^{*}}roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.Our proposed method uses game-theoretic reinforcement learning and faster-than-real-time gameplay simulation to approximate a perfect safety filter for any given robot ODD,targeting the robot’s full dynamic envelope,in contrast with existing reduced-order filters, which aim to enforce safety within a significantly smaller setΩΩ{\Omega}roman_Ω.

Reach–Avoid Safety Game.Whether it is possible for the robot to robustly maintain safety, as in(2), can be seen as the categorical (true/false) outcome of a game of kind between the robot’s controller and an adversarial disturbance that aims to drive it into the failure set.In turn, this result can be encoded implicitly through a game of degree with a continuous outcome (for example, the closest distance that will separate the robot and any obstacle).In particular, for the purposes of predictive safety filtering, we consider a sufficient finite-time condition for all-time safety: it is enough for the robot to reach a known controlled-invariant set 𝒯c𝒯superscript𝑐{\mathcal{T}}\subset{\mathcal{F}}^{c}caligraphic_T ⊂ caligraphic_F start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT (for example, coming to a stable stance) in H𝐻{H}italic_H steps without previously entering the failure set {\mathcal{F}}caligraphic_F.Once there, the robot can switch to a policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT that keeps it in 𝒯𝒯{\mathcal{T}}caligraphic_T indefinitely).This induces a reach–avoid game[17, 32] with outcome

Jkπu,πd(x):=maxτ[k,H]min{(xτ),mins[k,τ]g(xs)}assignsubscriptsuperscript𝐽superscript𝜋𝑢superscript𝜋𝑑𝑘𝑥subscript𝜏𝑘𝐻subscript𝑥𝜏subscript𝑠𝑘𝜏𝑔subscript𝑥𝑠\displaystyle{J}^{{\pi}^{u}\!,{\pi}^{d}}_{k}({x}):=\max_{{\tau}\in[{k},{H}]}%\min\left\{{\ell}\left({x}_{{\tau}}\right),\,\min_{{s}\in[{k},{\tau}]}{g}\left%({x}_{{s}}\right)\right\}italic_J start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) := roman_max start_POSTSUBSCRIPT italic_τ ∈ [ italic_k , italic_H ] end_POSTSUBSCRIPT roman_min { roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , roman_min start_POSTSUBSCRIPT italic_s ∈ [ italic_k , italic_τ ] end_POSTSUBSCRIPT italic_g ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) }(4)

where g𝑔{g}italic_g and {\ell}roman_ℓ are the (Lipschitz) failure and target margins, satisfying g(x)<0x𝑔𝑥0𝑥{{g}({x})<0\Leftrightarrow{x}\in{\mathcal{F}}}italic_g ( italic_x ) < 0 ⇔ italic_x ∈ caligraphic_F, (x)0x𝒯𝑥0𝑥𝒯{{\ell}({x})\geq 0\Leftrightarrow{x}\in{\mathcal{T}}}roman_ℓ ( italic_x ) ≥ 0 ⇔ italic_x ∈ caligraphic_T.The outcome summarizes the aforementioned condition for all-time safety: For any given τ[0,H]𝜏0𝐻{\tau}\in[0,{H}]italic_τ ∈ [ 0 , italic_H ], if we previously enter the failure set {\mathcal{F}}caligraphic_F, g(xτ)<0𝑔subscript𝑥𝜏0{g}({x}_{\tau})<0italic_g ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) < 0, then for k[0,τ]𝑘0𝜏{k}\in[0,{\tau}]italic_k ∈ [ 0 , italic_τ ], Jkπu,πd(x)<0subscriptsuperscript𝐽superscript𝜋𝑢superscript𝜋𝑑𝑘𝑥0{J}^{{\pi}^{u}\!,{\pi}^{d}}_{k}({x})<0italic_J start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) < 0, denoting that past failure overrides future successes.The value function of this game satisfies the reach-avoid Isaacs equation

Vk(x)subscript𝑉𝑘𝑥\displaystyle{V}_{k}({x})italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x )=maxumindmin{g(x),max{(x),Vk+1(f(x,u,d))}},absentsubscript𝑢subscript𝑑𝑔𝑥𝑥subscript𝑉𝑘1𝑓𝑥𝑢𝑑\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,\max%\left\{{\ell}({x}),\,{V}_{{k}+1}\big{(}{f}({x},{u},{d})\big{)}\right\}\right\},= roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT roman_min { italic_g ( italic_x ) , roman_max { roman_ℓ ( italic_x ) , italic_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_f ( italic_x , italic_u , italic_d ) ) } } ,(5a)
VH(x)subscript𝑉𝐻𝑥\displaystyle{V}_{H}({x})italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_x )=min{(x),g(x)},absent𝑥𝑔𝑥\displaystyle=\min\left\{{\ell}({x}),\,{g}({x})\right\}\,,= roman_min { roman_ℓ ( italic_x ) , italic_g ( italic_x ) } ,(5b)

and the robot’s controller is guaranteed a winning strategy from any state where V0(x)0}{V}_{0}({x})\geq 0\}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) ≥ 0 }.

3 Predictive Gameplay Safety Filters

3.1 Offline Gameplay Learning

We extend the Iterative Soft Adversarial Actor–Critic for Safety (ISAACS) scheme[38] to reach-avoid games(4), approximately solving the infinite-horizon counterpart of the Isaacs equation(5).

Simulated Adversarial Safety Games. At every time step of gameplay, we record the transition (x,u,d,x,,g)𝑥𝑢𝑑superscript𝑥superscriptsuperscript𝑔({x},{u},{d},{{x}^{\prime}},{{\ell}^{\prime}},{{g}^{\prime}})( italic_x , italic_u , italic_d , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the replay buffer {\mathcal{B}}caligraphic_B, with x:=f(x,u,d)assignsuperscript𝑥𝑓𝑥𝑢𝑑{{{x}^{\prime}}:={f}({x},{u},{d})}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_f ( italic_x , italic_u , italic_d ), :=(x)assignsuperscriptsuperscript𝑥{{\ell}^{\prime}}:={\ell}({{x}^{\prime}})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := roman_ℓ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and g:=g(x)assignsuperscript𝑔𝑔superscript𝑥{{g}^{\prime}}:={g}({{x}^{\prime}})italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := italic_g ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Policy and Critic Networks UpdateThe core of the proposed offline gameplay learning is to find an approximate solution to the time-discounted infinite-horizon version ofEq.5.We employ the Soft Actor-Critic (SAC)[51] framework to update the critic and actor networks with the following loss functions.

We update the critic to reduce the deviation from the Isaacs target222Deep reinforcement learning typically involves training an auxiliary target critic Qωsuperscriptsubscript𝑄𝜔{Q}_{\omega}^{\prime}italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with parameters ωsuperscript𝜔\omega^{\prime}italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that undergo slow adjustments to align with the critic parameters ω𝜔\omegaitalic_ω. This process aims to stabilize the regression by maintaining a fixed target within a relatively short timeframe.
L(ω)𝐿𝜔\displaystyle L(\omega)italic_L ( italic_ω ):=𝔼(x,u,d,x,,g)[(Qω(x,u,d)y)2],assignabsentsubscript𝔼similar-to𝑥𝑢𝑑superscript𝑥superscriptsuperscript𝑔superscriptsubscript𝑄𝜔𝑥𝑢𝑑𝑦2\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u},{d},{{x}^{\prime}},{{%\ell}^{\prime}},{{g}^{\prime}})\sim{\mathcal{B}}}\left[\left({Q}_{\omega}({x},%{u},{d})-y\right)^{2}\right]\,,:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_u , italic_d , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_B end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_u , italic_d ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,
y𝑦\displaystyle yitalic_y=γmin{g,max{,Qω(x,u,d)}}+(1γ)min{,g}absent𝛾superscript𝑔superscriptsuperscriptsubscript𝑄𝜔superscript𝑥superscript𝑢superscript𝑑1𝛾superscriptsuperscript𝑔\displaystyle=\gamma\min\left\{\ {{g}^{\prime}},\max\left\{{{\ell}^{\prime}},{%Q}_{\omega}^{\prime}({{x}^{\prime}},{{u}^{\prime}},{{d}^{\prime}})\right\}%\right\}+(1-\gamma)\min\left\{{{\ell}^{\prime}},{{g}^{\prime}}\right\}= italic_γ roman_min { italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_max { roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } } + ( 1 - italic_γ ) roman_min { roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }(6a)
with uπθ(x){{u}^{\prime}}\sim{\pi}_{\theta}(\cdot\mid{{x}^{\prime}})italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), dπψ(x){{d}^{\prime}}\sim{\pi}_{\psi}(\cdot\mid{{x}^{\prime}})italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).We update control and disturbance policies following the policy gradient induced by the critic with entropy regularization:
L(θ)𝐿𝜃\displaystyle L(\theta)italic_L ( italic_θ ):=𝔼(x,d)[Qω(x,u~,d)+αulogπθ(u~x)],assignabsentsubscript𝔼similar-to𝑥𝑑subscript𝑄𝜔𝑥~𝑢𝑑superscript𝛼𝑢subscript𝜋𝜃conditional~𝑢𝑥\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{d})\sim{\mathcal{B}}}\Big{[%}-{Q}_{\omega}({x},{\tilde{{u}}},{d})+{{\alpha}^{u}}\log{\pi}_{\theta}({\tilde%{{u}}}\mid{x})\Big{]},:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_d ) ∼ caligraphic_B end_POSTSUBSCRIPT [ - italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_u end_ARG , italic_d ) + italic_α start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_u end_ARG ∣ italic_x ) ] ,(6b)
L(ψ)𝐿𝜓\displaystyle L(\psi)italic_L ( italic_ψ ):=𝔼(x,u)[Qω(x,u,d~)+αdlogπψ(d~x)],assignabsentsubscript𝔼similar-to𝑥𝑢subscript𝑄𝜔𝑥𝑢~𝑑superscript𝛼𝑑subscript𝜋𝜓conditional~𝑑𝑥\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u})\sim{\mathcal{B}}}\Big{[%}{Q}_{\omega}({x},{u},{\tilde{{d}}})+{{\alpha}^{d}}\log{\pi}_{\psi}({\tilde{{d%}}}\mid{x})\Big{]},:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_u ) ∼ caligraphic_B end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_u , over~ start_ARG italic_d end_ARG ) + italic_α start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ∣ italic_x ) ] ,(6c)
where u~πθ(x){\tilde{{u}}}\sim{\pi}_{\theta}(\cdot\mid{x})over~ start_ARG italic_u end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x ), d~πψ(x){\tilde{{d}}}\sim{\pi}_{\psi}(\cdot\mid{x})over~ start_ARG italic_d end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x ), and αu,αdsuperscript𝛼𝑢superscript𝛼𝑑{{\alpha}^{u}},{{\alpha}^{d}}italic_α start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are hyperparameters incentivizing exploration (entropy in the stochastic policies), which decay gradually in magnitude through the training.

Following the ISAACS scheme, we jointly train the safety critic, controller actor and disturbance actor throughEq.6.For better learning stability, the controller actor can be updated at a slower rate (only every τ1𝜏1\tau\geq 1italic_τ ≥ 1 disturbance updates), consistent with the asymmetric information structure of the game,and a leaderboard of best-performing controllers and disturbances can be maintained to mitigate mutual overfitting to the latest adversary iteration[38].

3.2 Online Gameplay Filter

Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (1)

This section demonstrates how the reach–avoid control actor πθsubscript𝜋𝜃{\pi}_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and disturbance actor πψsubscript𝜋𝜓{\pi}_{\psi}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT synthesized offline through game-theoretic RL can be systematically used at runtime to construct highly effective safety filters for general nonlinear, high-dimensional dynamic systems.

The gameplay rollout considers applying the candidate task policy π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT followed by the learned fallback policy π\faShield*superscript𝜋\faShield*{{\pi}^{\text{{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, with the whole rollout under attack by the learned disturbance policy πψsubscript𝜋𝜓{\pi}_{\psi}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT. It is effectively a gameplay between the learned fallback policy π\faShield*superscript𝜋\faShield*{{\pi}^{\text{{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and the learned disturbance policy πψsubscript𝜋𝜓{\pi}_{\psi}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT to check if accepting the candidate action from task policy π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT will result in an inevitable failure even if we then apply our best-effort attempt to maintain safety. The reach-avoid outcome defined in Eq.4 is used to determine the game outcome. A runtime gameplay filter can then be defined with the simple switching rule:

ϕ(x,π\faCheckSquare[regular])=italic-ϕ𝑥superscript𝜋\faCheckSquare[regular]absent\displaystyle{\phi}({x},{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]%}}}})=italic_ϕ ( italic_x , italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ) ={π\faCheckSquare[regular],Δ\faShield*(x,π\faCheckSquare[regular])=1,π\faShield*,Δ\faShield*(x,π\faCheckSquare[regular])=0,casessuperscript𝜋\faCheckSquare[regular]superscriptΔ\faShield*𝑥superscript𝜋\faCheckSquare[regular]1superscript𝜋\faShield*superscriptΔ\faShield*𝑥superscript𝜋\faCheckSquare[regular]0\displaystyle\left\{\begin{array}[]{ll}{{\pi}^{\text{{\hskip 0.49005pt%\faCheckSquare[regular]}}}},&{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{\text{%{\hskip 0.49005pt\faCheckSquare[regular]}}}})=1,\\{{\pi}^{\text{{\faShield*}}}},&{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{%\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}})=0,\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT , end_CELL start_CELL roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ) = 1 , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , end_CELL start_CELL roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ) = 0 , end_CELL end_ROW end_ARRAYΔ\faShield*(x,π\faCheckSquare[regular]):=𝟙{τ{1,,H},x^τ𝒯s{1,,τ},x^s}\displaystyle\begin{array}[]{ll}{\Delta}^{\text{{\faShield*}}}({x},{{\pi}^{%\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}):=\mathbbm{1}\Big{\{}&%\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_{{\tau}}\in{\mathcal{T}}\,\land\\&\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F}}\Big{\}}%\end{array}start_ARRAY start_ROW start_CELL roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ) := blackboard_1 { end_CELL start_CELL ∃ italic_τ ∈ { 1 , … , italic_H } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_T ∧ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∀ italic_s ∈ { 1 , … , italic_τ } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∉ caligraphic_F } end_CELL end_ROW end_ARRAY(7e)
with x^0=xsubscript^𝑥0𝑥{\hat{x}}_{0}={x}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x,x^τ+1=f(x^τ,u^τ,πψ(x^τ)),τ0formulae-sequencesubscript^𝑥𝜏1𝑓subscript^𝑥𝜏subscript^𝑢𝜏subscript𝜋𝜓subscript^𝑥𝜏𝜏0{\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\pi}_{\psi}({%\hat{x}}_{{\tau}})),{\tau}\geq 0over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) , italic_τ ≥ 0, and
u^τ={π\faCheckSquare[regular](x^τ),τ=0,π\faShield*(x^τ),τ{1,,H1},subscript^𝑢𝜏casessuperscript𝜋\faCheckSquare[regular]subscript^𝑥𝜏𝜏0superscript𝜋\faShield*subscript^𝑥𝜏𝜏1𝐻1\displaystyle{\hat{u}}_{{\tau}}=\begin{cases}{{\pi}^{\text{{\hskip 0.49005pt%\faCheckSquare[regular]}}}}({\hat{x}}_{{\tau}}),&{\tau}=0,\\{{\pi}^{\text{{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{1,\dots,{H}-1\},%\end{cases}over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = { start_ROW start_CELL italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ = 0 , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ ∈ { 1 , … , italic_H - 1 } , end_CELL end_ROWπ\faShield*(x)={πθ(x),x𝒯,π𝒯(x),x𝒯.superscript𝜋\faShield*𝑥casessubscript𝜋𝜃𝑥𝑥𝒯superscript𝜋𝒯𝑥𝑥𝒯\displaystyle{{\pi}^{\text{{\faShield*}}}}({x})=\left\{\begin{array}[]{ll}{\pi%}_{\theta}({x}),&{x}\not\in{\mathcal{T}},\\{{\pi}^{{\mathcal{T}}}}({x}),&{x}\in{\mathcal{T}}.\end{array}\right.italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x ) = { start_ARRAY start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , end_CELL start_CELL italic_x ∉ caligraphic_T , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_x ) , end_CELL start_CELL italic_x ∈ caligraphic_T . end_CELL end_ROW end_ARRAY(7h)

That is, if gameplay monitor returns “success” (the simulated trajectory safely reaches the target set), the filter selects the task policy; otherwise, it selects the fallback safety policy.

In practice, the computation of a full gameplay rollout may require multiple time steps (i.e., multiple control policy executions).In that case, the filter in(7) can be extended to a multi-step variant in which decisions are made by the filter every L𝐿{L}italic_L steps, appropriately accounting for the latency.Figure1 illustrates the gameplay safety filter logic with the L𝐿{L}italic_L-step rollout latency.

4 Experimental Evaluation

We run hardware experiments and an extensive simulation study,focusing on quadruped robots as an informative platform but stressing thatour proposed methodology is general and can be applied to other types of robots.We aim to evaluate the extent to which the synthesizedgameplay filters can maintain safety within the ODD specified at training,generalize beyond the ODD,and avoid unnecessarily impeding task execution.We also conduct ablation studies to investigate the importance of reach–avoid reinforcement learning and adversarial self-play in the filter synthesis, and of the gameplay-rollout in the filter’s runtime monitoring.Implementation details are inAppendixB.

4.1 Experiment Setup

Robots and simulator.We use a Ghost Robotics Spirit S40(LABEL:fig:front) and a Unitree Go-2.Both have built-in IMUs to obtain body angular velocities and linear acceleration,and internal motor encoders to measure joint positions and velocities.The S40 has no foot contact sensing; the Go-2 receives a Boolean contact signal for each foot.Neither robot’s safety filter is given access to visual perception.We use the PyBullet physics engine[52] for both training and runtime gameplay simulation.

Gameplay filter.We set up an offboard gameplay rollout server, a ROS servicethat receives the current robot state estimate and candidate task policy, runs a H𝐻{H}italic_H-step gameplay rollout, and returns a single policy selection (either task or fallback) for the subsequent L𝐿Litalic_L control cycles.Our physical robot experiments use horizon H=300𝐻300{H}=300italic_H = 300, latency L=10𝐿10L=10italic_L = 10, with filter decisions running at around 3.5Hz3.5Hz3.5~{}\text{Hz}3.5 Hz.

Task.The robot’s task is to move from its initial location to a goal on the other side of the terrain.

Operational design domain.The safety filter is computed for a fairly simple ODD, defined by the nominal robot simulator perturbed by forces of up to 50N50N50~{}\text{N}50 N applied anywhere on the robot’s torso;the disturbance adversary acts by a vector d𝒟6𝑑𝒟superscript6{d}\in{\mathcal{D}}\subset\mathbb{R}^{6}italic_d ∈ caligraphic_D ⊂ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT encoding what force to apply and where.We intentionally limit the ODD to only consider flat ground.The failure set{\mathcal{F}}caligraphic_F is defined as all fall states, in which any non-foot robot part makes contact with the ground.The deployment set and controlled-invariant set 𝒳0=𝒯subscript𝒳0𝒯{\mathcal{X}}_{0}={\mathcal{T}}caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_T are chosen empirically to contain all four-legged stances with a lowered torso, around which the robot is robustly stable with a simple leg position controller π𝒯superscript𝜋𝒯\pi^{\mathcal{T}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT.

Test conditions.We test in two types of conditions: flat terrain with tugging forces (similar to ODD) and irregular terrain (out-of-ODD).The irregular terrain is a 2m×4m2m4m2~{}\text{m}\times 4~{}\text{m}2 m × 4 m area with a 15-degree incline along one edge, and twomemory foam mounds, 5cm5cm5~{}\text{cm}5 cm and 15cm15cm15~{}\text{cm}15 cm high,positioned 1.8m1.8m1.8~{}\text{m}1.8 m from each other.Tugging forces areapplied manually through a rope, attached to the robot’s torso and to a motion-tracked dynamometerset to provide audiovisual alerts at 80% and 100% of the ODD limit.

Baselines.To evaluate the effectiveness of the reach–avoid learning signal androbust in-simulationlearning, we consider three prior reinforcement learning algorithms: (1) standard SAC[51] with reward defined as+11+1+ 1 inside 𝒯𝒯{\mathcal{T}}caligraphic_T,11-1- 1 inside {\mathcal{F}}caligraphic_F,and 00 everywhere else;(2)single-agentreach–avoid reinforcement learning (RARL)[32]; RARL with domain randomization (DR); and (4) adversarial SAC with the above indicator reward.We also compare to a critic (value-based) filter, which queries the learned Qωsubscript𝑄𝜔{Q}_{\omega}italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT for the current state and proposed task action and intervenes if it is below a threshold;we run a parameter sweep in simulation to tune the threshold value and use it in all experiments.

Policies.All learned policies are neural networks with 3 fully-connected layers of 256 neurons; critics have 3 layers of 128 neurons.We handcraft a task policy using an inverse kinematics gait planner for forward/sideways walking.We use a low-level PD position controller that outputs torques τi=Kp(δθJi)KdωJisuperscript𝜏𝑖subscript𝐾𝑝𝛿subscriptsuperscript𝜃𝑖Jsubscript𝐾𝑑subscriptsuperscript𝜔𝑖J\tau^{i}=K_{p}({\delta{\theta}^{i}_{\text{J}}})-K_{d}\cdot{{\omega}^{i}_{\text%{J}}}italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT ) - italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT to the robot motor controller with Kp,Kdsubscript𝐾𝑝subscript𝐾𝑑K_{p},K_{d}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT the proportional and derivative gains.

PolicyTugging ForcesIrregular Terrain
Successful RunsFailed RunsSuccessful Runs
Safe/All Runs Withstood AttacksAll (within 110% ODD)Filter Freq.Tgoalsubscript𝑇goalT_{\text{goal}}italic_T start_POSTSUBSCRIPT goal end_POSTSUBSCRIPTFavgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPTFmaxpeaksubscriptsuperscript𝐹peakmaxF^{\text{peak}}_{\text{max}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPTFavgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPTFminpeaksubscriptsuperscript𝐹peakminF^{\text{peak}}_{\text{min}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPTSafe/All RunsFilter Freq.Tgoalsubscript𝑇goalT_{\text{goal}}italic_T start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT
ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT7/1053/56 (33/35)0.1726.367.5N70.5N59.8N52.7N10/100.1941.2
ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT4/1022/28 (10/15)0.1026.873.7N80.9N53.6N40.0N5/100.2233.5
π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT0/106/16 (1/5)56.5N41.4N5/1016.4

.PolicyTugging ForcesSuccessful RunsFailed Runs Safe/Total RunsAll Runs (Runs within ODD)Favgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPTFmaxpeaksubscriptsuperscript𝐹peakmaxF^{\text{peak}}_{\text{max}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPTFavgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPTFminpeaksubscriptsuperscript𝐹peakminF^{\text{peak}}_{\text{min}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPTϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT8/10 (5/5)42.4N215N105.7N104.4Nπ\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT0/10 (0/10)32.7N15.3 Nπ\faCheckSquare[regular]built-insubscriptsuperscript𝜋\faCheckSquare[regular]built-in{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}_{\text{built-in}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT built-in end_POSTSUBSCRIPT7/10 (5/5)23.9N134.7N106.1N94.3N

AlgorithmMaximum Force
LeftRight
LowHighLowHigh
π\faShield*superscript𝜋\faShield*{{\pi}^{\text{{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT87.1N61.1N99.3N59.1N
πθsubscript𝜋𝜃{\pi}_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT100.5N150.3N121.6N121.9N
RARL + DR46.4N43N57.2N72.1N
π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT83.2N96.9N82.8N59N
π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT151.9N173.7N140.3N142.6N
{\dagger}Safety policies from reward-based RL and ISAACS with the avoid-only objective fail immediately before applying force.*The policy was able to withstand this magnitude of force.Because the policy made the quadruped move in the tugging direction, we were not able to apply a larger force in 10 pull attempts.

4.2 Physical Results

Safe walking within and beyond the ODD.We evaluate the effectiveness of our proposed gameplay filter in terms of both safety and disruption of task performance.We run similar experiments with baseline methodsfor rough comparison purposesbut caution that, due to the impossibility of reproducing identical conditions, these results should not be taken as a fine-grained quantitative comparison between methods.Such a comparison is conducted at scale, albeit in simulation, in Section4.3.Table1 shows the results for the S40 robot, subject to tugging forces and irregular terrain (not considered in the ODD),andTable2 shows the results for the Go-2 robot under a larger range of tugging forces (up to 4×4\times4 × the ODD bound).Our proposed gameplay safety filteris remarkably robust across robot platforms and test conditions;while not unbeatable outside of the specified ODD, it still withstands large tugging forces before it violates the safety constraints. Importantly, the gameplay filter does not disproportionately interfere with the task-oriented actions: it maintains comparable filter frequency and task performance as the critic filter while drastically reducing safety failures.LABEL:fig:front shows the gameplay filter in action on the S40,dynamically counterbalancing tugs or springing into a wide stance.Time plots of tugging forces in all S40 runs are given inAppendixD.

External forces. We measure the maximum tugging force withstood by various safety policies and filters, reported in Table3.We pull the quadruped from different directions, with “low” indicating angles in the range [0.1, 0.4]rad0.10.4rad[-0.1,\,0.4]~{}\text{rad}[ - 0.1 , 0.4 ] rad, and “high” in [0.5, 1.0]rad0.51.0rad[0.5,\,1.0]~{}\text{rad}[ 0.5 , 1.0 ] rad.The employed πθsubscript𝜋𝜃{\pi}_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can withstand 150N150N150~{}\text{N}150 N from all directions, but the non-game-theoretic counterpart (RARL+DR) is vulnerable to the tugging from the left and can only withstand 43N43N43~{}\text{N}43 N.This suggests that DR struggles to capture the worst-case realization of disturbances in a bounded class. This arises from its inherent nature: as the dimension of the disturbance input increases, the likelihood of the random policy simulating the worst-case disturbance decreases exponentially.Further, we notice the reward-based RL baselines and ISAACS with avoid-only objective fail almost immediately by overreacting and flipping over.Reach–avoid policies behave more robustly by bringing the robot to a stable stance.We also include tests for task policy π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT and the fixed-pose policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT (used when the state is in the target set).We observe that ISAACS control actor is strictly better than π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT and is comparable to π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT.

Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (2)

4.3 Simulated Results

Bespoke ultimate stress test (BUST).To test each policy’s robustness when taken to the limit,we RL-train a specialized adversarial disturbanceπψsuperscriptsubscript𝜋𝜓{{\pi}_{\psi}^{*}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to exploit its safety vulnerabilities(Table4).For each robot–disturbance policy pair,we play 1,000 finite horizon games and record the safe rate—overall fraction of failure-free runs. All pairs use the same set of 1,000 initial states.We observe that π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT is vulnerable to all πψsuperscriptsubscript𝜋𝜓{{\pi}_{\psi}^{*}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, while the proposed gameplay filter is only exploited by its associated BUST disturbance πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{\pi}_{\psi}^{*}}({{\phi}^{\text{game}}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT ). Further, the robustness of ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT pushes πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{\pi}_{\psi}^{*}}({{\phi}^{\text{game}}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT ) to learn effective attacks that also exploit other policies (the third column has the lowest safe rates compared to other columns across the board).The last two columns show the safe rate under random disturbances.All safety filters and safety policies remain at remarkably high safe rates, suggesting that our adversarial BUST evaluation method establishes a more demanding safety benchmark for policies than DR.

πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{\pi}_{\psi}^{*}}\left({\pi}_{\theta}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )πψ(π\faCheckSquare[regular])superscriptsubscript𝜋𝜓superscript𝜋\faCheckSquare[regular]{{\pi}_{\psi}^{*}}\left({{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]%}}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT )πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{\pi}_{\psi}^{*}}\left({{\phi}^{\text{game}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT )πψ(ϕcritic)superscriptsubscript𝜋𝜓superscriptitalic-ϕcritic{{\pi}_{\psi}^{*}}\left({{\phi}^{\text{critic}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT )πrndsuperscript𝜋rnd{\pi}^{\text{rnd}}italic_π start_POSTSUPERSCRIPT rnd end_POSTSUPERSCRIPTπrnd,+superscript𝜋rnd,+{\pi}^{\text{rnd,+}}italic_π start_POSTSUPERSCRIPT rnd,+ end_POSTSUPERSCRIPT
πθsubscript𝜋𝜃{\pi}_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT0.370.380.170.440.880.85
π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT0.00.00.00.00.030.03
ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT0.420.350.030.450.840.89
ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT0.370.340.100.440.860.86

Sensitivity analysis: reach–avoid criteria vs. avoid-only.We evaluate the significance of using reach–avoid criteria in the gameplay filter by performing a sensitivity analysis of the horizon in the imagined gameplay.Figure2 shows that the gameplay filter with reach–avoid criteria still remains 100%percent\%% safe rate even when the gameplay horizon is short (H=10𝐻10{H}=10italic_H = 10).In contrast, an “avoid-only” gameplay filter that only requires not reaching {\mathcal{F}}caligraphic_F for H𝐻{H}italic_H stepsincurs more safety violations as the horizon decreases.The difference is due to shorter imagined gameplay resulting in more frequent filter intervention for reach–avoid criteria but overly optimistic monitoring for avoid-only criteria (oblivious to imminent failures beyond H𝐻{H}italic_H).Further, as the gameplay horizon increases, the reach–avoid gameplay filter’s intervention frequency decreases.

5 Conclusion

This work presents a game-theoretic learning approach to synthesize safety filters for high-order, nonlinear dynamics.The proposed gameplay safety filter monitors system safety through imagined games between its best-effort safety fallback policy and a learned virtual adversary, aiming to realize the worst-case uncertainty in the system.We validate our approach on two different quadruped robots under strong tugging forces and unmodeled irregular terrain while maintaining zero-shot safety.An exhaustive simulation study is also performed to rigorously stress-test the approach and quantify its reliability and conservativeness.

Limitations.Despite the strong empirical robustness in both simulated and physical experiments,we do not have strong theoretical guarantees on convergence of offline gameplay learning, and therefore learned disturbance policy can in general be expected to behave suboptimally in at least certain regions of the state space.Naturally, the potential implications are quite serious, since a suboptimal (not-truly-worst-case) disturbance model may lead the gameplay rollout to erroneously conclude that a proposed course of action is safe, only to then be met by an ODD realization that unexpectedly drives the robot into a catastrophic failure state.Without strong theoretical assurances that for now remain elusive, this is not a method that should be placed in sole charge of a truly safety-critical system where an eventual catastrophic failure can carry inadmissible cost.

The remarkably high effectiveness demonstrated by the gameplay filter across various within-ODD experiments and even under out-of-ODD conditions could indicate that this new type of filter does in fact enjoy desirable properties yet to be established.This calls for future theoretical work at the intersection of game-theoretic reinforcement learning and nonlinear systems theory.In parallel, we see an opportunity for application-driven research to leverage the computational scalability and de facto robustness of gameplay filters to tackle ongoing challenges in robot learning, for example for safe acquisition of novel skills as well as rapid detection of shifts in operating conditions enabling safe runtime adaptation of ODD assumptions.

Acknowledgments

This work has been supported in part by the Google Research Scholar Award and the DARPA LINC program. The authors thank Zixu Zhang for his help preparing the Go-2 robot for experiments.

References

  • Kumar etal. [2021]A.Kumar, Z.Fu, D.Pathak, and J.Malik.RMA: Rapid Motor Adaptation for Legged Robots.In Proc.Robotics: Science and Systems, 7 2021.doi:10.15607/RSS.2021.XVII.011.
  • Zhuang etal. [2023]Z.Zhuang, Z.Fu, J.Wang, C.Atkeson, S.Schwertfeger, C.Finn, and H.Zhao.Robot parkour learning.In Conf.Robot Learning, volume 229 of Proceedings of Machine Learning Research, pages 73–92, 11 2023.URL https://proceedings.mlr.press/v229/zhuang23a.html.
  • Hsu etal. [2023]K.-C. Hsu, A.Z. Ren, D.P. Nguyen, A.Majumdar, and J.F. Fisac.Sim-to-lab-to-real: Safe reinforcement learning with shielding and generalization guarantees.Artificial Intelligence, 314:103811, 2023.ISSN 0004-3702.doi:https://doi.org/10.1016/j.artint.2022.103811.URL https://www.sciencedirect.com/science/article/pii/S0004370222001515.
  • Margolis etal. [2022]G.Margolis, G.Yang, K.Paigwar, T.Chen, and P.Agrawal.Rapid locomotion via reinforcement learning.In Proc.Robotics: Science and Systems, New York City, NY, USA, 6 2022.doi:10.15607/RSS.2022.XVIII.022.
  • [5]A.Z. Ren, A.Dixit, A.Bodrova, S.Singh, S.Tu, N.Brown, P.Xu, L.Takayama, F.Xia, J.Varley, Z.Xu, D.Sadigh, A.Zeng, and A.Majumdar.Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners.URL https://openreview.net/forum?id=4ZK8ODNyFXx.
  • Bansal etal. [2017]S.Bansal, M.Chen, S.Herbert, and C.J. Tomlin.Hamilton-Jacobi reachability: A brief overview and recent advances.In Proc.IEEE Conf.Decision and Control, pages 2242–2253, 2017.doi:10.1109/CDC.2017.8263977.
  • Bui etal. [2021]M.Bui, M.Lu, R.Hojabr, M.Chen, and A.Shriraman.Real-time Hamilton-Jacobi reachability analysis of autonomous system with an FPGA.In IEEE/RSJ Int.Conf.Intelligent Robots & Systems, pages 1666–1673, 2021.doi:10.1109/IROS51168.2021.9636410.
  • Mattila etal. [2015]R.Mattila, Y.Mo, and R.M. Murray.An iterative abstraction algorithm for reactive correct-by-construction controller synthesis.In Proc.IEEE Conf.Decision and Control, pages 6147–6152, 2015.doi:10.1109/CDC.2015.7403186.
  • [9]Q.Nguyen and K.Sreenath.Robust safety-critical control for dynamic robotics.67(3):1073–1088.ISSN 1558-2523.doi:10.1109/TAC.2021.3059156.
  • [10]T.G. Molnar, R.K. Cosner, A.W. Singletary, W.Ubellacker, and A.D. Ames.Model-free safety-critical control for robotic systems.7(2):944–951.ISSN 2377-3766.doi:10.1109/LRA.2021.3135569.
  • Yang etal. [2022]T.-Y. Yang, T.Zhang, L.Luu, S.Ha, J.Tan, and W.Yu.Safe reinforcement learning for legged locomotion.In IEEE/RSJ Int.Conf.Intelligent Robots & Systems, pages 2454–2461, 2022.doi:10.1109/IROS47612.2022.9982038.
  • [12]T.He, C.Zhang, W.Xiao, G.He, C.Liu, and G.Shi.Agile but safe: Learning collision-free high-speed legged locomotion.URL http://arxiv.org/abs/2401.17583.
  • Hsu etal. [2023]K.-C. Hsu, H.Hu, and J.F. Fisac.The safety filter: A unified view of safety-critical control in autonomous systems, 2023.URL https://arxiv.org/abs/2309.05837.
  • Wabersich etal. [2023]K.P. Wabersich, A.J. Taylor, J.J. Choi, K.Sreenath, C.J. Tomlin, A.D. Ames, and M.N. Zeilinger.Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems.IEEE Control Systems Magazine, 43(5):137–177, 2023.doi:10.1109/MCS.2023.3291885.
  • [15]K.L. Hobbs, M.L. Mote, M.C. Abate, S.D. Coogan, and E.M. Feron.Runtime Assurance for Safety-Critical Systems: An Introduction to Safety Filtering Approaches for Complex Control Systems.43(2):28–65.ISSN 1941-000X.doi:10.1109/MCS.2023.3234380.URL https://ieeexplore.ieee.org/document/10081233.
  • Mitchell etal. [2005]I.Mitchell, A.Bayen, and C.Tomlin.A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games.IEEE Transactions Automatic Control, 50(7):947–957, 2005.ISSN 1558-2523.doi:10.1109/TAC.2005.851439.
  • Fisac etal. [2015]J.F. Fisac, M.Chen, C.J. Tomlin, and S.S. Sastry.Reach-avoid problems with time-varying dynamics, targets and constraints.In Hybrid Systems: Computation and Control, pages 11–20, Seattle, WA, USA, 2015.doi:10.1145/2728606.2728612.
  • Mitchell [2008]I.M. Mitchell.The flexible, extensible and efficient toolbox of level set methods.Journal Scientific Computing, 35(2):300–329, 2008.doi:10.1007/s10915-007-9174-4.
  • Bui etal. [2022]M.Bui, G.Giovanis, M.Chen, and A.Shriraman.OptimizedDP: An efficient, user-friendly library for optimal control and dynamic programming, 2022.URL https://arxiv.org/abs/2204.05520.
  • Ames etal. [2019]A.D. Ames, S.Coogan, M.Egerstedt, G.Notomista, K.Sreenath, and P.Tabuada.Control barrier functions: Theory and applications.In European Control Conference, pages 3420–3431, 2019.doi:10.23919/ECC.2019.8796030.
  • Xu etal. [2017]X.Xu, J.W. Grizzle, P.Tabuada, and A.D. Ames.Correctness guarantees for the composition of lane keeping and adaptive cruise control.IEEE Transactions Automation Science and Engineering, 15(3):1216–1229, 2017.
  • Wang etal. [2018]L.Wang, D.Han, and M.Egerstedt.Permissive barrier certificates for safe stabilization using sum-of-squares.In Proc.American Control Conference, pages 585–590, 2018.doi:10.23919/ACC.2018.8431617.
  • Lindemann etal. [2021]L.Lindemann, H.Hu, A.Robey, H.Zhang, D.Dimarogonas, S.Tu, and N.Matni.Learning hybrid control barrier functions from data.In Conf.Robot Learning, pages 1351–1370, 2021.
  • Xu etal. [2015]X.Xu, P.Tabuada, J.W. Grizzle, and A.D. Ames.Robustness of control barrier functions for safety critical control.48(27):54–61, 2015.ISSN 24058963.doi:10.1016/j.ifacol.2015.11.152.URL https://linkinghub.elsevier.com/retrieve/pii/S2405896315024106.
  • Robey etal. [2020]A.Robey, H.Hu, L.Lindemann, H.Zhang, D.V. Dimarogonas, S.Tu, and N.Matni.Learning control barrier functions from expert demonstrations.In Proc.IEEE Conf.Decision and Control, pages 3717–3724, 2020.doi:10.1109/CDC42340.2020.9303785.
  • Choi etal. [2021]J.J. Choi, D.Lee, K.Sreenath, C.J. Tomlin, and S.L. Herbert.Robust control barrier-value functions for safety-critical control.In Proc.IEEE Conf.Decision and Control, pages 6814–6821, 2021.doi:10.1109/CDC45484.2021.9683085.
  • Robey etal. [2021]A.Robey, L.Lindemann, S.Tu, and N.Matni.Learning robust hybrid control barrier functions for uncertain systems.IFAC-PapersOnLine, 54(5):1–6, 2021.
  • Bansal and Tomlin [2021]S.Bansal and C.J. Tomlin.Deepreach: A deep learning approach to high-dimensional reachability.In Proc.IEEE Conf.Robotics and Automation, pages 1817–1824, 2021.doi:10.1109/ICRA48506.2021.9561949.
  • Fisac etal. [2019]J.F. Fisac, N.F. Lugovoy, V.Rubies-Royo, S.Ghosh, and C.J. Tomlin.Bridging hamilton-jacobi safety analysis and reinforcement learning.In Proc.IEEE Conf.Robotics and Automation, pages 8550–8556, 2019.doi:10.1109/ICRA.2019.8794107.
  • Bharadhwaj etal. [2021]H.Bharadhwaj, A.Kumar, N.Rhinehart, S.Levine, F.Shkurti, and A.Garg.Conservative safety critics for exploration.In Int.Conf.Learning Representations, 2021.URL https://openreview.net/forum?id=iaO86DUuKi.
  • Thananjeyan etal. [2021]B.Thananjeyan, A.Balakrishna, S.Nair, M.Luo, K.Srinivasan, M.Hwang, J.E. Gonzalez, J.Ibarz, C.Finn, and K.Goldberg.Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones.IEEE Robotics and Automation Letters, 6(3):4915–4922, 2021.doi:10.1109/LRA.2021.3070252.
  • Hsu etal. [2021]K.-C. Hsu, V.Rubies-Royo, C.J. Tomlin, and J.F. Fisac.Safety and liveness guarantees through reach-avoid reinforcement learning.In Proc.Robotics: Science and Systems, 7 2021.doi:10.15607/RSS.2021.XVII.077.
  • Wabersich and Zeilinger [2018]K.P. Wabersich and M.N. Zeilinger.Linear model predictive safety certification for learning-based control.In Proc.IEEE Conf.Decision and Control, pages 7130–7135, 2018.doi:10.1109/CDC.2018.8619829.
  • Wabersich and Zeilinger [2021]K.P. Wabersich and M.N. Zeilinger.A predictive safety filter for learning-based control of constrained nonlinear dynamical systems.Automatica, 129:109597, 2021.ISSN 0005-1098.doi:https://doi.org/10.1016/j.automatica.2021.109597.URL https://www.sciencedirect.com/science/article/pii/S0005109821001175.
  • Bastani and Li [2021]O.Bastani and S.Li.Safe reinforcement learning via statistical model predictive shielding.In Proc.Robotics: Science and Systems, 7 2021.doi:10.15607/RSS.2021.XVII.026.
  • Leeman etal. [2023]A.Leeman, J.Köhler, S.Bennani, and M.Zeilinger.Predictive safety filter using system level synthesis.In Learning for Dynamics & Control, volume 211 of Proceedings of Machine Learning Research, pages 1180–1192, 6 2023.URL https://proceedings.mlr.press/v211/leeman23a.html.
  • RameshKumar etal. [2023]A.RameshKumar, K.-C. Hsu, P.J. Ramadge, and J.F. Fisac.Fast, smooth, and safe: Implicit control barrier functions through reach-avoid differential dynamic programming.IEEE Control Systems Letters, 7:2994–2999, 2023.doi:10.1109/LCSYS.2023.3292132.
  • Hsu etal. [2023]K.-C. Hsu, D.P. Nguyen, and J.F. Fisac.ISAACS: Iterative soft adversarial actor-critic for safety.In Learning for Dynamics & Control, volume 211 of Proceedings of Machine Learning Research, 6 2023.URL https://proceedings.mlr.press/v211/hsu23a.html.
  • Hu etal. [2020]H.Hu, M.Fazlyab, M.Morari, and G.J. Pappas.Reach-sdp: Reachability analysis of closed-loop systems with neural network controllers via semidefinite programming.In Proc.IEEE Conf.Decision and Control, pages 5929–5934, 2020.doi:10.1109/CDC42340.2020.9304296.
  • [40]E.Luo, N.Kochdumper, and S.Bak.Reachability Analysis for Linear Systems with Uncertain Parameters using Polynomial Zonotopes.In Proceedings of the 26th ACM International Conference on Hybrid Systems: Computation and Control, HSCC ’23, pages 1–12. Association for Computing Machinery.ISBN 9798400700330.doi:10.1145/3575870.3587130.URL https://dl.acm.org/doi/10.1145/3575870.3587130.
  • Bird etal. [2023]T.J. Bird, H.C. Pangborn, N.Jain, and J.P. Koeln.Hybrid zonotopes: A new set representation for reachability analysis of mixed logical dynamical systems.Automatica, 154:111107, 2023.ISSN 0005-1098.doi:https://doi.org/10.1016/j.automatica.2023.111107.URL https://www.sciencedirect.com/science/article/pii/S0005109823002674.
  • Reher and Ames [2021]J.Reher and A.D. Ames.Dynamic walking: Toward agile and efficient bipedal robots.Annual Review of Control, Robotics, and Autonomous Systems, 4(1):535–572, 2021.doi:10.1146/annurev-control-071020-045021.
  • [43]H.Lai, W.Zhang, X.He, C.Yu, Z.Tian, Y.Yu, and J.Wang.Sim-to-real transfer for quadrupedal locomotion via terrain transformer.In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5141–5147. IEEE.ISBN 9798350323658.doi:10.1109/ICRA48891.2023.10160497.URL https://ieeexplore.ieee.org/document/10160497/.
  • [44]L.Campanaro, S.Gangapurwala, W.Merkt, and I.Havoutis.Learning and deploying robust locomotion policies with minimal dynamics randomization.URL http://arxiv.org/abs/2209.12878.
  • Nguyen and Sreenath [2016]Q.Nguyen and K.Sreenath.Optimal Robust Time-Varying Safety-Critical Control With Application to Dynamic Walking on Moving Stepping Stones, 10 2016.
  • [46]Z.Gu, Y.Zhao, Y.Chen, R.Guo, J.K. Leestma, G.S. Sawicki, and Y.Zhao.Robust-locomotion-by-logic: Perturbation-resilient bipedal locomotion via signal temporal logic guided model predictive control.URL http://arxiv.org/abs/2403.15993.
  • [47]S.O.-R. A. D.O. committee.Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles.URL https://www.sae.org/content/j3016_202104.
  • Laine etal. [2020]F.Laine, C.-Y. Chiu, and C.Tomlin.Eyes-Closed Safety Kernels: Safety of Autonomous Systems Under Loss of Observability.In Proc.Robotics: Science and Systems, Corvalis, Oregon, USA, 7 2020.doi:10.15607/RSS.2020.XVI.096.
  • Zhang and Fisac [2021]Z.Zhang and J.F. Fisac.Safe Occlusion-Aware Autonomous Driving via Game-Theoretic Active Perception.In Proc.Robotics: Science and Systems, 7 2021.doi:10.15607/RSS.2021.XVII.066.
  • [50]H.Hu, Z.Zhang, K.Nakamura, A.Bajcsy, and J.F. Fisac.Deception game: Closing the safety-learning loop in interactive robot autonomy.URL https://openreview.net/forum?id=0o2JgvlzMUc.
  • Haarnoja etal. [2018]T.Haarnoja, A.Zhou, P.Abbeel, and S.Levine.Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.In Int.Conf.Machine Learning, volume80 of Proceedings of Machine Learning Research, pages 1861–1870, 7 2018.URL https://proceedings.mlr.press/v80/haarnoja18b.html.
  • Coumans and Bai [2016–2021]E.Coumans and Y.Bai.Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016–2021.

\appendixpage

Appendix A Frequently Asked Questions

We discuss some design choices and implications of our method using an informal FAQ format.

Why choose worst-case safety and not probabilistic analysis?Although not as established in the robot learning community, robust/worst-case formulations are widely used across engineering.Their key advantage is that they can enforce systematic handling of all scenarios in a well-defined class, even if some of them are highly unlikely—e.g., the robot must withstand all (rather than most) external forces of up to 50 N, even the unlucky push that happens to maximally disturb its stance. This is consistent with much of the safety analysis found in bridges, elevators, automobiles, aircraft, and other safety-critical engineering systems, in great part because it facilitates a clear-cut social contract between their designers and the broader public. For example, we do not certify elevators for 95% of loads up to 300 kg or bridges for 99% of earthquakes up to magnitude 8, but rather all such loads and earthquakes, and we treat any loss of safety within the specified bounds as a serious failure to comply with the promise made to society. As robots and autonomous systems become more widely deployed, we argue that their safety should be certified and held to similar standards, at least in truly safety-critical settings where people could otherwise get hurt.

Isn’t worst-case safety too conservative to be useful?Actually, this is a common misconception.Robust/worst-case assessments are not intrinsically more or less conservative than probabilistic ones: this depends entirely on what set and distribution we choose to run these assessments against.The term “worst-case” doesn’t mean a system must preserve safety in the worst conceivable scenario (whatever that means), but rather under all conditions—including the worst one—in a specified set.Worst-case safety lets designers and regulators draw this line (the ODD boundary), and it ensures that the system then maintains safety across all certified (in-ODD) conditions.If your robot’s behavior is “too conservative” this means it’s guarding against eventualities you don’t really care about: just exclude them from your ODD.But, if you do want safety under these conditions, then your robot is not actually too conservative: it’s doing what it should.With the gameplay filter, you are never left wondering: each time it overrides the task policy, it logs the specific future it’s preempting.Then, only one question remains: did you or did you not want your robot to avoid that hypothetical crash?Worst-case safety is extremely powerful, and it lets you control exactly what situations your robot is required to handle. You just need to be ready to answer to some hard what-if questions.

What does it mean for the proposed gameplay filter to approximate a perfect filter?If we had the exact solution to the Isaacs reach-avoid equation(5), our gameplay rollouts would be necessary and sufficient for safely reaching 𝒯𝒯{\mathcal{T}}caligraphic_T in H𝐻Hitalic_H (or fewer) steps. Since 𝒯𝒯{\mathcal{T}}caligraphic_T is typically chosen to be a broad, naturally reachable class of robot states (e.g., coming to a stable stance for a walking robot or pulling over for an autonomous vehicle), safely reaching 𝒯𝒯{\mathcal{T}}caligraphic_T within a long enough horizon H𝐻{H}italic_H is possible as long as remaining safe is possible in the first place. In other words, the sufficient reach-avoid condition becomes a tight approximation of the all-time safety condition. We can observe this phenomenon in Fig.2, where the reach-avoid filter’s overstepping vanishes with long H𝐻{H}italic_H.

Why is computing a gameplay rollout better than just querying the learned reach–avoid critic?In theory, the critic should make fairly accurate predictions of game outcomes after training. In practice, we have found thatit’soften unreliable and/or overly conservative.A key advantage of the gameplay rollout is that the uncertainty linked to the learning-based safety analysis is much more structured:the robot’s future safety fallback is perfectly predicted (since it will be implemented as-is), and the dynamics can be reliably simulated given players’ actions, so all uncertainty falls on the learned disturbance.One very useful implication of this structure is that, even if the disturbance is suboptimally adversarial, a predicted gameplay rollout ending in a safety failure constitutes a valid certificate (i.e., a proof) that there exists an ODD realization in which the robot will violate safety if the filter does not intervene immediately.That is, we know the gameplay safety monitor isn’t falsely crying wolf—we can’t prove anything like that about the black-box neural safety critic’s predictions.

Why is reach–avoid preferable if it’s more conservative than avoid-only?This is an important aspect of predictive safety filtering and relates to a deeper tenet in safety engineering philosophy: whatever the safety boundary is (i.e., a strategy that is “just safe enough”), it is preferable to approach it from the safe side than from the unsafe side. In practice, we don’t know a priori how many prediction steps H𝐻{H}italic_H we need to avoid being blindsided by future failures just beyond the lookahead horizon. When in doubt, it’s preferable to risk being overly conservative than to risk losing safety.

Having a terminal state constraint is common in MPC, how is reach–avoid different? The use of a terminal controlled-invariant set in MPC is well established and ensures recursive feasibility. Our choice of reach–avoid over an avoid–only safety condition is an instance of the same principle. An important difference is that the (also well-established) reach–avoid condition gives our filter extra flexibility by allowing the gameplay trajectory to reach the forever-safe set 𝒯𝒯{\mathcal{T}}caligraphic_T at any time within the horizon. This reduces conservativeness and often lets us to terminate the gameplay rollout early.

How do you determine 𝒯𝒯{\mathcal{T}}caligraphic_T?In practice, a suitable 𝒯𝒯{\mathcal{T}}caligraphic_T is obtained from domain knowledge, offline computation, pre-deployment learning, or some combination, often in the form of a stability basin (region of attraction) around a desirable class of equilibrium points sufficiently away from failure. For example, most robots can be robustly stabilized around static or steady cruising configurations by comparatively simple linear feedback controllers (e.g., most modern walking robots ship with built-in controllers that can stabilize them around a default stance). Larger all-time safe regions may be found by (robust) Lyapunov analysis or even optimized through control Lyapunov functions.

What are the implications of the choice of 𝒯𝒯{\mathcal{T}}caligraphic_T?Broadly speaking, the larger the 𝒯𝒯{\mathcal{T}}caligraphic_T we can characterize offline, the easier the job of the gameplay filter at runtime, and, potentially, the fewer steps we’ll need to reach it from more dynamic configurations. In fact, in the extreme case, we could be remarkably lucky and find 𝒯=Ω𝒯superscriptΩ{\mathcal{T}}={\Omega}^{*}caligraphic_T = roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, in which case, the gameplay filter’s job is made much easier, since all candidate actions that are safe will keep the state in 𝒯𝒯{\mathcal{T}}caligraphic_T, immediately terminating the rollout check. Conversely, all actions that leave 𝒯𝒯{\mathcal{T}}caligraphic_T are unsafe and the gameplay rollout will not be able to return to 𝒯𝒯{\mathcal{T}}caligraphic_T. In order to avoid initializing the gameplay filter from a no-win scenario, designers should ensure that 𝒯𝒯{\mathcal{T}}caligraphic_T contains the range of expected robot deployment conditions (𝒳0subscript𝒳0{\mathcal{X}}_{0}caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) in the ODD.

Why aren’t you using onboard cameras or lidar?Our empirical focus in this paper is on demonstrating automatically synthesized safety filters that account for the full-order (36–D) walking dynamics of quadruped robots. We think that the simplest and clearest demonstration of this concept is by having the filter only consider the robot’s own state (proprioception) without accounting for the environment, obstacles, etc. (exoception). That said, incorporating information about the robot’s surroundings can be extremely valuable—and often critical—to safety. We are very excited by the scalability and generality that new safety approaches like the one we present in this paper seem to enjoy, and we expect they will soon unlock full-order safety filters that incorporate rich exoceptive information in real time, whether straight from raw sensor data or through intermediate representations provided by the perception and localization stack.

Appendix B Implementation Details

State and action spaces.For the scope of this paper, we aim to construct a proprioceptive safety filter that relies on onboard estimation of the robot’s kinematic state but no exoceptive information (from camera, lidar, etc.) about the surrounding environment.333Ranged perception can improve the robustness of walking controllers by sensing terrain geometry and texture, and it is strictly needed for ODDs including unmapped or moving obstacles. Full-order legged robot safety filters combining proprioception and exoception have significant potential and are ripe for investigation.We encode the quadrupedal robots’ state and action vectors as follows:

x𝑥\displaystyle{x}italic_x:=[px,py,pz,vx,vy,vz,θx,θy,θz,ωx,ωy,ωz,{θJi},{ωJi}],assignabsentsubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧subscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧subscript𝜃𝑥subscript𝜃𝑦subscript𝜃𝑧subscript𝜔𝑥subscript𝜔𝑦subscript𝜔𝑧subscriptsuperscript𝜃𝑖Jsubscriptsuperscript𝜔𝑖J\displaystyle:=\left[{{p}_{x}},{{p}_{y}},{{p}_{z}},{{v}_{x}},{{v}_{y}},{{v}_{z%}},{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}},{{\omega}_{x}},{{\omega}_{y}},%{{\omega}_{z}},\{{{\theta}^{i}_{\text{J}}}\},\{{{\omega}^{i}_{\text{J}}}\}%\right],:= [ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , { italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } , { italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } ] ,
u𝑢\displaystyle{u}italic_u:=[{δθJi}],assignabsentdelimited-[]𝛿subscriptsuperscript𝜃𝑖J\displaystyle:=\left[\{{\delta{\theta}^{i}_{\text{J}}}\}\right],:= [ { italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } ] ,

with px,py,pzsubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧{{p}_{x}},{{p}_{y}},{{p}_{z}}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the position of the body frame with respect to a fixed reference (“world”) frame;vx,vy,vzsubscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧{{v}_{x}},{{v}_{y}},{{v}_{z}}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the velocity of the robot’s torso expressed in (forward–left–up) body coordinates;θx,θy,θzsubscript𝜃𝑥subscript𝜃𝑦subscript𝜃𝑧{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}}italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the roll, pitch, and yaw angles of the robot’s body frame with respect to the world frame;444For the purposes of this demonstration, we find that an Euler angle representation of body attitude performs adequately and makes the failure set straightforward to encode. In general, a quaternion-based representation may be preferable, avoiding the risk of computational issues in the neighborhood of singularities (at θy=±π2subscript𝜃𝑦plus-or-minus𝜋2{{\theta}_{y}}=\pm\frac{\pi}{2}italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = ± divide start_ARG italic_π end_ARG start_ARG 2 end_ARG).ωx,ωy,ωzsubscript𝜔𝑥subscript𝜔𝑦subscript𝜔𝑧{{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the body frame’s axial rotational rates;and θJi,ωJi,δθJisubscriptsuperscript𝜃𝑖Jsubscriptsuperscript𝜔𝑖J𝛿subscriptsuperscript𝜃𝑖J{{\theta}^{i}_{\text{J}}},{{\omega}^{i}_{\text{J}}},{\delta{\theta}^{i}_{\text%{J}}}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT , italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT the angle, angular velocity, and commanded angular increment of the robot’s ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT joint.

The above constitutes a full-order state representation of the robot’s idealized Lagrangian mechanics. A total of 18 generalized coordinates encode the 6 degrees of freedom of the torso’s rigid-body pose in addition to the configuration of 3 rotational joints (hip abduction, hip flexion, and knee flexion) for each of the 4 legs; the robot’s rate of motion is expressed through 18 corresponding generalized velocities, for a total 36 continuous state variables.We discuss discrete contact variables below.

The robot’s control authority is achieved by independently modulating the torque applied on each of its 12 rotational joints by an electric motor; in modern legged platforms, these motors typically have dedicated low-level controllers, so our control policy sends a tracking reference to each motor controller rather than directly commanding a torque.

Finally, the disturbance is modeled as an external force that can act on any point of the robot’s torso and in any direction of Euclidean space, with a bounded modulus. The specified range of admissible disturbance forces is discussed below.

Black-box simulator(s).The dynamical model is implemented by the off-the-shelf PyBullet physics engine[52] using the standardized robot description files made available by the manufacturers of each platform. Our method treats the simulator as black-box environment for both training and runtime safety filtering,allowing the engine and/or robot model to be easily swapped out.The generality and modularity of our approach is perhaps best illustrated by the fact thatwe synthesized and deployed the safety filter for the Go-2 robot using identical hyperparameter values as for the S40 robot.Our only modification, other than replacing the robot model in the physics engine, was to append 4 state components to each neural network’s input space to account for foot contact information;we note that even this straightforward addition is entirely optional, sincewe could have alternatively constructed a safety filter that simply disregarded the extra sensor data.

Actor and critic networks.The learned control and disturbance actors, as well as the safety critics, are independent of the robot’s absolute position px,py,pzsubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧{{p}_{x}},{{p}_{y}},{{p}_{z}}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and heading angle θzsubscript𝜃𝑧{{\theta}_{z}}italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT; of these, only distance to the ground has an effect on the dynamics, but since it is hard to observe without vision, we do not make it available.In the case of the Go-2 quadruped (but not the S40), the policies additionally depend on the discrete contact state, encoded as a Boolean (true/false) indicator for each foot.In simulation, each neural network policy receives as input the ground-truth state of the robot in the simulator; in hardware experiments, they instead receive a state estimate computed by the robot’s on-board perception stack.Each policy is implemented by a fully-connected feedforward neural network with 3 hidden layers of 256 neurons, and critics have 3 hidden layers with 128 neurons.

Safety specification.We are interested in preventing falls, understood as any part of the robot other than its feet making contact with the ground.To encode the failure set of all such falls with a simple margin function, we define a small number of critical points 𝕡𝕔subscript𝕡𝕔\mathbb{p_{c}}blackboard_p start_POSTSUBSCRIPT blackboard_c end_POSTSUBSCRIPT, including the 8 corners of a (tight) 3-D bounding box around the robot’s torso as well as its four knee joints. The failure margin is

g(x)=min{mini{zcorneri}z¯corner,g,mini{zkneei}z¯knee},𝑔𝑥subscript𝑖superscriptsubscript𝑧corner𝑖subscript¯𝑧corner𝑔subscript𝑖superscriptsubscript𝑧knee𝑖subscript¯𝑧knee\displaystyle{g}({x})=\min\left\{\min_{i}\{z_{\text{corner}}^{i}\}-\bar{z}_{%\text{corner},{g}},\,\min_{i}\{z_{\text{knee}}^{i}\}-\bar{z}_{\text{knee}}%\right\},italic_g ( italic_x ) = roman_min { roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } - over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , italic_g end_POSTSUBSCRIPT , roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } - over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT } ,

with zcornerisuperscriptsubscript𝑧corner𝑖z_{\text{corner}}^{i}italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the vertical distance to the ground of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT robot body corner point and zkneeisuperscriptsubscript𝑧knee𝑖z_{\text{knee}}^{i}italic_z start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the vertical distance to the ground of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT robot knee point.The target (all-time safe set) is defined as a narrow neighborhood of a static stance with all four feet on the ground and a sufficiently lowered torso, chosen so that the robot is robustly stable with a simple stance controller. The target margin is

(x)=min{\displaystyle{\ell}({x})=\min\Big{\{}\,roman_ℓ ( italic_x ) = roman_min {ω¯|ωx|,ω¯|ωy|,ω¯|ωz|,¯𝜔subscript𝜔𝑥¯𝜔subscript𝜔𝑦¯𝜔subscript𝜔𝑧\displaystyle\bar{{\omega}}-|{{\omega}_{x}}|,\,\bar{{\omega}}-|{{\omega}_{y}}|%,\,\bar{{\omega}}-|{{\omega}_{z}}|,over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | , over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | , over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | ,
v¯|vx|,v¯|vy|,v¯|vz|,¯𝑣subscript𝑣𝑥¯𝑣subscript𝑣𝑦¯𝑣subscript𝑣𝑧\displaystyle\bar{{v}}-|{{v}_{x}}|,\,\bar{{v}}-|{{v}_{y}}|,\,\bar{{v}}-|{{v}_{%z}}|,over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | , over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | , over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | ,
z¯corner,maxi{zcorneri},z¯footmaxi{zfooti}},\displaystyle\bar{z}_{\text{corner},{\ell}}-\max_{i}\{z_{\text{corner}}^{i}\},%\,\bar{z}_{\text{foot}}-\max_{i}\{z_{\text{foot}}^{i}\}\Big{\}},over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , roman_ℓ end_POSTSUBSCRIPT - roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT foot end_POSTSUBSCRIPT - roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT foot end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } } ,

with zfootisuperscriptsubscript𝑧foot𝑖z_{\text{foot}}^{i}italic_z start_POSTSUBSCRIPT foot end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the vertical elevation of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT robot foot relative to the ground.The threshold valueswe used for our failure and target set specification are as follows.

z¯corner,gsubscript¯𝑧corner𝑔\displaystyle\bar{z}_{\text{corner},{g}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , italic_g end_POSTSUBSCRIPT=0.1mabsent0.1m\displaystyle=0.1~{}\text{m}= 0.1 mz¯kneesubscript¯𝑧knee\displaystyle\bar{z}_{\text{knee}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT=0.05mabsent0.05m\displaystyle=0.05~{}\text{m}= 0.05 m
z¯corner,subscript¯𝑧corner\displaystyle\bar{z}_{\text{corner},{\ell}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , roman_ℓ end_POSTSUBSCRIPT=0.4mabsent0.4m\displaystyle=0.4~{}\text{m}= 0.4 mz¯footsubscript¯𝑧foot\displaystyle\bar{z}_{\text{foot}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT foot end_POSTSUBSCRIPT=0.05mabsent0.05m\displaystyle=0.05~{}\text{m}= 0.05 m
ω¯¯𝜔\displaystyle\bar{{\omega}}over¯ start_ARG italic_ω end_ARG=10°/sabsent10°s\displaystyle=10\degree\!/\text{s}= 10 ° / sv¯¯𝑣\displaystyle\bar{v}over¯ start_ARG italic_v end_ARG=0.2m/sabsent0.2ms\displaystyle=0.2~{}\text{m}/\text{s}= 0.2 m / s

Uncertainty specification.To account for uncertainty in the deployment conditions as well as general modeling error (or sim-to-real gap), our operational design domain (ODD) includes an external force that may push or pull any point on the robot’s torso in any direction with a maximum magnitude of 50N50N50~{}\text{N}50 N:

d𝑑\displaystyle{d}italic_d=[Fx,Fy,Fz,pxF,pyF,pzF],absentsubscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧subscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦subscriptsuperscript𝑝𝐹𝑧\displaystyle=\left[{{F}_{x}},{{F}_{y}},{{F}_{z}},{{p}^{F}_{x}},{{p}^{F}_{y}},%{{p}^{F}_{z}}\right]\,,= [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] ,(8)

where F=[Fx,Fy,Fz]𝐹subscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧{F}=[{{F}_{x}},{{F}_{y}},{{F}_{z}}]italic_F = [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ]represents the force vector applied at position defined by pxF,pyF,pzFsubscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦subscriptsuperscript𝑝𝐹𝑧{{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPTin the body coordinates pxF,pyF[0.1,0.1]subscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦0.10.1{{p}^{F}_{x}},{{p}^{F}_{y}}\in[-0.1,0.1]italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ [ - 0.1 , 0.1 ],pzF[0,0.05]msubscriptsuperscript𝑝𝐹𝑧00.05m{{p}^{F}_{z}}\in[0,0.05]~{}\text{m}italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ [ 0 , 0.05 ] m.The red arrows in the imagined gameplay of Figure1 show examples of learned adversarial disturbance.

Gameplay filter runtime implementation.To easily deploy our gameplay safety filter across two different robots for the physical experiments, we encapsulate its computation inside a ROS service, which we run on an offboard computer. Each robot’s onboard process calls this service wirelessly (approximately 3.5 times per second) passing its current state estimate and proposed course of action; the offboard server then simulates the gameplay for a fixed horizon (for us, H=300𝐻300H=300italic_H = 300 or 3s3s3~{}\text{s}3 s) and returns a Boolean indicating which policy to use for the next L𝐿Litalic_L time steps; our choice of L=10𝐿10L=10italic_L = 10 accounts for the wireless round trip, which makes up a significant fraction (approximately 70%percent7070\%70 %) of the total latency.

We note that the computational resources used for the offboard computation are comparable to those available on current mobile robot platforms. In particular, the entire simulation and filter logic was run on one single core of an Intel i7-1185G7 processor at 3GHz. For comparison, the Go-2 is equipped with a second computer (not used in our experiments) with a 6-core 8GB NVIDIA Jetson Orin Nano processor at 1.5 GHz. We estimate that the total latency of the gameplay filter run fully onboard the Go-2 with the same simulator would be roughly comparable, possibly lower given the absence of a wireless roundtrip.

Appendix C Extended Evaluation

To further demonstrate the strengths of our approach and shed light on its superior scalability to complex robot dynamics, we compare the gameplay performance of the self-play–trained controller and disturbance policies as training proceeds.The results inFigs.3 and4 suggest that the dense temporal difference signal in reach–avoid games plays a determining role in enabling data-efficient learning, while previously proposed safety methods that use reward-based RL with a (sparse) failure indicator consistently require more training episodes before starting to learn meaningfully robust safe control strategies.

Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (3)
Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (4)

Appendix D Detailed Tugging Force Plots

We provide time plots for all runs of the tug test experiment on the S40 robot (summarized inTable3), displaying the magnitude of the tugging force over the course of each trial. We present all 10 runs for each of the three evaluated control schemes: gameplay filter ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT, critic (value-based) filter ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT, and unfiltered task policy π\faCheckSquare[regular]superscript𝜋\faCheckSquare[regular]{{\pi}^{\text{{\hskip 0.49005pt\faCheckSquare[regular]}}}}italic_π start_POSTSUPERSCRIPT [regular] end_POSTSUPERSCRIPT. Each run is annotated to show individual attacks, defined as sequences of significant tug forces (10absent10\geq 10≥ 10N) that are applied continually or close together in time (less than 1s1s1~{}\text{s}1 s interruption within an attack). Conversely, distinct attacks are at least 1s1s1~{}\text{s}1 s away from each other, to ensure that the effects of the previous attack have died off before the next one begins.

Looking at individual attacks within each run provides a more fine-grained insight on the performance of each control scheme under various disturbances (both within-ODD and out-of-ODD). Importantly, it allows us to attribute a safety failure to the attack that immediately preceded it in a given run, but mark all earlier attacks in the same run as safely handled.

Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (5)
Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (6)
Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (7)
Gameplay Filters: Robust Zero-Shot Safety through Adversarial Imagination (2024)
Top Articles
Tickets einfach online verkaufen | Eventbrite
Hamburg, Deutschland: Events, Kalender und Tickets | Eventbrite
Global Foods Trading GmbH, Biebesheim a. Rhein
Mountain Dew Bennington Pontoon
Manhattan Prep Lsat Forum
Atvs For Sale By Owner Craigslist
Nikki Catsouras Head Cut In Half
City Of Spokane Code Enforcement
Zachary Zulock Linkedin
Youtube Combe
Belly Dump Trailers For Sale On Craigslist
Walmart End Table Lamps
Viha Email Login
Moviesda3.Com
Paradise leaked: An analysis of offshore data leaks
Vintage Stock Edmond Ok
Strange World Showtimes Near Roxy Stadium 14
Earl David Worden Military Service
Full Standard Operating Guideline Manual | Springfield, MO
Mc Donald's Bruck - Fast-Food-Restaurant
Providence Medical Group-West Hills Primary Care
Company History - Horizon NJ Health
MyCase Pricing | Start Your 10-Day Free Trial Today
Cookie Clicker Advanced Method Unblocked
15 Primewire Alternatives for Viewing Free Streams (2024)
Cardaras Funeral Homes
Unable to receive sms verification codes
Saxies Lake Worth
Www Mydocbill Rada
Our Leadership
Bad Business Private Server Commands
Gridwords Factoring 1 Answers Pdf
Jambus - Definition, Beispiele, Merkmale, Wirkung
Tamil Play.com
THE 10 BEST Yoga Retreats in Konstanz for September 2024
Chilangos Hillsborough Nj
Petsmart Northridge Photos
Dadeclerk
Hebrew Bible: Torah, Prophets and Writings | My Jewish Learning
Unifi Vlan Only Network
Directions To Advance Auto
Suffix With Pent Crossword Clue
The Listings Project New York
Ferguson Showroom West Chester Pa
Natasha Tosini Bikini
Bustednewspaper.com Rockbridge County Va
Unit 11 Homework 3 Area Of Composite Figures
Rise Meadville Reviews
Canada Life Insurance Comparison Ivari Vs Sun Life
Kushfly Promo Code
How to Do a Photoshoot in BitLife - Playbite
Santa Ana Immigration Court Webex
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 5989

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.