## Abstract

Autonomous vehicle control approaches are rapidly being developed for everyday street-driving scenarios. This article considers autonomous vehicle control in a less common, albeit important, situation “a vehicle stuck in a ditch.” In this scenario, a solution is typically obtained by either using a tow-truck or by humans rocking the vehicle to build momentum and push the vehicle out. However, it would be much more safe and convenient if a vehicle was able to exit the ditch autonomously without human intervention. In exploration of this idea, this article derives the governing equations for a vehicle moving along an arbitrary ditch profile with torques applied to front and rear wheels and the consideration of four regions of wheel-slip. A reward function was designed to minimize wheel-slip, and the model was used to train control agents using Probabilistic Inference for Learning COntrol (PILCO) and deep deterministic policy gradient (DDPG) reinforcement learning (RL) algorithms. Both rear-wheel-drive (RWD) and all-wheel-drive (AWD) results were compared, showing the capability of the agents to achieve escape from a ditch while minimizing wheel-slip for several ditch profiles. The policy results from applying RL to this problem intuitively increased the momentum of the vehicle and applied “braking” to the wheels when slip was detected so as to achieve a safe exit from the ditch. The conclusions show a pathway to apply aspects of this article to specific vehicles.

## 1 Introduction

Autonomous vehicles are a technology that is poised to change transportation. Many prominent companies have allocated significant resources to develop autonomous vehicle technology to ensure safety and reduce traffic issues. However, the investigation of autonomous vehicle control has primarily been concerned with the control of vehicles for on-road, everyday-driving applications [1]. This article seeks to explore the possibility of controlling a vehicle in a less common, albeit important, driving situation—a vehicle stuck in a ditch.

This article presents a unique dynamic model of an idealized vehicle moving on an arbitrary ditch profile, the switching conditions for four regions of possible wheel-slip behavior, and a comparison of multiple reinforcement learning (RL) techniques to train the vehicle to get unstuck from the ditch while minimizing wheel-slip for both rear-wheel-drive (RWD) and all-wheel-drive (AWD) vehicle models. This is a different problem from the RL “mountain-car” scenario [2] as the dynamics model includes significantly more complexity, such as rigid-body vehicle dynamics and wheel-slip, so as to better emulate solving this problem for a real-world scenario (see Sec. 2). In contrast, the “mountain-car” problem relies on a point-mass assumption and a continuous dynamics model. Reward function design and challenges in training an agent to avoid wheel-slip using a discontinuous dynamics model are significantly different than previous approaches [3–5]. It is likely that the suspension, tires, drive-train, and perhaps other mechanisms influence the performance of a vehicle stuck in a ditch. There are many different suspension designs, drive-trains, and tire models, and these can vary significantly different from vehicle to vehicle. Thus, this article focuses on the dominant effects of rigid body dynamics, wheel-slip, and ditch shape, but does not include the dynamic effects of a specific vehicle, such as the compliance of a specific vehicle suspension or tires, in an effort to provide a basis of comparison for future studies. In our previous work in Ref. [6], a vehicle model was developed that did not consider any wheel-slip and the control problem was considered using human behavioral forcing instead of RL.

Many drivers have found themselves stuck in a ditch at one time or another. The severity of this situation can be compounded by issues, such as lack of cell reception (inability to call a towing service), visibility issues (such as at night), inclement weather conditions, and low-traffic roads (less likely that someone would stop and help). In Ref. [7], the correlation between ditches and car accidents was considered, with the finding that 90% of ditch accidents occur in rural areas. Thus, having a vehicle stuck in a ditch is both a safety concern and a great inconvenience.

When the assistance of a tow vehicle is unavailable, getting a vehicle unstuck is often accomplished by the assistance of human force, with companions pushing behind the vehicle as the driver applies the gas pedal. However, the combination of static human force and torque applied to the wheels is generally insufficient to achieve the desired goal. Instead of applying a static force, a dynamic force is applied rhythmically to the vehicle (similar to pushing a child on a swing), so that the vehicle builds up momentum and achieves escape from the ditch without requiring the substantial applied force of a tow-truck. For the increased safety of occupants and a greater possibility of achieving escape from the ditch, it is desirable that a vehicle would be able to autonomously escape the ditch without human intervention.

Many different types of vehicle dynamics models exist in the literature. For example, some models seek to understand complex problems such as steering, tire deformation, suspension, and braking and utilize many degree-of-freedom (DOF). A comprehensive survey of different vehicle dynamics applications was presented in Ref. [8]. This survey focused primarily on automotive suspension systems, worst-case maneuvering, minimum-time maneuvering, and driver modeling, while citing 185 references. However, in the minimum-time maneuvering problem, the applications were focused on minimum track time for racing, whereas the present article considers escaping from a ditch while minimizing wheel-slip. Reference [9] summarized advancements in the study of vehicle dynamics across a range of vehicle, tire, and driver models while also noting the need to further develop nonlinear dynamic models for vehicles. A vehicle dynamics prediction module for public-road maneuvering was presented in Ref. [10], with a primary emphasis on highway-speed maneuvering. In Ref. [11], several benchmarks for vehicle dynamics problems were considered for both rail and road vehicles, with a particular focus on studying wheel-slip and lateral dynamics.

As mentioned previously, the demand for vehicle automation and innovative optimal control solutions has been a strong motivation for further understanding of vehicle dynamics. Some researchers sought to develop vehicle control strategies that perform well in hazardous scenarios, which is a goal similar to the problem investigated in this article. In Ref. [12], a modified fixed-point control allocation scheme was implemented in a Simulink CarSim simulation to test braking during high-speed double lane changing on slippery roads and hard braking with an actuator failure. In Ref. [13], a coordinate control system involving electronic stability control, active roll control, and engine torque control was used to maximize driver comfort. A linearized 2 DOF dynamics model was used in Ref. [14] to develop an adaptive optimization based second-order sliding mode controller. For modeling the controller, the authors assumed the vehicle velocity while turning was pseudo-constant and the steering and side-slip angles were small. In Ref. [15], the authors proposed a three-dimensional state, including steering and tire force as well as longitudinal vehicle dynamics (position and velocity) as key inputs to a control design that involves synthesizing control approaches using a proportional controller.

Some research has been performed in the area of avoiding hazardous terrain autonomously, such as Ref. [16], which focused on path planning to avoid discrete obstacles. Reference [17] proposed the use of LiDAR to detect hazardous terrain, such as ditches, with the intention of avoidance for autonomous land vehicles. In Ref. [18], navigation of an autonomous vehicle through hazardous terrain is considered using imitation learning from expert demonstration of a task. While these articles are useful for off-road applications, the current article is concerned with the safe exit from a ditch, rather than the avoidance of it all-together.

Since most of the vehicle dynamics models in the literature have focused primarily on everyday driving situations, they have also assumed a flat surface profile, which is typical for most roads. Since the purpose of this research is to address the situation of a vehicle stuck in a ditch, an arbitrary surface profile was assumed. A single-track vehicle model moving on a smooth surface was presented in Refs. [19,20], but without a mathematical derivation or validation of the model through simulation or experiment. Single-track vehicle dynamics were considered in Ref. [21] as well, where the authors derived the dynamics of a cart that is being excited by a moving base using Lagrange’s method. They included results from an earthquake response simulation. A similar dynamics problem of a ball rolling on a two-dimensional potential surface was shown in Ref. [22], with a resulting dynamic model that appears similar in form to the dynamics model presented in this article.

This article derives a dynamic model for a vehicle moving on an unknown surface profile (which allows the possibility of simulating vehicle behavior on any continuous ditch shape) and will consider four different cases of wheel-slip for the vehicle: (1) no wheels are slipping, (2) both rear and front wheels are slipping, (3) the rear wheels are slipping and the front wheels are not slipping, and (4) the rear wheels are not slipping and the front wheels are slipping. In addition, this article derives the terminal conditions for these four slip cases and describes a simulation method to accurately switch between these cases. To develop a control policy for achieving escape from a ditch while minimizing wheel-slip, two different RL methods are used and their results compared.

## 2 Relevant Reinforcement Learning Background

A more recent control approach that will be applied in this article is RL, and a brief description is included here. The core RL algorithm is composed of two primary functions: the environment and the agent (see Fig. 1). The environment provides the state and corresponding reward achieved based on a given action. The agent uses a control policy *π* to determine the action based on the state and reward observed from the environment. RL seeks to answer the question: “What action should be taken to maximize the expected long-term reward?” Typically, the reward function is designed in such a way that the algorithm will make decisions that direct the environment toward a desired goal.

The vehicle-ditch problem was considered well suited for RL control for two reasons. First, RL can achieve good results while not needing to know the exact model of a complex dynamic system. This is particularly useful when the system dynamics are difficult to model analytically or when a control approach is data driven, instead of based on a model. The discontinuous vehicle dynamics model with four different regions of wheel-slip behavior fits this category well. Some complex control examples, such as control a nonlinear turbo-generator system in Ref. [23] and an optimal tracking control problem in Ref. [24], are solved using RL without prior knowledge of the system dynamics.

The second reason RL is well suited for the vehicle-ditch scenario is that it has the ability to explore many combinations of states and actions and can achieve good controllability for even systems with control constraints. A practical example of a control-constrained system that is similar to the vehicle-ditch scenario is that of a parent pushing a child on a swing. The parent may not be able to exert enough effort to push the child as high as they may want to swing in one push. However, by timing repeated pushes in such a way as to build the child’s momentum, the desired height for swinging can be obtained. Classic control methods, such as proportional-integral-derivative (PID) and linear-quadratic-regulator (LQR), encounter many difficulties when trying to control control-constrained dynamic systems, and this will be discussed further in Sec. 5. The ability of RL to effectively control control-constrained systems is particularly useful for the vehicle-ditch problem, since in a real-world scenario, humans have to effectively time their pushing of a vehicle in combination with the driver applying the gas pedal to achieve escape from the ditch. Hence, the vehicle is often control constrained for these real-life scenarios as well. In this article, we will apply two different RL techniques, Probabilistic Inference for Learning COntrol (PILCO) and deep deterministic policy gradient (DDPG), to control the discontinuous vehicle dynamics model to achieve escape from a ditch. Each of these algorithms will be explained briefly.

Probabilistic inference for learning control is an RL algorithm presented in Ref. [25] that uses a Gaussian process (GP) to create a surrogate model of the dynamics of a system. This algorithm attempts to learn an effective policy while reducing the number of trial episodes necessary to do so. An example of the application of PILCO to a control-constrained system can be found in Ref. [26], where PILCO was applied to a double-pendulum-cart system to achieve successful swing-up. In this article, we chose to use a matlab implementation of PILCO as one method for controlling the vehicle-ditch scenario. The application of this algorithm and its limitations will be presented in more detail in Sec. 5.

Deep deterministic policy gradient is a part of a specific category of RL called deep RL reference [27], where deep neural networks are trained to approximate any one of the following: a value function (which ties long-term reward to actions), the control policy (which ties states to actions), or the system model (which updates the states and rewards for the system). This deep learning is particularly useful when the system is complex, and thus multiple layers of neural networks are needed to achieve an accurate approximation of one or more components of the RL structure. Deep learning techniques have been used to solve difficult control problems. For example, a DDPG RL technique was used in Ref. [28] to control a bicycle effectively. Double Q-learning was used in Ref. [29] to achieve autonomous driving that feels similar to a human.

Deep learning has also been applied extensively to control autonomous vehicles. In Ref. [30], a survey was presented of current deep RL methods for solving typical autonomous vehicle control problems, such as motion and path planning for roadway driving. A classic RL benchmark problem called “mountain-car’” [3–5] is somewhat similar to the problem considered in this article—getting a vehicle unstuck from a ditch. However, “mountain-car” uses a simple control-constrained point-mass (the car) that is unable to reach the top of a mountain without applying RL. While “mountain-car” has been used as a good benchmark problem with which to test RL methods, it has not been considered as a control problem for real-world use. To solve the vehicle-ditch problem, we include rigid body dynamics, an arbitrary ditch profile, and the potential for slip to occur with either front or rear wheels using both RWD and AWD models. Our purpose is to provide insight into autonomously controlling a vehicle in such a hazardous scenario.

A detailed explanation of DDPG is beyond the scope of this background [27], but we chose to apply this RL algorithm since it has the ability to implement a continuous action space, which is most applicable to typical analog-signal control scenarios and because it is capable of controlling complicated systems due to its deep neural network structure. We implemented a neural network structure as defined in Ref. [31], since it showed good results across a variety of complex systems.

The remainder of this article will present the derivation of the discontinuous analytical model, simulation methods, and results from applying various RL techniques.

## 3 Derivation of Analytical Model

### 3.1 Dynamic System Description.

To understand the behavior of a vehicle moving on an arbitrary surface, the equation of motions (EOM) for the system must first be derived. This is done using Newtonian mechanics. A diagram of the system is shown in Fig. 2. To derive the EOM for the system represented in Fig. 2, we begin by defining the position vector for rigid body *K* as **r**_{K} (where *K* represents either wheel *A*, rigid body *M*, or wheel *B*). Vector components are further defined as $rK(i^)=rK\u22c5I^$ and $rK(j^)=rK\u22c5J^$, where $I^$ and $J^$ are coordinate vectors shown in Figs. 2–4. In addition, the rotational angle corresponding to rigid body *K* is defined as *θ*_{K}. The derivation of the analytical expressions for these position vectors, as well as their corresponding velocity and acceleration vectors ($r\u02d9K$, $\theta \u02d9K$, $r\xa8K$, and $\theta \xa8K$), are included in Appendix A for completeness. The position coordinate for this system is *x*, *y*(*x*) is the function describing the shape of the ditch surface, and *y*_{K,x} is the derivative with respect to *x* of *y*(*x*) evaluated at the contact point of wheel *K* with the surface. For the key dimensions of the vehicle, *R* is the radius of the wheels, *l* is the length of *M*, and *x*_{c} and *y*_{c} describe the position of the center of mass of *M* with respect to the left-hand lower corner of body mass *M* shown in Fig. 2.

This derivation will develop the EOM for this vehicle and provide the state-space dynamics for four possible cases of traction the vehicle experiences with the surface: (1) neither wheels *A* or *B* are slipping, (2) both wheels *A* and *B* are slipping, (3) wheel *A* is slipping and wheel *B* is not slipping, and (4) wheel *A* is not slipping and wheel *B* is slipping. The subscript _{1} will be used to denote the first case, the subscript _{2} will be used to denote the second case, and so on. The subscript _{n} will be used to denote any of the four cases.

When the vehicle is in case 1, wheels *A* and *B* are assumed to be in perfect traction with surface *y*(*x*). Thus, *θ*_{A,n} and *θ*_{B,n} are functions of *x*_{n}, and thus, there is a single DOF that describes the behavior of the vehicle in this condition—*x*_{n}. If wheel *A* loses traction, the vehicle transitions to case 3, where there is no direct relationship between *θ*_{A,n} and *x*_{n}, and thus, an additional DOF is introduced to the system due to a spinning or sliding wheel *A*—*θ*_{A,n}. Similarly, if both wheels *A* and *B* lose traction, the vehicle is in case 2 where there is no direct relationship between *θ*_{A,n} and *x*_{n} or *θ*_{B,n} and *x*_{n}, and thus, two additional DOFs is introduced to the system—*θ*_{A,n} and *θ*_{B,n}.

Since the DOFs change depending on the case *n*, a state-space model of this discontinuous dynamic system will also change the size. It is necessary to have a state-space model that does not change size depending on *n* to make switching between cases possible during numerical integration. Uniformity in the size of the state-space model between all four cases is accomplished by setting the state-space size to the maximum it would be for any of the four cases and augmenting the smaller state-spaces to include the maximum DOFs. For instance, with case 1, the state-space model would only depend on *x*_{n} and $x\u02d9n$. In case 2, the state-space model would include *x*_{n}, $x\u02d9n$, *θ*_{A,n}, $\theta \u02d9A,n$, *θ*_{B,n}, and $\theta \u02d9B,n$. Since case 2 includes all possible DOFs for this vehicle model, the state-spaces for the other cases are augmented to include these states as well. This is explained further in following sections.

### 3.2 Governing Equations.

**F**

_{K}and torques

*T*

_{K}on each rigid body $K:\u2211FK=mKr\xa8K$ and $\u2211TK=IK\theta \xa8K$, where

*I*

_{K}and

*m*

_{K}are the moment of inertia and mass of rigid body

*K*, respectively. These forces and torques are portrayed in Figs. 3(a), 3(b), and 4, where

*F*

_{F,K}is the friction force,

*F*

_{N,K}is the normal force,

*F*

_{g,K}is the gravitational force, and

*τ*

_{K}is a rotational torque acting on rigid body

*K*. Also,

*α*

_{K}is the angle from the horizontal of wheel

*K*and Φ

_{K}describes the curvature of the surface

*y*(

*x*) at the contact point with wheel

*K*(see Eq. (11) in Appendix A). Finally, internal forces on wheel

*K*are represented by

*K*

_{x}and

*K*

_{y}. The equations describing the motion of these rigid bodies are as follows:

The gravitational forces in Eqs. (1)–(9) are defined as *F*_{g,K} = *m*_{K}*g*, where *g* is the gravitational constant. Angle *α*_{K} is related to the slope of *y*(*x*) at the contact point of wheel *K* with the surface by tan *α*_{k} = *y*_{K,x}, and thus, $cos\alpha k=1/1+yK,x2$ and $sin\alpha K=yK,x/1+yK,x2$.

The complex analytical expressions for $r\xa8K$ and $\theta \xa8K,n$ (included in Appendix A) must be substituted into Eqs. (1)–(9) to solve. Given the complexity of this dynamic system for each of the four cases presented, maple software was implemented to obtain the exact state-space form for each case. The maple files used to derive this system can be found in the data repository for this article. In Secs. 3.3–3.6, the method will be presented for deriving each case and the general form for each state-space will be provided. For the simplest case, case 1 (see Sec. 3.3), a more complete derivation will be presented to demonstrate the steps needed to derive the other state-spaces.

### 3.3 Case 1: Neither Wheels *A* or *B* Are Slipping.

*x*

_{1}. Hence, Eqs. (1)–(9) can be reduced to a single equation dependent on

*x*

_{1}. Since there is no wheel-slip occurring,

*θ*

_{A,1}and

*θ*

_{B,1}are dependent on

*x*

_{1}. Thus, instead of using Eqs. (3) and (6) to solve for $\theta \xa8A,1$ and $\theta \xa8B,1$, the wheel friction force

*F*

_{F,K}is treated as the dependent variable. This relationship is used to reduce Eqs. (1)–(9) to a single equation dependent on

*x*

_{1}. The result is expressed as follows:

*H*

_{1},

*J*

_{1},

*T*

_{A,1}, and

*T*

_{B,1}are all nonlinear functions of

*x*

_{1}specific to case 1, and all parameters of this form used later in this article are also nonlinear functions of

*x*

_{n}pertaining to a given case

*n*. The superscript position for any parameters

*H*

_{n},

*J*

_{n},

*T*

_{A,n}, or

*T*

_{B,n}does not denote the exponential operation, but either a normal force parameter (i.e., $HnN$) or an angular acceleration parameter (i.e., $Hn\theta $).

*x*

_{1}and $x\u02d91$, the extra DOFs that will arise from different cases must be included as well, as mentioned previously. This means that the state-space for case 1 must be artificially augmented to include

*θ*

_{A,n}, $\theta \u02d9A,n$,

*θ*

_{B,n}, and $\theta \u02d9B,n$. Solutions for $\theta \xa8A,1$ and $\theta \xa8B,1$ for the case where a given wheel is not slipping (see Eqs. (13) and (20) in Appendix A, respectively) are included here for clarity in the state-space derivation:

*z*

_{1}−

*z*

_{6}are the states

*x*

_{n}, $x\u02d9n$,

*θ*

_{A,n}, $\theta \u02d9A,n$,

*θ*

_{B,n}, and $\theta \u02d9B,n$, respectively. This collection of states will be referred to as

**z**in future sections.

*A*and

*B*and the surface. Thus, the conditions that would make either wheel-slip would be the terminal conditions for this state-space. Either wheel would slip when the friction force needed to maintain traction with the surface exceeds the product of the static friction coefficient,

*μ*

_{s}, and the normal force acting on the wheel. These conditions are as follows:

Note, if either event Eqs. (17) or (18) occurs, the system transitions to a different case. If Eq. (17) occurs, the system transitions to case 3, and if Eq. (18) occurs, the system transitions to case 4, and if both Eqs. (17) and (18) occur, the system transitions to case 2.

*A*and

*B*can be obtained using Eqs. (3) and (6) and the solutions are found for $\theta \xa8A,1$ and $\theta \xa8B,1$ in Eqs. (12) and (13), respectively, to obtain

*A*and

*B*can be obtained to evaluate the terminal conditions listed in Eqs. (17) and (18). These forces are obtained by solving Eqs. (1)–(9) for

*F*

_{N,A,1}and

*F*

_{N,B,1}, resulting in

Now, the state-space and terminal conditions for case 1 have been developed. Since the complete EOMs and equations for normal forces for the other three cases are lengthy, we chose only to derive these by example for the first case. For the rest of the cases, a more brief derivation will be provided.

### 3.4 Case 2: Both Wheels *A* and *B* Are Slipping.

*θ*

_{A,n}and

*x*

_{n}or

*θ*

_{B,n}and

*x*

_{n}that existed for case 1 are no longer valid. The friction force at wheel

*K*is modeled as

*F*

_{F,K,2}=

*μ*

_{K}

*F*

_{N,K,2}, where

*μ*

_{K}is the dynamic friction coefficient between wheel

*K*and the surface. In this case, there are three DOFs—

*x*

_{n},

*θ*

_{A,n}, and

*θ*

_{B,n}, so there is no need to augment the state-space for this case to include any additional states. The following EOM dependent on

*x*

_{2}is derived from Eqs. (1)–(9):

*θ*

_{A,2}, and

*θ*

_{B,2}:

*A*and

*B*be slipping against the surface. Thus, the conditions that would make either wheel stop slipping would be the terminal conditions for this state-space. Either wheel would stop slipping when the relative velocity between the wheel and the surface,

*v*

_{r,K}, becomes zero. These conditions are obtained from Eqs. (12) and (18) and are as follows:

### 3.5 Case 3: Wheel *A* Is Slipping and Wheel *B* Is Not Slipping.

*θ*

_{A,n}and

*x*

_{n}that existed for case 1 is no longer valid. As with case 2, the friction force at wheel

*A*is modeled as

*F*

_{F,A,3}=

*μ*

_{A}

*F*

_{N,A,3}. However, since wheel

*B*is not slipping, the relationship between

*θ*

_{B,n}and

*x*

_{n}described in Eq. (13) is valid. This information and Eqs. (1)–(9) are used to get EOMs for

*x*

_{3}and

*θ*

_{A,3}:

*v*

_{r,A}, must be zero. For wheel B to start slipping, the friction force acting on wheel B must exceed the product of the static friction coefficient and the normal force at wheel B. These two conditions are shown as follows:

### 3.6 Case 4: Wheel *A* Is Not Slipping and Wheel *B* Is Slipping.

*θ*

_{B,n}and

*x*

_{n}that existed for cases 1 and 3 is no longer valid. The friction force at wheel

*B*is modeled as

*F*

_{F,B,4}=

*μ*

_{B}

*F*

_{N,B,4}. However, since wheel

*A*is not slipping, the relationship between

*θ*

_{A,n}and

*x*

_{n}described in Eq. (12) is valid. This information and Eqs. (1)–(9) are used to get EOMs for

*x*

_{4}and

*θ*

_{B,4}:

*v*

_{r,B}, must be zero. These two conditions are shown as follows:

This concludes the derivation of the state-spaces for the four possible cases for surface contact between the planar vehicle model and the ditch surface profile. The terminal conditions for each state-space have been described as well as the transitions to different cases. In Sec. 4, a method for numerically simulating this discontinuous dynamic system and intelligently switching between each of the four cases will be presented.

## 4 Simulation

*a*= 3,

*b*= 16/225, and

*c*= 1/1225. In practice, this function

*y*(

*x*) can be any smooth function that has a radius of curvature greater than wheel radius

*R*and does not approach ±∞ in the simulation region of interest. For the physical dimensions and properties of the vehicle, real data for a 1998 F-150 pickup truck were obtained from the National Highway Traffic Safety Administration Vehicle Research and Test Center and used in the analytical model [32]. This vehicle was chosen because it was one of the few vehicles with relevant moment-of-inertia and center-of-mass data readily available. Typical tire and wheel sizes for this truck were used to derive mass and moment inertia parameters for the wheels. The scale of this scenario is shown in Fig. 5, and a list of the physical vehicle parameters is found in Table 1.

Parameter | Value |
---|---|

m_{A}, m_{B} | 51 kg |

m_{M} | 2039.25 kg |

I_{A}, I_{B} | 3.3702 kg m^{2} |

I_{M} | 5091 kg m^{2} |

R | 0.3675 m |

l | 3.517 m |

x_{c} | 2.0520 m |

y_{c} | 0.3335 m |

Parameter | Value |
---|---|

m_{A}, m_{B} | 51 kg |

m_{M} | 2039.25 kg |

I_{A}, I_{B} | 3.3702 kg m^{2} |

I_{M} | 5091 kg m^{2} |

R | 0.3675 m |

l | 3.517 m |

x_{c} | 2.0520 m |

y_{c} | 0.3335 m |

The primary challenge in simulating this discontinuous dynamic model is the transition between the four state-spaces shown in Eqs. (16), (28), (34), and (42). Typically, simulating a system of continuous ordinary differential equations is straightforward using either a Runge-Kutta method or other numerical integration tool, such as matlab’sode45 function. However, for this model, in addition to integrating the state-space for each case, the exact moment at which the terminal condition for the state-space occurs must be solved for to accurately switch to a new state-space at the correct moment. For instance, if the vehicle starts out operating in case 1, it will continue in case 1 until either of the terminal conditions for case 1, Eq. (17) or (18), occurs. If Eq. (17) occurs, the vehicle will transition to case 3. Thus, the state-space must be switched from case 1 to case 3 and integration continued until either of the terminal conditions for case 3 occur and the case changes again, and so on.

Initially, matlab’sode45 function was used in conjunction with a custom event function to solve the state-space up until the exact moment the terminal event occurred. However, there was an issue with this quasi-black-box approach. Solving for the time a terminal condition is reached is accomplished by checking an event function for a zero-crossing, and then by iterating using some numerical root-finding method to solve for the exact moment, the terminal condition occurs. matlab’sode45 event location feature does not have the ability to stop integration after a certain number of calls to the event function have been made in an attempt to locate the terminal condition. This was discovered to be an issue after noticing that matlab’sode45 event location feature occasionally would find a zero’s approximate location, but instead of honing in on its exact location, it continued to cross the zero-point back and forth indefinitely. To better simulate the dynamic model, a Newton-Raphson routine was created to solve for the terminal event for a state-space with a provision if convergence was not achieved within a designated number of iterations, the simulation was terminated. It was feasible to use a Newton–Raphson method instead of a secant method since analytical expressions for the time derivatives of the terminal conditions were available. Pseudo-code outlining the process for evaluating this dynamic system is shown in Algorithm 1.

### Single time-step integration

**Input**$n,zm,\tau A,\tau B,tm,\Delta t,\mu s$

**Output**$n,zm+1,tm+1$

1 $t=tm,z=zm$;

2 $FF,A,FF,B,FN,A,FN,B\u2190getforces(n,z,\tau A,\tau B)$;

3 $n\u2190checkstartingcase(FF,A,FF,B,FN,A,FN,B,\mu s)$;

4 **while**$t<tm+\Delta t$**do**

5 $tr,zr\u2190integrateode(n,z,\tau A,\tau B,t,tm+\Delta t)$;

6 $e1,e2\u2190getevents(n,zr,\tau A,\tau B)$;

7 **if** zerocrossing($e1,e2$) **then**

8 $tnr\u2190$ guess($e1,e2$);

9 $t,z,n\u2190$ newtonraphson($n,zr,tnr)$;

10 **else**

11 $tm+1\u2190tr(end)$;

12 $zm+1\u2190zr(end)$;

13 break

14 **end**

15 **end**

Algorithm 1 shows the process for integrating the discontinuous dynamic model from Sec. 3 over a single time-step, accounting for switching between four slipping cases. The inputs are as follows: case *n*, initial conditions **z**_{m}, torque controls applied during the time-step *τ*_{A} and *τ*_{B}, starting time *t*_{m}, step-size Δ*t*, and static friction coefficient *μ*_{s}. The outputs are as follows: the slipping case at the end of the time-step *n*, the states of the system at the end of the time-step **z**_{m+1}, and the ending time *t*_{m+1}. These outputs then become the initial conditions, case condition, and starting time for the beginning of the next time-step of integration.

Lines 1–3 perform some initialization steps for the integration process. In particular, Line 2 calculates the friction and normal forces acting on both wheels given the current torque actions. Since either of the torques could cause wheel-slip at the start of the time-step, line 3 checks to see if this occurs and if so, changes the slipping case to the correct one (see Algorithm 2 in Appendix B). Lines 4–15 are a while-loop that continues, while the simulation time *t* is less than the ending time *t*_{m} + Δ*t*. Inside the while-loop, line 5 integrates the state-space for case *n* and outputs a refined mesh of times *t*_{r} and states **z**_{r} over the entire time-step. In line 6, the terminal conditions **e**_{1} and **e**_{2} for the current case *n* are calculated. Line 7 checks to see if there was a zero-crossing in **e**_{1} or **e**_{2}. If there was a zero-crossing, a terminal event occurred, and it is necessary to solve for the exact moment the event occurred. In line 8, **e**_{1} and **e**_{2} are used to provide an initial guess for the event moment *t*_{nr}. The Newton–Raphson calculation on line 9 seeks to find the event, and if it does, it outputs the time *t* and states **z** at the terminal event. The simulation then returns to line 5 with an updated time *t*, initial conditions **z**, and the new slipping case *n*, and the loop continues. If the Newton–Raphson method does not converge, the simulation is considered to have failed and the simulation ends. If there was not a zero-crossing, Lines 12–13 output the updated time *t*_{m+1} and states **z**_{m+1} at the end of the time-step and the while-loop breaks. This algorithm allows repeatable and accurate simulation of the discontinuous dynamic model.

In addition, a continuous friction model was used for this simulation from Ref. [33]. This can be seen in Fig. 6, where *μ*_{K} is a function of *v*_{r,K}, where *K* represents either wheel *A* or wheel *B*. This allows different friction coefficients to be applied to either front or rear wheels as a function of relative velocity. To assist the convergence of the Newton–Raphson method, this function incorporates a hyperbolic tangent function to smooth the discontinuity at *v*_{r,K} = 0.

## 5 Reinforcement Learning Control

As has been mentioned in Sec. 1, RL can be an effective tool for controlling complex dynamic systems even when they are control constrained. The vehicle model from Sec. 3 was intentionally control constrained by limiting the maximum applied torque to 700 N m. Thus, the simulated vehicle is not capable of simply applying unlimited positive torque and exiting the ditch. For all RL training, the parameters describing the ditch shape defined in Eq. (47) were set to the values described in Sec. 4. Three control scenarios are considered in this section: RWD with no wheel-slip, RWD with wheel-slip, and AWD with wheel-slip. In addition, the robustness of each of the resulting control policies will be examined at the end of this section.

Full-state feedback was allowed. This was deemed feasible since in the real world *θ*_{A,n} and *θ*_{B,n} could be measured using potentiometers. In addition, since *θ*_{M} is a function of *x*_{n} and could be measured using a gyroscope, *x*_{n} is considered at least partially observable. By using available sensing technologies for autonomous vehicles (such as LiDAR), *y*(*x*) could be observed and inform a controller on what best control approach to use to get unstuck from the ditch.

It is useful to explain why classic control methods, such as PID and LQR, are incapable of controlling control-constrained systems, and in particular, the vehicle-ditch problem. These classic methods rely on measuring the error between a desired state and a measured state and computing a desired control effort that will seek to minimize this error. The fundamental issue with these methods is that they rely on assumptions of linearity in the system. When the available control effort is not enough to reach the desired state (in an control-constrained system) when applied in a linear relationship to the state error, the best possible control solution either PID or LQR can achieve is to saturate the control in the direction of the desired state. In Fig. 7, a saturated control of the maximum torque is applied to wheel *A* in the direction of the goal. However, the vehicle just continues to rock back and forth in the ditch with this constant control effort applied without making any real progress toward the target state. While this is the best solution classic control methods can achieve, it is not useful for this problem due to its extremely poor performance.

### 5.1 Applying Reinforcement Learning Assuming a Rear-Wheel-Drive Model With No Wheel-Slip.

First, PILCO was applied to control the vehicle to achieve escape from the ditch using only RWD (torque is only applied to wheel *A*). One of the fundamental weaknesses of this algorithm is that it relies on a continuous dynamics model for simulation training. In addition, since PILCO uses a GP to build a surrogate model of the system dynamics, it cannot account for multiple different regions of behavior (i.e., cases 1–4) with a single GP model without nontrivial alterations to the core algorithm. Thus, to successfully implement PILCO, it was necessary to assume that the vehicle did not slip with either front or rear wheels, and thus not leave case 1. This control algorithm was used to emphasize the importance of considering wheel-slip in controlling the vehicle. The reward function used by PILCO was a positive Gaussian-shaped reward around in the vicinity of the target state. The results after 14 training episodes (308 s of training experience) is shown in Fig. 8. In Fig. 8(a), the blue line illustrates the simulated response of assuming that the vehicle cannot slip. The vehicle in this case successfully reached the target (dashed line) state in approximately 20 s. The control torque profile generated using PILCO is shown in Fig. 8(b). However, when the PILCO control policy was applied to the complete dynamics model, the vehicle failed to achieve escape and fell back into the ditch, as shown by the red line. Since torque was not applied to wheel *B*, wheel *B* never slipped so the red line changes only between case 1 and case 3 and wheel *A* slip (or case 3) is denoted by the gray shaded regions.

A DDPG algorithm was applied to the same scenario as the PILCO implementation for comparison. The neural network structure was the same as the one implemented in Ref. [31] and training was implemented in matlab using the RL Toolbox. A positive reward function was structured in such a way as to “incentivize” successful achievement of the target state. Often, reward functions that are designed to achieve a target state penalize the system when it is far away from the target state, but when the target state is reached, the penalty is zero. A reward function was chosen that was zero when the system was far away from the target state and was shaped so that as the system approached, the target state it achieved greater and greater rewards. It was not desirable to numerically penalize the vehicle for being far away from the goal, since the vehicle must build momentum by moving in the opposite direction of the goal at times. The reward function was shaped so that it increased as distance to the target (position error *e*_{x}) decreased. The reward was also dependent on velocity error $ex\u02d9$ so that it increased as the vehicle slowed down near the target and provided a slight increase for building momentum at the bottom of the ditch. This reward shape *r*_{s} is shown in Fig. 9 and was designed to incentivize the vehicle to build enough momentum to exit the ditch but additionally to achieve a controlled stop at the target state. While ensuring a controlled stop was a more complex control objective, it is a reasonable safety concern since in the real world it is not desirable that a vehicle exit the ditch in an uncontrolled manner and possibly incur an accident by heading into traffic.

Within 1600 training episodes, the agent was effectively trained to achieve escape with results quite similar to those achieved with PILCO (see Fig. 10). It should be noted that the significant disparity in number of training episodes needed between PILCO and DDPG is due to the fact that PILCO’s use of a surrogate dynamics model allows training to be achieved with significantly fewer training episodes than needed for deep neural network approaches. In Fig. 10(a), similar to Fig. 8(a), a comparison of the vehicle trajectories when assuming no wheel-slip (blue line) versus allowing wheel-slip (red line) can be seen. Figure 10(b) shows the applied control torque *τ*_{A} and the gray regions denote regions when wheel *A* slipped when wheel-slip was allowed.

On comparing Figs. 8 and 10, it is apparent that there is some similar behavior between the PILCO and DDPG control policies. The torque profiles have a similar shape as a result of intelligently building the vehicle’s momentum to achieve escape from the ditch. Both policies performed well, with PILCO achieving escape in 20 s and DDPG performing slightly better by achieving escape in 17 s. In addition, when these policies were applied to the complete dynamics model allowing for wheel-slip, the vehicle did not achieve escape due to significant wheel-slip, as shown by the gray regions of Figs. 8 and 10.

It is useful to consider what effect the starting position of the vehicle has on completing the objective for this control problem. The results shown in Figs. 8 and 10 show a starting position *x*_{0} in the ditch of 0 m. To examine the potential effect of different starting positions, the DDPG agent was trained with random starting positions between −3 and 3 m. The time to achieve the target state *t*_{g} can be seen as a function of *x*_{0} in Fig. 11. Figure 11 shows a discontinuity in *t*_{g} at *x*_{0} ≈ 0.4 m. This is the result of the trained agent requiring one fewer oscillations in the ditch to achieve the target state for *x*_{0} > 0.4 m, and thus achieving the goal much faster (*t*_{g} < 13 s).

### 5.2 Applying Reinforcement Learning Assuming a Rear-Wheel-Drive Model with Wheel-Slip.

It is desirable for an RL policy to perform well even when wheel-slip is present, and so a DDPG policy was trained using a RWD dynamics model that allowed for wheel-slip. Since torque was not applied to wheel *B*, wheel *B* did not slip in this control scenario, but wheel *A* could slip since torque was applied to it.

*r*

_{s}in Fig. 9. It was desired to penalize high relative velocities for wheel

*A*, since that is effectively high-rate wheel-spin, and to penalize the condition of slipping. A total reward function was designed such that

*r*

_{t}=

*r*

_{s}− 0.001|

*v*

_{r,A}| −

*r*

_{c}(see Eq. (35)), where

*B*could not slip. The observation states that were used in training the DDPG agent were

*x*

_{n}, $x\u02d9n$,

*v*

_{r,A,n}, and

*n*. Figure 12 shows the performance of the trained DDPG agent after 2250 training episodes.

It is apparent that the control policy shown in Fig. 12 is significantly different in nature than Fig. 8 or 10. There are few gray regions on the plot, which indicate when wheel *A* was slipping. By considering the three narrow slip regions around *t* ≈ 2 s on Fig. 12, it can be seen that the agent learned to reverse torque direction rapidly to stop slipping. This happened again at *t* ≈ 4 s, *t* ≈ 7 s, and *t* ≈ 11 s. While switching torque directions this quickly is not physically possible on a vehicle, this was effectively “braking” the vehicle to regain traction with the surface. In practice, this control to avoid slipping could be applied via the vehicle brakes. Even though slip was incorporated in training this RWD control policy, the vehicle still achieved escape in about 17 s, which is comparable with the performance achieved when DDPG was applied while ignoring slip. The utility of RL is apparent here, as the DDPG agent has been successfully trained so as to accomplish all desired control objectives: exit the ditch in a controlled manner while avoiding wheel-slip.

### 5.3 Applying Reinforcement Learning Assuming an All-Wheel-Drive Model With Wheel-Slip.

*A*and

*B*and the system could be in any of the four wheel-slip cases discussed in Sec. 3. This was the most challenging and computationally intensive result to achieve. For this AWD scenario, it was desired that the vehicle exit the ditch while minimizing wheel-slip. Again, a modified reward function was needed to achieve this objective. For this scenario, a total reward function was designed such that

*r*

_{t}=

*r*

_{s}− 0.001|

*v*

_{r,A}| − 0.001|

*v*

_{r,B}| −

*r*

_{c}, where

Equation (49) was designed to heavily penalize the system for both wheels slipping (as in case 2), and to penalize less for either wheel slipping (as in cases 3 and 4), and to not penalize the system at all when no wheels slip (as in case 1). Training a DDPG agent to achieve an effective policy for this complex system was computationally intensive and took nearly 12,000 training episodes (several weeks of computing) to achieve an effective policy that accomplished the control objectives. The performance of this policy is shown in Fig. 13.

In Fig. 13, the yellow shaded region shows where wheel *A* was slipping, the pink shows where wheel *B* was slipping, and the green shows where both wheels *A* and *B* were slipping. Figure 13(a) shows the vehicle trajectory and Fig. 13(b) shows *τ*_{A} with a black line and *τ*_{B} with a dot-dashed red line. Similar to the results shown in Fig. 12, the control policy intelligently sought to avoid slipping and achieve escape from the ditch. At *t* ≈ 0.25 s, the vehicle momentarily entered case 4 when wheel *B* slipped, but the policy immediately corrected by reversing torque directions. At *t* ≈ 2 s, wheel *A* was slipping, and *τ*_{A} adjusted successfully to make the vehicle stop slipping. There is only one green region in Fig. 13, which means that the vehicle only once lost traction with both wheels *A* and *B*, and it is clear that the policy sought to correct that by reversing torque directions until control of both wheels was regained at *t* ≈ 6 s. Once escape from the ditch was achieved, *τ*_{A} and *τ*_{B} continued to be applied to maintain the vehicle position. Since the vehicle was in case 1 upon exiting the ditch, it was effectively a single DOF system at that point, which is why *τ*_{B} was maintained constant and *τ*_{A} was adjusting to maintain the vehicle position. The AWD vehicle achieved escape nearly 6 s faster than the RWD vehicle, highlighting the benefit of AWD for hazardous vehicle scenarios. From these results, it is clear that the DDPG policy effectively achieves escape from the ditch for the AWD scenario while minimizing wheel-slip.

### 5.4 Control Policy Robustness.

The control policies that were trained using RL were trained using one ditch shape, with *a*, *b*, and *c* held constant (see Eq. (47)). Due to the computational cost of training these policies, it was infeasible to train many different policies for numerous ditch shapes. It is useful to examine how robust the trained policies are for ditch shapes other than the one they were trained with. To test the policy on various ditch shapes, parameters *a* and *b* were varied from 2 − 4 and 0.05 − 0.1, respectively. The control policies were then applied for each combination of *a* and *b*. Parameters *a* and *b* influence the shape of the ditch such that as they increase, the ditch becomes narrower and steeper. The performance of the DDPG control policies for the RWD and AWD models with wheel-slip are shown in Figs. 14 and 15.

In Fig. 14(a), it can be seen that the control policy for the RWD model with wheel-slip succeeded in achieving escape from the ditch for a range of values for *a* and *b*. It is to be expected that large values of either *a* or *b* tend to result in poor policy performance, since those correspond to more challenging ditch shapes. This is so, as shown by the white boxes which indicate ditch shapes where the control policy failed to achieve escape from the ditch. For the least challenging ditch shapes indicated by the lower left-hand corner of Figs. 14(a) and 14(b), the policy achieved escape most rapidly and without any slip. For the more challenging ditch shapes, escape time was longer and the chance of wheel-slip increased.

In Fig. 15, the robustness of the control policy for the AWD model with wheel-slip is demonstrated. Figure 15(a) shows the time to escape from the ditch for different values of *a* and *b*. The control policy performed well except for large values of *a*, which correspond to steeper ditch profiles. There was a significant escape time reduction for values of $a\u2a853$, which is due to the vehicle not needing to move backwards to build momentum to achieve escape for those ditch shapes. In Figs. 15(b)–15(d), the percentage of the trajectory resulting in cases 2–4 of wheel-slip are shown, respectively. In these figures, it is useful to note the performance at the location marked with the red X (which corresponds to the values for *a* and *b* that the control policy was trained with). At the red X locations, the control policy performed well at avoiding slip for all of the wheels. However, the control policy struggled to avoid slip for some of the different ditch shapes, as shown by the color variation in the lower half of Fig. 15(c) and the upper right-hand side of Fig. 15(d).

## 6 Conclusions

This article presented a discontinuous dynamic model for an idealized vehicle moving on an arbitrarily-shaped ditch profile. This model allowed simulation of a vehicle on any continuous ditch shape and also accounted for four regions of wheel-slip. The complexity of simulating this dynamic system (switching between each of four state-spaces) was addressed through the use of a Newton–Raphson solver.

To achieve escape from the ditch, RL was explored as a means of generating an effective control policy for this discontinuous, control-constrained system. First, PILCO and DDPG were implemented on a RWD dynamic model while ignoring the possibility of wheel-slip. The resulting policies were not capable of achieving the control objective when applied while allowing wheel-slip, illustrating the need to incorporate this dynamic feature in training a RL agent. Second, DDPG was implemented on a RWD dynamic model with wheel-slip. The result was a policy that intelligently applied “braking” to stop the rear wheels from slipping. This policy successfully achieved escape from the ditch while minimizing wheel-slip. Finally, DDPG was implemented on the full AWD dynamic model with wheel-slip. This scenario was by far the most complex, as it required two control torques and had four possible regions of dynamic behavior. After 12,000 training episodes, the trained agent provided a policy that performed well, both by achieving escape from the ditch and also by minimizing wheel-slip for both front and rear wheels. In addition, reward functions were designed for each of these three control scenarios in such a way as to achieve the desired outcome.

This article has sought to address a challenging hazardous vehicle scenario—a vehicle stuck in a ditch. While there has been great progress in vehicle automation for everyday driving, this article has sought to address a unique problem in vehicle automation by including rigid body dynamics, an arbitrary ditch profile, and the potential for slip to occur with either front or rear wheels using both RWD and AWD models. RL policies were successfully trained to control the discontinuous dynamics model in several configurations and the results compared. For this RL application, DDPG shows more promise due to its ability to implement a continuous action space as well as control different regions of dynamic behavior in a discontinuous model. In addition, the control policies generated using DDPG were demonstrated to be robust at achieving escape from the ditch for a wide range of ditch shapes.

Future work in applying RL to this problem should seek to develop an experimental implementation and additional simulation training on different vehicles. Additional modeling for particular vehicle components, such as suspension and tires, may be necessary for increased model accuracy and transferring a control policy from simulation to experiment. The data repository for this project is available and easily adaptable in order to foster additional study in the area of vehicle automation.

## Footnote

## Acknowledgment

Partial support from ARO W911NF-21-2-0117 is gratefully acknowledged.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The data and information that support the findings of this article are freely available.^{2}

### Appendix A: Derivation of Position, Velocity, and Acceleration Vectors

*A*, wheel

*B*, and body mass

*M*. These position vectors depend on

*θ*

_{M}, for which an expression will be derived later. The position vectors are derived from the geometry seen in Fig. 2, where

*A*and

*B*are functions of

*x*

_{n}, and thus analytical expressions for

*θ*

_{A,n}and

*θ*

_{B,n}are

*x*has been used to describe the contact point of wheel

*A*with the surface

*y*(

*x*). However, to define $\theta \u02d9B,n$ and $\theta \xa8B,n$, it is useful to define a temporary spatial coordinate

*x*

_{B}, which describes the contact point of wheel

*B*with the surface

*y*(

*x*). By examining Eqs. (A4) and (A5), we can restate Eq. (A5) by a direct comparison to Eq. (A4) as follows:

*θ*

_{M}can be obtained by modifying Eq. (A2) to include a parameter Δ

_{l}, which describes the horizontal distance between the contact points of the two wheels with the surface; the result is

_{l}can be solved for by solving the transcendental equation |

**r**

_{B}−

**r**

_{A}| −

*l*= 0, using the expression for

**r**

_{B}shown in Eq. (A21). This enforces that the rigid body

*M*be kept at a fixed length

*l*. We can solve for

*θ*

_{M}by evaluating Eqs. (A21) and (A1) at values of

*x*in

*θ*

_{M}varies spatially with

*x*. Thus, the angular velocity and acceleration of rigid body

*M*can be computed as follows:

### Appendix B: Algorithm for Checking Starting Case

**Input**$FF,A,FF,B,FN,A,FN,B,\mu s$

**Output**$n$

1 **switch**$n$**do**

2 **case**$1$

3 **if**$|FF,A|>\mu sFN,A$**then**

4 $n\u21903$

5 **end**

6 **if**$|FF,B|>\mu sFN,B$**then**

7 $n\u21904$

8 **end**

9 **if**$|FF,A|>\mu sFN,A$**and**$|FF,B|>\mu sFN,B$**then**

10 $n\u21902$

11 **end**

12 **end**

13 **case 3**

14 **if**$|FF,B|>\mu sFN,B$**then**

15 $n\u21902$

16 **end**

17 **end**

18 **case 4**

19 **if**$|FF,A|>\mu sFN,A$**then**

18 $n\u21902$

20 **end**

21 **end**

22 **end**