## Abstract

Including resilience in an overall systems optimization process is challenging because the space of hazard-mitigating features is complex, involving both inherent and active prevention and recovery measures. Many resilience optimization approaches have thus been put forward to optimize a system’s resilience while systematically managing these complexities. However, there has been little study about when to apply or how to adapt architectures (or their underlying decomposition strategies) to new problems, which may be formulated differently. To resolve this problem, this article first reviews the literature to understand how choice of optimization architecture flows out of problem type and, based on this review, creates a conceptual framework for understanding these architectures in terms of their underlying decomposition strategies. To then better understand the applicability of alternating and bilevel decomposition strategies for resilience optimization, their performance is compared over two demonstration problems. These comparisons show that while both strategies can solve resilience optimization problem effectively, the alternating strategy is prone to adverse coupling relationships between design and resilience models, while the bilevel strategy is prone to increased computational costs from the use of gradient-based methods in the upper level. Thus, when considering how to solve a novel resilience optimization problem, the choice of decomposition strategy should flow out of problem coupling and efficiency characteristics.

## 1 Introduction

Complex, large-scale, or safety-critical engineered systems will inevitably encounter hazardous scenarios. In these scenarios, it is important to minimize potential safety and/or performance losses and maintain or restore critical operations [1]. This is accomplished by incorporating resilience in the system’s dynamic hazard response, which can include resistance, absorption, restoration, and recovery [2], as well as active prevention [3,4] attributes or features. Starting this process in the early design stage provides the best opportunity to shape the overall system design to be resilient to hazards [5–7] (e.g., by incorporating flexibility, redundancy, sensing and reconfiguration technology in the design).

Many frameworks for incorporating resilience in early system design have thus been put forward [1,8–13]. A key challenge in incorporating resilience in the early design process is trading the benefits of hazard-mitigating features to system resilience with their corresponding design and operational costs and inefficiencies. Value modeling [14,15], multiobjective decision analysis [12,16], and expected cost modelling [17–21] frameworks have thus been put forward to resolve these trade-offs and enable design decision-making. However, even when the trade-offs among design, operational, and resilience objectives have been resolved, it can be difficult to incorporate resilience because the space of potential features is large and complex [12], since it comprises many different variables which can interact in unintuitive ways.

To enable the systematic exploration of these design spaces, resilience optimization frameworks have been developed, which leverage mathematical optimization techniques to find the optimally resilient design [12]. The most commonly used approach is the two-stage approach [22–24], which uses a bilevel optimization strategy to (in the upper-level) design the system before the event and then (in the lower level) optimize the system response after each hazardous scenario occurs [25–29]. Other general resilience optimization formulations have presented it as a sequential problem: first allocating resilience to subsystems and then optimizing the reliability and health management of those systems to achieve the required resilience [30,31]. In addition, there have been a number of other dedicated resilience optimization architectures and applications that organize the problem differently—as a sequential problem, as a multidisciplinary design optimization problem [32,33], a multiagent problem [34,35], a bilevel problem [36,37], and as a monolithic problem [21,38–40].

Given the variety of resilience optimization formulations and frameworks, it can be difficult to understand how best to use these frameworks to approach a new problem. Thus, there is a need to understand and differentiate the types of resilience optimization problems one might encounter and to understand how best to select and tailor an optimization framework to these problem types, similar to what has been done in the related fields of multidisciplinary design optimization [41,42] and codesign [43]. To approach this problem, the authors previously developed a general framework for combined design, operational, and resilience optimization and compared the use of all-at-once, sequential, and bilevel architectures within this framework [44]. While this comparison constituted a first step into understanding the comparative advantages of resilience optimization approaches, it was limited to a very simple algorithm (exhaustive search) on a single problem. Other authors have compared nested (bilevel) with simultaneous solution architectures in similarly-formulated reliability-based codesign problems [45], finding that a nested approach could converge to a similar approach to a simultaneous (all-at-once) optimization architecture with fewer function evaluations. However, this work was limited toward reliability-based codesign problems and thus may not apply to the broader set of resilience optimization problem formulations. Furthermore, neither of these studies included alternating architectures in the comparison.

### 1.1 Contributions.

Given these gaps in the research, it may be difficult to understand when to select one of the myriad of existing approaches or how to formulate a new approach to a new resilience optimization problem of interest. The aim of this article is thus to develop an overall theory for understanding resilience optimization problem formulations and solution architectures. It advances this aim by pursuing two major contributions: First, it provides a comprehensive review of existing resilience optimization approaches and categorizes them in an overall framework showing how optimization architectures are used in different problem formulations (Sec. 2). Based on this review, it then identifies overall multilevel decomposition strategies used in resilience optimization approaches, as well as an alternating multilevel decomposition approach, which has not been studied in the field. To understand the comparative performance and applicability of these strategies, it then compares them over a simple notional resilience optimization problem and a cooling tank problem in Sec. 3 to evaluate decomposition strategy performance given coupling and alignment problem characteristics. These comparisons are then used to understand how best to apply the identified multilevel decomposition strategies to new resilience optimization problems (Secs. 4 and 5).

## 2 Resilience Optimization

Resilience optimization is the use of mathematical optimization techniques to increase the resilience of the system to hazardous scenarios. In this framework, resilience is the valued quality of the system’s dynamic response to hazards, which minimizes their undesirable consequences (e.g., downtime, safety consequences, repairs) [17]. This definition is common to a number of commonly used resilience frameworks, including the resilience triangle [46] and similar variants presented in the literature, (e.g., Refs. [31,47–49]) although there remains ongoing debate about resilience definitions [50–52]. The variety of different approaches to model, consider, and change the system’s hazard response has resulted in a number of different formulations of the resilience optimization problem. The remainder of this section presents a general framework for understanding resilience optimization problem formulations, which it then uses to classify previous examples of resilience optimization in the literature. It then describes decomposition strategies used in resilience optimization architectures, as well as the alternating architecture, which will be used in the comparison in Sec. 3.

### 2.1 Resilience Optimization Problem Formulations.

**x**is the design vector,

*C*

_{D/O}is the cost of design and/or operations, and

*C*

_{R}is the cost of hazards, which is taken as an expectation over the set of scenarios

*S*, where

*r*

_{s}is the rate of the scenario,

*C*

_{s}is the cost of the scenario, and

*n*is the life of the system. While they are stated as costs, it should be noted that the cost function

*C*

_{D/O}only needs to be a function that captures the merit of the design (apart from resilience) while

*C*

_{R}is a function that captures the quality of the of the system’s response in a hazardous scenario. This problem is different from the long-established reliability-based multidisciplinary design optimization (RBMDO) problem [53,54] in that it optimizes resilience of the system (the response of the system to each scenario captured by

*C*

_{s}) as a value-based objective instead of considering the reliability (probability of failure of the system most analogous to

*r*

_{s}) as a constraint. The reason for using decomposition strategies in RBMDO problems is furthermore motivated by the need to use a most probable point uncertainty analysis like FORM/SORM, which require an optimization process to calculate the probability of failure, rather than (in resilience optimization) optimizing a subset of hazard-affecting decision variables. As a result, the resilience optimization problem has much less coupling between respective analyses, and different optimization architectures may be necessary to solve it.

#### 2.1.1 Integrated Resilience Optimization.

**x**in the resilience optimization problem can take different forms. To distinguish these formulations, consider that the design vector

**x**may be made up of two components:

**x**

_{D/O}, the design and operational variables that define the inherent resilience of the system (e.g., design buffer), and

**x**

_{R}, the resilience variables that define the actions the system takes over the set of hazardous scenarios—the active resilience of the system. When constraints are additionally included, this problem may be stated:

*C*

_{D/O},

**h**

_{D/O}, and

**g**

_{D/O}are design and operational cost objectives and constraints (the result of the design cost model at design variables

**x**

_{D/O}) and

*C*

_{R},

**h**

_{R}, and

**g**

_{R}are the resilience cost model in terms of the design/operational variables and resilience variables

**x**

_{R}. This specific formulation of this problem, where the design/operational variables

**x**

_{D/O}and resilience variables

**x**

_{R}are optimized together as decision variables, is referred to here as integrated resilience optimization (IRO) and will be used to discuss optimization architectures in Sec. 2.2. This problem formulation is very similar to (but more general than) the reliability-based co-design problem presented in Ref. [45]—the main difference (in addition to the differences between resilience and reliability optimization pointed out in Sec. 2.1) is that the lower level of this problem is the optimization of resilience variables (which includes individual or multiscenario contingency management) instead of one single control policy. This integrated resilience optimization represents a general formulation of the resilience optimization problem from which we can derive two formulations that operate over a reduced set of variables, as shown in Fig. 1: Resilience-based design optimization (RDO) and resilience policy optimization (RPO).

A large number of integrated resilience optimization formulations and approaches have been presented for integrated resilience optimization problems, as presented in Table 1. While some of these approaches use a monolithic formulation, the use of decomposition architectures is much more common because of the inherent structure of the problem, where some variables (design/operational) will apply to all resilience scenarios and others will only apply to some or individual scenarios. As a result, two-stage (and n-stage, a variant used where there are multiple sequential decisions) approaches are used most commonly. This approach leverages both a bi-level and scenario-based decomposition to find the optimal response of a system in each scenario given a set of design/operational variables that apply to all scenarios. However, it is not the only approach used, because the resilience variables can be coupled between scenarios (leading to the use of bilevel architectures) and because (in robust optimization formulations) the resilience objective is based on the worst-case scenario (rather than the entire set). This leads to the use of trilevel optimization architectures where the optimization of the resilience model is the optimization of the resilience policy nested within the optimization of the worst-case scenario. While there has been some exploration of sequential multilevel decomposition approaches, these are used much less commonly.

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [55] | Mitigation (D/O) and restoration (R) strategies for a natural gas distribution network and power grid. |

[56] | Preparedness (D/O) and recovery (R) actions made links of a transportation network. | |

[33] | Power plant condenser design parameters (D/O) and PHM maintenance/repair policy (R). | |

[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). | |

[57] | Aircraft control actuator reliability, detectability, and reconfigurability (D/O) and detection/recovery rapidity (R). | |

[58] | Transmission line protection level (D/O) and power generation and supply response for an electricity distribution system. | |

[59,60] | Freight transportation shipment allocation (D/O) and recovery activity (R). | |

Bilevel | [36] | Size and capacity of logistics service centers (D/O) and customer demand allocation (R). |

[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). Compared monolithic and scenario-set resilience model decomposition approaches. | |

Trilevel | [61] | Network configuration and capacity (D/O) and processes operating level (R) in the worst-case scenario. |

[29,62] | Investment planning (D/O) and hazard response (D/O) in worst-case scenario. | |

[63] | Electricity and natural gas system line reinforcement (R) and flow response in worst-case attack scenario (R). | |

[64] | Protected nodes (D/O) and restored nodes (R) in worst-case interdiction scenario in water, gas, and power network. | |

[65] | Natural Gas Unit commitments (D/O) and contingency actions (R) in worst-case scenario. | |

[66] | Road network defenses (D/O) and flow of traffic (R) in worst-case attack scenario. | |

Two-stage | [26] | Airport preparedness (D/O) and recovery (R) actions. |

[27] | Freight network preparedness (D/O) and recovery (R) actions. | |

[28] | Power commitment and reserves of distributed generators and microgrids (D/O) and reserve deployment (R). | |

[67] | Distribution center capacity, and nominal customer allocation and order size (D/O) and customer allocation and order size in disruptive scenarios (R). | |

N-stage | [68] | Infrastructure mitigation and preparedness actions (D/O) and repair, recovery, and transfer plans (R) for healthcare systems. |

Sequential | [69] | Train braking profile (D/O) and maintenance policy (R). |

[30] | System-level resource allocation and redundancy (D/O). Component reliability and PHM efficiency are co-optimized in the lower level (R). Used a custom lower-level strategy. |

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [55] | Mitigation (D/O) and restoration (R) strategies for a natural gas distribution network and power grid. |

[56] | Preparedness (D/O) and recovery (R) actions made links of a transportation network. | |

[33] | Power plant condenser design parameters (D/O) and PHM maintenance/repair policy (R). | |

[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). | |

[57] | Aircraft control actuator reliability, detectability, and reconfigurability (D/O) and detection/recovery rapidity (R). | |

[58] | Transmission line protection level (D/O) and power generation and supply response for an electricity distribution system. | |

[59,60] | Freight transportation shipment allocation (D/O) and recovery activity (R). | |

Bilevel | [36] | Size and capacity of logistics service centers (D/O) and customer demand allocation (R). |

[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). Compared monolithic and scenario-set resilience model decomposition approaches. | |

Trilevel | [61] | Network configuration and capacity (D/O) and processes operating level (R) in the worst-case scenario. |

[29,62] | Investment planning (D/O) and hazard response (D/O) in worst-case scenario. | |

[63] | Electricity and natural gas system line reinforcement (R) and flow response in worst-case attack scenario (R). | |

[64] | Protected nodes (D/O) and restored nodes (R) in worst-case interdiction scenario in water, gas, and power network. | |

[65] | Natural Gas Unit commitments (D/O) and contingency actions (R) in worst-case scenario. | |

[66] | Road network defenses (D/O) and flow of traffic (R) in worst-case attack scenario. | |

Two-stage | [26] | Airport preparedness (D/O) and recovery (R) actions. |

[27] | Freight network preparedness (D/O) and recovery (R) actions. | |

[28] | Power commitment and reserves of distributed generators and microgrids (D/O) and reserve deployment (R). | |

[67] | Distribution center capacity, and nominal customer allocation and order size (D/O) and customer allocation and order size in disruptive scenarios (R). | |

N-stage | [68] | Infrastructure mitigation and preparedness actions (D/O) and repair, recovery, and transfer plans (R) for healthcare systems. |

Sequential | [69] | Train braking profile (D/O) and maintenance policy (R). |

[30] | System-level resource allocation and redundancy (D/O). Component reliability and PHM efficiency are co-optimized in the lower level (R). Used a custom lower-level strategy. |

#### 2.1.2 Resilience-Based Design Optimization.

In RDO, the design and operations of the system **x**_{D/O} are optimized as decision variables, resulting in a design and/or mission profile, which is inherently resilient to faults. Following the notation in Eq. (1), this problem may be generically stated as follows:

**x**

_{R}. Note that this notation only means that there are no resilience decision variables, not that there are no resilience variables. The existence of resilience variable that are returned as responses

**y**

_{R}of the design variables results in the resilience model

**h**

_{R}(

**x**

_{D/O}) results in the variant “RDO with coupled control” type listed in Fig. 1. This is similar to the IRO formulation, and we consider it its own variant of the RDO formulation because the control variables are not optimized as decision variables and are instead responses.

Existing formulations and solution approaches for RDO problems are presented in Table 2. As shown, the majority of the existing RDO formulations use a monolithic architecture, and a very common type of problem is a sensor allocation, where sensors are distributed in a system to enable detection and reconfiguration of component faults. However, there have been a few additional problem formulation and solution architecture variants—two approaches [35,83] incorporate a coupled control problem that finds a corresponding control policy given the design policy by solving a resilience constraint satisfaction problem. In addition, the reliability-based codesign formulation in Ref. [45] uses a bilevel structure (in a comparison with all-at-once and other approaches) to simultaneously explore system designs and corresponding control policies at the same time. In addition, while these problems are nearly all solved using monolithic solution strategies, there is one example using the scenario-based decomposition described in Sec. 2.2.1, where each design variable is mapped to a set of scenarios originating from a function in the system.

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [38,39,70–75] | PHM sensor allocation and network design. |

[76–79] | Optimization of (generic, power, rail, etc.) network topology. | |

[40] | Sensor locations, inspection interval, detection probability of motor controller PHM system. | |

[16] | Optimization of supply chain connectivity and production. | |

[18,31,80] | Component resilience attributes (e.g., redundancy, robustness, rapidity, reliability, restoration). | |

[81] | Retrofit improvements (materials, thicknesses) in a bridge. | |

[82] | Transportation path through an earthquake-prone area. | |

[83] | Power production planning (D/O) and load not served during extreme weather events (coupled R). | |

[35] | Agent (operator) placement and number (D/O) in a power grid and power equilibrium (coupled R). | |

Scenario-set | [84] | EPS system redundancy architecture. |

Bilevel | [45] | Wind turbine (and notional problem) design (upper level) and operations (lower level). |

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [38,39,70–75] | PHM sensor allocation and network design. |

[76–79] | Optimization of (generic, power, rail, etc.) network topology. | |

[40] | Sensor locations, inspection interval, detection probability of motor controller PHM system. | |

[16] | Optimization of supply chain connectivity and production. | |

[18,31,80] | Component resilience attributes (e.g., redundancy, robustness, rapidity, reliability, restoration). | |

[81] | Retrofit improvements (materials, thicknesses) in a bridge. | |

[82] | Transportation path through an earthquake-prone area. | |

[83] | Power production planning (D/O) and load not served during extreme weather events (coupled R). | |

[35] | Agent (operator) placement and number (D/O) in a power grid and power equilibrium (coupled R). | |

Scenario-set | [84] | EPS system redundancy architecture. |

Bilevel | [45] | Wind turbine (and notional problem) design (upper level) and operations (lower level). |

#### 2.1.3 Resilience Policy Optimization.

**x**

_{R}are optimized as decision variables, resulting in an optimal contingency management policy over the set of resilience scenarios for a given system design (see Refs. [85–88]). Following the notation in Eq. 1, this problem may be stated as follows:

**y**

_{D/O}that enable a particular policy are included as a response variable of the resilience policy in the model

**h**

_{R}(

**x**

_{R}), which may result in a corresponding cost

*C*

_{D/O}(

**y**

_{D/O}), if desired. Previous RPO formulations and approaches are presented in Table 3. While a number of resilience policy optimization architectures are monolithic (and use no decomposition architecture), the use of a bilevel decomposition approach is also common. These approaches are used when there is a corresponding problem that must be optimized given the set of resilience variables (e.g., traffic equilibrium [93]) and when the response to be optimized is the worst-case scenario (rather than the expectation over the set of scenarios). In addition, the resilience policy optimization problem can be solved using a two-stage approach (e.g., Ref. [25]) when the resilience policy can be split into variables that apply to all scenarios (the first stage) and variables that only apply in singular scenarios (the second stage). Finally, among monolithic formulations, there has been one example [19] of the coupled design problem formulation in Fig. 1, where the enabling flexibility was considered an output of a chosen recovery policy.

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [89] | Urban rail repair sequence and duration. |

[85–88] | Aircraft off-nominal control policies. | |

[90] | Electric distribution system recovery policy. | |

[91] | Optimal bridge recovery delay and rate. | |

[92] | Optimal power and water system postearthquake recovery actions. | |

[19] | Monopropellant system recovery policy (R) and enabling flexibility (coupled D/O). | |

Two stage | [25] | Modes to repair over set of scenarios and dispatch of energy resources and repair crews in each scenario. |

Bilevel | [93] | Optimal policy for reconfiguring traffic intersections given optimal user behavior to intersection reconfiguration in each scenario. |

[37] | Reservoir flowrate (upper level) and flexibility allocation (lower level) over scenarios. | |

[94] | Road network recovery plans (upper level) given equilibrium traffic assignment (lower level). | |

[95] | Wildfire response operations (lower level) in the worst-case scenario (upper level). |

Architecture | Ref. | Problem description/variables |
---|---|---|

Monolithic | [89] | Urban rail repair sequence and duration. |

[85–88] | Aircraft off-nominal control policies. | |

[90] | Electric distribution system recovery policy. | |

[91] | Optimal bridge recovery delay and rate. | |

[92] | Optimal power and water system postearthquake recovery actions. | |

[19] | Monopropellant system recovery policy (R) and enabling flexibility (coupled D/O). | |

Two stage | [25] | Modes to repair over set of scenarios and dispatch of energy resources and repair crews in each scenario. |

Bilevel | [93] | Optimal policy for reconfiguring traffic intersections given optimal user behavior to intersection reconfiguration in each scenario. |

[37] | Reservoir flowrate (upper level) and flexibility allocation (lower level) over scenarios. | |

[94] | Road network recovery plans (upper level) given equilibrium traffic assignment (lower level). | |

[95] | Wildfire response operations (lower level) in the worst-case scenario (upper level). |

### 2.2 Decomposition Strategies.

The use of a specialized optimization architecture is motivated by the potential to reduce the solution time and complexity of a problem by decomposing it into reduced-dimensionality subproblems that map better to known algorithms. What differentiates the resilience optimization problem from a traditional multidisciplinary design optimization (MDO) problem is that it often has inherent structural characteristics that may be leveraged for more efficient solution due to there being reduced coupling between analyses. MDO problems, on the other hand, are often tightly coupled, having a large number of shared and coupling variables that must be kept consistent in the final solution. As a result, existing MDO architectures may not be suitable for the task of leveraging the weak coupling relationships between resilience optimization subproblems to achieve a more efficient and effective solution process.

**x**is broken into two sets—the optimized decision variables

**x**and corresponding response variables

**y**, which result from a simulation at

**h(x)**. This problem then has the following form:

**x**

_{D/O},

**x**

_{R},

**y**

_{D/O}, and

**y**

_{R}are the decision variables and corresponding response variables of the design/operational and resilience models. Two main decomposition strategies can be leveraged to create an optimization architecture, as shown in Fig. 3: multilevel decomposition and scenario decomposition. As shown, while multilevel decomposition strategies decompose the resilience and design/operational problems into separate interacting problems, scenario-based decomposition strategies decompose the resilience problem between scenarios and sets of scenarios. While not the focus of the comparison in this article, we first describe scenario-based decomposition strategies in Sec. 2.2.1 to provide important context about how and why multilevel decomposition strategies are used in practice (since they are often used together). We then present two multilevel decomposition strategies here: a bilevel optimization strategy that is commonly used in the architectures found in the literature, and an alternating architecture that is used much less commonly (if at all).

#### 2.2.1 Scenario Decomposition.

Scenario decomposition approaches decompose the resilience model into independent subproblems for each scenario or groups of scenarios. This is advantageous because the size of resilience problem increases with the number of scenarios considered, and each scenario may itself be computationally costly to simulate. However, its usage depends on the coupling of the scenarios in the resilience problem, as shown in Fig. 4. If the scenarios are fully uncoupled, meaning that the resilience variables for each scenario do not interact, each problem can essentially be solved independently. This is the case in the two-stage approaches [25], n-stage, and scenario-set approaches in Tables 1–3, where the variables optimized are how the system responds to each situation individually (rather than in aggregate).

However, there may be resilience problems of interest where the variables being optimized are coupled between the scenarios—for example, when a control component must take actions in hazardous scenarios given sensor readings. In cases like this, it may still be possible to map these variables to subsets of scenarios, which is the lower-level decomposition approach in Ref. [44]. However, in cases where the scenarios are entirely coupled, the problem must remain monolithic. Note that these decomposition approaches can be performed not only in the context of an IRO formulation (resulting in the approaches shown in Fig. 4) but also in a RPO and a RDO formulation where design variables can be decomposed to fault scenarios [84]. While these approaches are not compared here, we include them in this discussion because they can be a key reason to choose a multilevel decomposition strategy instead of using an all-at-once approach: multilevel decomposition strategies enable one to further decompose the resilience model into separate optimization problems, which can greatly reduce the computational cost of the overall problem.

#### 2.2.2 Multilevel Decomposition: Bilevel Architecture.

*C**

_{R}(

**x**

_{D/O}) and

**g**

_{R}*(

**x**

_{D/O}) are the optimal (or best) responses from the lower-level resilience optimization, which itself has following the form:

**x**

_{D/O}and

**y**

_{D/O}are the design/operational variables from the upper level used as inputs for the lower-level optimization. While this architecture is often used on its own, it is also combined with other architectures to most effectively solve the problem of interest. In the tri-level architecture, two bilevel optimization problems are nested within each other to optimize design/operations with hazard response (in the upper-level nested optimization) and hazard response with the worst-case hazard event (in the lower-level nested optimization). In the two- and n-stage optimization architectures, this bilevel architecture is further augmented with the scenario-based decomposition strategy in the lower-level to take advantage of the separability of the resilience policy optimization problem (where the response to each contingency is independent).

#### 2.2.3 Multilevel Decomposition: Alternating Architecture.

*k*− 1 of the lower level resilience optimization. This lower-level resilience optimization problem in turn has the following form:

Alternating architectures can be adapted in a number of ways depending on the type of problem. As shown in Fig. 6, the sequential architecture explored in the previous work [44] is a variant of the alternating architecture, which only solves the optimization at each level once. One could additionally use several different exit conditions in the implementation of an alternating architecture (e.g., number of iterations, tolerances on variables), which could be adapted to a particular problem. Finally, alternating architectures may include the resilience model (or a surrogate) in the design/operations optimization loop, resulting in the “with *C*_{R}” variant shown (see overlay/highlighting) in Fig. 6. This enables the design/operational optimization to take into account some of the costs of resilience without running a full optimization at every iteration.

## 3 Demonstration: Comparing Multilevel Decomposition Strategies

As stated in Sec. 2.2, while the bilevel multilevel decomposition strategy is used widely in resilience optimization, the use of the alternating strategy for these problems has not yet been studied. As a result, less is known about this strategy, and it may not be clear when it might be most applicable to a problem of interest. We posit that the effectiveness of this strategy compared with other strategies (such as a bilevel or all-at-once strategy) hinges on two inherent problem properties: *alignment* and *coupling*. Alignment refers to whether the upper-level and lower-level objectives oppose, support, or are invariant to each other (i.e., if **d C_{D}**/d

*x*

_{D}·

**d**/d

*C**x*

_{D}≈ ‖

**d**/d

*C*_{D}*x*

_{D}‖*‖

**d**/d

*C**x*

_{D}‖, where

**d**/d

*C**x*

_{D}=

**d**/d

*C*_{D}*x*

_{D}+

**d**/d

*C*_{R}*x*

_{D}). Coupling, on the other hand, refers to the degree to which upper-level variables are constrained with lower-level variables. For the purpose of this demonstration, we define three levels of coupling:

– In an

*uncoupled*problem, the lower-level optimization is merely a refinement of the upper-level optimization, meaning that there is a direct path from $[xD*,xR0]to[xD*,xR0]$.- – In a
*loosely coupled*problem, the optimal choice of design variables**x**_{D}* may depend on the choice of resilience variables; however, the choices do not directly depend on each other and the following relationship holds:$C(x+\delta x)\u2248C(x+\delta xD)+C(x+\delta xR)\u2212C(x)$ – In a

*fully coupled*problem, this relationship does not hold, and as a result, the design and resilience variables must be jointly explored.

### 3.1 Notional System.

*x*

_{p}, which then, because of a hazardous event, drops by an amount

*x*

_{a}given by the slack in the system

*x*

_{s}. This continues for the time

*x*

_{b}until the system recovers, which takes time

*x*

_{c}. Cost functions and constraints were constructed to form the optimization problem shown below:

*x*

_{r}is the rate of the fault, and

*a*,

*b*,

*c*,

*d*,

*e*,

*f*, and

*n*are all problem constants. In this problem, the system produces revenue from performing

*x*

_{s}operations subject to a hazards and

*x*

_{a}operations not subject to hazards and is also subject to the costs of maintaining a reliable system. The constraint

*h*

_{D1}relates slack, performance, and the drop in performance due to the fault, while

*g*

_{D2}specifies a peak performance level, which bounds reliability.

#### 3.1.1 Optimization.

This is a nonlinear programming problem. In this work, this problem is solved using python’s trust-region algorithm in the scipy package (see Ref. [100]) using all-at-once, bilevel, alternating, and sequential strategies. While a full description of the implementation is out of the scope of this section, it is important to know that the bilevel method was run for 50 upper-level iterations with 20 corresponding lower-level iterations (since other convergence criteria were not met during optimization at either level) and the alternating approaches were set to terminate when the improvement between upper and lower-level optimizations was below a tolerance of *f*_{tol} = 10^{−4}. The starting point used was *x* = [1, 0.5, 10^{−4}, 0.5, 1, 1]. Figure 8 shows the progress of the alternating, all-at-once, and alternating strategies over the computational time used for the optimization. As shown, while both the all-at-once and alternating strategies complete the optimization in reasonable computational time, the bilevel strategy takes an order of magnitude longer to approach the same solution while the alternating and sequential strategies (without *C*_{R}) converge to a poor design, since they have no ability to account for resilience in the design optimization problem. These results are also reflected in the final results comparison in Table 4, which additionally shows the performance of sequential strategies with and without the resilience cost *C*_{R} in the upper level. As shown, the sequential strategy with the resilience cost in the upper level performs nearly as well as the all-at-once strategy at reduced computational cost because of the reduced space of the problems (and lack of subsequent iterations present in alternating architectures).

Strategy | x_{p} | x_{a} | x_{r} | x_{s} | x_{b} | x_{c} | f* | time |
---|---|---|---|---|---|---|---|---|

All-at-once | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10^{+06} | 0.51 |

Bilevel | 1.5 | 0.8 | 0.0024 | 0.7 | 0.71 | 10 | −1.2 × 10^{+06} | 14 |

Alternating (with C_{R}) | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10^{+06} | 0.96 |

Alternating (no C_{R}) | 1 × 10^{+02} | 80 | 1 × 10^{+02} | 20 | 0.071 | 10 | 4.1 × 10^{+11} | 0.53 |

Seq. (with C_{R}) | 1.3 | 0.78 | 0.00083 | 0.51 | 0.72 | 10 | −1.2 × 10^{+06} | 0.37 |

Seq. (no C_{R}) | 1 × 10^{+02} | 80 | 1 × 10^{+02} | 20 | 0.071 | 10 | 4.1 × 10^{+11} | 0.26 |

Strategy | x_{p} | x_{a} | x_{r} | x_{s} | x_{b} | x_{c} | f* | time |
---|---|---|---|---|---|---|---|---|

All-at-once | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10^{+06} | 0.51 |

Bilevel | 1.5 | 0.8 | 0.0024 | 0.7 | 0.71 | 10 | −1.2 × 10^{+06} | 14 |

Alternating (with C_{R}) | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10^{+06} | 0.96 |

Alternating (no C_{R}) | 1 × 10^{+02} | 80 | 1 × 10^{+02} | 20 | 0.071 | 10 | 4.1 × 10^{+11} | 0.53 |

Seq. (with C_{R}) | 1.3 | 0.78 | 0.00083 | 0.51 | 0.72 | 10 | −1.2 × 10^{+06} | 0.37 |

Seq. (no C_{R}) | 1 × 10^{+02} | 80 | 1 × 10^{+02} | 20 | 0.071 | 10 | 4.1 × 10^{+11} | 0.26 |

The comparative performance of these strategies flows out of the characteristics of the problem and optimization methods. Because the design and resilience optimization problems are only loosely coupled (resilience variables do not significantly impact the upper-level cost), the sequential and alternating approaches perform nearly as well as a monolithic approach in terms of solution found and computational time. However, this is only the case when the resilience cost is included in the upper-level model, which is consistent with Ref. [44]. The bilevel strategy performs poorly for two reasons: first, establishing a gradient in the upper level is unnecessarily costly because each point evaluated to find the gradient using the finite difference method in the algorithm corresponds to a full optimization of the lower-level; second, many of the iterations used solving the lower-level optimization are wasted because they are unrelated to establishing feasibility in the upper level. To summarize, the alternating strategy can perform well comparably to the all-at-once strategy because of the loose coupling in the problem, while the bilevel strategy performs poorly because of the inherent computational expense of re-optimizing the lower level to calculate the gradient in the upper level.

### 3.2 Cooling Tank Problem.

*t*= 20, meaning the system must be designed in such a way to be optimally resilient in this window. To optimize the resilience of this system, the size of the tank

*x*

_{T}and the size buffer of the valve inlet

*x*

_{I}(above the required size) are the design variables, while the response signals to reconfigure the input valve

**x**

_{ip}and output valve

**x**

_{op}(to decrease, increase, or maintain flow) are the resilience variables. The contingency management input states are $z\u2192=[i,t,o]$, where

*i*is a given state in the input valve (leak, blockage, or nominal),

*t*is the level of the tank (low, high, or nominal), and

*o*is the state of the output valve (leak, blockage, or nominal). The resulting optimization problem is expressed as follows:

*C*

_{D}(

*x*

_{T},

*x*

_{I}) is the design cost of both implementation and efficiency (which increases quadratically with buffer size)

*C*

_{R}(

**x**

_{ip},

**x**

_{op}) is the resilience cost, which is taken over the set of fault simulations, and

*g*

_{R}is the (resilience-level) constraint determining whether the given set of variables results in a nominal mission profile in the nominal scenario. The scenario costs take the following form:

*l*(

*t*) is the level of coolant in the tank,

*b*(

*t*) is the amount of useful unspent buffer coolant in the tank, and

*ov*(

*t*) and

*iv*(

*t*) are the policy used by the input and output valves, respectively. As shown, the primary failure costs result from the tank overfilling, emptying, and no longer having enough buffer coolant to cool the heat source. The valve policy costs are used to discourage unnecessary usage of valve reconfiguration in cases where it is not needed, such as in policy states that are not entered in the simulations in the set of considered scenarios. These costs are summed over the number of time-steps in the simulation where the condition is present to account for the increased risk from greater time exposure resulting from each condition. Five scenarios are included in the set of scenarios

*S*in the resilience model, which constitute blockages and leaks in each system, as shown in Table 5, along with their rates and costs (assuming no mitigation).

Scenario | Rate | Cost | Expected cost |
---|---|---|---|

Import coolant leak | 1.7 × 10^{−06} | 2.1 × 10^{+06} | 3.5 × 10^{+05} |

Import coolant blockage | 1.7 × 10^{−06} | 2.1 × 10^{+06} | 3.5 × 10^{+05} |

Store coolant leak | 1.7 × 10^{−06} | 1 × 10^{+06} | 1.7 × 10^{+05} |

Export coolant leak | 1.7 × 10^{−06} | 1 × 10^{+06} | 1.7 × 10^{+05} |

Export coolant blockage | 1.7 × 10^{−06} | 1 × 10^{+05} | 1.7 × 10^{+04} |

Scenario | Rate | Cost | Expected cost |
---|---|---|---|

Import coolant leak | 1.7 × 10^{−06} | 2.1 × 10^{+06} | 3.5 × 10^{+05} |

Import coolant blockage | 1.7 × 10^{−06} | 2.1 × 10^{+06} | 3.5 × 10^{+05} |

Store coolant leak | 1.7 × 10^{−06} | 1 × 10^{+06} | 1.7 × 10^{+05} |

Export coolant leak | 1.7 × 10^{−06} | 1 × 10^{+06} | 1.7 × 10^{+05} |

Export coolant blockage | 1.7 × 10^{−06} | 1 × 10^{+05} | 1.7 × 10^{+04} |

#### 3.2.1 Optimization.

This problem is difficult to solve in an all-at-once strategy because of the high resilience model dimensionality (54 variables) and the mix of variable types (continuous in the design model and discrete in the resilience model). In addition, because variables are state based (and not scenario based), some variables may be coupled (e.g., raising a level when it is too low may cause the level to become too high, resulting in a new set of actions). Thus, this work uses a custom evolutionary algorithm in a monolithic resilience model to generate and refine solutions. This makes it difficult to solve design and resilience models in tandem since the result of the lower-level optimization may not necessarily be continuous (or act like a continuous function to an upper-level solver). To solve this problem, the design model in this work is searched using the Nelder–Mead method in scipy [100], a gradient-free direct search method, which creates a mesh and iteratively adds and removes points based on the values of the current points.

The alternating strategy was implemented on this problem using a population size of 50 and a number of iterations of 100, while a population size of 20 and number of iterations of 20 was used in the bilevel strategy. This was done to mitigate computational costs in the bilevel strategy and take as much advantage of each successive optimization step in the alternating strategy as possible, since it was observed to converge quickly. Both strategies used a seeding method in the evolutionary algorithm, which enabled the best populations found at the end of each lower-level optimization to be saved and carried over to the next optimization. Since the lower-level optimization was performed using an evolutionary algorithm, these optimizations were run over 20 replicates to avoid potential issues with solution and performance variability. The progression of these strategies over the computational time of the optimization in Fig. 10. As shown, the sequential strategies and alternating strategy (without the cost of resilience) follow the trends shown for the notional problem in Sec. 3.1—either preceding to the minimum-design cost solution (10,0) (which does not incorporate any resilience) or terminating prematurely. However, unlike the previous comparison, the alternating strategy (with *C*_{R}) very quickly reaches a plateau where each individual optimization does not improve the design significantly, while the bilevel strategy searches the space more effectively, ultimately converging to a lower-cost design.

This is further demonstrated in Table 6, which shows the optimization performance (over 20 runs), an optimal design output from a single run, and a summary of the chosen policy for that design. As shown, the alternating strategy (with *C*_{R}) converges to a tank size of 20, the size which *by design* mitigates leak faults by making it impossible for the tank to drain completely, while the bilevel strategy converges to a tank size of 18.6. While the optimal designs found by all approaches include corrective actions in the lower level for when states go off-nominal (i.e., |*x*_{ip}, *x*_{op} ≠ 0| > 0), the design found by the bilevel strategy increases the inlet flow more often (*x*_{ip} > 0) in the resilience policy, since the inlet has buffer, which it can leverage (i.e., *x*_{l} > 0). As a result, it does not need as large of a tank, since it can increase the input flow in the corresponding faulty scenarios when it might otherwise rely solely rely on tank buffer.

Design | Policy (summary) | Optimization performance (20 runs) | ||||||
---|---|---|---|---|---|---|---|---|

Approach | x_{t}* | x_{l}* | |x_{ip}, x_{op} ≠ 0| | |x_{ip} > 0| | $f\xaf*$ | std. | Time (s) | std. |

Bilevel | 18.6 | 0.62 | 22 | 12 | 287,604 | 2,785 | 1,062 | 586 |

Alt. (with C_{R}) | 20 | 0.0 | 21 | 8 | 453,567 | 831 | 373 | 83 |

Alt. (no C_{R}) | 10 | 0.0 | 23 | 5 | 893,333 | 0 | 181 | 6 |

Seq. (with C_{R}) | 22 | 0.0 | 20 | 7 | 467,133 | 980 | 65 | 2 |

Seq. (no C_{R}) | 10 | 0.0 | 24 | 6 | 893,333 | 0 | 60 | 2 |

Design | Policy (summary) | Optimization performance (20 runs) | ||||||
---|---|---|---|---|---|---|---|---|

Approach | x_{t}* | x_{l}* | |x_{ip}, x_{op} ≠ 0| | |x_{ip} > 0| | $f\xaf*$ | std. | Time (s) | std. |

Bilevel | 18.6 | 0.62 | 22 | 12 | 287,604 | 2,785 | 1,062 | 586 |

Alt. (with C_{R}) | 20 | 0.0 | 21 | 8 | 453,567 | 831 | 373 | 83 |

Alt. (no C_{R}) | 10 | 0.0 | 23 | 5 | 893,333 | 0 | 181 | 6 |

Seq. (with C_{R}) | 22 | 0.0 | 20 | 7 | 467,133 | 980 | 65 | 2 |

Seq. (no C_{R}) | 10 | 0.0 | 24 | 6 | 893,333 | 0 | 60 | 2 |

The superior performance of the bilevel strategy in this instance is a result of the coupling relationship between the upper- and lower-level problems. Since the pipe buffer has no intrinsic value outside its ability to be leveraged by a lower-level policy, which has been correspondingly optimized, the alternating strategy (which optimizes each separately) reduces pipe margin to 0, while the bilevel strategy (which optimizes each in conjunction) finds an optimal pipe margin of 0.62, which it leverages in the resilience policy by increasing the input flow when appropriate to mitigate the fault scenario. This demonstrates how tightly coupled design and resilience problems can necessitate a bilevel strategy, since in these cases, the resilience of the design variables requires a leveraging optimal resilience policy, which itself may be sensitive to changes in the design.

## 4 Discussion and Theoretical Implications

Using multilevel decomposition strategies on integrated resilience optimization problems can improve the computational efficiency of the optimization process by reducing the complexity of the design and resilience optimization problems. It is particularly desirable to reduce the number of evaluations of the resilience model, which can be computationally expensive because evaluation time increases by $O(E*T*S)$, where *E* is the number of equations, *T* is the number of time-steps, and *S* is the number of scenarios. Multilevel strategies can reduce this complexity by enabling the use of scenario-based decomposition strategies in the resilience optimization problem (used in the commonly used two-stage architecture), which greatly decrease computational complexity of the resilience optimization problem by breaking up the lower-level optimization over scenarios or sets of scenarios. Sequential or alternating strategies additionally have the potential to reduce the space of the problems and proceed more efficiently through the design space (as was the case in the notional system problem). Finally, a monolithic formulation may not be able to readily solve a given problem when the design and resilience models have differing variable types (as was the case in the cooling tank problem). In this context, multilevel strategies can improve the ability to optimize the problem by enabling the design and resilience problems to be solved by methods applicable to each problem type.

To better understand how (and when) to apply different multilevel decomposition strategies, Sec. 3 compared alternating and bilevel strategies (and their variants) on two different problems with different levels of coupling. In this comparison (summarized in Fig. 11), it was shown that the *alignment* and *coupling* of the design/operational and resilience problems drives the relative effectiveness of multilevel decomposition strategies. When the design/operational and resilience problems are not aligned (which constitutes all problems where there is a trade-off between design cost and resilience), the resilience model (or a surrogate) must be included in alternating and sequential strategies for them to perform adequately. Coupling additionally drives the choice of multilevel decomposition strategies—while alternating strategies can be used when the design and resilience problems are uncoupled or loosely coupled, a fully coupled problem requires a bilevel or all-at-once strategy.

Finally, the relative efficiency and effectiveness of given decomposition strategies is heavily dependent on the algorithm used. In the exhaustive search used in the drone model in Ref. [44], for example, the bilevel architecture was able to reduce (some) computational cost by reducing the number of iterations by reducing the space of the search and enabling a lower-level decomposition strategy. However, as shown in the notional example presented here, the large number of lower-level optimizations necessary to approximate a gradient in the upper-level problem can increase the computational cost by orders of magnitude when using a gradient-based solver. This was not the case in the tank problem, in part because the design/operations model was optimized using the Nelder–Mead method, which does not have a gradient-finding step at each iteration. Thus, to perform efficiently, the choice of strategy must be connected to the solution algorithm—whether it be because a strategy is inherently quicker with a given algorithm or because upper- and lower-level problems require different methods to solve efficiently. In particular, bilevel formulations are significantly hindered by the use of gradient-based approaches in the upper level, since each point in upper-level design/operations optimization results in a re-optimization of the lower level. There is some potential to mitigate the efficiency issues of the bilevel architecture by adapting the number of scenarios and iterations of the resilience optimization steps depending on the stage of optimization (e.g., to speed gradient approximation steps). Thus, the future work should develop specialized architectures for resilience optimization, which help manage the computational costs specific to this type of problem.

## 5 Conclusions

Effective application of resilience optimization architectures requires knowledge of the underlying optimization problem formulation. Because resilience optimization problems can be formulated in a number of different ways, characterizing the applicability of given architectures requires considering their ability to optimize each type of formulation. As presented in Sec. 2, existing architectures use a variety of scenario-based and multilevel decomposition strategies to effectively solve different formulations of the resilience optimization problem. To understand the applicability of the multilevel decomposition strategies in the literature, as well the alternating strategy (which has not been used or studied rigorously in this application), this work then applied these architectures to two example problems. There were two main lessons learned from these two applications: First, the ability for alternating architectures to optimize effectively depends on the *alignment* and *coupling* of the design and resilience problems. Second, the performance of the bilevel strategy varies widely based on the algorithms used at each level, since it can enable different solution strategies at each level but also require a full optimization of the lower level at each upper-level design point. More broadly, knowing when to apply decomposition architectures requires an understanding of the formulation of the problem (in terms of variable types), alignment and coupling properties of the underlying models, and efficiency characteristics of the algorithms used.

There are a few limitations with insights presented here, which should be resolved in the future work. First, while the comparison here can help one understand when to use an alternating or bilevel architecture, it may be difficult to identify the coupling relationships prior to optimization. Future work should develop methods for identifying coupling relationships in a given resilience optimization problem before selecting an architecture. Second, while gradient-based optimization methods are shown to lead to high computational costs in bilevel approaches, this can be reduced by using gradient-free optimization algorithms, using a surrogate of the resilience model in the upper level, or limiting the amount of re-optimization of the resilience model during the gradient-finding steps. Future work should investigate these approaches to best understand how to solve integrated resilience optimization problems. Finally, a large number of possible formulations and architectures were presented in Sec. 2.2, which demonstrates the large space of possible resilience optimization approaches. In addition, methods in the larger field of multidisciplinary design optimization (e.g., analytic target cascading) have largely been neglected in the field of resilience optimization despite the mature and readily available tools that exist for this purpose (e.g., openmdao, agile). Future work should continue to explore and compare architectures, which have not seen substantial use for resilience optimization so that they can be understood and used appropriately in practice.

## Acknowledgment

This research was partially conducted at NASA Ames Research Center. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government.

## Conflict of Interest

There are no conflicts of interest.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.