Resilience Engineering is an approach to system design focusing on a system’s ability to adapt, absorb disturbances, recover from unexpected events, and continue to operate effectively under varying and often challenging conditions.
Rather than eliminating all failures, it emphasizes building flexibility, responsiveness, and capacity. These qualities help in responding to the unforeseen, so the system can sustain performance even when things go wrong.
Resilience is defined as a system’s ability to withstand and minimize the impact of disruptions provoked by an external event, as well as the ability of the system to satisfy or maintain its performance after the disruption
Resilience Engineering considers the entire system, including its human and organizational aspects, rather than just technical components.
Resilience Model
Figure 1 shows the trend we are analyzing. On the vertical axis, we see capability. This refers to how well our system or equipment is performing. The horizontal axis represents time. Under normal conditions, the equipment operates at full capacity — 100%.

But then, as an unexpected event occurs. This could be anything. It may be a mechanical fault, a cyber intrusion, or even extreme weather. Each results in a sudden and sharp drop in capability.
The degree to which a system can absorb the impact depends on its design. It also relies on the level of redundancy and condition. This is what we call the ability to absorb.
The five main pillars of Resilience Engineering are:
- Anticipating Disruptions: Identifying potential threats and vulnerabilities in advance.
- Detecting Deviations: Recognizing when the system is starting to move away from its normal state.
- Responding Effectively: Developing and implementing appropriate actions to address the disruption.
- Recovering Quickly: Restoring the system to its normal or a new stable state after a disruption.
- Continuous Learning: Incorporating lessons learned from past events to improve future resilience.
Anticipating Disruptions
Identifying potential threats and vulnerabilities in advance.
Power transformers face a range of operational and environmental risks — from overloading and insulation degradation to lightning surges and contamination. Resilience starts by forecasting and preparing for these challenges before they occur.
- Condition-based risk assessments: Utilities analyze historical DGA, temperature, and load data to predict emerging failures such as overheating or arcing.
- Design stage anticipation: Transformers near coastal areas are built with corrosion-resistant radiators and sealed oil preservation systems to handle salt-laden air.
- Grid contingency planning: Operators model how transformers will behave under N-1 or N-2 contingencies (loss of one or two units) to ensure continuity of supply.
- Aging fleet management: Transformer fleets are ranked by criticality, age, and condition so that the highest-risk assets are prioritized for refurbishment or replacement.
- Addressing the Single Points of Vulnerability is key. This can be done by a simple redesign or by using technological advances. When glitches in temperature probes caused trips, a 1-second time delay can be added. Additionally, a rate of change alarm can be included. This way, potential failures could be caught before they occur or affect the process.
- Quality Maintenance: By doing maintenance in-house as far as possible, to ensure higher quality and repeatability. This improves the reliability of the system and equipment.
Detecting Deviations
Recognizing when the system starts to move away from its normal state.
Early detection prevents minor issues from escalating into catastrophic failures. This pillar relies heavily on monitoring, data analytics, and pattern recognition.
- Online DGA monitoring: Detects rising acetylene or ethylene levels, indicating arcing or thermal faults before alarms or trips occur.
- Partial discharge (PD) sensors: Identify internal insulation breakdown well before it becomes a fault.
- Bushing power factor (tan δ) monitoring: Detects internal moisture or contamination trends.
- Load and temperature trending: Identifies deviations such as overloading patterns during peak hours or fan/pump malfunctions.
- AI-driven anomaly detection: Machine learning algorithms now compare real-time readings with historical baselines to automatically flag subtle deviations in behavior.
- Newer Technology: Another example is flow switches used to indicate process flow. They operate only in open and closed positions. These switches usually get stuck and give false positions. Using newer technology with the installation of analogue flow sensors allows for monitoring the full flow range. The flow performance can be trended with alarm limits to indicate a decrease in flow before it affects the process.
Responding Effectively
Developing and implementing appropriate actions to address the disruption.
When a transformer shows signs of distress, the effectiveness and speed of the operational response determine whether the event becomes a failure or a learning opportunity.
- Automated protection and trip coordination: Buchholz relays, pressure relief devices, and differential protection isolate faults immediately to prevent catastrophic damage.
- Dynamic load reduction: Operators remotely reduce transformer loading after early fault detection to limit further deterioration.
- Deploying mobile oil treatment units: Used to restore dielectric strength and reduce moisture during early warning signs of oil degradation.
- Switching redundancy: Critical substations are designed with parallel transformer operation or alternate feeders, allowing quick reconfiguration during an outage.
- However, resilience engineering also depends on the competence of the people operating and maintaining the plant or equipment. Operators discuss risks. They implement operating procedures. This allows them to proactively manage these risks instead of waiting for a trip event.
Recovering Quickly
Restoring the system to its normal or a new stable state after a disruption.
A resilient system minimizes downtime by restoring service safely and efficiently, sometimes through creative temporary solutions.
- Mobile substations or spare transformers: Deployed to restore supply within hours of a transformer failure — widely practiced by utilities such as Eskom, National Grid (UK), and Hydro-Québec.
- Rapid bushing or tap changer replacement programs: Field teams trained for modular replacement to reduce repair duration from weeks to days.
- Oil reclamation and reconditioning: Allows partial restoration of insulation performance after overheating or contamination events.
- Use of 3D scanning and digital twins: Enables fast fabrication of replacement components and simulation of re-energization conditions to reduce commissioning time.
Continuous Learning
Incorporating lessons learned from past events to improve future resilience.
Resilient organizations use every incident as a learning opportunity to strengthen future designs, maintenance practices, and operational culture.
- Failure forensics: After a bushing explosion or winding fault, utilities perform root cause analysis — lessons lead to updated specifications and improved procurement standards.
- Fleet analytics: Trends from fleetwide monitoring are used to adjust maintenance intervals and refine life-extension models.
- Knowledge sharing platforms: Lessons from incidents (e.g., OLTC contact erosion, corrosion in conservators) are shared across stations and maintenance teams.
- Design evolution: Older mineral-oil designs are replaced by ester-based fluids and sealed tanks, reflecting lessons on fire safety and moisture control.
- Training and human reliability programs: Operators learn from near misses and simulated fault events to improve situational awareness and decision-making.
The Recovery Phase that follows is where we see the full value of resilience. The system may take the quickest recovery path. Alternatively, it might take the normal, more gradual one. This choice depends not just on technology and spare parts. Crucially, it relies on the skills of the workforce.
Skilled technicians and engineers can diagnose faults faster, implement workarounds, and safely restore operations with minimal risk. Having critical spares and the right skills makes all the difference during the recovery phase.
Another key factor is quick decision-making. When an incident occurs, the team must convene immediately. They need to identify the root cause. The team should brainstorm solutions and make decisions. It is important to get the system or equipment back online, even with emergency modifications.
To mitigate restricted processes, like the procurement process, set up support and spares overarching contracts proactively with OEMs. This ensures a quicker response when a problem occurs.
Finally, as you will see on the far right of the diagram, recovery may not always fully restore the system. Sometimes it must be stabilized at a slightly reduced capability — a new normal. But the goal is clear: limit the damage, bounce back fast, and maintain critical operations. Then go back as soon as possible to return to full capacity.
Case Study: Resilience Engineering for Power Transformers
Power transformers are the backbone of modern electricity networks, yet they operate under harsh conditions and face stresses such as overloads, short circuits, and aging. By applying resilience engineering principles, we can see how transformers anticipate, absorb, and recover from faults.
Normal State
In healthy operation, a transformer delivers its rated output reliably:
- Insulation is dry and strong.
- Oil is clean and stable.
- Cooling fans and pumps are operating smoothly.
- Sensors show no abnormality.
Event: A Disturbance Occurs
Transformers face multiple types of disruptive events:
- Through faults → Mechanical stress and winding displacement.
- Overloading during heatwaves → Excessive hot spot temperatures, accelerating aging.
- Bushing failure → Sudden flashover or even catastrophic failure.
- Cooling system failure → Rising oil and winding temperatures.
Mitigation: Absorbing the Shock
Resilient transformers are designed with features that soften the impact:
- Redundant cooling banks ensure continued heat dissipation even if one fan fails.
- Buchholz relays detect incipient faults before they escalate.
- Robust winding bracing withstands short-circuit forces.
- Oil processing and moisture control preserve dielectric strength.
Recovery: Restoring Capability
After an event, recovery depends on response speed and available resources:
- Quick recovery: Mobile oil purification units or temporary cooling systems.
- Normal recovery: Replacing bushings, repairing OLTC contacts, or rewinding damaged windings.
New Normal State
Sometimes, full recovery is not possible:
- Aged insulation may shorten the remaining life expectancy.
- Repairs may restore service but at a reduced MVA rating.
- Retrofits may improve operation but limit overload margins.
Resilience Through Robust Design and New Technologies
Resilience in transformers is not only about responding to failures, but also about designing for adaptability and longevity. Emerging technologies and improved engineering practices are reshaping resilience strategies:
- Robust Mechanical Design:
- Short-circuit–resistant windings with enhanced clamping and bracing.
- Improved bushing technologies, such as resin-impregnated paper (RIP) bushings, which reduce fire and explosion risks.
- Smart Monitoring and Diagnostics:
- Online dissolved gas analysis (DGA) for real-time fault detection.
- Fiber-optic sensors embedded in windings for precise hot spot temperature monitoring.
- Partial discharge monitoring to detect insulation breakdown before catastrophic failure.
- Advanced Cooling and Thermal Management:
- High-efficiency fans and pumps with built-in redundancy.
- Intelligent control systems that optimize cooling based on transformer load and ambient conditions.
- Oil and Insulation Life Extension:
- On-line oil filtration and reclamation systems to slow down insulation aging.
- Ester-based fluids with higher fire points and better moisture tolerance than mineral oil.
- Digital Twin Technology:
- Virtual models of transformers simulate operating stresses and predict degradation pathways, enabling proactive interventions.
- Modular and Mobile Resilience Measures:
- Mobile transformers or skid-mounted spares for rapid replacement.
- Plug-and-play monitoring kits that can be deployed temporarily during high-risk periods.
Conclusion
Resilience engineering teaches us that no system — whether mechanical or human — can avoid disruption forever. The real test is how much capability we lose during the shock and how effectively we recover afterwards.
In the case of transformers, resilience is measured in uptime, safety, life extension, and cost savings. In our personal and professional lives, it is measured in our ability to adapt, recover, and continue progressing despite setbacks.
References
- Nethmin Malshani Pilanawithana, Yingbin Feng, Kerry London, Peng Zhang, Developing resilience for safety management systems in building repair and maintenance: A conceptual model, Safety Science, Volume 152, 2022, 105768, ISSN 0925-7535, https://doi.org/10.1016/j.ssci.2022.105768. (https://www.sciencedirect.com/science/article/pii/S0925753522001072)
- M. Meira, C. R. Ruschetti, R. E. Álvarez, and C. J. Verucchi, “Power transformers monitoring based on electrical measurements: state of the art,” IET Gener. Transm. Distrib., vol. 12, no. 12, pp. 2805–2815, 2018. DOI: 10.1049/iet-gtd.2017.2086.
