Leveraging Security Chaos Engineering for Cloud Cyber Resilience - Part II

The first part of this blog series discussed the difference between cyber resilience and cyber security and decoupled the concept of...

31.7.2023

Kennedy Torkura

8 min read

Leveraging Security Chaos Engineering for Cloud Cyber Resilience - Part II

Table of Contents

Contributors

Kennedy Torkura

Co-Founder & CTO

The first part of this blog series discussed the difference between cyber resilience and cyber security and decoupled the concept of cyber resilience through the lens of the people, process, and technology framework. This concluding part dives deeper into cloud cyber resilience by examining the current state-of-the-art, highlighting cyber resilience engineering, and connecting the dots between the Cyber Resiliency Engineering Framework and Security Chaos Engineering. ‌‌‌‌

Cloud Cyber Resilience - State of the Art‌‌

A recent study by Cisco found that cyber resilience is a high priority for 96 percent of executives. Similarly, the World Economic Forum recently formulated the Cyber Resilience Framework and Cyber Resilience Index to accelerate organizations' adoption of cyber resilience. These are just two of several initiatives designed to sensitive organizations about cyber resilience to encourage its adoption. However, there is still no remarkable adoption trend, thus begging the question: why is adopting cyber resilience low, and what approaches could accelerate its adoption?‌‌

We explore the above questions within the scope of cloud infrastructure. Some useful insights can be gained by examining three dominant cloud indicators: cloud services, cloud architectural blueprints, and cloud reference implementations.

Cloud Services

The first indicator to consider is the services offered by the three main cloud service providers (CSP): Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure. While both AWS and Azure offer over 200 services, GCP offers slightly over 100 services. However, none of these services are dedicated to enabling cyber resilience. One might argue that the services supporting data resilience (e.g., AWS Backup) fall under cyber resilience. However, the goal of an efficient cyber resilient system would be to keep the storage layer far beyond the reach of adversaries by implementing preceding resilience layers. Another argument that may be put forward is about the existence of several cyber security services that might be leveraged to enable cyber resilience. Agreed, some CSP security services might be leveraged for cyber resilience, e.g., services that enable zero trust architecture and Threat Detection and Incident Response. But is that actually not still cloud security? Cyber resilience requires specific design principles, goals, and objectives for which CSP security services must be aligned to deliver resilience. CSPs can address this impasse by creating services specifically to enable, support and manage cyber resilience. Such services would allow cyber resilience architects, engineers, etc., to design, implement and deploy cyber resilience across cloud infrastructure. ‌‌‌

Cloud Architectural Blueprints

Each of the three leading CSPs provides Well-Architected Frameworks (WAF) as architectural blueprints for building applications and infrastructure. Generally, the WAFs include the following pillars - Reliability, Security (including compliance and privacy), Cost Optimization, Operational Excellence, and Performance efficiency/optimization. (AWS has an additional pillar - Sustainability). None of these frameworks guide cyber resilience; arguably, you might be tempted to mention reliability and security pillars. However, the reliability pillar guides non-security requirements, most of which fall under or overlap with operational and system resilience. Hence comparing them to cyber resilience is akin to comparing application unit tests to penetration tests; similar goals but completely different methodologies, skill sets, etc.

Similarly, the security pillar lays the foundations for cyber security, which is not the same as cyber resilience. We did a good job differentiating these two in the first part of this blog series. Feel free to read here.‌‌‌‌

‌

**Architectural Diagram of an** **Azure Landing Zone**

Cloud Reference Implementations

Due to the complexity of cloud infrastructure, it's not sufficient to provide WAFs and other kinds of best practices. Cloud users need some kind of automation that simplifies practical adoption. This need has paved the way for cloud vending machines as part of the Cloud Adoption Framework (CAF). The names differ across CSPs; however, the aim and objectives are similar. AWS offers the Control Tower, which provides landing zones with the best-practice implementations. Like the control tower, Microsoft Azure provides landing zones that are a component of the CAF. GCP also provides landing zones as a reference implementation for several use cases in the form of Infrastructure-as-Code. Looking through these CSP landing zones, there are no examples of implementations that enable cyber resilience. This is not surprising since the landing zones are based on the WAFs and available services offered by the CSPs. ‌‌‌‌‌‌

Cyber Resilience Engineering

Efficient adoption of cyber resilience for cloud infrastructure requires engineering-driven approaches. In recent years, engineering-driven approaches have been force-multipliers in adopting and maturing several technologies. This assertion is proven by observing the evolution of several cloud-native computing concepts that are rapidly becoming commonplace in modern infrastructure. Some of these concepts include continuous integration/development/deployment, DevOps, and microservices. Footprints of these concepts are found across organizations regardless of size, industry, geographical location, etc. Whilst theoretical definitions remain foundational and important, clear directions on the engineering and practitioner aspects are critical. Clear engineering-driven approaches allow adopters to quickly hit the ground running, take concepts to reality, and iterate to maturity. However, there haven’t been adequate engineering-driven initiatives for cyber resilience. The Cyber Resiliency Engineering Framework (CREF) recently appeared to fill this void. ‌‌‌‌

Cyber Resiliency Engineering Framework

Despite the theoretical understanding of cyber resilience, its adoption is limited, especially as no practical implementation guidelines exist. CREF addresses this challenge by providing a comprehensive approach to designing, developing, and maintaining cyber-resilient systems. CREF is an initiative driven by the US National Institute of Standards and Technology (NIST). CREF differentiates itself from other cyber resilience initiatives by providing precise constructs that define cyber resilience as a domain, including goals, design principles, and implementation techniques and approaches. CREF draws many of its constructs from system resilience while clarifying the relationship between cyber resilience and risk management. This approach allows for a tangible understanding of these constructs from several perspectives, in particular organizational, management, and engineering perspectives. ‌‌

Leveraging CREF for Cloud Cyber Resilience

Typical of a framework, CREF provides high-level guidance allowing for contextual implementation for specific end-user requirements. Though this allows for flexible implementation, it is challenging for newer technologies, e.g., cloud computing. Similar to other frameworks that eventually publish cloud-specific versions, it would be very useful if a version of CREF specifically focused on cloud technologies was published. For example, mapping the CREF techniques to cloud services (as done in the table below for the adaptive response) would be a huge relief to cloud cyber resilience practitioners. Furthermore, the entirety of CREF constructs would be needed to be applied to cloud technologies. This would empower cloud stakeholders to enable cloud cyber resiliency more easily. Ultimately, these challenges require a joint effort from cloud stakeholders: CSPs, regulatory organizations, third-party providers, etc. ‌‌

**A Mapping the Adaptive Response CREF Technique to AWS Services**

‌‌‌‌‌‌‌‌

Security Chaos Engineering - A Discipline Forged In Resilience ‌‌

Security chaos engineering is a direct spin-off of chaos engineering, a discipline that emerged over a decade ago due to Netflix’s need for high levels of system resilience for her cloud infrastructure. Over the years, several reputable companies have operated unwavering systems and operational resilience by leveraging chaos engineering. However, the cyber security industry is yet to leverage chaos engineering to wield cyber resilience, despite the undeniable fact that cyber resilience is critical for overcoming the rapidly increasing rate of successful cyber attacks.

**SCE Uses Adversarial Techniques to Verify Cyber Resilience**

Cloud Security Verification

The central intuition behind chaos engineering is the intentional and organized injection of faults into a system to see if the system can handle the threats to resilience that come with those faults. If not handled appropriately, the injected faults likely result in system failures, eventually impacting several desired properties, including availability and performance. Similarly, when this approach is applied in a cyber security context, the injected faults are aimed to impact confidentiality, integrity, and availability. Furthermore, these security faults may be designed to hit on the cyber resiliency measures implemented in the system to verify how those implementations ensure resilience. Essentially, these security attacks are adversarial in nature and align with the kind of test specified in CREF as a means to verify cyber resilience and continuously make improvements. CREF recommends using adversarial testing to verify the effectiveness of cyber resilience approaches, design principles, and goals and objectives. These verification processes can be conducted at several stages in the lifecycle of cyber-resilient systems. ‌‌‌‌

SCE is a form of adversarial testing that allows defenders to easily emulate threat behavior against systems to verify the effectiveness of security controls. This same approach is applicable to verify the effectiveness of cyber resilience systems. Cyber resilience aims to allow anticipation and adaptation under unfavorable adversarial conditions, so it is important to test cyber resilient systems to enable continuous cyber resilience continuously.

**Adaptive Response Technique Showing Some Example Metrics**

‌

Cloud Cyber Resiliency Verification

The idea of testing for system properties is imperative as systems tend to exhibit unexpected characteristics often. This challenge is further exacerbated in complex systems. This same challenge is central to the current state of cloud security, hence the need to adopt cloud cyber resilience. Like cyber security, cyber resilience is not an attribute that can be acquired or “bolted-in”. Cyber-resilient-by-default systems would provide optimal resilience outcomes.

Similarly, a cyber-resilient culture is imperative! This implies using a flexible and suitable, cloud-native approach for verifying that cyber resilience requirements are well designed and implemented throughout a system’s life-cycle. CREF has 14 implementation techniques. However, most systems would need to select and implement some of these techniques based on several factors, majorly the organization's risk profile. Furthermore, objective measurement and validation of implemented cyber resilience constructs require metrics. The CREF document provides some example metrics which could be leveraged. For example, the number of attempted intrusions stopped at a network perimeter is an example metric for the Adaptive Response implementation technique. This metric can be computed following SCE experiments and used as a guide for subsequent improvements. This approach also applies to existing cyber security activities for evaluating ransomware countermeasures. By injecting a ransomware attack, it becomes evident whether the countermeasures are effective.‌‌‌

**Mitigant Helps Enterprises To Find the Sweet Spot Between Compliance, Cloud Security and Cyber Resilience**

Cloud Cyber Resilience With Mitigant

Mitigant Cloud Immunity is the first implementation of SCE! It is designed to enable cloud infrastructure with a balance of cloud compliance, security, and cyber resilience. Mitigant Cloud Immunity consists of several attacks designed to validate the effectiveness of cyber resilience implementations on AWS. All attacks are mapped to the MITRE ATT&CK library; this enables the implementation of real-world attacks for various use cases, including incident response exercises, threat emulation to validate detection efficiency, and atomic attacks that can be run in a penetration-testing style without waiting for the period penetration testing periods.

With Mitigant Cloud Immunity, different phases of these activities are automatically and seamlessly handled, e.g., documentation of hypothesis, clean-up after attacks, and evidence gathering. Don’t hesitate to sign up for your free trial today and kick-start your journey to operating a cyber-resilient cloud infrastructure. Sign up here - https://mitigant.io/sign-up

‍

Ready to Secure Your Cloud Infrastructures?

Connect with the Mitigant Team and proactively protect your clouds today.

Book Demo Start Free Trial