security chaos engineering
Demystifying Security Chaos Engineering - Part I
We are witnessing an upsurge of high-profile attacks in recent times, and the attacks that impacted prominent companies are the most appalling! One of the most chilling facts about these attacks is the success rate against security controls considered robust, e.g., Multi-Factor Authentication. Clearly, cybercriminals are outpacing modern cyber security mechanisms, and novel approaches are imperative to address these concerns.
Consequently, two schools of thought have emerged; one believes the industry needs to evolve more security approaches and the other argues for cyber resilient mechanisms. This blog post supports the latter, given the potential of leveraging Security Chaos engineering to enable cyber resilience.
Note: This is the first part of a two-part series. Please subscribe to be informed when Part II is published. Also, this blog is based on a talk presented at the Cyber Security & Cloud Expo (Europe) 2022.
Chaos Engineering - The Origins
The origins of chaos engineering can be traced to Netflix's migration from an on-premises data center to cloud infrastructure - Amazon Web Service (AWS). During Netflix's early days in the cloud, workloads were primarily deployed on EC2 instances (it was the state-of-the-art in cloud compute back then). Strangely though, EC2 Instances would shut down without warning. As you can imagine, the impact was unacceptable as this behavior introduced serious availability issues. Netflix customers could be cut off for a while and probably reconnected afterward, the immediate impact would be a bad customer experience.
Netflix Chaos Monkey
Unfortunately for Netflix, AWS had no solution to these failures, hence innovating a solution was imperative! Enter chaos engineering, the basic idea was to evolve systems that could tolerate the menace of unpredictable dying EC2 instances. Consequently, Netflix implemented Chaos Monkey, a system that would automatically and intentionally inject availability failures. The main job of Chaos Monkey was to randomly kill EC2 instances and other services, this would effectively cause the very failures that occurred unpredictably.
Crazy as it sounds, Chaos Monkey performed remarkably well; the engineering teams evolved by implementing systems that survived dying EC instances. Due to the success of Chaos Monkey, Netflix developed other tools based on the Principles of Chaos Engineering, these set of tools were known as the Netflix Simian Army. By continually deploying these tools and adopting
a learning from failure mindset, Netflix has survived outages that took out entire AWS regions e.g. the US East 1 outage.
Growing Adoption of Chaos Engineering
The success of Netflix's Simian Army popularized chaos engineering and has encouraged its adoption. Today, several open-source projects and commercial products offer relatively easy-to-use chaos engineering capabilities. Similarly, most cloud service providers offer chaos engineering services: AWS Fault Injection Simulator, AWS Resilience Hub, and Azure Chaos Studio. These tools and services focus on leveraging chaos engineering to prevent availability failures. Unfortunately, the security industry is yet to jump on this bandwagon despite the unique benefits of applying chaos engineering principles to cyber security.
Cloud Native Security Landscape
Primarily, Chaos Monkey enabled
availability resilience for Netflix i.e. their infrastructure is resilient against availability failures. Interestingly, availability is one of the key attributes for enabling a security system, also called the CIA triad. The other key attributes are confidentiality and integrity. Essentially, every security control aims to prevent violations of one or more attributes of the CIA triad. However, current cloud-native security mechanisms struggle to achieve this aim and the reasons are not far-fetched. Here are some of our thoughts on why attacks are still successful regardless of the evolving cloud-native security mechanisms:
Complexity: The Enemy of Security
Cloud-native infrastructure enables several advantages, including scalability, elasticity, and (perceived) cost saving; however, alongside these advantages, complexity is inherited. The complexity results from multiple abstraction layers that underpin cloud-native infrastructure. Bruce Schneier asserted that “complexity is the greatest enemy of security, " and exactly that; we are experiencing the effect of complex systems on security objectives. Complex systems are hard to understand, particularly through the lens of cyber security, and the efficiency of any security architecture depends on the depth of understanding grasped by defenders. Furthermore, better insights into the workings of any system ultimately facilitate creative tooling support and innovative deployment of security controls when standard approaches are limited.
Dynamic Security Posture
Cloud infrastructure allows for agility, empowering teams to continuously deploy infrastructure to meet market demands while gaining competitive advantages against competitors. This directly increases productivity and paves the path for practicing modern techniques, e.g., DevOps and GitOps. However, each cloud infrastructure change potentially introduces security issues, e.g., security misconfiguration. Hence, these changes make it harder to maintain a consistent security posture, and this is. challenge! CISOs and other security leaders want to have an educated perception of their infrastructures' security posture, unfortunately, this is hard to achieve due to ephemeral cloud-native infrastructure.
Misconfigurations - Root Cause of Cloud Attacks
Misconfigured cloud assets remain one of the most prevalent causes of cloud breaches. Gartner asserted that 99% of cloud attacks would be directly caused by misconfigurations until 2025. It is important to note that this prediction includes all cloud resources, including the
cloud security resources. So regardless of how efficient a cloud security mechanism might be, its effectiveness is eroded if not well configured. Furthermore, misconfigurations are often introduced from various sources, including during deployments and routine maintenance.
Security Silos - Introduce Blindspots
The cloud operating model builds on multiple layers of abstraction. Accordingly, security mechanisms are designed to align with these abstraction layers to achieve a `Defense-in-Depth` model. This model, also referred to as the 4Cs of cloud-native security, proposes the positioning of security systems at the four abstraction layers: code, container, cluster, and cloud. While this model has multiple advantages ad provides security to a large extent, it fails to address multi-layered attacks. This failure results from a siloed security architecture, mainly when the cloud-native security systems deployed at the various abstraction layers operate independently, i.e., without synergizing. The impact is attacks that transpire across two or more abstraction layers are not easily detected. Ultimately, end users risk having a false sense of security, a situation where all seems normal and secure until an attack becomes successful, aka security theater.
Security Chaos Engineering
It is becoming increasingly clear that security in cloud-native infrastructure is more about resilience than “just” security. Despite the huge amount of cloud-native security products appearing daily, breaches still occur!
Firefighting Versus Fire Resilience
The immense width of the cloud-native attack surface and possibilities for attacks requires a shift of mindset from `firefighting` to becoming resilient against fires (fire resilience), as rightly asserted by DinoDai Zovi. Cloud-native security ought to balance between keeping attackers out and fighting/resisting attacks. This calls for a mind-shift from attack prevention to an `Assume Breach` mindset. Werner Vogel, CTO of Amazon, declared: ” Failures are a given, and everything will eventually fail over time,” similarly, security failures are inevitable in the cloud. It is, therefore, imperative to shift focus to attack detection, recovery, and resistance.
Assume Breach Mindset
Security Chaos Engineering is an enabler of confidently adopting an
assume breach mindset. Similar to how Chaos Engineering has enabled resilience against availability failures, Security Chaos Engineering enables resilience against integrity and confidentiality failures (including availability). The same principles of chaos engineering are applicable; though, adapted to fit desired security objectives. By injecting security failures into cloud-native infrastructure, the actual behavior of security controls becomes apparent by observations. These observations result in empirical and tangible knowledge which can be leveraged for proactive and iterative security hardening.
Mitigant’s Security Chaos Engineering Platform
Implementing Security chaos engineering from the ground up could be an overbearing task for most enterprises. The technical know-how is relatively non-existent, and the time and effort required is barely affordable for most enterprises. Consequently, given the knowledge and experience gained from an academic research background and industry experience, Mitigant's founders are well positioned to commoditize security chaos engineering.
We want a future where every enterprise can leverage security chaos engineering to become resilient to cloud attacks. Hence, we have built a SaaS offering that allows for easy adoption, drastically reducing the steep knowledge and skills otherwise required. For more some practical use-cases read Part I and Part II of our blog posts on defeating ransomware with Security Chaos Engineering. The second part of this blog article will provide more interesting insights on Security Chaos Engineering, so subscribe to our blog post to get informed.
Co-Founder & CTO, Mitigant. | Contributing Author - O'Reilly Security Chaos Engineering Book. | AWS Community Builder