Managing cloud drifts is one of the challenges faced by cloud-native enterprises. While it might be easy to orchestrate complex cloud infrastructure via a single command, keeping track of these resources rapidly becomes a hurdle. Enterprises do not want to implement gatekeeper structures that often hinder agility. Unattended cloud drifts could introduce huge costs and security issues. This blog overviews cloud drifts and different drift management techniques.
Mind the Gap - How Wide Has Your Cloud Drifted
The cloud has made enterprises agile; infrastructure can be launched quickly with a few buttons or commands. In addition, features like autoscaling allow infrastructure to grow in response to traffic and other factors. However, while this might help a business achieve its objectives, it could bite back if proper measures are not implemented. The phenomenon where a cloud infrastructure moves away from its planned or perceived state is called cloud drift.
Drifts in cloud infrastructure occur when cloud resources (including configurations) move away from the desired state. Unfortunately, most enterprises feel the pain of cloud drift before implementing countermeasures. Moreover, implementing countermeasures is also a challenge, given that they could become hindrances to the agility of engineering and product teams.
Types of Cloud Drifts
Drifts could occur at different levels in cloud infrastructure, majorly due to resource or configuration changes.
Cloud Resource Drifts
Every asset in the cloud is a resource; thus, the number of resources in a cloud infrastructure occurs when cloud resources and configurations change owing to several factors, including intentional and unintentional reasons. Resource drifts occur when cloud resources are created, modified, or deleted.
Cloud Configuration Drifts
Most cloud resources have some configuration critical for expressing desired behaviour, including security, performance, and availability. Therefore, changes to these configurations could have considerable implications in varying degrees. For example, changes to AWS S3 access policies could potentially make the bucket publicly accessible.
Implications of Cloud Drifts
Given that every cloud asset is essentially a resource, most cloud resources are billed by the cloud providers via a PAYG model. Hence, keeping track of cloud infrastructure evolution requires efficient mechanisms that identify when resources are orchestrated to determine if they are correctly or wrongly deployed.
Change management has been traditionally considered a critical aspect of a robust security architecture. Though managing infrastructure changes is not a core security responsibility, it turns out that changes in the cloud environment are critical for identifying security events. However, like most events that might fire alerts, there is a considerable chance for false positives, thus adding to the complexity of drift management.
Drift Management Lifecycle
Most drift management mechanisms detect drifts by employing "the reconciler pattern". The reconciler pattern is a software engineering pattern that aims to address the issue of drifts. It achieves this by establishing two states: the desired state (also known as the expected state) and the actual state (also known as the real-world state). The desired state is defined at orchestration time via different methods, e.g. DSL or IaC, and persisted in a kind of data store (e.g. file-based, in-memory, object storage, and RDBMS). Drifts are identified by comparing the desired state with the expected state to calculate the differences, i.e., the changes resulting from the creation, modification, or deletion of cloud resources and configurations. The reconciler pattern has four standard methods: getActual(), getExpected(), reconcile(), and destroy(). These methods are used for drift detection and resolution. It is critical to understand that drift management is the broader umbrella that encapsulates different aspects related to cloud drifts: drift detection, drift analysis, and drift reconciliation. Let's examine these aspects briefly:
This involves steps to identify drifts and most likely inform a cloud administrator via CLI or user interfaces. Drift detection is the most common form of drift management. However, it caters to a small fraction of the implications are leaves other issues unaddressed.
Beyond detecting drifts, it is sometimes critical to understanding the root cause of drifts as this might help proactively prevent future drifts or better understand a cloud system. Drift analysis often implies different mechanisms, including log event analysis. A more critical aspect to be commonly practized in analyzing security events that lead to drift might identify Indicators-of-Compromise.
Drift resolution aims to reconcile the differences between the expected and actual states. This is one of the most challenging aspects of drift management, as it might involve the deletion or creation of resources.
Drift Management Techniques
Drift management techniques are divided into two main categories: static and dynamic drift management techniques. Let's examine them briefly.
Static Drift Management
The most popular drift management techniques employ static techniques based on IaC systems, e.g., Terraform. For example, the Terraform commands terraform refresh, terraform plan and terraform apply aim to compute the drift and update the desired state (Terraform state files).
Dynamic Drift Management
This approach establishes the desired state by directly enumerating cloud accounts at a specific point and persisting a state. After that, the established state can be compared with the actual state to determine drifts. This approach is more comprehensive, given that there might be some resources that are not provisioned via IaC and hence not included in the drift. Furthermore, reconciliation is more efficient since this is directly done by leveraging the cloud provider SDKs. Essentially, dynamic drift management leverages an asset management system or CMDB to maintain a detailed form of the expected state and allow for advanced operations such as CRUD and versioning. This approach is also known as Infrastructure-as-Software.
Static Versus Dynamic Drift Management
Similar to other computing mechanisms, there is a continuing discussion about the pros and cons of static and dynamic drift management techniques. A significant limitation of the static drift management approach is its limitation to only the infrastructure orchestrated via the established desired state. Therefore, it does not have knowledge of infrastructure deployed via other means, e.g., other IaC systems. For example, Terraform has no awareness of infrastructure deployed via other means, e.g., AWS CDK, Pulumi, cloud APIs, or cloud web consoles. However, dynamic drift management approaches scan the entire cloud infrastructure regardless of orchestration source and can thereafter resolve the drifts seamlessly. However, dynamic drift management systems are not easily implemented as they are traditional software applications and therefore require a much longer time for implementation. An alternative is to use managed dynamic drift systems, e.g., AWS Config.
Mitigant's Drift Management System
Mitigant drift management system uses dynamic drift management approaches to allow for continuous detection, analysis, and resolution of drifts. Being a SaaS platform, engineering teams do not need to spend time building a system; instead, all the features come out of the box. For example, drifts in AWS infrastructure are automatically tracked, and notifications are sent to enable prompt response. In addition, Mitigant drift analysis focuses on investigating the security events that might have led to drifts that allow for proactive security countermeasures.
Co-Founder & CTO, Mitigant. | Contributing Author - O'Reilly Security Chaos Engineering Book. | AWS Community Builder