Cloud Governance and the case for Cloud Custodian

Cloud Governance - What is it?

“the ability to provide strategic direction, track performance, allocate resources, and make adjustments to ensure that organizational objectives are met, without breaching the parameters of risk tolerance or compliance obligations” - Source: Adapt Your Governance Framework For The Cloud, Forrester Research, Inc

I first came across the term cloud governance when researching best practices for configuring public cloud environments. I’m sure this term has been discussed a lot in different communities across the tech landscape, but coming from a traditional infrastructure background, best practices in cloud, was a slight shift from what I was familiar with. After some time, I began to see a clear distinction in what cloud governance actually entailed and how it was broken down.

Governance vs Management

We can think of Cloud Governance as a set of standards that we define for our environments to adhere to. These standards are based on organization objectives and industry best practices. We can think of Cloud Management as the processes, patterns and tooling we implement to achieve cloud governance.

Why do we need it?

The idea of cloud environments is to give some level of control to development teams, as opposed to a small team of administrators or operators. Decisions are made in a decentralized manner. Without appropriate management in place, this can result in misconfigurations affecting areas like billing and security, as well as the efficient delivery of end products. The speed at which we can provision and configure resources, becomes a key metric in measuring our success in fast paced environments. However, it is vital that we do not sacrifice proper configuration in favour of speed. Having appropriate governance policies and management in place, prevents anyone from taking shortcuts that may cause problems down the line.

How do we achieve it?

There are some simple quick wins we can implement to immediately gain some level of control and accountability in our cloud environments. A few example include:

Separate environments for Production, Staging, Test, etc
Centralized authentication with a single identity provider, using role based access.
Embracing the principle of least privilege.
Proper tagging conventions and tag enforcement. (This can be used to drive automation, and deeper insight into billing).

The options are probably the most common you will see and, in my opinion, set a good foundation.

Immutable Infrastructure & Infrastructure as Code

“Cattle, not pets”

This familiar saying holds heavy weight in todays public cloud. Immutable infrastructure means no more in-place changes on live infrastructure, resources don’t get modified after deployment. If a change needs to be made, the existing resources get destroyed and redeployed with any updates.

When done properly, this eliminates the chance of configuration drift and increases the reliability, simplicity and predictability of infrastructure provisioning. Treating our infrastructure like any other code, allows us to version it, peer review it, and gain a better level of accountability for the resources that get provisioned.

Having all our resources defined in Terraform is the end goal, but we still need to enforce our governance policies, and have some way to automate some of the administrative and cleanup tasks that come with running multiple cloud environments…

Enter Cloud Custodian…

https://github.com/cloud-custodian/cloud-custodian

Cloud Custodian is a rules engine for cloud environments, that allows us to define our policies in a familiar way (YAML), and enforce them automatically in reaction to changes in the environment, or on a recurring schedule.

When using a run time such as CloudTrail, for example, Custodian will automatically provision a Lambda function and apply the designated filters to CloudTrail events, running its actions on anything that matches. Actions can be simple alert actions to email or Slack. Or they can trigger specific reactions to quickly deal with policy violations.

Using Cloud Custodian in tandem with an IaC tool like Terraform, provides us with a layered approach to Cloud Management. It also allows us to schedule different tasks in order to intelligently handle things like billing.

The anatomy of a policy

Below is an example of a policy I currently use across multiple environments to actively remove insecure rules from security groups, and alert an appropriate Slack channel.

There are 3 key components to the policy. The mode, the filters and the action. In the example below, we specify the mode as type: cloudtrail, watching for the specific events AuthorizeSecurityGroupIngress, and RevokeSecurityGroupIngress.

The policy will filter on tags to start. If a security group has a tag with the key c7n_sg_public_access_exempt, Cloud Custodian will ignore it. It will then look to see if the security group, with a specific rule violation, is part of a particular VPC. If it is, then Cloud Custodian will also ignore it. The final filter, will check for an ingress rule with a source CIDR of 0.0.0.0/0 or the IPv6 equivalent ::/0.

If conditions are met in the filters, the actions are then triggered but any matching resources. In this example there are 2 actions, the first is a remove-permissions action on any matched ingress, the second is a notify action which used a predefined template and sends the details to an SQS Queue which is used as a relay to a Slack webhook.

There is a lot of underlying configuration needed to get this policy running in multiple accounts. The detail of the setup is well documented in the official repo.

policies:
  - name: high-risk-security-groups-remediate
    resource: security-group
    description: |
      Remove any rule from a security group that allows 0.0.0.0/0 or ::/0 (IPv6) ingress
      and notify the user  who added the violating rule.
    mode:
        type: cloudtrail
        events:
          - source: ec2.amazonaws.com
            event: AuthorizeSecurityGroupIngress
            ids: "requestParameters.groupId"
          - source: ec2.amazonaws.com
            event: RevokeSecurityGroupIngress
            ids: "requestParameters.groupId"
    filters:
      - "tag:c7n_sg_public_access_exempt": absent
      - not:
            - type: value
              key: VpcId
              value: <vpc_id>
      - or:
            - type: ingress
              Cidr:
                value: "0.0.0.0/0"
            - type: ingress
              CidrV6:
                value: "::/0"
    actions:
        - type: remove-permissions
          ingress: matched
        - type: notify
          slack_template: sg-rule-violation
          priority_header: '2'
          subject: Security Group Rule Violation
          to:
            - slack://#cloud-custodian
          transport:
            type: sqs
            queue: <sqs_url>

Summary

As an engineer, I’ve seen many different environments and configurations. Some lacked security and proper guardrails, while others enforced such heavy restrictions that it made it difficult to provision critical resources efficiently. In my opinion, it’s a matter of balance. Well documented best practices are a good starting point, but governance policies should also be driven by the needs of the business, and constantly reviewed for further improvements.

As a member of a small operational team in a large Engineering department, my biggest hurdle / downfall was trying to change all the misconfigurations of an environment, without having the proper cloud governance and management frameworks in place. This meant constantly chasing my tail on ensuring resources had been provisioned correctly. It wasn’t until I drew a line in the sand, began enforcing governance through tools like Cloud Custodian, that I began to feel like I was regaining control and closing the gap on security holes and runaway billing issues.

I do feel that, although Cloud Custodian is a fantastic tool to tackle some of the issues that spawn from neglected cloud accounts. Cultural shifts are just as important and go hand in hand with tooling. Implement sensible guard rails in your environment, but ensure that you are sharing knowledge across all your technical teams as to what the best practices are and why you are implementing them. THIS IS VITAL!

The aim of this post was to provide context as to why a tool like Cloud Custodian can be so valuable to Engineers trying to get a good handle on single or even multiple cloud environments. I didn’t want to dig too deep into the configuration and deployment of Cloud Custodian and its policies, but I may cover those items off in another post.