Cost analytics for self-managed kubernetes

UK Home Office: Building a customer cost analytics service for a self-managed kubernetes environment

As a specialist AWS partner, Cloudscaler recognise that not all organisations are seeing the expected cost savings having moved to the cloud.

The good news is that by using best practice and AWS cost optimisation techniques, we have been able to reduce the annual spend in some of our enterprise platforms by over 40%. Below our technical lead, Andy Astley, talks to Stas Vonholsky as part of the AWS “this is my architecture” series about how we achieved consistent AWS savings.

Stas: Hi, my name is Stas and welcome to another episode of This Is My Architecture. Today I’m joined by Andy from the Digital Dimensions. Hi Andy, how are you?

Andy: Hi there, thanks for having me.

S: Thanks for coming.

A: I’m here today to talk to you about one specific piece of our shared platform.

We run a shared platform specifically for the immigration area of the Home Office. We have a set of portfolio projects within the immigration sphere, about 20 to 25 different projects.

We run a whole set of services on that platform, but I’m here today to talk to you about the AWS cost optimisation work we’ve been doing.

S: That’s really cool. We can just dive into the architecture, just to understand how it works a bit. I notice you have a Kubernetes cluster over there, is that part of the shared platform? Did you offer it to users as a service?

A: Yes, we had a whole set of our projects that build their business services in Docker. The AWS platform team run a Kubernetes cluster. At the moment it’s EC2 based, we specifically use a COTS product. In the future we might look at EKS and Fargate. But at the moment it gives us a bit of a challenge in terms of cost.

We have a cost usage report from Amazon and within that report we get thorough tagging. We get costs for standard Amazon resources, so EC2, RDS. But the real challenge is that we can’t see who owns the pods within the Kubernetes cluster. Really this is what this top part of the architecture is all about.

S: Got it. So if I understand correctly, you have Cloudwatch that invokes Lambda functions. These Lambda functions hook up into the Kubernetes environment, fetch information about utilisation, and then just drop it into an S3 bucket? Where it’s essentially ready to be processed to get toward the standard, let’s call it, detailed usage reporting.

A: Exactly. So specifically, every hour, we have a schedule within Cloudwatch that triggers Lambda. We have a Python script within Lambda that calls into the Kubernetes API across different clusters.

First we enforce tagging when our users deploy pods, we have an environment tag and a cost sector tag against each pod. That information is reported by Lambda, which is effectively fed into the same S3 bucket as our cost and usage report.

S: You collect the data so you have, let’s call it, semi-structured raw data sitting here, and I noticed that the next step is ETL. Can you share a bit more about how you process the data with the framework you use?

A: Sure. We’re quite heavy users of Python, so we have Python scripts in here. We’re running them on EC2, and we’re looking at changing that in future – maybe use Glue, it’s essentially an ETL process under the covers.

This takes the raw data from the costs and usage report from Amazon, it throws away some of that data, bits that we don’t need, and it combines the data from Kubernetes.

So essentially what we’re doing is we’re removing the cost of the compute nodes of Kubernetes, and we’re replacing it with the breakdown of who owns the pods on those compute nodes, and which environments those pods are in. So effectively, what we then get after the processing in ETL is we get another raw data report of a combined Amazon resource and Kubernetes pods into S3, ready for us to start the reporting process in the bottom right.

S: Got it. So it looks like a very elegant ETL, Python bar – we have the data coming in, some processing, S3 again, and I’m guessing that because I think you’re using Athena to convert the data in S3, you need to essentially have a meta stored data catalogue. To add the schema and the location of the data.

A: Exactly. So we use a Glue crawler to convert the raw data into Athena format. We have Glue running on the raw data in S3, to convert it into a format that allows Quicksight to run Athena queries against.

S: That’s really cool. So if this is a kind of cost analytics service, what does the world map look like?

A: This is one of the challenges we have, because we’re a shared platform it’s not just managing costs. We as a platform team are hosting budgets, we can’t control that budget, we’ve empowered a bunch of delivery teams to build a structure on our platform.

Really this kind of dashboard here is fitted to give visibility to the individual teams, the breakdown of their services from a cost perspective. So we allow a user here, so he’s over with the project team, to go into Quicksight – pulling the raw data into Athena allows them to get a breakdown based on Amazon accounts, Amazon service, and environment and team based on the tax we apply to Kubernetes zones and Amazon resources.

S: Nice, so what is next?

A: So just by using the reporting on its own, what we’ve found that it has drawn some behavioural changes. What we’ve also done is run a whole series of education with our project teams, to teach them how to build to cross-division optimisation standards. We’ve got a whole set of techniques, things like right-siding the storage compute, so the use of spark, reserved instances, scheduling and really that has resulted in a number of behavioural changes.

But what I want to show you really is the cost-efficiency rating report.

S: Just to understand, let’s imagine I’m a team leader and I need infrastructure and I need reserved instances. Where do I go on this efficiency rating scale.

A: Reserved instances are weighted and ranked, and given a score. But you have to use a number of different cost optimisation techniques to get the highest score possible for that service.

This efficiency rating is similar to something used to rate white goods. What we’re doing is taking all of the different techniques and for each service, based on weighting those different techniques, we generate a score today and a potential score for the future – what you could achieve if you were as optimal as possible.

This AWS efficiency rating combined with the cost dashboard that we saw are really important for us and it’s really driving down costs.

S: Sweet. So Andy this was really interesting, thank you so much for sharing.

A: You’re welcome.

Hopefully, that provided some insight into how you can use AWS best practice and optimisation techniques to give greater visibility of your AWS spend and ultimately reduce your costs. If you would like to know more, or if you would like Cloudscaler to help you optimise AWS spend, then please connect with us or find out more about our AWS Services & Management.