Your Company's Superpower: Platform Engineering

@kjuulh

2024-09-25

Platform Engineering basically has two angles in the industry, either the hype for it is overwhelming because it is the new hot thing on the block, or it is underestimated because it just looks like operations doing software engineering. I aim with this piece of content to define why you should care about Platform Engineering.

Platform Engineering defined

Platform Engineering to its namesake, is the practice of engineering a platform. Often as Software Engineers we forget what the Engineering in our titles actually mean. Engineering means building the right thing, reliably and securely. That we understand, but the platform is harder to define. When we think of a Platform for a company, it is the foundation that other users within the company utilize. An important detail here is that Platform Engineering is not just for developers, it can be for business users, data analysts, operations people, etc. It is basically a vehicle to make someone elses job easier within the company.

An example of a platform to be built could be: A specific individual or team, owning the deployment pipeline that ships software into production, the metrics solution that is used across the company, the tool analysts interface with when they want to query the data in the company.

These might sound mundane, and you may have a question, but hey, I could just setup github actions, and boom, my service is now going to production. You would be correct, Platform Engineering is taking a holistic view of the company's portfolio of tools, and make decision based on those. Often tooling starts grass roots, a developer is missing a feature, he/she goes and implements said feature, they now have a pipeline to build and deploy their code. But so does the other 15 teams in the company, and now you've got 15 bespoke solutions. Another step would be a DevOps department, being consultants to the different teams to have some homogeniety.

A Platform Engineering team would take those requirements; we need a build and deployment capability across the company, lets come up with as simple solution as possible to capture the largest amount of complexity. As such the team might end up building a complete build pipeline, but only for Golang Services, and as such keep the complexity around for Python Services or what have you. The Platform Engineering team would then treat the pipeline as a product, handle user feedback, user interviews, track data for adoption, make sure the right features are available, etc.

The end product is that you've got a dedicated team, which can capture the partial complexity for 15 teams, and if the product is good enough, it can be complete, i.e. the friction between the development teams and the platform teams is minimal. If you extrapolate this mindset, you can go from:

Here is the example of the scope of the complexity a normal organisation might have in software products, internal to the company, but without dedicated stewardship.

5 programming languages, 30 pipelines, 5 types of software libraries, 10 libraries pr language, 30 types of deployment, 3 clouds

5 + 30 + 5 * 10 + 30 + 3 = 83 products spread out over the organisation

Versus a stewarded Platform

1 programming language, 1 pipeline, 10 libraries, 1 deployment, 1 cloud

1 + 1 + 10 + 1 + 1 = 14 products dedicated to a specialized team

You might say, that it is inrealistic to succeed with a single programming language, or 1 pipeline, but it can be done. It can be a long journey to get there, and you may not want to go that far, if you've got enough people to maintain the software. But in general, the goal of Platform Engineering in such an organisation is to move out complexity of feature teams to let them focus on what they're best at; building features.

As is often the case, Platform Engineers are basically Software Engineering working in a specialized field, with specialized tooling, as such it is more approachable to tackle familiar problems, i.e. building out a deployment strategy for Kubernetes, AWS, whatever. But it can also be so much more. How do SQL analysts interface with their tools, what is slowing them down, do they achieve the quality they want from the products they rely on, are their workflows as effecient and tight as can be? Often this isn't the case, in the same vein as a software engineer hacking together a pipeline, and analyst might cook up a workflow that is borderline masochism. As Platform Engineers we've got the knowledge and tools to help shape some of the workflows and tools to fit the needs of our users.

Platform Engineering doesn't mean invented here

Platform Engineering can be taken to its extremes, where we basically build all the tools from scratch, define all workflows, templates by hand, and rely on a massive team to support said complexity. But it shouldn't be our first approach, one of the most interesting things about Platform Engineering is the creativity it invites. Hey, I need to build a build pipeline, what tools do I have available, and how can I turn this into a good product and abstraction for my users. Do I really want to provide a small layer of abstraction on top of github actions/jenkins etc. Or will we build a turnkey solution that basically builds our company's version of what a Golang services looks like.

Are we supposed to build the entire build pipeline software ourselves, or can we leverage either open-source offerings, or SaaS solutions and provide a small opinionated layer on top to make it a product internally. That is really the goal of Platform Engineering, to think creatively about problems, such that we can build the most reliable thing, with the lowest complexity, in the most secure manner.

Platform Engineering A Superpower

I gotta make up for the title, so how is Platform Engineering a superpower? Lets say a new security requirement comes down, that you now need to calculate software bills of materials across all of your software because you want to sell your services to an organisation that requires that level of security. You can basically measure your profecciency in Platform Engineering in how well you're able to execute cross cutting concerns.

If you had 30 pipelines, for 5 languages. You'd need to basically copy paste/modify whatever the same product to produce software bills of materials to each and every pipeline, that may run on different types of CI systems, have different level of compatibility with the language in question.

If you however had 1 language and 1 pipeline, across the entire fleet of services, you could basically build the feature in the afternoon, append it to the build pipeline, track all the builds and see that all artifacts required were produced. With Platform Engineering you've basically transformed a challenging multi month effort into an afternoon project.

For the first approach developer teams will also have to own the changes, because it is their pipelines, how difficult would it be to prioritise that across 30 teams? From experience there are always a handful that are absolutely strapped for time, so you probably wouldn't make it in time. The second approach is completely automated, they wouldn't even know their pipelines are producing bills of materials. The same could be said for signing artifacts, producing artifacts for other architectures, swapping out library internals and whatnot.

It can be extremely fulfilling work to build a project that can basically bootstrap a service from scratch to production in minutes, without any handoff to other teams, or requring complex manuals for actually being allowed to go in production. As a Platform Engineering team, we can offload a lot of requirements from teams, such they can focus on delivering value for our paying customers, and in turn make the organisation much more nimble, scaleable and so on.

Platform Engineering is a double edged sword

If your Platform Engineering team isn't able to build whole abstractions, which by the way are extremely difficult to build. They will have a lot of maintenance on their hands, if they aren't able to fulfill requirements from their customers, the developers, analysts and so on. You might end up with a munity on your hands. People simply going rogue, because the complexity of the platform has increased to an almost absurd level.

This can happen and it needs to be careful considered, as engineers we should continuously defer products to maintenance, and pick them up once offerings become available via. open-source or proprietary software caches up with our own abstractions.

You might build a state of the art build system one year, let it run for 5, and suddenly it is clunky to work with, because it has been in maintenance mode for years, but you may discover that an open-source tool has matured that is able to fill some of those requirements you had, because you're fully in control of the platform you might even be able to swap out the "engine" with a less complex one, and free up some maintenance budget for your team.

I'll discuss when you should make those decision and what they look like in another piece of content.

Teams require help to be in control of their software

At the end of the day, feature teams should still be fully responsible for their products, including the operations. So as a Platform Team you've got to carefully consider how you allow them to be in control. Do you send back all the right signals for them to be in control? Do they know much many applications they're running, how much CPU, memory they're using. What is their SQL latency, when did they deploy, which version is deployed, how are their gRPC latencies, what is the HTTP Error rate.

Again you can treat these as products, it doens't have to be fancy, but it takes a long time to get this right, and as you build abstractions on top of products, you'll continuously see that as services demand more of the platform that more and more of the internals needs to be exposed, such that the feature teams can engineer their services to the implicit requirements of the platform. I.e. they might need to tune memory settings, connection pools, http authentication, which ports are open. To what the platform expects.