Introducing the Digital Platform
This post describes the vision and direction for what I'm tentatively calling our "Digital Platform"1—a collection of our tools, services, and operational functionality arranged as a compelling internal product that allows teams to operate more quickly and independently. It imagines how we can transform our DevOps team into a product team that manages our platform as a product.
Let's dig in by starting with why this is important!
Motivation
All product teams—such as the Integrations team or the Smartwork team—have common needs that are not strictly germane to their core mission. As a principle of lean development, these teams should spend as little time and mental energy on these extraneous needs as possible so that they can focus on building solutions to support their core mission. But that is not currently the case for most of our teams, who have to spend considerable time and mental energy on work that isn't central to their mission.
These needs include everything required on the delivery path, such as turning code into software (build), checking the software for errors and regressions (verify), installing the software on a server that people can access (deployment), and monitoring the software in production.
As our digital product offerings grow, each team must either build capabilities to do these things themselves, or they need to work with other teams to accomplish their goals. The former results in a lot of duplication of effort and the later results in tight coupling between teams, as each team's rate of delivery is dependent on another team's rate of delivery, effectively bottlenecking every team.
We've seen the effects of this in profound ways across every team, whether it's Norton Illumine Ebook work slowing significantly as they worked with the other teams to connect to our systems, the Ecommerce team needing to work across many codebases and teams to accomplish their goals, or the Integrations team needing to work across teams to design how customers will use the new LTI implementation while we still support the old implementation.
Aside from specific project work, teams experience this nearly daily whenever they need to open a DevOps ticket for any reason at all, creating backlog coupling that ties the team's rate of delivery to the capacity of the DevOps team.
The solution
Thankfully, this problem is not unique to us. It's such a common problem that there are now idiomatic patterns and anti-patterns to help guide us to our solution: the platform as a product pattern.
In fact, at the time of this post, applying product management to internal platforms is listed as the #1 recommended adoption technique on the Thoughtworks Technology Radar, an "opinionated guide to technology frontiers" and useful indicator about technology trends to help lend confidence to solutions.
In practical terms, this is an evolution of our DevOps team into a platform-as-a-product team. More of a different way of doing things than a wholly new team, this will mean some changes for how we think about our needs as developers and product team members. Under this model, our product teams will be the customers of the platform and should voice their needs accordingly so that the platform team can build solutions that meet those needs.
And the platform team will practice all the hallmarks of a product mindset:
- A keen interest in and commitment to their customers' unmet, unrealized, and evolving needs.
- Building a great user experience (in this case, typically a developer experience).
- An eagerness to discard solutions that aren't working for their customers.
And to capture and drive this product mindset, the team will have a new mission.
Mission
Empower teams to build, monitor, and ship rapidly, securely, reliably, and with substantial autonomy.
Let's break this mission into its parts since a good mission helps us answer questions about scope and principles as they arise.
- The leading verb is empower because the platform should give our product teams super powers—new capabilities that help them do their work simultaneously faster, more securely, and more reliably.
- Contrasting this to our current state, where a separate team (DevOps) "owns" key capabilities along the delivery pipeline (deployment, for instance), the new expectation will be that product teams will use these capabilities to execute the entire delivery pipeline.
- Build, monitor, and ship describe the key parts of the DevOps lifecycle where the platform will focus its improvements.
- Adverbs of rapidly, securely, and reliably hint at the principles of fast delivery (measured via DORA metrics, perhaps), security as a core tenet of the work, and site reliability (measured via common metrics, perhaps).
- Finally, substantial autonomy underscores the goal of creating a platform that doesn't require high-bandwidth coordination like meetings or deeply reading extensive documentation.
- While occasional support is to be expected, the success of solutions should be measured by how rarely teams need help using them.
Scope
The key capabilities listed in the mission outline the type of needs that the platform may address.
- Build – needs related to turning an idea into software.
- For example, functionality like source code management, continuous integration (CI), or even future-forward tooling like remote development.
- Ship – needs related to getting software onto a server.
- For example, functionality like continuous delivery (CD), deployment automation, and environment provisioning.
- Monitor – needs related to viewing software on a server.
- For example, functionality like server metrics, logging, tracing, and general observability.
Domains
Another useful way to break down the platform's scope is to outline the domains, or functional areas that they might manage. This is a common domain-driven design approach to help us create boundaries between work and systems.
I've seeded some of those possible domains based on the needs that we currently have, but the names and boundaries of these domains are likely to change. It will ultimately be up to the team to define these domains and find the solutions that best meet our teams' needs.
- Runtime management
- Pipeline management
- Database management
- System cost management
- Secrets/credentials management
- Modernization of systems / service remediation and migrations
- Shared service infrastructure
- Cloud provisioning
- Security & compliance
- Reliability
- Monitoring and observability
Domain details
Runtime management
Runtimes are effectively the "servers" of our computing platform. They include AWS EC2 instances and our Kubernetes infrastructure but could include other options should the need or opportunity arise.
Features of this part of the platform might include a safe and secure way for developers to independently provision new runtimes for experiments and for production, consistent tagging conventions to help with resource monitoring, and guardrails or tooling to help manage costs since runtimes are our primary cloud expense.
Pipeline management
The "pipeline" is everything that happens between when I check in my code and when it's running on production. At a bare minimum, this includes three stages:
- Build – turning code into software, typically an image.
- Validate – quality checks and acceptance tests.
- Deploy – getting the software on a runtime (see "runtime management").
Features of this part of the platform should include self-serve continuous integration capabilities, which addresses stage 1 & 2 (we already have this running on GitLab but could provide more templates to help developer get started), as well as self-serve continuous delivery and deployment automation capabilities to address stage 3 (deployment).
Database management
This is everything to do with our databases, from provisioning (creating a new database) to masking personal data and upgrade paths.
See database management needs (ENG-15) for a detailed breakdown of some of our current known needs.
System cost management
Cloud platforms like AWS are "you build it, you run it" solutions that can have unexpected costs if not managed well. The platform should manage this so that individual developers can build with the confidence that they're not going to run up crazy charges.
Features of this part of the platform might include resource monitoring rules to identify under-utilized resources, guardrails for developers with permission to provision resources, and early warning notifications for cost anomalies.
Secrets/credentials management
Effectively managing credentials (often required for apps to communicate with each other) and secrets (more general sensitive information) is a key part of a secure platform. We do not currently have a comprehensive solution or strategy in place for this so this will be largely a greenfield area for the team.
Modernization of systems / service remediation and migrations
As the platform is built out, the platform team should build solutions that help teams use the new systems.
Features of this part of the platform might include migration tools to help automatically modify code, migration documentation that provides a step-by-step process for teams, or even short-term work to help do the work of migrating for critical systems.
Shared service infrastructure
This is a catch-all for services that sit between teams and might be better-suited as part of our "platform," as well as the infrastructure to coordinate a growing microservice architecture.
Features of this part of the platform would be things that improve the experience of building, using, and deploying a service. This might include a unified API gateway or router, a service mesh, or even related components like in-memory data stores (e.g., Redis), or message queuing.
Cloud provisioning
Creating resources like S3 buckets, EC2 instances, or CloudFront distributions in AWS is called "provisioning" and our platform should allow anyone to create their own resources, with guardrails. Managing this requires clear design and rules for how resources are created, layers of abstraction that allow us to change the backend without affecting how developers work with the system (for instance, switching to Azure), clear expectations about resource lifecycles (monitoring and terminating, for instance), and ideally automation to make the process simple and easy for developers.
Features of this part of the platform may include infrastructure-as-code templates for common stacks, and very clear and well-designed IAM rules to give users the access they need without risking breaking anything.
Security & compliance
As we grow, we need strategies and solutions to help us ensure that our applications are secure and in compliance with growing customer requirements around accessibility, privacy, and other horizontal frontiers.
Features of this part of the platform may include security templates for CI jobs, guidelines for security, and tooling to help provide greater visibility into the areas of compliance in our platform.
Reliability
Reliability is a catch-all for the areas of performance, stability, scalability, availability, disaster recovery.
Features of this part of the platform may include tools that allow us to measure and improve SLIs, SLAs and SLOs, common incident response metrics, and streamlined notification systems (tools, processes, practices).
Monitoring and observability
From the DORA monitoring and observability capability:
Monitoring is tooling or a technical solution that allows teams to watch and understand the state of their systems. Monitoring is based on gathering predefined sets of metrics or logs. Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.
Features of this part of the platform may include Kubernetes logging, tracing, and metrics, as well as helping teams set up their own dashboard for monitoring.
The transition
So far, we've focused entirely on the end state for the platform: a vision of what it might look like if we take on a product mindset for our shared capabilities. But this isn't going to change overnight. It will be a gradual change that will undoubtedly expose new questions, invalidate some of the assumptions I've made in this post, and require all of us to work together to drive the change.
As problem solvers, I expect us to treat new questions the same way we treat any tough technical problem, deconstructing it and working together to figure out solutions, all while keeping the spirit of the mission in mind to help us make decisions.
But there are three key changes that we will be making to facilitate this transition:
- We will be hiring a new Product Manager who will own and drive the goals and priorities for the team's work. This will be a completely new role and we won't fully transition the team until this position is filled.
- We will be doing away with DevOps as a role entirely because it sends the message that DevOps is a specific group's job, rather than a software development process and cultural philosophy. Details are outlined below under changing roles.
- All current DevOps team members will be part of the new Digital Platform team.
Changing roles
These changes will not take effect until the new Product Manager is hired and the team has formally launched. They are shared here as part of the vision and direction for the team, not as immediate action items.
Since we are doing away with DevOps as a role, this means changing titles for our DevOps Engineers and our DevOps Manager to reflect their evolving roles on the team.
-
DevOps Engineers will become Reliability Engineers.
At the outset, Reliability engineers will continue doing the exact same work they're doing today as DevOps engineers. But as the self-service platform is defined and built, they should expect to shift their work toward more automation, eventually balancing around 50% of their time on operations/support (responding to tickets, incident response, keeping the lights on, etc.) and 50% of their time on automating systems like writing infrastructure as code. IBM's video, What is Site Reliability Engineering (SRE)? illustrates an example of what this role might look like in the future. -
DevOps Manager will become an Engineering Manager for the Digital Platform team.
In the short term, the Engineering Manager will continue to function the same way it does today as the DevOps Manager. Longer term, this role will focus less on communicating and managing work priorities since that will be the role of the Product Manager, and more time on typical engineering management tasks like supporting and growing people, aligning, setting, and amplifying quality standards both vertically and horizontally, and improving or eliminating process.
Our specialist roles2 on the current DevOps team will not be changing, but they will be part of the new Digital Platform team, where their subject matter expertise will be essential in describing work around their domains. Since they will have a team of engineers helping them accomplish their goals and a product manager prioritizing the work in a backlog, we can expect to have a much better mechanism for getting work prioritized and completed.
Additional resources
The vision for this came from many sources, including conversations with many of you, as well as technology leaders at other companies. But I also got a lot of this from readings, talks, podcasts, and videos, which I'm sharing here for anyone who's interested in learning more.
- What I Talk About When I Talk About Platforms – a recent, clear, comprehensive description and story of building a successful platform. The engineering team discussed this article at our June 2023 monthly.
- Building Infrastructure Platforms – a slightly older description of platforms, also on Martin Fowler's site.
- The Art of Platform Thinking – an even older and generally more opinionated concept of a platform.
- Applying product management to internal platforms – the entry for Thoughtworks' Technology Radar, "an opinionated guide to technology frontiers."
- Managing Platform Teams: How to Build and Structure Platform Engineering? (podcast)
- Managing Platform Teams: How to Build and Structure Platform Engineering? (video of the same podcast)
- Team Topologies – a great book on team types.
- Platform as a Product describes the "platform" topology from Team Topologies.
- How to Shift from a Project to Product Mindset (podcast)
- Tips for Building Successful Platform Teams (video) – a short, dense video with some practical tips.
Footnotes
-
Calling this thing the "Digital Platform" may understandably create confusion. As Evan Bottcher observes, "'Platform' is just about the most ambiguous term we could use for an approach that is super-important for increasing delivery speed and efficiency at scale." At Norton, we sometimes call NCIA our "platform," we have a group of codebases called our "Platform services," we already sometimes call an array of tools our "Norton Publishing Platform" (NPP), and there are probably numerous other things that we regularly refer to as a platform.
I've chosen to use "platform" here simply because it's the term that is most often used for this solution and team pattern. Once launched, we will leave it to the team to decide how to resolve those points of confusion. ↩
-
Our specialist roles are Database Engineer and Application Security Engineer. ↩
