Learning: Subnet scaling maintenance cancellation

August 14, 2025 · 5 min read

Summary

Planned maintenance

Platform team scheduled an operation to expand the range of available IP addresses for RDS databases in production. This would allow us to scale up our databases and associated resources more aggressively when under heavy demand without "running out" of available address space. We pursued this because of previously measured issues that happened during load testing of event-service in staging environment (You can read more about that here - PLAT-425)

Actual event

However, the configuration of RDS in production prevented us from safely doing this cutover in production as previously communicated, which should have been fast and low risk. Had we proceeded as previously planned, we would have also needed to fully recreate the actual databases with their contained data, which would have been:

Much riskier than the original proposition presented and
More time consuming, so maintenance would have gone past the 30-minute window we expected

Because of this, David decided to cancel the planned maintenance in favor of rescheduling this later when we have more information and more normalized configurations.

Background

Terraform is an "infrastructure as code" solution that allows any engineer to essentially "declare" infrastructure configurations via code changes. These are then interpreted and expressed to add, change, or remove infrastructure from our cloud provider, i.e. AWS.

Example below - Engineer makes declarative changes in code files that are interpreted by Terraform and then affect our cloud environments

Screenshot

This is highly preferable to the current way of managing infrastructure manually because of improvements to auditability, approval mechanisms, and self-service capabilities for engineering teams.

Screenshot

Migration of AWS infrastructure to be managed by Terraform is currently in progress.

Root Cause Analysis

Why wasn't this identified before

Two main reasons:

Fractured infrastructure management - Between the timing of Platform team testing and executing in lower env, and planned execution on prod, 2 of the 3 DBs that would have been affected (GenAI and Knewton) were "migrated" to be managed by Terraform in production instead of manually. The previous subnet changes in lower env that we tested were not managed by Terraform, so we only had experience doing this manually in AWS.

Below - rough timeline of RDS changes over the last 1-2 months Screenshot

Unproven Terraform process - Terraform documentation gave no clear indication that changing VPCs or subnets would be potentially destructive. They list the option and some operational parameters, but nothing about databases needing to essentially be "reconstructed" after these changing what we would have needed to change.
1. https://developer.hashicorp.com/terraform/tutorials/aws/aws-rds
2. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/db_subnet_group

Because of this, we had no clear indication that making these specific changes could be potentially problematic, outside of directly testing them which we eventually were able to do.

The Result

Rob had a feeling doing this via Terraform may be more destructive than previously thought, which was later confirmed on further research. He worked late the night before (on 7/31) and eventually proved that these changes would be potentially destructive. David elected to cancel the maintenance window as it represented a higher much risk change then previously communicated, and any way of attempting to get this done had been previously untested.

Takeaways

Stale changes / Drift - Subnets effort dragged for too long, causing the infrastructure to not match tested expectations - Subnet migration in lower was completed on July 1st, Prod was planned for August 1st.
Incomplete TF adoption - Some resources in a given AWS product, like RDS, are managed by TF and some aren't which we've seen is a recipe for disaster.
Fractured dev efforts - Too many engineers working on too many different things at once. PLAT team needs to make more effort to have better concerted efforts from start to finish
Untested and unknown - As we have now seen, Terraform documentation may be unreliable in some cases. We can no longer solely rely on these and must directly test any TF changes 1-to-1 in lower envs before attempting in production.

Action Items

Who's involved

We've proved previously that importing existing RDS databases to be managed by TF is a non-disruptive operation. Platform team will work on these migrations and coordinate with application teams to do smoke testing as needed before rescheduling the maintenance window

Breakdown

Most items will be done as part of PLAT Epic - Normalize RDS using Terraform

Normalize management patterns - Get all RDS resources uniformly managed via TF - PLAT-516
Highlight risky changes - Identify destructive vs non-destructive changes that can be made in TF RDS modules since this will be "Self-service for developers" - PLAT-517
Create guardrails - Add in extra protection and approval mechanisms for identified TF fields - PLAT-518
Streamline TF adoption - Lower the barrier to entry for all engineers, not just on Platform, to use and manage RDS infrastructure in this way - PLAT-519
Dry run subnet changes - Create and test TF based workflow in lower envs to have 1-to-1 confidence in this cutover in production.
Reschedule maintenance window - Once we have a more concrete understanding of outage time frame and risk factors, we'll reschedule for another time via DEPLOY-2724

Summary​

Planned maintenance​

Actual event​

Background​

Root Cause Analysis​

Why wasn't this identified before​

The Result​

Takeaways​

Action Items​

Who's involved​

Breakdown​