Learning: Subnet scaling maintenance cancellation
Summary
Planned maintenance
Platform team scheduled an operation to expand the range of available IP addresses for RDS databases in production. This would allow us to scale up our databases and associated resources more aggressively when under heavy demand without "running out" of available address space. We pursued this because of previously measured issues that happened during load testing of event-service in staging environment (You can read more about that here - PLAT-425)
Actual event
However, the configuration of RDS in production prevented us from safely doing this cutover in production as previously communicated, which should have been fast and low risk. Had we proceeded as previously planned, we would have also needed to fully recreate the actual databases with their contained data, which would have been:
- Much riskier than the original proposition presented and
- More time consuming, so maintenance would have gone past the 30-minute window we expected
Because of this, David decided to cancel the planned maintenance in favor of rescheduling this later when we have more information and more normalized configurations.
Background
Terraform is an "infrastructure as code" solution that allows any engineer to essentially "declare" infrastructure configurations via code changes. These are then interpreted and expressed to add, change, or remove infrastructure from our cloud provider, i.e. AWS.
Example below - Engineer makes declarative changes in code files that are interpreted by Terraform and then affect our cloud environments

This is highly preferable to the current way of managing infrastructure manually because of improvements to auditability, approval mechanisms, and self-service capabilities for engineering teams.

Migration of AWS infrastructure to be managed by Terraform is currently in progress.
Root Cause Analysis
Why wasn't this identified before
Two main reasons:
- Fractured infrastructure management - Between the timing of Platform team testing and executing in lower env, and planned execution on prod, 2 of the 3 DBs that would have been affected (GenAI and Knewton) were "migrated" to be managed by Terraform in production instead of manually. The previous subnet changes in lower env that we tested were not managed by Terraform, so we only had experience doing this manually in AWS.
Below - rough timeline of RDS changes over the last 1-2 months

-
Unproven Terraform process - Terraform documentation gave no clear indication that changing VPCs or subnets would be potentially destructive. They list the option and some operational parameters, but nothing about databases needing to essentially be "reconstructed" after these changing what we would have needed to change.
Because of this, we had no clear indication that making these specific changes could be potentially problematic, outside of directly testing them which we eventually were able to do.
The Result
Rob had a feeling doing this via Terraform may be more destructive than previously thought, which was later confirmed on further research. He worked late the night before (on 7/31) and eventually proved that these changes would be potentially destructive. David elected to cancel the maintenance window as it represented a higher much risk change then previously communicated, and any way of attempting to get this done had been previously untested.
Takeaways
- Stale changes / Drift - Subnets effort dragged for too long, causing the infrastructure to not match tested expectations - Subnet migration in lower was completed on July 1st, Prod was planned for August 1st.
- Incomplete TF adoption - Some resources in a given AWS product, like RDS, are managed by TF and some aren't which we've seen is a recipe for disaster.
- Fractured dev efforts - Too many engineers working on too many different things at once. PLAT team needs to make more effort to have better concerted efforts from start to finish
- Untested and unknown - As we have now seen, Terraform documentation may be unreliable in some cases. We can no longer solely rely on these and must directly test any TF changes 1-to-1 in lower envs before attempting in production.
Action Items
Who's involved
We've proved previously that importing existing RDS databases to be managed by TF is a non-disruptive operation. Platform team will work on these migrations and coordinate with application teams to do smoke testing as needed before rescheduling the maintenance window
Breakdown
Most items will be done as part of PLAT Epic - Normalize RDS using Terraform
- Normalize management patterns - Get all RDS resources uniformly managed via TF - PLAT-516
- Highlight risky changes - Identify destructive vs non-destructive changes that can be made in TF RDS modules since this will be "Self-service for developers" - PLAT-517
- Create guardrails - Add in extra protection and approval mechanisms for identified TF fields - PLAT-518
- Streamline TF adoption - Lower the barrier to entry for all engineers, not just on Platform, to use and manage RDS infrastructure in this way - PLAT-519
- Dry run subnet changes - Create and test TF based workflow in lower envs to have 1-to-1 confidence in this cutover in production.
- Reschedule maintenance window - Once we have a more concrete understanding of outage time frame and risk factors, we'll reschedule for another time via DEPLOY-2724