Explanation: Understanding RDS Infrastructure
Introduction
This document explains the architecture, design decisions, and policy guardrails behind Norton's self-service RDS infrastructure. It covers how the Terraform module works, how the CI/CD pipeline validates and applies changes, what OPA policies enforce, and the rationale behind the choices made. Read the how-to guide first if you need practical step-by-step instructions for creating or modifying databases.
The Problem: Manual Database Provisioning
Before the self-service RDS workflow, creating or modifying a database required:
- A ticket to the Platform team describing the desired configuration
- Platform engineers manually writing Terraform code
- Back-and-forth to clarify requirements (sizing, networking, access)
- Manual review and apply cycles
This created bottlenecks where simple database changes could take days. Development teams couldn't iterate quickly on database configurations, and the Platform team spent significant time on routine provisioning tasks.
What Self-Service Solves
The self-service model shifts the workflow:
Developers own the configuration. OPA policies provide automated guardrails. The Platform team provides oversight without being a bottleneck.
Architecture Overview
Module Structure
The RDS Terraform module lives at aws/rds/ in the Infrastructure repository:
aws/rds/
├── rds.tf # DB instances, subnet groups, security groups, read replicas
├── variables.tf # Input variable definitions with types and defaults
├── outputs.tf # Exported attributes (endpoints, ARNs, etc.)
├── locals.tf # Subnet/SG resolution logic, password lookup, replica flattening
└── data.tf # Data sources for existing subnet groups, SGs, and secrets
How Instances Are Configured
Each environment has its own variables file at accounts/{environment}/rds/terraform.tfvars. These files contain a map called rds_instances where each key is a logical name for a database and the value is an object with all configuration attributes.
The module iterates over this map and creates the corresponding AWS resources:
Two Password Management Approaches
The module supports two approaches for managing the master database password:
Traditional Password (most common for dev):
- You create a secret in AWS Secrets Manager with
{"password": "your-password"} - Set
password_secret_nameto the secret path - Terraform reads the password at apply time via a data source
- You manage rotation manually
Managed Password (recommended for production):
- Set
manage_master_user_password = true - AWS RDS creates and manages the password in Secrets Manager automatically
- AWS handles rotation on a schedule
- Optionally specify
master_user_secret_kms_key_idfor custom encryption
When to use which: Use traditional passwords for development databases where you need to know the password for local development tools. Use managed passwords for production databases where automatic rotation and reduced human access to credentials are important.
Networking
Understanding the networking layer is important for choosing the right configuration values.
VPC (Virtual Private Cloud)
Every RDS instance runs inside a VPC. Norton's AWS accounts have multiple VPCs for different purposes. You must specify the vpc_id where your database should be created. This determines which network your database is reachable from.
The VPC you choose should match the VPC where your application runs. If your EKS pods are in VPC A, your database should also be in VPC A (or have cross-VPC connectivity configured by the Platform team).
Subnet Groups
A DB subnet group is a collection of subnets (in different Availability Zones) where RDS can place your database. This is required for Multi-AZ deployments and general high availability.
How the module handles subnet groups:
- If you set
subnet_group_nameto an existing group name → the module uses it - If you don't set
subnet_group_name→ the module creates a new subnet group fromsubnet_ids
For most development databases, use the shared subnet group (dev-group). This is pre-configured with subnets across multiple AZs.
Security Groups
Security groups act as firewalls controlling which traffic can reach your database.
How the module handles security groups:
- If you set
security_group_ids→ the module uses those existing security groups - If you set
security_group_name→ the module looks up the named security group - If neither is set → the module creates a new security group allowing traffic from
allowed_cidr_blockson your database'sport
For most databases, use the shared security group for your environment. Custom security groups are typically only needed for databases with strict network isolation requirements.
OPA Policy Guardrails
The CI pipeline evaluates every RDS change against Open Policy Agent (OPA) policies before Terraform can apply. These policies enforce organizational standards and prevent misconfigurations.
How OPA Evaluation Works
- The CI pipeline runs
terraform planand converts the output to JSON - OPA evaluates the plan JSON against policy rules in
policies/accounts/{environment}/rds/policy.rego - If any
denyrules match, the pipeline fails and posts the violation messages as MR comments - The developer fixes the violations and pushes updated commits
What the Policies Enforce
Every allowed_* / *_required key below is a field in policies/data/rds/allowlist.json — open that file to see the current values and which engine, version, instance class, port, VPC, and subnet group entries are live right now.
| Check | What's Validated | How It Works |
|---|---|---|
| Engine type | Only allowed engines | Must be in allowed_engines list |
| Engine version | Must be in allowlist | Must be in allowed_engine_versions list |
| Instance class | Must be in allowlist | Must be in allowed_instance_classes list |
| Port | Must match engine | Must match the engine-to-port mapping in allowed_ports |
| Storage encryption | Must be enabled | Controlled by storage_encrypted_required flag |
| Backup retention | Minimum days required | Must be >= min_backup_retention_days |
| Public access | Only specific identifiers | Identifier must be in allow_public_access_identifiers |
| Performance Insights | May be required | Controlled by performance_insights_required flag |
| VPC | Must be allowlisted | Must be in allowed_vpc_ids (empty = unrestricted) |
| Subnet group | Must be allowlisted | Must be in allowed_subnet_group_names (empty = unrestricted) |
These values change over time. Engine versions are deprecated, new instance classes are added, and networking resources evolve. Always check the current allowlist before submitting your MR rather than relying on values printed in this document.
The source of truth for all allowed values is policies/data/rds/allowlist.json in the Infrastructure repository. This file contains separate dev and prod sections with all current allowlists. The Platform team updates this file when new engine versions, instance classes, or networking resources need to be supported.
Policy Change Detection
OPA policies only evaluate attributes that are changing in the Terraform plan. If you're only updating tags on an existing database, the engine version check won't trigger even if the current version isn't in the latest allowlist. This prevents existing databases from being blocked by new policy additions.
Known Policy Gaps
The following areas are documented for transparency and are tracked for future improvement:
- Tag enforcement: OPA does not currently enforce that required tags (
Team,Product) are present. This is documented as a best practice but not blocking. - Storage size bounds: Instance classes are limited but
allocated_storagehas no upper bound in OPA. The practical limit is what exists in production today. - Encryption changes: The policy prevents disabling encryption but does not restrict who can toggle encryption-related settings. Changes to
storage_encryptedorkms_key_idon existing databases can be destructive (recreates the instance). These are flagged as READ-MORE in the tfvars. - Production RDS policy: Currently only the development account has a dedicated RDS OPA policy. Production uses the shared allowlist but evaluation may not be active for all fields.
SAFE vs READ-MORE Properties
The development tfvars file includes annotations (as comments) categorizing every configurable property by risk level. This is a developer experience feature to help you understand the impact of changes before making them.
What SAFE Means
SAFE properties can be changed without risk of data loss or service destruction. They may still cause brief restarts (like instance_class) but won't destroy your database or require recreation.
Examples:
tags— Metadata only, no infrastructure impactallocated_storage— Storage increases are online operationsmulti_az— Adds/removes standby replica without downtimebackup_retention_period— Changes backup policy only
What READ-MORE Means
READ-MORE properties require you to understand the implications before changing them. In some cases, changing these values on an existing database will cause Terraform to destroy and recreate the instance, resulting in data loss if not handled properly.
Examples:
engine— Changing engine type (e.g., postgres → mysql) destroys the instancedb_name— Changing on an existing DB forces recreationusername— Changing on an existing DB forces recreationstorage_encrypted— Enabling encryption on an unencrypted DB requires a snapshot restore (destructive)kms_key_id— Changing the encryption key forces recreation
If a property is marked READ-MORE, always review the linked CI job proof or the Terraform aws_db_instance documentation before making the change. When in doubt, contact the Platform team.
The SAFE/READ-MORE annotations are visible in the comments at the top of the development/rds/terraform.tfvars file, with a link to the main documentation for the properties
Read Replicas
The RDS module supports creating read replicas from any primary database instance. Read replicas are useful for:
- Offloading read traffic from the primary database
- Cross-region disaster recovery (when configured in a different AZ)
- Analytics queries that shouldn't impact production write performance
How Read Replicas Work
A read replica is an asynchronous copy of the primary database. It receives updates from the primary via replication and can serve read-only queries. In the Terraform module:
- Replicas inherit most settings from the primary (VPC, subnets, security groups, encryption)
- You can override specific settings per replica (instance class, availability zone, etc.)
- Each replica gets its own endpoint for read traffic
For the step-by-step to add one, see How-To → Use Case 5: Adding a Read Replica.
Read Replica Drawbacks and Complexity
Read replicas are not a free performance boost — they add real operational complexity. Weigh these before adding one:
- Replication lag is real and unbounded. Replicas are asynchronous. Under load, writes to the primary can take seconds (occasionally longer) to appear on a replica. Applications that read-after-write from a replica will see stale data. If your workload can't tolerate eventual consistency, you need read-your-own-writes routing back to the primary.
- Connection-string routing is your problem. RDS does not give you an automatic reader endpoint outside of Aurora — each replica has its own hostname. You (the application) need to decide which queries go to which endpoint, how to fail over between replicas if one is down, and how to redirect reads back to the primary in a pinch.
- Upgrades have ordering constraints. Minor/major version upgrades have to be applied in a specific order across primary and replicas, and a major version upgrade on the primary typically requires the replicas to be rebuilt (or upgraded in sequence). Plan upgrades as a coordinated operation, not one instance at a time.
- Failover behavior is different from Multi-AZ. A read replica is not a Multi-AZ standby. Promoting a replica to replace a failed primary is a manual operation that breaks replication to the other replicas and changes the endpoint. For automatic failover you still need
multi_az = trueon the primary. - Storage and backup cost multiplies. Each replica is a full-size instance with its own storage, IOPS, and backup footprint. Three replicas ≈ 4× the cost of the primary alone.
- Parameter drift. Replica-level parameter group changes, storage type mismatches, or instance-class mismatches can cause replication to fall behind or break entirely. Keep replicas as close to the primary's shape as possible unless you have a specific reason to diverge.
- Backups still come from the primary. Replicas do not reduce backup load on the primary. If your motivation is "offload backups," a read replica is not the right tool — look at snapshot schedules or PITR configuration instead.
If after reading this you're still sure you need a replica, follow the how-to and reach out to @platform early — we can help you right-size it and set up endpoint routing.
Environment Differences
Development vs Production
| Aspect | Development | Production |
|---|---|---|
| Engines allowed | PostgreSQL, MySQL | PostgreSQL only |
| Performance Insights | Optional | Required |
| Min backup retention | 1 day | 7 days |
| Deletion protection | Recommended | Strongly recommended |
| Multi-AZ | Optional | Recommended for critical DBs |
| Public access | Allowed for specific DBs | Restricted |
| Password management | Traditional (manual) | Managed (auto-rotation) recommended |
Account Structure
Norton's AWS accounts map to environments as follows:
- Development account (
637244866643): Hosts dev, QA, and staging workloads - Production account (
100478842646): Hosts production workloads
Each account has its own:
- VPCs and networking configuration
- KMS encryption keys
- Secrets Manager secrets
- IAM roles and policies
- Terraform state (stored in S3)
CI/CD Pipeline Flow
What Happens on a Merge Request
If OPA finds violations, the pipeline fails and posts the specific issues as comments on the MR. Fix the violations and push again — the pipeline re-runs automatically.
What Happens on Merge to Main
Changes are applied to the specific environment based on the directory path:
- Changes in
accounts/development/rds/→ applied to the Development AWS account - Changes in
accounts/production/rds/→ applied to the Production AWS account
Drift Detection
The Platform team runs periodic drift detection to identify changes made outside of Terraform (e.g., manual AWS Console changes). When drift is detected, the team reconciles state, which means: we compare what's declared in the tfvars on main against what's actually running in AWS, and then force reality back to match the tfvars. In practice that usually means reverting your manual change — if you added a parameter group tweak, replica, tag, or connection setting via the Console, our reconciliation will undo it on the next apply.
What this means for you:
- Any change made outside the Infrastructure repository is temporary. It will be reverted — usually within a pipeline run or two.
- If you need an emergency change (e.g., mid-incident), make it in the Console to unblock, then immediately open an MR to codify the change so reconciliation doesn't wipe it out.
- If you see drift reverted on something you still need, open an MR — don't re-apply via the Console.
The drift report is published at W.W. Norton Drift Report.
Design Rationale
Why a Single Variables File Per Environment
All RDS instances for an environment are defined in a single terraform.tfvars file rather than individual files per database. This approach:
- Provides visibility — Developers can see all databases in an environment at a glance
- Enables reference — New databases can copy patterns from existing ones
- Simplifies pipeline — One plan and apply per environment, not per database
- Reduces duplication — Shared environment values (like VPC IDs) are visible in context
Why OPA Over Terraform Sentinel
OPA was chosen over Terraform Cloud's Sentinel because:
- Open source — No vendor lock-in to Terraform Cloud
- Data-driven — Allowlists are JSON files, not code changes
- CI-native — Runs in any CI pipeline, not tied to a specific Terraform backend
- Flexibility and expressiveness — Rego, OPA's policy language, lets us express complex, relational rules (e.g., "deep archive must be strictly greater than glacier transition days, and both must fall inside per-environment ranges") cleanly. As our policy needs grow — owner-based deletion guards, tag enforcement, conditional rules per account — Rego gives us room to expand without rewriting the evaluation layer.
- Reusable across resource types — One OPA toolchain covers S3, RDS, and future modules. Sentinel would couple us to a specific Terraform workflow, whereas OPA lets Platform share policy-writing patterns with Kubernetes admission control and other adjacent systems.
Why SAFE/READ-MORE Annotations
Rather than preventing all changes, the annotations educate developers about risk:
- SAFE changes can be self-served with confidence
- READ-MORE changes signal "pause and understand before proceeding"
- This builds developer knowledge over time rather than creating dependency on the Platform team
References
Internal (Norton)
- How-to guide: Managing RDS Databases with Terraform
- Infrastructure Repository: wwnorton/ops/infrastructure
- OPA Policy:
policies/accounts/development/rds/policy.rego - Allowlist Data:
policies/data/rds/allowlist.json
External (AWS & HashiCorp)
- AWS RDS Documentation: Amazon RDS User Guide
- Terraform Resource: aws_db_instance
- OPA Documentation: Open Policy Agent