Explanation: Understanding RDS Infrastructure

Introduction

This document explains the architecture, design decisions, and policy guardrails behind Norton's self-service RDS infrastructure. It covers how the Terraform module works, how the CI/CD pipeline validates and applies changes, what OPA policies enforce, and the rationale behind the choices made. Read the how-to guide first if you need practical step-by-step instructions for creating or modifying databases.

The Problem: Manual Database Provisioning

Before the self-service RDS workflow, creating or modifying a database required:

A ticket to the Platform team describing the desired configuration
Platform engineers manually writing Terraform code
Back-and-forth to clarify requirements (sizing, networking, access)
Manual review and apply cycles

This created bottlenecks where simple database changes could take days. Development teams couldn't iterate quickly on database configurations, and the Platform team spent significant time on routine provisioning tasks.

What Self-Service Solves

The self-service model shifts the workflow:

Developers own the configuration. OPA policies provide automated guardrails. The Platform team provides oversight without being a bottleneck.

Architecture Overview

Module Structure

The RDS Terraform module lives at aws/rds/ in the Infrastructure repository:

aws/rds/
├── rds.tf         # DB instances, subnet groups, security groups, read replicas
├── variables.tf   # Input variable definitions with types and defaults
├── outputs.tf     # Exported attributes (endpoints, ARNs, etc.)
├── locals.tf      # Subnet/SG resolution logic, password lookup, replica flattening
└── data.tf        # Data sources for existing subnet groups, SGs, and secrets

How Instances Are Configured

Each environment has its own variables file at accounts/{environment}/rds/terraform.tfvars. These files contain a map called rds_instances where each key is a logical name for a database and the value is an object with all configuration attributes.

The module iterates over this map and creates the corresponding AWS resources:

Two Password Management Approaches

The module supports two approaches for managing the master database password:

Traditional Password (most common for dev):

You create a secret in AWS Secrets Manager with {"password": "your-password"}
Set password_secret_name to the secret path
Terraform reads the password at apply time via a data source
You manage rotation manually

Managed Password (recommended for production):

Set manage_master_user_password = true
AWS RDS creates and manages the password in Secrets Manager automatically
AWS handles rotation on a schedule
Optionally specify master_user_secret_kms_key_id for custom encryption

When to use which: Use traditional passwords for development databases where you need to know the password for local development tools. Use managed passwords for production databases where automatic rotation and reduced human access to credentials are important.

Networking

Understanding the networking layer is important for choosing the right configuration values.

VPC (Virtual Private Cloud)

Every RDS instance runs inside a VPC. Norton's AWS accounts have multiple VPCs for different purposes. You must specify the vpc_id where your database should be created. This determines which network your database is reachable from.

The VPC you choose should match the VPC where your application runs. If your EKS pods are in VPC A, your database should also be in VPC A (or have cross-VPC connectivity configured by the Platform team).

Subnet Groups

A DB subnet group is a collection of subnets (in different Availability Zones) where RDS can place your database. This is required for Multi-AZ deployments and general high availability.

How the module handles subnet groups:

If you set subnet_group_name to an existing group name → the module uses it
If you don't set subnet_group_name → the module creates a new subnet group from subnet_ids

For most development databases, use the shared subnet group (dev-group). This is pre-configured with subnets across multiple AZs.

Security Groups

Security groups act as firewalls controlling which traffic can reach your database.

How the module handles security groups:

If you set security_group_ids → the module uses those existing security groups
If you set security_group_name → the module looks up the named security group
If neither is set → the module creates a new security group allowing traffic from allowed_cidr_blocks on your database's port

For most databases, use the shared security group for your environment. Custom security groups are typically only needed for databases with strict network isolation requirements.

OPA Policy Guardrails

The CI pipeline evaluates every RDS change against Open Policy Agent (OPA) policies before Terraform can apply. These policies enforce organizational standards and prevent misconfigurations.

How OPA Evaluation Works

The CI pipeline runs terraform plan and converts the output to JSON
OPA evaluates the plan JSON against policy rules in policies/accounts/{environment}/rds/policy.rego
If any deny rules match, the pipeline fails and posts the violation messages as MR comments
The developer fixes the violations and pushes updated commits

What the Policies Enforce

Every allowed_* / *_required key below is a field in policies/data/rds/allowlist.json — open that file to see the current values and which engine, version, instance class, port, VPC, and subnet group entries are live right now.

Check	What's Validated	How It Works
Engine type	Only allowed engines	Must be in `allowed_engines` list
Engine version	Must be in allowlist	Must be in `allowed_engine_versions` list
Instance class	Must be in allowlist	Must be in `allowed_instance_classes` list
Port	Must match engine	Must match the engine-to-port mapping in `allowed_ports`
Storage encryption	Must be enabled	Controlled by `storage_encrypted_required` flag
Backup retention	Minimum days required	Must be >= `min_backup_retention_days`
Public access	Only specific identifiers	Identifier must be in `allow_public_access_identifiers`
Performance Insights	May be required	Controlled by `performance_insights_required` flag
VPC	Must be allowlisted	Must be in `allowed_vpc_ids` (empty = unrestricted)
Subnet group	Must be allowlisted	Must be in `allowed_subnet_group_names` (empty = unrestricted)

These values change over time. Engine versions are deprecated, new instance classes are added, and networking resources evolve. Always check the current allowlist before submitting your MR rather than relying on values printed in this document.

The source of truth for all allowed values is policies/data/rds/allowlist.json in the Infrastructure repository. This file contains separate dev and prod sections with all current allowlists. The Platform team updates this file when new engine versions, instance classes, or networking resources need to be supported.

Policy Change Detection

OPA policies only evaluate attributes that are changing in the Terraform plan. If you're only updating tags on an existing database, the engine version check won't trigger even if the current version isn't in the latest allowlist. This prevents existing databases from being blocked by new policy additions.

Known Policy Gaps

The following areas are documented for transparency and are tracked for future improvement:

Tag enforcement: OPA does not currently enforce that required tags (Team, Product) are present. This is documented as a best practice but not blocking.
Storage size bounds: Instance classes are limited but allocated_storage has no upper bound in OPA. The practical limit is what exists in production today.
Encryption changes: The policy prevents disabling encryption but does not restrict who can toggle encryption-related settings. Changes to storage_encrypted or kms_key_id on existing databases can be destructive (recreates the instance). These are flagged as READ-MORE in the tfvars.
Production RDS policy: Currently only the development account has a dedicated RDS OPA policy. Production uses the shared allowlist but evaluation may not be active for all fields.

SAFE vs READ-MORE Properties

The development tfvars file includes annotations (as comments) categorizing every configurable property by risk level. This is a developer experience feature to help you understand the impact of changes before making them.

What SAFE Means

SAFE properties can be changed without risk of data loss or service destruction. They may still cause brief restarts (like instance_class) but won't destroy your database or require recreation.

Examples:

tags — Metadata only, no infrastructure impact
allocated_storage — Storage increases are online operations
multi_az — Adds/removes standby replica without downtime
backup_retention_period — Changes backup policy only

What READ-MORE Means

READ-MORE properties require you to understand the implications before changing them. In some cases, changing these values on an existing database will cause Terraform to destroy and recreate the instance, resulting in data loss if not handled properly.

Examples:

engine — Changing engine type (e.g., postgres → mysql) destroys the instance
db_name — Changing on an existing DB forces recreation
username — Changing on an existing DB forces recreation
storage_encrypted — Enabling encryption on an unencrypted DB requires a snapshot restore (destructive)
kms_key_id — Changing the encryption key forces recreation

If a property is marked READ-MORE, always review the linked CI job proof or the Terraform aws_db_instance documentation before making the change. When in doubt, contact the Platform team.

The SAFE/READ-MORE annotations are visible in the comments at the top of the development/rds/terraform.tfvars file, with a link to the main documentation for the properties

Read Replicas

The RDS module supports creating read replicas from any primary database instance. Read replicas are useful for:

Offloading read traffic from the primary database
Cross-region disaster recovery (when configured in a different AZ)
Analytics queries that shouldn't impact production write performance

How Read Replicas Work

A read replica is an asynchronous copy of the primary database. It receives updates from the primary via replication and can serve read-only queries. In the Terraform module:

Replicas inherit most settings from the primary (VPC, subnets, security groups, encryption)
You can override specific settings per replica (instance class, availability zone, etc.)
Each replica gets its own endpoint for read traffic

For the step-by-step to add one, see How-To → Use Case 5: Adding a Read Replica.

Read Replica Drawbacks and Complexity

Read replicas are not a free performance boost — they add real operational complexity. Weigh these before adding one:

Replication lag is real and unbounded. Replicas are asynchronous. Under load, writes to the primary can take seconds (occasionally longer) to appear on a replica. Applications that read-after-write from a replica will see stale data. If your workload can't tolerate eventual consistency, you need read-your-own-writes routing back to the primary.
Connection-string routing is your problem. RDS does not give you an automatic reader endpoint outside of Aurora — each replica has its own hostname. You (the application) need to decide which queries go to which endpoint, how to fail over between replicas if one is down, and how to redirect reads back to the primary in a pinch.
Upgrades have ordering constraints. Minor/major version upgrades have to be applied in a specific order across primary and replicas, and a major version upgrade on the primary typically requires the replicas to be rebuilt (or upgraded in sequence). Plan upgrades as a coordinated operation, not one instance at a time.
Failover behavior is different from Multi-AZ. A read replica is not a Multi-AZ standby. Promoting a replica to replace a failed primary is a manual operation that breaks replication to the other replicas and changes the endpoint. For automatic failover you still need multi_az = true on the primary.
Storage and backup cost multiplies. Each replica is a full-size instance with its own storage, IOPS, and backup footprint. Three replicas ≈ 4× the cost of the primary alone.
Parameter drift. Replica-level parameter group changes, storage type mismatches, or instance-class mismatches can cause replication to fall behind or break entirely. Keep replicas as close to the primary's shape as possible unless you have a specific reason to diverge.
Backups still come from the primary. Replicas do not reduce backup load on the primary. If your motivation is "offload backups," a read replica is not the right tool — look at snapshot schedules or PITR configuration instead.

If after reading this you're still sure you need a replica, follow the how-to and reach out to @platform early — we can help you right-size it and set up endpoint routing.

Environment Differences

Development vs Production

Aspect	Development	Production
Engines allowed	PostgreSQL, MySQL	PostgreSQL only
Performance Insights	Optional	Required
Min backup retention	1 day	7 days
Deletion protection	Recommended	Strongly recommended
Multi-AZ	Optional	Recommended for critical DBs
Public access	Allowed for specific DBs	Restricted
Password management	Traditional (manual)	Managed (auto-rotation) recommended

Account Structure

Norton's AWS accounts map to environments as follows:

Development account (637244866643): Hosts dev, QA, and staging workloads
Production account (100478842646): Hosts production workloads

Each account has its own:

VPCs and networking configuration
KMS encryption keys
Secrets Manager secrets
IAM roles and policies
Terraform state (stored in S3)

CI/CD Pipeline Flow

What Happens on a Merge Request

If OPA finds violations, the pipeline fails and posts the specific issues as comments on the MR. Fix the violations and push again — the pipeline re-runs automatically.

What Happens on Merge to Main

Changes are applied to the specific environment based on the directory path:

Changes in accounts/development/rds/ → applied to the Development AWS account
Changes in accounts/production/rds/ → applied to the Production AWS account

Drift Detection

The Platform team runs periodic drift detection to identify changes made outside of Terraform (e.g., manual AWS Console changes). When drift is detected, the team reconciles state, which means: we compare what's declared in the tfvars on main against what's actually running in AWS, and then force reality back to match the tfvars. In practice that usually means reverting your manual change — if you added a parameter group tweak, replica, tag, or connection setting via the Console, our reconciliation will undo it on the next apply.

What this means for you:

Any change made outside the Infrastructure repository is temporary. It will be reverted — usually within a pipeline run or two.
If you need an emergency change (e.g., mid-incident), make it in the Console to unblock, then immediately open an MR to codify the change so reconciliation doesn't wipe it out.
If you see drift reverted on something you still need, open an MR — don't re-apply via the Console.

The drift report is published at W.W. Norton Drift Report.

Design Rationale

Why a Single Variables File Per Environment

All RDS instances for an environment are defined in a single terraform.tfvars file rather than individual files per database. This approach:

Provides visibility — Developers can see all databases in an environment at a glance
Enables reference — New databases can copy patterns from existing ones
Simplifies pipeline — One plan and apply per environment, not per database
Reduces duplication — Shared environment values (like VPC IDs) are visible in context

Why OPA Over Terraform Sentinel

OPA was chosen over Terraform Cloud's Sentinel because:

Open source — No vendor lock-in to Terraform Cloud
Data-driven — Allowlists are JSON files, not code changes
CI-native — Runs in any CI pipeline, not tied to a specific Terraform backend
Flexibility and expressiveness — Rego, OPA's policy language, lets us express complex, relational rules (e.g., "deep archive must be strictly greater than glacier transition days, and both must fall inside per-environment ranges") cleanly. As our policy needs grow — owner-based deletion guards, tag enforcement, conditional rules per account — Rego gives us room to expand without rewriting the evaluation layer.
Reusable across resource types — One OPA toolchain covers S3, RDS, and future modules. Sentinel would couple us to a specific Terraform workflow, whereas OPA lets Platform share policy-writing patterns with Kubernetes admission control and other adjacent systems.

Why SAFE/READ-MORE Annotations

Rather than preventing all changes, the annotations educate developers about risk:

SAFE changes can be self-served with confidence
READ-MORE changes signal "pause and understand before proceeding"
This builds developer knowledge over time rather than creating dependency on the Platform team

Explanation: Understanding RDS Infrastructure

Introduction

The Problem: Manual Database Provisioning

What Self-Service Solves

Architecture Overview

Module Structure

How Instances Are Configured

Two Password Management Approaches

Networking

VPC (Virtual Private Cloud)

Subnet Groups

Security Groups

OPA Policy Guardrails

How OPA Evaluation Works

What the Policies Enforce

Policy Change Detection

Known Policy Gaps

SAFE vs READ-MORE Properties

What SAFE Means

What READ-MORE Means

Read Replicas

How Read Replicas Work

Read Replica Drawbacks and Complexity

Environment Differences

Development vs Production

Account Structure

CI/CD Pipeline Flow

What Happens on a Merge Request

What Happens on Merge to Main

Drift Detection

Design Rationale

Why a Single Variables File Per Environment

Why OPA Over Terraform Sentinel

Why SAFE/READ-MORE Annotations

References

Internal (Norton)

External (AWS & HashiCorp)

Introduction​

The Problem: Manual Database Provisioning​

What Self-Service Solves​

Architecture Overview​

Module Structure​

How Instances Are Configured​

Two Password Management Approaches​

Networking​

VPC (Virtual Private Cloud)​

Subnet Groups​

Security Groups​

OPA Policy Guardrails​

How OPA Evaluation Works​

What the Policies Enforce​

Policy Change Detection​

Known Policy Gaps​

SAFE vs READ-MORE Properties​

What SAFE Means​

What READ-MORE Means​

Read Replicas​

How Read Replicas Work​

Read Replica Drawbacks and Complexity​

Environment Differences​

Development vs Production​

Account Structure​

CI/CD Pipeline Flow​

What Happens on a Merge Request​

What Happens on Merge to Main​

Drift Detection​

Design Rationale​

Why a Single Variables File Per Environment​

Why OPA Over Terraform Sentinel​

Why SAFE/READ-MORE Annotations​

References​

Internal (Norton)​

External (AWS & HashiCorp)​

Introduction

The Problem: Manual Database Provisioning

What Self-Service Solves

Architecture Overview

Module Structure

How Instances Are Configured

Two Password Management Approaches

Networking

VPC (Virtual Private Cloud)

Subnet Groups

Security Groups

OPA Policy Guardrails

How OPA Evaluation Works

What the Policies Enforce

Policy Change Detection

Known Policy Gaps

SAFE vs READ-MORE Properties

What SAFE Means

What READ-MORE Means

Read Replicas

How Read Replicas Work

Read Replica Drawbacks and Complexity

Environment Differences

Development vs Production

Account Structure

CI/CD Pipeline Flow

What Happens on a Merge Request

What Happens on Merge to Main

Drift Detection

Design Rationale

Why a Single Variables File Per Environment

Why OPA Over Terraform Sentinel

Why SAFE/READ-MORE Annotations

References

Internal (Norton)

External (AWS & HashiCorp)