2 posts tagged with "Platform" | Norton Digital Product Guidebook

Incident: Testmaker Export Outage

November 7, 2025 · 7 min read

Technical Product Manager

Date of incident: October 28, 2025
Recovery time: ~15 hours total impact (5:25 PM EDT Oct 28 - 10:30 AM EDT Oct 29)
Impacted product(s): Testmaker, potentially more
Primary investigator(s): Ankur Wase, Roberto Crisial, Bo Kaung
Incident repair team: Sanjay Akula, Roberto Crisial, Steve Hurlock

Summary

A short summary of the incident, impact, and resolution On October 28-29, 2025, Testmaker experienced a global outage affecting all export and import functionality for instructors. The incident was caused by an AWS account-level public access block configuration change implemented as a security measure on October 28th. This configuration blocked S3 uploads that used the ACL: 'public-read' parameter, which multiple Testmaker applications (scheduler-api, document-api, content-extraction-api) relied upon for file exports and imports. The issue likely also impacted other organizational S3 buckets, including Marketing team's CDN content delivery systems, though these impacts went unreported. The issue was resolved by rolling back the account-level public access block configuration, allowing bucket-level ACL policies to govern access control instead. Impact:

967 4XX errors in CloudWatch to just the testmaker S3 bucket, indicative of export/import errors
16 Service Desk cases from 15 instructors across 13 colleges/universities and 2 high schools
Backend manuscript import functionality also affected (internal users only)
Likely unreported issues with other S3-dependent systems (e.g., Marketing team's CDN content delivery)

Five+ Whys

Why was Testmaker failing to export files?

Testmaker export function relies on S3 to create and upload files to public blob storage that can then be served to users that need to download it. From the application logs, the S3 upload function was identified as the point of failure.

Why did the S3 upload fail?

An account-level AWS S3 public access block configuration was enabled on October 28th as a security measure to prevent public access to S3 resources across the AWS account.

Why was this change made?

A bug in our home-grown Amazon Q code assisted security tool “AWS Security Command Center” designed to only do a hardening CHECK on the Norton accounts accidentally turned on this public access block feature - this tool was not only checking, but implementing the hardening fixes too.

Why was this tool developed?

Tool was initially developed to manage WAF IPs, to manage 2 IP lists (Deny IP Lists) for LBs and CloudFront. Common tasks needed to be automated as a part of this to ensure parity and consistency. This was the first time the tool was used to do something other than manage WAF IP lists. This tool also helps in an emergency scenario where we need to quickly block ip(s) from threats.

Why was this running in production account?

Because that’s where the compliance checks and audits need to happen and are regularly checked. But this tool automatically runs on all accounts so it was included by default with all the others.

Why was this done without previous lower env testing?

We didn’t think this was something that needed review because the intended purpose of what was supposed happen should have had no impact

Timeline

Timeline of significant known events and updates before, during, and after the incident window

All times in EDT

October 28, 2025

3:00 PM: Security measures implemented in response to threat activity: account-level S3 public access block configuration enabled and WAF rules deployed to block suspected threat actor IP addresses
5:25 PM: First user reports received - global Testmaker export issue identified, Service Desk notified with case tracking protocol; developers alerted

October 29, 2025

7:30 AM: Developers identify S3 upload failures across multiple Testmaker applications (scheduler-api, document-api, content-extraction-api)
7:30 AM: Error pattern identified: ERROR::File-003 : Failed to upload file to given s3 testmaker bucket
7:45 AM: Platform team engaged to investigate potential AWS environment or configuration changes
8:15 AM: Issue escalated as blocker; team huddle initiated
8:30 AM: Credentials validated; AWS CLI uploads successful from workstation using same credentials
9:00 AM: Investigation confirms: Credentials valid, bucket accessible, no recent code deployments related to export functionality
9:15 AM: Code review identifies ACL: 'public-read' parameters in uploadSingleFile and uploadLocalFile methods
9:20 AM: Team investigates both S3 policy changes and concurrent WAF IP blocking changes to isolate root cause
9:30 AM: Triage call started in Incident Response channel; cross-team investigation intensifies
9:45 AM: Connection made between code's ACL requirements and potential AWS policy changes; WAF changes ruled out as cause
10:00 AM: Account-level S3 public access block configuration identified as root cause; lack of AWS audit trail slowed identification process
10:10 AM: Decision made to roll back public access block configuration
10:30 AM: Configuration rollback completed (~20 minutes after decision); testing confirms exports functioning normally
10:35 AM: Status page and service desk notified of resolution

Learned

What didn’t we know before this incident that we now know?

Account-level vs. bucket-level policy hierarchy: AWS account-level public access block configurations override bucket-level ACL policies, creating a cascading failure mode that wasn't previously understood by the development teams.
Gap in change management process: Infrastructure and security changes at the AWS account level can be implemented without review, documentation, or communication to application development teams or the broader organization, creating blind spots for impact assessment
Limited AWS audit trail visibility: AWS does not provide comprehensive, easily accessible audit trails for account-level policy configuration changes, significantly extending the time required to identify what changed and when during incident triage.
Cross-functional dependency mapping: We lack comprehensive documentation and tooling to map application-level dependencies on infrastructure configurations, making impact analysis for infrastructure changes difficult and error-prone.

Open Questions

Problem urgency wasn’t clear from original incident message - Platform may have been able to look into and address this sooner but product team was not able to triage until the next morning. What (if anything) can we do about this?
Who is responsible for status page/banner to all customers during “global” issues? Who and when to ask about adding status page updates?

Action Items

What action items will we prioritize to mitigate incidents like this in the future? Collated action items from Platform post-mortem meeting

Immediate (1 Sprint)

Implement change management process for anything from security going into production
Setup a testing & promotion process for Platform operations to be applied to Prod
Check other AWS accounts for potential impact and blast radius
Monitoring to alert when unexpected 4XX errors occur on S3 buckets

Short-term (1 Month)

Establish guardrails around AI tool/script creation and use
Understand and document s3 public bucket usage vs private usage

Long-term (1 Quarter)

Logging/tracking for access & config changes - (From retro)
Synthetic monitoring on user workflows to function as an early warning and detection system

Introducing the Digital Platform

July 27, 2023 · 17 min read

Evan Yamanishi

Director of Engineering

This post describes the vision and direction for what I'm tentatively calling our "Digital Platform"[^1]—a collection of our tools, services, and operational functionality arranged as a compelling internal product that allows teams to operate more quickly and independently. It imagines how we can transform our DevOps team into a product team that manages our platform as a product.

Summary​

Five+ Whys​

Timeline​

Learned​

Open Questions​

Action Items​