Incident: Testmaker Export Outage
- Date of incident: October 28, 2025
- Recovery time: ~15 hours total impact (5:25 PM EDT Oct 28 - 10:30 AM EDT Oct 29)
- Impacted product(s): Testmaker, potentially more
- Primary investigator(s): Ankur Wase, Roberto Crisial, Bo Kaung
- Incident repair team: Sanjay Akula, Roberto Crisial, Steve Hurlock
Summary
A short summary of the incident, impact, and resolution On October 28-29, 2025, Testmaker experienced a global outage affecting all export and import functionality for instructors. The incident was caused by an AWS account-level public access block configuration change implemented as a security measure on October 28th. This configuration blocked S3 uploads that used the ACL: 'public-read' parameter, which multiple Testmaker applications (scheduler-api, document-api, content-extraction-api) relied upon for file exports and imports. The issue likely also impacted other organizational S3 buckets, including Marketing team's CDN content delivery systems, though these impacts went unreported. The issue was resolved by rolling back the account-level public access block configuration, allowing bucket-level ACL policies to govern access control instead. Impact:
- 967 4XX errors in CloudWatch to just the testmaker S3 bucket, indicative of export/import errors
- 16 Service Desk cases from 15 instructors across 13 colleges/universities and 2 high schools
- Backend manuscript import functionality also affected (internal users only)
- Likely unreported issues with other S3-dependent systems (e.g., Marketing team's CDN content delivery)
Five+ Whys
Why was Testmaker failing to export files?
Testmaker export function relies on S3 to create and upload files to public blob storage that can then be served to users that need to download it. From the application logs, the S3 upload function was identified as the point of failure.
Why did the S3 upload fail?
An account-level AWS S3 public access block configuration was enabled on October 28th as a security measure to prevent public access to S3 resources across the AWS account.
Why was this change made?
A bug in our home-grown Amazon Q code assisted security tool “AWS Security Command Center” designed to only do a hardening CHECK on the Norton accounts accidentally turned on this public access block feature - this tool was not only checking, but implementing the hardening fixes too.
Why was this tool developed?
Tool was initially developed to manage WAF IPs, to manage 2 IP lists (Deny IP Lists) for LBs and CloudFront. Common tasks needed to be automated as a part of this to ensure parity and consistency. This was the first time the tool was used to do something other than manage WAF IP lists. This tool also helps in an emergency scenario where we need to quickly block ip(s) from threats.
Why was this running in production account?
Because that’s where the compliance checks and audits need to happen and are regularly checked. But this tool automatically runs on all accounts so it was included by default with all the others.
Why was this done without previous lower env testing?
We didn’t think this was something that needed review because the intended purpose of what was supposed happen should have had no impact
Timeline
Timeline of significant known events and updates before, during, and after the incident window
All times in EDT
October 28, 2025
- 3:00 PM: Security measures implemented in response to threat activity: account-level S3 public access block configuration enabled and WAF rules deployed to block suspected threat actor IP addresses
- 5:25 PM: First user reports received - global Testmaker export issue identified, Service Desk notified with case tracking protocol; developers alerted
October 29, 2025
- 7:30 AM: Developers identify S3 upload failures across multiple Testmaker applications (scheduler-api, document-api, content-extraction-api)
- 7:30 AM: Error pattern identified: ERROR::File-003 : Failed to upload file to given s3 testmaker bucket
- 7:45 AM: Platform team engaged to investigate potential AWS environment or configuration changes
- 8:15 AM: Issue escalated as blocker; team huddle initiated
- 8:30 AM: Credentials validated; AWS CLI uploads successful from workstation using same credentials
- 9:00 AM: Investigation confirms: Credentials valid, bucket accessible, no recent code deployments related to export functionality
- 9:15 AM: Code review identifies ACL: 'public-read' parameters in uploadSingleFile and uploadLocalFile methods
- 9:20 AM: Team investigates both S3 policy changes and concurrent WAF IP blocking changes to isolate root cause
- 9:30 AM: Triage call started in Incident Response channel; cross-team investigation intensifies
- 9:45 AM: Connection made between code's ACL requirements and potential AWS policy changes; WAF changes ruled out as cause
- 10:00 AM: Account-level S3 public access block configuration identified as root cause; lack of AWS audit trail slowed identification process
- 10:10 AM: Decision made to roll back public access block configuration
- 10:30 AM: Configuration rollback completed (~20 minutes after decision); testing confirms exports functioning normally
- 10:35 AM: Status page and service desk notified of resolution
Learned
What didn’t we know before this incident that we now know?
- Account-level vs. bucket-level policy hierarchy: AWS account-level public access block configurations override bucket-level ACL policies, creating a cascading failure mode that wasn't previously understood by the development teams.
- Gap in change management process: Infrastructure and security changes at the AWS account level can be implemented without review, documentation, or communication to application development teams or the broader organization, creating blind spots for impact assessment
- Limited AWS audit trail visibility: AWS does not provide comprehensive, easily accessible audit trails for account-level policy configuration changes, significantly extending the time required to identify what changed and when during incident triage.
- Cross-functional dependency mapping: We lack comprehensive documentation and tooling to map application-level dependencies on infrastructure configurations, making impact analysis for infrastructure changes difficult and error-prone.
Open Questions
- Problem urgency wasn’t clear from original incident message - Platform may have been able to look into and address this sooner but product team was not able to triage until the next morning. What (if anything) can we do about this?
- Who is responsible for status page/banner to all customers during “global” issues? Who and when to ask about adding status page updates?
Action Items
What action items will we prioritize to mitigate incidents like this in the future? Collated action items from Platform post-mortem meeting
Immediate (1 Sprint)
- Implement change management process for anything from security going into production
- Setup a testing & promotion process for Platform operations to be applied to Prod
- Check other AWS accounts for potential impact and blast radius
- Monitoring to alert when unexpected 4XX errors occur on S3 buckets
Short-term (1 Month)
- Establish guardrails around AI tool/script creation and use
- Understand and document s3 public bucket usage vs private usage
Long-term (1 Quarter)
- Logging/tracking for access & config changes - (From retro)
- Synthetic monitoring on user workflows to function as an early warning and detection system

