Blog | Norton Digital Product Guidebook

Engineering Manager

Date of incident: Sat March 8, 2025
Recovery time: 3 hours
Impacted product(s): Support Site
Primary investigator(s): Anand Patil, Sanjay Akula, Roberto Crisial
Incident repair team: Sanjay Akula, Roberto Crisial, Steve Hurlock

Timeline

All times EST on Saturday 03/08/25.
The outage lasted 8 hours, from 3:30 am to 11:30 am.
We became aware of it at 9 am, making our recovery time 2.5 hours.

5:21pm – Kshitij Tilekar reports in the Incident channel: Unable to submit a Support Ticket Request Incident Response I am unable to submit a Support Ticket Request. The Category and Type of Issue drop-downs show no options in them.

5:36pm – There is a discussion about what info comes from Salesforce (since we saw recent issues with Salesforce connections on support2), but Rob and Sanjay see that there are no Categories being populated.

6:26pm – Jared confirms that Categories do not come from Salesforce and contacts Anand.

7:12pm – Anand finds that the support site is not able to connect to MongoDB.

7:55pm – Jared posts an alert about the issue but can't update the support banner.

~8:15pm – Sanjay and Anand restart the support app, but see that it can't connect to MongoDB. Sanjay reboots the support server, but still could not see the mongoDB connection. Sanjay adds an EIP to the EC2 to avoid the IP changing on reboot. Steve adds the new IP address to MongoDB allow list, resolving the issue.

8:22pm – Kashish confirms the fix

8:31pm – Anand posts the following summary:

Investigation Summary

Anand and Sanjay identified the issue: the database was attempting to connect but encountering a timeout error.
We engaged Rob for further investigation, and he suggested creating an Elastic IP address to allow access in MongoDB.
Sanjay created the Elastic IP, and Steve added it to MongoDB Atlas.
Rob confirmed successful communication.
Sanjay restarted the support application.
The database is now connecting successfully.
The support application is loading categories as expected.

Root Cause Analysis

No tickets could be submitted on the Support.wwnorton.com website.
Support EC2 did not have Mongodb connection: why? no restarts happened, timeout in log at what time?
a. instance was not reaching mongo cluster using interface in mongo allow list, could be AWS maintenance b. Mongo db allow list or other action affected this connection.

Learned

What didn’t we know before this incident that we now know?

Should we move the support app to EKS instead of running on EC2?

Action Items

Get logs show Mongodb disconnect time and error message from docker container. Other logs in Graylog and on EC2? Anand & Sanjay
Identify original IP address for the support site - was this affected by IP address cleanup, think not, but need to confirm? Check in Cloudtrail. - Rob
Check with MongoDB for actions taken during this time. - Steve
Discuss pros and cons of moving to EKS - separate meeting
What actions should be taken if this happens again - once root cause identified.
Discuss monitoring for the site, mongodb & salesforce connections, need alerts. -- next meeting
Why are logs not being collected from support site? - check in credential files, need to move to DNS instead of IP? Anand & Sanjay

Incident: DLT Release 19 Login Issue

February 19, 2025 · 4 min read

Engineering Manager

https://teams.microsoft.com/l/message/19:8e766a3059544fe68f47ef73d351165f@thread.tacv2/1739989495934?tenantId=2916ea14-8f24-4be6-8a2d-3afc0c7a4892&groupId=f6944feb-4a08-482b-ade9-3f240f442054&parentMessageId=1739989495934&teamName=Digital%20Product%20Team&channelName=App%20-%20Integrations%20-%20Deployment&createdTime=1739989495934

Engineering Manager

During the DLT release 19 deployment, an issue where new instructor can't launch DLT application from Canvas PI (Willow and VitalSource) was found and fixed.

Why Logging Is Critical for Visibility

February 13, 2025 · 5 min read

Engineering Manager

When managing complex systems, four overarching themes—Visibility, Testing, Resilience, and Leverage—serve as guiding principles. These themes are interconnected, but at their core, Visibility is foundational. Without visibility, the other three themes lose their efficacy. This post focuses on the critical role of logging in achieving system visibility, exploring why it matters, presenting set of logging guidelines and next steps for development teams.

Blameless Postmortem Culture

December 12, 2024 · 5 min read

Engineering Manager

Director of Engineering

Engineering Manager

Incidents are an opportunity to improve our systems, build an environment of trust, and create a psychologically safe workplace that respects our dignity.

Incident: Deep Linking Tool (DLT) Login flow

December 10, 2024 · 9 min read

Engineering Manager

On 11/25 Patricia Lochary found issues on launching DLT and requested Integrations dev team to look into it (related ticket: INT-2173). After some initial research on 11/26, dev team found clues about Login Web / Testmaker API Gateway having issues in Token generation/validation and requested assistance from team lead (Ankur) to review it. He suggested restarting pods as we did in other opportunities. During that initial check we were notified that at least one instructor reported that he was not able to login into DLT. While we later determined that the login/auth flow were affected, we received no indication that any other service or product was affected, which is why this incident focuses on the DLT Login flow.

Incident: Videos not found

October 4, 2024 · 7 min read

David Dushaj

Technical Product Manager

Director of Engineering

Early on 10/4, teams began reporting that some videos on the website and Smartwork were not loading. After investigating, we discovered that the NGINX proxy pointing to Liquid Web was not resolving requests properly. This happened because the NGINX proxy pod restarted in the middle of the night, pulling a new image with code that made videos in Liquid Web unreachable. We rolled back to the previous image to restore service.

While we later determined that routing to most resources on Liquid Web were affected, we received no indication that any other service or asset type was affected, which is why this incident focuses on the videos.

CI/CD Part 1: Continuous Integration

September 5, 2023 · 12 min read

Director of Engineering

Don Jordan

Business Analyst

Engineering Manager

Spending time on menial, repetitive tasks can feel demotivating and aimless, like our work isn't contributing to something meaningful. And time spent on these kinds of tasks is also time away from delivering business or customer value.

Introducing the Digital Platform

July 27, 2023 · 17 min read