Incident: Videos not found

October 4, 2024 · 7 min read

Technical Product Manager

Director of Engineering

Early on 10/4, teams began reporting that some videos on the website and Smartwork were not loading. After investigating, we discovered that the NGINX proxy pointing to Liquid Web was not resolving requests properly. This happened because the NGINX proxy pod restarted in the middle of the night, pulling a new image with code that made videos in Liquid Web unreachable. We rolled back to the previous image to restore service.

While we later determined that routing to most resources on Liquid Web were affected, we received no indication that any other service or asset type was affected, which is why this incident focuses on the videos.

Date of incident: Friday, October 4th, 2024
Recovery time: 2.5 hours
Impacted product(s): Smartwork, wwnorton.com, likely others
Primary investigator(s)¹: Roberto Crisial
Incident repair team²: David Dushaj, Roberto Crisial, Don Jordan, Camilo Joga, Jaime Zelada, Sanjay Akula, Mac Penn, Bo Kaung, Steve Hurlock, Evan Yamanishi

Root Cause Analysis

Videos hosted through wwnorton.com/common/mplay returned a 404 not found error, affecting Smartwork and the website.
Requests to wwnorton.com/common/ were directed through an NGINX proxy, which was not resolving those requests.
The NGINX proxy pod was ejected earlier in the morning, pulling a different image than was previously running when it restarted.
- The pod was ejected because a NoExecute taint was set.
A different image was pulled because the production values.yaml file didn’t use a pinned, immutable image tag.
- If no tag is provided, Docker Engine uses the :latest tag as a default.
The image name was managed manually, rather than by the pipeline workflow.
The NGINX proxy was not initially accounted for in the EKS production cutover, so the deploy repo was created last minute to resolve a production issue that we experienced on the day of the cutover (7/13).
Nobody involved in the production cutover work knew that the NGINX proxy was required because the lower environments didn’t experience any related issues before the cutover and because nobody involved in its design still works at Norton.

Timeline

All times EST on October 4th, 2024.
The outage lasted 8 hours, from 3:30 am to 11:30 am.
We became aware of it at 9 am, making our recovery time 2.5 hours.

3:30 – NGINX Proxy pod was ejected and pulled a different image than was running previously.

9:03 – Vinod reported in Teams Alerts channel that the video at wwnorton.com/inquizitive was not working.

9:04 – In the same Teams thread, Kuldeep noted that videos were not working in Smartwork.

9:09 – Jared noted that all affected videos used the wwnorton.com/common/mplay/ base URL.

9:20 – Confirmed these were working 24 hours ago.

9:30 – Ruled out the following as they reported as having no changes or deployments – LiquidWeb, Smartwork, Website, and Inquizitive.

9:40 – Jared posted a message on the status page to alert users that we are investigating.

10:00 – David started a war room to triage together with everyone on the thread mentioned previously.

10:15 – We began investigating the proxy and site forwarding, as the videos are accessible locally and over AWS VPN using the "web.*" subdomain.

10:30 – Rob continued investigating but believed NGINX proxy to be involved somehow. We confirmed that the NGINX proxy pod had restarted about 7 hours earlier.

11:01 – Rob discovered that the NGINX proxy’s production image name does not have a tag, which means that it uses the :latest tag by default.

11:02 – Rob pinned the proxy image tag to a known working one and deployed it.

11:30 – Confirmed internally and with reporting teams that videos were working and that the issue was resolved.

Learned

What didn’t we know before this incident that we now know?

The NGINX Proxy repo does not automatically deploy to lower environments because it is missing deploy repo configs for everything but production.
This change in the NGINX Proxy project broke videos most routing to Liquid Web.
Despite most apps and assets on Liquid Web going down, stakeholders only reported issues with videos, and we received no customer reports. While not reliable on its own since customers could have experienced issues they didn't report, this helped us better understand what’s being used on Liquid Web, and that we’re on the right track with our work to shut down Liquid Web.
There was at least one repository that used mutable tags in production.
At least Smartwork and the website are using videos that are hosted on Liquid Web.
- Evan Yamanishi:
  
  "We learned here that the website and smartwork teams are still using the mplay video player. I'd like them to migrate all of those videos to player.wwnorton.com."
- Melissa Bilski:
  
  "A ticket for the WWNorton.com marketing videos was opened here: WEB-1302"
A Kubernetes node can be "ejected" when it has a NoExecute taint set but there is no tolerance set (covered in Taints and Tolerations). This will cause it to not only restart but also re-pull the image specified in the deploy config.

Action Items

Figure out why NGINX pods restarted and if this was normal or not. (ENG-695)
Add lower environment deployments to the NGINX Proxy deploy repo. (ENG-679)
Check all other deploy repos for missing deployment configurations. (Epic: ENG-696)
Make sure all deploy repos are using immutable tags in production. (Epic: ENG-696)
Backout the change in NGINX proxy repo that was made to stop pointing to LiquidWeb. Alternatively, wrap the change in a feature flag. (ENG-680)
Explore synthetic monitoring solutions to catch issues as soon as they occur (or before), rather than waiting for a human report. (ENG-40)
Discuss the pros and cons of version mutability across our supply chain with engineers, reinforcing that production artifacts should be pinned and immutable. (discussed at the 10/10 engineering monthly call)
Do more Liquid Web investigation to find out what needs to be migrated. Based on this incident and the usage logs from ENG-571, there is more activity on Liquid Web than anyone expected. This will help us better outline the steps required to shut down Liquid Web with minimal business/customer impact.

The primary investigators are the individuals who discovered the technical cause(s) of the incident and/or found the solution to the issue. ↩
The incident repair team were involved in the work to fix the issue, whether they found the cause and solution or not. They are called out because they are the next most likely to be able to help explain the details of the incident, after the primary investigator(s). ↩

Root Cause Analysis​

Timeline​

Learned​

Action Items​

Footnotes​

Root Cause Analysis

Timeline

Learned

Action Items

Footnotes