Skip to main content

Incident: DLT Release 19 Login Issue

· 4 min read
Steve Hurlock
Engineering Manager
Claudio Caviglia
Engineering Manager

During the DLT release 19 deployment, an issue where new instructor can't launch DLT application from Canvas PI (Willow and VitalSource) was found and fixed.

https://teams.microsoft.com/l/message/19:8e766a3059544fe68f47ef73d351165f@thread.tacv2/1739989495934?tenantId=2916ea14-8f24-4be6-8a2d-3afc0c7a4892&groupId=f6944feb-4a08-482b-ade9-3f240f442054&parentMessageId=1739989495934&teamName=Digital%20Product%20Team&channelName=App%20-%20Integrations%20-%20Deployment&createdTime=1739989495934

  • Date of incident: 2/19/25
  • Recovery time: 3.5 hours
  • Impacted product(s): login-web, DLT
  • Impacted users(s): New instructors
  • Primary investigator(s): Kevin Temes, Ankur Wase, Roberto Crisial
  • Incident repair team: testmaker, platform

Root Cause Analysis

  1. Why New instructors can't login?
  2. Why existing users can login?
  3. Why internalservices is returning 404?
  4. Why an ingress rule affected the login flow?

Timeline

  • All times EST on 2/19/25.
  • The outage lasted 3.5 hours

Issue Detection

8:50 AM - Deployment starts

11:19 AM - Rajaram Rane reports: We are observing issue for New instructor can't launch DLT application from Canvas PI (Willo and VitalSource). This issue observed for only New user on Canvas, Blackboard-Vital source also getting same error Existing user s are working fine for all LMSs and PI

11:25 AM - Kevin Temes reports errors appearing in the login-web console, possibly related to api-gateway.

11:49 AM - Morgan Rinehart notes that Ankur Wase is offline due to offshore time and asks whether the team can sign off without a fix today.

11:51 AM - Kevin Temes clarifies that new instructors using DLT for the first time cannot access the tool, while existing users remain unaffected.

11:55 AM - Morgan Rinehart suggests the platform team may assist and flags it for Ankur and Dipak to check in the morning. However, Ankur logs back in to investigate.

Investigation

11:59 AM - Claudio Caviglia mentions that restarting the pod would fix the issue but would erase root cause details.

12:01 PM - Jessica Fix logs the issue in MAINT: MAINT-4587.

12:11 PM - Ankur Wase identifies that login-web cannot connect to api-gateway, receiving a 404 error. He asks Steve Hurlock and Roberto Crisial if any ingress rule changes were made.

12:13 PM - Roberto Crisial checks the logs and requests the affected URL.

12:15 PM - Ankur Wase provides the URL: https://internalservices.wwnorton.com/testmaker-api-gateway/v1/login.

12:15 PM - Roberto Crisial finds the issue and begins fixing it.

Resolution

12:21 PM - Roberto Crisial confirms the issue was due to a misconfiguration in the rconsole ingress rule and states it is fixed.

12:22 PM - Ankur Wase confirms that login now works and asks the team to verify.

12:30 PM - Claudio Caviglia states that Ruby Tiwari has confirmed the fix in the ticket and Roberto Crisial provides the following explanation of the root cause https://wwnorton.atlassian.net/browse/MAINT-4587?focusedCommentId=247258

This is related to PLAT-89: setup new application for RConsole using CI/CD Production env. The ingress controller seems to be misconfigured. The applied configuration route all internalservices.wwnorton.com to ncia/rconsole solution. The following rules to route traffic based on path conditions were overruled.

Configuration has been updated to fix the issue. Issue seems to be related to the request to update url from internalservices.wwnorton.com/rconsole/* to rconsole.wwnorton.com/*

Learned

What didn’t we know before this incident that we now know?

  1. There are specific target groups depending on the app/domain. This needs to be reviewed carefully before adding/updating any new rule/application domain.
  2. Using subdomains can help apply specific rules without overwriting general ones or unintentionally taking precedence over other rules.

Action Items

  1. Avoid making manual changes to ingress rules; instead, implement changes via a Merge Request (MR). This ensures the change is properly reviewed and approved, while also being documented for future reference.
  2. After applying a new ingress rule to the load balancer, the team should monitor the network for any sudden spike in errors, such as 404 Not Found, to quickly identify potential issues.