Incident: Deep Linking Tool (DLT) Login flow
On 11/25 Patricia Lochary found issues on launching DLT and requested Integrations dev team to look into it (related ticket: INT-2173). After some initial research on 11/26, dev team found clues about Login Web / Testmaker API Gateway having issues in Token generation/validation and requested assistance from team lead (Ankur) to review it. He suggested restarting pods as we did in other opportunities. During that initial check we were notified that at least one instructor reported that he was not able to login into DLT. While we later determined that the login/auth flow were affected, we received no indication that any other service or product was affected, which is why this incident focuses on the DLT Login flow.
While going through the error logs and debugging Testmaker's api gateway code, we noticed 2 things:
- the error message being logged was coming from a piece of code that was attempting to run a query against an in-memory instance of a collection that was not properly initialized (although the application was properly connected to the database)
- the database connection string being used by the api-gateway had an explicit database shard specified.
The hypothesis we have based on those two details is that, when the external-api-gateway pod booted up and connected to the MongoDB database, that particular shard did not have access to that particular collection.
- Date of incident: Wednesday, November 27th, 2024
- Recovery time: 26 hours
- Impacted product(s): Deep Linking Tool (DLT), Auth Service
- Primary investigator(s)1: Ankur Wase, Kevin Temes
- Incident repair team2: Christian Hardy, Nestor Pastor, Rajaram Rane, Roberto Crisial, Sanjay Akula, Steve Hurlock.
Root Cause Analysis
- Instructors trying to launch a course landed in DLT login and were not able to login.
- The session id was not generated.
- There was an error trying to insert a value in a DB collection used to store session information.
- DB collection was not accessible from the code that was responsible of handling the flow.
- The database shard used by the pod did not have access to the DB collection.
- Why the database shard was specified in credentials.
Timeline
- All times EST on November 25th-27th, 2024.
- The issue lasted +48 hours.
- Dev team became aware of it at 10 am on 11/26 during standup, and there were no new reports at that time. This made our recovery time 26 hours since we started reviewing the issue.
11/25
Afternoon – During partner integration testing, Patricia Lochary faced the login issue.
11/26
10:00 am – During SU the issue was mentioned for the first time.
10:19 am – INT-2173 created for dev review.
10:30 am – Dev team started the research, gathering logs and testing flows with different users.
3:18 pm – Dev team arrived at the conclusion that issue was related to login web app and asked Ankur to review with suggestion to restart pods, and moved the ticket to feedback required.
4:50 pm – Another report received of instructors not able to login.
4:55 pm – Ticket was reprioritized to major.
11/27
4:44 am – Ankur confirmed that there was a ticket created to analyze the issue happened in previous opportunities and requested Pods to be restarted if issue was persistent.
8:48 am – Steve created a chat with Primary investigator(s) and Incident repair team to confirm about login-web pods to be restated.
8:52 am – Roberto restarted login-web pods.
9:15 am – Kevin confirmed that the error was still happening, but log was a bit different but same login flow, the new clue was related to norton-auth-service.
9:21 am – Ankur confirmed that he was not able to reproduce the issue in Testmaker.
9:22 am – Kevin’s log showed a ValidateToken error (/validatetoken in norton-auth-service).
9:31 am – Robert and Steve started looking into another and known issues (EAI_AGAIN) but didn’t found issues.
9:48 am – Ankur and Kevin continued with the approach of a session id / id token issue and refresh token flow.
10:32 am – Steve found lot of invalid token messages but related to nerd-service and expected flow.
10:47 am – Ankur suggests reviewing refresh token process in api-gateway.
11:00 am – Claudio started a war room to triage together with everyone mentioned in Primary investigator(s) and Incident repair team and posted in Incident channel.
11:16 pm – Steve and Robert continue checking logs with the hypothesis of a DB issue brought by Kevin on initial ticket comment (TypeError: Cannot read properties of undefined (reading 'insertOne')).
11:20 am – Integrations QA Lead, Rajaram Rane, was pulled into the call to help on reproducing the issue.
11:35 am – Ankur reviewed the same log “Cannot read properties of undefined” coming from a package that should be failing for all flows, even other products, so he was not clear about this as root cause but suggested to restart the pods of api-gateway.
11:39 am – Robert restarted api-gateway pods.
11:44 am – Issue still present, team reviewed credentials in case were overwritten during last deployment, logs about DB errors, no clues at this point.
11:45 am – Robert commented about external-testmaker-api-gateway pods that were pointing to same solution but to be accessible outside VPC.
11:52 am – STG working as expected with same DLT codebase as in Production.
11:55 am – Team reviewed the use of token id in DLT, logs and expected flow. There was no conclusion that the token was wrong at that point, but invalid signature was found. The clue was still the session id not able to being created.
12:02 pm – Nestor Pastor confirmed that he was unable to login as Christian Hardy did on previous tests.
12:04 pm – Ankur reviewed the package wwnortontask-scheduler-service related to the message TypeError: Cannot read properties of undefined (reading 'insertOne') and tssStore collection of Testmaker DB in ncia-prod-aux cluster.
12:12 pm – Steve figured out that api-gateway credentials have a specific shard for DB connection. It seems this is not an issue as was using the SRV protocol.
12:19 pm – Patricia sent an email (Subject: “Instructors unable to sign into Norton Learning Tools”) to notify about the issue and set a protocol for communication in case new reports were received.
12:23 pm – Ankur found in logs that the related pod was actually external-testmaker-api-gateway and requested to restart it.
12:25 pm – The external-testmaker-api-gateway pod was restarted.
12:30 pm – Call finished after confirming internally that the issue was solved. This was confirmed on Channel thread as well.
Learned
What didn’t we know before this incident that we now know?
- api-gateway has a replica pod (external-testmaker-api-gateway) for external calls (outside VPC) as DLT is a frontend app that needs to connect directly to validate token (through appservices).
- We were assuming that logs were coming from api-gateway pods
- Restarting pods could resolve the issue temporarily, but root cause needs to be addressed to avoid facing same issue again and spend time on the same triage research and actions.
- We need to consider a login issue with more severity until anybody confirms it is not widespread. We lost valuable time until we figured out the issue had more impact.
- We need to double check any request to confirm changes were applied (Yamanishi, Evan: MongoDB connections)
Action Items
- Force to reproduce the issue in STG (INT-2193).
- Figure out why a pod connected with DB is unable to retrieve a valid collection (ENG-774).
- Review MongoDB logs (ENG-775).
- Review invalid signature at token id in DLT and improve logging (INT-2198).
- Review invalid token error in nerd-service and improve logging (NERD-3966).
- Prioritize a task to document each common and legacy package / service (Eng Monthly).
- Promote tasks to improve logging (Epic: ENG-777 / + all teams).
- Review and remove the use of DB shards on applications (ENG-778 / impacted teams).
- Decouple DLT from api-gateway (INT-2184).
Footnotes
-
The primary investigators are the individuals who discovered the technical cause(s) of the incident and/or found the solution to the issue. ↩
-
The incident repair team were involved in the work to fix the issue, whether they found the cause and solution or not. They are called out because they are the next most likely to be able to help explain the details of the incident, after the primary investigator(s). ↩
