Downtime

Discord Tickets is offline

Mar 28 at 12:57am GMT

Affected services

Discord Tickets (Public)

Resolved
Mar 31 at 04:32am BST

Post-mortem

Timeline

The crash

The public instance went offline at 00:47 UTC on Friday, March 28th. The reason for this is unknown; it was not a scheduled update. However, the reason is not particularly important as occasional crashes and automatic recoveries after a few minutes are relatively normal.
Unfortunately, despite the container automatically restarting, the service was unable to recover this time.

Token is revoked

At 06:29 UTC, Discord revokes the bot's token due to it reconnecting 1,000 times. There are no obvious signs as to why; the container did not restart during this time.

New token generated

At approximately 17:00 UTC, the container is recreated with a new bot token. The bot is observed slowly receiving GUILD_DELETE events but does not become READY within a normal timeframe.

Another new token

At approximately 22:00 UTC, although the previous token had not been revoked, the bot still had not become READY so the token was regenerated again and the container was recreated. This made no difference, so the bot was left to reconnect overnight.

Token is revoked again

At 04:11 UTC on Saturday, March 29th, Discord revokes the bot's token for a second time.

Root cause identified

The root cause is finally identified at approximately 01:30 UTC on Monday, March 31st. A solution was found 10 minutes later.

Service recovers

After the fix was implemented and deployed, the service recovered at 02:04 UTC.

The cause

As this incident wasn't caused by a change that could be rolled back, it required investigating a production service, which was made more difficult by the lack of error logs. Even with maximum debugging logs, no additional context was revealed, leaving just two clues:

Suspicious GUILD_DELETE events,
Apparent repeated reconnection.

However, discord.js was not surfacing any indication of its connection state. This was suspected of being a limitation of the library's internal sharding feature, which Discord Tickets still uses.
Attempting to connect to Discord using the dedicated ShardingManager revealed shards dying with an error:
Client took too long to become ready.

After a quick search, suspicions are confirmed and a solution is found. The slowly-arriving GUILD_DELETE events are holding up the connection for too long, discord.js gives up and tries again, but can be resolved by increasing the timeout option.

The proposed solution would require migrating the codebase to use ShardingManager, which is an overdue change needed to increase performance, and as this incident has shown, observability too. However, doing this would add a day to an already long incident.

Thankfully, internal sharding supports a similar option, allowing the service to be recovered for Monday morning.

The `GUILD_DELETE` events

This event is delivered when a guild becomes unavailable, often due to the bot being removed, but when it is received at startup it is usually due to an outage or other problem.
The connection timeouts were likely caused by an increase in the number of unavailable guilds, which was almost 50.

Updated
Mar 29 at 03:51pm GMT

The cause is still being investigated. Please join the Discord server at https://lnk.earth/discord for updates.

Created
Mar 28 at 12:57am GMT

Discord Tickets (Public instance) went down.