Discord Tickets is offline
Resolved
Mar 31 at 04:32am BST
Post-mortem
Timeline
The crash
The public instance went offline at 00:47 UTC on Friday, March 28th. The reason for this is unknown; it was not a scheduled update. However, the reason is not particularly important as occasional crashes and automatic recoveries after a few minutes are relatively normal.
Unfortunately, despite the container automatically restarting, the service was unable to recover this time.
Token is revoked
At 06:29 UTC, Discord revokes the bot's token due to it reconnecting 1,000 times. There are no obvious signs as to why; the container did not restart during this time.
New token generated
At approximately 17:00 UTC, the container is recreated with a new bot token. The bot is observed slowly receiving GUILD_DELETE
events but does not become READY
within a normal timeframe.
Another new token
At approximately 22:00 UTC, although the previous token had not been revoked, the bot still had not become READY
so the token was regenerated again and the container was recreated. This made no difference, so the bot was left to reconnect overnight.
Token is revoked again
At 04:11 UTC on Saturday, March 29th, Discord revokes the bot's token for a second time.
Root cause identified
The root cause is finally identified at approximately 01:30 UTC on Monday, March 31st. A solution was found 10 minutes later.
Service recovers
After the fix was implemented and deployed, the service recovered at 02:04 UTC.
The cause
As this incident wasn't caused by a change that could be rolled back, it required investigating a production service, which was made more difficult by the lack of error logs. Even with maximum debugging logs, no additional context was revealed, leaving just two clues:
- Suspicious
GUILD_DELETE
events, - Apparent repeated reconnection.
However, discord.js was not surfacing any indication of its connection state. This was suspected of being a limitation of the library's internal sharding feature, which Discord Tickets still uses.
Attempting to connect to Discord using the dedicated ShardingManager
revealed shards dying with an error:
Client took too long to become ready.
After a quick search, suspicions are confirmed and a solution is found. The slowly-arriving GUILD_DELETE
events are holding up the connection for too long, discord.js gives up and tries again, but can be resolved by increasing the timeout option.
The proposed solution would require migrating the codebase to use ShardingManager
, which is an overdue change needed to increase performance, and as this incident has shown, observability too. However, doing this would add a day to an already long incident.
Thankfully, internal sharding supports a similar option, allowing the service to be recovered for Monday morning.
The GUILD_DELETE
events
This event is delivered when a guild becomes unavailable, often due to the bot being removed, but when it is received at startup it is usually due to an outage or other problem.
The connection timeouts were likely caused by an increase in the number of unavailable guilds, which was almost 50.
Affected services
Discord Tickets (Public instance)
Updated
Mar 29 at 03:51pm GMT
The cause is still being investigated. Please join the Discord server at https://lnk.earth/discord for updates.
Affected services
Discord Tickets (Public instance)
Created
Mar 28 at 12:57am GMT
Discord Tickets (Public instance) went down.
Affected services
Discord Tickets (Public instance)