What do you do when 3% of requests fail?
That was my reality one painful day in September. Deploy new feature, pat yourself on the back, go to lunch. A big feature – done. Nothing can touch you. 💪
A Slack message? Who's working through lunch this time ...
Weird, that's a lot of messages for lunchtime ...
Everything is on fire. High API error rate 🔥
~ an alarm SMS
Our system sent alarms to Slack. Serious alarms to SMS.
The high API error rate was the biggest alarm. A catch-all that triggered when you can't be sure the more specific alarms even work.
Back to my desk, my heart sank: 3% of all API requests were failing. Reasons unknown.
Every user action, every background process on the web, on iOS and on Android, every time you opened the site or accessed the app 👉 3% chance of failure. At our usual 10 requests per second, that's 18 errors every minute!
Wanna know the best part?
Nobody noticed. ✌️
I knew something was wrong because of that SMS. The system kept hobbling along. Slower, in pain, but getting the job done.
How can a system with 18 failures per minute keep working?
The design principle behind every backend architecture states:
- Everything can and will fail
- Your system should work anyway
- Make failures easy to fix