Robust backend design
Imagine you're a Roman general leading a vast and powerful army. You're about to attack a city.
But you can't do it alone.
Your buddy with another vast and powerful army hides behind a hill on the other side. You need their help to win.
Attack together and win. Attack alone and die.
How do you ensure a joint attack?
Smoke signals would reveal your plan to the city. It's too far to shout and phones are 2000 years in the future.
A messenger is your best bet. Run to the other army, deliver the message, come back with confirmation.
Unless they're caught. 🤔
The messenger could fail to deliver your message. Or get caught on their way back. You'll never know.
Send more messengers until one makes it back? How does your friend know that any messenger made it back? Nobody wants to attack alone.
This puzzle is known as the Two Generals' Problem. There is no solution.
No algorithm guarantees 100% certainty over a lossy medium. Best you can do is "Pretty sure."
And that's why distributed systems are hard.
You cannot have 100% reliability. As soon as servers talk to each other, you're doomed to probabilities.
Serverless systems are always distributed. 😅
Build a robust backend
A robust backend keeps working in the face of failure.
As we mentioned in the Architecture Principles chapter, your backend follows 3 principles:
- Everything can and will fail
- Your system should keep working
- Failures should be easy to fix
You get there with a combination of error recovery, error isolation, and knowing when your system needs help.
The strategies mentioned in Architecture Principles were:
- isolate errors
- retry until success
- make operations replayable
- be debuggable
- remove bad requests
- alert the engineer when something's wrong
- control your flow
This chapter talks about how.
In March 2017, Amazon S3 went down and took with it half the internet. Root cause was a typo.
AWS Engineers were testing what happens when a few servers go offline. A typo took out too many and the rest got overwhelmed. They started failing one by one.
Soon the whole system was down.
And because AWS relies on S3 to store files ... much of AWS went down. And because half the internet runs on AWS ... it went down.
AWS couldn't even update their status dashboard because error icons live on S3.
To isolate errors you have to reduce inter-dependency. Always think: "What can I do to make moving pieces less dependent on each other?"
In your car, the brakes keep working even if your brake lights go out. The systems work together, but independently.
Inter-dependency can be subtle and hard to spot. The specifics are different each time.
Here are 3 rules:
- Give each operation a single responsibility
- Do the whole operation in one atomic go
- Avoid coupling
Serverless functions are optimized for this approach by default. They encourage you to keep code light and isolated 🤘