Chapter 1 - Reliability, Scalability, Maintainability

Reliability

Definition: continuing to work correctly, even when things go wrong.

Faults
- The application performs the function that the user expected.
- It can tolerate the user making mistakes or using the software in unexpected ways.
- Its performance is good enough for the required use case, under the expected load and data volume.
- The system prevents any unauthorized access and abuse.
- Hardware faults examples:
  - Hard disk crash
  - Faulty RAM
  - Power grid blackout
  - Unplugged network cable
  - Redundancy is the key to tolerate such faults
- Software faults examples:
  - Bug which implies a server crash on a particular bad user input
  - A process which vampirises resources - CPU time, RAM, disk space, ...
  - Unresponsive service or a service which returns corrupted responses
  - Cascading failures which one minor fault in one components triggers multiple other components to fail
- Human errors
Failures
- The system as a whole stops providing the required service
We need to design fault-tolerant systems which prevents faults from causing failures
- See Netflix Chaos Monkey for example: https://github.com/Netflix/chaosmonkey

Measuring load and performance.

Describing load well is key. It can be (these are load parameters):
- Requests per second on a web server
- Ratio of read / writes on a database
- Amount of data to process in batches
After describing load, see theoretically what happens when:
- you increase one load parameter, and keep all resources unchanged (CPU, memory, etc), what is affected?
- you keep performance unchanged, which resources do you need to increase? And in which proportion?
Measure performance in percentiles
- Performance can vary a lot (depending on various factors) and when you measure it, you need to measure a distribution
- Average is a poor metric, median (50% performs better, 50% performs worse) is already a better indicator, abbreviated p50
- Percentiles, and in particular p95, p99 and p999 for 95%, 99% and 99.9%
  - 99.9% shows the 1 in 1000 requests which are the slowest. Amazon dedicates a lot of efforts to them as they are often tied to users which have more data attached to their profile than the others as they have more purchases, more services, etc.. their best customers in short.
Use performance in dashboards
- Read more about: Forward decay, t-digest, HdrHistogram

Vertically
- Move to a more powerful machine (Often the case for Databases, as they are stateful until the cost justifies to scale horizontally)
Horizontally
- Distribute across multiple smaller machines (Easy for stateless applications, hence it’s often the case for web app servers)

Operability, simplicity and evolvability.