Chapter 1 - Reliability, Scalability, Maintainability

Reliability

Definition: continuing to work correctly, even when things go wrong.

  • Faults
    • The application performs the function that the user expected.
    • It can tolerate the user making mistakes or using the software in unexpected ways.
    • Its performance is good enough for the required use case, under the expected load and data volume.
    • The system prevents any unauthorized access and abuse.
    • Hardware faults examples:
      • Hard disk crash
      • Faulty RAM
      • Power grid blackout
      • Unplugged network cable
      • Redundancy is the key to tolerate such faults
    • Software faults examples:
      • Bug which implies a server crash on a particular bad user input
      • A process which vampirises resources - CPU time, RAM, disk space, ...
      • Unresponsive service or a service which returns corrupted responses
      • Cascading failures which one minor fault in one components triggers multiple other components to fail
    • Human errors
  • Failures
    • The system as a whole stops providing the required service
  • We need to design fault-tolerant systems which prevents faults from causing failures

Scalability

Measuring load and performance.

  • Describing load well is key. It can be (these are load parameters):
    • Requests per second on a web server
    • Ratio of read / writes on a database
    • Amount of data to process in batches
  • After describing load, see theoretically what happens when:
    • you increase one load parameter, and keep all resources unchanged (CPU, memory, etc), what is affected?
    • you keep performance unchanged, which resources do you need to increase? And in which proportion?
  • Measure performance in percentiles
    • Performance can vary a lot (depending on various factors) and when you measure it, you need to measure a distribution
    • Average is a poor metric, median (50% performs better, 50% performs worse) is already a better indicator, abbreviated p50
    • Percentiles, and in particular p95, p99 and p999 for 95%, 99% and 99.9%
      • 99.9% shows the 1 in 1000 requests which are the slowest. Amazon dedicates a lot of efforts to them as they are often tied to users which have more data attached to their profile than the others as they have more purchases, more services, etc.. their best customers in short.
  • Use performance in dashboards
    • Read more about: Forward decay, t-digest, HdrHistogram

How to scale?

  • Vertically
    • Move to a more powerful machine (Often the case for Databases, as they are stateful until the cost justifies to scale horizontally)
  • Horizontally
    • Distribute across multiple smaller machines (Easy for stateless applications, hence it’s often the case for web app servers)

Maintainability

Operability, simplicity and evolvability.

  • Operability
    • Make it easy for operations teams to keep the system running
    • Operations care about:
      • Monitoring health of the system and restore it quickly when things go bad
      • Tracking down cause of failures (system failures or degraded performance)
      • Keeping up to date (particularly security patches)
      • Anticipating future problems and solving them before they occur (e.g. capacity planning)
      • Performing complex maintenance tasks such as moving an application from a platform to another
    • Good systems show then
      • Provide visibility into the runtime behaviour and internals of the system for good monitoring
      • Avoid dependency on individual machines
      • Provide default behaviour with possibility to be overridden when needed
      • Be able to self-heal where appropriate
  • Simplicity
    • Make it easy for new engineers to understand the system
  • Evolvability
    • Make it easy for engineers to make changes to the system

results matching ""

    No results matching ""