March 10, 2020 | #book review
Let's dig into Matin Kleppmann's "Designing Data-Intensive Applications" and talk about the facets of reliable, scalable, and maintainable applications!
The first chapter of Designing Data-Intensive Applications
(now referred to as DDIT) focuses on setting the stage for the rest of the book by presenting the author's definitions of the following terms:
At a high level, the author comments on how there are some common ideas of what it takes to call a system "reliable".
I found the author's comments on this topic to be extremely relevant to my career experiences. Software is challenging; the reality is that often, software is created at an extremely rapid pace and something as obvious as reliability gets overlooked. In my personal career, I have seen proof of concepts (which we never expected to see a production environment) morph into something real.
Alright, so that is all fine and dandy - but what about when things go wrong? When things go wrong, we call these actions faults
. Common types of faults are hardware faults, software errors, and human errors. It's important to clarify that there is a difference between a fault
and a failure
.
A fault
is when the system encounters unexpected behavior and is able to keep on trucking whereas a failure
stops the system completely. In general, the most common approach is to tolerate the fault instead of preventing them completely. This brings up an interesting point regarding how in code, with higher-level languages such as C#
, we have this concept of exceptions
, try/catch
and how that differs from a language such as C
.
This section discussed how there are multiple ways hardware can fail and how cloud providers have helped emerge a different approach which prioritizes flexibility and elasticity over single-machine reliability.
Something all developers can empathize with. Bugs are inevitable and as much as we strive to prevent them, they creep into our system. Sometimes the engineers working on the code introduce them, other times it's an outside source that we seemingly have little control over.
Some examples include:
We can combat these issues by testing individual systems, writing meaningful integration tests, using monitoring platforms, and adding fault tolerance and recovery strategies in code.
The author points out that humans are known to be unreliable. I tell my team that we should strive to automate anything that currently requires manual human intervention. The reality is that operators (humans) are responsible for a significant amount more errors than hardware failures.