Book Summary: Designing Data Intensive Applications

The Book, Designing Data Intensive Applications by Martin Kleppmenn is a great book about designing software applications with keeping dataset in mind. Below are my notes about the book:

  • Reliable, Scalable, and Maintainable Applications:
    • Reliable means a system should continue to work always even when there’s a fault or other difficulty. To achieve reliability, we add redundancy to the systems. Redundancy means adding a duplication of the system components.
    • Scalable means a system should be able to handle the growth of the usage, data, or complexity.
    • Maintainable means a system should be flexible to support existing and new use cases.
  • Data Models and Query Languages:
    • SQL is most common query language.
    • Object-Relational Mismatch: application code is generally written in object oriented programming languages. We need to translate data from tabular format to application code format. ORM tools like Hibernate helps to simplify this translation.
    • Which data model is appropriate for application code: If application is using a data set of a tree of one-to-many, then document data model may make more sense. Document model is faster for writes. Graph database is suitable for many-to-many relationships. Document databases are more suitable when there is less connection to data with other datasets.
    • Cypher Query Language: this is created for Neo4J graph database.
  • Storage and Retrieval:
    • An index is a structure derived from the primary data. Databases leave the decision of indexing to application developers.
    • SSTables and LSM Trees: Sorted String Table (or SS Table) has the sequence of key-values sorted. LSM Trees: yet to add notes about it.
    • B-Trees: One place per key. I am yet to add more notes about it.
    • Row storage: Data rows are stored at the same location. It happens with RDBMS systems.
    • Column storage: In column storages, data rows are not stored together. Instead, data values are split into columns. An example of a columnar database is Snowflake.
    • OLTP (Online Transaction Processing) databases: designed for low latency needs. Examples are RDBMS systems like MySQL and Oracle.
    • OLAP (Online Analytical Processing) databases: Examples are HBase, Hive, and Spark.
  • Encoding and Evolution: yet to add notes about it.
    • Rolling upgrade: rolling a change in production on a fewer node at a time.
    • Backward compatibility means newer code can read the data that was written by older code. Forward compatibility means older code can read the data that was written by newer code.
    • Encoding/Decoding: Converting raw data to encoded bytes is encoding ( or Serialization or Marshalling). Decoding is the opposite of it.
  • Distributed Data:
    • Scaling to higher load:
      • The simplest approach is Vertical scaling.
      • Share nothing architectures: also called horizontal scaling.
      • Replication versus Partitioning: Replication means keeping a copy of the data on several nodes. Partitioning means splitting a big database into smaller subsets called partitions.
  • Replication:
    • For replication style, we assume that the dataset is so small that each machine or a node can hold a copy of the entire dataset.
    • Different approaches to replicate datasets:
      • Single Leader based replication: It’s an active/passive replication. One of the nodes is designated as the leader. Other replicas are known as followers. Whenever there is a new data into leader, the change is processed to the followers in the order of the change. One leader is responsible to all writes. Either leader maybe responsible to copy data to other replicas or some replicas may be responsible to copy data to other replicas.
      • Multi-leader based replication: In this configuration, we can have one leader per datacenter. Multi-leader configuration has these advantages over a single-leader configuration: higher performance, tolerance of datacenter outages, and tolerance of network problems. The big disadvantage in multi-leader approach is the write conflicts. There are ways to handle write conflicts.
      • Leader less replication: The application owner or the client is responsible to replicate data to all databases. The application owner or the client is also responsible to resolve conflicts. An example database is Amazon Dynamo.
  • Partitioning:
    • The main to partition the data is scalability.
    • The book covers partitioning strategies in detail. I am yet to understand the details. As I learn, I will add notes about it.
    • Partitioning also means data splitting or data sharding.
    • Skewed means some partitions have more data than other partitions. a partition with relatively higher load is called a hot spot.
  • Transactions:
    • Transaction is a way to ensure either all involved reads and writes together pass or fail.
    • The books covers ACID and 2 Phase Commit transaction types in detail. I am yet to understand the details. As I learn, I will add notes about it.
    • ACID properties:
      • Atomicity means all or nothing actions for the actions.
      • Consistency means the data is correct in the database as per defined terms. For example, no foreign key violation, etc. to be followed correctly.
      • Isolation means that the concurrently running transactions are isolated (or we can say serialized) from each other.
      • Durability means that database guarantees the storage of the data.
    • Reading the data that’s not committed is a dirty read. Overriding partially committed data is dirty writes.
  • Troubles with distributed systems: this chapter is about ways to avoid faults in distributed systems.
  • Consistency and consensus: this chapter is about maintaining consistency of the data.
  • Derived data:
    • The source of truth holds the authentic dataset.
    • Derived data is redundant, derived from the source of truth. Cache is an example of the derived data.
  • Batch processing:
    • Services (online systems) are real time systems, that provide the response as soon as possible.
    • Batch processing are the jobs that run on scheduled time.
    • Stream processing systems (or near realtime systems) process the data as they receive the inputs.
  • Stream processing: Stream refers to the data that is available incrementally. Stream processing is the way to process the data near real time.
  • Future of Data Systems: I am yet to understand the details. As I learn, I will add notes about it.

As I learn more, I will update this page. Thank you.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s