Who should read it: It is for you if you are looking for an overview of this topic for a project, to conduct/appear in an interview, or in general. As we learn more, we will update this article.
Why we need distributed databases:
- Difficult to store entire data set into a single database
- Single point of failure
- Slow in performance
- If we make a big computer, it will be more expensive
For the distributed database architectures, we have a master database and can have multiple secondary databases. Single node databases are classic databases like PostgreSQL, MySQL, etc. Distributed databases are made of multiple nodes. These are fault tolerant. Database clusters means multiple database instances. In general, we have leader nodes and follower nodes. Leader node is in charge of returning the final data results. Followers receive the data. If the leader node fails, a follower node can become the leader node.
Types of distributed databases:
- Big compute databases: Split data across multiple nodes. These are suitable for analytical workloads.
- High availability databases: are extremely fault tolerant. Each node has a full copy of the data.
Some key points about distributed databases:
- Imbalance node: the problem when a node has more data load. Moving the data between nodes is slow.
- Asking data from hard disk is slow. Asking data from RAM is fast.
- Leader node examines the query. Leader node distributes the jobs to different nodes.
- Sharding: It is a model in that all database instances acts as the primary databases. We segment data into multiple instances. Problem is that if we have more load on one segmented database, it will cause problems. Also, if we have to join data with two databases, we will have network connections.
CAP theorem:
- C stands for Consistency. If we write information, we want to get same data. That is the consistency. To maintain consistency, data in the primary and secondary databases should replicate asap.
- A stands for Availability. If we have two databases, if one machine goes down, as a whole system, users should be able to read or write.
- P stands for Partitions. Partition tells us that even if connections to two machines not working, we should still be able to read/write the data.
- RBDMS databases provide strong consistency. NoSQL databases generally prioritize availability, partition tolerance, and provide eventual consistency.
As per CAP theorem, generally, databases can achieve up to two features out of three. For distributed databases, assume network failures will be inevitable. So, for distributed databases, we need to choose between C and A.
PACELC theorem:
As per CAPLEC theorem, if partition happens, choose Availability and Consistency. Else, choose Latency and Consistency.
Other theorems:
BASE: BASE stands for Basically Available Soft state Eventually Consistent. NoSQL is an example of a BASE.
- Basically Available: The system is guaranteed to be available in event of failure.
- Soft State: The state of the data could change without application interactions due to eventual consistency.
- Eventual Consistency: The system will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state. But the consistency is not guaranteed at a transaction level.
Indexes:
Database index is a data structure that helps to retrieve data faster from a table. Indexes are like library catalog that helps to know the location of a book. For more about index, refer here.
Relational databases:
Relational databases store data in rows and columns. Some famous relational databases are MySQL, Oracle, and Postgres.
Advantages of relational databases:
- Well defined relationships and structured: data in relational databases is structured, with foreign and primary key constraints. It helps in organizing the data. Defined relationships and structure also helps in retrieving the data effectively.
- ACID (Atomicity, Consistency, Isolation, Durability): As relational databases support ACID properties, it’s helpful in ensuring the data changes for a transaction.
Disadvantages of relational databases:
- Rigidity due to structured data: As the data is well defined and structured, it’s not easy to store a new data set for that a structure is unknown. For example, to add a new column into a table, the table has to be changed, to support it.
- Difficult to scale: scaling means supporting more volume of data. For relational databases, scaling is difficult. For read-only operations, it’s easier to replicate the data. For write operations, a general approach is to add more capacity (vertical scaling) to the primary database server, which is costlier than replacing read-only databases.
NoSQL Databases:
There are multiple types of NoSQL databases like:
- Key-Value storage type: Data is stored in key-value pairs in arrays. Some examples of such databases are Redis and Dynamo databases.
- Document databases: In these databases, data is stored in the documents. A collection is a group of documents. Each document can have a different structure. An example of such a database is MongoDB.
- Wide-column databases: In these databases, the number of columns can vary per row in the same table. We can consider it as a two dimensional key-value storage. Some examples of wide-column databases are Cassandra and HBase. For more, refer here.
- Graph databases: These represent data in a form of a graph. Examples of such databases are Neo4J and Infinite graph.
- Some NoSQL databases:
- Couchbase: It is a NoSQL database that stores the data either in key/value pair or in JSON document format. In a traditional database model, we begin with a schema. We add tables and the columns in the tables.
- MongoDB: MongoDB is an open source document database. It works on concept of collections and documents. A collection is a document which is equivalent to an RDBMS table. A document is a set of key-value pairs. It is a schema-less database. It is easy to scale. It is a good choice for a Big Data need.
Advantages of NoSQL databases:
- Flexibility with unstructured data: As the data in NoSQL databases is unstructured, these databases provide more flexibility to store the data.
- Horizontal scaling: Horizontal scaling means distributing data into multiple server instances. Data in NoSQL databases are distributed, by using sharding. These databases support horizontal scaling for both, read and write operations.
SQL versus NoSQL databases:
- Storage: Data in SQL databases is stored in rows. NoSQL databases have different data storage models like key-value or graph
- Schema: SQL databases have a fixed schema. noSQL databases can have different schemas.
- Querying: SQL databases use Structural Query Language (SQL), to retrieve the data. NoSQL databases uses UnQL (Unstructured Query Language). NoSQL are focused as a collection of documents.
- Scalability: Horizontal scaling in SQL databases is difficult as compared to NoSQL databases.
- Reliability: Most SQL databases are reliable and ACID compliant. Whereas, NoSQL databases may compromise reliability and ACID compliance.
- Language: SQL databases use transactional SQL. They support core ANSI/ISO language elements. Whereas, NoSQL databases are not limited to one particular language. For example, MongoDb uses Javascript based query language.
How to choose between SQL and NoSQL databases:
- Consider SQL databases when:
- Data is structured and structure is not changed frequently.
- Supporting transaction-oriented use cases.
- No need to scale the database.
- Consider NoSQL databases when:
- Data is not structured and the structure can change frequently.
- A flexibility of dynamic schema is needed.
- We anticipate the scaling of the database in the future.
- Level of data integrity is not needed.
Other databases terms:
- Purpose built databases: There are many options like relational database, key-value database, document database, graph database, in-memory database, time series database, and ledger database. Depending on the situation, today’s application developers need to pick a right database for the use case, by analyzing pros and cons of the situation.
Partitioning and Sharding: it is the process of splitting the data into columns or features. Vertical partitioning splits the data into the same database by columns or features within the same tables. Horizontal partitioning splits the table data to multiple shards ( e.g. multiple database locations). In case of sharding, a table may have a customer ID 1 on one server and customer ID 2 on another server. In case of partitioning, other customer Ids 1 and 2 are on the same database servers and in the same table.
Distributed transaction types:
- Two-phase commit (2PC): In case of 2 Phase commit, there is a coordinator that prepares multiple transactions. Then, the coordinator either commits or rolls back all transactions together. While preparing each transactions, the database table rows (that to be updated in DB) are locked using local transactions. This prevents any updates during the 2-phase transaction. We also need to plan for a time limit for each transaction so that the coordinator is waiting to commit or rollback within a defined time period.
- 3 phase commit: First phase is that the coordinator asks if it’s ok to commit. Second phase is to pre-commit the transaction. Third step is to commit a transaction. I’m yet to learn more and add more notes about it.
- Try-Confirm/Cancel (TC/C): In the first phase, the coordinator asks all databases to reserve resources. In the second phase, the coordinator captures the replies from all databases. If the response is yes, the coordinator asks all databases to confirm the transaction. If any of the databases respond as no, the coordinator asks all databases to cancel the transaction.
- Saga: It’s an asynchronous way to achieve the transaction. It’s an event driven process. Micro-services generally use Saga as their default choice. In Saga, all operations are executed in a sequence. When one operation finishes, then the next operation is executed. For the rollback purpose, we need to prepare double operations: one for execution and the another one for the rollback. To coordinate operations in Saga, there are two ways:
- Choreography: All services do their jobs by subscribing to other services’ events. This is a decentralized coordination.
- Orchestration: In this way, there is a single coordinator to instruct all services in a defined order.
References:
- https://www.youtube.com/watch?v=5co1CuTPtkg
- https://www.tutorialspoint.com/mongodb/index.htm
- https://aws.amazon.com/getting-started/hands-on/purpose-built-databases/
- https://pages.awscloud.com/rs/112-TZM-766/images/Enter_the_Purpose-Built-Database-Era.pdf
- https://towardsdatascience.com/datastore-choices-sql-vs-nosql-database-ebec24d56106
- Distributed transactions: https://www.youtube.com/watch?v=S4FnmSeRpAY
- Distributed transactions: https://www.alibabacloud.com/blog/an-in-depth-analysis-of-distributed-transaction-solutions_597232#:~:text=Try%2C%20Commit%2C%20and%20Cancel%20(,%2C%20commit%2C%20and%20cancel%20interfaces.
- Distributed Transactions: https://experiencestack.co/the-core-knowledge-of-sql-database-development-2-29304458eaf3