Wtf is ACID in database transactions?

Wtf is ACID in database transactions?

Introduction

ACID helps make sure database actions are done dependably.

It stands for Atomicity, Consistency, Isolation and Durability.

These principles are important for keeping a database trustworthy and stable, especially when handling many tasks at once.

  • Atomicity means each action is all or nothing.

  • Consistency keeps the database in a valid state by following rules.

  • Isolation makes actions separate, even when they happen together.

  • Durability ensures completed actions stay done, even if the system fails.

What is a transaction?

A transaction is a group of tasks that a database does together as one step.

Here's what makes transactions special:

  1. Group of Operations: It can include various operations like adding, updating, or deleting data.

  2. All-or-Nothing: Transactions follow the "all-or-nothing" rule. This means all tasks in the transaction must finish, or none will. If there's a problem during the transaction, it goes back to how it was before it began.

  3. Maintaining Integrity: Transactions help keep your database consistent. For example, when moving money between two bank accounts, the transaction makes sure the money is taken out of one account and put into the other.

  4. Isolation from Others: Transactions are usually separate from each other. This means that what happens in one transaction stays hidden from others until it's done. This separation stops confusion from happening when transactions happen at the same time.

Atomicity

Atomicity means "all or nothing." In a database, many actions are grouped together.

Atomicity makes sure that all actions in the group are finished, or none of them are. This is important to keep data safe and avoid problems with incomplete transactions.

How it works

Consider a transaction involving two operations: A and B.

If both A and B complete successfully, the transaction is committed, and the changes are permanently made to the database.

If either A or B fails, the transaction is rolled back, meaning no changes are made, and the database remains in its original state before the transaction started.

If both A and B work, the transaction is done, and the database is updated. If A or B doesn't work, the transaction is undone, and the database stays the same as before.

Internals

Databases use logs to record every action in a transaction. If something goes wrong, the database checks these logs to undo changes, making sure incomplete transactions don't stay in the system.

Locking mechanisms

Locks stop other actions from using data during a transaction. This helps easily undo transactions if required.

Isolation

Isolation keeps transactions separate until they're complete. It hides actions in one transaction from others until it's finished.

Isolation is important for keeping data correct and consistent. Without good isolation, transactions might cause problems with each other, leading to wrong data.

You can imagine as we scale applications. The database will be busy. Multiple transactions often happen at the same time. That's where things get scary.

Levels of isolation

There are different levels of isolation. Think of these levels as different ways of managing how much transactions know about each other's business.

Read Uncommitted

This is like having no secrets between transactions. One transaction can see the changes made by others even before they're finalized.

Pros: Super fast because there's no waiting for other transactions to finish.

Cons: Can lead to dirty reads where you see data that might not stick around because it's not yet committed.

Read Committed

A bit more private than Read Uncommitted. Transactions only see changes from others after they're fully done and committed.

Pros: Avoids dirty reads, so what you see is more reliable.

Cons: You might get non-repeatable reads. If you look at data twice in the same transaction, it could change in between if another transaction updates it.

Repeatable Read

This is like putting a temporary do not disturb sign. If a transaction reads data, other transactions can't change that data until the first one finishes.

Pros: Solves non-repeatable reads. You get the same data every time you look at it in the same transaction.

Cons: Can lead to phantom reads. While existing data stays the same, new data can still be added by others, changing the overall picture.

Serializable

The highest level of privacy. It's like transactions are in their own bubbles, fully isolated from each other.

Pros: Super consistent. Avoids dirty reads, non-repeatable reads, and phantom reads. Everything's predictable.

Cons: Can be slow. Since transactions don't see each other, they might have to wait in line, reducing speed and efficiency.

Dirty reads

Imagine reading a note someone is still writing. A dirty read occurs when a transaction reads data that another transaction hasn't finished with yet.

Pros: The only pro here is speed. Since transactions don't wait for others to finish, things can move quickly.

Cons: The big downside is unreliability.

Non-repeatable reads

Asking the same question twice and getting different answers. You read data, read it again later, and it changed because another action changed it in between.

Pros: This lets more transactions happen at the same time because it doesn't restrict data as much.

Cons: The inconsistency can be problematic. You can't be sure the data you're working with remains the same throughout your transaction, which can lead to errors or confusion.

Phantom reads

A phantom read happens when a transaction reads a set of rows that match a condition, then reads the same set again and finds additional rows have been added by another transaction.

This is different from non-repeatable reads.

Think of it like searching for books priced under $20. You find 10 books initially. While you're still in the transaction, someone else adds or changes prices, and now your search finds 12 books.

Pros: Like non-repeatable reads, this level allows for greater concurrency and can improve performance in a busy database.

Cons: The main problem is things can change unexpectedly. The data you're using might get bigger because others add to it, making it harder to work with and understand.

Consistency

Consistency in databases means that all information follows specific rules, like data types and relationships.

The main goal is to keep the database in a good state before and after a transaction, based on these rules.

For example, if a rule states that all users must have a valid email address, a transaction that tries to create a user without an email address should be rejected to maintain consistency.

Inconsistent account balances in a banking system

Imagine a banking system where a key rule is that the sum of all transactions for an account must equal the account's current balance.

Scenario

  • Starting Point: An account has a balance of $1000.

  • Transaction 1: A withdrawal transaction of $200 is initiated.

  • Transaction 2: Almost simultaneously, another transaction to withdraw $500 is initiated.

  • Inconsistent State: Imagine both transactions look at the balance. They see $1000 and go ahead. The account has two withdrawals adding up to $700, but the balance might be reduced only once because of the checks happening at the same time. So, the account balance shows $800 instead of the right amount, $300.

Durability

Durability in databases means that once a transaction has been committed, it is guaranteed to be a permanent part of the database. This should be true even if the system breaks down or loses power.

The important time for durability is the commitment point. When a transaction gets to this point, all its changes are finished and need to be saved.

After commitment, the system must ensure these changes aren't lost.

Implementation

There are several techniques to achieve durability. Write-ahead logging (WAL) is the most common one where changes are first recorded in a transaction log.

Techniques

Transaction Logs

Logs every transaction detail before applying them to the database. If there's a crash, this log helps replay and finish those transactions.

Pros:

  • Reliability: Ensures no transaction is lost, enhancing data integrity.

  • Efficient Recovery: Can quickly restore the database to its last consistent state.

Cons:

  • Performance Overhead: Writing to logs can slow down transaction processing.

  • Storage Requirement: Requires additional storage for log data.

Checkpoints

Periodically creates a database checkpoint. For recovery, it replays only transactions post the last checkpoint, not the full log.

Pros:

  • Faster Recovery: Reduces the amount of data to process during recovery.

  • Resource Optimization: Helps manage the size of transaction logs.

Cons:

  • Data Loss Risk: Data between the last checkpoint and the crash could be lost.

  • Performance Impact: Creating checkpoints can temporarily slow down the database.

Replication

Copies data to multiple servers or locations. If one fails, the others can continue to provide the data.

Pros:

  • High Availability: Offers fault tolerance and data availability.

  • Load Distribution: Can spread out the read load, improving performance.

Cons:

  • Complexity: More complicated to set up and maintain.

  • Synchronization Challenge: Keeping all copies up-to-date can be challenging.

Write-Ahead Logging (WAL)

Logs changes before updating the database, ensuring transactions can be finished if recovery is needed.

Pros:

  • Data Integrity: Ensures that every committed transaction is recorded.

  • Consistent Performance: Balances durability with good overall performance.

Cons:

  • Log Management: Requires careful handling of the log file size and complexity.

  • Recovery Time: Can be slower compared to other methods in some scenarios.