Distributed Transactions Demystified: From 2PC to Sagas in Microservices

Aug 24, 2024

As modern software architecture evolves, the management of transactions becomes increasingly complex. In monolithic applications, transaction management is often straightforward, especially when dealing with a single database. However, as systems grow and transition to microservices architectures, transactions span multiple services, each with its own database. This introduces significant challenges in maintaining consistency, reliability, and integrity across distributed systems.

Transaction Management in Monolithic Applications

In a monolithic application that accesses a single database, transaction management is relatively simple. Transactions can be managed using well-established ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring that all operations within a transaction are either fully completed or fully rolled back in the event of a failure. This guarantees data consistency and integrity within the application.

However, as monolithic applications grow, they often start using multiple databases and message brokers to handle increased load, performance, and scalability requirements. This is where transaction management starts to become more challenging, as coordinating transactions across multiple databases and brokers requires more elaborate mechanisms.

The Challenges of Distributed Transactions

In a microservices architecture, transactions typically span multiple services, each with its own database. Managing these distributed transactions is far more complex than in a monolithic environment. One traditional approach to handling distributed transactions is the Two-Phase Commit (2PC) protocol. However, 2PC comes with several significant challenges:

Lack of Support in Modern Technologies: Many modern technologies, including popular NoSQL databases like MongoDB and Cassandra, do not support distributed transactions. Similarly, modern message brokers such as RabbitMQ and Apache Kafka also lack support for distributed transactions. As a result, if you insist on using 2PC, you are limited in your choice of technologies, potentially hindering the adoption of modern, scalable solutions.
Reduced Availability: Distributed transactions inherently rely on synchronous Inter-Process Communication (IPC). For a distributed transaction to commit, all participating services must be available and able to communicate. This synchronous dependency reduces the overall availability of the system, as a failure in one service can prevent the entire transaction from completing.

Due to these limitations, many organizations are moving away from traditional distributed transactions and exploring alternative approaches like the Saga pattern to manage distributed transactions in microservices architectures.

The Saga Pattern: A Solution for Distributed Transactions

The Saga pattern is an alternative approach to managing distributed transactions without relying on the rigid requirements of 2PC. A saga is a sequence of local transactions, each of which updates data within a single service using the familiar ACID transaction frameworks and libraries. However, unlike traditional transactions, sagas use asynchronous messaging to maintain data consistency across services.

How the Saga Pattern Works:

Asynchronous Messaging: An essential benefit of the saga pattern is its reliance on asynchronous messaging. When a local transaction within a service completes, the service publishes an event message. This message then triggers the next step in the saga. Because messaging is asynchronous, it ensures that all steps of the saga are executed, even if some participants are temporarily unavailable. The message broker buffers the message until it can be delivered to the next service.
Compensating Transactions: One of the key challenges with sagas is the lack of isolation, which is a fundamental aspect of traditional ACID transactions. In a saga, each step commits its changes immediately, making them visible to other transactions. If an error occurs in a later step, the system must execute compensating transactions to roll back the changes made by earlier steps. These compensating transactions are specifically defined for each service to revert its local transaction's effects.
For example, in an e-commerce application:
- T1 (Order Service): Create an order with a PENDING state.
- TC1 (Order Service): If a failure occurs, move the order state to an appropriate FAILED state.
- T2 (Kitchen Service): Create a food ticket with an AWAITING_ACCEPTANCE state.
- TC2 (Kitchen Service): If a failure occurs, move the food ticket to an appropriate FAILED state.

This approach ensures that even if one part of the saga fails, the system can compensate and maintain overall data consistency.

Coordinating Sagas: Choreography vs. Orchestration

There are two main ways to coordinate the execution of sagas: Choreography and Orchestration.

1. Choreography: Distributed Decision-Making

In a choreography-based saga, the decision-making and sequencing of operations are distributed among the participating services. Unlike an orchestration-based saga, there is no central coordinator directing the flow of the saga. Instead, each service involved in the saga operates independently and communicates with other services through events.

Here’s how it works:

Event-Driven Communication:
Services in a choreography-based saga publish events to signal changes or updates, such as the creation, modification, or deletion of business objects. These events are then consumed by other services that are interested in or need to react to these changes.
Decentralized Control:
Each service subscribes to relevant events and acts based on the information contained in those events. For example, if a service receives an event that an order has been created, it may perform tasks such as validating the order or updating its own state in response.
Autonomous Participants:
Since there is no central coordinator, each service independently handles its own part of the saga. The services manage their own transactions and communicate their status or updates through events, which helps maintain loose coupling between them.
Dynamic Sequencing:
The sequence of operations in the saga is determined by the events and responses between services rather than being predefined by a central authority. Each service’s response to an event may trigger subsequent events and actions in other services, leading to a flexible and dynamic workflow.

Example of Choreography in an E-commerce Saga:
(example from the book "Microservices Patterns" by Chris Richardson)

Order Service creates an Order in the APPROVAL_PENDING state and publishes an OrderCreated event.
Consumer Service consumes the OrderCreated event, verifies that the con- sumer can place the order, and publishes a ConsumerVerified event.
Kitchen Service consumes the OrderCreated event, validates the Order, cre- ates a Ticket in a CREATE_PENDING state, and publishes the TicketCreated event.
Accounting Service consumes the OrderCreated event and creates a Credit-
CardAuthorization in a PENDING state.
Accounting Service consumes the TicketCreated and ConsumerVerified
events, charges the consumer’s credit card, and publishes the CreditCard-
Authorized event.
Kitchen Service consumes the CreditCardAuthorized event and changes the
state of the Ticket to AWAITING_ACCEPTANCE.
Order Service receives the CreditCardAuthorized events, changes the state of
the Order to APPROVED, and publishes an OrderApproved event.

Handling Failures in Choreography:

The sequence of events is as follows (when credit card authroziation fails )

1 Order Service creates an Order in the APPROVAL_PENDING state and publishes an OrderCreated event.

2 Consumer Service consumes the OrderCreated event, verifies that the con- sumer can place the order, and publishes a ConsumerVerified event.

3 Kitchen Service consumes the OrderCreated event, validates the Order, creates a Ticket in a CREATE_PENDING state, and publishes the TicketCreated event.

4 Accounting Service consumes the OrderCreated event and creates a Credit- CardAuthorization in a PENDING state.

5 Accounting Service consumes the TicketCreated and ConsumerVerified events, charges the consumer’s credit card, and publishes a Credit Card Authorization Failed event.

6 Kitchen Service consumes the Credit Card Authorization Failed event and changes the state of the Ticket to REJECTED.

7 Order Service consumes the Credit Card Authorization Failed event and changes the state of the Order to REJECTED.

Benefits of Choreography-based Sagas:

Simplicity: Each service simply publishes events whenever it creates, updates, or deletes business objects, making it straightforward to implement.
Loose Coupling: Services are only aware of the events they subscribe to, allowing them to operate independently without needing direct knowledge of each other’s existence.
Resilience: Choreography allows services to function autonomously. If one service fails, others can continue to operate and the failed service can catch up once it’s back online, thanks to the event-driven model.

However, this approach has some drawbacks:

Complexity in Understanding: Unlike orchestration, where the entire saga’s logic is centralized, choreography spreads the implementation across various services. This distribution can make it difficult for developers to grasp the full picture of how the saga functions.
Cyclic Dependencies Among Services: The event-driven nature of choreography can lead to services becoming cyclically dependent on each other’s events. For instance, the Order Service might depend on the Accounting Service, which in turn relies on the Order Service. These cyclic dependencies, while not always problematic, are often seen as a sign of poor design.
Potential for Tight Coupling: Each service must subscribe to all relevant events that impact its operations for complex use cases. For example, the Accounting Service needs to track events that involve charging or refunding a credit card. This requirement can lead to situations where services must be updated in tandem, particularly if changes occur in the order lifecycle managed by the Order Service.
Coordination Complexity:
As the number of services increases, the coordination among them can become more complex. Managing the event flows and ensuring that all services are correctly subscribing to and processing events can be challenging in large systems.

2. Orchestration: Centralized Control

In the orchestration approach, the coordination logic for a saga is centralized in a dedicated saga orchestrator class. This orchestrator is responsible for managing the entire flow of the saga by sending specific command messages to the various participating services, instructing them on which actions to take.

When using orchestration, the orchestrator acts as the central controller of the saga. It interacts with the participating services using a command-and-response pattern, where it sends a command to a service to perform a particular operation. Once the service completes the operation, it sends a response back to the orchestrator. The orchestrator then processes the response and decides the next step in the saga, continuing this process until the entire transaction is complete.

You can think of the saga orchestrator as a state machine, where each state represents a step in the saga, and transitions between states are determined by the responses from the participating services.

Example of Orchestration in an E-commerce Saga:

(example from the book "Microservices Patterns" by Chris Richardson)

Order Service first creates an Order and a Create Order Saga orchestrator. After that,

the flow for the happy path is as follows:

The saga orchestrator sends a Verify Consumer command to Consumer Service.
Consumer Service replies with a Consumer Verified message.
The saga orchestrator sends a Create Ticket command to Kitchen Service.
Kitchen Service replies with a Ticket Created message.
The saga orchestrator sends an Authorize Card message to Accounting Service.
Accounting Service replies with a Card Authorized message.
The saga orchestrator sends an Approve Ticket command to Kitchen Service.
The saga orchestrator sends an Approve Order command to Order Service.

Pros of Orchestration:

Centralized Logic: The entire saga’s logic is centralized in the orchestrator, making it easier to understand, manage, and modify.
No Cyclic Dependencies: The orchestrator avoids the cyclic dependencies often seen in choreography.

Cons of Orchestration:

Single Point of Failure: The orchestrator becomes a single point of failure, and its complexity grows as the saga becomes more intricate.

Choosing Between Choreography and Orchestration

The choice between choreography and orchestration depends on the complexity and requirements of the system. For simpler sagas, choreography’s simplicity and loose coupling might be preferable. However, for more complex sagas where the flow of events needs to be tightly controlled, orchestration provides better clarity and maintainability.

ACID Properties in the Context of Sagas:

Sagas do not fully adhere to the ACID properties typically associated with traditional transactions. The primary challenge with sagas is their lack of isolation, which is one of the ACID properties. Here’s a clearer explanation of how sagas align with ACID and the issues that arise due to the lack of isolation:

Atomicity:
In a saga, atomicity is maintained at the local transaction level. The saga ensures that all the steps in the saga either complete successfully or all changes are rolled back if any step fails. This guarantees that no partial or inconsistent state is left behind.
Consistency:
Consistency is managed within individual services through their local databases. Each service maintains referential integrity and consistency of its own data. Across different services, consistency is managed by ensuring that each service operates correctly in response to the events and commands it receives.
Durability:
Durability is handled by the local databases of each service. Once a transaction in a saga commits, the changes are permanently recorded in the database, ensuring that they are not lost even in the case of failures.
Isolation:
This is where sagas differ significantly from traditional transactions. Sagas lack isolation because updates made by one saga’s local transactions are visible to other sagas as soon as they commit. This visibility can lead to several issues:
- Lost Updates:
  If one saga updates data without taking into account changes made by another concurrent saga, it may unintentionally overwrite those changes, leading to lost updates.
- Dirty Reads:
  A saga might read data that has been modified by another saga which has not yet completed its operations. This means that the reading saga could encounter data that is in an intermediate or inconsistent state.
- Fuzzy or Nonrepeatable Reads:
  Different steps of a saga might read the same data at different times and get varying results because another saga has altered the data in between those reads. This inconsistency can arise from the lack of isolation between concurrent sagas.

Countermeasures for handling the lack of isolation

To handle the lack of isolation in sagas and mitigate potential issues such as lost updates, dirty reads, and inconsistent data, several countermeasures can be employed. Here’s a clearer explanation of each approach:

1. Semantic Lock
Semantic locking involves setting a flag in records that are being created or updated by a saga. This flag indicates that the record is currently being processed and may not be in a final state. The flag can either:

Prevent Access: Act as a lock to block other transactions from accessing the record.
Warn Other Transactions: Serve as a warning for other transactions to be cautious when interacting with the record.

Example:
You might use a status like *_pending to signify that an order is in the process of being updated. This helps other sagas or transactions recognize that the record is not yet finalized and might still change.

Considerations:
Managing the lock is only part of the solution. You also need to determine how to handle scenarios where a locked record needs to be accessed by other operations. For instance, if an order is being approved but the client decides to cancel it, you need to define how to manage this conflict.

2. Commutative Updates
Designing update operations to be commutative means they can be performed in any order without affecting the final outcome. Commutative operations ensure that the order of execution does not impact the consistency of the data.

Example:
Operations like debit() and credit() on an account are commutative because the order in which they are applied does not alter the account's balance.

3. Pessimistic View
The pessimistic view countermeasure involves reordering the steps in a saga to reduce the risk of issues arising from dirty reads. This approach aims to minimize the chances of inconsistencies by carefully arranging the sequence of operations.

Consideration:
By adjusting the order of operations, you can avoid situations where a saga might read outdated or inconsistent data, thereby reducing potential conflicts and errors.

4. Reread Value
To prevent lost updates, the reread value approach involves checking the current state of a record before making an update. The saga rereads the record to verify that it has not changed since the last read. If changes are detected, the saga aborts and may restart to ensure consistency.

Example:
Before updating an order, a saga rereads the order’s details to confirm that no other transactions have modified it. If the record has been updated, the saga rolls back and retries the operation.

5. Version File
The version file countermeasure tracks operations performed on a record by maintaining a version history. This helps reorder non-commutative operations to achieve consistency.

Example:
By recording each update and its version, you can manage the order of operations more effectively, ensuring that conflicting updates are applied in a consistent sequence.

6. By Value
The by value approach involves choosing between sagas and distributed transactions based on the risk level of each request. Low-risk operations can use sagas with appropriate countermeasures, while high-risk operations, such as those involving significant financial transactions, might require the stronger guarantees provided by distributed transactions.

Example:
For low-risk tasks, such as updating user preferences, you can use sagas. For high-risk tasks, like processing large monetary transactions, you might opt for distributed transactions to ensure greater consistency and reliability.

Conclusion

Managing distributed transactions in microservices is challenging but essential for maintaining data consistency across services. While traditional approaches like 2PC have significant limitations, the Saga pattern offers a robust alternative that can be tailored to the needs of modern distributed syst

ems. Whether you choose choreography or orchestration to coordinate your sagas, the key is to carefully design your transactions to handle failures gracefully and ensure the integrity of your data. As microservices continue to evolve, so too will the strategies for managing distributed transactions, making it an exciting and dynamic area of software engineering.

References :

https://microservices.io/book

https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga

Discussion about this post

Ready for more?