Implement Distributed Lock For a Microservices Software System
Are you facing problems with data consistency in a highly concurrent environment? Are you looking for a way to limit and synchronize access to a shared resource across services in a distributed system? Let's see how a distributed lock can help in this situation.
1) Why do we need a distributed lock?
Imagine if you are building a delivery application where multiple shippers can pick up the orders at any given time. You don’t want an order to be picked up by two or more shippers at the same time. Or, in the inventory management business, you want to deduct the available amount of an SKU synchronously so you could avoid mistakes in operation. Exactly, a lock mechanism could be one of the solutions in these cases.
Figure 1. Lock mechanism example
Locking is a way to limit access to a shared resource and allow mutual exclusivity. In that way, we can ensure the consistency of the resource or data we are working with. There is a well-known problem called Race condition (referenced: Race condition - Wikipedia), which can be avoided by using locks. If you are a software developer, you must have been familiar with the lock mechanism provided by the programming language/framework you are using. In C# we have lock, Monitor, Threading API (Semaphore, Mutex, Interlocked, etc). In Java, we have synchronized, Lock, etc. All of these techniques help us limit access, or synchronize the manipulation of objects or data in our application. However, what if our system consists of multiple applications or instances of applications distributed across servers and physical locations? And our resources are shared among those applications or instances as well. That’s where a distributed lock is helpful.
Today, we will learn how to implement mutually exclusive locks in a micro-services or distributed software system. In the end, you will have a sense of distributed lock and be able to enhance your implementation to support more advanced lock modes (to be discussed later in this post) if necessary.
2) Distributed Lock Manager (DLM)
Distributed Lock Manager (DLM) is a software component that manages locks that can be accessed by multiple clients in a distributed system. It is a logical concept and can be implemented in various ways. DLM does not actually limit access to any specific resources other than the locks. It provides mechanisms for managing, acquiring, and releasing locks so that applications can ensure they access to the data in an intended or mutually exclusive manner.
Figure 2. Distributed Lock flow
A resource is an entity that we should control the shared access to it.
Resources in a database system might be:
Resources in an ordering application might be:
We have some common lock modes:
- Null (NL)
- Concurrent Read (CR)
- Concurrent Write (CW)
- Protected Read (PR)
- Protected Write (PW)
- Exclusive (EX): a traditional exclusive lock, allow the client that acquired and is holding the lock to read/update resource while preventing the others to access it. Exclusive mode is used for our demo purposes in this post.
More details can be found here: https://en.wikipedia.org/wiki/Distributed_lock_manager
Lock mechanism at a glance
If multiple clients request the same lock, their requests should be placed into a requests queue, so the lock can be given to clients in FIFO manner. After a client successfully acquires a lock, others must wait until the lock is released by that client.
So we have some basic steps here:
- Wait for a lock
- Acquire the lock
- Access the resource
- Release the lock
“Redis is an open-source, in-memory data store used by millions of developers as a database, cache, streaming engine, and message broker.”
We talk about Redis in this post because it can be used as a DLM and is also well-known for that. For the rest of this post, many concepts and discussions are referenced from the original documentation here:
Later in this post, we will see how to implement a DLM using a custom implementation versus using Redis.
4) Distributed lock: Safety and Liveness Guarantee
From Redis’ point of view, there are three properties that must be at least satisfied to guarantee we used distributed locks effectively.
- Safety property: Mutual exclusion
- Liveness property A: Deadlock free, which means eventually, a lock is always available for acquisition, even if it is not released explicitly by the client (due to crashing or partitioning).
- Liveness property B: Fault tolerance. Clients can acquire and release the locks as long as major DLM instances are up and running.
What is a Deadlock?
“In concurrent computing, deadlock is any situation in which no member of some group of entities can proceed because each waits for another member, including itself, to take action, such as sending a message or, more commonly, releasing a lock.” - Wikipedia contributors. (2022, October 17). Deadlock. Wikipedia. Retrieved October 22, 2022, from https://en.wikipedia.org/wiki/Deadlock
Image 3. Traffic deadlock
Figure 4. Deadlock in software
5) Custom implementation
This custom implementation by me is just a demo project to get a sense of how a Distributed Lock Manager works and know some various ways to do that. There’re still many things to improve in the solution and it is not ready for production. See Limitation for more details.
All of the demo projects in this post are placed in the following repository: https://github.com/trung-tran-sts/DLock
Project requirements/technical stack:
- C# / .NET Core 3.1
- ASP.NET Core
- Redis (Docker image)
In the demo, reference the following projects/files:
- SimpleDLM / DLock.SimpleDLM
- DLock.LockClient / Services / SimpleDLMLockClient.cs
Figure 5. Custom implementation flow diagram
The custom DLM implementation has some following steps in the overall flow:
1) DLM Locks Hub (SignalR) listens for connections and requests.
2) Lock client sends a lock acquisition request with resource, expiry time, and wait time.
3) Lock client waits for a local lock (a locked semaphore slim) to be acquired.
4) DLM generates the lock id for the request and adds the SignalR connection to the group of that lock id (for lock-acquired notification).
5) DLM tries giving the lock to the client:
a) If a lock for a specified resource was acquired before:
i) If the existing lock is still valid, it cannot be given to the client.
ii) If the existing lock is expired, release the lock.
b) If the lock can be acquired, lock it and then notify the client so they could safely access the resources.
c) If the lock couldn’t be acquired, add the lock request to the queue.
6) Lock client listens for lock-acquired notification.
7) If triggered, the lock client releases the local lock (semaphore slim) and allows the client to access resources
8) If the wait timeout passed, throw an exception that lock couldn’t be acquired.
9) After processing with the resources, send a lock release request (with lock id) to the DLM.
10) DLM releases the lock with the specified lock id and cleans associated resources.
11) DLM dequeues the lock request queue to find the next valid lock request (is still in waiting time) to give the lock (go to step 5).
- Basic mutual exclusion lock
- Realtime lock acquisition notification (no need retrying)
- Lock request queue managed by DLM
- Blocking/waiting for lock acquisition
- Wait timeout and lock timeout (deadlock-free)
- Single point of failure
- Multiple connections can lead to performance degradation
- Locks can be expired without the knowledge of the client acquiring the lock due to process pauses or long processing. This may cause the other clients to manage to acquire the lock and modify the state of the data before the first client finishes updating that data.
- Lack of availability
- Incomplete memory management
6) Single Redis instance as a DLM
From this section, I assume we’re all familiar with how to use and set up Redis in basic since there’s no specific configuration in Redis to enable Distributed lock feature, it’s just a client implementation that takes advantage of Redis mechanism to implement a DLM.
Redis can be helpful for a DLM with just a single instance.
To acquire the lock, execute the following command:
SET resource_name a_random_value NX PX 10000
It will set the key only if it does not already exist (using NX option), with an expiration time (TTL) of 10000 milliseconds (PX option). The important thing here is to set the key to a unique value “a_random_value” across all clients and all lock requests. It is used in order to release the lock in a safe way. We only want to release the lock if it is the expected value we received when the client managed to acquire that lock, so we could avoid releasing the lock acquired by other clients. See the following Lua script for releasing a lock:
if redis.call("get",KEYS) == ARGV then
Now, we already have two core operations: acquiring and releasing locks. It’s our responsibility to implement the complete DLM to support blocking, retrying acquiring locks, lock requests queue, etc.
More details can be found here: https://redis.io/docs/reference/patterns/distributed-locks/#correct-implementation-with-a-single-instance
- Lack of Liveness guarantee mentioned in Distributed lock: Safety and Liveness Guarantee
- Single point of failure
- Lack of automatic communication handling, blocking/waiting mechanism
- Require additional implementation
7) RedLock algorithm
Here, it is a distributed version of the algorithm. We assume we have N DLM masters (which are Redis master nodes in our case). Those masters nodes are totally independent, we don’t use any replication or coordination system. This version will still use the similar above approach to manage locks in each single Redis DLM instance. If N=5, means we have 5 Redis masters on different servers.
Figure 6. RedLock algorithm
So now, in order to acquire the lock, clients in the RedLock version will have to perform the following things:
- Get the current time in ms
- Try to acquire the lock in all N instances sequentially, with the same key name and a random value. During this step, when trying to acquire the lock in each instance, the client will just wait for a small timeout, e.g, 5-50 milliseconds. This prevents the client from waiting for a long time trying to talk with a Redis node that is unavailable
- Compute how much time is taken in order to acquire the lock, by comparing it with the start acquiring timestamp in step 1. Only when the client manages to acquire locks in the majority of the instances (at least 3 if N=5 in this case), and the total time to acquire the lock is less than the lock TTL, it’s considered to be acquired successfully.
- If the lock was acquired, its current TTL is considered to be the time left (initial TTL minus the lock acquisition time in step 3).
- If the client failed to acquire the lock (not able to lock N/2+1 instances or the TTL is invalid), it will try to unlock all the instances (even the instances that were not able to lock before).
Above is the overall algorithm of RedLock, if you want to deep dive into its analysis, go to the Redis documentation for more information: https://redis.io/docs/reference/patterns/distributed-locks/#the-redlock-algorithm
8) RedLock with Redis
In the demo, reference the following projects/files:
- DLock.LockClient / Services / RedLockRedisLockClient.cs
Install NuGet Gallery | RedLock.net 2.3.2 in our Lock client project
Set up Redis DLM instances
Sample docker-compose Redis DLM instance
We will need 5 instances like that in this case, each with a different port for demo purposes.
Run docker-compose up -d to start our DLM instances.
Set up the RedLock client
Import the libraries' namespaces, then construct multiple connection multiplexers to our Redis DLM instances.
Create the RedLock.net RedLockFactory using the list of connection multiplexers.
Acquire the lock
Use RedLockFactory to acquire the lock.
If redLock.IsAcquired is true, we can safely access our resources.
We need to keep that redLock instance for releasing purposes.
i) Release the lock
Releasing the lock is easy. We just need to dispose the red lock instance obtained in the previous step.
- See Discussion for more details about RedLock’s concerns and limitations.
- RedLock.net does not implement a lock request queue, instead it’s using retry mechanism, so it’s possible for the clients that come later to acquire the locks faster than early-waiting clients.
- It is possible for every client successfully acquire the locks in some instances, resulting in no client eventually managing to acquire the major locks.
Following is the comparison table between 3 implementations:
|Custom||Single Redis||RedLock Redis|
|High availability||Not yet||No||Yes|
|Correctness even with long-running processes||Need fencing token||Need fencing token||Need fencing token|
|Memory management||Not yet||Not yet||Managed by library|
|Lock request queue||Yes||Not yet||No|
|Retrying acquiring lock||No||Yes||Yes|
|Realtime notification for lock acquisition||Yes||No||No|
Redis is fine for a reliable distributed lock manager. However, if consistency and correctness are important factors in your system, you should review and plan the following points carefully:
- Using fencing tokens is especially important for processes that can take significant time, and is always a safer way. Locks can be expired without the knowledge of the client acquiring the lock due to process pauses. This may cause the other clients to manage to acquire the lock and modify the state of the data before the first client finishes updating that data. A workaround is to extend the locks' lifetime, however, it is still just an assumption.
- Redis is not using a single clock for the expiration mechanism. A wall-clock shift - differences in time across machines - may result in multiple clients can acquire a single lock.
- Consider using a single instance approach if downtime and availability are not a big concern or it hardly happens in your case.
- Use the heartbeat pattern to extend the lock TTL, however, this will add additional complexity to the overall architecture.
More details can be found here: https://redis.io/docs/reference/patterns/distributed-locks/#disclaimer-about-consistency
We just learned some ways to enable distributed lock in our micro-services/distributed software system. We now know how to apply a distributed lock to solve our resources' consistency and concurrent access problems. We also know some of the limitations and concerns around those implementations.
Some last words from me: keep calm and learn, then deep dive into the problems if you face them in your real-life work. Hope this post helps you gain some interesting and new views on designing the architecture of software systems.
- Demo source code: https://github.com/trung-tran-sts/DLock
- Race condition: https://en.wikipedia.org/wiki/Race_condition
- Distributed lock manager: https://en.wikipedia.org/wiki/Distributed_lock_manager
- Distributed lock manager by Oracle: https://docs.oracle.com/cd/A57673_01/DOC/server/doc/SPS73/chap8.htm
- Distributed lock with Redis: https://redis.io/docs/reference/patterns/distributed-locks/
- Analysis of RedLock with Redis: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
- Deadlock: https://en.wikipedia.org/wiki/Deadlock
OTHER ARTICLES FROM TRUNG TRAN