Probably and No: Redis, Probabilistic, and Bloom Filters
I’ve got a problem. I have tens of millions of players in my online game World of EverCraft (Not a real game. But if it was, it would…
Data deduplication, also known as Dedupe, delivers improved user experiences and engagement as well as more efficient data processing. Although most people think of deduplication as removing multiple copies of the same data in a database, in reality the need for deduplication happens in many common situations, leading to such questions as:
Sending messages to customers is an essential part of any business. Before sending an email or text to a given user, deduplication can determine if that user has seen that email or text before. If so, the system can send that user a different message or decide not to contact that user for a given time period (if at all).
Extract, transform, and load (ETL) is a common technique for working with large data sets such as event logs. Typically ETL is used to collate data from disparate systems so it can be analyzed. Adding deduplication to this process means the system doesn’t perform the same analysis multiple times on multiple copies of the same data.
When a user creates a new account, it’s important that their ID be unique. With deduplication, it’s trivial to determine if a given ID already exists.
It’s good to get an alert that something important has happened in a system. It’s bad to not get any alert at all, but it’s almost as bad to get 300 alerts for the same event. Or 30,000 alerts. Deduplication can stifle an alert that has occurred before, but the technique can be combined with a timer to remove recent duplicates, allowing the system to issue a given alert no more than once in a given time frame.
Fraud detection is a great use case for AI/ML processing. Unfortunately, AI/ML processing is expensive. Deduplication can examine a transaction and see if it is similar to a transaction that was approved in the past. If the new transaction is similar enough, the system can approve the transaction automatically without incurring the cost of fraud detection. (In this use case, a duplicate value is a good thing.)
A Redis set stores any given value only once. No matter how many times a value is added to a set, the set contains only one copy of that value. Redis sets are also optimized to answer the question, “Is this value in this set?” with extreme speed. Sets are also determinate in that the answers to the question are “Definitely” and “Definitely not.” Be aware, however, that a set can use a substantial amount of memory as the number of unique values in the set increases.
The Probabilistic module provides Bloom filters, a probabalistic algorithm useful for deduplication. Unlike a set, a Bloom filter only stores hashes for each value, not the value itself. That means a Bloom filter can take as little as 2% of the memory a set requires. And they are typically slightly faster than sets, which are themselves very fast. But there is a tradeoff for this speed and memory efficiency: a Bloom filter is not determinate. The answers to the question, “Have we seen this value before?” are “Definitely not” and “Probably so.” In other words, false positives are possible with a Bloom filter.
Customer success story
“A major gaming company uses Bloom filters in Redis Enterprise to add an event-level deduplication layer to real-time data streams, enabling them to process more than 2 billion events per day. Compared to their original implementation, they now use roughly 90% less memory and they reduced latency from roughly 2 milliseconds to less than 0.25 milliseconds.”
To have world-class user engagement, your solutions have to anticipate a user’s needs and understand their previous experiences. A great deduplication solution lets you make sure that every user interaction is unique, relevant to the user’s interests, and based on the user’s history.
A high-performance deduplication solution should be able to scale to handle massive data sets with high performance and linear scalability. Redis sets and the Probabilistic module do just that. No matter which solution you choose, both offer lightning-fast performance over all sizes of data sets, including multi-cloud and geo-distributed sets.
An efficient deduplication solution gives you the opportunity to avoid redundant processing, making sure that no data is handled more than once. In some cases, deduplication can be used to avoid processing altogether. And using deduplication to limit or eliminate alerts makes it easier for operators to respond appropriately to system events while avoiding alert fatigue.
A set contains at most one copy of any given value. In other words, we can add the same value to a set a million times (call SADD pets dog over and over), but the value will only occur once in the set. With a set defined, we can use the SISMEMBER command to tell if a given value is in the set. If we create a set named pets and add the values dog, cat, and parrot to it, the value dog is a member of the set, while the value monkey is not:
redis> SADD pets dog cat parrot
(integer) 3
redis> SISMEMBER pets dog
(integer) 1
redis> SISMEMBER pets monkey
(integer) 0
redis> SADD pets dog
(integer) 0
Here’s how easy it is to set up and use a Bloom filter:
We want to make sure that every user has a unique name. We add doug and lisa to the Bloom filter. (BF.ADD creates the filter if necessary.) If someone tries to create another account named doug, the Bloom filter will say that we probably have a user with that name already. On the other hand, if someone tries to create a new account with the username bob, the Bloom filter will tell us that it is definitely not an ID we’ve seen before, so it’s safe to create a new account with that name.
redis> BF.ADD users doug
(integer) 1
redis> BF.ADD users lisa
(integer) 1
redis> BF.EXISTS users doug
(integer) 1
redis> BF.EXISTS users bob
(integer) 0
redis> BF.ADD users lisa
(integer) 0