Optimizing Redis Key-Value Storage for Large-Scale Applications
When dealing with massive amounts of data, such as the 100 million key-value pairs stored on Instagram, finding an efficient storage solution is crucial. In this case, we needed to map approximately 300 million photos to their corresponding user IDs, which required a scalable and persistent storage system.
Legacy System and Data Fragmentation
Although our client and API applications have been updated, there are still many old cached data that need to be handled. To understand fragmentation and query performance, we need to map these photos to their user IDs. This is where our legacy system comes into play, requiring us to store this data in a way that is compatible with our existing infrastructure.
The Problem: Storing Large Amounts of Data
We initially considered storing the data in a database row, with columns for “Media ID” and “User ID.” However, since these IDs never change (only inserts are made), a SQL database seems redundant. Moreover, it would not require transactions, and other tables would have no relationships with this data. This led us to explore alternative storage solutions.
Redis: The Swiss Army Knife of Key-Value Stores
We turned to Redis, a widely used key-value store on Instagram. Redis is a powerful tool that provides a “Set key, get key” mechanism, as well as aggregation types like collections and ordered lists. It also has a persistence model configuration, which saves settings to disk at specified time intervals and provides master-slave synchronization.
Initial Solution: Simple Key-Value Pairing
Our initial solution was to use Redis as a simple key-value store, where the key would be the media ID, and the value would be the user ID. We would use the following commands:
SET media:1155315 939
GET media:1155315
> 939
However, this solution had a major limitation: it required about 21 GB of storage for 300 million keys, which is not feasible with our existing infrastructure.
Redis Hash: The Solution
Pieter Noordhuis, a core developer of Redis, suggested using Redis hash to solve our problem. Redis hash is a dictionary-like data structure that can be very effective in memory. We found that using Redis hash with a maximum number of entries set to 1000 (using the hash-zipmap-max-entries configuration option) was the most efficient solution.
Hash-Based Storage
We assigned all media IDs to buckets, with each bucket containing 1000 IDs. This determines which key belongs to which bucket, and we store the hash in that bucket. For example, given the media ID 1155315, which belongs to bucket 1155 (1155315/1000 = 1155), we would use the following commands:
HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"
Memory Savings
The memory difference is astonishing: our 1 million key hash-coded solution requires only 16 MB of storage, and expanded to 300 million keys, the total storage requirement is less than 5 GB. This is even cheaper than the Amazon m1.large instance type, which is about 1/3 the cost of our original solution.
Hash Performance
The most important thing is that the hash is still O(1), very fast. If you are interested in trying this solution, you can use the Gist scripts on GitHub that we used to run these tests (we have included Memcached for comparison).
By using Redis hash, we were able to solve our problem of storing large amounts of data in a scalable and persistent way. This solution has been successfully implemented on Instagram, and we hope it will be helpful to others facing similar challenges.