I’m excited to announce that we’ve opensourced redset, a flexible tool we’ve been using internally to help coordinate work in distributed Python systems. Redset helps to deduplicate and manage time-sensitive processing tasks, which oft-used queuing solutions don’t handle very well. That might read like Martian to some, so I’ll try to illustrate the motivations that drove us to write redset.
From an engineering standpoint, Percolate has become a very rich, almost kaleidoscopic system encompassing many different threads of activity. The product is responsible for monitoring social services, crawling content, retrieving and processing analytics data, generating notifications, and managing a variety of cached views on the data that Percolate concerns itself with. Most of these processes happen independently and concurrently, despite being members of the same virtual ecosystem.
Many of these tasks share a similar structure in terms of how they can be successfully organized and processed, and often engineering patterns crop up that are flexible enough to be reused across problems. One pattern is the queuing strategy, which is an effective solution for our publishing, scraping, and notification problems. Most engineers these days are familiar with using a message queue for these sorts of tasks: You want to scrape the content from site A, so you pop a message onto the queue that says “scrape site A” and some consumer of that queue (usually one of many) eventually pops the message off and does the requested work. If you need to scrape more sites, you add more consumers. Easy. MQs are as regular a fixture in the contemporary tech scene as Aeron chairs or Thunderbolt monitors.
There is a similar but subtly different class of activities that message queues don’t organize very well. Queuing is fine if you can rely on the message producers not to generate duplicate messages, or if each piece of work is so easy that you don’t mind a little overlap, but what if you have multiple producers who might generate duplicate requests, and the work is really expensive? You might, then, be faced with doing redundant, expensive work. Because the message queue has no good way of determining whether the contents of a candidate message are already represented in the queue (this is the nature of a queue), you can end up with duplicate messages and a bunch of wasted cycles.
Message queues also don’t support priority very well. Say you want to scrape site A at some point, but you want to scrape site B as soon as possible. Message queues typically don’t support establishing priorities among tasks.
Given that we do expensive analytics crawling and processing, Percolate has a few tasks that require both uniqueness and priority across multiple producers. Message queues just don’t fit the bill, despite being a well understood, supported technology in the Python milieu.
Fortunately, a library called redis offers builtin data structures that solve these sorts of problems nicely. A redis sorted set provides us with a centralized data container that has both the uniqueness and priority properties that we’re after.
The only catch is that adapting the redis sorted set for use across multiple processes in Python is a little cumbersome. Not only must the serializing and deserializing of redis datatypes to Python objects be managed, but in order to support concurrent usage across process boundaries, certain critical parts of the code must be protected with mutual exclusion locks to avoid duplicate reads.
Since we’ve found this pattern to be useful in our own system, we’ve invested some time into writing and packaging a small library that helps bridge the gap between redis sorted sets and Python. So redset was born, and we’ve found it to be a very helpful tool for managing prioritized tasksets and cached views. We use it primarily for analytics activities, where we can easily monitor how quickly tasks are being processed with
redset.TimeSortedSet objects, but the potential applications are broad.
I like redset in particular because it’s an example of a small, focused opensource tool. Its only dependency is the redis-py package, and its lightweight requirements make it easily usable in a variety of contexts. Its small size makes testing it comprehensively a straightforward task.
The opensource community has given Percolate so much, and it’s my genuine hope that redset helps engineers in the same community simplify their distributed Python systems in a satisfying way.
If you’re interested in writing tools to make distributed systems more fun to work with, consider dropping us a note.