You may not need Redis with Elixir

If you have participated in a discussion about Elixir, you may have heard “you may not need Redis with Elixir”. Given that Redis has many use cases, this sentence may confuse developers as they try to match Elixir’s different features against Redis’ capabilities. This article aims to explore different scenarios where the above is true, when they are not, and which trade-offs you may want to consider. We will discuss four cases:

  1. Distributed PubSub
  2. Presence
  3. Caching
  4. Asynchronous processing

Before we start, I want to emphasize we find Redis a fantastic piece of technology. This is not a critique of Redis but rather a discussion of the different options Elixir developers may have available.

Case #1: Distributed PubSub

The first scenario where you may not need Redis with Elixir is Distributed PubSub. Throughout this section, we will consider PubSub systems to provide at-most-once delivery: they broadcast events to the currently available subscribers. If a subscriber is not around, they won’t receive the message later.

For this reason, PubSub systems are often paired with databases to offer persistence. For example, every time someone sends a message in a chat application, the system can save the contents to the database and then broadcast it to all users. This means everyone connected at a given moment sees the update immediately, but disconnected users can catch up later.

Imagine that you have multiple nodes, and you want to exchange messages between said nodes. In Elixir, thanks to the Erlang VM, which ships with distribution support, this can be as simple as:

for node <- Node.list() do
  send({:known_name, node}, :hello_world)
end

In 200LOC or less, you can implement a PubSub system that broadcasts to all subscribers within the same node or anywhere else in a cluster, without bringing any third-party tools. At best, you will need libcluster - an Elixir library - to establish the connection between the nodes based on some strategy (K8s, AWS, DNS, etc.).

In other words, PubSub pretty much ships out of the box with Elixir. Technologies without distribution would need to rely on Redis PubSub, PostgreSQL Notifications, or similar to achieve the same.

Of course, the above assumes your infrastructure allows you to directly establish connections between nodes, which may not be possible in some PaaS, such as Heroku. In those cases, you can use any of the technologies above (Phoenix has a Redis adapter for its PubSub), or alternatively use platforms, such as Gigalixir, that make it trivial to setup a cluster.

Case #2: Presence

Presence is the ability to track who is connected in a cluster right now — the “who” may be users, phones, IoT devices, etc. For example, if Alice is connected to node A, she wants to see that Bob is also available, even if he has joined node B.

Presence is one of the problems that are more complicated to implement than it sounds. For example, let’s consider implementing Presence by storing the connected entities in a database. However, what happens if a node crashes or leaves the cluster? Because the node crashed, all the users connected to it must be removed, but the node itself cannot do so. Therefore the other nodes need to detect those failure scenarios and act accordingly. But observing failures in a distributed system is also complicated: how do you differentiate between a temporarily unresponsive node from one that permanently failed?

Another common approach to solve this problem is to frequently write to a database while users are connected. If you have seen no writes within a timeframe, you consider those users to be disconnected. However, such solutions have to choose between being write-intensive or inaccurate. For instance, let’s say that users become disconnected after 1 minute. This means that you need to write to the database every 1 minute for every user. If you have 10k users, that’s 167 writes per second, only to track that the users are connected. Meanwhile, the gap between a user leaving and having their status reflected in the UI is, in the worst-case scenario, also 1 minute. Any attempt at reducing the number of writes implies an increased gap.

Given Elixir’s clustering support, we can once more implement Presence without a need for third-party dependencies! We use a PubSub system to implement Presence, as we need to notify as users join and leave. Instead of relying on centralized storage, the nodes directly communicate and exchange information about who is around. This removes the need for frequent writes. When a user leaves, this is also reflected immediately.

So while you can use Redis or another storage to provide Presence, Elixir can deliver a solution that is efficient and doesn’t require third-party tools.

Case #3: Caching

The solutions to previous cases were built on top of Erlang’s unique distribution capabilities. In the following sections, the distinguishing factor between needing Redis or not will be multi-core concurrency, so this discussion is more generally applicable. Therefore, when we say Elixir in this section, it will also apply to JVM, Go, and other environments. They will contrast to Ruby, Python, and Node.js, in which their primary runtimes do not provide adequate multi-core concurrency within a single Operating System process.

Let’s start with the non-concurrent scenario. Consider you are building a web application in Ruby, Python, etc. To deploy it, you get two eight-core machines. In languages that do not provide satisfactory multi-core concurrency, a common option for deployment is to start 8 instances of your web application, one per core, on each node. Overall, you will have CxN instances, where C is the number of cores, and N is the number of nodes.

Now consider a particular operation in this application that is expensive, and you want to cache its results. The easiest solution, regardless of your programming environment, is to cache it in memory. However, given we have 16 instances of this application, caching it in memory is suboptimal: we will have to perform this expensive operation at least 16 times, one for each instance. For this reason, it is widespread to use Redis, Memcached, or similar for caching in environments like Ruby, Python, etc. With Redis, you would cache it only once, and it will be shared across all instances. The trade-off is that we are replacing memory access by a network round-trip, and the latter is orders of magnitude more expensive.

Now let’s consider environments with multi-core concurrency. In languages like Elixir, you start one instance per node, regardless of the number of cores, since the runtime will share memory and efficiently spread the work across all cores. When it comes to caching, keeping the cache in-memory is a much more affordable scenario, as you will only have to compute once per node. Therefore, you have the option to skip Redis or Memcached altogether and avoid network round-trip.

Of course, this depends on how many nodes you are effectively running in production. Luckily, many companies report being able to run Elixir with an order of magnitude less nodes than technologies they have migrated from.

You can also choose a mixed approach and store the cache both in-memory and in Redis. First, you look up in memory and, if missing, you fallback to Redis. If unavailable in both, then you execute the operation and cache it in each. The critical part to highlight here is that multi-core environments give you more flexibility to tackle these problems while reducing resource utilization. In Elixir/Erlang, you can also keep the cache in memory and use PubSub to distribute it across nodes. You can see this last approach in action in the excellent FunWithFlags library.

Another trade-off to consider is that all in-memory cache will be gone once you deploy new nodes. Therefore, if you need data to persist across deployments, you will want to use Redis as a cache layer, as detailed above, or dump the cache in a storage, such as database, S3, or Redis, before each deployment.

Case #4: Asynchronous processing

Another scenario you may not need Redis in Elixir is to perform asynchronous processing. Let’s continue the discussion from the previous case.

In environments without or with limited multi-core concurrency, given each instance is assigned to one core, they are limited in their ability to handle requests concurrently. This has led to a common saying that “you should avoid blocking the main thread”. For example, imagine that your application has to deliver emails on sign up or generate some computationally expensive reports. While one of your 16 web instances is doing this, it cannot handle other incoming requests efficiently. For this reason, a common choice here is to move the work elsewhere, typically a background-job processing queue. First, you store the work to be done on Redis or similar. Then one of the 16 web instances (or more commonly a completely different set of workers) grabs it from the queue.

In multi-core concurrent environments, requests can be handled concurrently regardless if they are doing CPU or IO work. Sending the email from the request itself won’t block other requests. Generating the report is not a problem, as requests can be served by other CPUs. These platforms typically get assigned as many requests as they can handle and they distribute the work over the machine resources. Even if you prefer to deliver emails outside of the request, in order to send an earlier response to users, you can spawn an asynchronous worker without a need to move the delivery to an external queue or to another machine. Once again, concurrency gives us a more straightforward option to tackle these scenarios.

Note the Erlang VM takes care of multiplexing CPU and IO work without a need for developers to tag functions as async or similar. Workers in Erlang/Elixir are also preemptive, so it is not possible for a group of workers to starve all of the machine resources and block other workers from progressing their tasks. Quite similar to how Operating Systems manage their own processes, albeit much more lightweight.

There is one big caveat here: background-job processing queues often come with multiple features, such as retries, job visibility, etc. If you need any of these features, then I strongly suggest using a tool that relies on storage and provides all bells and whistles. Note that a background-job tool may use Redis, such as Elixir’s exq, but it doesn’t need to. They can use a database, as seen in Oban, or conventional messaging systems, such as RabbitMQ or Amazon SQS. In any case, for something as trivial as sending an email in Elixir, I would send the e-mail within the request, especially if the user needs to open up the e-mail before proceeding.

This caveat has led to some confusion, where some would claim that “you don’t need background jobs in Elixir”, which can be misleading. In Elixir, it is a choice you make when your requirements demand so, but it is not a necessity from day one.

I want to finish this section with a tale of one of my last consulting gigs as a Ruby developer as it was an insightful example of when background jobs are not an answer and can be even harmful.

The gig was with a company having scalability issues with Ruby. In particular, their problems were related to payment processing. They had to integrate with a specific payment processor, which would often take north of 3 seconds to handle a request. As per the above, while their Ruby servers were waiting for the payment processor, they could not do any other work, which slowed down their service. Their first course of action was to ramp up the number of servers. However, as the application gained users, latency was still unpredictable, operations became more complicated, often putting strains on other parts of their architecture, leading to a lot of sunk development time.

They tried using threaded web servers but it did not address the problem satisfactorily. They also explored moving to JRuby, which would have solved the problem at the runtime level, but they had little experience operating Java VMs, which blocked them from migrating.

The quick workaround (and common practice) was to move the payment processing to a background job. However, if the processing failed, they could not merely retry the job. Due to payment processing requirements, the user input was necessary on every attempt. So when it failed, they chose to send an email to users with a link to try again, which ultimately affected their conversion rates.

When we were brought in to work on the system, we developed a separate application to communicate with the payment processor, so we could scale it in isolation and try different deployment options with minimal impact. Then we added client-side polling to see the payment state while it was processed. The problem was addressed, but it cost hundreds of hours of development time and lost revenue until they arrived at the solution. A difficulty that would not exist in platforms with rich and robust tools for async processing and concurrency.

Summary

In this article, we discussed cases where you can reduce your operational complexity by using the features that ship as part of Elixir. The goal is to provide an in-depth reference that developers can link to when someone says that “you may not need Redis in Elixir”.

If I had to summarize what all of the cases have in common, the answer is ephemeral state. PubSub, caching, etc. are all temporary. PubSub delivers messages to who is available right now. Presence keeps who is connected right now. Whatever is cached can be lost and be recomputed. Therefore, if you have ephemeral data in Elixir, the odds are that you may not need Redis. However, if you need to persist or backup this state, then Redis or any other database will be handy.

It is also worth saying that, if you would rather just use Redis, for whatever reason, then go ahead and use Redis! You certainly won’t be alone as you join other companies using libraries like Redix to run Elixir and Redis together in production.