Remix's concurrent submissions are fundamentally flawed (without causal ordering)
- José Valim
- September 12th, 2024
- concurrency, liveview
I have recently heard that ChatGPT launched a new version of its UI, using Remix, so I decided to give it a try and chase some UI/UX bugs. One of the motivations for this is because I consider Remix to be a library/framework trying to further integrate client and server, similar to Phoenix LiveView, but with different trade-offs.
As I dug deeper, I realized that the trade-offs made by Remix’s submission and revalidation are flawed and they cannot reliably deliver the properties outlined in their concurrency page for the majority of applications (if not all).
Submission and revalidation
With submission and revalidation is the idea that, if you submit a form, press a button, or anything that may lead to a POST/PATCH/DELETE on the server, you will first submit a request and then you do another request to load the data.
The first obvious issue with this approach is that, for any mutation, you are doing two round-trips to the server. For example, ChatGPT’s UI does perform two round-trips and the lag is quite noticiable. The two most common reasons I have heard for going down this route are:
-
It supports workflows with no JavaScript. However, in ChatGPT’s case, that’s not a possibility. So why pay the price for a feature that is not there?
-
It benefits caching. Which is partially pointless: why am I paying the price of two requests for the possibility of eventually using the cached value in the future? Why not do a single request and, if I need to read the data again, then I cache it?
Anyway, assuming you are fine with paying the price of two round-trips, Remix documentation says that it allows concurrent submissions and that Remix “safeguards against potential pitfalls by refraining from committing stale data when other actions introduce race conditions”. Unfortunately, that’s not quite true.
Hello Database
Remix documentation includes diagrams with some examples of how they deal with network requests. Let’s build on top of them. In particular, they use the following keys:
- |: Submission begins
- ✓: Action complete, data revalidation begins
- ✅: Revalidated data is committed to the UI
- ❌: Request cancelled
And here is one example they show:
submission 1: |----✓-----✅
submission 2: |-----✓-----✅
submission 3: |-----✓-----✅
There is a wrong assumption in here: it assumes that the revalidation that finishes first, contains an earlier version of the data. Given that most Remix applications interact with a database, let’s add a new key, called R
, which is when the revalidation reads from the database. Most people would expect it to always run like this:
submission 1: |----✓--R-----------------✅
submission 2: |-----✓--R----------------✅
submission 3: |-----✓--R---------✅
But the following is also a possible execution:
submission 1: |----✓---------------R----✅
submission 2: |-----✓--R----------------✅
submission 3: |-----✓------R-----✅
As you can see above, R1 will see all submissions, and that will be reflected in the UI. But R2 won’t see the effects of the third submission, reverting the UI to a previous state, only for it to correct itself once again.
Let’s make things more concrete. Imagine you have a table with three rows. Each row has a delete button. If you delete the three rows one after the other, you will issue three submission, one to delete each row. On the revalidate step, submission 1 will see all rows deleted, removing them from the page. The submission 2 comes in, and brings the third row back to life, only for it to be removed again. In this particular example, you could somehow track that the third row has been removed permanently, but for any non-trivial case, a submission may affect too many different properties and UI elements to track.
Overall, the assumption that the first response has an earlier version of the data is wrong for concurrent requests and Remix does not safeguard from these race conditions. The safest thing for Remix to do is to issue the revalidation only after all submissions completed, which may further penalize the user experience by stalling updates until the last one arrives:
submission 1: |----✓
submission 2: |-----✓
submission 3: |-----✓------R-----✅
In fact, you cannot even guarantee the submissions are processed in order! It may be that submission 2 updates the database before submission 1! If the concurrent submissions modify overlapping resources in the database, there is no guarantee the last submission sent by the user will be the last one applied by the server, unless the submissions converge or are made serial. So not only it may show the wrong data, it may also persist stale data to the database.
Intermission: Q&A
At this point, you may have several questions and suggestions, so let’s get some of the quick ones out of the way, before we jump into the big one.
Q: Couldn’t I store locally that an item has been updated/deleted?
Yes, you can definitely do that, and that’s what I assume most client frameworks are doing. This issue above arises from the “submission and revalidation” approach, especially when the properties returned by the server are out-of-sync with the client changes (spoiler alert: single fetch mutation is worse). Of course, you could start tracking the updates and deletes in your Remix app as well, to keep your UI consistent, but then why bother with “submission and revalidation” in the first place, if you cannot trust the properties returned by the server?
Q: What if I disallow double submissions?
The issues described here can also happen when deleting two entries in the same table. So you would have to block all interactions within the table/component. Blocking the user from using your UI because your framework cannot deal with concurrent requests is the opposite of good UX/DX.
Q: Isn’t the submission and revalidate pattern, as described, eventually consistent?
Not quite. The pattern is eventually consistent in the sense that you will eventually have the same version as the server, but we should not expect an eventually consistent system to return data which we have previously seen as deleted.
Q: Can the scenario above actually happen?
A typical web request will pass through proxies, load balancers/gateways, then be thrown into JavaScript’s event loop, garbage collectors, then the database connection polling and any transaction locking your database may use. And then make its way back. If a single iteration of your event loop blocks for too long, for example, by decoding/encoding large JSON payloads, that’s enough to shuffle the order around. You should also consider the fallacies of distributed systems. Those provide plenty of opportunities for your requests and responses to be processed out of order.
What about single fetch/round-trip mutation?
The first time I brought up the latency issues from submission and revalidation, a common response was: you can do a single request instead!
And while I agree a single request would be preferrable, it is worth pointing out that they do not solve the underlying problem. In fact, single fetch mutations will worsen stale data issues. A simple way to think about it is that, under the submission and revalidate pattern, you are guaranteed to have at least one read request after all three submissions, but this guarantee is gone under single fetch.
Let’s see some diagrams, starting with the keys:
- |: Submission begins
- U: Submission updated/deleted
- R: Data read
- ✅: Revalidated data is committed to the UI
This is how most people would expect it to behave:
submission 1: |----U--R---✅
submission 2: |----U--R---✅
submission 3: |----U--R---✅
But submission 2 could be delayed and you end-up with this:
submission 1: |----U--R---✅
submission 2: |--------------U--R---✅
submission 3: |----U--R---------------✅
If you assume the last submission is correct, it will show the result of submission 3 in the UI, but the server state is actually the one from submission 2. While users may see stale data in web applications when another user changes it, a user must not see stale data that they submitted themselves, and the above is just one possible variation of what may actually happen.
Since each request is now update/deleting the data and then reading it, you still cannot know nor guarantee which submission read the actual latest version of the data, even if you do it all inside a transaction. For example, PostgreSQL does not guarantee that a transaction T1, that was started before T2, will commit before T2. So the potential for showing stale data is even greater here.
The simplest way to address these issues is to disable concurrent requests and deal with its impact in the user experience, as shown next:
submission 1: |----U--R---✅
submission 2: |----U--R---✅
submission 3: |----U--R---✅
Perhaps we could do better?
In search of solutions
It is generally not possible to know the order a transaction will be committed within the transaction itself, except by making transactions serializable, which would cause a huge impact on performance. You could use something akin to PostgreSQL’s pg_current_snapshot()
to tell you which transactions are currently running, and that can give you some feedback, but if the three transactions from the three submissions overlap each other, you are still stuck.
Someone may also consider using sticky sessions/server affinity to guarantee the submissions are sent to the same instance and processed in order, but you still have to deal with the event loop and hope that the transactions are started in order and end in the same order they started, which - once again - is not guaranteed unless you serialize all database transactions, drastically impacting performance.
Remix’s own documentation mentions the potential of stale data and one of the solutions they suggest is to include timestamps in the form and compare them on the server, updating entries only if updated_at < requested_at
. However, that’s not enough unless you are locking rows, which pushes complexity to all server updates and introduces the possibility of deadlocks.
The simplest solution I can think about this problem requires at least causal ordering (but I may have missed simpler models).
Solution #1: causal ordering
The idea with causal ordering is that, if I perform three submissions, #1, #2, and #3, the submission #2 should carry with itself the information that it depends on the execution of submission #1. And submission #3 depends on #2.
Assuming we are using sticky sessions, we can now route all requests to the same Node.js instance. Then, you can make it so submission #2 blocks until submission #1 is completed, using some eventing system within the JavaScript runtime, to guarantee they are processed in the correct order. On the other hand, because the server may receive submission #2 after submission #1 has been fully completed, the notification that submission #1 has completed may already have been emitted. To address this, the server would need to keep a log of all completed submissions within a time period.
The benefit of this approach is that the client can fire requests immediately and the server can also send concurrent responses, as long as it orders the updates and reads within:
submission 1: |----U--R---✅
submission 2: |----U--R---✅
submission 3: |----U--R---✅
Of course, response for submission #2 may still arrive earlier than submission #1, but because the server has ordered them, it is completely safe to ignore the result of submission #1.
While I believe this would solve the problem, it comes with the complexity of ordering concurrent events by keeping history in each Node.js process and you still can only deploy it to infrastructure that supports sticky sessions.
One possible alternative to sticky sessions, suggested by Dev Agrawal, is to use database transactions and locks to maintain the causal order. Each client gets a database row with the last submission ID, and a submission may only continue if the relevant last submission ID has been committed. Locks would be used to ensure submissions from the same client are not processed concurrently by the server. This approach requires you to hold a transactional lock for the duration of each request, which may put additional pressure on your database pool and increase the likelihood of deadlocks if any locking mechanism is used within your application for actual data integrity. The overall implementation, feasibility, and costs will depend on the database of choice.
Solution #2: persistence all the way
Another solution, which is the one employed by Phoenix LiveView, is to keep an open connection between the client and the server, using WebSockets. This way, all events are received and can be processed in order, which guarantees the database reads and all updates will be delivered in order (but you can also easily process them concurrently when using Elixir, if you deem it safe to do so).
One potential caveat here is the requirement to use WebSockets. Of course, you can always fallback to long-polling… or can you?
The issue with long-polling is that you are back to issuing separate HTTP requests, which can be routed to different servers, and now we are back to solution one: you need sticky sessions and some causal ordering between the requests.
Therefore you may be wondering: how does Phoenix LiveView solves this? I am glad you asked!
When you start a long polling connection in LiveView, imagine it goes to Server #1, LiveView starts a lightweight Erlang VM process (you can literally spawn million of those) to be responsible for that particular session, and assign a session identifier to it. Once the long polling request concludes, we include the session identifier in the response.
Now when the client does the next long polling request, it may go to Server #2, but it also includes the session identifier. Because Phoenix runs on top of the Erlang VM, it uses the Erlang Distribution to find the process in the other node, preserving the persistence property we are interested in! I actually recommend checking out the long polling implementation in Phoenix, since this is all achieved with ~450 lines of code (here and here).
Unfortunately, if you do not have a distribution channel readily accessible, there is a reasonable amount of work required to enable persistent connections with long polling.
Of course, you could always try to bring another service (paid or self-hosted), but I am drawing a line at bringing in additional complexity and services just to guarantee a framework won’t serve stale data or race updates.
We still have to talk about cancelled submissions
So far, we have explored the downsides of the “submission and revalidation” approach. It causes the user experience to lag unnecesarily and Remix, in particular, does not deliver on the promise of safeguarding most applications from race conditions.
However, it is worth noting Remix may also cancel submissions, which can become a massive problem.
First of all, you can only cancel a submission in favor of a subsequent one if they are idempotent. While we ideally want to implement endpoints as idempotent whenever possible, in my opinion, it is too high of an assumption (or requirement) for a web framework to impose by default.
Still, the biggest issue is that a cancelled request may still be received by the server, and based on everything we discussed, be processed after the subsequent submission. Remix actually recognizes this in their documentation with the following diagram:
👇 interruption with new submission
|----❌----------------------✓
|-------✓-----✅
👆
initial request reaches the server
after the interrupting submission
has completed revalidation
But then they proceed to dismiss this scenario as an issue only possible with “inconsistent infrastructure”. It happens that your network and infrastructure won’t be homogenous and they don’t consider that, after a request is sent, it will pass through proxies, load balancers/gateways, then be thrown into JavaScript’s event loop and garbage collector, then the database connection polling and any transaction locking your database may use before it performs any write. So even if you are willing to accept the double round-trip of “submission and revalidation”, its race conditions (or lack of concurrency if they so choose to disable the feature), you still have to contend with the fact that your users may see stale data immediately after a submission. Which can go from minor UI nuisances to leading them to wrong decisions, such as clicking on “Buy Now” thinking a particular order had 2 line items, but the server actually stored 3 thanks to a “cancelled” submission.
While this particular problem could happen in web applications written 20 years ago, for example by double submitting a form, encouraging users to rely on concurrent requests and active cancellation may make this problem more frequent. I also believe we should aim to improve on the limitations of the past, rather than reaffirm them. Luckily, introducing causal ordering (or persistence), would fully address this problem too.
Overall, I hope this article shows that, if you are going to use the server state to drive the UI, concurrent submissions can be the source of pitfalls, race conditions, and inconsistencies, which can be addressed by introducing causal ordering.