How to debug Elixir/Erlang compiler performance

Recently someone opened up an issue on Gettext saying compilation of Gettext modules got slower in Erlang/OTP 23. In this article, we are going to explore how I have debugged this problem and the three separate pull requests sent to the Erlang/OTP repository to improve compiler performance.

For those not familar, the Gettext project converts .po files like this:

# pt
msgid "Hello world"
msgstr "Olá mundo"

# pl
msgid "Hello world"
msgstr "Witaj świecie"

Into a module with functions:

def translate("pt", "Hello world"), do: "Olá mundo"
def translate("pl", "Hello world"), do: "Witaj świecie"

While we start with an Elixir application, we end-up doing most of the work with the Erlang compiler and tools, so most of the lessons here are applicable to the wider ecosystem. Be sure to read until the end for a welcome surprise.

Isolating the slow file

When project compilation is slow, the first step is to identify which files are slow. In Elixir v1.11, this can be done like this:

$ mix compile --force --profile time

The command above will print:

[profile] lib/ecto/query/planner.ex compiled in 1376ms (plus 596ms waiting)
[profile] lib/ecto/association.ex compiled in 904ms (plus 1168ms waiting)
[profile] lib/ecto/changeset.ex compiled in 869ms (plus 1301ms waiting)
[profile] Finished compilation cycle of 95 modules in 2579ms
[profile] Finished group pass check of 95 modules in 104ms

Compilation of each file in your project is done in parallel. The overall message is:

[profile] FILE compiled in COMPILE_TIME (plus WAITING_TIME waiting)

COMPILE_TIME is the time we were effectively compiling code. However, since a file may depend on a module defined in another file, WAITING_TIME is the time we wait until the file we depend on becomes available. High waiting times are not usually a concern, so we focus on the files with high compilation times.

At the end, we print two summaries:

[profile] Finished compilation cycle of 95 modules in 2579ms
[profile] Finished group pass check of 95 modules in 104ms

The first includes the time to compile all files in parallel and includes how many modules have been defined. The second is the time to execute a group pass which looks at all modules at once, in order to find undefined functions, emit deprecations, etc.

Unless the “group pass check” is the slow one - which would be a bug in the Elixir compiler - we are often looking at a single file being the root cause of slow compilation. With this file in hand, it is time to dig deeper.

Timing the slow file

Once we have identified the slow file, we need to understand why it is slow. When Elixir compiles a file, it executes code at three distinct stages. For example, let’s assume the slow down was in lib/problematic_file.ex that looks like this:

defmodule ProblematicModule do
  def function do

When compiling the file above, Elixir will execute each level in order. If that file has multiple modules, then compilation will happen for each module in the file, first at MODULE LEVEL and then FUNCTION LEVEL.

TIP: If a file with multiple modules is slow, I suggest breaking those modules into separate files and repeating the steps in the previous section.

With this knowledge in hand, we want to compile the file once again, but now passing the ERL_COMPILER_OPTIONS=time flag to the underlying Erlang compiler, which will print time reports. One option is to do this:

$ mix compile
$ touch lib/problematic_file.ex
$ ERL_COMPILER_OPTIONS=time mix compile

Then, for each module being compiled (which includes the one in your mix.exs), you will see a report like this:

core             :  0.653 s   72136.4 kB
sys_core_fold    :  0.482 s   69055.3 kB
sys_core_alias   :  0.146 s   69055.3 kB
core_transforms  :  0.000 s   69055.3 kB
sys_core_bsm     :  0.098 s   69055.3 kB
v3_kernel        :  2.250 s  169439.0 kB

Most compilers work by doing multiple passes on your code. Above we can see how much time was spent on each pass and how much memory the code representation, also known as Abstract Syntax Tree (AST), takes after each pass.

The ERL_COMPILER_OPTIONS=time mix compile command above has one issue though. If other files depend on the problematic file, they may be recompiled too, and that will add noise to your output. If that’s the case, you can also do this:

$ ERL_COMPILER_OPTIONS=time mix run lib/problematic_file.ex

This is a rather neat trick: we are re-running a file that we have just compiled. You will get warnings about modules being redefined but they are safe to ignore.

With the time reports in hand, there are two possible scenarios here:

  1. One (or several) of the passes in the report are slow. This means the slow down happens when compiling at the FUNCTION LEVEL and it will be associated with the generation of the .beam file for ProblematicModule

  2. All passes are fast and the slow down happens before the reports emitted by ERL_COMPILER_OPTIONS=time are printed. If this is the case, the slowdown is actually happening at the MODULE LEVEL, before the generation of the .beam file

Most times, the slowdown is actually at the FUNCTION LEVEL, including the one reported as a Gettext issue, so that’s the one we will explore. Performance issues at the MODULE LEVEL may still happen though, especially in large module bodies as seen in Phoenix’s Router - but don’t worry, those have often already been optimized throughout the years!

Moving to Erlang

At this point, we have found a module that is slow to compile. Given the original Gettext issue pointed to a difference of performance between Erlang versions, my next step is to remove Elixir from the equation.

Luckily, this is very easy to do with the decompile project:

$ mix archive.install github michalmuskala/decompile
$ mix decompile ProblematicModule --to erl

This command will emit a Elixir.ProblematicModule.erl file, which is literally the compiled Elixir code, represented in Erlang. Now, let’s compile it again, but without involving Elixir at all:

$ erlc +time Elixir.ProblematicModule.erl

TIP: the command above may not work out of the box. That’s because the .erl file generated by decompile may have invalid syntax. In those cases, you can manually fix those errors. They are often small nits.

If you want to try it yourself, you can find the .erl file for the Gettext report here:

$ erlc +time Elixir.GettextCompile.Gettext.erl

Here are the relevant snippets of the report I got on my machine:

expand_records         :      0.065 s   19988.0 kB
core                   :      3.295 s  373293.3 kB
beam_ssa_bool          :      1.125 s   39252.7 kB
beam_ssa_bsm           :      2.432 s   39263.1 kB
beam_ssa_funs          :      0.119 s   39263.1 kB
beam_ssa_opt           :      6.242 s   39298.0 kB
beam_ssa_pre_codegen   :      3.426 s   48897.5 kB

Looking at the report you can start building an intuition about which passes are slow. Given we were also told the code compiled fast on Erlang/OTP 22.3, I compiled the same file with that Erlang version and compared the reports side by side. Here are my notes:

  1. The core pass got considerably slower between Erlang/OTP versions (from 1.8s to 3.2s)

  2. Going from the expand_records pass to core increases the memory usage by almost 20 times (although this behaviour was also there on Erlang/OTP 22)

  3. The beam_ssa_bool did not exist on Erlang/OTP 22

In Erlang/OTP 22.3, the module takes 22 seconds to compile. On version 23.1, it takes 32 seconds. We have some notes and a reasonable target of 22 seconds to optimize towards. Let’s get to work.

Note: it is worth saying that it is very natural for new passes to be added and others to be removed between Erlang/OTP versions, precisely because the compiler is getting smarter all the time! As part of this process, some passes get faster and others get slower. Such is life. :)

Pull request #1: the profiler option

The Erlang compiler also has a neat feature that alows us to profile any compiler pass. Since we have detected the slow down in the core file, let’s profile it:

$ erlc +'{eprof, core}' Elixir.ProblematicModule.erl

It will print a report like this:

core: Running eprof

****** Process <0.111.0>    -- 100.00 % of profiled time ***
FUNCTION                   CALLS        %      TIME  [uS / CALLS]
--------                   -----  -------      ----  [----------]
gen:do_for_proc/2              1     0.00         0  [      0.00]
gen:'-call/4-fun-0-'/4         1     0.00         0  [      0.00]
v3_core:unforce/2              2     0.00         0  [      0.00]
v3_core:make_bool_switch/5     2     0.00         0  [      0.00]
v3_core:expr_map/4             1     0.00         0  [      0.00]
v3_core:safe_map/2             1     0.00         0  [      0.00]

With the slowest entries coming at the bottom. In this Gettext module, the slowest entry was:

cerl_trees:mapfold/4     3220377    19.14   2447684  [      0.76]

Jackpot! 20% of the compilation time was spent on a single function. This is a great opportunity for optimization.

I usually like to say there are two types of performance improvements. You have semantic improvements, which you can only pull off by having a grasp of the domain. The more you understand, the more likely you are to be able to come up with an improved algorithm (or the more you will be certain you are already implementing the state of the art). There are also mechanical improvements, which are more about how the runtime and the data structures in the language work. Often you work with a mixture of both.

In this case, the function cerl_trees:mapfold/4 is a function that traverses all AST nodes recursively. You can also see it was called more than 3 million times. The caller of this function in the core pass has the following goal:

Lower a receive to more primitive operations. Rewrite patterns that use and bind the same variable as nested cases.

To be honest, I don’t quite understand the work being done by the linked code but I checked the module being compiled and I learned that:

  1. The module does not use receive
  2. The module does not have patterns that use and bind the same variable

In other words, the pass is looking for a construct that does not happen anywhere in the compiled code. Therefore, can we avoid doing the work if we know we don’t have to do it?

That’s when I realized that, there are many constructs that never have a case or a receive in them! For example, a list with integer elements, such as [1, 2, 3], will never have a case/receive inside. More importantly, a string, such as “123”, won’t either. Those are known as literals. As we have seen, the Gettext module is full of literals, such as strings, and perhaps traversing them looking for these constructs is part of the issue. What if we tell cerl_trees:mapfoldl/4 to stop traversing whenever it finds a literal?

This is exactly what my first pull request does. By skipping literals and profiling again, I got these results:

cerl_trees:mapfold/4     2002931    11.14   1647204  [      0.72]

This brought this particular pass from 3.2s to 2.4s! Skipping literals indeed yields a solid improvement but still not quite as fast as Erlang/OTP 22.3, which took only 1.8s.

Luckily, we can go even deeper! We know we can’t have case/receives inside a literal. But are there any other constructs that can’t have case/receives in them? The answer is yes! The core pass performs variable hoisting out of expressions, which means that, code like this:

  case y do
    true -> foo()
    false -> bar()

is rewritten to:

_compilervar =
  case y do
    true -> foo()
    false -> bar()

[x, _compilervar]

This expands the number of constructs we no longer have to traverse, as it is guaranteed they won’t have a receive nor case in them. I have updated the pull request accordingly and overall I was able to improve two distinct passes by 25% and 33%. They are not much but I will take them!

Pull request #2: the memory jump

The first patch took most of a day. While debugging and working on it, I jumped around the source code and learned a lot. At some point, my brain starting nagging me about the second note: the AST becomes considerably larger at the end of the core pass. That’s when I realized: what if cerl_trees:mapfold/4 is running millions of times because the AST is too large? And more importantly, why is the AST so large?

While investigating the core pass, I noticed that strings such as "Hello" in patterns, would come in roughly as:

{bin, Metadata0, [
  {bitstr, Metadata1, {string, Metadata2, 'Hello'}, Size, Type}

and come out as:

{bin, Metadata0, [
  {bitstr, Metadata1, {char, Metadata2, $H}, Size, Type},
  {bitstr, Metadata1, {char, Metadata2, $e}, Size, Type}
  {bitstr, Metadata1, {char, Metadata2, $l}, Size, Type}
  {bitstr, Metadata1, {char, Metadata2, $l}, Size, Type}
  {bitstr, Metadata1, {char, Metadata2, $o}, Size, Type}

Strings are a higher level representation and we want to convert them to lower level ones in order to run compiler optimizations later on. However, the new representation consumes much more memory. Given the Gettext module is matching on a bunch of strings, this explains such a huge growth in memory usage.

Luckily, the core pass already had an optimization for this scenario, which converts strings to large integers, so that the “Hello” string actually comes out as:

{bin, Metadata0, [
  {bitstr, Metadata1, {integer, Metadata2, 310939249775}, 40, integer}

However, this optimization was only applied to strings outside of patterns. We could try to apply this optimization in more cases but we need to be careful with not making compiled pattern matching slower at runtime. Fortunately, about a year ago, I sent a pull request to the compiler that made it apply string matching optimizations more consistently. This means we can collapse strings into large integers now without affecting the result of later compiler passes!

This lead to the second pull request. Before the patch:

expand_records   :  0.077 s   19988.7 kB
core             :  3.295 s  373293.3 kB
sys_core_fold    :  0.868 s  370212.9 kB
sys_core_alias   :  0.237 s  370212.9 kB
core_transforms  :  0.000 s  370212.9 kB
sys_core_bsm     :  0.677 s  370212.9 kB
v3_kernel        :  2.662 s  169439.0 kB

After this patch:

expand_records   :  0.077 s   19988.7 kB
core             :  0.653 s   72136.4 kB
sys_core_fold    :  0.482 s   69055.3 kB
sys_core_alias   :  0.146 s   69055.3 kB
core_transforms  :  0.000 s   69055.3 kB
sys_core_bsm     :  0.098 s   69055.3 kB
v3_kernel        :  2.250 s  169439.0 kB

Not only it made the core pass 75% faster, it made all other passes faster too. This goes until the v3_kernel pass, which changes the AST representation. Also notice the size of the AST is the same after v3_kernel, which supports our theory we are not ultimately changing the end result. Overall, the memory usage was reduced by 75% on the core passes.

Ironically, if I had started with this patch, I probably wouldn’t have worked in the first pull request, because cerl_trees:mapfold/4 most likely wouldn’t show up as a bottleneck.

These results were very exciting but there is still one last note to explore.

Pull request #3: the mechanical fix

To finish the day, I have also profiled beam_ssa_bool, beam_ssa_pre_codegen, and friends. Curiously, in almost all of them, the slowest call was related to a recursive function named beam_ssa:rpo/1, that would be invoked hundreds of thousands of times:

beam_ssa:rpo_1/4   802391    10.72   375843  [      0.47]

While exploring the code, I learned that many times we could skip these calls by precomputing the rpo value and explicitly passing it in as an argument. Take this code:

  map_size(DefVars) > 1 ->
    Dom = beam_ssa:dominators(Blocks1),
    Uses = beam_ssa:uses(Blocks1),
    St0 = #st{defs=DefVars,count=Count1,dom=Dom,uses=Uses},
    {Blocks2,St} = bool_opt(Blocks1, St0),

Each of beam_ssa:dominators/1, beam_ssa:uses/1, and bool_opt/2 call beam_ssa:rpo/1 with the same argument. Therefore, if I rewrite the code and change the supporting APIs to this:

  map_size(DefVars) > 1 ->
    RPO = beam_ssa:rpo(Blocks1),
    Dom = beam_ssa:dominators(RPO, Blocks1),
    Uses = beam_ssa:uses(RPO, Blocks1),
    St0 = #st{defs=DefVars,count=Count1,dom=Dom,uses=Uses},
    {Blocks2,St} = bool_opt(RPO, Blocks1, St0),

The profiler now gives better numbers on the beam_ssa:rpo/1 calls, reducing it almost in half:

beam_ssa:rpo_1/4   481526     6.35   203949  [      0.42]

To me, this is a mechanical change because I literally had no idea what rpo meant while writing the patch - and I still don’t! It assume it is something that is generally cheap to compute, but given our problematic module is almost 100k lines of code, it exercises code paths that 99% of the code out there doesn’t.

The other interesting aspect is that this type of mechanism refactoring is extremely easy to perform in functional languages exactly because they are immutable and tend to isolate side-effects. I can move function calls around because I know they are not changing something else below my feet.

I asked the Erlang Compiler Team if they were interested in making the calls to beam_ssa:rpo/1 upfront, as in the code snippet above, which they kindly agreed. This lead to the third and last pull request.

Putting it all together

At this point, you may be wondering: did I reach my target? Did I make it faster? To our general excitement, the target was reached before I even started! It happens that Erlang/OTP has landed JIT support on master and the JIT support (and most likely other optimizations) already made it so Erlang master compiled faster than Erlang 22.3, beating it by 1 second, down to 21s.

Putting the JIT and all of the pull requests above together, the problematic module compiles in 18s, shaving an extra 3 seconds and reducing the memory usage spike by more than half! The performance benefits yielded by JIT are generally applicable while the changes in these pull requests will mostly benefit modules with many strings inside patterns (such as the Phoenix Router, Gettext, Elixir’s Unicode module, etc).

In case you are a Gettext user and you don’t want to wait until the next Erlang version comes out to benefit from faster compilation times, I have also pushed improvements to the Gettext library that breaks these problematic modules into a bunch of small ones, by partitioning them by locale and domain. Those improvements are in master and we would welcome if you gave them a try and provided feedback as we prepare for a Hex release.

I am a person who absolutely loves doing optimization work and I have to say the 36 hours that emcompassed debugging these issues up to writing this article, have been extremely fun! I hope you have learned a couple things too.