What's new in Nx - March/2021

Three weeks ago we have made Nx publicly available and since then we have already grown to 25 contributors!

In this article I will describe the new features we have added during this period and what is next for us.

Ahead-of-time (AOT) compilation

Nx finally supports AOT compilation. Nx has a pluggable compiler architecture and one of our compilers is the EXLA compiler, which provides bindings to Google’s XLA. While the XLA compiler emits efficient code which can run on the CPU or the GPU, the compiler itself takes a long time and has some pitfalls, which can bring complications around deployment.

The simplest way to perform ahead-of-time compilation is by calling Nx.Defn.aot/4. For instance, we changed our MNIST neural network from scratch example to AOT-compile the neural network with its parameters:

# Define the arguments to AOT-compiled neural network
args = [Nx.template({30, 784}, {:f, 32})]

# Now define a MNIST.Trained module with the trained params embedded
EXLA.aot(
  MNIST.Trained,
  [{:predict, &MNIST.predict(trained_params, &1), args}]
)

# Run the AOT-compiled neural network by giving it a batch of images
IO.inspect MNIST.Trained.predict(hd(train_images))

However, the full power of AOT comes with the Nx.Defn.export_aot/5 function (also exposed as EXLA.export_aot/4), which allows us to export the AOT definition to a directory. Once exported, we can fully import the AOT definition in a separate environment, which doesn’t require EXLA (nor XLA) to be installed! It would go like this:

  1. Add {:exla, only: :export_aot} as a dependency

  2. Define a script that would train a neural network at script/train_and_export.exs

  3. Train and export the AOT definition with mix run script/train_and_export.exs

  4. Now, inside lib/trained.ex, you can import it:

    if File.exists?("priv/Trained.nx.aot") do
      defmodule Trained do
        Nx.Defn.import_aot("priv", __MODULE__)
      end
    else
      IO.warn "Skipping Trained because aot definition was not found"
    end

I believe this can be particularly useful to embedded devices and the Nerves community, as you should be able to cross-compile a neural network which then gets deployed to the device. The ahead-of-time compilation happens in two steps:

  1. First we compile the numerical definition to a shared object file with headers. This step already supports cross-compilation via the :target_triple option

  2. We then compile the shared object, the Tensorflow headers, and the Erlang VM NIF headers to a NIF, using Tensorflow’s build tool called Bazel

The second step does not come with cross compilation out of the box, as it requires customization of each toolchain within Bazel. Luckily the export_aot function allows developers to specify Bazel flags and environment for this purpose.

Bindings for PyTorch

There are two pluggable mechanisms in Nx: backends and compilers. Backends are eager: once you call Nx.add/2 with two tensors, the operation will be immediately dispatched to the tensor. Compilers, on the other hand, are lazy and work on numerical definitions, which receive the whole tensor expression with all operations and compile them to highly specialized code.

When we first announced Nx, we shipped with a backend implemented in pure Elixir and with the EXLA compiler. So there was a gap: while EXLA gave us an extremely efficient lazy mode, our default backend is too slow for practical use cases.

Thanks to the amazing work led by Stas Versilov, from Matrex fame, we now have early bindings for PyTorch. Or to be more precise, to LibTorch. The name of our LibTorch bindings is Torchx and it comes with a Nx backend.

We have also added Nx.default_backend/2 to make it easier to swap the default backend, making sure all of your operations are handled by the Torchx backend. It is still early on and we still have to implement many operations, but you can already check the examples directory inside Torchx to see it in action (including another MNIST neural network from scratch, but this time with Torchx).

Linear algebra decompositions and autograd

Thanks to the excellent contributions from Paulo Valente, we now support a handful of decompositions, such as QR, LU, SVD, as well as L0, L1, and L2 norms.

Thanks to contributors, we also have considerably shrunk our list of pending gradients, with a handful more to go.

The Machine Learning Working Group

A related topic to Nx is that we now have a Machine Learning Working Group in the Erlang Ecosystem Foundation, which Sean Moriarity and I are now chairs. Nx is one of the projects discussed within the Working Group and we hope more projects and active members from other BEAM languages will join us in our efforts!

What’s new for the next what’s new in Nx

With AOT out of the way, Sean and I have turned our focus to distributed numerical definitions, which will enable, among other things, distributed training. It is important to highlight this is orthogonal to the Erlang Distribution, as the goal is to directly transfer information between CPU/GPU/TPU devices. For this purpose, Sean is exploring TPU support while I am working on streaming, which will allow us to stream data to multiple devices at once.

We would love if you join us in our efforts and help us push Nx forward!