SDKs with Req: S3

Welcome to “SDKs with Req” mini-series:

In previous article, SDKs with Req: Stripe, I presented my take on SDKs: instead of using packages with tens/hundreds/thousands of modules, let’s implement ourselves just what we need. Stripe API is known by developers for its ease of use, so it is not surprise rolling our own layer was straight-forward. Let’s see which challenges we face when writing our own wrapper around AWS S3.

S3

Let’s say we want to add some persistence to our app, write some data somewhere and read it back later. A very popular choice for that is S3. One option is to use something like the aws package, we can use the AWS.S3.put_object and AWS.S3.get_object functions:

Mix.install([
  {:aws, "~> 1.0"},
  {:hackney, "~> 1.16"}
])

access_key_id = System.fetch_env!("AWS_ACCESS_KEY_ID")
secret_access_key = System.fetch_env!("AWS_SECRET_ACCESS_KEY")
region = "us-east-1"
bucket = "bucket1"
key = "key1"

client = AWS.Client.create(access_key_id, secret_access_key, region)

{:ok, _, %{status_code: 200}} =
  AWS.S3.put_object(client, bucket, key, %{
    "Body" => "foo"
  })

{:ok, _, %{status_code: 200, body: "foo"}} =
  AWS.S3.get_object(client, bucket, key)

This is pretty good! The library is really well designed: there is no global configuration so we can easily use it in multi tenancy setups, for example. It embraces the underlying HTTP protocol, we can see the returned HTTP status, headers, etc in the responses, which really helps debugging. It has built in support for hackney and finch underlying HTTP clients and it is easy to write your own.

On the other hand, to take a step back, we are basically using two (AWS.S3.put_object and AWS.S3.get_object) functions out of almost 100 functions total in that module. Out of almost 400 modules total in the package. That is a lot of code that we don’t use! (We could have used a much smaller dependency, ExAws.S3. We’d still be using only a couple functions. We will talk about that package soon!)

For this particular use case, writing to and reading back from a bucket, there is also another way. Instead of calling PutObject and GetObject API endpoints we can simply make PUT and GET calls with the URL pointing to the bucket key:

PUT https://s3.amazonaws.com/:bucket/:key
GET https://s3.amazonaws.com/:bucket/:key

This is really convenient and the only missing piece is authenticating these requests. As you may know, S3 is so popular that many storage services use the S3 API as their API. To name just a few we have Cloudflare R2, DigitalOcean Spaces, Backblaze B2, and my current go-to, Tigris. Due to this S3 API ubiquity, and following in curl footsteps, Req ships with built-in support for authenticating request by creating the so called AWS signature. In Req, this is done through put_aws_sigv4 step. Here is an example of using Tigris on Fly:

access_key_id = System.fetch_env!("AWS_ACCESS_KEY_ID")
secret_access_key = System.fetch_env!("AWS_SECRET_ACCESS_KEY")
endpoint_url = System.fetch_env!("AWS_ENDPOINT_URL_S3")
bucket = System.fetch_env!("BUCKET_NAME")
key = "key1"

req =
  Req.new(
    aws_sigv4: [
      service: :s3,
      access_key_id: access_key_id,
      secret_access_key: secret_access_key
    ],
    url: "#{endpoint_url}/#{bucket}/#{key}"
  )

%{status: 200} =
  Req.put!(req, body: "Hello, World!")

%{status: 200, body: "Hello, World!"} =
  Req.get!(req)

These AWS_* and BUCKET_NAME system environment variables are automatically set by Fly when using Tigris. Pretty easy!

(Full disclosure: Fly and Tigris are sponsoring Livebook, a Dashbit project. For what it’s worth I’m a happy user and definitely would recommend them regardless!)

Here is a full module supporting these basic features:

defmodule MyApp.S3 do
  def new(options \\ []) when is_list(options) do
    s3 = Application.fetch_env!(:teams, :s3)
    endpoint_url = Keyword.fetch!(s3, :endpoint_url)
    bucket = Keyword.fetch!(s3, :bucket)

    Req.new(
      aws_sigv4: [service: :s3] ++ Keyword.take([:access_key_id, :secret_access_key, :region]),
      base_url: "#{endpoint_url}/#{bucket}",
      retry: :transient
    )
    |> Req.merge(Keyword.get(s3, :req_options, []) ++ options)
  end

  def request(options \\ []) do
    Req.request(new(options))
  end

  def request!(options \\ []) do
    Req.request!(new(options))
  end
end

and the updated usage:

%{status: 200} =
  MyApp.S3.request!(method: :put, url: key, body: "Hello, World!")

%{status: 200, body: "Hello, World!"} =
  MyApp.S3.request!(url: key)

So far so good! We were able to replace a full-blown SDK by a couple functions. You could also define MyApp.S3.get_object(key, options \\ []) and MyApp.S3.put_object(key, value, options \\ []) for extra convenience.

Listing objects and signed URLs

Let’s talk about other functionalities we commonly use from S3 services.

Imagine we want to listing bucket objects. Using Req, here is an example XML we’d get from an S3 API:

iex> Req.get!("https://#{bucket}.s3.amazonaws.com", options).body
"""
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>bucket</Name>
  ...
  <Contents>
    <Key>key1</Key>
    ...
  </Contents>
  <Contents>
    <Key>key2</Key>
    ...
  </Contents>
  ....
</ListBucketResult>
"""

We could parse the XML ourselves but it is a little bit tricky. While Erlang/OTP ships with an XML library, xmerl, it is not very ergonomic to use and there are security caveats. Fortunately, alternatives like Saxy which I’d definitely recommend checking out exist. At this point, we are being forced to make decisions that an existing SDK would already have done for us.

Another example is pre-signed URLs. S3 buckets are private by default, they can’t be accessed without authentication. Therefore, if you want to allow a third party resource to get or set bucket objects, we need to generate pre-signed URLs. These URLs are valid for configurable duration and cryptographically signed and so we can share them with our customers without sharing the full credentials. Similar to parsing XML, we could write this code ourselves, but we can agree it would be more convenient if we could just use something off the shelf.

Does it mean “small development kits” are doomed to fail?

Introducing ReqS3

We are solving these problems in Req with plugins. The goal of a plugin is to provide the minimum set of functionality to augment Req for use against particular services.

This brings us to the very first Req plugin ever created, ReqS3. This teaches Req how to handle the s3:// scheme, parse XML, generate pre-signed URLs, and more. Here’s how we would list all buckets using ReqS3:

iex> Mix.install([:req_s3])
iex> req = Req.new() |> ReqS3.attach()
iex> Req.get!(req, url: "s3://ossci-datasets").body
%{
  "ListBucketResult" => %{
    "Contents" => [
      %{
        "ETag" => "\"d41d8cd98f00b204e9800998ecf8427e\"",
        "Key" => "mnist/",
        ...
      },
      %{
        "ETag" => "\"9fb629c4189551a2d022fa330f9573f3\"",
        "Key" => "mnist/t10k-images-idx3-ubyte.gz",
        ...
      }
    ],
    "Name" => "ossci-datasets",
    ...
  }
}

iex> Req.get!(req, url: "s3://ossci-datasets/mnist/t10k-images-idx3-ubyte.gz").body
<<0, 0, 8, 3, ...>>

Or if you want to generate a pre-signed URL, you can use ReqS3.presign_url(options) just for that.

One of the main goals of Req is to have “batteries-included”, be easy to get started and handle most common tasks seamlessly. Continuing with the analogy, Req has “replaceable-batteries”, virtually all of the functionality is implemented as steps and you can easily re-use existing and write new ones. Req plugins are nothing more than “battery packs”, a collections of steps that augments Req for certain purposes. However, different from SDKs, Req plugins focus on the low-level bits, such as authentication and content handling, and aim to remain small.

Pre-signing Forms

Speaking of S3 pre-signing, besides URLs we can also pre-sign form uploads. That is, instead of users uploading data to our backend servers and our backend to S3, they would upload data directly to S3. ReqS3 ships with a ReqS3.presign_form(options) function to do just that. Here’s an example. Using Phoenix Playground, I created a single-file Phoenix LiveView example following the Phoenix “External Uploads: Direct to S3” guide. Here’s the full gist: https://gist.github.com/wojtekmach/8310bf1d8725715a2801f334caa0c339 and some excerpts below, first mounting the view and pre-signing on external uploads:

@impl true
def mount(_params, _session, socket) do
  {:ok,
   socket
   |> allow_upload(
     :photo,
     accept: ~w[.png .jpeg .jpg],
     max_entries: 1,
     auto_upload: true,
     external: &presign_upload/2
   )}
end

defp presign_upload(entry, socket) do
  s3_options = s3_options(entry)
  form = ReqS3.presign_form(s3_options ++ [content_type: entry.client_type])

  meta = %{
    uploader: "S3",
    key: s3_options[:key],
    url: form.url,
    fields: Map.new(form.fields)
  }

  {:ok, meta, socket}
end

defp presign_url(entry) do
  ReqS3.presign_url(s3_options(entry))
end

and in the template we show preview and, when the upload is done, the (pre-signed) link to the uploaded photo:

<%= for entry <- @uploads.photo.entries do %>
<div>
  <.live_img_preview entry={entry} height="100" />
  <div><%= entry.progress %>%</div>
  <%= if entry.done? do %>
    <.link href={presign_url(entry)}>Uploaded</.link>
  <% end %>
</div>
<% end %>

Btw, on latest Req we can replicate form upload using :form_multipart option:

form = ReqS3.presign_form(options)
Req.post!(form.url, form_multipart: form.fields ++ [file: body])

More on Authentication

So far for authentication we’ve been using the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. Where do we get the values from? From a lot of possible places turns out! Here is a few ones that we can programatically retrieve credentials from:

  1. If we’re using AWS CLI and have authenticated, credentials are written to ~/.aws/credentials.
  2. If we’re using AWS EC2, we can retrieve instance metadata.
  3. If we’re using AWS ECS, we can retrieve task metadata.
  4. If we configure an “Identity Provider” and authenticate, we can use AWS STS AssumeRoleWithWebIdentity. If you’re on Fly, check out their excellent “AWS without Access Keys” blog post!

Fortunately, the aws-beam team behind aws and aws_erlang packages, created another one that encapsulates these different ways to retrieve credentials: aws_credentials. Here’s an example where we grab information about the user who is making the request:

iex> resp = Req.post!(
...>   url: "https://iam.amazonaws.com",
...>   aws_sigv4: :aws_credentials.get_credentials(),
...>   headers: [accept: "application/json"],
...>   form: [Action: "GetUser", Version: "2010-05-08"],
...> )
iex> resp.body["GetUserResponse"]["GetUserResult"]["User"]["UserName"]
"wojtekmach"

I’d imagine the main driver of maintaining separate low-level libraries is being able to use them from their Elixir AND Erlang SDKs but as an added benefit, we can all usem them standalone too!

More S3 Functionality

So far we talked about listing, uploading and downloading bucket objects, and pre-signing URLs and forms. This covers a fair chunk of functionality that in my experience most people need most of the time but it is nowhere near all capabilities of the platform. Even within uploads, per S3 guidelines when your object size reaches 100MB you should consider using multipart upload which is quite more complicated than a single PUT operation.

Another example is deleting objects, while Req.delete!("https://bucket1.s3.amazonaws.com/key1, ...) will work just fine, if you want to delete multiple objects, it is much more efficient to use a dedicated DeleteObjects API endpoint. ReqS3 might add support for multipart uploads and conveniences for making XML REST API requests in the future but there are no plans at the moment. If you need features beyond what exists in Req/ReqS3 today, my recommendation would be to look into ExAWS.S3 or AWS.S3.

One could ask (very reasonably so!) if you’re going to end up using a “big” SDK anyway, why bother with Req/ReqS3? To that I’d repeat, start small, implement just what you need. But if you start feeling like you’re writing a lot of library code not specific to your app, by all means reach for an already existing SDK. Different teams, applications, and services will have different thresholds, so it is important to discuss your options.

Conclusion

In the previous article, our custom Stripe module ended up being very straightforward because that platform API is very straightforward. Stripe uses simple bearer token authentication, JSON, prefixed IDs, and in general seems consistent and predictable.

S3 (and AWS in general!) is quite more complex. For example, it uses completely custom authentication scheme (though basically a standard one in object storage space) and XML which is harder to parse generically. For example, check out this hand-written XML parser for ExAWS.SQS. I don’t think people should be re-implementing that. So I am glad it exists in a package!

Start small and bring complexity as needed. Happy hacking!