Porting Software to ARM: Here’s What We Learned

AWS’s release of its Graviton instance in 2019 has sparked a lot of interest in server-side ARM. And for good reason--moving to Graviton can save up to one-third of compute costs without sacrificing performance. With high demand, but constrained infrastructure budgets, this is a big deal! 

While there has been great interest in using ARM in the data center for a long time, adoption has been slow - it was too hard to source ARM-based servers and port the entire software stack to ARM. With the introduction of Graviton, it’s now as easy to spin up an ARM instance as an x86 instance. But what about porting your software? 

In this post, we’ll go through our 10 lessons learned when porting our software to ARM. We’ll help you scope, avoid pitfalls, and plan the project. 

10 Lessons Learned When Porting to ARM


1 - Don’t start with setting up any built infrastructure - you only need one Graviton VM to start!

Often, at the beginning of all these config changes, it’s easier to work interactively. We started a Graviton instance, pulled down our source code, and started running container builds. Each time, the build would fail on a different step. We’d tweak the code until it moved to the next step. We did this until we had a container build running on ARM. This way, we could work interactively, only waiting seconds between each build rather than 30 minutes for our full build and test pipeline to execute. After we had things working, we adjusted our automatic build and then began building the ARM container target with our CI system.

2 - Container base images already compiled for ARM really simplify the port. 

For example, the official Ubuntu image has already been ported to ARM. This means that Canonical handled porting the OS and most of the packages to ARM for you.

3 - Make sure your test suite isn’t flaky, otherwise it’s hard to verify the recompile.

When a test is flaky, rather than only failing when the code is truly broken, it sometimes fails even though the code is fine. If you have flaky testing, the recompile can be particularly fraught. How will you tell the difference between a subtle error introduced by a change of architecture or a flaky test? Therefore, make sure you minimize test flakiness before you begin the recompilation stage of the port. That way, you’ll have confidence that any test failures are due to architecture/config change rather than your code. 

4 - Interpreted languages work well, so long as you don’t use native extensions.

The JVM, Erlang (BEAM), and Python virtual machines allow for platform-independent execution of code in Java, Erlang, or Python respectively. But, If you extend any of these languages (i.e., JNI, NIF, Python C extension) you’ll need to recompile these portions and retest them for working on ARM. Even if you didn’t directly extend these languages, a library you depend on might have. 

5 - Service frameworks may require the regeneration of clients, servers, and objects.

For example, with gRPC, for some languages such as C++, native code is generated for the Protobufs. As a result, you need to regenerate and recompile these. 

6 - Packages you depend upon might not have been compiled yet. 

We depended on some particularly new packages in the ML/AI space. While they compiled for ARM, they didn’t yet distribute ARM binaries. As a result, we couldn’t depend on the published binaries and instead needed to run our own build processes for these packages.

7 - Dynamically-generated architecture-specific code requires a rewrite.

We have a compiler at the core of our metric query engine. While we get a performance boost from executing machine code, we needed to update our code to generate ARM binaries when running on our backend x86 servers. In this case, we needed to update our code for cross-compilation: compiling on x86 for an ARM target. This required updating at the code level i.e. we needed to update our code to be aware of the targets we needed to compile for and generate for each of these. 

8 - Be prepared for the proliferation of build targets.

Your build targets grow with the product of all the possible combinations. For instance:

[number of operating systems] times [number of packaging formats] times [number of architectures]

For example, we package for (RHEL, CentoOS, Amazon Linux, and Ubuntu) times (Docker or VM) times (x86 or ARM). This doubling of build targets caused us to scale out our CI system to handle the increased builds. Increasing the scale and complexity of the CI system requires not only adjusting config but also preparing for more incidents - like any other system, the more complex and scaled out, the more support required.

A tool that really helped us with creating multiple packages for different architectures was Docker Buildx. We use Docker Buildx in our CI so we can build for ARM and x86 on the same machine. This decreases runtime and the number of resources needed in our CI runs. We currently use x86 machines for our CI, however, with the power of Docker Buildx, we open the possibility to move our entire CI to ARM machines, once again cutting costs in our overall infrastructure.

9 - The most complex testing is the mixed mode.

Our agent runs on x86 and ARM. This means we needed to test all ARM clusters along with mixed x86 and ARM clusters.  We also had to run regression tests for our original x86 agent. This did uncover some bugs in our cross compiler; we realized the cross-compiled executables weren’t always getting loaded. 

10 - You probably can’t port all of your software at the same time.

It took 4 weeks to rebuild and then test our agent. We didn’t port every container that our backend runs to ARM at the time. We'll do our backend later when we want to gain those cost reductions within our own platform. This actually made testing easier, because it meant that we could hold most of the system constant, allowing us to isolate issues to the port of the agent. 

Conclusion 

If I can leave you with one bonus lesson it would be that porting to ARM requires some thought and planning, but the payoff is well worth it. Get excited and have patience! We hope this post helps you scope, avoid pitfalls, and more easily port your software to ARM.

 

 

 

 

Top