Atlas Protocol

Authors: Phineas Walton

This service is internal to how Leap functions. This likely isn’t helpful to developers wanting to write Leap implementations - it’s more targeted toward our existing engineers to help them understand how our internal services work.

Outline

Leap, specifically Leap Edge, is written in Elixir. Elixir, or more specifically, OTP, provides useful behaviors and functions for creating and managing an actor-based model across a cluster of nodes. Individual actors are called “processes” and each individual process can implement a behavior called GenServer to manage their own state.

Processes in Elixir are represented by their PID (process ID), denoted by 3 integers, which are transitive across Elixir nodes. As you can imagine, developers want to give their own internal names and IDs to processes so they can refer to them using existing internal terms gathered from external input like an API or database.

There have been many different variations created by the Elixir & Erlang community to solve this issue, including some in the Erland OTP standard library, however they all have different drawbacks & tradeoffs. Without focussing on local process managers, which are trivial to implement, inter-node process management solutions include:

All of these solutions embrace an eventually consistent model, where certain nodes in the network may have a different view of where processes live. This model does not work for our use case. We need every node to be able to know where a resource lives, no matter what - if not, we could lose messages, or route them to the wrong session process.

One way to solve this would be to implement a cluster-wide lock or semaphore for resource sync, however this would add latency and introduce a lot of un-necessary gossip between nodes. We need an atomic, low-latency process registry across all nodes.

CAP Theorem

The CAP theorem states that in the event of a network partition, you can only favor either availability or consistency, but not both. In our case, we need to favor availability for the ability to look up a resource and understand if the response is inconsistent or not, and consistency for knowing where a resource lives.

The Solution

The solution to this problem requires 2 new concepts to be introduced into the system:

RRPs (resource reference pointers)

Hop operates multiple availability regions around the world, and in them, many different physical machines, services and most importantly, Elixir clusters that run Leap.

We need an easy way to reference, internally, where certain resource IDs live at any given time. Given our system is made up of interconnected Elixir nodes, we don’t have to worry about the other layers (physical machines, overlay networks) and just need to know 3 things: the region, the resource type and the node reference that the process lives on. An RRP pointing to a channel resource running in the US East region could look like this:

rrp:us-east-1/hop@node-1/channels/project_0:channel_0

which takes the format of:

rrp:<region>/<node>/<resource_type>/<resource_id>

Redis as a Process Node Registry

By using Redis (or KeyDB in our case), we can rely on the extremely fast and safe lookups that it provides from every node in the network, globally. We can rely on it as an SSoT (single source of truth), even during network partitions, and also use it to resolve conflicts by running sanity checks against RRPs.

Putting these Concepts to Work

Schrodinger’s Node

When a node looks up an RRP from Redis, it can easily perform a local lookup to check if the node that the RRP claims a process is running on is actually up (with OTP’s :erlang.nodes/1). If not, it can start it on the current node instead, and update the RRP value.

Race Condition Resolution

When nodes claim ownership of existing RRPs, it optimistcally assigns RRP values (before the resource is actually created) in attempt to prevent call race conditions. However, there is still an error margin within the latency of reading and updating an RRP value where two or more nodes may create the same resource at once.

The round trip time to set an RRP is measured while an existing ownership re-claim is executed, and a timer is set internally to re-check the RRP after the RTT has passed. Because a resource claim SET operation should take half of the total RRT, we can safely assume that, if the value hasn’t changed, then this node safely owns the resource.

Hacking the CAP

We’re effectively partitioning the CAP theorem’s guarantees and using the availability we get from one service (Redis), and then pairing it with a service acting as a consistent layer (individual Elixir nodes) by optimistcally taking ownership of resources if a partition is observed.