Atlas Protocol
The Hop Atlas protocol
Atlas Protocol
Authors: Phineas Walton
This service is internal to how Leap functions. This likely isn’t helpful to developers wanting to write Leap implementations - it’s more targeted toward our existing engineers to help them understand how our internal services work.
Outline
Leap, specifically Leap Edge, is written in Elixir. Elixir, or more
specifically, OTP, provides useful behaviors and functions for creating and
managing an actor-based model across a cluster of nodes. Individual actors are
called “processes” and each individual process can implement a behavior called
GenServer
to manage their own state.
Processes in Elixir are represented by their PID (process ID), denoted by 3 integers, which are transitive across Elixir nodes. As you can imagine, developers want to give their own internal names and IDs to processes so they can refer to them using existing internal terms gathered from external input like an API or database.
There have been many different variations created by the Elixir & Erlang community to solve this issue, including some in the Erland OTP standard library, however they all have different drawbacks & tradeoffs. Without focussing on local process managers, which are trivial to implement, inter-node process management solutions include:
All of these solutions embrace an eventually consistent model, where certain nodes in the network may have a different view of where processes live. This model does not work for our use case. We need every node to be able to know where a resource lives, no matter what - if not, we could lose messages, or route them to the wrong session process.
One way to solve this would be to implement a cluster-wide lock or semaphore for resource sync, however this would add latency and introduce a lot of un-necessary gossip between nodes. We need an atomic, low-latency process registry across all nodes.
CAP Theorem
The CAP theorem states that in the event of a network partition, you can only favor either availability or consistency, but not both. In our case, we need to favor availability for the ability to look up a resource and understand if the response is inconsistent or not, and consistency for knowing where a resource lives.
The Solution
The solution to this problem requires 2 new concepts to be introduced into the system:
RRPs (resource reference pointers)
Hop operates multiple availability regions around the world, and in them, many different physical machines, services and most importantly, Elixir clusters that run Leap.
We need an easy way to reference, internally, where certain resource IDs live at any given time. Given our system is made up of interconnected Elixir nodes, we don’t have to worry about the other layers (physical machines, overlay networks) and just need to know 3 things: the region, the resource type and the node reference that the process lives on. An RRP pointing to a channel resource running in the US East region could look like this:
rrp:us-east-1/hop@node-1/channels/project_0:channel_0
which takes the format of:
rrp:<region>/<node>/<resource_type>/<resource_id>
Redis as a Process Node Registry
By using Redis (or KeyDB in our case), we can rely on the extremely fast and safe lookups that it provides from every node in the network, globally. We can rely on it as an SSoT (single source of truth), even during network partitions, and also use it to resolve conflicts by running sanity checks against RRPs.
Putting these Concepts to Work
Schrodinger’s Node
When a node looks up an RRP from Redis, it can easily perform a local lookup to
check if the node that the RRP claims a process is running on is actually up
(with OTP’s :erlang.nodes/1
). If not, it can start it on the current node
instead, and update the RRP value.
Race Condition Resolution
When nodes claim ownership of existing RRPs, it optimistcally assigns RRP values (before the resource is actually created) in attempt to prevent call race conditions. However, there is still an error margin within the latency of reading and updating an RRP value where two or more nodes may create the same resource at once.
The round trip time to set an RRP is measured while an existing ownership re-claim is executed, and a timer is set internally to re-check the RRP after the RTT has passed. Because a resource claim SET operation should take half of the total RRT, we can safely assume that, if the value hasn’t changed, then this node safely owns the resource.
Hacking the CAP
We’re effectively partitioning the CAP theorem’s guarantees and using the availability we get from one service (Redis), and then pairing it with a service acting as a consistent layer (individual Elixir nodes) by optimistcally taking ownership of resources if a partition is observed.