Atlas Protocol
Authors: Phineas WaltonThis service is internal to how Leap functions. This likely isn’t helpful to
developers wanting to write Leap implementations - it’s more targeted toward
our existing engineers to help them understand how our internal services work.
Outline
Leap, specifically Leap Edge, is written in Elixir. Elixir, or more specifically, OTP, provides useful behaviors and functions for creating and managing an actor-based model across a cluster of nodes. Individual actors are called “processes” and each individual process can implement a behavior calledGenServer
to manage their own state.
Processes in Elixir are represented by their PID (process ID), denoted by 3
integers, which are transitive across Elixir nodes. As you can imagine,
developers want to give their own internal names and IDs to processes so they
can refer to them using existing internal terms gathered from external input
like an API or database.
There have been many different variations created by the Elixir & Erlang
community to solve this issue, including some in the Erland OTP standard
library, however they all have different drawbacks & tradeoffs. Without
focussing on local process managers, which are trivial to implement, inter-node
process management solutions include:
All of these solutions embrace an eventually consistent model, where certain
nodes in the network may have a different view of where processes live. This
model does not work for our use case. We need every node to be able to know
where a resource lives, no matter what - if not, we could lose messages, or
route them to the wrong session process.
One way to solve this would be to implement a cluster-wide lock or semaphore for
resource sync, however this would add latency and introduce a lot of
un-necessary gossip between nodes. We need an atomic, low-latency process
registry across all nodes.
CAP Theorem
The CAP theorem states that in the event of a network partition, you can only favor either availability or consistency, but not both. In our case, we need to favor availability for the ability to look up a resource and understand if the response is inconsistent or not, and consistency for knowing where a resource lives.The Solution
The solution to this problem requires 2 new concepts to be introduced into the system:RRPs (resource reference pointers)
Hop operates multiple availability regions around the world, and in them, many different physical machines, services and most importantly, Elixir clusters that run Leap. We need an easy way to reference, internally, where certain resource IDs live at any given time. Given our system is made up of interconnected Elixir nodes, we don’t have to worry about the other layers (physical machines, overlay networks) and just need to know 3 things: the region, the resource type and the node reference that the process lives on. An RRP pointing to a channel resource running in the US East region could look like this:rrp:us-east-1/hop@node-1/channels/project_0:channel_0
which takes the format of:
rrp:<region>/<node>/<resource_type>/<resource_id>
Redis as a Process Node Registry
By using Redis (or KeyDB in our case), we can rely on the extremely fast and
safe lookups that it provides from every node in the network, globally. We can
rely on it as an SSoT (single source of truth), even during network partitions,
and also use it to resolve conflicts by running sanity checks against RRPs.
Putting these Concepts to Work
Schrodinger’s Node
When a node looks up an RRP from Redis, it can easily perform a local lookup to check if the node that the RRP claims a process is running on is actually up (with OTP’s:erlang.nodes/1
). If not, it can start it on the current node
instead, and update the RRP value.