The Interplay of Nakamoto Consensus and Byzantine Fault Tolerance
Trying to understand the interplay of Byzantine Fault Tolerance (BFT) and Nakamoto Consensus (NC) is not easy. Reading the original papers without a guide is challenging because BFT was invented prior to NC, and, the authors of NC were likely unaware of BFT research.
This basic primer should help the reader understand the design trade-offs between NC and BFT algorithms.
Looking forward, it’s important to understand these trade-offs because most next generation blockchains are factoring in both traditional BFT concepts as well as NC concepts.
SIMILARITIES AND DIFFERENCES
BFT and NC are not directly comparable. BFT describes a family of scenarios, while NC is a specific algorithm. The scenario NC is built for had not been previously posed by BFT related research, and in fact, the Bitcoin whitepaper does not reference any BFT research among its many citations. That said, it’s possible to consider NC in the context of BFT by looking at some of the similarities and differences.
- Both replicate state machines to increase reliability and availability, and accept inefficiency as a trade-off.
- Byzantine Behavior: Both assume nodes can attempt to act maliciously by lying as well as saying different things to different peer nodes.
- Network: NC assumes a synchronous network. BFT considers both synchronous and asynchronous networks, but usually focuses on the harder problem of asynchronous networks.
- Permissioning: NC does not assume nodes are known in advance, and allows any node to join without permission (i.e. permissionless). BFT does assume nodes are known in advance, requiring some kind of permissioning.
- Consistency: NC allows temporary inconsistency about the state of the system. BFT algorithms are always consistent (as long as assumptions are met).
- Durability: NC allows historical data to be rewritten under a certain attack called a reorganization. In BFT, historical data does not change.
- Voting: NC relies on basic majority rules: at least half of the nodes (or hash power or stake) being honest. BFT relies on a super majority of at least two thirds of the nodes being honest.
Let’s unpack this a bit.
REPLICATION FOR RELIABILITY AND AVAILABILITY
A common method to increase reliability is to rely on redundancy. A car normally has four brakes, for example. Even if one brake fails, the other three can safely stop the car.
Similarly, availability is also boosted by redundancy. A traffic intersection typically has at least two traffic lights, in case one is obscured.
When you access a web service like gmail, there is typically around 9x redundancy to provide high levels of reliability and availability.
Redundancy is by definition inefficient. If a system only needs 1 machine but keeps 99 spares around, then it’s only 1% efficient.
Both NC and BFT depend on replication to increase reliability and availability, and accept the resulting inefficiency. NC did so because it wanted to secure a valuable asset, Bitcoin, and BFT’s original research accepted inefficiency because it was designed for scenarios where mistakes were unacceptable, such as weapon systems.
NC and BFT both assume that among the many replicated nodes, some could malfunction. The simpler malfunction is simply stopping, called stop-fail. The more complex malfunction is saying the wrong thing, or attempting to say different things to different nodes. This is called byzantine behavior.
Both BFT and NC deal with byzantine nodes.
Most BFT algorithms use signed messages to detect when a node is saying multiple things. When this happens, receiving nodes compare notes to identify the faulty node.
Nakamoto Consensus creates a random delay so that any one node can effectively only say one thing at a time. Nodes are effectively prevented from saying multiple things. This delay is created by the well known Proof of Work.
A network may be able to deliver all messages between nodes within a fixed time (synchronous), or not (asynchronous).
The former is comparable to being in the same room with someone. Within the bounds of the speed of sound, if someone talks, you’ll hear it.
The latter is like snail mail. It’s hard to predict when a letter will be received.
The Internet and most computer systems are usually synchronous, but, scenarios can pop up that resemble asynchronous behavior.
The problem with an asynchronous is that not hearing from a node could mean two things:
Either the node isn’t talking, or,
The node is talking but you just haven’t heard it yet.
NC does not specifically talk about network synchrony, but, it is clear that the author assumed that data would travel throughout the network quickly, relative to block production time (10 minutes). In other words, NC assumes a synchronous network.
Most modern BFT algorithms assume an asynchronous network, including PBFT (used by Cosmos), Algorand, Honeybadger, and HotStuff (used by Facebook Libra).
PERMISSIONING AND CONSISTENCY
A system can either have a fixed number of known nodes (e.g. 4 brakes in your car) or allow any node to join (e.g. connecting your computer to the Internet).
NC assumes any node can join the system at any time without telling anyone. This is permissionless because no one has the authority to grant permission to join the system.
BFT thinking, on the other hand, assumes nodes are known in advance. After NC’s invention, known nodes is called either permissioned if the nodes are allowed in by a central authority, or, another name like delegated proof of stake if the nodes are voted in.
Permissioning is intimately related to consistency.
Consistency means that data doesn’t change unless it’s supposed to.
A known set of nodes running a good BFT algorithm within defined parameters can ensure consistency. NC, on the other hand, does allow the possibility that the latest data you see may soon change. This is called a temporary fork. When two miners generate two valid blocks at the same height, only one can be valid. In NC, one of these blocks will eventually be orphaned. If you were relying on that data in the orphaned block, you’d be in trouble. And there’s an even deeper attack, the reorg, which allows very old data to be rewritten.
NC’s permissionless model introduces another attack vector: the Sybil attack. In a Sybil attack, a single adversary creates multiple fake accounts to stuff the ballot box. Solving for this is the second use of PoW. Not only does PoW prevent two nodes from talking at the same time, it also prevent a single node from easily saying two things at once. To say two things, an attacker would need to spend two times the money on hardware and electricity to perform work. It’s very elegant that PoW solves both Sybil Atacks and Byzantine Behavior.
By comparison, BFT does not suffer from Sybil attacks because the problem of identifying the permissioned nodes is solved outside the system, such as by a central authority.
CONSENSUS: MAJORITY / SUPERMAJORITY
In both BFT and NC, a majority or supermajority is required to achieve consensus.
BFT has a classic result that more than 2/3 of the nodes must be honest for the system to continue working. This is often cited. However, the 2/3 result is actually correct for two different scenarios.
First, oral messages (OM), which are forgeable and allow a node to talk out of two sides of its mouth, within a synchronous network, requires more than 2/3 honest nodes. The result here is actually quite unintuitive and therefore valuable — we generally think of a simple majority as enough to achieve consensus, not a supermajority. The OM scenario is not considered much today because good public key encryption exists.
Second, signed messages (SM) within an asynchronous network, requires more than 2/3 honest nodes. Signed messages use public key encryption to prevent forgeries by making it easy to detect when a node is talking out of two sides of its mouth. But the fact that not hearing from a node could mean they aren’t talking, or could also mean the are talking but it’s taking a while, makes this scenario difficult to solve. This SM scenario more closely resembles the actual conditions of the Internet today, so this is where research focuses.
There’s actually a third scenario: signed messages within a synchronous network. There’s been recent research in this area as well.
NC made very different assumptions from BFT thinking and ultimately led to a new wave of innovation. It’s reasonable to call the problem NC solves a part of the BFT family, although it is at best a long lost cousin, rather than direct descendant.
Don’t confuse consensus with performance. Replication by nature is inefficient. When a blockchain advertises higher transactions per second (TPS), it’s almost always based less on the core consensus algorithm, and more on design choices like the number of nodes, network speed, and storage capabilities.
Thanks to Evan Mair for review.