# English

## What is Distributed Consensus?

### an example

Let’s say we have a single node system.For this example, you can think of our node as a database server that stores a single value. We also have a client that can send a value to the server. Coming to agreement, or consensus, on that value is easy with one node. But how do we come to consensus if we have multiple nodes? That’s the problem of distributed consensus.

Raft is a protocol for implementing distributed consensus.

### how it works

Let’s look at a high level overview of how it works.

#### 3 states

A node can be in 1 of 3 states:

• The Follower state

• The Candidate state

All our nodes start in the follower state. If followers don’t hear from a leader then they can become a candidate. The candidate then requests votes from other nodes. Nodes will reply with their vote. The candidate becomes the leader if it gets votes from a majority of nodes. This process is called Leader Election.

All changes to the system now go through the leader. Each change is added as an entry in the node’s log. This log entry is currently uncommitted so it won’t update the node’s value. To commit the entry the node first replicates it to the follower nodes then the leader waits until a majority of nodes have written the entry.

The entry is now committed on the leader node and the node state is “5”. The leader then notifies the followers that the entry is committed.

The cluster has now come to consensus about the system state. This process is called Log Replication.

In Raft there are two timeout settings which control elections.

• the election timeout

• the heartbeat timeout

• The election timeout is the amount of time a follower waits until becoming a candidate.
• The election timeout is randomized to be between 150ms and 300ms.
• After the election timeout the follower becomes a candidate and starts a new election term…and sends out Request Vote messages to other nodes.
• If the receiving node hasn’t voted yet in this term then it votes for the candidate and the node resets its election timeout.
• Once a candidate has a majority of votes it becomes leader.
• The leader begins sending out Append Entries messages to its followers.
• These messages are sent in intervals specified by the heartbeat timeout.
• Followers then respond to each Append Entries message.
• This election term will continue until a follower stops receiving heartbeats and becomes a candidate.

• Node B is now leader of term 2.

• Requiring a majority of votes guarantees that only one leader can be elected per term.

##### Occur a split vote

If two nodes become candidates at the same time then a split vote can occur. Let’s take a look at a split vote example…

• Two nodes both start an election for the same term and each reaches a single follower node before the other.

• Now each candidate has 2 votes and can receive no more for this term.

• The nodes will wait for a new election and try again.

#### Log Replication

Once we have a leader elected we need to replicate all changes to our system to all nodes.

This is done by using the same Append Entries message that was used for heartbeats.

##### Log Replication process

Let’s walk through the process.

• First a client sends a change to the leader.
• The change is appended to the leader’s log, then the change is sent to the followers on the next heartbeat.
• An entry is committed once a majority of followers acknowledge it and a response is sent to the client.
• Now let’s send a command to increment the value by “2”.
• Our system value is now updated to “7”.
##### network partitions

Raft can even stay consistent in the face of network partitions.

• Let’s add a partition to separate A & B from C, D & E.
• Because of our partition we now have two leaders in different terms.
• One client will try to set the value of node B to “3”.
• Node B cannot replicate to a majority so its log entry stays uncommitted.
• The other client will try to set the value of node D to “8”.
• This will succeed because it can replicate to a majority.

Now let’s heal the network partition.

• Node B will see the higher election term and step down.
• Both nodes A & B will roll back their uncommitted entries and match the new leader’s log.
• Our log is now consistent across our cluster.

# 中文

## 什么是分布式一致？

### 一个例子

Raft 是用于实施 分布式一致协议

### 工作原理

#### 三种节点状态

• 跟随状态
• 候选状态
• 领导状态

① 接收客户端变更请求 —> 记录日志 —> 同步日志 —> 等待反馈 —>与随节点达成一致 —> 提交变更并通知跟随节点 —> 反馈客户端

② 接收客户端变更请求 —> 记录日志 —> 同步日志 —> 等待反馈 —> 未与随节点达成一致 —> 回滚变更并通知跟随节点 —> 反馈客户端

#### 领导节点选举

• 选举超时：跟随节点成为候选节点之前要等待的时间，被随机分配在150毫秒至300毫秒之间。
• 心跳超时
##### 初次选举的情况

① 选举超时后，跟随节点成为候选节点并开始新的选举任期 ，为自己投票并将请求投票消息发送到其他节点。

② 如果接收节点在这个选举任期中还没有投票，那么它将投票给候选节点，并且节点重置其选举超时时间。

③ 一旦候选节点获得多数票，它就会成为领导节点。

④ 领导节点开始向其关注者发送 追加记录 消息。这些消息以 心跳超时 指定的时间间隔发送。

⑤ 跟随者响应每个追加记录 消息。

⑥ 此选举任期将持续到追随者停止接收心跳并成为候选人为止。

##### 领导节点故障后的情况

① 节点B现在是任期2的候选节点。

② 候选节点需要获得多数节点的投票，才可以保证每个任期内只选出一位领导节点。

③ 此时节点B获得自己和节点C的投票，成为新的领导节点。

##### 拆分表决的情况

• 两个节点都开始以相同的任期4进行选举
• 每个都先到达一个跟随者节点。
• 现在，每位候选人都有2票，并且在这个任期中将无法获得更多选票
• 开启新一轮的选举
• 节点B在第5届中获得了多数选票，因此成为领导节点。

#### 日志复制

##### 网络正常的情况

• 首先，客户将请求 set 5 发送给领导节点。
• 然后在下一次心跳时将变更记录发送给跟随节点。
• 一旦大多数追随节点认可，领导节点便提交该记录。
• 然后将响应发送给客户端。
• 现在，让我们发送一条命令，将值增加 2 add 2
• 我们的系统值现在更新为7
##### 网络分区的情况

• 让我们添加一个分区以将 A＆B 与 C，D＆E 分开。
• 由于我们的分区，我们现在有两个领导节点，他们使用不同的任期，分别是 Term 1Term 3
• 让我们添加另一个客户端，并尝试更新两个领导节点。
• 一个客户端将尝试将节点B的值设置为 3，节点B无法复制给多数的节点，因此其日志记录保持未提交的状态。
• 另一个客户端将尝试将节点D的值设置为 8，因为它可以复制到大多数节点，因此将成功提交。

• 节点B将看到比自己高的选举任期编号Term 3，因此退出领导状态。
• 节点A和B都将回滚其未提交的记录并匹配新领导节点的日志。
• 现在，我们的日志在整个集群中恢复一致了。