Yuerer's Blog

一个好理解的Tracing JIT Compiler

2023-10-12T12:00:20.000Z

前言

Lua 开发者通常听说或使用过 LuaJIT，但是可能因为种种原因未能理解其工作原理，在这里分享一篇 Jakob Erlandsson 和 Simon Kärrman 的硕士毕业论文，TigerShrimp: An Understandable Tracing JIT Compiler，该论文讲述了如何为 JVM 开发一个 Tracing JIT，并附带了源码以及可视化工具。下文将简要剖析一些其实现原理。

编译流程

TigerShrimp 基于 JVM Bytecode，使用 Javac 将 Java 代码文件编译为 .class 文件，后直接进行 decode .class 文件，通过这种方式绕过 Parser 阶段，得到 bytecode。

执行流程

TigerShrimp 内部有个简单的 Interpreter，用以直接执行 bytecode，执行每一条 Instruction时，会记录当前的 pc (二元组，记录函数索引和指令索引，不然指令索引可能重复)，是否为热路径，若为热路径，则会执行 record 流程，记录每一条执行的指令。(通常记录循环，循环有回边，记录执行次数，执行次数大于一阈值，则认为是热路径)。

若已经有 native code，即已经是热路径并完成了生成机器码的工作，则直接执行 native code。

记录流程

常规操作，记录每一条 Instruction，只有在分支语句时需要特殊处理。因为这里是 record 的过程中，是顺序执行的，所以一定不会有分支，相当于这些 Instruction 组成了一个 BasicBlock，但是原始的指令是有分支的，需要将分支进行翻转处理。具体例子如下：

1: if (a < b):
2:x()
3: else:
4:y()

若 a > b 则会执行到 y() 即 pc = 4 的位置，若原样记录 a < b 这条指令，逻辑就错了，因此需要翻转指令为 a >= b 。

指令记录到 return 时或回到循环开始的位置，则该条热路径记录完成。

记录编译

热路径记录完成后，需要进行编译为机器代码，TigerShrimp 选择了 asmjit 库来帮助生成机器代码。具体的字节码翻译过程此处略过，只分析函数进入的准备工作，以及分支判断失败时的处理(如何正确的回退到解释器)。

Prologue

initCode.push_back(
      {x86::ENTER, {IMMEDIATE, .val = Value(0)}, {IMMEDIATE, .val = Value(0)}});

Op rdi = {REGISTER, .reg = RDI};
Op rsi = {REGISTER, .reg = RSI};
Op rax = {REGISTER, .reg = RAX};
Op rdiPtr = {MEMORY, .mem = {RDI, 0}};
initCode.push_back({x86::PUSH, rdi}); // arg1 ExitInformation
initCode.push_back({x86::PUSH, rsi}); // arg2 handleTraceExit
initCode.push_back({x86::MOV, rdi, rdiPtr}); // ExitInformation->variables

struct ExitInformation {
  Value::Data* variables;
  uint8_t** traces;
};
extern "C" int handleTraceExit(ExitInformation*, int);

ExitInformation 用于描述当前执行的堆栈信息，使用数组来模拟堆栈，以便在执行 native code 过程中，因为分支判断失败跳回 Interpreter 时恢复当前的堆栈信息，继续解释执行。

traces 用于存储所有跳出点的 native code 地址，用于实现 Trace Stitching，简单的说就是当分支判断失败后，不要直接回到解释器，而是先看看这个退出点是否存在另一条热路径，若有则直接转移控制权。

Bailout

若分支判断失败，将会直接跳转到 bailoutcode 的位置，此时 RSI 寄存器已经存储了当前的 pc 值，便于之后恢复到寄存器执行。

void Compiler::compileBailoutFor(Op label) {
  if (operandStack.size() != 0) {
    cerr << "Operand stack not empty upon leaving trace" << endl;
    throw;
  }
  bailoutCode.push_back({x86::LABEL, label});
  long idForPc = exitId++;
  exitPoints[idForPc] = label.pc;
  bailoutCode.push_back(
      {x86::MOV, {REGISTER, .reg = RSI}, {IMMEDIATE, .val = Value(idForPc)}});
  bailoutCode.push_back({x86::JMP, exitLabel});
}

由于执行执行过程中不会使用到物理栈，都是通过 ExitInfomation->variables 数组来模拟，所以此时的 RAX 为 handleTraceExit ， RDI 为 ExitInformation ，跳入 _handleTraceExit

bailoutCode.push_back({x86::LABEL, exitLabel});
bailoutCode.push_back({x86::POP, rax});
bailoutCode.push_back({x86::POP, rdi});
bailoutCode.push_back({x86::LEAVE});
bailoutCode.push_back({x86::JMP, rax});

asm("_handleTraceExit:;"
    "push %rdi;" // ExitInformation
    "mov $8, %r8;" // r8 = 8
    "mov (%rdi, %r8), %rdi;" // rdi = (rdi + 8) = (traces)
    "mov (%rdi, %rsi, 8), %r8;" // r8 = (rdi + rsi*8) rsi is pc
    "pop %rdi;" // rdi = ExitInformation
    "cmp $0, %r8;" // r8 == 0
    "je return;" // r8 == 0 jmp to return label
    "lea _handleTraceExit(%rip), %rsi;" // rsi = &_handleTraceExit
    "jmp *%r8;" // jmp _handleTraceExit
    "return:;" // return lebel
    "mov %rsi, %rax;" // return value = rax = pc
    "ret;");

查找当前退出 pc 是否有一条热路径，若有则直接跳入继续执行，没有就将退出 pc 返回回去。

使用 JMP 尾调用，避免多次函数调用的性能损耗。

总结

TigerShrimp 为了实现简单，选择直接从 bytecode 解释执行，跳过繁杂的 Parser 生成 AST 阶段，其次为了实现栈上替换(OSR)，直接不使用物理栈，使用数组模拟，方便回退到解释器，易于理解。

Skynet 时间轮剖析

2023-08-26T00:20:20.000Z

前言

定时器的实现通常使用有序数据结构来实现，一般通过红黑树、跳表、最小堆、时间轮来实现。

其中又以最小堆最容易实现，红黑树最难实现。

Skynet 选择时间轮的原因估计是多线程，时间轮的插入平均复杂度比其他几个都要低，非常适用于多线程场景。

本篇就简单剖析一下 Skynet 实现的 TimingWheel。以下代码为方便阅读有删减。

时间轮

首先实现上是采用数组 + 链表的形式进行实现。

先定义了一个链表，存放了过期时间，从 *tail 可以看出，此结构为尾插法，毕竟后插入的定时器后执行，很合理。

struct timer_node {
struct timer_node *next;
uint32_t expire;
};

struct link_list {
struct timer_node head;
struct timer_node *tail;
};

时间轮数据结构中含有一把自旋锁，时间轮在框架中会被多线程访问，又由于插入的时候冲突的粒度比较小，所以用自旋锁而不是互斥锁。

#define TIME_NEAR_SHIFT 12
#define TIME_NEAR (1 << TIME_NEAR_SHIFT)
#define TIME_LEVEL_SHIFT 5
#define TIME_LEVEL (1 << TIME_LEVEL_SHIFT)
#define TIME_NEAR_MASK (TIME_NEAR-1)
#define TIME_LEVEL_MASK (TIME_LEVEL-1)

struct timer {
struct link_list near[TIME_NEAR];
struct link_list t[4][TIME_LEVEL];
struct spinlock lock;
uint32_t time;
uint64_t current;
uint64_t current_point;
};

从中可以看出，Skynet 的时间轮有5个层级，其中会执行的那层为 near 数组，其他的4层均不会被执行到。

之所以要分为5层，是为了节约内存，不然你完全可以定义一个巨大的数组，每个槽位表示每秒要执行的任务。

其中第一层大小为 1 << 12 即 4096，2-5层为 1<<5 即 32 。

可以看出定时器最大值为 12 + (4 * 5) = 32 位。

每当遍历完整个 near 数组后，则从下面几层中取出一个槽位，将其填充到 near 数组继续模拟计时。

大致了解数据结构之后，再来看初始化逻辑。

定时器初始化

void
skynet_timer_init(void) {
TI = timer_create_timer();
uint32_t current = 0;
TI->current = current;
TI->current_point = gettime();
}

均为简单的初始化链表。

static struct timer *
timer_create_timer() {
struct timer *r=(struct timer *)skynet_malloc(sizeof(struct timer));
memset(r,0,sizeof(*r));

int i,j;

for (i=0;i
link_clear(&r->near[i]);
}

for (i=0;i<4;i++) {
for (j=0;j
link_clear(&r->t[i][j]);
}
}

SPIN_INIT(r)

r->current = 0;

return r;
}

其中 gettime 使用了 clock_gettime 而且还是单调时间，避免系统时间被修改。 clock_gettime 的时间精度为纳秒，此函数进行换算后，最后精度为毫秒，而且还是10毫秒，此时我们可以猜测时间轮的精度为10ms。

static uint64_t
gettime() {
uint64_t t;
struct timespec ti;
clock_gettime(CLOCK_MONOTONIC, &ti);
t = (uint64_t)ti.tv_sec * 1000;
t += ti.tv_nsec / 1000000;
return t;
}

定时器更新

定时器更新由定时器线程去执行，每隔 100 微秒(也就是 0.1毫秒) 触发一次定时器更新，之所以外面调用是 0.1 毫秒，而时间轮精度为10 毫秒，是为了留足时间给定时器回调函数执行，否则某些函数执行时间过长，可能会导致定时器越来越晚触发。

static void *
thread_timer(void *p) {
for (;;) {
skynet_updatetime();
usleep(100);
}
}

skynet_updatetime 还考虑到时间倒流的问题，虽然我认为是不会触发，因为 clock_gettime 取的是 CLOCK_MONOTONIC 的时间，即系统启动后至今的时间，不会倒流。

之所以此处要判断 (cp != TI->current_point) 是因为 update 的间隔为 0.1 毫秒，而时间轮精度为10 毫秒，可能 update 执行的时候还没到定时的最小精度，最终触发 timer_update 。

int
skynet_updatetime(void) {
int count = 0;
uint64_t cp = gettime();
if(cp < TI->current_point) {
skynet_error(NULL, "ERROR: time diff error: change from %lld to %lld", cp, TI->current_point);
TI->current_point = cp;
} else if (cp != TI->current_point) {
uint32_t diff = (uint32_t)(cp - TI->current_point);
TI->current_point = cp;
TI->current += diff;
int i;
for (i=0;i
count += timer_update(TI);
}
}
return count;
}

时间轮执行

先加自旋锁，然后进行执行 timeout 为 0 的回调，否则后面进行转移 2-4 层的时间轮，将 near 层时间轮覆盖后，就再也执行不到了。

static int
timer_update(struct timer *T) {
int count = 0;

SPIN_LOCK(T);

// try to dispatch timeout 0 (rare condition)
count += timer_execute(T);

// shift time first, and then dispatch timer message
timer_shift(T);

count += timer_execute(T);

SPIN_UNLOCK(T);
return count;
}

执行逻辑很简单，用当前 time(为tick) % 4096 找到需要执行的槽位的链表，在代码中为了提升性能用了位运算 & 实现。这里还能注意到小细节，执行回调函数链表时不需要加锁。

static inline int
timer_execute(struct timer *T) {
int count = 0;
int idx = T->time & TIME_NEAR_MASK;

while (T->near[idx].head.next) {
struct timer_node *current = link_clear(&T->near[idx]);
SPIN_UNLOCK(T);
// dispatch_list don't need lock T
count += dispatch_list(current);
SPIN_LOCK(T);
}
return count;
}

时间轮 Shift 操作

该函数由于用了大量位运算，所以看起来会比较难看，可以先从 while 的条件开始看，先是用当前 tick 也就是 ct % 4096，如果为 0，则说明 near 层的槽位已经全部走完此时走完了 4096 * 10 毫秒，大约是40.96秒。接下来就是要找到正确的层次，然后从层次中找到正确的槽位，将其填充到 near 数组，即 ct / 4096 % 32，若为 0，则说明不在当前层次，还能再除以一个 32。总之将位运算代码 & 翻译为取余，>> 翻译为除法，此函数的逻辑便不言自明。

同时 T→time 会一直递增，最后溢出回到0，uint32 的最大值为 4294967295 溢出为 0，则说明正确的值为 4294967296 那么用该值 / 4096 / 32 / 32 / 32 / 32 发现为 1，这就说明我们此时需要将第四层的第0个槽位挪到 near 数组。非常简单。

static void
timer_shift(struct timer *T) {
int mask = TIME_NEAR;
uint32_t ct = ++T->time;
if (ct == 0) {
move_list(T, 3, 0);
} else {
uint32_t time = ct >> TIME_NEAR_SHIFT;
int i=0;

while ((ct & (mask-1))==0) {
int idx=time & TIME_LEVEL_MASK;
if (idx!=0) {
move_list(T, i, idx);
break;
}
mask <<= TIME_LEVEL_SHIFT;
time >>= TIME_LEVEL_SHIFT;
++i;
}
}
}

时间轮添加

timer_add 告诉了我们定时器传进来的参数 time 并不是要计时的值，而是多少个 tick ，比如 time 传入 10 时，则以为这 10个 tick 也就是 10 * 10 毫秒后到时。 expire 表示的就是多少个 tick 后到时。

static void
timer_add(struct timer *T,void *arg,size_t sz,int time) {
struct timer_node *node = (struct timer_node *)skynet_malloc(sizeof(*node)+sz);
memcpy(node+1,arg,sz);

SPIN_LOCK(T);

node->expire=time+T->time;
add_node(T,node);

SPIN_UNLOCK(T);
}

又是一大坨位运算，就是判断 expire 应该插入到哪一层，如果将其改写，则需要大量的 if 判断。

static void
add_node(struct timer *T,struct timer_node *node) {
uint32_t time=node->expire;
uint32_t current_time=T->time;

if ((time|TIME_NEAR_MASK)==(current_time|TIME_NEAR_MASK)) {
link(&T->near[time&TIME_NEAR_MASK],node);
} else {
int i;
uint32_t mask=TIME_NEAR << TIME_LEVEL_SHIFT;
for (i=0;i<3;i++) {
if ((time|(mask-1))==(current_time|(mask-1))) {
break;
}
mask <<= TIME_LEVEL_SHIFT;
}
link(&T->t[i][((time>>(TIME_NEAR_SHIFT + i*TIME_LEVEL_SHIFT)) & TIME_LEVEL_MASK)],node);
}
}

Lua 内存监控工具

2023-08-25T14:42:20.000Z

背景

Lua 项目中，通常需要工具进行内存监控，目前开源的工具中有 lua-snapshot，但这个工具的缺陷是开销比较大，在调用接口之后，会扫描整个GC链表，找出所有的GC对象，并进行统计，最后会创建大量的 Lua 对象，将结果存在里面，这就会导致本身内存已经够高了，再用这个工具的话，很可能会触发 OOM 或者是 STW，业务无法正常提供服务。

作为补充，期望有个工具能够监控所有的对象开辟的位置和大小信息，进行精确定位代码问题。

最终实现的效果如下图所示：

实现

Lua 支持替换 frealloc，这就使得我们监控内存分配成为了可能，接下来就是如何减轻性能损耗的同时将信息记录下来。我们需要的信息有文件名和行号。

内存分配

Lua 中所有的内存分配都是基于 realloc ，可以简单的在分配时，遍历 CallInfo 调用栈，获取最顶层的 Lua 函数的文件名和行号，将其记录下来即可。

内存释放

由于对象释放时，是找不到正确的 Lua 调用栈的（就算找到了，也是取到触发垃圾回收那个时刻的文件名和行号），因此需要再分配时，就给这个内存对象记录一下，为了快速方便的取得该内存地址的开辟位置，在内存对象上增加一个 Cookie 。

struct mem_cookie {
    uint16_t proto_id; // 文件名ID
    uint16_t line;
}; // 4 Byte

内存扩缩容

内存扩缩容，大部分情况下都是 table 下的 array 或者 hash 部分进行扩缩容，若直接在扩缩容处获取调用栈信息，会导致获取的文件名和行号对不上该 table 创建的位置，在 global_State 记录 table pointer，通过读 Cookie，避免遍历调用栈以及更精确。

文件名优化

我们采用 proto_id作为文件名，这主要是出于以下考虑：

proto 可能会被释放，不可传递 proto->source 指针。
proto->source 字符串拷贝会比较消耗性能。

因此通过给 proto 一个 ID，进行编号，映射一张 proto_id -> source 的表，即可。具体可以改动 loadfile 的实现完成。

行号优化

Lua 5.4 中行号是相对行号，内存分配又是个高频操作，将 Lua 5.3 的绝对行号移植过来，直接查表省去行号计算。

Python3-源码剖析(三)-GC垃圾回收

2022-05-01T06:34:28.000Z

剖析一下 CPython 的自动垃圾回收机制，并尝试提出改进的思路。

引用计数

相信有过计算机基础的人，哪怕对垃圾回收不那么熟悉，也肯定知道引用计数这个玩意。引用计数诞生于上个世纪，其主要思想是通过给每个对象增加计数，当计数为0时，则肯定没人使用该对象，可以放心将其删除。

虽然这个方法看起来有点糙，但在实际项目中，它的优点在于可以更实时的释放内存，释放内存的时机更精确，这也是为什么有的项目会尝试给 Lua 增添一个引用计数的垃圾回收，避免内存上涨过快。

凡事都有利弊，它的缺点也很明显，无法处理循环引用。

以下用 Python 举一个非常普遍的例子。

class A:  
    pass  
  
class B:  
    pass

a = A()
b = B()
a.b = b
b.a = a
del a
del b

在上面中，我们手动删除了 a 和 b ，理应进行释放，但由于 a 和 b 互相构成了循环引用，导致其引用计数总是不为0，进而造成内存泄漏，而 CPython 对其解决方法也极其简单，就是将所有可能造成循环引用的对象，构成一个双向链表进行扫描，从 root object 出发进行扫描 - 清除，无法到达的对象就是可释放的对象，普通的对象直接采用引用计数去释放，简单快捷。

怎么去验证以上结论呢？我们可以用反证法，当 del a 和 del b 后，再调用 gc.collect() 查看其是否能被回收到，如果能回收到，说明在此时引用计数已经失效。

# 设置 debug 标签，使得垃圾回收后的对象 存放至 gc.garbage 列表中
gc.set_debug(gc.DEBUG_SAVEALL)

# 回收第0代垃圾对象
gc.collect(0)

# 打印出回收的垃圾对象
print(gc.garbage)

可以看出引用计数确实失效了，因为通过 扫描-清除 回收能回收到这两个对象。

1	[<__main__.A object at 0x10adefc10>, <__main__.B object at 0x10adeff70>, {'b': <__main__.B object at 0x10adeff70>}, {'a': <__main__.A object at 0x10adefc10>}]

接下来，我们来到 CPython 源码中查看如何用引用计数管理一个对象。我们将以整数为例，先看看整数对象的对象模型。

// 对象的基类，拥有双向链表和引用计数
typedef struct _object {
    struct _object *_ob_next;
    struct _object *_ob_prev;
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;
} PyVarObject;

struct _longobject {
    PyVarObject ob_base;
    digit ob_digit[1];
};

可以看出每个对象都有一条双向链表，但是这里需要说明的是，此处的双向链表并非后面扫描 - 标记所使用的双向链表，此处的双向链表会将所有的对象都链接到 refchain 中，目前从代码中只能看出是拿来做调试用途的。

void
_Py_AddToAllObjects(PyObject *op, int force)
{
    if (force || op->_ob_prev == NULL) {
        op->_ob_next = refchain._ob_next;
        op->_ob_prev = &refchain;
        refchain._ob_next->_ob_prev = op;
        refchain._ob_next = op;
    }
}

void
_Py_NewReference(PyObject *op)
{
    if (_Py_tracemalloc_config.tracing) {
        _PyTraceMalloc_NewReference(op);
    }
#ifdef Py_REF_DEBUG
    _Py_RefTotal++;
#endif
    Py_SET_REFCNT(op, 1);
#ifdef Py_TRACE_REFS
    _Py_AddToAllObjects(op, 1);
#endif
}

static inline PyObject*
_PyObject_INIT(PyObject *op, PyTypeObject *typeobj)
{
    Py_TYPE(op) = typeobj;
    if (PyType_GetFlags(typeobj) & Py_TPFLAGS_HEAPTYPE) {
        Py_INCREF(typeobj);
    }
    _Py_NewReference(op);
    return op;
}

PyLongObject *
_PyLong_New(Py_ssize_t size)
{
    PyLongObject *result;
    ...
    result = PyObject_MALLOC(offsetof(PyLongObject, ob_digit) +
                             size*sizeof(digit));
    ...
    return (PyLongObject*)PyObject_INIT_VAR(result, &PyLong_Type, size);
}

以上关于 CPython 中的引用计数部分，就讲解完了，整体非常简单。接下来就是看容器类对象(会造成循环引用的对象)如何进行垃圾回收了。

扫描-清除

垃圾回收领域一直有几大门派，最为突出的门派分别为 扫描-清除 和 标记-整理 ，先讲什么是 标记-整理。

假设我们将语言的内存池分为两块，其中一块不用，另一块一直拿来创建对象，当垃圾回收开启时，我们将所有可达对象(即可用对象)进行标记，然后将标记的对象重新在另一块内存池中进行创建，最后直接将原本那块内存池进行释放，这就将垃圾整理完成。

标记-整理 这种垃圾回收办法依赖于一个假设，那就是垃圾对象比正常的对象要多得多，这样整理起来由于是整个内存池一起销毁的，所以会快得多。

CPython 选择的是 扫描-清除，我们就不在其他地方进行展开了，着重来介绍 扫描-清除 。

假设我们从 root object 出发，如果可以扫描到的对象，即成为可达对象，可达对象则代表正在被使用不可清理。最终我们将得到一个不可达对象的列表，将其清理即可。

而 扫描-清除 由于扫描和清除是一次性完成的，会导致 Stop The World 时间特别长，因此产生了所谓的分代垃圾回收，这也就是 CPython 目前所使用的垃圾回收。

分代垃圾回收

分代垃圾回收基于一个假设，大部分对象存活的时间比较短，少部分对象存活的时间比较长，那么就可以优先对新生代进行垃圾回收，而对老年代的垃圾回收次数放缓，这就解决了 扫描-清除 的时间过长的问题。

接下来我们就来简单看看分代垃圾回收的实现。我们以一个容器对象作为例子，就拿 list 好了。

以下为 list 的对象模型，由于本篇主题为垃圾回收，所以不关注其他成员。

typedef struct {
    PyVarObject ob_base;
    PyObject **ob_item;
    Py_ssize_t allocated;
} PyListObject;

结构也是非常简单，同样有引用计数与双向链表(在 PyVarObject 结构中)，那么就会有疑惑了，这里的双向链表用于链接所有对象到 refchain ，那么我们的分代垃圾回收的扫描链表去哪了？

/* GC information is stored BEFORE the object structure. */
typedef struct {
    // Pointer to next object in the list.
    // 0 means the object is not tracked
    uintptr_t _gc_next;

    // Pointer to previous object in the list.
    // Lowest two bits are used for flags documented later.
    uintptr_t _gc_prev;
} PyGC_Head;

void
_PyObject_GC_Link(PyObject *op)
{
    PyGC_Head *g = AS_GC(op);
    PyThreadState *tstate = _PyThreadState_GET();
    GCState *gcstate = &tstate->interp->gc;
    g->_gc_next = 0;
    g->_gc_prev = 0;
    gcstate->generations[0].count++; /* number of allocated GC objects */
    if (gcstate->generations[0].count > gcstate->generations[0].threshold &&
        gcstate->enabled &&
        gcstate->generations[0].threshold &&
        !gcstate->collecting &&
        !_PyErr_Occurred(tstate))
    {
        gcstate->collecting = 1;
        gc_collect_generations(tstate);
        gcstate->collecting = 0;
    }
}

PyObject *
_PyType_AllocNoTrack(PyTypeObject *type, Py_ssize_t nitems)
{
    PyObject *obj;
    const size_t size = _PyObject_VAR_SIZE(type, nitems+1);

    // 计算真实内存大小
    const size_t presize = _PyType_PreHeaderSize(type);
    char *alloc = PyObject_Malloc(size + presize);
    if (alloc  == NULL) {
        return PyErr_NoMemory();
    }
    obj = (PyObject *)(alloc + presize);
    if (presize) {
        ((PyObject **)alloc)[0] = NULL;
        ((PyObject **)alloc)[1] = NULL;
        _PyObject_GC_Link(obj);
    }
    memset(obj, '\0', size);

    if (type->tp_itemsize == 0) {
        _PyObject_Init(obj, type);
    }
    else {
        _PyObject_InitVar((PyVarObject *)obj, type, nitems);
    }
    return obj;
}

// list的构造函数
PyObject *
PyType_GenericAlloc(PyTypeObject *type, Py_ssize_t nitems)
{
    PyObject *obj = _PyType_AllocNoTrack(type, nitems);
    if (obj == NULL) {
        return NULL;
    }

    if (_PyType_IS_GC(type)) {
        _PyObject_GC_TRACK(obj);
    }
    return obj;
}

可以看出，在创建这类需要扫描的对象时，会提前算好头部还需要加多少内存，在头部再加一个 PyGC_Head 作为分代回收的链表，然后调用 _PyObject_GC_Link 触发垃圾回收，可以看出当创建一个对象达到该代的阈值时，将会触发垃圾回收，最后才调用 _PyObject_GC_TRACK 将其链入第0代 GC链表 中。

// Lowest bit of _gc_next is used for flags only in GC.
// But it is always 0 for normal code.
#define _PyGCHead_NEXT(g)        ((PyGC_Head*)(g)->_gc_next)
#define _PyGCHead_SET_NEXT(g, p) _Py_RVALUE((g)->_gc_next = (uintptr_t)(p))

// Lowest two bits of _gc_prev is used for _PyGC_PREV_MASK_* flags.
#define _PyGCHead_PREV(g) ((PyGC_Head*)((g)->_gc_prev & _PyGC_PREV_MASK))
#define _PyGCHead_SET_PREV(g, p) do { \
    assert(((uintptr_t)p & ~_PyGC_PREV_MASK) == 0); \
    (g)->_gc_prev = ((g)->_gc_prev & ~_PyGC_PREV_MASK) \
        | ((uintptr_t)(p)); \
    } while (0)

static inline void _PyObject_GC_TRACK(
    PyObject *op)
{
    PyGC_Head *gc = _Py_AS_GC(op);

    PyInterpreterState *interp = _PyInterpreterState_GET();
    PyGC_Head *generation0 = interp->gc.generation0;
    PyGC_Head *last = (PyGC_Head*)(generation0->_gc_prev);
    _PyGCHead_SET_NEXT(last, gc);
    _PyGCHead_SET_PREV(gc, last);
    _PyGCHead_SET_NEXT(gc, generation0);
    generation0->_gc_prev = (uintptr_t)gc;
}

从宏中可以看出，CPython 用了地址的最后两位去做一些事情，之所以可以这么做是因为内部实现了个小的内存分配器，里面的地址按4字节对齐，这意味着后两位一定为0，这也是一个常用技巧了，没什么好说的。

现在让我们关注最重要的垃圾回收过程。

static Py_ssize_t
gc_collect_generations(PyThreadState *tstate)
{
    GCState *gcstate = &tstate->interp->gc;
    Py_ssize_t n = 0;
    for (int i = NUM_GENERATIONS-1; i >= 0; i--) {
        if (gcstate->generations[i].count > gcstate->generations[i].threshold) {
            if (i == NUM_GENERATIONS - 1
                && gcstate->long_lived_pending < gcstate->long_lived_total / 4)
                continue;
            n = gc_collect_with_callback(tstate, i);
            break;
        }
    }
    return n;
}

从最老一代开始进行收集，目前 CPython 默认有3代，分别为 0，1，2代。为了避免多次进行 full gc ，这里设置了个条件，当清理最老一代的时候，必须要非最老一代存活的对象(long_lived_pending) / 当前最老一代存活的对象(long_lived_total) 超过 25% 才进行全量回收，其实这主要是因为 扫描-清理 过程是一次完成的，所以要尽量避免 full gc 。

接着就正式进入垃圾回收主函数。

static Py_ssize_t
gc_collect_with_callback(PyThreadState *tstate, int generation)
{
    ...
    Py_ssize_t result, collected, uncollectable;
    result = gc_collect_main(tstate, generation, &collected, &uncollectable, 0);
    ...
    return result;
}

在阅读之前，还要补充一个知识点，分代垃圾回收里面的三代回收是有阈值的，其中只有第0代也就是最年轻的一代的阈值指的是对象个数，剩下两代都是执行年轻代的次数。默认值为 (700, 10, 10) 这意味着想触发第0代垃圾回收需要创建出700个对象，而想触发第1代垃圾回收，需要第0代垃圾回收执行过10次，想要触发第2代垃圾回收则需要第1代垃圾回收执行过10次(同时还要满足上面的一个条件，这里就不重复了)。

static Py_ssize_t
gc_collect_main(PyThreadState *tstate, int generation,
                Py_ssize_t *n_collected, Py_ssize_t *n_uncollectable,
                int nofail)
{
    int i;
    Py_ssize_t m = 0; /* # objects collected */
    Py_ssize_t n = 0; /* # unreachable objects that couldn't be collected */
    PyGC_Head *young; /* the generation we are examining */
    PyGC_Head *old; /* next older generation */
    PyGC_Head unreachable; /* non-problematic unreachable trash */
    PyGC_Head finalizers;  /* objects with, & reachable from, __del__ */
    PyGC_Head *gc;
    _PyTime_t t1 = 0;   /* initialize to prevent a compiler warning */
    GCState *gcstate = &tstate->interp->gc;

    // 将更老的一代的 count + 1 从而让之后能执行到后续的垃圾回收
    if (generation+1 < NUM_GENERATIONS)
        gcstate->generations[generation+1].count += 1;
  
    // 当前代和比当前代更年轻的计数重置，因为我们会将[0, 当前代]全部处理完
    for (i = 0; i <= generation; i++)
        gcstate->generations[i].count = 0;

    // 将更年轻的代归到当前代的链表上
    for (i = 0; i < generation; i++) {
        gc_list_merge(GEN_HEAD(gcstate, i), GEN_HEAD(gcstate, generation));
    }

    // young = [0, 当前代]
    young = GEN_HEAD(gcstate, generation);
    if (generation < NUM_GENERATIONS-1)
        // 当当前为第1代则old为第2代，当当前为第0代则old为第1
        old = GEN_HEAD(gcstate, generation+1);
    else
        // 说明当前为第2代，则old也为第2代
        old = young;

    // 核心, 将young中的对象的引用计数拷贝一份放到 _gc_prev.
    // 因为如果真正去修改引用计数的话，会导致意外释放掉一些本不该释放的对象
    // 但是拷贝一份出来做运算就没有这问题，此函数值得之后细说
    deduce_unreachable(young, &unreachable);
    // 找到可以停止追踪的tuples，减少垃圾回收工作量，这里不懂没关系，后面会举例细说
    untrack_tuples(young);

    // 将可达对象进行升级(升级到更老一代)
    if (young != old) {
        if (generation == NUM_GENERATIONS - 2) {
            gcstate->long_lived_pending += gc_list_size(young);
        }
        gc_list_merge(young, old);
    }
    else {
        // 同理，找到可以停止追踪的dict，只在full gc启用，原因后面说
        untrack_dicts(young);
        gcstate->long_lived_pending = 0;
        gcstate->long_lived_total = gc_list_size(young);
    }

    // 处理带有析构函数的对象，__del__，类似于 Lua 的 __gc
    gc_list_init(&finalizers);
    move_legacy_finalizers(&unreachable, &finalizers);
    move_legacy_finalizer_reachable(&finalizers);

    /* Clear weakrefs and invoke callbacks as necessary. */
    m += handle_weakrefs(&unreachable, old);

    // 调用 tp_finalize 即 __del__
    finalize_garbage(tstate, &unreachable);

    // 处理复活的对象
    PyGC_Head final_unreachable;
    handle_resurrected_objects(&unreachable, &final_unreachable, old);

    // 打破容器的引用计数
    m += gc_list_size(&final_unreachable);
    delete_garbage(tstate, gcstate, &final_unreachable, old);

    // 将终结器对象的bug信息进行整理
    handle_legacy_finalizers(tstate, gcstate, &finalizers, old);

    // 最老一代还会清空缓存池
    if (generation == NUM_GENERATIONS-1) {
        clear_freelists(tstate->interp);
    }
    ...
}

整体来看，除了代码量略大，其他的还是很简单的，接下来我们将解决上面几个遗留问题。

如何找到 root object ？
untrack_tuples 是个啥？
untrack_dicts 为什么只在 full gc 时调用？

先解释第二点，为了加快垃圾回收的迭代，当 tuple 容器没有内嵌容器时，会将其从垃圾回收跟踪中删除，只使用最基础的引用计数。证明这一点很简单。

a = (1, 2)  
print(gc.is_tracked(a))  # True
gc.collect()
print(gc.is_tracked(a))  # False

可以看出，对 tuple 取消追踪，是个惰性过程。接下来我们引申到 dict 。

a = {"a": 1}
print(gc.is_tracked(a))  # False
a["b"] = {}
print(gc.is_tracked(a))  # True
a.pop("b")
print(gc.is_tracked(a))  # True

可以得出，当 dict 没有复杂的对象时，则不会对其追踪，那么我们是否可以将同样的思路引用于 list 呢？

接下来我们回到问题1，如何找到 root object ？如果读者对 Lua 了解的话就知道，Lua 的对象都可以从 registry 这个全局表中追踪到，但在 Python 的世界中却是不可行的，之所以会产生这样的问题，主要还是因为 Python 扩展模块(extension modules) 工作方式导致用于无法确定根集，这就使得复杂度一下就上来了。

CPython 的解决方法也很简单，结合引用计数和扫描清除两种办法去解决。拷贝一份引用计数(如果在原本的引用计数上操作太危险了，不小心变成0，就触发了引用计数回收了)，然后在其基础上进行遍历，每次将引用计数 -1，这样就得到了相对引用计数，相对引用计数为0，则有可能是不可达对象，先猜想它是，后续再遍历可达对象，如果从可达对象可以找到相对引用计数为0的对象，那么它就是可达对象，需要将其恢复。

这块虽然有点绕，但仔细品味一下还是非常简单的。

接下来我们来讨论第三个问题，为什么 untrack_dicts 只在第三代垃圾回收时触发？
这主要是因为 dict 插入一个对象时，会判断这个对象是不是容器，是容器就会将其追踪，但是每次都会在 untrack_dicts 去遍历检查是否可以取消追踪，这就很蠢了，有兴趣的可以阅读 Issue #14775。

其实还有些内容想讲，随便来个话题，在 Python 2 时代，当两个对象循环引用又同时有 __del__ 时，垃圾回收会不回收这两个对象这类问题，但我不想在这里继续展开了，太累了，有兴趣可以阅读 PEP-442 进行学习。

想法

渐进分代？

CPython 的GC是 Stop The World 的，哪怕它已经很尽力用分代的方式去减少GC的损耗。是否可以将其改进为渐进的方式？我目前的想法是在容器操作时，进行 Barrier 操作，维护一个中间态，使得前面的扫描过程是可渐进的，最后处理垃圾的时候再停下来一次性处理完，减少停止的时间。但这个思路貌似不行，原因是根集是不确定的。

减少跟踪对象？

是否可以对其它常用容器也做 untrack 操作，当容器没有嵌套容器时，取消 track 操作，减少GC遍历损耗？这个思路需要小心避免犯上面 untrack_dicts 的错误。

甚至我们扩展 gcmodule 的接口，使得对一些常驻内存的对象进行标记，使其不要被跟踪？

尽可能少用 del

这点就不说啥了，Lua 里也最好别用 __gc 。

最后的最后，感谢 CPython 这份非常漂亮的代码设计，让我在这个五一假期，受益良多，下一步可能会回到 Lua 5.2 中，阅读它 “失败” 的分代GC作品，我认为学习失败的经验比成功的经验要重要得多。

Python3-源码剖析(二)-指令特化

2022-04-09T07:19:28.000Z

在上一篇关于 Python3 源码剖析中，剖析 float 的实现主要是阅读的 Python 3.10 的源码，但是在我看到 PEP-659 这篇关于指令特化(Specializing Adaptive Interpreter)的提案时，我就被它吸引了，因为这就是我之前想给 Lua 提速加的功能之一，冲着对它的热情，我决定将阅读的 CPython 版本提升到 3.11 ，这一篇就来剖析一下指令特化的实现，我们将通过两个对象做加法进行分析。

对象相加

首先通过 Python 自带的 dis 工具进行分析，分析两个对象相加的流程。

from dis import *
def test():  
    a = 1.5  
    c = a + 1.3
print(dis(test))

  3           0 RESUME                   0

  4           2 LOAD_CONST               1 (1.5)
              4 STORE_FAST               0 (a)

  5           6 LOAD_FAST                0 (a)
              8 LOAD_CONST               2 (1.3)
             10 BINARY_OP                5 (*)
             14 POP_TOP
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

可以看到两个对象相乘的指令码为 BINARY_OP ，我们跟踪到 CPython 中，可以确定会调用到 PyNumber_Add 函数中。

static const binaryfunc binary_ops[] = {
    [NB_ADD] = PyNumber_Add,
    [NB_AND] = PyNumber_And,
    .....
};
TARGET(BINARY_OP) {
    PREDICTED(BINARY_OP);
    PyObject *rhs = POP();
    PyObject *lhs = TOP();
    PyObject *res = binary_ops[oparg](lhs, rhs);
    Py_DECREF(lhs);
    Py_DECREF(rhs);
    SET_TOP(res);
    if (res == NULL) {
        goto error;
    }
    JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
    DISPATCH();
}

PyNumber_Add 实现也很简单，先看看这两个对象支不支持该二元运算符，不支持，则看看支不支持 concat 操作。

PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
    PyObject *result = BINARY_OP1(v, w, NB_SLOT(nb_add), "+");
    if (result != Py_NotImplemented) {
        return result;
    }
    Py_DECREF(result);

    PySequenceMethods *m = Py_TYPE(v)->tp_as_sequence;
    if (m && m->sq_concat) {
        result = (*m->sq_concat)(v, w);
        assert(_Py_CheckSlotResult(v, "+", result != NULL));
        return result;
    }

    return binop_type_error(v, w, "+");
}

binary_op1 则是分别对左右两个对象进行判定，查看是否支持相加的操作。

static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot
#ifndef NDEBUG
           , const char *op_name
#endif
           )
{
    binaryfunc slotv;
    if (Py_TYPE(v)->tp_as_number != NULL) {
        slotv = NB_BINOP(Py_TYPE(v)->tp_as_number, op_slot);
    }
    else {
        slotv = NULL;
    }

    binaryfunc slotw;
    if (!Py_IS_TYPE(w, Py_TYPE(v)) && Py_TYPE(w)->tp_as_number != NULL) {
        slotw = NB_BINOP(Py_TYPE(w)->tp_as_number, op_slot);
        if (slotw == slotv) {
            slotw = NULL;
        }
    }
    else {
        slotw = NULL;
    }

    if (slotv) {
        PyObject *x;
        if (slotw && PyType_IsSubtype(Py_TYPE(w), Py_TYPE(v))) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        assert(_Py_CheckSlotResult(v, op_name, x != NULL));
        if (x != Py_NotImplemented) {
            return x;
        }
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        PyObject *x = slotw(v, w);
        assert(_Py_CheckSlotResult(w, op_name, x != NULL));
        if (x != Py_NotImplemented) {
            return x;
        }
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}

可以看出一个小小的二元运算，需要经历以下几个过程。

确定二元运算的类型(加法)。
确定两个对象的类型，查看两个对象是否支持加法。
确定是否支持 concat。

如果有一个办法可以提前知道这两个对象的类型，提前确定它们的二元运算是什么就好了，这样就可以绕过一系列的条件判断语句，直达核心，省去大量的预测分支，从而提高性能。

指令特化

思路

经过前面的背景铺垫，我们可以先试想一下，如何去做指令特化？

首先要明确地是什么时候做指令特化？如果每个函数执行的时候都做一次指令特化，那么很可能会消耗更多的时间，这点和 JIT 的思路一致，只有对调用频率高的函数做优化才有意义。

其次要明确指令特化失败了怎么办？因为 Python 是脚本语言，很可能下次传进来的对象不再是原来的那个类型了，这个时候就可能会发生指令特化失效的情况，但是如果每次都在指令特化后的执行流程中检查对象的类型，那又回到了老路子，性能可能提升不了，解决这个问题的思路是，在指令后面缓存一些数据，减少条件判断的个数。

实战

接下来我们将开始实战指令特化，首先根据前面分析，我们需要记录每个对象的执行次数，还记得前面的字节码吗？RESUME 就是拿来做这个事情的。

在编译生成字节码阶段，每当进入一个新的作用域时，就会创建一个 RESUME 的指令，这是新版本中特有的。

static int
compiler_enter_scope(struct compiler *c, identifier name,
                     int scope_type, void *key, int lineno)
{
    ....
    ADDOP_I(c, RESUME, 0);
}

可以看出，co_warmup 会在每次进入该作用域时自增，当其为 0 时，进行 quicken 操作。其默认值目前为 -8 。

#define QUICKENING_WARMUP_DELAY 8
#define QUICKENING_INITIAL_WARMUP_VALUE (-QUICKENING_WARMUP_DELAY)

static inline void
_PyCode_Warmup(PyCodeObject *code)
{
    if (code->co_warmup != 0) {
        code->co_warmup++;
        if (code->co_warmup == 0) {
            _PyCode_Quicken(code);
        }
    }
}

TARGET(RESUME) {
    _PyCode_Warmup(frame->f_code);
    JUMP_TO_INSTRUCTION(RESUME_QUICK);
}

那么 quicken 操作是什么呢？其实就是将原本的指令替换为 自适应 指令，自适应指令也会有个变量记录进入该指令的次数，当达到一定次数时，才考虑将其进行特化。之所以不在一开始就生成自适应的二元操作指令，主要是避免一些性能损耗吧，毕竟有一些函数调用次数少。

uint8_t _PyOpcode_Adaptive[256] = {
    [LOAD_ATTR] = LOAD_ATTR_ADAPTIVE,
    [LOAD_GLOBAL] = LOAD_GLOBAL_ADAPTIVE,
    [LOAD_METHOD] = LOAD_METHOD_ADAPTIVE,
    [BINARY_SUBSCR] = BINARY_SUBSCR_ADAPTIVE,
    [STORE_SUBSCR] = STORE_SUBSCR_ADAPTIVE,
    [CALL] = CALL_ADAPTIVE,
    [PRECALL] = PRECALL_ADAPTIVE,
    [STORE_ATTR] = STORE_ATTR_ADAPTIVE,
    [BINARY_OP] = BINARY_OP_ADAPTIVE,
    [COMPARE_OP] = COMPARE_OP_ADAPTIVE,
    [UNPACK_SEQUENCE] = UNPACK_SEQUENCE_ADAPTIVE,
};

void
_PyCode_Quicken(PyCodeObject *code)
{
    _Py_QuickenedCount++;
    int previous_opcode = -1;
    _Py_CODEUNIT *instructions = _PyCode_CODE(code);
    for (int i = 0; i < Py_SIZE(code); i++) {
        int opcode = _Py_OPCODE(instructions[i]);
        uint8_t adaptive_opcode = _PyOpcode_Adaptive[opcode];
        if (adaptive_opcode) {
            _Py_SET_OPCODE(instructions[i], adaptive_opcode);
            // Make sure the adaptive counter is zero:
            assert(instructions[i + 1] == 0);
            previous_opcode = -1;
            i += _PyOpcode_Caches[opcode];
        }
        else {
            assert(!_PyOpcode_Caches[opcode]);
            switch (opcode) {
                ....
                case RESUME:
                    _Py_SET_OPCODE(instructions[i], RESUME_QUICK);
                    break;
                ....
            }
            previous_opcode = opcode;
        }
    }
}

在此处 BINARY_OP 的自适应指令则为 BINARY_OP_ADAPTIVE ，同时细心的读者可以发现，在 quicken 过程中，还会将 RESUME 替换为 RESUME_QUICK 这主要是因为，既然都已经决定特化了这个函数了，我再每次都去算进入这个函数多少次，意义不大，想办法将其特化掉，省去一部分性能损耗。

BINARY_OP_ADAPTIVE 在这条指令后面藏了一个缓存，存储了当前指令还差多少次进行特化(我猜测是因为与0对比的时候，运算的比较快)，当 counter 为0时，进行特化。

目前默认的 counter 为 53，作者说：大了优化的少，小了整天优化，只有50附近比较靠谱，但是又不想选50，就选了个53质数。

当回退的时候，指令特化失败时，会被修改为 64 。

TARGET(BINARY_OP_ADAPTIVE) {
    _PyBinaryOpCache *cache = (_PyBinaryOpCache *)next_instr;
    if (cache->counter == 0) {
        PyObject *lhs = SECOND();
        PyObject *rhs = TOP();
        next_instr--;
        _Py_Specialize_BinaryOp(lhs, rhs, next_instr, oparg, &GETLOCAL(0));
        DISPATCH();
    }
    else {
        cache->counter--;
        JUMP_TO_INSTRUCTION(BINARY_OP);
    }
}

_Py_Specialize_BinaryOp 的过程也非常简单，就是检查对象类型，还有操作类型，进行决策即可。

void
_Py_Specialize_BinaryOp(PyObject *lhs, PyObject *rhs, _Py_CODEUNIT *instr,
                        int oparg, PyObject **locals)
{
    assert(_PyOpcode_Caches[BINARY_OP] == INLINE_CACHE_ENTRIES_BINARY_OP);
    _PyBinaryOpCache *cache = (_PyBinaryOpCache *)(instr + 1);
    switch (oparg) {
        case NB_ADD:
        case NB_INPLACE_ADD:
            if (!Py_IS_TYPE(lhs, Py_TYPE(rhs))) {
                break;
            }
            if (PyUnicode_CheckExact(lhs)) {
                _Py_CODEUNIT next = instr[INLINE_CACHE_ENTRIES_BINARY_OP + 1];
                bool to_store = (_Py_OPCODE(next) == STORE_FAST ||
                                 _Py_OPCODE(next) == STORE_FAST__LOAD_FAST);
                if (to_store && locals[_Py_OPARG(next)] == lhs) {
                    _Py_SET_OPCODE(*instr, BINARY_OP_INPLACE_ADD_UNICODE);
                    goto success;
                }
                _Py_SET_OPCODE(*instr, BINARY_OP_ADD_UNICODE);
                goto success;
            }
            if (PyLong_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_ADD_INT);
                goto success;
            }
            if (PyFloat_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_ADD_FLOAT);
                goto success;
            }
            break;
        case NB_MULTIPLY:
        case NB_INPLACE_MULTIPLY:
            if (!Py_IS_TYPE(lhs, Py_TYPE(rhs))) {
                break;
            }
            if (PyLong_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_MULTIPLY_INT);
                goto success;
            }
            if (PyFloat_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_MULTIPLY_FLOAT);
                goto success;
            }
            break;
        case NB_SUBTRACT:
        case NB_INPLACE_SUBTRACT:
            if (!Py_IS_TYPE(lhs, Py_TYPE(rhs))) {
                break;
            }
            if (PyLong_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_SUBTRACT_INT);
                goto success;
            }
            if (PyFloat_CheckExact(lhs)) {
                _Py_SET_OPCODE(*instr, BINARY_OP_SUBTRACT_FLOAT);
                goto success;
            }
            break;
#ifndef Py_STATS
        default:
            _Py_SET_OPCODE(*instr, BINARY_OP);
            return;
#endif
    }
    SPECIALIZATION_FAIL(BINARY_OP, binary_op_fail_kind(oparg, lhs, rhs));
    STAT_INC(BINARY_OP, failure);
    cache->counter = ADAPTIVE_CACHE_BACKOFF; // 64
    return;
success:
    STAT_INC(BINARY_OP, success);
    cache->counter = initial_counter_value(); // 53
}

关键是如果一开始指令特化成功，后面传入的对象不再是原来的对象了，那应该怎么回退呢？带着这个问题，我们来到特化后的指令 BINARY_OP_ADD_FLOAT 。

可以看到，在这里就只是简单检查一下两边对象类型，然后快速的用浮点相加完成了两对象相加，这就是性能提速的原因。

DEOPT_IF 就是用来判断是否特化失效的宏，特化失败走向 miss 。

#define DEOPT_IF(cond, instname) if (cond) { goto miss; }

TARGET(BINARY_OP_ADD_FLOAT) {
    assert(cframe.use_tracing == 0);
    PyObject *left = SECOND();
    PyObject *right = TOP();
    DEOPT_IF(!PyFloat_CheckExact(left), BINARY_OP);
    DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
    STAT_INC(BINARY_OP, hit);
    double dsum = ((PyFloatObject *)left)->ob_fval +
        ((PyFloatObject *)right)->ob_fval;
    PyObject *sum = PyFloat_FromDouble(dsum);
    SET_SECOND(sum);
    Py_DECREF(right);
    Py_DECREF(left);
    STACK_SHRINK(1);
    if (sum == NULL) {
        goto error;
    }
    JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
    NOTRACE_DISPATCH();
}

当指令特化失效后，就会找回该特化指令原始的指令进行执行，还会尝试去再次特化该指令。

miss:
    {
        STAT_INC(opcode, miss);
        opcode = _PyOpcode_Deopt[opcode];
        STAT_INC(opcode, miss);
        /* The counter is always the first cache entry: */
        _Py_CODEUNIT *counter = (_Py_CODEUNIT *)next_instr;
        *counter -= 1;
        if (*counter == 0) {
            int adaptive_opcode = _PyOpcode_Adaptive[opcode];
            assert(adaptive_opcode);
            _Py_SET_OPCODE(next_instr[-1], adaptive_opcode);
            STAT_INC(opcode, deopt);
            *counter = ADAPTIVE_CACHE_BACKOFF;
        }
        next_instr--;
        DISPATCH_GOTO();
    }

整个指令的变化可以参考下图。

至此我们的分析结束，指令特化真好玩，下次(一定)我就将它实现到 Lua 上。

Python3 源码剖析(一)-float诞生

2022-04-05T02:30:54.000Z

去年 2021 年的时候，我的工作主要集中在改进 Lua虚拟机 ，后来由于工作变动，现在主要的工作语言已经切换为了 Python ，因此打算阅读下 Python 3.10 的源码，学习一下它的设计，对比 Lua 的优势。

希望在接下来的阅读过程中，能够体会到一种 回家 的畅快感。

本篇将以 float 作为起点，了解如何创建出一个浮点对象，深入剖析 float 其内部实现。

一切皆对象

一切皆对象这句话都要被讲烂了，但是还要讲多一次。

Python 是一门面向对象的强类型动态语言，里面的任何东西都是对象，以浮点数为例。

# a 是一个浮点实例对象，类型是 float
>>> a = 3.14159
>>> type(a)
<class 'float'>

# float 也是个对象，但它是 类型对象
>>> float
<class 'float'>
>>> float()
0.0

# float这个类型对象的类型是 type
>>> type(float)
<class 'type'>

以上我们可以确定，Python 中类型也是对象。

此外所有对象的类型都是 type ，可以称其为元类。而所有对象都继承自 object 。

>>> type(int)
<class 'type'>
>>> type(dict)
<class 'type'>
>>> type(list)
<class 'type'>

>>> float.__base__
<class 'object'>
>>> dict.__base__
<class 'object'>

而 object 的类型也是 type ，type 的类型也为 type。

>>> type(object)
<class 'type'>

>>> type(type)
<class 'type'>

至此我们可以得出以下几个结论，方便后续继续阅读 float 的实现。

一切皆对象，包括类型也是对象
所有类都继承自 object
所有类的类型都是 type

>>> type.__base__
<class 'object'>

>>> object.__class__
<class 'type'>

type 的父类也是 object
type 的类型也是 type
object 的类型也是 type
object 的父类为 None

两者互为表里，相辅相成。

PyObject

理解了以上的内容，就能开始正式阅读源码了。CPython 为了表示一种继承的关系，但苦于 C语言 没有这种机制，不得不手动模拟，抽出 PyObject 作为父类。

PyObject 的结构相当简单，和 Lua 一样，需要自动垃圾回收，给每个对象头部都加了 double-link ，当创建对象的时候就将所有对象串起来，主要用于扫描与分代垃圾回收。

#define _PyObject_HEAD_EXTRA            \
    struct _object *_ob_next;           \
    struct _object *_ob_prev;

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

PyObject 是所有对象的起点，后续任何一个对象都继承自它。它包含双向链表和引用计数(ob_refcnt)，通过这两个结构运用了多种垃圾回收机制。

ob_type 则是类型指针，指向该对象真正的类型，表示该对象的一些行为，用于实现多态。

PyVarObject 则是 PyObject 的增强版，用于支持 变长对象。

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;
} PyVarObject;

之所以需要 变长对象 是因为有的类型是一个容器，需要存储动态变更大小，例如 List 。既然 PyVarObject 是变长对象，那么 PyObject 就可以看作是定长对象。

PyTypeObject

前面我们知道，在 Python 的世界中，类型也是对象，实例是由类型对象生成出来的。 PyTypeObject 就是所谓的类型实例对象， PyType_Type 则是类型的类型对象，它用于表示该类型的一些行为，生成出来的实例也会遵循它的规则进行，一定要先搞清楚这两者的关系，才好去理解 Python。

具体的 PyTypeObject 结构在此处先不展开，留到后续阅读各个内建对象时，再解释说明。

#define PyObject_VAR_HEAD      PyVarObject ob_base;

typedef struct _typeobject PyTypeObject; // type 的实例对象

struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "." */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */
...
};

在 Python 虚拟机启动后，内建类型对象就可以拿来实例化对象了，这说明内建类型对象是在启动时就准备好了。

而 PyType_Type 就是提前准备好的类型对象。

// 垃圾回收链表, 之所以都为空, 是因为这些提前准备好的对象不是动态生成的, 不需要垃圾回收
#define _PyObject_EXTRA_INIT 0, 0,

// 设置对象类型, 可以看出type的type还是type
#define PyObject_HEAD_INIT(type)        \
    { _PyObject_EXTRA_INIT              \
    1, type },

#define PyVarObject_HEAD_INIT(type, size)       \
    { PyObject_HEAD_INIT(type) size },

PyDoc_STRVAR(type_doc,
"type(object) -> the object's type\n"
"type(name, bases, dict, **kwds) -> a new type");

PyTypeObject PyType_Type = { // 类型对象
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "type",                                     /* tp_name 命名 */
    sizeof(PyHeapTypeObject),                   /* tp_basicsize 基础大小 */
    sizeof(PyMemberDef),                        /* tp_itemsize 元素大小 */
    (destructor)type_dealloc,                   /* tp_dealloc 析构函数 */
    offsetof(PyTypeObject, tp_vectorcall),      /* tp_vectorcall_offset */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_as_async */
    (reprfunc)type_repr,                        /* tp_repr 对象调试信息 */
    &type_as_number,                            /* tp_as_number 作为数字时的操作函数 */
    0,                                          /* tp_as_sequence */
    0,                                          /* tp_as_mapping */
    0,                                          /* tp_hash */
    (ternaryfunc)type_call,                     /* tp_call 类型对象可调用 */
    0,                                          /* tp_str */
    (getattrofunc)type_getattro,                /* tp_getattro 获取元素 */
    (setattrofunc)type_setattro,                /* tp_setattro 设置元素 */
    0,                                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_HAVE_GC |
    Py_TPFLAGS_BASETYPE | Py_TPFLAGS_TYPE_SUBCLASS |
    Py_TPFLAGS_HAVE_VECTORCALL,                 /* tp_flags */
    type_doc,                                   /* tp_doc */
    (traverseproc)type_traverse,                /* tp_traverse */
    (inquiry)type_clear,                        /* tp_clear */
    0,                                          /* tp_richcompare */
    offsetof(PyTypeObject, tp_weaklist),        /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    type_methods,                               /* tp_methods */
    type_members,                               /* tp_members */
    type_getsets,                               /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    offsetof(PyTypeObject, tp_dict),            /* tp_dictoffset */
    type_init,                                  /* tp_init */
    0,                                          /* tp_alloc */
    type_new,                                   /* tp_new */
    PyObject_GC_Del,                            /* tp_free */
    (inquiry)type_is_gc,                        /* tp_is_gc */
    .tp_vectorcall = type_vectorcall,
};

我们可以看出，type 的类型还是 type。其次有好多地方都是空的，这是因为有的参数是等到用到的时候再添加，由 PyType_Ready 函数完成，内置对象都会在 _PyTypes_Init 时就已经初始化好。

现在，我们已经知道所有的对象都是先由 type 这一元类生成，那么对象是怎么被生成的？

对象生成主要有两种方式，一种是调用类型对象，也就是使用类型对象的 __call__ ，另一种则是在语法分析时，就可确定该对象的类型，直接调用内部的CAPI(对应指令为 LOAD_CONST)。

# 1
>>> float(1.5)
1.5

# 2
>>> f = 1.5
>>> f
1.5

这两种的区别主要在于性能上，在语法分析阶段直接能确定类型的，会比调用类型对象生成的要快的多。

float(1.5) ⇒ float.__class__.__call__(float, 1.5) ⇒ type.__call__(float, 1.5) ⇒ type_call(float, 1.5) 而在 type_call 中还会去检查是否可以转换为 float 对象，自然就慢了。

f = 1.5 ⇒ PyFloat_FromDouble(1.5) 一步到位，没有更多的类型判断。

怎么证明以上的结论呢？有个很简单的方法。

>>> print(float.__call__)
'__call__' of type object at 0x103f65d70>
>>> print(int.__call__)
'__call__' of type object at 0x103f67f90>

>>> print(type.__call__)
'__call__' of 'type' objects>

可以看出类型对象的 __call__ 实际上就是 type 的 __call__ 。同时我们还可以知道，结构体中的 slot 的函数指针，在 Python 的世界中也是对象！ PyWrapperDescrObject 对函数指针进行包装还加了一些描述。

有了以上的前置知识，接下来就是要关注一个对象的创建流程了，从 type_call 函数开始阅读，因为 type 的 __call__ 调用的是 type_call 。

static PyObject *
type_call(PyTypeObject *type, PyObject *args, PyObject *kwds)
{
    PyObject *obj;
    PyThreadState *tstate = _PyThreadState_GET();

    // 支持 type(x) 语法 返回对应的类型对象
    if (type == &PyType_Type) {
        Py_ssize_t nargs = PyTuple_GET_SIZE(args);
        if (nargs == 1 && (kwds == NULL || !PyDict_GET_SIZE(kwds))) {
            obj = (PyObject *) Py_TYPE(PyTuple_GET_ITEM(args, 0));
            Py_INCREF(obj);
            return obj;
        }
        if (nargs != 3) {
            PyErr_SetString(PyExc_TypeError,
                            "type() takes 1 or 3 arguments");
            return NULL;
        }
    }
    // 1. 先调用 __new__ 函数
    if (type->tp_new == NULL) {
        _PyErr_Format(tstate, PyExc_TypeError,
                      "cannot create '%s' instances", type->tp_name);
        return NULL;
    }

    obj = type->tp_new(type, args, kwds);
    obj = _Py_CheckFunctionResult(tstate, (PyObject*)type, obj, NULL);
    if (obj == NULL)
        return NULL;

    // 2. 检查 __new__ 返回的对象类型是不是和传进来的类型一致
    if (!PyType_IsSubtype(Py_TYPE(obj), type))
        return obj;

    // 3. 一致才调用 __init__ 函数
    type = Py_TYPE(obj);
    if (type->tp_init != NULL) {
        int res = type->tp_init(obj, args, kwds);
        if (res < 0) {
            Py_DECREF(obj);
            obj = NULL;
        }
    }
    return obj;
}

这么看就简单多了，通过调用类型对象进行实例化，会先执行 __new__ ，若返回的类型正确则继续调用 __init__。

PyBaseObject

如果说 PyTypeObject 是万物的元类，那么 PyBaseObject 就是万物的父类。而父也是由造物主 type 创造出来的，它们两是一体，不可分割（因为 object 的类型也是 type）。

整体上看非常普通，没什么特别的，主要是定义了一些最基础的方法，给子类用，比如比较之类的。

PyDoc_STRVAR(object_doc,
"object()\n--\n\n"
"The base class of the class hierarchy.\n\n"
"When called, it accepts no arguments and returns a new featureless\n"
"instance that has no instance attributes and cannot be given any.\n");

PyTypeObject PyBaseObject_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "object",                                   /* tp_name */
    sizeof(PyObject),                           /* tp_basicsize */
    0,                                          /* tp_itemsize */
    object_dealloc,                             /* tp_dealloc */
    0,                                          /* tp_vectorcall_offset */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_as_async */
    object_repr,                                /* tp_repr */
    0,                                          /* tp_as_number */
    0,                                          /* tp_as_sequence */
    0,                                          /* tp_as_mapping */
    (hashfunc)_Py_HashPointer,                  /* tp_hash */
    0,                                          /* tp_call */
    object_str,                                 /* tp_str */
    PyObject_GenericGetAttr,                    /* tp_getattro */
    PyObject_GenericSetAttr,                    /* tp_setattro */
    0,                                          /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE,   /* tp_flags */
    object_doc,                                 /* tp_doc */
    0,                                          /* tp_traverse */
    0,                                          /* tp_clear */
    object_richcompare,                         /* tp_richcompare */
    0,                                          /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    object_methods,                             /* tp_methods */
    0,                                          /* tp_members */
    object_getsets,                             /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    0,                                          /* tp_dictoffset */
    object_init,                                /* tp_init */
    PyType_GenericAlloc,                        /* tp_alloc */
    object_new,                                 /* tp_new */
    PyObject_Del,                               /* tp_free */
};

现在不去关注这里面的内容，等到对其他的对象足够了解后，再回到 type 和 object 中剖析。这样做的好处是，自上而下阅读，不容易产生疑惑。

PyFloatObject

终于到了本文的重点，PyFloatObject 是一个浮点数实例对象，我们就以它为起点，去窥探其中的设计。之所以选择它，是因为它是所有对象里面最简单的了。

// 可以看出是个定长对象，里面就只有一个 double
// (那铁定是定长啊，一个浮点变个啥啊
typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

PyFloat_Type 看命名就知道是浮点数的类型对象了。

里面的行为都比较简单，要注意的是没有 __init__ ，因为浮点对象比较简单，可以在 __new__ 的时候就填充好。

PyTypeObject PyFloat_Type = {
    // 设置类型为 type, 垃圾回收链表为空 因为是非动态生成的
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    // 该浮点类型对象的名称
    "float",
    // 该类型对象的大小
    sizeof(PyFloatObject),
    0,
    // 可以理解为析构函数, 用于缓存浮点数
    (destructor)float_dealloc,                  /* tp_dealloc */
    0,                                          /* tp_vectorcall_offset */
    0,                                          /* tp_getattr */
    0,                                          /* tp_setattr */
    0,                                          /* tp_as_async */
    // 描述信息 机器看的 __repr__
    (reprfunc)float_repr,                       /* tp_repr */
    // 浮点数作为数字的一些操作
    &float_as_number,                           /* tp_as_number */
    0,                                          /* tp_as_sequence */
    0,                                          /* tp_as_mapping */
    // 浮点哈希操作
    (hashfunc)float_hash,                       /* tp_hash */
    0,                                          /* tp_call */
    0,                                          /* tp_str */
    // 标准获取属性方法, 例如: float.__doc__
    PyObject_GenericGetAttr,                    /* tp_getattro */
    0,                                          /* tp_setattro */
    0,                                          /* tp_as_buffer */
    // 默认对象, 允许继承, 
    Py_TPFLAGS_DEFAULT | Py_TPFLAGS_BASETYPE |
        _Py_TPFLAGS_MATCH_SELF,                 /* tp_flags */
    // 文档
    float_new__doc__,                           /* tp_doc */
    0,                                          /* tp_traverse */
    0,                                          /* tp_clear */
    // 浮点数比较, 后面细说 是个地狱
    float_richcompare,                          /* tp_rixchcompare */
    0,                                          /* tp_weaklistoffset */
    0,                                          /* tp_iter */
    0,                                          /* tp_iternext */
    // 浮点数的方法
    float_methods,                              /* tp_methods */
    0,                                          /* tp_members */
    float_getset,                               /* tp_getset */
    0,                                          /* tp_base */
    0,                                          /* tp_dict */
    0,                                          /* tp_descr_get */
    0,                                          /* tp_descr_set */
    0,                                          /* tp_dictoffset */
    0,                                          /* tp_init */
    0,                                          /* tp_alloc */
    // 可以理解为构造函数
    float_new,                                  /* tp_new */
    // 可以理解为快速版构造函数 Python3.9后推出的
    .tp_vectorcall = (vectorcallfunc)float_vectorcall,
};

为了接下来阅读方便，我将 floatobject.h 的一部分宏作了注释贴上来。

// 浮点数缓存池大小
#ifndef PyFloat_MAXFREELIST
#  define PyFloat_MAXFREELIST   100
#endif

// 浮点数类型判定, 包含浮点及其子类
#define PyFloat_Check(op) PyObject_TypeCheck(op, &PyFloat_Type)

// 浮点数精确判定类型, 不包含子类
#define PyFloat_CheckExact(op) Py_IS_TYPE(op, &PyFloat_Type)

// 不是一个数字
#ifdef Py_NAN
#define Py_RETURN_NAN return PyFloat_FromDouble(Py_NAN)
#endif

// 返回无穷大or无穷小
#define Py_RETURN_INF(sign) do                     \
    if (copysign(1., sign) == 1.) {                \
        return PyFloat_FromDouble(Py_HUGE_VAL);    \
    } else {                        \
        return PyFloat_FromDouble(-Py_HUGE_VAL);   \
    } while(0)

// 将PyObject 转换为 C的浮点数, 宏以不检查类型, 提高效率
#ifndef Py_LIMITED_API
#define PyFloat_AS_DOUBLE(op) (((PyFloatObject *)(op))->ob_fval)
#endif

浮点数初始化

虚拟机在启动后，会进行浮点数的一些初始化，主要包含以下两个操作

判断当前机器为 ieee-754 的大端还是小端编码。

void
_PyFloat_Init(void)
{
// https://tooltt.com/ieee/
// 01000011 43
// 00111111 3f
// 11111111 ff
// 00000001 1
// 00000010 2
// 00000011 3
// 00000100 4
// 00000101 5
#if SIZEOF_DOUBLE == 8
    {
        // 01000011 00111111 11111111 00000001 00000010 00000011 00000100 00000101
        double x = 9006104071832581.0;
        if (memcmp(&x, "\x43\x3f\xff\x01\x02\x03\x04\x05", 8) == 0)
            detected_double_format = ieee_big_endian_format;
        else if (memcmp(&x, "\x05\x04\x03\x02\x01\xff\x3f\x43", 8) == 0)
            detected_double_format = ieee_little_endian_format;
        else
            detected_double_format = unknown_format;
    }
#else
    detected_double_format = unknown_format;
#endif
    ...
    double_format = detected_double_format;
    float_format = detected_float_format;
}

填充 float info 数据。

// floatinfo 浮点数一些信息
PyDoc_STRVAR(floatinfo__doc__,
"sys.float_info\n\
\n\
A named tuple holding information about the float type. It contains low level\n\
information about the precision and internal representation. Please study\n\
your system's :file:`float.h` for more information.");
static PyStructSequence_Field floatinfo_fields[] = {
    {"max",             "DBL_MAX -- maximum representable finite float"},
    {"max_exp",         "DBL_MAX_EXP -- maximum int e such that radix**(e-1) "
                    "is representable"},
    {"max_10_exp",      "DBL_MAX_10_EXP -- maximum int e such that 10**e "
                    "is representable"},
    {"min",             "DBL_MIN -- Minimum positive normalized float"},
    {"min_exp",         "DBL_MIN_EXP -- minimum int e such that radix**(e-1) "
                    "is a normalized float"},
    {"min_10_exp",      "DBL_MIN_10_EXP -- minimum int e such that 10**e is "
                    "a normalized"},
    {"dig",             "DBL_DIG -- maximum number of decimal digits that "
                    "can be faithfully represented in a float"},
    {"mant_dig",        "DBL_MANT_DIG -- mantissa digits"},
    {"epsilon",         "DBL_EPSILON -- Difference between 1 and the next "
                    "representable float"},
    {"radix",           "FLT_RADIX -- radix of exponent"},
    {"rounds",          "FLT_ROUNDS -- rounding mode used for arithmetic "
                    "operations"},
    {0}
};

static PyStructSequence_Desc floatinfo_desc = {
    "sys.float_info",           /* name */
    floatinfo__doc__,           /* doc */
    floatinfo_fields,           /* fields */
    11
};

int
_PyFloat_InitTypes(void)
{
    /* Init float info */
    // 此处的 StructSequence 本质上是 tuple的包装
    if (FloatInfoType.tp_name == NULL) {
        if (PyStructSequence_InitType2(&FloatInfoType, &floatinfo_desc) < 0) {
            return -1;
        }
    }
    return 0;
}

这样就可以通过 sys.float_info 来查看当前环境的浮点数参数。

1
2
3

>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

浮点数的创建与销毁

浮点数创建

浮点数创建主要在 float_new_impl 中。

static PyObject *
float_new(PyTypeObject *type, PyObject *args, PyObject *kwargs)
{
    PyObject *return_value = NULL;
    PyObject *x = NULL;

    if ((type == &PyFloat_Type) &&
        !_PyArg_NoKeywords("float", kwargs)) {
        goto exit;
    }
    if (!_PyArg_CheckPositional("float", PyTuple_GET_SIZE(args), 0, 1)) {
        goto exit;
    }
    if (PyTuple_GET_SIZE(args) < 1) {
        goto skip_optional;
    }
    x = PyTuple_GET_ITEM(args, 0);
skip_optional:
    return_value = float_new_impl(type, x);

exit:
    return return_value;
}

判断类型是否为 float_type，不是则看看是否为 float 的子类，否则就尝试将字符串转为浮点数。

static PyObject *
float_new_impl(PyTypeObject *type, PyObject *x)
{
    if (type != &PyFloat_Type) {
        if (x == NULL) {
            x = _PyLong_GetZero();
        }
        return float_subtype_new(type, x); /* Wimp out */
    }

    if (x == NULL) {
        return PyFloat_FromDouble(0.0);
    }
    /* If it's a string, but not a string subclass, use
       PyFloat_FromString. */
    if (PyUnicode_CheckExact(x))
        return PyFloat_FromString(x);
    return PyNumber_Float(x);
}

重点关注 PyFloat_FromDouble ，可以看出，float 有个对象缓存链表，各个对象采用 ob_type 进行串联。

// 通过C浮点数获取python 浮点对象, 注意虚拟机中有浮点缓存器。
PyObject *
PyFloat_FromDouble(double fval)
{
    // 获取浮点缓存
    struct _Py_float_state *state = get_float_state();
    PyFloatObject *op = state->free_list;
    if (op != NULL) {
        // 通过ob_type作为链表串联起缓存池中的浮点对象
        state->free_list = (PyFloatObject *) Py_TYPE(op);
        state->numfree--;
    }
    else {
        op = PyObject_Malloc(sizeof(PyFloatObject));
        if (!op) {
            return PyErr_NoMemory();
        }
    }
    _PyObject_Init((PyObject*)op, &PyFloat_Type);
    op->ob_fval = fval;
    return (PyObject *) op;
}

float_vectorcall

除了 float_new 还有一个创建浮点数的新方法 float_vectorcall ，内部也是调用的 float_new_impl ，用于提高性能，但是浮点数里面没有启用！因为它的 flag 没有 Py_TPFLAGS_HAVE_VECTORCALL ，可能只是暂时预留一个位置，还没有开发到，所以就先跳过吧

浮点数销毁

// 析构 确保一定是 PyFloat_Type 类型
// 链表长度100个
// 用 ob_type 做链表, 反正已知这条链上的对象都是PyFloat_Type
static void
float_dealloc(PyFloatObject *op)
{
    if (PyFloat_CheckExact(op)) {
        struct _Py_float_state *state = get_float_state();
        if (state->numfree >= PyFloat_MAXFREELIST)  {
            PyObject_Free(op);
            return;
        }
        state->numfree++;
        Py_SET_TYPE(op, (PyTypeObject *)state->free_list);
        state->free_list = op;
    }
    else {
        Py_TYPE(op)->tp_free((PyObject *)op);
    }
}

如何验证浮点数是不是真的用到了缓存池？有个很简单的方法验证。

>>> a = 1.3
>>> id(a)
4500913328
>>> del a

>>> b = 1.3
>>> id(b)
4500913328

a 与 b 的 id 一致说明复用了浮点数对象。

浮点数操作

浮点数的大部分操作都比较简单，唯独比较操作是一个非常麻烦的操作。

浮点数比较

作者也曾提到，浮点数比较是一个噩梦，之所以这么麻烦，主要是当浮点数和整数比较时，将浮点数转换为整数去比较会丢失精度，用整数转换为浮点数也不可行，因为一个整数的有效位高达63位，而双精度浮点数的有效位为53位，无法直接进行比较。

大致步骤如下：

如果 j 为浮点数且无穷，则可直接判定。
如果 j 为整数则检查符号，符号不同也可直接判定。
j 为整数且符号相同，判定是否可以转换为浮点数(通过计算整数的比特位，只要不超过48位，就可直接转换为浮点数)，后直接判定。
若j为负数，转换为整数，计算 i 的指数，指数小于 j 的位数，则可直接判定(因为指数也可以看作是位数)。
j为整数，分离 i 这个浮点数的小数与整数部分，如果小数部分存在，则将 i 左移后异或上 1，保留精度后与j左移一位进行判定即可。

static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
    double i, j;
    int r = 0;

    assert(PyFloat_Check(v));
    i = PyFloat_AS_DOUBLE(v);

    if (PyFloat_Check(w))
        j = PyFloat_AS_DOUBLE(w);
    // 不是有限代表 i是无穷 所以 j无论是个啥都行...
    else if (!Py_IS_FINITE(i)) {
        if (PyLong_Check(w))
            j = 0.0;
        else
            goto Unimplemented;
    }
    // j是个整数 很麻烦
    else if (PyLong_Check(w)) {
        // 检查符号
        int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
        int wsign = _PyLong_Sign(w);
        size_t nbits;
        int exponent;
        // 符号不等可以简单判断, 因为大小无关紧要了
        if (vsign != wsign) {
            i = (double)vsign;
            j = (double)wsign;
            goto Compare;
        }
        // 如果符号相同 尝试将j转换为浮点数，特别是0最为合适
        // 那么什么时候才是合适的呢？ 那就是这个整数w的比特位合适。。
        nbits = _PyLong_NumBits(w);
        if (nbits == (size_t)-1 && PyErr_Occurred()) {
            PyErr_Clear();
            i = (double)vsign;
            assert(wsign != 0);
            j = wsign * 2.0;
            goto Compare;
        }
        // 指数符加尾数占48位，指数符加指数占16位 所以48位铁定不会出问题
        if (nbits <= 48) {
            j = PyLong_AsDouble(w);

            assert(j != -1.0 || ! PyErr_Occurred());
            goto Compare;
        }
        if (vsign < 0) {
            i = -i;
            op = _Py_SwappedOp[op];
        }
        assert(i > 0.0);
        (void) frexp(i, &exponent); // i=returnvalue * 2^exponent
        // 通过指数来比较大小很骚
        if (exponent < 0 || (size_t)exponent < nbits) {
            i = 1.0;
            j = 2.0;
            goto Compare;
        }
        if ((size_t)exponent > nbits) {
            i = 2.0;
            j = 1.0;
            goto Compare;
        }
        {
            double fracpart;
            double intpart;
            PyObject *result = NULL;
            PyObject *vv = NULL;
            PyObject *ww = w;

            if (wsign < 0) {
                ww = PyNumber_Negative(w);
                if (ww == NULL)
                    goto Error;
            }
            else
                Py_INCREF(ww);

            fracpart = modf(i, &intpart);
            vv = PyLong_FromDouble(intpart);
            if (vv == NULL)
                goto Error;

            if (fracpart != 0.0) {
                PyObject *temp;

                temp = _PyLong_Lshift(ww, 1);
                if (temp == NULL)
                    goto Error;
                Py_DECREF(ww);
                ww = temp;

                temp = _PyLong_Lshift(vv, 1);
                if (temp == NULL)
                    goto Error;
                Py_DECREF(vv);
                vv = temp;

                temp = PyNumber_Or(vv, _PyLong_GetOne());
                if (temp == NULL)
                    goto Error;
                Py_DECREF(vv);
                vv = temp;
            }

            r = PyObject_RichCompareBool(vv, ww, op);
            if (r < 0)
                goto Error;
            result = PyBool_FromLong(r);
         Error:
            Py_XDECREF(vv);
            Py_XDECREF(ww);
            return result;
        }
    } /* else if (PyLong_Check(w)) */

    else        /* w isn't float or int */
        goto Unimplemented;

 Compare:
    switch (op) {
    case Py_EQ:
        r = i == j;
        break;
    case Py_NE:
        r = i != j;
        break;
    case Py_LE:
        r = i <= j;
        break;
    case Py_GE:
        r = i >= j;
        break;
    case Py_LT:
        r = i < j;
        break;
    case Py_GT:
        r = i > j;
        break;
    }
    return PyBool_FromLong(r);

 Unimplemented:
    Py_RETURN_NOTIMPLEMENTED;
}

看完这一段我就有疑惑了，我记得 Lua 实现浮点数比较非常简单啊。翻阅 Lua 5.3.6 源码进行查阅得知，Lua 直接将两个浮点数转换为整数进行比较，这样会有精度丢失的问题(将浮点直接向下取整取到整数)。

int luaV_equalobj (lua_State *L, const TValue *t1, const TValue *t2) {
  const TValue *tm;
  if (ttype(t1) != ttype(t2)) {  /* not the same variant? */
    if (ttnov(t1) != ttnov(t2) || ttnov(t1) != LUA_TNUMBER)
      return 0;  /* only numbers can be equal with different variants */
    else {  /* two numbers with different variants */
      lua_Integer i1, i2;  /* compare them as integers */
      return (tointeger(t1, &i1) && tointeger(t2, &i2) && i1 == i2);
    }
  }

copysign

copysign 是 ieee-754 中关于浮点数定义的一个辅助函数，用于确定一个浮点数的符号，在 Python 中为了支持符号0，实现了这个方法。

这个函数使用方法是将 y 的符号赋给 x 并返回。
实现方式也挺巧妙的，利用 atan2(0, -1.) 会得到一个 -PI 的结果，如果机器支持-0，则为-PI，若不支持则为 +PI，以此来确定机器是否支持符号0。

double
copysign(double x, double y)
{
    /* use atan2 to distinguish -0. from 0. */
    if (y > 0. || (y == 0. && atan2(y, -1.) > 0.)) {
        return fabs(x);
    } else {
        return -fabs(x);
    }
}

总结

本篇剖析了 Python3.10 的 float 对象的内部结构与实现，对比 Lua 可知其优势。

拥有浮点数缓存池。
比较函数实现更为靠谱。
考虑到机器是否支持符号0，通过 copysign 实现。

LuaJIT 5.3.6 方案

2021-07-04T02:02:20.000Z

截止至上一次发博文已经过了接近三个月时间，这么长的一段时间我主要是去做了以下几件事情，一个是实现 Lua 多线程的垃圾回收 方案，另一个则是 LuaJIT 5.3.6 实现。其实也没用到三个月，实现代码加上测试一共花了一个月，至于剩下的两个月，主要是响应号召，去打了一下疫苗，腹泻，发烧，休息😓。

Lua 多线程垃圾回收

这一块的思路主要是从 Redis 通过子线程释放内存这块学来的，通过这个小优化，使得我们游戏服务器在大量玩家下线时，不再出现大规模的掉帧，效果还是非常显著的。

LuaJIT 5.3.6

之所以实现这个 Lua 5.3.6 JIT，其实是因为 LuaJIT 2.0 不支持 5.3的新扩展，而项目已经进行到了中后期，没有时间去调整代码了，最后花了半个月的时间去实现了一个小版本，通过了 Lua 的官方测试用例，也在项目用上了。性能方面提高了2-5倍，接入成本为0，不需要修改任何逻辑代码。

至于解释器部分，借鉴了 Lua 5.4 的一个小优化点，将 switch case 修改为了 computed goto ，提升了约 5% 的性能，之后可能会学习 Lua 5.4 扩展字节码。如果这个项目还做下去的话(我有时间的话)，我会想尝试解释器执行脚本时记录各个操作数的类型，实现动态替换字节码，减少不必要的类型判断，从而提升一定的解释器速度。不过这个方案风险太大，暂时先搁置。

在实现 LuaJIT 5.3.6 的过程中，顺带复习了一下编译原理的前端部分，实现了一些官方不支持的语法，比如 ‘+=’，自增表达式(当然没有提交)，还是非常有趣的。

结语

以上的代码实现已经开源，合并之前的 NOGC 优化思路，LuaJIT-5.3.6，欢迎 Star ，这对我很重要。

Redis 6 剖析(二) 主从同步

2021-04-03T15:31:20.000Z

本篇是 Redis 6 剖析的第二篇，主要探讨 Redis 是怎么做主从同步的，对代码会有所删减。

SLAVE

通常启用主从同步，只要在从服务器执行 SLAVEOF HOST PORT 即可，这个时候就会执行到 replicaofCommand 。由于主从同步是从服务器发起的，因此我们先从 Slave 开始进行剖析。

repl_state

Redis 的主从同步，是通过状态机驱动的，因此有必要在本篇一开始前，就先看看有哪些状态。

typedef enum {
    REPL_STATE_NONE = 0,            /* No active replication */
    REPL_STATE_CONNECT,             /* Must connect to master */
    REPL_STATE_CONNECTING,          /* Connecting to master */
    /* --- Handshake states, must be ordered --- */
    REPL_STATE_RECEIVE_PING_REPLY,  /* Wait for PING reply */
    REPL_STATE_SEND_HANDSHAKE,      /* Send handshake sequance to master */
    REPL_STATE_RECEIVE_AUTH_REPLY,  /* Wait for AUTH reply */
    REPL_STATE_RECEIVE_PORT_REPLY,  /* Wait for REPLCONF reply */
    REPL_STATE_RECEIVE_IP_REPLY,    /* Wait for REPLCONF reply */
    REPL_STATE_RECEIVE_CAPA_REPLY,  /* Wait for REPLCONF reply */
    REPL_STATE_SEND_PSYNC,          /* Send PSYNC */
    REPL_STATE_RECEIVE_PSYNC_REPLY, /* Wait for PSYNC reply */
    /* --- End of handshake states --- */
    REPL_STATE_TRANSFER,        /* Receiving .rdb from master */
    REPL_STATE_CONNECTED,       /* Connected to master */
} repl_state;

REPL_STATE_NONE，未启动同步。
REPL_STATE_CONNECT，需要连接到 Master 。
REPL_STATE_RECEIVE_PING_REPLY，等待 PING 的回包。
REPL_STATE_SEND_HANDSHAKE，验证密码。
REPL_STATE_RECEIVE_AUTH_REPLY ，等待 AUTH 的回包。
REPL_STATE_RECEIVE_PORT_REPLY，等待 REPLCONF 针对端口的回包。
REPL_STATE_RECEIVE_IP_REPLY ，等待 REPLCONF 针对IP的回包。
REPL_STATE_RECEIVE_CAPA_REPLY ，等待 REPLCONF 针对”能力”(即支持的功能)的回包。
REPL_STATE_SEND_PSYNC ，发送 PSYNC 。
REPL_STATE_RECEIVE_PSYNC_REPLY ，等待 PSYNC 的回包。
REPL_STATE_TRANSFER ，传送快照。
REPL_STATE_CONNECTED ，主从同步完成。

replicaofCommand

拿到 Master 的 IP 和 Port 。

void replicaofCommand(client *c) {
    ....
    long port;
    replicationSetMaster(c->argv[1]->ptr, port);
    addReply(c,shared.ok);
}

replicationSetMaster

断连所有的 Slave ，然后取消掉原先的主从连接（如果有），设置 Cache Master 为了复用 PSYNC （保存当前进度，不进行全量同步）。

设置状态 REPL_STATE_CONNECT ，表示需要连接 Master 。

void replicationSetMaster(char *ip, int port) {
    server.masterhost = sdsnew(ip);
    server.masterport = port;

    /* Force our slaves to resync with us as well. They may hopefully be able
     * to partially resync with us, but we can notify the replid change. */
    disconnectSlaves();
    cancelReplicationHandshake(0);
    /* Before destroying our master state, create a cached master using
     * our own parameters, to later PSYNC with the new master. */
    if (was_master) {
        replicationDiscardCachedMaster();
        replicationCacheMasterUsingMyself();
    }

    server.repl_state = REPL_STATE_CONNECT;
    connectWithMaster();
}

connectWithMaster

Redis 6 支持 TLS，为了简化剖析过程，此处默认不采用 TLS 连接。

server.repl_transfer_lastio，最后一次 IO 时间，用于超时处理。
设置状态 REPL_STATE_CONNECTING ，表示已连接到 Master 。

int connectWithMaster(void) {
    server.repl_transfer_s = server.tls_replication ? connCreateTLS() : connCreateSocket();
    if (connConnect(server.repl_transfer_s, server.masterhost, server.masterport,
                NET_FIRST_BIND_ADDR, syncWithMaster) == C_ERR) {
        serverLog(LL_WARNING,"Unable to connect to MASTER: %s",
                connGetLastError(server.repl_transfer_s));
        connClose(server.repl_transfer_s);
        server.repl_transfer_s = NULL;
        return C_ERR;
    }

    server.repl_transfer_lastio = server.unixtime;
    server.repl_state = REPL_STATE_CONNECTING;
    serverLog(LL_NOTICE,"MASTER <-> REPLICA sync started");
    return C_OK;
}

syncWithMaster

Slave → Master 连接完成后，会进入到 syncWithMaster 回调。这个函数共有 300多行，因此分为多个部分讲解。

若当前状态机状态为 REPL_STATE_NONE ，直接返回。
检查链接是否正常。

这种情况主要是出现在 Slave 连接上 Master 之后，Client 后悔了。

void syncWithMaster(connection *conn) {
    char tmpfile[256], *err = NULL;
    int dfd = -1, maxtries = 5;
    int psync_result;

    /* If this event fired after the user turned the instance into a master
     * with SLAVEOF NO ONE we must just return ASAP. */
    if (server.repl_state == REPL_STATE_NONE) {
        connClose(conn);
        return;
    }

    /* Check for errors in the socket: after a non blocking connect() we
     * may find that the socket is in error state. */
    if (connGetState(conn) != CONN_STATE_CONNECTED) {
        serverLog(LL_WARNING,"Error condition on socket for SYNC: %s",
                connGetLastError(conn));
        goto error;
    }

REPL_STATE_CONNECTING ，设置 Read Handler 为当前函数。
发送命令 PING 到 Master 。
设置状态 REPL_STATE_RECEIVE_PING_REPLY ，表示等待 Master 返回 PONG 。

主要是因为 Connect Handler 只会执行一次，后面的状态机的处理流程都在本函数，因此需要再次进入该函数。

/* Send a PING to check the master is able to reply without errors. */
if (server.repl_state == REPL_STATE_CONNECTING) {
    serverLog(LL_NOTICE,"Non blocking connect for SYNC fired the event.");
    /* Delete the writable event so that the readable event remains
     * registered and we can wait for the PONG reply. */
    connSetReadHandler(conn, syncWithMaster);
    connSetWriteHandler(conn, NULL);
    server.repl_state = REPL_STATE_RECEIVE_PING_REPLY;
    /* Send the PING, don't check for errors at all, we have the timeout
     * that will take care about this. */
    err = sendCommand(conn,"PING",NULL);
    if (err) goto write_error;
    return;
}

同步读 Master 对 PING 的回包，正常情况只要有回包都是没错误的，除非对方是旧版本。
设置状态 REPL_STATE_SEND_HANDSHAKE ，表示需要进行握手。

/* Receive the PONG command. */
if (server.repl_state == REPL_STATE_RECEIVE_PING_REPLY) {
    err = receiveSynchronousResponse(conn);

    /* We accept only two replies as valid, a positive +PONG reply
     * (we just check for "+") or an authentication error.
     * Note that older versions of Redis replied with "operation not
     * permitted" instead of using a proper error code, so we test
     * both. */
    if (err[0] != '+' &&
        strncmp(err,"-NOAUTH",7) != 0 &&
        strncmp(err,"-NOPERM",7) != 0 &&
        strncmp(err,"-ERR operation not permitted",28) != 0)
    {
        serverLog(LL_WARNING,"Error reply to PING from master: '%s'",err);
        sdsfree(err);
        goto error;
    } else {
        serverLog(LL_NOTICE,
            "Master replied to PING, replication can continue...");
    }
    sdsfree(err);
    err = NULL;
    server.repl_state = REPL_STATE_SEND_HANDSHAKE;
}

握手阶段主要是进行密码验证，将 Slave 的 IP 和 PORT 传给 Master 方便查询，同时告诉 Master 我当前的能力，比如 EOF 为我支持 无盘传输 ， psync2 表示支持部分同步。
设置状态 REPL_STATE_RECEIVE_AUTH_REPLY ，表示等待认证回包。

if (server.repl_state == REPL_STATE_SEND_HANDSHAKE) {
    /* AUTH with the master if required. */
    if (server.masterauth) {
        char *args[3] = {"AUTH",NULL,NULL};
        size_t lens[3] = {4,0,0};
        int argc = 1;
        if (server.masteruser) {
            args[argc] = server.masteruser;
            lens[argc] = strlen(server.masteruser);
            argc++;
        }
        args[argc] = server.masterauth;
        lens[argc] = sdslen(server.masterauth);
        argc++;
        err = sendCommandArgv(conn, argc, args, lens);
        if (err) goto write_error;
    }

    /* Set the slave port, so that Master's INFO command can list the
     * slave listening port correctly. */
    {
        int port;
        if (server.slave_announce_port)
            port = server.slave_announce_port;
        else if (server.tls_replication && server.tls_port)
            port = server.tls_port;
        else
            port = server.port;
        sds portstr = sdsfromlonglong(port);
        err = sendCommand(conn,"REPLCONF",
                "listening-port",portstr, NULL);
        sdsfree(portstr);
        if (err) goto write_error;
    }

    /* Set the slave ip, so that Master's INFO command can list the
     * slave IP address port correctly in case of port forwarding or NAT.
     * Skip REPLCONF ip-address if there is no slave-announce-ip option set. */
    if (server.slave_announce_ip) {
        err = sendCommand(conn,"REPLCONF",
                "ip-address",server.slave_announce_ip, NULL);
        if (err) goto write_error;
    }

    /* Inform the master of our (slave) capabilities.
     *
     * EOF: supports EOF-style RDB transfer for diskless replication.
     * PSYNC2: supports PSYNC v2, so understands +CONTINUE .
     *
     * The master will ignore capabilities it does not understand. */
    err = sendCommand(conn,"REPLCONF",
            "capa","eof","capa","psync2",NULL);
    if (err) goto write_error;

    server.repl_state = REPL_STATE_RECEIVE_AUTH_REPLY;
    return;
}

检测认证情况。
设置状态 REPL_STATE_RECEIVE_PORT_REPLY ，表示等待 Master 确认端口配置是否正常。

if (server.repl_state == REPL_STATE_RECEIVE_AUTH_REPLY && !server.masterauth)
      server.repl_state = REPL_STATE_RECEIVE_PORT_REPLY;

  /* Receive AUTH reply. */
  if (server.repl_state == REPL_STATE_RECEIVE_AUTH_REPLY) {
      err = receiveSynchronousResponse(conn);
      if (err[0] == '-') {
          serverLog(LL_WARNING,"Unable to AUTH to MASTER: %s",err);
          sdsfree(err);
          goto error;
      }
      sdsfree(err);
      err = NULL;
      server.repl_state = REPL_STATE_RECEIVE_PORT_REPLY;
      return;
  }

检测端口配置情况。
设置状态 REPL_STATE_RECEIVE_CAPA_REPLY，表示 Master 确认能力回包。

if (server.repl_state == REPL_STATE_RECEIVE_IP_REPLY && !server.slave_announce_ip)
      server.repl_state = REPL_STATE_RECEIVE_CAPA_REPLY;

  /* Receive REPLCONF ip-address reply. */
  if (server.repl_state == REPL_STATE_RECEIVE_IP_REPLY) {
      err = receiveSynchronousResponse(conn);
      /* Ignore the error if any, not all the Redis versions support
       * REPLCONF listening-port. */
      if (err[0] == '-') {
          serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                              "REPLCONF ip-address: %s", err);
      }
      sdsfree(err);
      server.repl_state = REPL_STATE_RECEIVE_CAPA_REPLY;
      return;
  }

检测能力设置是否正常。
设置状态 REPL_STATE_SEND_PSYNC ，表示开始进行同步。

/* Receive CAPA reply. */
if (server.repl_state == REPL_STATE_RECEIVE_CAPA_REPLY) {
    err = receiveSynchronousResponse(conn);
    /* Ignore the error if any, not all the Redis versions support
     * REPLCONF capa. */
    if (err[0] == '-') {
        serverLog(LL_NOTICE,"(Non critical) Master does not understand "
                              "REPLCONF capa: %s", err);
    }
    sdsfree(err);
    err = NULL;
    server.repl_state = REPL_STATE_SEND_PSYNC;
}

slaveTryPartialResynchronization(conn, 0) 表示给 Master 发送 PSYNC ? -1 ? 为 Master RunID ， -1 为进度。
设置状态 REPL_STATE_RECEIVE_PSYNC_REPLY ，表示等待 Master 对 PSYNC 回包。

/* Try a partial resynchonization. If we don't have a cached master
   * slaveTryPartialResynchronization() will at least try to use PSYNC
   * to start a full resynchronization so that we get the master replid
   * and the global offset, to try a partial resync at the next
   * reconnection attempt. */
  if (server.repl_state == REPL_STATE_SEND_PSYNC) {
      if (slaveTryPartialResynchronization(conn,0) == PSYNC_WRITE_ERROR) {
          err = sdsnew("Write error sending the PSYNC command.");
          abortFailover("Write error to failover target");
          goto write_error;
      }
      server.repl_state = REPL_STATE_RECEIVE_PSYNC_REPLY;
      return;
  }

slaveTryPartialResynchronization(conn,1) 表示同步读 Master 针对 PSYNC 的回包，看是要全量同步，还是要增量同步。不支持 PSYNC 则进行全量同步。

psync_result = slaveTryPartialResynchronization(conn,1);
  if (psync_result == PSYNC_WAIT_REPLY) return; /* Try again later... */

  /* Check the status of the planned failover. We expect PSYNC_CONTINUE,
   * but there is nothing technically wrong with a full resync which
   * could happen in edge cases. */
  if (server.failover_state == FAILOVER_IN_PROGRESS) {
      if (psync_result == PSYNC_CONTINUE || psync_result == PSYNC_FULLRESYNC) {
          clearFailoverState();
      } else {
          abortFailover("Failover target rejected psync request");
          return;
      }
  }

能够增量同步，在 slaveTryPartialResynchronization 中设置状态 REPL_STATE_CONNECTED ，表示已连接成功，直接返回。

/* If the master is in an transient error, we should try to PSYNC
 * from scratch later, so go to the error path. This happens when
 * the server is loading the dataset or is not connected with its
 * master and so forth. */
if (psync_result == PSYNC_TRY_LATER) goto error;

/* Note: if PSYNC does not return WAIT_REPLY, it will take care of
 * uninstalling the read handler from the file descriptor. */

if (psync_result == PSYNC_CONTINUE) {
    serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.");
    if (server.supervised_mode == SUPERVISED_SYSTEMD) {
        redisCommunicateSystemd("STATUS=MASTER <-> REPLICA sync: Partial Resynchronization accepted. Ready to accept connections in read-write mode.\n");
    }
    return;
}

不支持增量同步，与所有 Slaves 端口，清空 backLog ，毕竟要重头开始了，通过 SYNC 进行同步。

/* PSYNC failed or is not supported: we want our slaves to resync with us
 * as well, if we have any sub-slaves. The master may transfer us an
 * entirely different data set and we have no way to incrementally feed
 * our slaves after that. */
disconnectSlaves(); /* Force our slaves to resync with us as well. */
freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */

/* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
 * and the server.master_replid and master_initial_offset are
 * already populated. */
if (psync_result == PSYNC_NOT_SUPPORTED) {
    serverLog(LL_NOTICE,"Retrying with SYNC...");
    if (connSyncWrite(conn,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
        serverLog(LL_WARNING,"I/O error writing to MASTER: %s",
            strerror(errno));
        goto error;
    }
}

通过 RDB 文件传输，则先创建临时文件。

/* Prepare a suitable temp file for bulk transfer */
if (!useDisklessLoad()) {
    while(maxtries--) {
        snprintf(tmpfile,256,
            "temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
        dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
        if (dfd != -1) break;
        sleep(1);
    }
    if (dfd == -1) {
        serverLog(LL_WARNING,"Opening the temp file needed for MASTER <-> REPLICA synchronization: %s",strerror(errno));
        goto error;
    }
    server.repl_transfer_tmpfile = zstrdup(tmpfile);
    server.repl_transfer_fd = dfd;
}

设置 Read Handler，读文件，同时设置状态 REPL_STATE_TRANSFER ，表示文件传送中。

/* Setup the non blocking download of the bulk file. */
if (connSetReadHandler(conn, readSyncBulkPayload)
        == C_ERR)
{
    char conninfo[CONN_INFO_LEN];
    serverLog(LL_WARNING,
        "Can't create readable event for SYNC: %s (%s)",
        strerror(errno), connGetInfo(conn, conninfo, sizeof(conninfo)));
    goto error;
}

server.repl_state = REPL_STATE_TRANSFER;
server.repl_transfer_size = -1;
server.repl_transfer_read = 0;
server.repl_transfer_last_fsync_off = 0;
server.repl_transfer_lastio = server.unixtime;
return;

slaveTryPartialResynchronization

slaveTryPartialResynchronization 主要是和 Master 通信获取是否可以增量同步的信息。

前半部分，则是通过发送命令 PSYNC 来进行对接， cached_master 是之前意外断开的 Master 节点信息。

#define PSYNC_WRITE_ERROR 0
#define PSYNC_WAIT_REPLY 1
#define PSYNC_CONTINUE 2
#define PSYNC_FULLRESYNC 3
#define PSYNC_NOT_SUPPORTED 4
#define PSYNC_TRY_LATER 5

int slaveTryPartialResynchronization(connection *conn, int read_reply) {
    char *psync_replid;
    char psync_offset[32];
    sds reply;

    /* Writing half */
    if (!read_reply) {
        /* Initially set master_initial_offset to -1 to mark the current
         * master replid and offset as not valid. Later if we'll be able to do
         * a FULL resync using the PSYNC command we'll set the offset at the
         * right value, so that this information will be propagated to the
         * client structure representing the master into server.master. */
        server.master_initial_offset = -1;

        if (server.cached_master) {
            psync_replid = server.cached_master->replid;
            snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
            serverLog(LL_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_replid, psync_offset);
        } else {
            serverLog(LL_NOTICE,"Partial resynchronization not possible (no cached master)");
            psync_replid = "?";
            memcpy(psync_offset,"-1",3);
        }

        /* Issue the PSYNC command, if this is a master with a failover in
         * progress then send the failover argument to the replica to cause it
         * to become a master */
        if (server.failover_state == FAILOVER_IN_PROGRESS) {
            reply = sendCommand(conn,"PSYNC",psync_replid,psync_offset,"FAILOVER",NULL);
        } else {
            reply = sendCommand(conn,"PSYNC",psync_replid,psync_offset,NULL);
        }

        if (reply != NULL) {
            serverLog(LL_WARNING,"Unable to send PSYNC to master: %s",reply);
            sdsfree(reply);
            connSetReadHandler(conn, NULL);
            return PSYNC_WRITE_ERROR;
        }
        return PSYNC_WAIT_REPLY;
    }

后半部分则是读到 Master 的回包，并确认其是 全量同步 +FULLRESYNC 还是 增量同步 +CONTINUE。

其中 RUN_ID 为一个40字符的随机值，每次启动实例随机生成， offset 相当于一个偏移量，用于之后同步完 RDB 后进行增量同步。

replid2 的出现主要是因为若从服务器被提拔为主服务器，其他的从服务器连到现在新的主服务器时，若直接校验 replid 则必然失败，因此出现了这个变量来保存上次同步的主服务器ID。

    /* Reading half */
    reply = receiveSynchronousResponse(conn);
    if (sdslen(reply) == 0) {
        /* The master may send empty newlines after it receives PSYNC
         * and before to reply, just to keep the connection alive. */
        sdsfree(reply);
        return PSYNC_WAIT_REPLY;
    }

    connSetReadHandler(conn, NULL);

    if (!strncmp(reply,"+FULLRESYNC",11)) {
        char *replid = NULL, *offset = NULL;

        /* FULL RESYNC, parse the reply in order to extract the replid
         * and the replication offset. */
        replid = strchr(reply,' ');
        if (replid) {
            replid++;
            offset = strchr(replid,' ');
            if (offset) offset++;
        }
        if (!replid || !offset || (offset-replid-1) != CONFIG_RUN_ID_SIZE) {
            serverLog(LL_WARNING,
                "Master replied with wrong +FULLRESYNC syntax.");
            /* This is an unexpected condition, actually the +FULLRESYNC
             * reply means that the master supports PSYNC, but the reply
             * format seems wrong. To stay safe we blank the master
             * replid to make sure next PSYNCs will fail. */
            memset(server.master_replid,0,CONFIG_RUN_ID_SIZE+1);
        } else {
            memcpy(server.master_replid, replid, offset-replid-1);
            server.master_replid[CONFIG_RUN_ID_SIZE] = '\0';
            server.master_initial_offset = strtoll(offset,NULL,10);
            serverLog(LL_NOTICE,"Full resync from master: %s:%lld",
                server.master_replid,
                server.master_initial_offset);
        }
        /* We are going to full resync, discard the cached master structure. */
        replicationDiscardCachedMaster();
        sdsfree(reply);
        return PSYNC_FULLRESYNC;
    }

    if (!strncmp(reply,"+CONTINUE",9)) {
        /* Partial resync was accepted. */
        serverLog(LL_NOTICE,
            "Successful partial resynchronization with master.");

        /* Check the new replication ID advertised by the master. If it
         * changed, we need to set the new ID as primary ID, and set or
         * secondary ID as the old master ID up to the current offset, so
         * that our sub-slaves will be able to PSYNC with us after a
         * disconnection. */
        char *start = reply+10;
        char *end = reply+9;
        while(end[0] != '\r' && end[0] != '\n' && end[0] != '\0') end++;
        if (end-start == CONFIG_RUN_ID_SIZE) {
            char new[CONFIG_RUN_ID_SIZE+1];
            memcpy(new,start,CONFIG_RUN_ID_SIZE);
            new[CONFIG_RUN_ID_SIZE] = '\0';

            if (strcmp(new,server.cached_master->replid)) {
                /* Master ID changed. */
                serverLog(LL_WARNING,"Master replication ID changed to %s",new);

                /* Set the old ID as our ID2, up to the current offset+1. */
                memcpy(server.replid2,server.cached_master->replid,
                    sizeof(server.replid2));
                server.second_replid_offset = server.master_repl_offset+1;

                /* Update the cached master ID and our own primary ID to the
                 * new one. */
                memcpy(server.replid,new,sizeof(server.replid));
                memcpy(server.cached_master->replid,new,sizeof(server.replid));

                /* Disconnect all the sub-slaves: they need to be notified. */
                disconnectSlaves();
            }
        }

        /* Setup the replication to continue. */
        sdsfree(reply);
        replicationResurrectCachedMaster(conn);

        /* If this instance was restarted and we read the metadata to
         * PSYNC from the persistence file, our replication backlog could
         * be still not initialized. Create it. */
        if (server.repl_backlog == NULL) createReplicationBacklog();
        return PSYNC_CONTINUE;
    }

    /* If we reach this point we received either an error (since the master does
     * not understand PSYNC or because it is in a special state and cannot
     * serve our request), or an unexpected reply from the master.
     *
     * Return PSYNC_NOT_SUPPORTED on errors we don't understand, otherwise
     * return PSYNC_TRY_LATER if we believe this is a transient error. */

    if (!strncmp(reply,"-NOMASTERLINK",13) ||
        !strncmp(reply,"-LOADING",8))
    {
        serverLog(LL_NOTICE,
            "Master is currently unable to PSYNC "
            "but should be in the future: %s", reply);
        sdsfree(reply);
        return PSYNC_TRY_LATER;
    }

    if (strncmp(reply,"-ERR",4)) {
        /* If it's not an error, log the unexpected event. */
        serverLog(LL_WARNING,
            "Unexpected reply to PSYNC from master: %s", reply);
    } else {
        serverLog(LL_NOTICE,
            "Master does not support PSYNC or is in "
            "error state (reply: %s)", reply);
    }
    sdsfree(reply);
    replicationDiscardCachedMaster();
    return PSYNC_NOT_SUPPORTED;
}

readSyncBulkPayload

readSyncBulkPayload 主要负责读取 Master 的 RDB 文件（也可以是无盘传输）。

如果刚开始传输（通过 server.repl_transfer_size == 1 判断），则先检查协议，同时查看是通过文件传输还是无盘传输，如果是文件，则可以提前获取文件大小，否则通过 EOF 标记代表无盘传输，以 eofmark 作为结尾的标记。

#define REPL_MAX_WRITTEN_BEFORE_FSYNC (1024*1024*8) /* 8 MB */
void readSyncBulkPayload(connection *conn) {
    char buf[PROTO_IOBUF_LEN];
    ssize_t nread, readlen, nwritten;
    int use_diskless_load = useDisklessLoad();
    dbBackup *diskless_load_backup = NULL;
    int empty_db_flags = server.repl_slave_lazy_flush ? EMPTYDB_ASYNC :
                                                        EMPTYDB_NO_FLAGS;
    off_t left;

    /* Static vars used to hold the EOF mark, and the last bytes received
     * from the server: when they match, we reached the end of the transfer. */
    static char eofmark[CONFIG_RUN_ID_SIZE];
    static char lastbytes[CONFIG_RUN_ID_SIZE];
    static int usemark = 0;

    /* If repl_transfer_size == -1 we still have to read the bulk length
     * from the master reply. */
    if (server.repl_transfer_size == -1) {
        if (connSyncReadLine(conn,buf,1024,server.repl_syncio_timeout*1000) == -1) {
            serverLog(LL_WARNING,
                "I/O error reading bulk count from MASTER: %s",
                strerror(errno));
            goto error;
        }

        if (buf[0] == '-') {
            serverLog(LL_WARNING,
                "MASTER aborted replication with an error: %s",
                buf+1);
            goto error;
        } else if (buf[0] == '\0') {
            /* At this stage just a newline works as a PING in order to take
             * the connection live. So we refresh our last interaction
             * timestamp. */
            server.repl_transfer_lastio = server.unixtime;
            return;
        } else if (buf[0] != '$') {
            serverLog(LL_WARNING,"Bad protocol from MASTER, the first byte is not '$' (we received '%s'), are you sure the host and port are right?", buf);
            goto error;
        }

        /* There are two possible forms for the bulk payload. One is the
         * usual $ bulk format. The other is used for diskless transfers
         * when the master does not know beforehand the size of the file to
         * transfer. In the latter case, the following format is used:
         *
         * $EOF:<40 bytes delimiter>
         *
         * At the end of the file the announced delimiter is transmitted. The
         * delimiter is long and random enough that the probability of a
         * collision with the actual file content can be ignored. */
        if (strncmp(buf+1,"EOF:",4) == 0 && strlen(buf+5) >= CONFIG_RUN_ID_SIZE) {
            usemark = 1;
            memcpy(eofmark,buf+5,CONFIG_RUN_ID_SIZE);
            memset(lastbytes,0,CONFIG_RUN_ID_SIZE);
            /* Set any repl_transfer_size to avoid entering this code path
             * at the next call. */
            server.repl_transfer_size = 0;
            serverLog(LL_NOTICE,
                "MASTER <-> REPLICA sync: receiving streamed RDB from master with EOF %s",
                use_diskless_load? "to parser":"to disk");
        } else {
            usemark = 0;
            server.repl_transfer_size = strtol(buf+1,NULL,10);
            serverLog(LL_NOTICE,
                "MASTER <-> REPLICA sync: receiving %lld bytes from master %s",
                (long long) server.repl_transfer_size,
                use_diskless_load? "to parser":"to disk");
        }
        return;
    }

非无盘加载则无论文件传输还是无盘传输都先写入文件再读取。

若是无盘传输，通过 eofmark 与 lastbytes 对比得到是否传输完成。

Redis 源码将无盘加载和有盘加载的代码进行拆分，为了方便剖析，此处进行合并。

if (!use_diskless_load) {
    /* Read the data from the socket, store it to a file and search
     * for the EOF. */
    if (usemark) {
        readlen = sizeof(buf);
    } else {
        left = server.repl_transfer_size - server.repl_transfer_read;
        readlen = (left < (signed)sizeof(buf)) ? left : (signed)sizeof(buf);
    }

    nread = connRead(conn,buf,readlen);
    if (nread <= 0) {
        if (connGetState(conn) == CONN_STATE_CONNECTED) {
            /* equivalent to EAGAIN */
            return;
        }
        serverLog(LL_WARNING,"I/O error trying to sync with MASTER: %s",
            (nread == -1) ? strerror(errno) : "connection lost");
        cancelReplicationHandshake(1);
        return;
    }
    atomicIncr(server.stat_net_input_bytes, nread);

    /* When a mark is used, we want to detect EOF asap in order to avoid
     * writing the EOF mark into the file... */
    int eof_reached = 0;

    if (usemark) {
        /* Update the last bytes array, and check if it matches our
         * delimiter. */
        if (nread >= CONFIG_RUN_ID_SIZE) {
            memcpy(lastbytes,buf+nread-CONFIG_RUN_ID_SIZE,
                   CONFIG_RUN_ID_SIZE);
        } else {
            int rem = CONFIG_RUN_ID_SIZE-nread;
            memmove(lastbytes,lastbytes+nread,rem);
            memcpy(lastbytes+rem,buf,nread);
        }
        if (memcmp(lastbytes,eofmark,CONFIG_RUN_ID_SIZE) == 0)
            eof_reached = 1;
    }

    /* Update the last I/O time for the replication transfer (used in
     * order to detect timeouts during replication), and write what we
     * got from the socket to the dump file on disk. */
    server.repl_transfer_lastio = server.unixtime;
    if ((nwritten = write(server.repl_transfer_fd,buf,nread)) != nread) {
        serverLog(LL_WARNING,
            "Write error or short write writing to the DB dump file "
            "needed for MASTER <-> REPLICA synchronization: %s",
            (nwritten == -1) ? strerror(errno) : "short write");
        goto error;
    }
    server.repl_transfer_read += nread;

    /* Delete the last 40 bytes from the file if we reached EOF. */
    if (usemark && eof_reached) {
        if (ftruncate(server.repl_transfer_fd,
            server.repl_transfer_read - CONFIG_RUN_ID_SIZE) == -1)
        {
            serverLog(LL_WARNING,
                "Error truncating the RDB file received from the master "
                "for SYNC: %s", strerror(errno));
            goto error;
        }
    }

    /* Sync data on disk from time to time, otherwise at the end of the
     * transfer we may suffer a big delay as the memory buffers are copied
     * into the actual disk. */
    if (server.repl_transfer_read >=
        server.repl_transfer_last_fsync_off + REPL_MAX_WRITTEN_BEFORE_FSYNC)
    {
        off_t sync_size = server.repl_transfer_read -
                          server.repl_transfer_last_fsync_off;
        rdb_fsync_range(server.repl_transfer_fd,
            server.repl_transfer_last_fsync_off, sync_size);
        server.repl_transfer_last_fsync_off += sync_size;
    }

    /* Check if the transfer is now complete */
    if (!usemark) {
        if (server.repl_transfer_read == server.repl_transfer_size)
            eof_reached = 1;
    }

    /* If the transfer is yet not complete, we need to read more, so
     * return ASAP and wait for the handler to be called again. */
    if (!eof_reached) return;
}

 /* We reach this point in one of the following cases:
 *
 * 1. The replica is using diskless replication, that is, it reads data
 *    directly from the socket to the Redis memory, without using
 *    a temporary RDB file on disk. In that case we just block and
 *    read everything from the socket.
 *
 * 2. Or when we are done reading from the socket to the RDB file, in
 *    such case we want just to read the RDB file in memory. */
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Flushing old data");

/* We need to stop any AOF rewriting child before flusing and parsing
 * the RDB, otherwise we'll create a copy-on-write disaster. */
if (server.aof_state != AOF_OFF) stopAppendOnly();

/* When diskless RDB loading is used by replicas, it may be configured
 * in order to save the current DB instead of throwing it away,
 * so that we can restore it in case of failed transfer. */

    /* Ensure background save doesn't overwrite synced data */
    if (server.child_type == CHILD_TYPE_RDB) {
        serverLog(LL_NOTICE,
            "Replica is about to load the RDB file received from the "
            "master, but there is a pending RDB child running. "
            "Killing process %ld and removing its temp file to avoid "
            "any race",
            (long) server.child_pid);
        killRDBChild();
    }

    /* Make sure the new file (also used for persistence) is fully synced
     * (not covered by earlier calls to rdb_fsync_range). */
    if (fsync(server.repl_transfer_fd) == -1) {
        serverLog(LL_WARNING,
            "Failed trying to sync the temp DB to disk in "
            "MASTER <-> REPLICA synchronization: %s",
            strerror(errno));
        cancelReplicationHandshake(1);
        return;
    }

    /* Rename rdb like renaming rewrite aof asynchronously. */
    int old_rdb_fd = open(server.rdb_filename,O_RDONLY|O_NONBLOCK);
    if (rename(server.repl_transfer_tmpfile,server.rdb_filename) == -1) {
        serverLog(LL_WARNING,
            "Failed trying to rename the temp DB into %s in "
            "MASTER <-> REPLICA synchronization: %s",
            server.rdb_filename, strerror(errno));
        cancelReplicationHandshake(1);
        if (old_rdb_fd != -1) close(old_rdb_fd);
        return;
    }
    /* Close old rdb asynchronously. */
    if (old_rdb_fd != -1) bioCreateCloseJob(old_rdb_fd);

    if (rdbLoad(server.rdb_filename,&rsi,RDBFLAGS_REPLICATION) != C_OK) {
        serverLog(LL_WARNING,
            "Failed trying to load the MASTER synchronization "
            "DB from disk");
        cancelReplicationHandshake(1);
        if (server.rdb_del_sync_files && allPersistenceDisabled()) {
            serverLog(LL_NOTICE,"Removing the RDB file obtained from "
                                "the master. This replica has persistence "
                                "disabled");
            bg_unlink(server.rdb_filename);
        }
        /* Note that there's no point in restarting the AOF on sync failure,
           it'll be restarted when sync succeeds or replica promoted. */
        return;
    }

    /* Cleanup. */
    if (server.rdb_del_sync_files && allPersistenceDisabled()) {
        serverLog(LL_NOTICE,"Removing the RDB file obtained from "
                            "the master. This replica has persistence "
                            "disabled");
        bg_unlink(server.rdb_filename);
    }

    zfree(server.repl_transfer_tmpfile);
    close(server.repl_transfer_fd);
    server.repl_transfer_fd = -1;
    server.repl_transfer_tmpfile = NULL;
}

无盘加载

删除 socket 的 Read Handler ，因为后续的加载操作通过 RIO 去加载，一边读取 TCP流，一边进行加载。

if (use_diskless_load &&
    server.repl_diskless_load == REPL_DISKLESS_LOAD_SWAPDB)
{
    /* Create a backup of server.db[] and initialize to empty
     * dictionaries. */
    diskless_load_backup = disklessLoadMakeBackup();
}
/* We call to emptyDb even in case of REPL_DISKLESS_LOAD_SWAPDB
 * (Where disklessLoadMakeBackup left server.db empty) because we
 * want to execute all the auxiliary logic of emptyDb (Namely,
 * fire module events) */
emptyDb(-1,empty_db_flags,replicationEmptyDbCallback);

/* Before loading the DB into memory we need to delete the readable
 * handler, otherwise it will get called recursively since
 * rdbLoad() will call the event loop to process events from time to
 * time for non blocking loading. */
connSetReadHandler(conn, NULL);
serverLog(LL_NOTICE, "MASTER <-> REPLICA sync: Loading DB in memory");
rdbSaveInfo rsi = RDB_SAVE_INFO_INIT;
if (use_diskless_load) {
    rio rdb;
    rioInitWithConn(&rdb,conn,server.repl_transfer_size);

    /* Put the socket in blocking mode to simplify RDB transfer.
     * We'll restore it when the RDB is received. */
    connBlock(conn);
    connRecvTimeout(conn, server.repl_timeout*1000);
    startLoading(server.repl_transfer_size, RDBFLAGS_REPLICATION);

    if (rdbLoadRio(&rdb,RDBFLAGS_REPLICATION,&rsi) != C_OK) {
        /* RDB loading failed. */
        stopLoading(0);
        serverLog(LL_WARNING,
            "Failed trying to load the MASTER synchronization DB "
            "from socket");
        cancelReplicationHandshake(1);
        rioFreeConn(&rdb, NULL);

        /* Remove the half-loaded data in case we started with
         * an empty replica. */
        emptyDb(-1,empty_db_flags,replicationEmptyDbCallback);

        if (server.repl_diskless_load == REPL_DISKLESS_LOAD_SWAPDB) {
            /* Restore the backed up databases. */
            disklessLoadRestoreBackup(diskless_load_backup);
        }

        /* Note that there's no point in restarting the AOF on SYNC
         * failure, it'll be restarted when sync succeeds or the replica
         * gets promoted. */
        return;
    }

    /* RDB loading succeeded if we reach this point. */
    if (server.repl_diskless_load == REPL_DISKLESS_LOAD_SWAPDB) {
        /* Delete the backup databases we created before starting to load
         * the new RDB. Now the RDB was loaded with success so the old
         * data is useless. */
        disklessLoadDiscardBackup(diskless_load_backup, empty_db_flags);
    }

    /* Verify the end mark is correct. */
    if (usemark) {
        if (!rioRead(&rdb,buf,CONFIG_RUN_ID_SIZE) ||
            memcmp(buf,eofmark,CONFIG_RUN_ID_SIZE) != 0)
        {
            stopLoading(0);
            serverLog(LL_WARNING,"Replication stream EOF marker is broken");
            cancelReplicationHandshake(1);
            rioFreeConn(&rdb, NULL);
            return;
        }
    }

    stopLoading(1);

    /* Cleanup and restore the socket to the original state to continue
     * with the normal replication. */
    rioFreeConn(&rdb, NULL);
    connNonBlock(conn);
    connRecvTimeout(conn,0);

replicationCron

replicationCron 在 Master 和 Slave 都会走到， Master 给 Slave 发心跳，而 Slave 给 Master 发当前的进度，用于展示时使用。

void replicationCron(void) {
    /* Non blocking connection timeout? */
    if (server.masterhost &&
        (server.repl_state == REPL_STATE_CONNECTING ||
         slaveIsInHandshakeState()) &&
         (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"Timeout connecting to the MASTER...");
        cancelReplicationHandshake(1);
    }

    /* Bulk transfer I/O timeout? */
    if (server.masterhost && server.repl_state == REPL_STATE_TRANSFER &&
        (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.");
        cancelReplicationHandshake(1);
    }

    /* Timed out master when we are an already connected slave? */
    if (server.masterhost && server.repl_state == REPL_STATE_CONNECTED &&
        (time(NULL)-server.master->lastinteraction) > server.repl_timeout)
    {
        serverLog(LL_WARNING,"MASTER timeout: no data nor PING received...");
        freeClient(server.master);
    }

    /* Check if we should connect to a MASTER */
    if (server.repl_state == REPL_STATE_CONNECT) {
        serverLog(LL_NOTICE,"Connecting to MASTER %s:%d",
            server.masterhost, server.masterport);
        connectWithMaster();
    }

    /* Send ACK to master from time to time.
     * Note that we do not send periodic acks to masters that don't
     * support PSYNC and replication offsets. */
    if (server.masterhost && server.master &&
        !(server.master->flags & CLIENT_PRE_PSYNC))
        replicationSendAck();
}

Master

Master 在收到 PSYNC 或者 SYNC 后，会调用 syncCommand 。

syncCommand

若是 PSYNC 则会调用 masterTryPartialResynchronization 来判断是否可以增量同步（从 repl_backlog 缓冲区中查找），否则全量同步。
若为 SYNC 则设置 Client→flags 为 CLIENT_PRE_PSYNC ，表示 Slave 不会发送 ACK ，不能因为其不发就认为其宕机。

void syncCommand(client *c) {
....
    /* Try a partial resynchronization if this is a PSYNC command.
     * If it fails, we continue with usual full resynchronization, however
     * when this happens masterTryPartialResynchronization() already
     * replied with:
     *
     * +FULLRESYNC  
     *
     * So the slave knows the new replid and offset to try a PSYNC later
     * if the connection with the master is lost. */
    if (!strcasecmp(c->argv[0]->ptr,"psync")) {
        if (masterTryPartialResynchronization(c) == C_OK) {
            server.stat_sync_partial_ok++;
            return; /* No full resync needed, return. */
        }
    } else {
        /* If a slave uses SYNC, we are dealing with an old implementation
         * of the replication protocol (like redis-cli --slave). Flag the client
         * so that we don't expect to receive REPLCONF ACK feedbacks. */
        c->flags |= CLIENT_PRE_PSYNC;
    }

往下走就全是全量同步了，若已有 BGSAVE 命令再执行，则尝试复用生成出来的 RDB ，将其他 Slave 的输出缓冲区拷给当前 Slave 来达到同步的目的。

c->replstate = SLAVE_STATE_WAIT_BGSAVE_START;
if (server.repl_disable_tcp_nodelay)
    connDisableTcpNoDelay(c->conn); /* Non critical if it fails. */
c->repldbfd = -1;
c->flags |= CLIENT_SLAVE;
listAddNodeTail(server.slaves,c);

/* Create the replication backlog if needed. */
if (listLength(server.slaves) == 1 && server.repl_backlog == NULL) {
    /* When we create the backlog from scratch, we always use a new
     * replication ID and clear the ID2, since there is no valid
     * past history. */
    changeReplicationId();
    clearReplicationId2();
    createReplicationBacklog();
    serverLog(LL_NOTICE,"Replication backlog created, my new "
                        "replication IDs are '%s' and '%s'",
                        server.replid, server.replid2);
}

/* CASE 1: BGSAVE is in progress, with disk target. */
if (server.child_type == CHILD_TYPE_RDB &&
    server.rdb_child_type == RDB_CHILD_TYPE_DISK)
{
    /* Ok a background save is in progress. Let's check if it is a good
     * one for replication, i.e. if there is another slave that is
     * registering differences since the server forked to save. */
    client *slave;
    listNode *ln;
    listIter li;

    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        slave = ln->value;
        /* If the client needs a buffer of commands, we can't use
         * a replica without replication buffer. */
        if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END &&
            (!(slave->flags & CLIENT_REPL_RDBONLY) ||
             (c->flags & CLIENT_REPL_RDBONLY)))
            break;
    }
    /* To attach this slave, we check that it has at least all the
     * capabilities of the slave that triggered the current BGSAVE. */
    if (ln && ((c->slave_capa & slave->slave_capa) == slave->slave_capa)) {
        /* Perfect, the server is already registering differences for
         * another slave. Set the right state, and copy the buffer.
         * We don't copy buffer if clients don't want. */
        if (!(c->flags & CLIENT_REPL_RDBONLY)) copyClientOutputBuffer(c,slave);
        replicationSetupSlaveForFullResync(c,slave->psync_initial_offset);
        serverLog(LL_NOTICE,"Waiting for end of BGSAVE for SYNC");
    } else {
        /* No way, we need to wait for the next BGSAVE in order to
         * register differences. */
        serverLog(LL_NOTICE,"Can't attach the replica to the current BGSAVE. Waiting for next BGSAVE for SYNC");
    }

若在执行无盘传送，说明启用了子进程进行序列化，再通过匿名管道传给父进程，父进程再通过 Socket 发给 Slave ，因此我们在这个时候应该等待。

/* CASE 2: BGSAVE is in progress, with socket target. */
} else if (server.child_type == CHILD_TYPE_RDB &&
           server.rdb_child_type == RDB_CHILD_TYPE_SOCKET)
{
    /* There is an RDB child process but it is writing directly to
     * children sockets. We need to wait for the next BGSAVE
     * in order to synchronize. */
    serverLog(LL_NOTICE,"Current BGSAVE has socket target. Waiting for next BGSAVE for SYNC");

没有后台进程再运行，则无论是 无盘同步 还是 RDB同步 都会走到 startBgsaveForReplication 这个函数。

/* CASE 3: There is no BGSAVE is progress. */
} else {
    if (server.repl_diskless_sync && (c->slave_capa & SLAVE_CAPA_EOF) &&
        server.repl_diskless_sync_delay)
    {
        /* Diskless replication RDB child is created inside
         * replicationCron() since we want to delay its start a
         * few seconds to wait for more slaves to arrive. */
        serverLog(LL_NOTICE,"Delay next BGSAVE for diskless SYNC");
    } else {
        /* We don't have a BGSAVE in progress, let's start one. Diskless
         * or disk-based mode is determined by replica's capacity. */
        if (!hasActiveChildProcess()) {
            startBgsaveForReplication(c->slave_capa);
        } else {
            serverLog(LL_NOTICE,
                "No BGSAVE in progress, but another BG operation is active. "
                "BGSAVE for replication delayed");
        }
    }
}

startBgsaveForReplication

决定无盘同步还是RDB同步， rdbSaveToSlavesSocket 和 rdbSaveBackground 名字已经很清晰了。

int startBgsaveForReplication(int mincapa) {
    int retval;
    int socket_target = server.repl_diskless_sync && (mincapa & SLAVE_CAPA_EOF);
    listIter li;
    listNode *ln;

    serverLog(LL_NOTICE,"Starting BGSAVE for SYNC with target: %s",
        socket_target ? "replicas sockets" : "disk");

    rdbSaveInfo rsi, *rsiptr;
    rsiptr = rdbPopulateSaveInfo(&rsi);
    /* Only do rdbSave* when rsiptr is not NULL,
     * otherwise slave will miss repl-stream-db. */
    if (rsiptr) {
        if (socket_target)
            retval = rdbSaveToSlavesSockets(rsiptr);
        else
            retval = rdbSaveBackground(server.rdb_filename,rsiptr);
    } else {
        serverLog(LL_WARNING,"BGSAVE for replication: replication information not available, can't generate the RDB file right now. Try later.");
        retval = C_ERR;
    }

    /* If the target is socket, rdbSaveToSlavesSockets() already setup
     * the slaves for a full resync. Otherwise for disk target do it now.*/
    if (!socket_target) {
        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            client *slave = ln->value;

            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
                    replicationSetupSlaveForFullResync(slave,
                            getPsyncInitialOffset());
            }
        }
    }
    return retval;
}

特别注意的是，无盘传输也是采用子进程的形式完成，但是绝不是通过子进程进行发送，而是子进程序列化好后通过匿名管道发给父进程，父进程再读取将其发往 Slave 。

rdbSaveToSlavesSockets

创建 匿名管道 ，通过 RIO 将内存序列化后写入 管道 中，父进程通过管道取出发到 Slave 。

/* Spawn an RDB child that writes the RDB to the sockets of the slaves
 * that are currently in SLAVE_STATE_WAIT_BGSAVE_START state. */
int rdbSaveToSlavesSockets(rdbSaveInfo *rsi) {
    listNode *ln;
    listIter li;
    pid_t childpid;
    int pipefds[2], rdb_pipe_write, safe_to_exit_pipe;

    server.rdb_pipe_read = pipefds[0]; /* read end */
    rdb_pipe_write = pipefds[1]; /* write end */
    anetNonBlock(NULL, server.rdb_pipe_read);

    safe_to_exit_pipe = pipefds[0]; /* read end */
    server.rdb_child_exit_pipe = pipefds[1]; /* write end */

    /* Collect the connections of the replicas we want to transfer
     * the RDB to, which are i WAIT_BGSAVE_START state. */
    server.rdb_pipe_conns = zmalloc(sizeof(connection *)*listLength(server.slaves));
    server.rdb_pipe_numconns = 0;
    server.rdb_pipe_numconns_writing = 0;
    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        client *slave = ln->value;
        if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {
            server.rdb_pipe_conns[server.rdb_pipe_numconns++] = slave->conn;
            replicationSetupSlaveForFullResync(slave,getPsyncInitialOffset());
        }
    }

    /* Create the child process. */
    if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
        /* Child */
        int retval, dummy;
        rio rdb;

        rioInitWithFd(&rdb,rdb_pipe_write);

        retval = rdbSaveRioWithEOFMark(&rdb,NULL,rsi);
        if (retval == C_OK && rioFlush(&rdb) == 0)
            retval = C_ERR;

        if (retval == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
        }

        rioFreeFd(&rdb);
        /* wake up the reader, tell it we're done. */
        close(rdb_pipe_write);
        close(server.rdb_child_exit_pipe); /* close write end so that we can detect the close on the parent. */
        /* hold exit until the parent tells us it's safe. we're not expecting
         * to read anything, just get the error when the pipe is closed. */
        dummy = read(safe_to_exit_pipe, pipefds, 1);
        UNUSED(dummy);
        exitFromChild((retval == C_OK) ? 0 : 1);

父进程注册管道的可读事件，从 rdbPipeReadHandler 读取。

    } else {
        /* Parent */
        close(safe_to_exit_pipe);
        if (childpid == -1) {
            serverLog(LL_WARNING,"Can't save in background: fork: %s",
                strerror(errno));

            /* Undo the state change. The caller will perform cleanup on
             * all the slaves in BGSAVE_START state, but an early call to
             * replicationSetupSlaveForFullResync() turned it into BGSAVE_END */
            listRewind(server.slaves,&li);
            while((ln = listNext(&li))) {
                client *slave = ln->value;
                if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) {
                    slave->replstate = SLAVE_STATE_WAIT_BGSAVE_START;
                }
            }
            close(rdb_pipe_write);
            close(server.rdb_pipe_read);
            zfree(server.rdb_pipe_conns);
            server.rdb_pipe_conns = NULL;
            server.rdb_pipe_numconns = 0;
            server.rdb_pipe_numconns_writing = 0;
        } else {
            serverLog(LL_NOTICE,"Background RDB transfer started by pid %ld",
                (long) childpid);
            server.rdb_save_time_start = time(NULL);
            server.rdb_child_type = RDB_CHILD_TYPE_SOCKET;
            close(rdb_pipe_write); /* close write in parent so that it can detect the close on the child. */
            if (aeCreateFileEvent(server.el, server.rdb_pipe_read, AE_READABLE, rdbPipeReadHandler,NULL) == AE_ERR) {
                serverPanic("Unrecoverable error creating server.rdb_pipe_read file event.");
            }
        }
        return (childpid == -1) ? C_ERR : C_OK;
    }
    return C_OK; /* Unreached. */
}

至此主从同步就已剖析完了，之后的命令传送则通过 propagate 函数进行传递。

主从同步的坑

主从数据不一致

主从同步本来就是异步过程，应从部署方面考虑。

读到过期数据

带有相对时间过期的命令发送到从服务器后，已经滞后了，最好使用绝对时间。
Redis 旧版本的从库就算读过期数据，也会原样返回（新版本返回空值）。

Redis 6 剖析(一) 异步机制

2021-03-26T15:46:20.000Z

一直觉得关系型数据库非常难用，在使用之前要先定好表的结构，中途修改存储结构，改动就会非常繁杂，特别是 外键 这玩意离开了学校就再也没见过。好在在 游戏领域 中，用的最多的都是 NoSQL 。

熟悉我风格的人，可以看出这个系列的标题，不再是 源码剖析，而是只有 剖析 两字，主要是考虑到 Redis 6.0 的代码量已经挺大了，同时网络中又有大量关于 Redis 数据结构的源码剖析，没必要再炒冷饭了。

出于以上的原因，我将 Redis 分为几个部分进行剖析和讨论。

异步机制
主从同步
集群
数据结构

本篇主要是来剖析 Redis 为了避免 阻塞 ，是如何运用 多进程 与 多线程，这两种异步机制的。

阻塞点

Redis 一般有以下几种阻塞的点。

从网络交互来看有

网络 I/O (多线程)
客户端交互 (部分删除用多线程 BIO)
传输 RDB 快照 (多进程)

从磁盘交互又分

关闭文件 (多线程 BIO)
记录 AOF 日志 (多线程 BIO)
AOF 日志重写 (多进程)
RDB 快照生成 (多进程)

网络 I/O (多线程)

Redis 在早期的版本中采用的是 单线程 + I/O 多路复用 的模型，而在最新的 6.0 ，采用了 Thread I/O ，默认不会开启，开启需要在配置中加入以下两行。

1 2	io-threads-do-reads true // 开启多线程读和解析执行 io-threads 2 // 开启多少个线程，至少要大于 1

Redis 在初始化的时候，会调用 initThreadedIO 。

initThreadedIO

根据配置，创建 server.io_threads_num 个子线程，如果只是一个，则选择直接返回，将网络I/O的处理放到主线程（相当于使用单线程I/O）。

通过为每个线程创建一个 mutex 来达到临时开启暂停子线程的功能，之所以需要这样，主要是子线程都是一个死循环，采用 自旋锁 的形式去获取任务链表，如果一直没有任务，CPU占用也会达到 100%。

/* Initialize the data structures needed for threaded I/O. */
void initThreadedIO(void) {
    server.io_threads_active = 0; /* We start with threads not active. */

    /* Don't spawn any thread if the user selected a single thread:
     * we'll handle I/O directly from the main thread. */
    if (server.io_threads_num == 1) return;

    if (server.io_threads_num > IO_THREADS_MAX_NUM) {
        serverLog(LL_WARNING,"Fatal: too many I/O threads configured. "
                             "The maximum number is %d.", IO_THREADS_MAX_NUM);
        exit(1);
    }

    /* Spawn and initialize the I/O threads. */
    for (int i = 0; i < server.io_threads_num; i++) {
        /* Things we do for all the threads including the main thread. */
        io_threads_list[i] = listCreate();
        if (i == 0) continue; /* Thread 0 is the main thread. */

        /* Things we do only for the additional threads. */
        pthread_t tid;
        pthread_mutex_init(&io_threads_mutex[i],NULL);
        setIOPendingCount(i, 0);
        pthread_mutex_lock(&io_threads_mutex[i]); /* Thread will be stopped. */
        if (pthread_create(&tid,NULL,IOThreadMain,(void*)(long)i) != 0) {
            serverLog(LL_WARNING,"Fatal: Can't initialize IO thread.");
            exit(1);
        }
        io_threads[i] = tid;
    }
}

IOThreadMain

通过 atomic 实现自旋锁的形式，去获取任务列表，再根据写任务或读任务去执行。其中在一开始的时候通过 lock(mutex) 的形式，给主线程暂停子线程的机会。

#define IO_THREADS_MAX_NUM 128
#define IO_THREADS_OP_READ 0
#define IO_THREADS_OP_WRITE 1

pthread_t io_threads[IO_THREADS_MAX_NUM];
pthread_mutex_t io_threads_mutex[IO_THREADS_MAX_NUM];
redisAtomic unsigned long io_threads_pending[IO_THREADS_MAX_NUM];
int io_threads_op;      /* IO_THREADS_OP_WRITE or IO_THREADS_OP_READ. */

list *io_threads_list[IO_THREADS_MAX_NUM];

static inline unsigned long getIOPendingCount(int i) {
    unsigned long count = 0;
    atomicGetWithSync(io_threads_pending[i], count);
    return count;
}

static inline void setIOPendingCount(int i, unsigned long count) {
    atomicSetWithSync(io_threads_pending[i], count);
}

void *IOThreadMain(void *myid) {
    /* The ID is the thread number (from 0 to server.iothreads_num-1), and is
     * used by the thread to just manipulate a single sub-array of clients. */
    long id = (unsigned long)myid;
    char thdname[16];

    snprintf(thdname, sizeof(thdname), "io_thd_%ld", id);
    redis_set_thread_title(thdname);
    redisSetCpuAffinity(server.server_cpulist);
    makeThreadKillable();

    while(1) {
        /* Wait for start */
        for (int j = 0; j < 1000000; j++) {
            if (getIOPendingCount(id) != 0) break;
        }

        /* Give the main thread a chance to stop this thread. */
        if (getIOPendingCount(id) == 0) {
            pthread_mutex_lock(&io_threads_mutex[id]);
            pthread_mutex_unlock(&io_threads_mutex[id]);
            continue;
        }

        serverAssert(getIOPendingCount(id) != 0);

        /* Process: note that the main thread will never touch our list
         * before we drop the pending count to 0. */
        listIter li;
        listNode *ln;
        listRewind(io_threads_list[id],&li);
        while((ln = listNext(&li))) {
            client *c = listNodeValue(ln);
            if (io_threads_op == IO_THREADS_OP_WRITE) {
                writeToClient(c,0);
            } else if (io_threads_op == IO_THREADS_OP_READ) {
                readQueryFromClient(c->conn);
            } else {
                serverPanic("io_threads_op value is unknown");
            }
        }
        listEmpty(io_threads_list[id]);
        setIOPendingCount(id, 0);
    }
}

Threaded I/O 读写流程

beforeSleep 会先遍历所有待读的客户端，采用 Round-Robin 将其分配到各个线程。
通过原子操作设置任务数量，交给 I/O线程 操作，自旋等到操作完成，再回到主线程执行命令，并加入到 clients_pending_write 。
遍历所有待写的客户端，再次用相同的策略分配到各个线程。
通过原子操作设置任务数量，再次交给 I/O线程 操作，自旋等待完成。
如果还没写完，则设置 Write Handler 到 epoll ，之后未完成的写任务交给主线程去写。

handleClientsWithPendingReadsUsingThreads

读操作，先检查 I/O 线程 是否关闭，从 clients_pending_read 中取出并进行分配到子线程，访问 io_threads_list 不需要加锁， io_threads_list[i] 只会有主线程和 i子线程访问，而主线程与子线程之间又通过一个原子变量进行同步，之间通过自旋的形式解决了数据竞争的问题，在等待任务完成的同时，主线程也承担一部分的读操作。最后加入到 clients_pending_write 链表。

int handleClientsWithPendingReadsUsingThreads(void) {
    if (!server.io_threads_active || !server.io_threads_do_reads) return 0;
    int processed = listLength(server.clients_pending_read);
    if (processed == 0) return 0;

    /* Distribute the clients across N different lists. */
    listIter li;
    listNode *ln;
    listRewind(server.clients_pending_read,&li);
    int item_id = 0;
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        int target_id = item_id % server.io_threads_num;
        listAddNodeTail(io_threads_list[target_id],c);
        item_id++;
    }

    /* Give the start condition to the waiting threads, by setting the
     * start condition atomic var. */
    io_threads_op = IO_THREADS_OP_READ;
    for (int j = 1; j < server.io_threads_num; j++) {
        int count = listLength(io_threads_list[j]);
        setIOPendingCount(j, count);
    }

    /* Also use the main thread to process a slice of clients. */
    listRewind(io_threads_list[0],&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        readQueryFromClient(c->conn);
    }
    listEmpty(io_threads_list[0]);

    /* Wait for all the other threads to end their work. */
    while(1) {
        unsigned long pending = 0;
        for (int j = 1; j < server.io_threads_num; j++)
            pending += getIOPendingCount(j);
        if (pending == 0) break;
    }

    /* Run the list of clients again to process the new buffers. */
    while(listLength(server.clients_pending_read)) {
        ln = listFirst(server.clients_pending_read);
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_READ;
        listDelNode(server.clients_pending_read,ln);

        if (processPendingCommandsAndResetClient(c) == C_ERR) {
            /* If the client is no longer valid, we avoid
             * processing the client later. So we just go
             * to the next. */
            continue;
        }

        processInputBuffer(c);

        /* We may have pending replies if a thread readQueryFromClient() produced
         * replies and did not install a write handler (it can't).
         */
        if (!(c->flags & CLIENT_PENDING_WRITE) && clientHasPendingReplies(c))
            clientInstallWriteHandler(c);
    }

    /* Update processed count on server */
    server.stat_io_reads_processed += processed;

    return processed;
}

handleClientsWithPendingWritesUsingThreads

写操作，检查一下 I/O线程 是否开启，当任务量少的时候，会通过 lock(mutex) 临时阻塞子线程，因为子线程是一个死循环，就算没有任务也会占满 CPU 。如果没有写完，则会设置写回调，注册到 epoll 中，下次由主线程去写。

int stopThreadedIOIfNeeded(void) {
    int pending = listLength(server.clients_pending_write);

    /* Return ASAP if IO threads are disabled (single threaded mode). */
    if (server.io_threads_num == 1) return 1;

    if (pending < (server.io_threads_num*2)) {
        if (server.io_threads_active) stopThreadedIO();
        return 1;
    } else {
        return 0;
    }
}

int handleClientsWithPendingWritesUsingThreads(void) {
    int processed = listLength(server.clients_pending_write);
    if (processed == 0) return 0; /* Return ASAP if there are no clients. */

    /* If I/O threads are disabled or we have few clients to serve, don't
     * use I/O threads, but the boring synchronous code. */
    if (server.io_threads_num == 1 || stopThreadedIOIfNeeded()) {
        return handleClientsWithPendingWrites();
    }

    /* Start threads if needed. */
    if (!server.io_threads_active) startThreadedIO();

    /* Distribute the clients across N different lists. */
    listIter li;
    listNode *ln;
    listRewind(server.clients_pending_write,&li);
    int item_id = 0;
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_WRITE;

        /* Remove clients from the list of pending writes since
         * they are going to be closed ASAP. */
        if (c->flags & CLIENT_CLOSE_ASAP) {
            listDelNode(server.clients_pending_write, ln);
            continue;
        }

        int target_id = item_id % server.io_threads_num;
        listAddNodeTail(io_threads_list[target_id],c);
        item_id++;
    }

    /* Give the start condition to the waiting threads, by setting the
     * start condition atomic var. */
    io_threads_op = IO_THREADS_OP_WRITE;
    for (int j = 1; j < server.io_threads_num; j++) {
        int count = listLength(io_threads_list[j]);
        setIOPendingCount(j, count);
    }

    /* Also use the main thread to process a slice of clients. */
    listRewind(io_threads_list[0],&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        writeToClient(c,0);
    }
    listEmpty(io_threads_list[0]);

    /* Wait for all the other threads to end their work. */
    while(1) {
        unsigned long pending = 0;
        for (int j = 1; j < server.io_threads_num; j++)
            pending += getIOPendingCount(j);
        if (pending == 0) break;
    }

    /* Run the list of clients again to install the write handler where
     * needed. */
    listRewind(server.clients_pending_write,&li);
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);

        /* Install the write handler if there are pending writes in some
         * of the clients. */
        if (clientHasPendingReplies(c) &&
                connSetWriteHandler(c->conn, sendReplyToClient) == AE_ERR)
        {
            freeClientAsync(c);
        }
    }
    listEmpty(server.clients_pending_write);

    /* Update processed count on server */
    server.stat_io_writes_processed += processed;

    return processed;
}

可以看出， Redis 的多线程模型并不是那么优雅，主线程完全没必要去等待所有线程的读或写操作，同时 I/O线程 又很暴力，直接一个死循环，吃光CPU，实现起来不够好，不过这也确实解决了单线程下 Redis 因为 read ， write 系统调用导致的性能开销（用户缓冲区和内核缓冲区拷贝所带来的）。

在网络中，见到不少人批判 Redis 使用自旋锁是一种开倒车的行为，但我不这么认为，使用 mutex 或者 spinlock 要根据实际情况来，当锁的粒度非常小的时候， spinlock 能够省去不必要的上下文切换的开销。

BIO (三个多线程)

BIO 是 Redis 的后台线程，主要接收以下三种任务，每个任务都会开一个单独的线程。

/* Background job opcodes */
#define BIO_CLOSE_FILE    0 /* Deferred close(2) syscall. */
#define BIO_AOF_FSYNC     1 /* Deferred AOF fsync. */
#define BIO_LAZY_FREE     2 /* Deferred objects freeing. */
#define BIO_NUM_OPS       3

关闭文件描述符。
AOF 同步内核缓冲区的数据到文件(fsync)。
惰性释放，将部分内存的释放放到另一个线程。

bioInit

初始化三个后台线程的互斥量和条件变量。

static pthread_t bio_threads[BIO_NUM_OPS];
static pthread_mutex_t bio_mutex[BIO_NUM_OPS];
static pthread_cond_t bio_newjob_cond[BIO_NUM_OPS];
static pthread_cond_t bio_step_cond[BIO_NUM_OPS];
static list *bio_jobs[BIO_NUM_OPS];
static unsigned long long bio_pending[BIO_NUM_OPS];
#define REDIS_THREAD_STACK_SIZE (1024*1024*4)

void bioInit(void) {
    pthread_attr_t attr;
    pthread_t thread;
    size_t stacksize;
    int j;

    /* Initialization of state vars and objects */
    for (j = 0; j < BIO_NUM_OPS; j++) {
        pthread_mutex_init(&bio_mutex[j],NULL);
        pthread_cond_init(&bio_newjob_cond[j],NULL);
        pthread_cond_init(&bio_step_cond[j],NULL);
        bio_jobs[j] = listCreate();
        bio_pending[j] = 0;
    }

    /* Set the stack size as by default it may be small in some system */
    pthread_attr_init(&attr);
    pthread_attr_getstacksize(&attr,&stacksize);
    if (!stacksize) stacksize = 1; /* The world is full of Solaris Fixes */
    while (stacksize < REDIS_THREAD_STACK_SIZE) stacksize *= 2;
    pthread_attr_setstacksize(&attr, stacksize);

    /* Ready to spawn our threads. We use the single argument the thread
     * function accepts in order to pass the job ID the thread is
     * responsible of. */
    for (j = 0; j < BIO_NUM_OPS; j++) {
        void *arg = (void*)(unsigned long) j;
        if (pthread_create(&thread,&attr,bioProcessBackgroundJobs,arg) != 0) {
            serverLog(LL_WARNING,"Fatal: Can't initialize Background Jobs.");
            exit(1);
        }
        bio_threads[j] = thread;
    }
}

bioProcessBackgroundJobs

设置线程名字，阻塞 SIGALRM 信号，然后不断获取任务，根据任务类型进行操作。

struct bio_job {
    time_t time; /* Time at which the job was created. */
    /* Job specific arguments.*/
    int fd; /* Fd for file based background jobs */
    lazy_free_fn *free_fn; /* Function that will free the provided arguments */
    void *free_args[]; /* List of arguments to be passed to the free function */
};

void *bioProcessBackgroundJobs(void *arg) {
    struct bio_job *job;
    unsigned long type = (unsigned long) arg;
    sigset_t sigset;

    /* Check that the type is within the right interval. */
    if (type >= BIO_NUM_OPS) {
        serverLog(LL_WARNING,
            "Warning: bio thread started with wrong type %lu",type);
        return NULL;
    }

    switch (type) {
    case BIO_CLOSE_FILE:
        redis_set_thread_title("bio_close_file");
        break;
    case BIO_AOF_FSYNC:
        redis_set_thread_title("bio_aof_fsync");
        break;
    case BIO_LAZY_FREE:
        redis_set_thread_title("bio_lazy_free");
        break;
    }

    redisSetCpuAffinity(server.bio_cpulist);

    makeThreadKillable();

    pthread_mutex_lock(&bio_mutex[type]);
    /* Block SIGALRM so we are sure that only the main thread will
     * receive the watchdog signal. */
    sigemptyset(&sigset);
    sigaddset(&sigset, SIGALRM);
    if (pthread_sigmask(SIG_BLOCK, &sigset, NULL))
        serverLog(LL_WARNING,
            "Warning: can't mask SIGALRM in bio.c thread: %s", strerror(errno));

    while(1) {
        listNode *ln;

        /* The loop always starts with the lock hold. */
        if (listLength(bio_jobs[type]) == 0) {
            pthread_cond_wait(&bio_newjob_cond[type],&bio_mutex[type]);
            continue;
        }
        /* Pop the job from the queue. */
        ln = listFirst(bio_jobs[type]);
        job = ln->value;
        /* It is now possible to unlock the background system as we know have
         * a stand alone job structure to process.*/
        pthread_mutex_unlock(&bio_mutex[type]);

        /* Process the job accordingly to its type. */
        if (type == BIO_CLOSE_FILE) {
            close(job->fd);
        } else if (type == BIO_AOF_FSYNC) {
            redis_fsync(job->fd);
        } else if (type == BIO_LAZY_FREE) {
            job->free_fn(job->free_args);
        } else {
            serverPanic("Wrong job type in bioProcessBackgroundJobs().");
        }
        zfree(job);

        /* Lock again before reiterating the loop, if there are no longer
         * jobs to process we'll block again in pthread_cond_wait(). */
        pthread_mutex_lock(&bio_mutex[type]);
        listDelNode(bio_jobs[type],ln);
        bio_pending[type]--;

        /* Unblock threads blocked on bioWaitStepOfType() if any. */
        pthread_cond_broadcast(&bio_step_cond[type]);
    }
}

关闭文件描述符

关闭文件描述符，有可能会删除掉文件，引起阻塞。因为 Redis 实现的时候会通过 rename 覆盖掉原有文件，将文件描述符的关闭交给 bio 子线程避免阻塞。

客户端交互 (惰性删除)

客户端操作，无非就是对数据结构进行增删改查，大部分的操作都是 O(1)，需要注意的是对集合的查询和聚合操作，同时删除一个 BigKey 也会带来性能开销，即使 Redis 用的 jemalloc 已经性能够好了。因此 Redis 选择开子线程的方式，去另一个线程释放内存。

这里有几个条件必须满足。

该对象没有其他人共享了。
这个对象之后一定访问不到。(hash表中为 value 的情况)

这样做也就不需要加锁了。（Lua 好适合这种情况）

void freeObjAsync(robj *key, robj *obj) {
    size_t free_effort = lazyfreeGetFreeEffort(key,obj);
    if (free_effort > LAZYFREE_THRESHOLD && obj->refcount == 1) {
        atomicIncr(lazyfree_objects,1);
        bioCreateLazyFreeJob(lazyfreeFreeObject,1,obj);
    } else {
        decrRefCount(obj);
    }
}

因此删除东西最好用 unlink ，当其为 BigKey 时，就会放入 bio 进行释放。同理 flushdb 也可以异步清除。

AOF 日志

每当执行一条命令后，若开启了 AOF日志 则将其记录到 AOF 缓冲区 （写后日志）。

1	propagate(c->cmd,c->db->id,c->argv,c->argc,propagate_flags);

propagate

AOF日志若开启，则调用 feedAppendOnlyFile 将其写入到 server.aof_buf 中。

void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
               int flags)
{
    if (!server.replication_allowed)
        return;

    /* Propagate a MULTI request once we encounter the first command which
     * is a write command.
     * This way we'll deliver the MULTI/..../EXEC block as a whole and
     * both the AOF and the replication link will have the same consistency
     * and atomicity guarantees. */
    if (server.in_exec && !server.propagate_in_transaction)
        execCommandPropagateMulti(dbid);

    /* This needs to be unreachable since the dataset should be fixed during 
     * client pause, otherwise data may be lossed during a failover. */
    serverAssert(!(areClientsPaused() && !server.client_pause_in_transaction));

    if (server.aof_state != AOF_OFF && flags & PROPAGATE_AOF)
        feedAppendOnlyFile(cmd,dbid,argv,argc);
    if (flags & PROPAGATE_REPL)
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
}

feedAppendOnlyFile

先检查目前所用的 db， Redis 默认有 REDIS_DEFAULT_DBNUM 16个db。后将有相对时间过期的指令转换为绝对时间。如果有 AOF 子进程在重写日志，则还会将其写入server.aof_rewrite_buf_blocks 链表中，同时通过管道传输到子进程。就算子进程宕机了，主进程的 AOF日志 也还是完整的。

void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc) {
    sds buf = sdsempty();
    /* The DB this command was targeting is not the same as the last command
     * we appended. To issue a SELECT command is needed. */
    if (dictid != server.aof_selected_db) {
        char seldb[64];

        snprintf(seldb,sizeof(seldb),"%d",dictid);
        buf = sdscatprintf(buf,"*2\r\n$6\r\nSELECT\r\n$%lu\r\n%s\r\n",
            (unsigned long)strlen(seldb),seldb);
        server.aof_selected_db = dictid;
    }

    if (cmd->proc == expireCommand || cmd->proc == pexpireCommand ||
        cmd->proc == expireatCommand) {
        /* Translate EXPIRE/PEXPIRE/EXPIREAT into PEXPIREAT */
        buf = catAppendOnlyExpireAtCommand(buf,cmd,argv[1],argv[2]);
    } else if (cmd->proc == setCommand && argc > 3) {
        robj *pxarg = NULL;
        /* When SET is used with EX/PX argument setGenericCommand propagates them with PX millisecond argument.
         * So since the command arguments are re-written there, we can rely here on the index of PX being 3. */
        if (!strcasecmp(argv[3]->ptr, "px")) {
            pxarg = argv[4];
        }
        /* For AOF we convert SET key value relative time in milliseconds to SET key value absolute time in
         * millisecond. Whenever the condition is true it implies that original SET has been transformed
         * to SET PX with millisecond time argument so we do not need to worry about unit here.*/
        if (pxarg) {
            robj *millisecond = getDecodedObject(pxarg);
            long long when = strtoll(millisecond->ptr,NULL,10);
            when += mstime();

            decrRefCount(millisecond);

            robj *newargs[5];
            newargs[0] = argv[0];
            newargs[1] = argv[1];
            newargs[2] = argv[2];
            newargs[3] = shared.pxat;
            newargs[4] = createStringObjectFromLongLong(when);
            buf = catAppendOnlyGenericCommand(buf,5,newargs);
            decrRefCount(newargs[4]);
        } else {
            buf = catAppendOnlyGenericCommand(buf,argc,argv);
        }
    } else {
        /* All the other commands don't need translation or need the
         * same translation already operated in the command vector
         * for the replication itself. */
        buf = catAppendOnlyGenericCommand(buf,argc,argv);
    }

    /* Append to the AOF buffer. This will be flushed on disk just before
     * of re-entering the event loop, so before the client will get a
     * positive reply about the operation performed. */
    if (server.aof_state == AOF_ON)
        server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf));

    /* If a background append only file rewriting is in progress we want to
     * accumulate the differences between the child DB and the current one
     * in a buffer, so that when the child process will do its work we
     * can append the differences to the new append only file. */
    if (server.child_type == CHILD_TYPE_AOF)
        aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf));

    sdsfree(buf);
}

flushAppendOnlyFile

AOF日志同步到硬盘的策略有三种，第一种不同步，由内核自己决定Flush时机，另一种每次都同步，但是 fsync 是会阻塞的，因此还有第三种每秒同步，通过 BIO 子线程，每秒去同步 fsync 一次，其实说是 fsync 也不准确，在 Linux 下用的是 fdatasync 省去了写文件的元数据开销。

void bioCreateFsyncJob(int fd) {
    struct bio_job *job = zmalloc(sizeof(*job));
    job->fd = fd;

    bioSubmitJob(BIO_AOF_FSYNC, job);
}

void aof_background_fsync(int fd) {
    bioCreateFsyncJob(fd);
}

AOF日志重写 (多进程)

前面提到的 AOF追加日志是利用了子线程去执行 fsync ，而这里则是用子进程去重写 AOF日志。重写日志主要是根据数据库现状重新创建一份新的 AOF日志，如果在主线程上操作，会导致很长时间不能处理客户端的请求。

AOF日志重写要么是由客户端发起 BGREWRITEAOF，要么是 serverCron 周期性判断是否触发了 AOF重写 。

当前没有其他子进程做事情，比如说 RDB快照，AOF重写，或者 loaded module。

同时默认要求大于 64*1024*1024 并且对比上一次重写后的文件大小是否增长了 100% 。

/* Trigger an AOF rewrite if needed. */
if (server.aof_state == AOF_ON &&
!hasActiveChildProcess() &&
server.aof_rewrite_perc &&
server.aof_current_size > server.aof_rewrite_min_size)
{
long long base = server.aof_rewrite_base_size ?
server.aof_rewrite_base_size : 1;
long long growth = (server.aof_current_size*100/base) - 100;
if (growth >= server.aof_rewrite_perc) {
serverLog(LL_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);
rewriteAppendOnlyFileBackground();
}
}

rewriteAppendOnlyFileBackground

fork 一个子进程，同时父进程在有子进程的时候， dict 不扩容，这主要是因为 fork 采用的 copy on write ，尽量不去改动进程的内存，避免物理页复制引起内存暴涨，同时一定不要开启 huge page ，原因同上。

最后子进程将数据库信息重写，并从父进程的管道中获取新的数据。

int rewriteAppendOnlyFileBackground(void) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR;
    if (aofCreatePipes() != C_OK) return C_ERR;
    if ((childpid = redisFork(CHILD_TYPE_AOF)) == 0) {
        char tmpfile[256];

        /* Child */
        redisSetProcTitle("redis-aof-rewrite");
        redisSetCpuAffinity(server.aof_rewrite_cpulist);
        snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int) getpid());
        if (rewriteAppendOnlyFile(tmpfile) == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_AOF_COW_SIZE, "AOF rewrite");
            exitFromChild(0);
        } else {
            exitFromChild(1);
        }
    } else {
        /* Parent */
        if (childpid == -1) {
            serverLog(LL_WARNING,
                "Can't rewrite append only file in background: fork: %s",
                strerror(errno));
            aofClosePipes();
            return C_ERR;
        }
        serverLog(LL_NOTICE,
            "Background append only file rewriting started by pid %ld",(long) childpid);
        server.aof_rewrite_scheduled = 0;
        server.aof_rewrite_time_start = time(NULL);

        /* We set appendseldb to -1 in order to force the next call to the
         * feedAppendOnlyFile() to issue a SELECT command, so the differences
         * accumulated by the parent into server.aof_rewrite_buf will start
         * with a SELECT statement and it will be safe to merge. */
        server.aof_selected_db = -1;
        replicationScriptCacheFlush();
        return C_OK;
    }
    return C_OK; /* unreached */
}

子进程完成之后，父进程会在 checkChildrenDone 接受它的返回值。

checkChildrenDone

rename AOF日志文件名，将原文件的文件描述符交给 bio 进行 close 避免阻塞。

可以从 ModuleForkDoneHandler 推论 Module 也预留了 fork 接口去多进程完成一些模块的自定义任务。

void checkChildrenDone(void) {
    int statloc;
    pid_t pid;

    if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
        int exitcode = WEXITSTATUS(statloc);
        int bysignal = 0;

        if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);

        /* sigKillChildHandler catches the signal and calls exit(), but we
         * must make sure not to flag lastbgsave_status, etc incorrectly.
         * We could directly terminate the child process via SIGUSR1
         * without handling it, but in this case Valgrind will log an
         * annoying error. */
        if (exitcode == SERVER_CHILD_NOERROR_RETVAL) {
            bysignal = SIGUSR1;
            exitcode = 1;
        }

        if (pid == -1) {
            serverLog(LL_WARNING,"wait3() returned an error: %s. "
                "child_type: %s, child_pid = %d",
                strerror(errno),
                strChildType(server.child_type),
                (int) server.child_pid);
        } else if (pid == server.child_pid) {
            if (server.child_type == CHILD_TYPE_RDB) {
                backgroundSaveDoneHandler(exitcode, bysignal);
            } else if (server.child_type == CHILD_TYPE_AOF) {
                backgroundRewriteDoneHandler(exitcode, bysignal);
            } else if (server.child_type == CHILD_TYPE_MODULE) {
                ModuleForkDoneHandler(exitcode, bysignal);
            } else {
                serverPanic("Unknown child type %d for child pid %d", server.child_type, server.child_pid);
                exit(1);
            }
            if (!bysignal && exitcode == 0) receiveChildInfo();
            resetChildState();
        } else {
            if (!ldbRemoveChild(pid)) {
                serverLog(LL_WARNING,
                          "Warning, detected child with unmatched pid: %ld",
                          (long) pid);
            }
        }

        /* start any pending forks immediately. */
        replicationStartPendingFork();
    }
}

RDB 快照 (多进程)

当使用 bgsaveCommand 命令时，类似 AOF重写 ，也是通过 fork 子进程去完成，避免加锁或是减少内存拷贝。当然其也支持自动触发。

/* If there is not a background saving/rewrite in progress check if
* we have to save/rewrite now. */
for (j = 0; j < server.saveparamslen; j++) {
struct saveparam *sp = server.saveparams+j;

/* Save if we reached the given amount of changes,
* the given amount of seconds, and if the latest bgsave was
* successful or if, in case of an error, at least
* CONFIG_BGSAVE_RETRY_DELAY seconds already elapsed. */
if (server.dirty >= sp->changes &&
server.unixtime-server.lastsave > sp->seconds &&
(server.unixtime-server.lastbgsave_try >
CONFIG_BGSAVE_RETRY_DELAY ||
server.lastbgsave_status == C_OK))
{
serverLog(LL_NOTICE,"%d changes in %d seconds. Saving...",
sp->changes, (int)sp->seconds);
rdbSaveInfo rsi, *rsiptr;
rsiptr = rdbPopulateSaveInfo(&rsi);
rdbSaveBackground(server.rdb_filename,rsiptr);
break;
}
}

多个检查点，查看是否触发存盘。

int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR;

    server.dirty_before_bgsave = server.dirty;
    server.lastbgsave_try = time(NULL);

    if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
        int retval;

        /* Child */
        redisSetProcTitle("redis-rdb-bgsave");
        redisSetCpuAffinity(server.bgsave_cpulist);
        retval = rdbSave(filename,rsi);
        if (retval == C_OK) {
            sendChildCowInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, "RDB");
        }
        exitFromChild((retval == C_OK) ? 0 : 1);
    } else {
        /* Parent */
        if (childpid == -1) {
            server.lastbgsave_status = C_ERR;
            serverLog(LL_WARNING,"Can't save in background: fork: %s",
                strerror(errno));
            return C_ERR;
        }
        serverLog(LL_NOTICE,"Background saving started by pid %ld",(long) childpid);
        server.rdb_save_time_start = time(NULL);
        server.rdb_child_type = RDB_CHILD_TYPE_DISK;
        return C_OK;
    }
    return C_OK; /* unreached */
}

至于 RDB快照 传送，也是采用子进程生成，父进程发送，若采用无盘传输，则子进程直接序列化后通过管道发给父进程，父进程再发给从服务器，下一篇会比较详细讨论，这里就不细说了。

总结

Redis 除了命令执行是单线程，其他的网络和耗时操作尽可能都转化为多进程或多线程，简化了开发，这一点在游戏服务器上是非常值得借鉴的。

此外， Redis 通过子线程释放内存，这一点我认为可以将其引用到 Lua 的垃圾回收中，缩短 stop the world 的时间，找个时间，写个多线程垃圾回收的版本，看看其效果。LuaJIT-5.3.6(更新时间 2021年07月04日，已实现 Lua 多线程垃圾回收版本)

Raft 共识算法解析

2021-03-14T02:01:20.000Z

前篇解析了 Gossip 协议，这篇主要看看 Raft 是如何实现的。

本文主要分为两个部分，首先是粗略讲解一遍 Raft 的设计思想，在这一部分不会将 RPC 的各种字段（因为没有意义，只会徒增心智负担），而在第二部分则是通过解析一份优质的 Raft 源码实现，在这个部分再深入到 RPC 各个字段。

如果看了一遍看不懂也没关系，建议多去看看 Raft 的论文，笔者也是反复看了两周才大致理解其指导思想。

Raft

一提到共识算法，相信大部分人都能马上想到 Paxos，但是我认为它不是算法，它的论文里面顶多算是一个指导思想，很少有人能够读完它就实现出一个可靠的共识算法（关键是要验证其的正确性），但是 Raft 不一样，它的一些设计非常巧妙，能够令人非常好的理解其指导思想，同时比较容易的实现（因为 Raft 从诞生那一刻就是为了弥补 Paxos 的可理解性，看看人家的论文名字 In Search of an Understandable Consensus Algorithm 可理解的分布式共识算法）。

用过 Zookeeper 的可能知道其内部的协议就是根据 Paxos 的指导实现的一个 Zab 算法，之所以不用 Raft 是因为 Raft 那时候还没出世呢。

三种角色

Raft 中的节点只有三种类型。

领导人，Leader
候选人，Candidate
跟随者，Follower

领导人 Leader

领导人主要是负责一切的写入操作，当领导人收到客户端的日志条目（请求）时，将其先记录下来（你可以理解为拿个小本本记下我收到了这个请求，但是不提交），然后广播复制（通过心跳）到其他的服务器上，当收到大多数服务器成功的响应后，就将其提交(Commit)到自身的状态机（这个时候才是真正的应用于kv存储），最后通过心跳广播到所有服务器，告诉他们你们也可以应用。

候选人 Candidate

如果领导人宕机了，这个时候就需要有候选人竞选领导，谁先收获到足够多的选票，谁就胜出。

跟随者 Follower

当领导人还在的时候，整个分布式只会有领导人和跟随者，他们之间通过心跳维持，当领导者宕机了，跟随者就会跳出来说我来当候选人，于是就切换到候选人的身份了。

领导选举

Raft 和 Paxos 最大的异同点我认为是引入了强领导的机制，因为这会使得整个分布式系统变得简单，多领导的机制简直就是灾难，你很难保证整个系统指令的顺序。

初始阶段，所有的节点都应该是跟随者，因为这个时候没有领导者与其维持心跳，因此会有一个跟随者发生心跳超时的情况，谁先超时，谁就变身成候选人，之所以有个先字，主要是因为 Raft 设计心跳超时的时候，采用了一种随机超时的机制，这个机制我个人觉得是非常巧妙地，它大幅度的减少了整个系统的复杂度，不再需要优先级各种系统的设计，直接通过随机的形式，也避免了瓜分选票导致长时间不能服务的问题。

有了候选人之后，先给自己来一票，然后发起 RequestVote ，当选票足够的时候就进化为领导人，如果一直没选出来则进入选举超时，重来一轮，如果收到领导者的消息，则对比 Term 任期，比候选人大则乖乖退回跟随者，小则无视。

决定投不投它一票的流程也很简单，采用 FIFO 先来先服务的形式，大前提是候选人的信息要比我的新。

关于这块如果不能理解，建议看 thesecretlivesofdata 这里的动画演示。

日志复制

首先要认识日志，日志由三部分组成，日志于哪个任期产生，日志的索引，日志的内容。

领导人收到客户端的请求之后，将请求组装成日志，然后先存储下来（不是应用，只是记录一下），接着通过广播发给其他节点，当大多数节点成功响应，则应用到自身的kv存储（或者说应用于自身的状态机），这个时候就可以返回了，同时心跳广播也会将最新的提交记录传递给所有节点，其他节点也会将其应用于自身，这里面的提前返回相当于是将二阶段提交给优化为了一阶段（因为它只要大多数节点回应就行了），降低了一半的消息延迟。

如果是跟随者收到客户端的写请求则有多种方法，比如拒绝并返回领导者的地址给客户端，或转发给领导者，将领导者的返回结果返回给客户端，充当代理身份。

为什么我只提到了写请求呢？因为读请求也是需要视情况而定的，我们知道 Raft 是一个共识算法，很多人一直以为它实现出来的就一定是强一致性，然而它是不是强一致性取决于你客户端怎么实现。比如说你想要强一致性，则强制读的时候一定在领导者上读，同时要经过半数节点确认，这样一定不会返回旧数据。如果无所谓强一致性，则可以设计成任意节点上读，这样很有可能是旧数据。还有一种模式是虽然在领导者身上读，但是不经过大多数节点的确认就直接返回，这样有可能会有旧数据（比如新的风暴（领导者）已经出现，但是因为网络的关系，没能通过心跳广播通知到其退位，它觉得它还是个领导者就擅自返回了数据，殊不知这个数据很有可能被新的领导者已经修改了）。

以上的三种读操作的一致性模型其实就是 Consul 所实现的。

这么一看， Raft 的缺点很明显，因为强领导者导致写性能很弱，相当于单机，这也是为什么在分布式存储领域中，大多采用分片的形式去使用（相当于多个 Raft 组），而不是采用大分布式的形式。

日志复制的安全性来自于几个方面。

首先领导者不能删除和覆盖日志，只能够新增，如果跟随者和领导者不一致则强制让跟随者的日志与领导者同步。这么做之所以是安全的是因为，领导者的日志一定是最新最全的。

如何保证领导者的日志一定是最新的呢？前面也提到了日志由 Term 任期， index 日志索引，日志内容所构成，每次复制都会去检查前一个日志的任期和索引是否相同，如果相同，我们则可以断定前面的日志也一定是相同的。

其次如果领导者复制给了跟随者日志，但是随后就宕机了，这个时候没有应用于状态机，怎么办？这个时候就依赖于 Term 任期字段，新的领导者首先通过上面的机制保证了它的日志一定是最全的，同时它的任期一定是更高的，于是就可以将其任期之前的未提交的直接提交了，然后同步给其他节点。再加上 Raft 整个系统实现是幂等性的，即使因为超时或者种种原因重新执行指令也不会发生任何副作用。

那么可能有的人就会想，日志一直在增加，我总不能一直存着所有的日志来和其他跟随者进行比对吧？论文里面的 Snapshot 就是做这块功能，将日志进行快照压缩，其实和 Redis aof重写挺像的，然后将快照同步出去即可。

关于日志复制，如果有疑惑的可以参阅 Raft Visualization 一个非常详细的动画演示。

集群变化

Raft 通过单节点变更，避免了集群变化时出现的脑裂情况，每次只添加单个节点不会形成另一个大多数，从而避免多个领导者。除了单节点变更还可以用 联合共识 （其实就是个二阶段的规则，集群之间互相试探），但是难实现啊。

hashicorp/raft 源码解析

有了以上的前置知识，我们就可以通过阅读知名的 hashicorp/raft 实现来更深入的理解 Raft。

几个数据结构

RaftState 是 Raft 当前所处的状态，如上所说有三种状态。

type RaftState uint32
const (
// Follower is the initial state of a Raft node.
Follower RaftState = iota
// Candidate is one of the valid states of a Raft node.
Candidate
// Leader is one of the valid states of a Raft node.
Leader
// Shutdown is the terminal state of a Raft node.
Shutdown // 关闭状态
)

raftState 则代表 Raft 节点信息。

type raftState struct {
// The current term, cache of StableStore
currentTerm uint64 // 当前任期

// Highest committed log entry
commitIndex uint64 // 最高提交的日志索引

// Last applied log to the FSM
lastApplied uint64 // 最后一条应用到状态机的索引

// Cache the latest snapshot index/term
lastSnapshotIndex uint64 // 快照索引
lastSnapshotTerm  uint64 // 快照任期

// Cache the latest log from LogStore
lastLogIndex uint64 // 最后一条索引
lastLogTerm  uint64 // 最后一条日志任期

// The current state
state RaftState // 节点状态，前面的三态
}

附加日志 RPC 请求，这里可以对照着论文看了。

type AppendEntriesRequest struct {
RPCHeader // 协议版本

// Provide the current term and leader
Term   uint64 // 任期
Leader []byte // 领导者信息

// Provide the previous entries for integrity checking
PrevLogEntry uint64 // 前一个日志的索引
PrevLogTerm  uint64 // 前一个日志的任期

// New entries to commit
Entries []*Log // 新的日志

// Commit index on the leader
LeaderCommitIndex uint64 // 已提交的最大编号，心跳带出，让跟随者也附加
}

附加日志 RPC 响应。

type AppendEntriesResponse struct {
RPCHeader

// Newer term if leader is out of date
Term uint64 // 当前任期

// Last Log is a hint to help accelerate rebuilding slow nodes
LastLog uint64 // 最后一条日志索引 用于快速找到缺失的日志（论文里没有）

// We may not succeed if we have a conflicting entry
Success bool // 是否成功 如果不匹配就不成功

// There are scenarios where this request didn't succeed
// but there's no need to wait/back-off the next attempt.
NoRetryBackoff bool // 是否不等待直接重试，论文没有 属于扩充项 加速用
}

投票 RPC 请求。

type RequestVoteRequest struct {
RPCHeader

// Provide the term and our id
Term      uint64 // 任期
Candidate []byte // 候选人信息

// Used to ensure safety
LastLogIndex uint64 // 最后一条日志索引
LastLogTerm  uint64 // 最后一条日志任期

// Used to indicate to peers if this vote was triggered by a leadership
// transfer. It is required for leadership transfer to work, because servers
// wouldn't vote otherwise if they are aware of an existing leader.
LeadershipTransfer bool // hashicorp 实现的一种主动转移领导的快速项，论文没有
}

投票 RPC 响应。

type RequestVoteResponse struct {
RPCHeader

// Newer term if leader is out of date.
Term uint64 // 任期

// Is the vote granted.
Granted bool // 投我吗
}

安装快照 RPC 请求。

快照主要是当日志项太多的时候，将其合并成一个快照复制。

type InstallSnapshotRequest struct {
RPCHeader
SnapshotVersion SnapshotVersion // 快照版本 扩展

Term   uint64 // 任期
Leader []byte // 领导信息

// These are the last index/term included in the snapshot
LastLogIndex uint64 // 快照中最后一条日志索引
LastLogTerm  uint64 // 快照中最后一条日志任期

// Cluster membership.
Configuration []byte // 配置
// Log index where 'Configuration' entry was originally written.
ConfigurationIndex uint64 // 配置项索引

// Size of the snapshot
Size int64 // 大小
}

安装快照 RPC 响应。

type InstallSnapshotResponse struct {
RPCHeader

Term    uint64 // 任期
Success bool // 是否成功
}

NewRaft

这里就是创建一个 Raft 节点的方法，其实就是验证一下配置，初始化日志，从db中拿出旧的数据（如果有），默认是一个 Follower 的状态，就开着三个协程去跑了。

func NewRaft(conf *Config, fsm FSM, logs LogStore, stable StableStore, snaps SnapshotStore, trans Transport) (*Raft, error) {
....
// Initialize as a follower.
r.setState(Follower)

// Start as leader if specified. This should only be used
// for testing purposes.
if conf.StartAsLeader {
r.setState(Leader)
r.setLeader(r.localAddr)
}
....
// Start the background work.
r.goFunc(r.run)
r.goFunc(r.runFSM)
r.goFunc(r.runSnapshots)
return r, nil
}

以下围绕着三个协程去讨论。

run

协程 run 则根据节点状态跑相应的函数。

func (r *Raft) run() {
for {
// Check if we are doing a shutdown
select {
case <-r.shutdownCh:
// Clear the leader to prevent forwarding
r.setLeader("")
return
default:
}

// Enter into a sub-FSM
switch r.getState() {
case Follower:
r.runFollower()
case Candidate:
r.runCandidate()
case Leader:
r.runLeader()
}
}
}

runFollower

跟随者下接收RPC请求，这里有一个 bootstrapCh，用于启动时接收集群信息。

除了接收附加日志，投票，安装快照请求，其他请求都不支持（代码已省略）。

心跳超时之后会变为候选者，即 Candidate 。

func (r *Raft) runFollower() {
heartbeatTimer := randomTimeout(r.conf.HeartbeatTimeout)

for r.getState() == Follower {
select {
case rpc := <-r.rpcCh:
r.processRPC(rpc)

....

case b := <-r.bootstrapCh:
b.respond(r.liveBootstrap(b.configuration))

case <-heartbeatTimer:
// Restart the heartbeat timer
heartbeatTimer = randomTimeout(r.conf.HeartbeatTimeout)

// Check if we have had a successful contact
lastContact := r.LastContact()
if time.Now().Sub(lastContact) < r.conf.HeartbeatTimeout {
continue
}

// Heartbeat failed! Transition to the candidate state
lastLeader := r.Leader()
r.setLeader("")

if r.configurations.latestIndex == 0 {
if !didWarn {
r.logger.Warn("no known peers, aborting election")
didWarn = true
}
} else if r.configurations.latestIndex == r.configurations.committedIndex &&
!hasVote(r.configurations.latest, r.localID) {
if !didWarn {
r.logger.Warn("not part of stable configuration, aborting election")
didWarn = true
}
} else {
r.logger.Warn(fmt.Sprintf("Heartbeat timeout from %q reached, starting election", lastLeader))
r.setState(Candidate)
return
}

case <-r.shutdownCh:
return
}
}
}

runCandidate

候选人默认先给自己来上一票，然后就到处要票，视情况决定是退回到跟随者，还是当上领导者。

除了日志和投票的请求，其他都是直接返回错误，选举超时则退回到跟随者，等待新一轮选举。

func (r *Raft) runCandidate() {
// Start vote for us, and set a timeout
voteCh := r.electSelf()
....
electionTimer := randomTimeout(r.conf.ElectionTimeout)

// Tally the votes, need a simple majority
grantedVotes := 0
votesNeeded := r.quorumSize()
r.logger.Debug(fmt.Sprintf("Votes needed: %d", votesNeeded))

for r.getState() == Candidate {
select {
case rpc := <-r.rpcCh:
r.processRPC(rpc)

case vote := <-voteCh:
// Check if the term is greater than ours, bail
if vote.Term > r.getCurrentTerm() {
r.logger.Debug("Newer term discovered, fallback to follower")
r.setState(Follower)
r.setCurrentTerm(vote.Term)
return
}

// Check if the vote is granted
if vote.Granted {
grantedVotes++
r.logger.Debug(fmt.Sprintf("Vote granted from %s in term %v. Tally: %d",
vote.voterID, vote.Term, grantedVotes))
}

// Check if we've become the leader
if grantedVotes >= votesNeeded {
r.logger.Info(fmt.Sprintf("Election won. Tally: %d", grantedVotes))
r.setState(Leader)
r.setLeader(r.localAddr)
return
}

case c := <-r.configurationChangeCh:
// Reject any operations since we are not the leader
c.respond(ErrNotLeader)

case a := <-r.applyCh:
// Reject any operations since we are not the leader
a.respond(ErrNotLeader)

case v := <-r.verifyCh:
// Reject any operations since we are not the leader
v.respond(ErrNotLeader)

case r := <-r.userRestoreCh:
// Reject any restores since we are not the leader
r.respond(ErrNotLeader)

case c := <-r.configurationsCh:
c.configurations = r.configurations.Clone()
c.respond(nil)

case b := <-r.bootstrapCh:
b.respond(ErrCantBootstrap)

case <-electionTimer:
// Election failed! Restart the election. We simply return,
// which will kick us back into runCandidate
r.logger.Warn("Election timeout reached, restarting election")
return

case <-r.shutdownCh:
return
}
}
}

runLeader

领导者主要是初始化多个拷贝协程，然后新建一个 noop 的日志项（就是不应用到状态机的日志），非常重要，相当于领导者一当选就马上告诉其他跟随者你们给我把之前任期未提交的日志给我提交了（隐式提交）。

noop 日志相当于一条分界线，只有其他节点同步到了这个日志，才正式提供服务，避免客户端从其他节点读到未 Commit 的数据（过时数据）。

func (r *Raft) runLeader() {
....
// setup leader state. This is only supposed to be accessed within the
// leaderloop.
r.setupLeaderState()
....
// Start a replication routine for each peer
r.startStopReplication()

// Dispatch a no-op log entry first. This gets this leader up to the latest
// possible commit index, even in the absence of client commands. This used
// to append a configuration entry instead of a noop. However, that permits
// an unbounded number of uncommitted configurations in the log. We now
// maintain that there exists at most one uncommitted configuration entry in
// any log, so we have to do proper no-ops here.
noop := &logFuture{
log: Log{
Type: LogNoop,
},
}
r.dispatchLogs([]*logFuture{noop})

// Sit in the leader loop until we step down
r.leaderLoop()
}

剩余的 RPC 请求处理，就不继续解析了，无非就是根据当前身上的信息和心跳发来的信息进行比对。

一些改进

流水线传输日志。
采用 MultiRaft，因为 Raft 是强领导者类型的，性能相当于单点。
跟随者变为候选者之前先与集群中确认是否真的没有 Leader 这有助于避免在对称网络分区错误（三节点，两机房，两节点在同一个机房）的时候把一个明明有 Leader 的集群转换为选举状态。
非对称网络分区错误（三节点，三机房，都在不同的机房），导致一直重新选举，通过检查上次 Leader 到当前的通信时间是否超过重新选举的时间可避免这一问题。

总结

Raft 把 超时 玩出了花，通过引入超时机制（心跳超时选举领导，选举超时重新选举领导）把整个系统的复杂性降低，同时通过心跳来附加日志和提交日志，不需要等待完全确认，将二阶段的提交过程优化为了一阶段。 Leader 上位后通过 noop 日志巧妙的避免了即日志不一致，旧读的问题。关于成员变更，则是采用单节点变更的形式，避免了 脑裂，不得不说 Raft 真的是把可理解这一特性发挥到了极致。

Gossip 协议解析

2021-03-09T08:45:20.000Z

一直都对分布式协议比较感兴趣，选择了 Gossip 和 Raft 作为起点，之所以这么选择有两个原因。

它们足够简单。
一个基于 AP ，一个基于 CP ，分别是可用性优先和一致性优先的代表。

Gossip

Gossip 协议主要通过谣言传播的形式，传播给其他节点。

我这里称 Gossip 为协议而不是算法是因为这只是个思想，基于这个思想有很多的变种。

Gossip 能够正常运作需要以下三种实现组合。

广播
反熵(Anti-entropy)
谣言传播

反熵

反熵其实就是通过推拉的形式，将两个节点的数据进行交换，进而达成一致。之所以有了广播还要有反熵去推拉，是因为有可能缓存区满了，丢了数据，或者是一个新节点刚刚上线，它肯定就没办法得到之前广播出来的消息啦，那就需要反熵进行修复。

谣言传播

其实谣言传播和广播大多数时候都是做到一块的，换句话说谣言传播是随机从节点里选K个进行广播。

Gossip 实现

主要分析 memberlist 的实现，其依赖于 Gossip 的变种，SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

每个节点有有以下四种状态，存活，怀疑，死亡，离开(相当于死亡的一种补充)。

const (
StateAlive NodeStateType = iota
StateSuspect
StateDead
StateLeft
)

Create

根据配置创建节点

func Create(conf *Config) (*Memberlist, error) {
m, err := newMemberlist(conf)
if err != nil {
return nil, err
}
if err := m.setAlive(); err != nil {
m.Shutdown()
return nil, err
}
m.schedule()
return m, nil
}

newMemberlist

填充结构体，建立 TCP 与 UDP 连接。

func newMemberlist(conf *Config) (*Memberlist, error) {
....
m := &Memberlist{
config:               conf,
shutdownCh:           make(chan struct{}),
leaveBroadcast:       make(chan struct{}, 1),
transport:            nodeAwareTransport,
handoffCh:            make(chan struct{}, 1),
highPriorityMsgQueue: list.New(),
lowPriorityMsgQueue:  list.New(),
nodeMap:              make(map[string]*nodeState),
nodeTimers:           make(map[string]*suspicion),
awareness:            newAwareness(conf.AwarenessMaxMultiplier),
ackHandlers:          make(map[uint32]*ackHandler),
broadcasts:           &TransmitLimitedQueue{RetransmitMult: conf.RetransmitMult},
logger:               logger,
}
....
go m.streamListen()
go m.packetListen()
go m.packetHandler()
return m, nil
}

TCP 处理

节点状态同步，Push-Pull，用户数据同步。

读出数据，根据消息类型进行操作，反熵体现在 pushPullMsg 这个类型中。

func (m *Memberlist) handleConn(conn net.Conn) {
defer conn.Close()
m.logger.Printf("[DEBUG] memberlist: Stream connection %s", LogConn(conn))

metrics.IncrCounter([]string{"memberlist", "tcp", "accept"}, 1)

conn.SetDeadline(time.Now().Add(m.config.TCPTimeout))
msgType, bufConn, dec, err := m.readStream(conn)

switch msgType {
case userMsg:
if err := m.readUserMsg(bufConn, dec); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to receive user message: %s %s", err, LogConn(conn))
}
case pushPullMsg:
// Increment counter of pending push/pulls
numConcurrent := atomic.AddUint32(&m.pushPullReq, 1)
defer atomic.AddUint32(&m.pushPullReq, ^uint32(0))

// Check if we have too many open push/pull requests
if numConcurrent >= maxPushPullRequests {
m.logger.Printf("[ERR] memberlist: Too many pending push/pull requests")
return
}

join, remoteNodes, userState, err := m.readRemoteState(bufConn, dec)
if err != nil {
m.logger.Printf("[ERR] memberlist: Failed to read remote state: %s %s", err, LogConn(conn))
return
}

if err := m.sendLocalState(conn, join); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to push local state: %s %s", err, LogConn(conn))
return
}

if err := m.mergeRemoteState(join, remoteNodes, userState); err != nil {
m.logger.Printf("[ERR] memberlist: Failed push/pull merge: %s %s", err, LogConn(conn))
return
}
case pingMsg:
var p ping
if err := dec.Decode(&p); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to decode ping: %s %s", err, LogConn(conn))
return
}

if p.Node != "" && p.Node != m.config.Name {
m.logger.Printf("[WARN] memberlist: Got ping for unexpected node %s %s", p.Node, LogConn(conn))
return
}

ack := ackResp{p.SeqNo, nil}
out, err := encode(ackRespMsg, &ack)
if err != nil {
m.logger.Printf("[ERR] memberlist: Failed to encode ack: %s", err)
return
}

err = m.rawSendMsgStream(conn, out.Bytes())
if err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send ack: %s %s", err, LogConn(conn))
return
}
default:
m.logger.Printf("[ERR] memberlist: Received invalid msgType (%d) %s", msgType, LogConn(conn))
}
}

UDP 处理

各种消息处理。

将数据转为命令进行处理，用户自定义数据分优先级。

func (m *Memberlist) ingestPacket(buf []byte, from net.Addr, timestamp time.Time) {
....
m.handleCommand(buf, from, timestamp)
}

func (m *Memberlist) handleCommand(buf []byte, from net.Addr, timestamp time.Time) {
// Decode the message type
msgType := messageType(buf[0])
buf = buf[1:]

// Switch on the msgType
switch msgType {
case compoundMsg:
m.handleCompound(buf, from, timestamp)
case compressMsg:
m.handleCompressed(buf, from, timestamp)

case pingMsg:
m.handlePing(buf, from)
case indirectPingMsg:
m.handleIndirectPing(buf, from)
case ackRespMsg:
m.handleAck(buf, from, timestamp)
case nackRespMsg:
m.handleNack(buf, from)

case suspectMsg:
fallthrough
case aliveMsg:
fallthrough
case deadMsg:
fallthrough
case userMsg:
// Determine the message queue, prioritize alive
queue := m.lowPriorityMsgQueue
if msgType == aliveMsg {
queue = m.highPriorityMsgQueue
}

// Check for overflow and append if not full
m.msgQueueLock.Lock()
if queue.Len() >= m.config.HandoffQueueDepth {
m.logger.Printf("[WARN] memberlist: handler queue full, dropping message (%d) %s", msgType, LogAddress(from))
} else {
queue.PushBack(msgHandoff{msgType, buf, from})
}
m.msgQueueLock.Unlock()

// Notify of pending message
select {
case m.handoffCh <- struct{}{}:
default:
}

default:
m.logger.Printf("[ERR] memberlist: msg type (%d) not supported %s", msgType, LogAddress(from))
}
}

schedule

开三个协程

probe 协程
push-pull 协程
gossip 协程

func (m *Memberlist) schedule() {
m.tickerLock.Lock()
defer m.tickerLock.Unlock()

// If we already have tickers, then don't do anything, since we're
// scheduled
if len(m.tickers) > 0 {
return
}

// Create the stop tick channel, a blocking channel. We close this
// when we should stop the tickers.
stopCh := make(chan struct{})

// Create a new probeTicker
if m.config.ProbeInterval > 0 {
t := time.NewTicker(m.config.ProbeInterval)
go m.triggerFunc(m.config.ProbeInterval, t.C, stopCh, m.probe)
m.tickers = append(m.tickers, t)
}

// Create a push pull ticker if needed
if m.config.PushPullInterval > 0 {
go m.pushPullTrigger(stopCh)
}

// Create a gossip ticker if needed
if m.config.GossipInterval > 0 && m.config.GossipNodes > 0 {
t := time.NewTicker(m.config.GossipInterval)
go m.triggerFunc(m.config.GossipInterval, t.C, stopCh, m.gossip)
m.tickers = append(m.tickers, t)
}

// If we made any tickers, then record the stopTick channel for
// later.
if len(m.tickers) > 0 {
m.stopTick = stopCh
}
}

probe 协程

随机选取一个节点，然后通过UDP发送 ping 消息，如果不通则通过 indirect-ping 消息完成，意思是发给其他随机几个节点，由他们替你去 ping。

如果配置打开 TCP 开关，也会通过 TCP 去 ping（如果 TCP 判断存活，UDP间接判断不存活，还是认为存活）。

func (m *Memberlist) probeNode(node *nodeState) {
defer metrics.MeasureSince([]string{"memberlist", "probeNode"}, time.Now())

// We use our health awareness to scale the overall probe interval, so we
// slow down if we detect problems. The ticker that calls us can handle
// us running over the base interval, and will skip missed ticks.
probeInterval := m.awareness.ScaleTimeout(m.config.ProbeInterval)
if probeInterval > m.config.ProbeInterval {
metrics.IncrCounter([]string{"memberlist", "degraded", "probe"}, 1)
}

// Prepare a ping message and setup an ack handler.
selfAddr, selfPort := m.getAdvertise()
ping := ping{
SeqNo:      m.nextSeqNo(),
Node:       node.Name,
SourceAddr: selfAddr,
SourcePort: selfPort,
SourceNode: m.config.Name,
}
ackCh := make(chan ackMessage, m.config.IndirectChecks+1)
nackCh := make(chan struct{}, m.config.IndirectChecks+1)
m.setProbeChannels(ping.SeqNo, ackCh, nackCh, probeInterval)

// Mark the sent time here, which should be after any pre-processing but
// before system calls to do the actual send. This probably over-reports
// a bit, but it's the best we can do. We had originally put this right
// after the I/O, but that would sometimes give negative RTT measurements
// which was not desirable.
sent := time.Now()

// Send a ping to the node. If this node looks like it's suspect or dead,
// also tack on a suspect message so that it has a chance to refute as
// soon as possible.
deadline := sent.Add(probeInterval)
addr := node.Address()

// Arrange for our self-awareness to get updated.
var awarenessDelta int
defer func() {
m.awareness.ApplyDelta(awarenessDelta)
}()
if node.State == StateAlive {
if err := m.encodeAndSendMsg(node.FullAddress(), pingMsg, &ping); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send ping: %s", err)
if failedRemote(err) {
goto HANDLE_REMOTE_FAILURE
} else {
return
}
}
} else {
var msgs [][]byte
if buf, err := encode(pingMsg, &ping); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to encode ping message: %s", err)
return
} else {
msgs = append(msgs, buf.Bytes())
}
s := suspect{Incarnation: node.Incarnation, Node: node.Name, From: m.config.Name}
if buf, err := encode(suspectMsg, &s); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to encode suspect message: %s", err)
return
} else {
msgs = append(msgs, buf.Bytes())
}

compound := makeCompoundMessage(msgs)
if err := m.rawSendMsgPacket(node.FullAddress(), &node.Node, compound.Bytes()); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send compound ping and suspect message to %s: %s", addr, err)
if failedRemote(err) {
goto HANDLE_REMOTE_FAILURE
} else {
return
}
}
}

// Arrange for our self-awareness to get updated. At this point we've
// sent the ping, so any return statement means the probe succeeded
// which will improve our health until we get to the failure scenarios
// at the end of this function, which will alter this delta variable
// accordingly.
awarenessDelta = -1

// Wait for response or round-trip-time.
select {
case v := <-ackCh:
if v.Complete == true {
if m.config.Ping != nil {
rtt := v.Timestamp.Sub(sent)
m.config.Ping.NotifyPingComplete(&node.Node, rtt, v.Payload)
}
return
}

// As an edge case, if we get a timeout, we need to re-enqueue it
// here to break out of the select below.
if v.Complete == false {
ackCh <- v
}
case <-time.After(m.config.ProbeTimeout):
// Note that we don't scale this timeout based on awareness and
// the health score. That's because we don't really expect waiting
// longer to help get UDP through. Since health does extend the
// probe interval it will give the TCP fallback more time, which
// is more active in dealing with lost packets, and it gives more
// time to wait for indirect acks/nacks.
m.logger.Printf("[DEBUG] memberlist: Failed ping: %s (timeout reached)", node.Name)
}

HANDLE_REMOTE_FAILURE:
// Get some random live nodes.
m.nodeLock.RLock()
kNodes := kRandomNodes(m.config.IndirectChecks, m.nodes, func(n *nodeState) bool {
return n.Name == m.config.Name ||
n.Name == node.Name ||
n.State != StateAlive
})
m.nodeLock.RUnlock()

// Attempt an indirect ping.
expectedNacks := 0
selfAddr, selfPort = m.getAdvertise()
ind := indirectPingReq{
SeqNo:      ping.SeqNo,
Target:     node.Addr,
Port:       node.Port,
Node:       node.Name,
SourceAddr: selfAddr,
SourcePort: selfPort,
SourceNode: m.config.Name,
}
for _, peer := range kNodes {
// We only expect nack to be sent from peers who understand
// version 4 of the protocol.
if ind.Nack = peer.PMax >= 4; ind.Nack {
expectedNacks++
}

if err := m.encodeAndSendMsg(peer.FullAddress(), indirectPingMsg, &ind); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send indirect ping: %s", err)
}
}

// Also make an attempt to contact the node directly over TCP. This
// helps prevent confused clients who get isolated from UDP traffic
// but can still speak TCP (which also means they can possibly report
// misinformation to other nodes via anti-entropy), avoiding flapping in
// the cluster.
//
// This is a little unusual because we will attempt a TCP ping to any
// member who understands version 3 of the protocol, regardless of
// which protocol version we are speaking. That's why we've included a
// config option to turn this off if desired.
fallbackCh := make(chan bool, 1)

disableTcpPings := m.config.DisableTcpPings ||
(m.config.DisableTcpPingsForNode != nil && m.config.DisableTcpPingsForNode(node.Name))
if (!disableTcpPings) && (node.PMax >= 3) {
go func() {
defer close(fallbackCh)
didContact, err := m.sendPingAndWaitForAck(node.FullAddress(), ping, deadline)
if err != nil {
m.logger.Printf("[ERR] memberlist: Failed fallback ping: %s", err)
} else {
fallbackCh <- didContact
}
}()
} else {
close(fallbackCh)
}

// Wait for the acks or timeout. Note that we don't check the fallback
// channel here because we want to issue a warning below if that's the
// *only* way we hear back from the peer, so we have to let this time
// out first to allow the normal UDP-based acks to come in.
select {
case v := <-ackCh:
if v.Complete == true {
return
}
}

// Finally, poll the fallback channel. The timeouts are set such that
// the channel will have something or be closed without having to wait
// any additional time here.
for didContact := range fallbackCh {
if didContact {
m.logger.Printf("[WARN] memberlist: Was able to connect to %s but other probes failed, network may be misconfigured", node.Name)
return
}
}

// Update our self-awareness based on the results of this failed probe.
// If we don't have peers who will send nacks then we penalize for any
// failed probe as a simple health metric. If we do have peers to nack
// verify, then we can use that as a more sophisticated measure of self-
// health because we assume them to be working, and they can help us
// decide if the probed node was really dead or if it was something wrong
// with ourselves.
awarenessDelta = 0
if expectedNacks > 0 {
if nackCount := len(nackCh); nackCount < expectedNacks {
awarenessDelta += (expectedNacks - nackCount)
}
} else {
awarenessDelta += 1
}

// No acks received from target, suspect it as failed.
m.logger.Printf("[INFO] memberlist: Suspect %s has failed, no acks received", node.Name)
s := suspect{Incarnation: node.Incarnation, Node: node.Name, From: m.config.Name}
m.suspectNode(&s)
}

push-pull 协程

随机选 1 个节点，通过 UDP 进行推拉，反熵修复值。

func (m *Memberlist) pushPull() {
// Get a random live node
m.nodeLock.RLock()
nodes := kRandomNodes(1, m.nodes, func(n *nodeState) bool {
return n.Name == m.config.Name ||
n.State != StateAlive
})
m.nodeLock.RUnlock()

// If no nodes, bail
if len(nodes) == 0 {
return
}
node := nodes[0]

// Attempt a push pull
if err := m.pushPullNode(node.FullAddress(), false); err != nil {
m.logger.Printf("[ERR] memberlist: Push/Pull with %s failed: %s", node.Name, err)
}
}

// pushPullNode does a complete state exchange with a specific node.
func (m *Memberlist) pushPullNode(a Address, join bool) error {
defer metrics.MeasureSince([]string{"memberlist", "pushPullNode"}, time.Now())

// Attempt to send and receive with the node
remote, userState, err := m.sendAndReceiveState(a, join)
if err != nil {
return err
}

if err := m.mergeRemoteState(join, remote, userState); err != nil {
return err
}
return nil
}

gossip 协程

根据配置随机找几个节点，通过 UDP 进行谣言传播，即从广播队列（TCP 同步节点状态的时候会将消息放入广播队列）中取出来进行广播。

func (m *Memberlist) gossip() {
defer metrics.MeasureSince([]string{"memberlist", "gossip"}, time.Now())

// Get some random live, suspect, or recently dead nodes
m.nodeLock.RLock()
kNodes := kRandomNodes(m.config.GossipNodes, m.nodes, func(n *nodeState) bool {
if n.Name == m.config.Name {
return true
}

switch n.State {
case StateAlive, StateSuspect:
return false

case StateDead:
return time.Since(n.StateChange) > m.config.GossipToTheDeadTime

default:
return true
}
})
m.nodeLock.RUnlock()

// Compute the bytes available
bytesAvail := m.config.UDPBufferSize - compoundHeaderOverhead
if m.config.EncryptionEnabled() {
bytesAvail -= encryptOverhead(m.encryptionVersion())
}

for _, node := range kNodes {
// Get any pending broadcasts
msgs := m.getBroadcasts(compoundOverhead, bytesAvail)
if len(msgs) == 0 {
return
}

addr := node.Address()
if len(msgs) == 1 {
// Send single message as is
if err := m.rawSendMsgPacket(node.FullAddress(), &node.Node, msgs[0]); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send gossip to %s: %s", addr, err)
}
} else {
// Otherwise create and send a compound message
compound := makeCompoundMessage(msgs)
if err := m.rawSendMsgPacket(node.FullAddress(), &node.Node, compound.Bytes()); err != nil {
m.logger.Printf("[ERR] memberlist: Failed to send gossip to %s: %s", addr, err)
}
}
}
}

如果配置打开 TCP 开关，也会通过 TCP 去 ping（如果 TCP 判断存活，UDP间接判断不存活，还是认为存活）。

结语

Gossip 是一个 AP 的分布式协议，总体来说还是比较简单的。

ptmalloc2 内存管理

2021-02-28T02:45:20.000Z

近期在压测服务器的过程中发现内存随着用户数增加而暴涨，用户数减少内存却没有释放回内核，一开始怀疑是内存泄漏，后面上了工具排查，最终定位到是 glibc 的内存管理并没有将内存释放给OS，为了解决这个问题，对 ptmalloc2 进行了剖析。

本篇中，不谈论 brk 和 mmap 系统调用的使用方法，默认环境为 Linux-x86-64，讨论的 ptmalloc2 的版本为 glibc 2.17 的版本。

chunk

ptmalloc2 分配给用户的内存都以 chunk 来表示，可以理解为 chunk 为分配释放内存的载体。

#ifndef INTERNAL_SIZE_T
#define INTERNAL_SIZE_T size_t
#endif

/* The corresponding word size */
#define SIZE_SZ                (sizeof(INTERNAL_SIZE_T))

struct malloc_chunk {
  INTERNAL_SIZE_T      prev_size;  /* Size of previous chunk (if free).  */
  INTERNAL_SIZE_T      size;       /* Size in bytes, including overhead. */
// -----------------------------------------------------------------------
  struct malloc_chunk* fd;         /* double links -- used only if free. */
  struct malloc_chunk* bk;

  /* Only used for large blocks: pointer to next larger size.  */
  struct malloc_chunk* fd_nextsize; /* double links -- used only if free. */
  struct malloc_chunk* bk_nextsize;
};

chunk 由以上几部分组成， INTERNAL_SIZE_T 为 size_t 为了屏蔽平台之间的差异，这里只谈论64位平台，为8字节。

prev_size 代表着上一个 chunk 的大小，是否有效取决于 size 的属性位 P。
size 代表当前 chunk 的大小和属性，其中低3位为属性位 [A|M|P]。
当这个 chunk 为空闲时，则会使用 fd, bk 将其加入链表中管理。
同上， fd_nextsize bk_nextsize 只用在 large bin 中，表示上/下一个大小的指针，加快链表遍历。

从上可以得出以下结论：

当前一个 chunk 非空闲时， prev_size 无意义，可以被前一个 chunk 所利用。
size 的低3位为属性位，说明 size 一定是 8 的倍数，A 为是否为非主分配区，1是0否，M 为是否从 mmap 中获取， P 为前一个 chunk 是否被使用。
分配区分两种，主分配区与非主分配区们。
当 chunk 非空闲时，fd bk，fd_nextsize bk_nextsize 都无意义，因此返回给用户的可用内存应为 size 之后。

/* size field is or'ed with PREV_INUSE when previous adjacent chunk in use */
#define PREV_INUSE 0x1
/* extract inuse bit of previous chunk */
#define prev_inuse(p)       ((p)->size & PREV_INUSE)
/* size field is or'ed with IS_MMAPPED if the chunk was obtained with mmap() */
#define IS_MMAPPED 0x2
/* check for mmap()'ed chunk */
#define chunk_is_mmapped(p) ((p)->size & IS_MMAPPED)

/* size field is or'ed with NON_MAIN_ARENA if the chunk was obtained
   from a non-main arena.  This is only set immediately before handing
   the chunk to the user, if necessary.  */
#define NON_MAIN_ARENA 0x4

/* check for chunk from non-main arena */
#define chunk_non_main_arena(p) ((p)->size & NON_MAIN_ARENA)

typedef struct malloc_chunk* mchunkptr;

#define chunk2mem(p)   ((void*)((char*)(p) + 2*SIZE_SZ))
#define mem2chunk(mem) ((mchunkptr)((char*)(mem) - 2*SIZE_SZ))

/* The smallest possible chunk */
#define MIN_CHUNK_SIZE        (offsetof(struct malloc_chunk, fd_nextsize))

#define MALLOC_ALIGN_MASK      (MALLOC_ALIGNMENT - 1)

#define MINSIZE  \
  (unsigned long)(((MIN_CHUNK_SIZE+MALLOC_ALIGN_MASK) & ~MALLOC_ALIGN_MASK))

#define request2size(req)                                         \
  (((req) + SIZE_SZ + MALLOC_ALIGN_MASK < MINSIZE)  ?             \
   MINSIZE :                                                      \
   ((req) + SIZE_SZ + MALLOC_ALIGN_MASK) & ~MALLOC_ALIGN_MASK)

mem 为用户真正可用的内存起始地址，可以看出最小的 chunk 应该至少 4*8 = 32字节，因为 fd_nextsize 和 bk_nextsize 只有在 large chunk 才用的上。

request2size 将用户申请的内存大小转化为需要分配的 chunk 大小，用户请求大小 (req + prev_size + size) = req + 16B，但是由于内存复用的关系，可以从下一个 chunk 中借用 prev_size 的空间（反正对于下一个 chunk 来说，前一个 chunk 已经被使用了，知道前一个 chunk 的大小也没有意义），因此应为 req + prev_size + size - prev_size(next chunk) = req + 8B，同时 req + 8B 不应小于 MINSIZE 所以二者取最大，为 max(req + 8B, 32B)。

bin

bin 可以理解桶，存放着 chunk ，在 ptmalloc 的世界中存在四种 bin。

fast bins
unsorted bin
small bins
large bins

fast bins 是小内存块的缓存，当小内存块被回收时，会先放入 fast bins，当下次分配小内存时，就会优先从 fast bins 中找，节约时间。

unsorted bin 只有一个，回收的 chunk 若大于 fast bins 的阈值即 global_max_fast，则放入 unsorted bin 。

small bins 顾名思义，就是 ptmalloc 觉得小的 chunk，就放进去，呈等差数列的形式递增，每个 bin 的 chunk 均为同一大小，通过 fd, bk 链接 chunk 链表。

large bins 同上，不过每个 bin 中的 chunk 有大小排序，大的在前，小的在后，通过 fd_nextsize, bk_nextsize 快速找到上/下一个大小节点。

1	#define NBINS 128

bins 共有 small bins 有 62 个， large bins 有 63个， unsorted bin 为 1个，总共为 62+63+1 = 126 个，其中 bin[0] 和 bin[127] 不用，因此 bins 总数为 128 个。要注意 fast bins 并不放入同一数组。

fast bins

fast bins 小内存块的缓存，大小小于 DEFAULT_MXFAST 的 chunk 分配与回收都会在 fast bins 中先查找，在64位上为 128字节，这个参数可以通过 mallopt 函数进行修改，最大值为 160B。一共有 9 个，bin[0] 和 bin[1] 没有用上，剩余 7 个为 small bins 的小 7 个。

#ifndef DEFAULT_MXFAST
#define DEFAULT_MXFAST     (64 * SIZE_SZ / 4)
#endif

typedef struct malloc_chunk* mfastbinptr;
#define fastbin(ar_ptr, idx) ((ar_ptr)->fastbinsY[idx])

/* offset 2 to use otherwise unindexable first 2 bins */
#define fastbin_index(sz) \
  ((((unsigned int)(sz)) >> (SIZE_SZ == 8 ? 4 : 3)) - 2)

/* The maximum fastbin request size we support */
#define MAX_FAST_SIZE     (80 * SIZE_SZ / 4)

#define NFASTBINS  (fastbin_index(request2size(MAX_FAST_SIZE))+1)

#define FASTBIN_CONSOLIDATION_THRESHOLD  (65536UL)

FASTBIN_CONSOLIDATION_THRESHOLD 表示当回收的 chunk 与相邻的 chunk 合并后大于该值 64k，则合并 fast bins 中所有的 chunk 放回到 unsorted bin

unsorted bin

unsorted bin 只有一个， fast bins 合并后的 chunk 会先放到这里，从名字可以看出这里面的 chunk 没有排序。如果从这里面分配不到合适的 chunk 就会将其放到正确的 small bins 或者 large bins 中。

1	#define unsorted_chunks(M) (bin_at(M, 1))

small bins

small bins 在64位平台上，共有62个bin，最小的 chunk 为 32字节，等差数列的公差为 16B (SMALLBIN_WIDTH)，最大为 1008B 。

#define NSMALLBINS         64
#define SMALLBIN_WIDTH    MALLOC_ALIGNMENT
#define SMALLBIN_CORRECTION (MALLOC_ALIGNMENT > 2 * SIZE_SZ)

#define in_smallbin_range(sz)  \
  ((unsigned long)(sz) < (unsigned long)MIN_LARGE_SIZE)

#define smallbin_index(sz) \
  ((SMALLBIN_WIDTH == 16 ? (((unsigned)(sz)) >> 4) : (((unsigned)(sz)) >> 3)) \
   + SMALLBIN_CORRECTION)

将数值带进 smallbin_index 会发现最小的 chunk 是在 bin[2] 上，这是因为为了编程的方便， small bins 从2开始，可以形成 chunk size = 2 * size_t * index 的等差数列，bin[1] 则用来存 unsorted bin 而 bin[0] 为空。

每个 bin 中的 chunk 大小相同，通过双向链表链接起来。

large bins

large bins 则接在 small bins 之后，MIN_LARGE_SIZE 可以看到最小的 large chunk 为 1024B 。共有63个。

#define MIN_LARGE_SIZE    ((NSMALLBINS - SMALLBIN_CORRECTION) * SMALLBIN_WIDTH)

#define largebin_index_32(sz)                                                \
(((((unsigned long)(sz)) >>  6) <= 38)?  56 + (((unsigned long)(sz)) >>  6): \
 ((((unsigned long)(sz)) >>  9) <= 20)?  91 + (((unsigned long)(sz)) >>  9): \
 ((((unsigned long)(sz)) >> 12) <= 10)? 110 + (((unsigned long)(sz)) >> 12): \
 ((((unsigned long)(sz)) >> 15) <=  4)? 119 + (((unsigned long)(sz)) >> 15): \
 ((((unsigned long)(sz)) >> 18) <=  2)? 124 + (((unsigned long)(sz)) >> 18): \
126)

#define largebin_index_32_big(sz)                                            \
(((((unsigned long)(sz)) >>  6) <= 45)?  49 + (((unsigned long)(sz)) >>  6): \
 ((((unsigned long)(sz)) >>  9) <= 20)?  91 + (((unsigned long)(sz)) >>  9): \
 ((((unsigned long)(sz)) >> 12) <= 10)? 110 + (((unsigned long)(sz)) >> 12): \
 ((((unsigned long)(sz)) >> 15) <=  4)? 119 + (((unsigned long)(sz)) >> 15): \
 ((((unsigned long)(sz)) >> 18) <=  2)? 124 + (((unsigned long)(sz)) >> 18): \
                                        126)
#define largebin_index_64(sz)                                                \
(((((unsigned long)(sz)) >>  6) <= 48)?  48 + (((unsigned long)(sz)) >>  6): \
 ((((unsigned long)(sz)) >>  9) <= 20)?  91 + (((unsigned long)(sz)) >>  9): \
 ((((unsigned long)(sz)) >> 12) <= 10)? 110 + (((unsigned long)(sz)) >> 12): \
 ((((unsigned long)(sz)) >> 15) <=  4)? 119 + (((unsigned long)(sz)) >> 15): \
 ((((unsigned long)(sz)) >> 18) <=  2)? 124 + (((unsigned long)(sz)) >> 18): \
126)

#define largebin_index(sz) \
  (SIZE_SZ == 8 ? largebin_index_64 (sz)                                     \
   : MALLOC_ALIGNMENT == 16 ? largebin_index_32_big (sz)                     \
   : largebin_index_32 (sz))

#define bin_index(sz) \
 ((in_smallbin_range(sz)) ? smallbin_index(sz) : largebin_index(sz))

large bins 中的每个 bin 里的 chunk 大小为一个区间，从大到小排序，通过双向链表链接，同时为了加快遍历的过程，通过 fd_nextsize, bk_nextsize 将前后不同大小的对象链接起来。

typedef struct malloc_chunk* mbinptr;

/* addressing -- note that bin_at(0) does not exist */
#define bin_at(m, i) \
  (mbinptr) (((char *) &((m)->bins[((i) - 1) * 2]))      \
     - offsetof (struct malloc_chunk, fd))

/* analog of ++bin */
#define next_bin(b)  ((mbinptr)((char*)(b) + (sizeof(mchunkptr)<<1)))

/* Reminders about list directionality within bins */
#define first(b)     ((b)->fd)
#define last(b)      ((b)->bk)

/* Take a chunk off a bin list */
#define unlink(P, BK, FD) {                                            \
  FD = P->fd;                                                          \
  BK = P->bk;                                                          \
  if (__builtin_expect (FD->bk != P || BK->fd != P, 0))                \
    malloc_printerr (check_action, "corrupted double-linked list", P); \
  else {                                                               \
    FD->bk = BK;                                                       \
    BK->fd = FD;                                                       \
    if (!in_smallbin_range (P->size)       \
&& __builtin_expect (P->fd_nextsize != NULL, 0)) {       \
      assert (P->fd_nextsize->bk_nextsize == P);       \
      assert (P->bk_nextsize->fd_nextsize == P);       \
      if (FD->fd_nextsize == NULL) {       \
if (P->fd_nextsize == P)       \
  FD->fd_nextsize = FD->bk_nextsize = FD;       \
else {       \
  FD->fd_nextsize = P->fd_nextsize;       \
  FD->bk_nextsize = P->bk_nextsize;       \
  P->fd_nextsize->bk_nextsize = FD;       \
  P->bk_nextsize->fd_nextsize = FD;       \
}       \
      }else {       \
P->fd_nextsize->bk_nextsize = P->bk_nextsize;       \
P->bk_nextsize->fd_nextsize = P->fd_nextsize;       \
      }       \
    }       \
  }                                                                    \
}

一些辅助宏，可能会好奇为什么有那么多个对 malloc_chunk* 的 typedef struct，其实就是 ptmalloc 把内存从不同的角度看待的意思，类似 C++ 的 union 。

malloc_par

malloc_par 可以理解为一个全局的参数。

struct malloc_par {
  /* Tunable parameters */
  unsigned long    trim_threshold;
  INTERNAL_SIZE_T  top_pad;
  INTERNAL_SIZE_T  mmap_threshold;
#ifdef PER_THREAD
  INTERNAL_SIZE_T  arena_test;
  INTERNAL_SIZE_T  arena_max;
#endif

  /* Memory map support */
  int              n_mmaps;
  int              n_mmaps_max;
  int              max_n_mmaps;
  /* the mmap_threshold is dynamic, until the user sets
     it manually, at which point we need to disable any
     dynamic behavior. */
  int              no_dyn_threshold;

  /* Statistics */
  INTERNAL_SIZE_T  mmapped_mem;
  /*INTERNAL_SIZE_T  sbrked_mem;*/
  /*INTERNAL_SIZE_T  max_sbrked_mem;*/
  INTERNAL_SIZE_T  max_mmapped_mem;
  INTERNAL_SIZE_T  max_total_mem; /* only kept for NO_THREADS */

  /* First address handed out by MORECORE/sbrk.  */
  char*            sbrk_base;
};

其中较为重要的参数有：

trim_threshold，mmap 的收缩阈值默认128KB
mmap_threshold，mmap 分配阈值默认128KB
n_mmaps_max，mmap 分配内存块的最大数
no_dyn_threshold，是否关闭动态调整分配阈值默认开启

以上任一项的修改，都会关闭动态调整分配阈值，之所以有这个机制，是为了减少 mmap 的次数，因为 mmap 的效率远远低于 brk。更多细节建议阅读 mallopt(3) — Linux manual page。

但是使用 mmap 分配的内存有一个好处，当释放的时候可以直接还回给内核，而且当虚拟内存空间有洞时，只能用 mmap 进行分配，在本次服务器压测的过程中，通过修改以下配置达到释放内存的目的，但是强烈不建议使用， mmap 分配的内存以页为单位，哪怕你申请 1B，都会变成向内核申请一块页大小的内存块，仅适合用于排查内存不释放究竟位于 ptmalloc2 的哪个地方。

#include 

mallopt(M_MMAP_THRESHOLD, 0);
mallopt(M_MMAP_MAX, 1e9);

malloc_state

前面提到过，申请出来的 chunk 可能来自三个地方。

mmap 直接申请
主分配区分配
非主分配区分配

malloc_state 就是用来管理分配区的。非主分配区的出现主要是为了缓解多线程的场景下，减少锁争用的情况，一般情况是一个线程对应一个非主分配区，尽管是这样还是会进行加锁，因此性能不佳，分配区达到CPU核心数时，则会停止创建非主分配区，转而进行复用，复用也很简单，轮询判断是否可以加锁。

#define FASTCHUNKS_BIT        (1U)

#define have_fastchunks(M)     (((M)->flags &  FASTCHUNKS_BIT) == 0)
#define clear_fastchunks(M)    catomic_or (&(M)->flags, FASTCHUNKS_BIT)
#define set_fastchunks(M)      catomic_and (&(M)->flags, ~FASTCHUNKS_BIT)

#define NONCONTIGUOUS_BIT     (2U)

#define contiguous(M)          (((M)->flags &  NONCONTIGUOUS_BIT) == 0)
#define noncontiguous(M)       (((M)->flags &  NONCONTIGUOUS_BIT) != 0)
#define set_noncontiguous(M)   ((M)->flags |=  NONCONTIGUOUS_BIT)
#define set_contiguous(M)      ((M)->flags &= ~NONCONTIGUOUS_BIT)

struct malloc_state {
  /* Serialize access.  */
  mutex_t mutex;

  /* Flags (formerly in max_fast).  */
  int flags;

  /* Fastbins */
  mfastbinptr      fastbinsY[NFASTBINS];

  /* Base of the topmost chunk -- not otherwise kept in a bin */
  mchunkptr        top;

  /* The remainder from the most recent split of a small request */
  mchunkptr        last_remainder;

  /* Normal bins packed as described above */
  mchunkptr        bins[NBINS * 2 - 2];

  /* Bitmap of bins */
  unsigned int     binmap[BINMAPSIZE];

  /* Linked list */
  struct malloc_state *next;

#ifdef PER_THREAD
  /* Linked list for free arenas.  */
  struct malloc_state *next_free;
#endif

  /* Memory allocated from the system in this arena.  */
  INTERNAL_SIZE_T system_mem;
  INTERNAL_SIZE_T max_system_mem;
};

mutex，为了支持多线程
flags，bit0 表示是否有 fast bin chunk，bit1 表示是否能返回连续的虚拟地址空间，显然只有主分配区才能做到，因为在未达到 mmap阈值时，只有主分配区是用 brk 进行分配，而非主分配区都是采用 mmap 。但也有一种情况，主分配区用 mmap ，静态链接 glibc 的时候，就会禁用 brk ，我想是担心出现洞。
fastbinY，就是存储 fast bins 的数组，NFASTBINS 为 10。
top，top chunk 前面一系列的 bin 分配不到内存，就从 top chunk 里拿，释放回内核也是从 top chunk 开始释放，即从高地址开始释放类似于 stack。
last_remainder，分配区若上次分配 small chunk 且还有剩余，则存入这个指针。
bins，即 unsorted bin + small bins + large bins = 1 + 62 + 63 = 125，bin[0] 和 bin[127] 没有用，但是 bins 的大小为 254，这主要是为了节约内存，可以理解为它只是用数组来申请内存，然后将其转化为双向链表的结构体。
binmap，标识 bit 指向的 bin 是否有空闲 chunk。
next，链接分配区。
system_mem，当前分配区已分配内存大小，可通过 malloc_stats(3) 进行查看。

内存分配

本节先通过文字描述一遍内存分配流程，再进行代码分析。malloc glibc 内部名字为 __libc_malloc

ptmalloc 是否没有初始化或者有钩子函数，调用指定函数(如果使用其他malloc，就在此处返回了)。
查找合适的分配区，加锁,，调用 _int_malloc 在分配区中分配内存，如果分配失败，则解锁分配区并换一个分配区，如果分配区的数量少于CPU核心数，则默认是新建一个非主分配区，并调用 mmap 分配一块大内存并设置好 top chunk。
进入 _int_malloc 逻辑。
chunk_size ≤ 128B，是则在 fast bins 中查找并返回，否则下一步。
若 chunk_size < 1008B 则在 small bins 进行分配，优先用 last_remainder，从尾节点先分配，头结点还回，使每一个 chunk 都有机会被用上，成功则返回，否则下一步。
若到这一步要么是还没找到合适的内存，或者是 chunk_size 是一个大的请求，则先遍历 fast bins，将相邻的 chunk 进行合并，放入到 unsorted bin 中，从 unstorted bin 中进行查找，一边找一边将其放入正确的 bins 中，同时在 binmap中进行标记。如果找到则返回给用户，若 unsorted bin 只有一个 chunk，且该 chunk 为 last remainder chunk，且我们需要的是一个 small bin chunk，则将其切分，剩余部分依然不动，此步骤最多尝试 MAX_ITERS（10000）次，防止因为 unsored bin 的 chunk 过多而影响分配效率。
最后还是找不到，那就在 large bins 中按照最佳匹配的原则，从更大的 bins 中进行查找，查找方式是通过遍历 binmap，找一个合适的 chunk，并将其切分，成功则返回，否则下一步。
只好从 top chunk 进行切分了（回收的时候也是从 top chunk 进行切分，埋下了长周期的内存无法回收导致内存暴涨的伏笔），不成功下一步。
又开始打 fast bins 的注意了，主要是 fast bins 回收的时候没有加锁，而是采用 lock-free 方式(Compareand-Swap)回收，因此有可能里面已经有 chunk 了，这时候又开始合并，放入 unsorted bin，但是却是从 small bins 或从 large bins 中再去查找，这主要是因为，在第 5,6 步的时候，如果在 small bins 中找不到合适的 chunk，就合并 fast bins 到 unsorted bin，然后放回到指定的 small bins 和 large bins 中，但是并没有再去扫描一下相应的 bins，这里相当于再补上一刀。
山穷水尽了，调用 sysmalloc 向内核申请内存了，先看看是否超过 mmap 分配的阈值，若没超过，主分配区采用 brk 扩充 top chunk 大小(若静态链接 brk 会被禁用，此时采用 mmap ），非主分配区则默认用 mmap 进行扩充，超过就更不用讲了，直接 mmap 分配给用户，释放也是直接释放即可。
分配成功。解锁分配区并返回。

#define arena_get(ptr, size) do { \
  arena_lookup(ptr); \
  arena_lock(ptr, size); \
} while(0)

#define arena_lookup(ptr) do { \
  void *vptr = NULL; \
  ptr = (mstate)tsd_getspecific(arena_key, vptr); \
} while(0)

#ifdef PER_THREAD
# define arena_lock(ptr, size) do { \
  if(ptr) \
    (void)mutex_lock(&ptr->mutex); \
  else \
    ptr = arena_get2(ptr, (size), NULL); \
} while(0)
#else
# define arena_lock(ptr, size) do { \
  if(ptr && !mutex_trylock(&ptr->mutex)) { \
    THREAD_STAT(++(ptr->stat_lock_direct)); \
  } else \
    ptr = arena_get2(ptr, (size), NULL); \
} while(0)
#endif

static mstate
internal_function
arena_get2(mstate a_tsd, size_t size, mstate avoid_arena)
{
  mstate a;

#ifdef PER_THREAD
  static size_t narenas_limit;

  a = get_free_list ();
  if (a == NULL)
    {
      /* Nothing immediately available, so generate a new arena.  */
      if (narenas_limit == 0)
{
  if (mp_.arena_max != 0)
    narenas_limit = mp_.arena_max;
  else if (narenas > mp_.arena_test)
    {
      int n  = __get_nprocs ();

      if (n >= 1)
narenas_limit = NARENAS_FROM_NCORES (n);
      else
/* We have no information about the system.  Assume two
   cores.  */
narenas_limit = NARENAS_FROM_NCORES (2);
    }
}
    repeat:;
      size_t n = narenas;
      /* NB: the following depends on the fact that (size_t)0 - 1 is a
 very large number and that the underflow is OK.  If arena_max
 is set the value of arena_test is irrelevant.  If arena_test
 is set but narenas is not yet larger or equal to arena_test
 narenas_limit is 0.  There is no possibility for narenas to
 be too big for the test to always fail since there is not
 enough address space to create that many arenas.  */
      if (__builtin_expect (n <= narenas_limit - 1, 0))
{
  if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
    goto repeat;
  a = _int_new_arena (size);
  if (__builtin_expect (a == NULL, 0))
    catomic_decrement (&narenas);
}
      else
a = reused_arena (avoid_arena);
    }
#else
  if(!a_tsd)
    a = a_tsd = &main_arena;
  else {
    a = a_tsd->next;
    if(!a) {
      /* This can only happen while initializing the new arena. */
      (void)mutex_lock(&main_arena.mutex);
      THREAD_STAT(++(main_arena.stat_lock_wait));
      return &main_arena;
    }
  }

  /* Check the global, circularly linked list for available arenas. */
  bool retried = false;
 repeat:
  do {
    if(!mutex_trylock(&a->mutex)) {
      if (retried)
(void)mutex_unlock(&list_lock);
      THREAD_STAT(++(a->stat_lock_loop));
      tsd_setspecific(arena_key, (void *)a);
      return a;
    }
    a = a->next;
  } while(a != a_tsd);

  /* If not even the list_lock can be obtained, try again.  This can
     happen during `atfork', or for example on systems where thread
     creation makes it temporarily impossible to obtain _any_
     locks. */
  if(!retried && mutex_trylock(&list_lock)) {
    /* We will block to not run in a busy loop.  */
    (void)mutex_lock(&list_lock);

    /* Since we blocked there might be an arena available now.  */
    retried = true;
    a = a_tsd;
    goto repeat;
  }

  /* Nothing immediately available, so generate a new arena.  */
  a = _int_new_arena(size);
  (void)mutex_unlock(&list_lock);
#endif

  return a;
}

可以看出分配区是绑定在线程的，但并不代表每个线程独占一个分配区，因此都要加锁，导致性能无论在单线程还是多线程上都不佳。同时分配区的数量取决于 CPU核心数，若获取不到则默认为 8。

void*
__libc_malloc(size_t bytes)
{
  mstate ar_ptr;
  void *victim;

  __malloc_ptr_t (*hook) (size_t, const __malloc_ptr_t)
    = force_reg (__malloc_hook);
  if (__builtin_expect (hook != NULL, 0))
    return (*hook)(bytes, RETURN_ADDRESS (0));

  arena_lookup(ar_ptr);

  arena_lock(ar_ptr, bytes);
  if(!ar_ptr)
    return 0;
  victim = _int_malloc(ar_ptr, bytes);
  if(!victim) {
    ar_ptr = arena_get_retry(ar_ptr, bytes);
    if (__builtin_expect(ar_ptr != NULL, 1)) {
      victim = _int_malloc(ar_ptr, bytes);
      (void)mutex_unlock(&ar_ptr->mutex);
    }
  } else
    (void)mutex_unlock(&ar_ptr->mutex);
  assert(!victim || chunk_is_mmapped(mem2chunk(victim)) ||
 ar_ptr == arena_for_chunk(mem2chunk(victim)));
  return victim;
}

_int_malloc

_int_malloc 可以说是 ptmalloc2 中最重要的函数之一，它可以说是 ptmalloc2 内存分配策略的实现。

static void*
_int_malloc(mstate av, size_t bytes)
{
  INTERNAL_SIZE_T nb;               /* normalized request size */
  unsigned int    idx;              /* associated bin index */
  mbinptr         bin;              /* associated bin */

  mchunkptr       victim;           /* inspected/selected chunk */
  INTERNAL_SIZE_T size;             /* its size */
  int             victim_index;     /* its bin index */

  mchunkptr       remainder;        /* remainder from a split */
  unsigned long   remainder_size;   /* its size */

  unsigned int    block;            /* bit map traverser */
  unsigned int    bit;              /* bit map traverser */
  unsigned int    map;              /* current word of binmap */

  mchunkptr       fwd;              /* misc temp for linking */
  mchunkptr       bck;              /* misc temp for linking */

  const char *errstr = NULL;

  /*
    Convert request size to internal form by adding SIZE_SZ bytes
    overhead plus possibly more to obtain necessary alignment and/or
    to obtain a size of at least MINSIZE, the smallest allocatable
    size. Also, checked_request2size traps (returning 0) request sizes
    that are so large that they wrap around zero when padded and
    aligned.
  */

  checked_request2size(bytes, nb);

  /*
    If the size qualifies as a fastbin, first check corresponding bin.
    This code is safe to execute even if av is not yet initialized, so we
    can try it without checking, which saves some time on this fast path.
  */

  if ((unsigned long)(nb) <= (unsigned long)(get_max_fast ())) {
    idx = fastbin_index(nb);
    mfastbinptr* fb = &fastbin (av, idx);
    mchunkptr pp = *fb;
    do
      {
victim = pp;
if (victim == NULL)
  break;
      }
    while ((pp = catomic_compare_and_exchange_val_acq (fb, victim->fd, victim))
   != victim);
    if (victim != 0) {
      if (__builtin_expect (fastbin_index (chunksize (victim)) != idx, 0))
{
  errstr = "malloc(): memory corruption (fast)";
errout:
  malloc_printerr (check_action, errstr, chunk2mem (victim));
  return NULL;
}
      check_remalloced_chunk(av, victim, nb);
      void *p = chunk2mem(victim);
      if (__builtin_expect (perturb_byte, 0))
alloc_perturb (p, bytes);
      return p;
    }
  }

以上为内存分配的第 4 步 fast bins，这里采用了 CAS 操作，换句话说回收 fast bins 不需要加锁。

if (in_smallbin_range(nb)) {
    idx = smallbin_index(nb);
    bin = bin_at(av,idx);

    if ( (victim = last(bin)) != bin) {
      if (victim == 0) /* initialization check */
malloc_consolidate(av);
      else {
bck = victim->bk;
if (__builtin_expect (bck->fd != victim, 0))
  {
    errstr = "malloc(): smallbin double linked list corrupted";
    goto errout;
  }
set_inuse_bit_at_offset(victim, nb);
bin->bk = bck;
bck->fd = bin;

if (av != &main_arena)
  victim->size |= NON_MAIN_ARENA;
check_malloced_chunk(av, victim, nb);
void *p = chunk2mem(victim);
if (__builtin_expect (perturb_byte, 0))
  alloc_perturb (p, bytes);
return p;
      }
    }
  }

内存分配第五步 small bins 至此结束。

else {
    idx = largebin_index(nb);
    if (have_fastchunks(av))
      malloc_consolidate(av);
  }

  /*
    Process recently freed or remaindered chunks, taking one only if
    it is exact fit, or, if this a small request, the chunk is remainder from
    the most recent non-exact fit.  Place other traversed chunks in
    bins.  Note that this step is the only place in any routine where
    chunks are placed in bins.

    The outer loop here is needed because we might not realize until
    near the end of malloc that we should have consolidated, so must
    do so and retry. This happens at most once, and only when we would
    otherwise need to expand memory to service a "small" request.
  */

  for(;;) {

    int iters = 0;
    while ( (victim = unsorted_chunks(av)->bk) != unsorted_chunks(av)) {
      bck = victim->bk;
      if (__builtin_expect (victim->size <= 2 * SIZE_SZ, 0)
  || __builtin_expect (victim->size > av->system_mem, 0))
malloc_printerr (check_action, "malloc(): memory corruption",
 chunk2mem (victim));
      size = chunksize(victim);

      /*
 If a small request, try to use last remainder if it is the
 only chunk in unsorted bin.  This helps promote locality for
 runs of consecutive small requests. This is the only
 exception to best-fit, and applies only when there is
 no exact fit for a small chunk.
      */

      if (in_smallbin_range(nb) &&
  bck == unsorted_chunks(av) &&
  victim == av->last_remainder &&
  (unsigned long)(size) > (unsigned long)(nb + MINSIZE)) {

/* split and reattach remainder */
remainder_size = size - nb;
remainder = chunk_at_offset(victim, nb);
unsorted_chunks(av)->bk = unsorted_chunks(av)->fd = remainder;
av->last_remainder = remainder;
remainder->bk = remainder->fd = unsorted_chunks(av);
if (!in_smallbin_range(remainder_size))
  {
    remainder->fd_nextsize = NULL;
    remainder->bk_nextsize = NULL;
  }

set_head(victim, nb | PREV_INUSE |
 (av != &main_arena ? NON_MAIN_ARENA : 0));
set_head(remainder, remainder_size | PREV_INUSE);
set_foot(remainder, remainder_size);

check_malloced_chunk(av, victim, nb);
void *p = chunk2mem(victim);
if (__builtin_expect (perturb_byte, 0))
  alloc_perturb (p, bytes);
return p;
      }

      /* remove from unsorted list */
      unsorted_chunks(av)->bk = bck;
      bck->fd = unsorted_chunks(av);

      /* Take now instead of binning if exact fit */

      if (size == nb) {
set_inuse_bit_at_offset(victim, size);
if (av != &main_arena)
  victim->size |= NON_MAIN_ARENA;
check_malloced_chunk(av, victim, nb);
void *p = chunk2mem(victim);
if (__builtin_expect (perturb_byte, 0))
  alloc_perturb (p, bytes);
return p;
      }

      /* place chunk in bin */

      if (in_smallbin_range(size)) {
victim_index = smallbin_index(size);
bck = bin_at(av, victim_index);
fwd = bck->fd;
      }
      else {
victim_index = largebin_index(size);
bck = bin_at(av, victim_index);
fwd = bck->fd;

/* maintain large bins in sorted order */
if (fwd != bck) {
  /* Or with inuse bit to speed comparisons */
  size |= PREV_INUSE;
  /* if smaller than smallest, bypass loop below */
  assert((bck->bk->size & NON_MAIN_ARENA) == 0);
  if ((unsigned long)(size) < (unsigned long)(bck->bk->size)) {
    fwd = bck;
    bck = bck->bk;

    victim->fd_nextsize = fwd->fd;
    victim->bk_nextsize = fwd->fd->bk_nextsize;
    fwd->fd->bk_nextsize = victim->bk_nextsize->fd_nextsize = victim;
  }
  else {
    assert((fwd->size & NON_MAIN_ARENA) == 0);
    while ((unsigned long) size < fwd->size)
      {
fwd = fwd->fd_nextsize;
assert((fwd->size & NON_MAIN_ARENA) == 0);
      }

    if ((unsigned long) size == (unsigned long) fwd->size)
      /* Always insert in the second position.  */
      fwd = fwd->fd;
    else
      {
victim->fd_nextsize = fwd;
victim->bk_nextsize = fwd->bk_nextsize;
fwd->bk_nextsize = victim;
victim->bk_nextsize->fd_nextsize = victim;
      }
    bck = fwd->bk;
  }
} else
  victim->fd_nextsize = victim->bk_nextsize = victim;
      }

      mark_bin(av, victim_index);
      victim->bk = bck;
      victim->fd = fwd;
      fwd->bk = victim;
      bck->fd = victim;

#define MAX_ITERS10000
      if (++iters >= MAX_ITERS)
break;
    }

第六步至此结束，到这要么是 small bins 不满足或者本身请求就是一个大请求，因此先整合 fast bins 的 chunk，将其放入 unsorted bin 中，一边又从 unsorted bin 中查找，顺便放入正确的 bins 中，如果碰巧就找到了那就返回就完事了，同时还会设置 binmap，方便之后搜索。

/*
      If a large request, scan through the chunks of current bin in
      sorted order to find smallest that fits.  Use the skip list for this.
    */

    if (!in_smallbin_range(nb)) {
      bin = bin_at(av, idx);

      /* skip scan if empty or largest chunk is too small */
      if ((victim = first(bin)) != bin &&
  (unsigned long)(victim->size) >= (unsigned long)(nb)) {

victim = victim->bk_nextsize;
while (((unsigned long)(size = chunksize(victim)) <
(unsigned long)(nb)))
  victim = victim->bk_nextsize;

/* Avoid removing the first entry for a size so that the skip
   list does not have to be rerouted.  */
if (victim != last(bin) && victim->size == victim->fd->size)
  victim = victim->fd;

remainder_size = size - nb;
unlink(victim, bck, fwd);

/* Exhaust */
if (remainder_size < MINSIZE)  {
  set_inuse_bit_at_offset(victim, size);
  if (av != &main_arena)
    victim->size |= NON_MAIN_ARENA;
}
/* Split */
else {
  remainder = chunk_at_offset(victim, nb);
  /* We cannot assume the unsorted list is empty and therefore
     have to perform a complete insert here.  */
  bck = unsorted_chunks(av);
  fwd = bck->fd;
  if (__builtin_expect (fwd->bk != bck, 0))
    {
      errstr = "malloc(): corrupted unsorted chunks";
      goto errout;
    }
  remainder->bk = bck;
  remainder->fd = fwd;
  bck->fd = remainder;
  fwd->bk = remainder;
  if (!in_smallbin_range(remainder_size))
    {
      remainder->fd_nextsize = NULL;
      remainder->bk_nextsize = NULL;
    }
  set_head(victim, nb | PREV_INUSE |
   (av != &main_arena ? NON_MAIN_ARENA : 0));
  set_head(remainder, remainder_size | PREV_INUSE);
  set_foot(remainder, remainder_size);
}
check_malloced_chunk(av, victim, nb);
void *p = chunk2mem(victim);
if (__builtin_expect (perturb_byte, 0))
  alloc_perturb (p, bytes);
return p;
      }
    }

    /*
      Search for a chunk by scanning bins, starting with next largest
      bin. This search is strictly by best-fit; i.e., the smallest
      (with ties going to approximately the least recently used) chunk
      that fits is selected.

      The bitmap avoids needing to check that most blocks are nonempty.
      The particular case of skipping all bins during warm-up phases
      when no chunks have been returned yet is faster than it might look.
    */

    ++idx;
    bin = bin_at(av,idx);
    block = idx2block(idx);
    map = av->binmap[block];
    bit = idx2bit(idx);

    for (;;) {

      /* Skip rest of block if there are no more set bits in this block.  */
      if (bit > map || bit == 0) {
do {
  if (++block >= BINMAPSIZE)  /* out of bins */
    goto use_top;
} while ( (map = av->binmap[block]) == 0);

bin = bin_at(av, (block << BINMAPSHIFT));
bit = 1;
      }

      /* Advance to bin with set bit. There must be one. */
      while ((bit & map) == 0) {
bin = next_bin(bin);
bit <<= 1;
assert(bit != 0);
      }

      /* Inspect the bin. It is likely to be non-empty */
      victim = last(bin);

      /*  If a false alarm (empty bin), clear the bit. */
      if (victim == bin) {
av->binmap[block] = map &= ~bit; /* Write through */
bin = next_bin(bin);
bit <<= 1;
      }

      else {
size = chunksize(victim);

/*  We know the first chunk in this bin is big enough to use. */
assert((unsigned long)(size) >= (unsigned long)(nb));

remainder_size = size - nb;

/* unlink */
unlink(victim, bck, fwd);

/* Exhaust */
if (remainder_size < MINSIZE) {
  set_inuse_bit_at_offset(victim, size);
  if (av != &main_arena)
    victim->size |= NON_MAIN_ARENA;
}

/* Split */
else {
  remainder = chunk_at_offset(victim, nb);

  /* We cannot assume the unsorted list is empty and therefore
     have to perform a complete insert here.  */
  bck = unsorted_chunks(av);
  fwd = bck->fd;
  if (__builtin_expect (fwd->bk != bck, 0))
    {
      errstr = "malloc(): corrupted unsorted chunks 2";
      goto errout;
    }
  remainder->bk = bck;
  remainder->fd = fwd;
  bck->fd = remainder;
  fwd->bk = remainder;

  /* advertise as last remainder */
  if (in_smallbin_range(nb))
    av->last_remainder = remainder;
  if (!in_smallbin_range(remainder_size))
    {
      remainder->fd_nextsize = NULL;
      remainder->bk_nextsize = NULL;
    }
  set_head(victim, nb | PREV_INUSE |
   (av != &main_arena ? NON_MAIN_ARENA : 0));
  set_head(remainder, remainder_size | PREV_INUSE);
  set_foot(remainder, remainder_size);
}
check_malloced_chunk(av, victim, nb);
void *p = chunk2mem(victim);
if (__builtin_expect (perturb_byte, 0))
  alloc_perturb (p, bytes);
return p;
      }
    }

第七步主要是从更大的 bins 中进行查找，然后进行切分，如果切分后剩余的内存太小则一起送给用户，还有很多的话，则将其插入到 unsorted bin ，分配的是小内存则还会将其剩余部分保存到 last_remainder 供下次优先分配。

use_top:
    /*
      If large enough, split off the chunk bordering the end of memory
      (held in av->top). Note that this is in accord with the best-fit
      search rule.  In effect, av->top is treated as larger (and thus
      less well fitting) than any other available chunk since it can
      be extended to be as large as necessary (up to system
      limitations).

      We require that av->top always exists (i.e., has size >=
      MINSIZE) after initialization, so if it would otherwise be
      exhausted by current request, it is replenished. (The main
      reason for ensuring it exists is that we may need MINSIZE space
      to put in fenceposts in sysmalloc.)
    */

    victim = av->top;
    size = chunksize(victim);

    if ((unsigned long)(size) >= (unsigned long)(nb + MINSIZE)) {
      remainder_size = size - nb;
      remainder = chunk_at_offset(victim, nb);
      av->top = remainder;
      set_head(victim, nb | PREV_INUSE |
       (av != &main_arena ? NON_MAIN_ARENA : 0));
      set_head(remainder, remainder_size | PREV_INUSE);

      check_malloced_chunk(av, victim, nb);
      void *p = chunk2mem(victim);
      if (__builtin_expect (perturb_byte, 0))
alloc_perturb (p, bytes);
      return p;
    }

第八步，从 top chunk 中进行切分，回收也是从 top chunk 从高往低释放回给内核，因此如果后分配的没有释放，会导致先分配的已释放都没办法还回给内核。

/* When we are using atomic ops to free fast chunks we can get
       here for all block sizes.  */
    else if (have_fastchunks(av)) {
      malloc_consolidate(av);
      /* restore original bin index */
      if (in_smallbin_range(nb))
idx = smallbin_index(nb);
      else
idx = largebin_index(nb);
    }

第九步，fast bins，因为 fast bins 的回收是不需要锁的，有可能回收了。

/*
       Otherwise, relay to handle system-dependent cases
    */
    else {
      void *p = sysmalloc(nb, av);
      if (p != NULL && __builtin_expect (perturb_byte, 0))
alloc_perturb (p, bytes);
      return p;
    }
  }
}

第十步，一滴也没有了，通过 sysmalloc 从内核申请内存。

sysmalloc

主分配区用 brk 申请一块内存进行内存分配，若是静态链接 glibc 则只能用 mmap 防止有洞。非主分配区则只能用 mmap 。还会先看看所需内存是否大于 mmap 的阈值，大过就直接采用 mmap 返回。但是 mmap 的效率不高，在内核中属于串行运作，因此 ptmalloc2 会动态调整这个阈值（默认为 128KB，最大可达 32MB）换句话说你要想百分百用 mmap 申请内存，那请你申请大于 32MB 的内存。

static void* sysmalloc(INTERNAL_SIZE_T nb, mstate av)
{
  mchunkptr       old_top;        /* incoming value of av->top */
  INTERNAL_SIZE_T old_size;       /* its size */
  char*           old_end;        /* its end address */

  long            size;           /* arg to first MORECORE or mmap call */
  char*           brk;            /* return value from MORECORE */

  long            correction;     /* arg to 2nd MORECORE call */
  char*           snd_brk;        /* 2nd return val */

  INTERNAL_SIZE_T front_misalign; /* unusable bytes at front of new space */
  INTERNAL_SIZE_T end_misalign;   /* partial page left at end of new space */
  char*           aligned_brk;    /* aligned offset into brk */

  mchunkptr       p;              /* the allocated/returned chunk */
  mchunkptr       remainder;      /* remainder from allocation */
  unsigned long   remainder_size; /* its size */

  unsigned long   sum;            /* for updating stats */

  size_t          pagemask  = GLRO(dl_pagesize) - 1;
  bool            tried_mmap = false;

  /*
    If have mmap, and the request size meets the mmap threshold, and
    the system supports mmap, and there are few enough currently
    allocated mmapped regions, try to directly map this request
    rather than expanding top.
  */

  if ((unsigned long)(nb) >= (unsigned long)(mp_.mmap_threshold) &&
      (mp_.n_mmaps < mp_.n_mmaps_max)) {

    char* mm;             /* return value from mmap call*/

  try_mmap:
    /*
      Round up size to nearest page.  For mmapped chunks, the overhead
      is one SIZE_SZ unit larger than for normal chunks, because there
      is no following chunk whose prev_size field could be used.

      See the front_misalign handling below, for glibc there is no
      need for further alignments unless we have have high alignment.
    */
    if (MALLOC_ALIGNMENT == 2 * SIZE_SZ)
      size = (nb + SIZE_SZ + pagemask) & ~pagemask;
    else
      size = (nb + SIZE_SZ + MALLOC_ALIGN_MASK + pagemask) & ~pagemask;
    tried_mmap = true;

    /* Don't try if size wraps around 0 */
    if ((unsigned long)(size) > (unsigned long)(nb)) {

      mm = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, 0));

      if (mm != MAP_FAILED) {

/*
  The offset to the start of the mmapped region is stored
  in the prev_size field of the chunk. This allows us to adjust
  returned start address to meet alignment requirements here
  and in memalign(), and still be able to compute proper
  address argument for later munmap in free() and realloc().
*/

if (MALLOC_ALIGNMENT == 2 * SIZE_SZ)
  {
    /* For glibc, chunk2mem increases the address by 2*SIZE_SZ and
       MALLOC_ALIGN_MASK is 2*SIZE_SZ-1.  Each mmap'ed area is page
       aligned and therefore definitely MALLOC_ALIGN_MASK-aligned.  */
    assert (((INTERNAL_SIZE_T)chunk2mem(mm) & MALLOC_ALIGN_MASK) == 0);
    front_misalign = 0;
  }
else
  front_misalign = (INTERNAL_SIZE_T)chunk2mem(mm) & MALLOC_ALIGN_MASK;
if (front_misalign > 0) {
  correction = MALLOC_ALIGNMENT - front_misalign;
  p = (mchunkptr)(mm + correction);
  p->prev_size = correction;
  set_head(p, (size - correction) |IS_MMAPPED);
}
else
  {
    p = (mchunkptr)mm;
    set_head(p, size|IS_MMAPPED);
  }

/* update statistics */

if (++mp_.n_mmaps > mp_.max_n_mmaps)
  mp_.max_n_mmaps = mp_.n_mmaps;

sum = mp_.mmapped_mem += size;
if (sum > (unsigned long)(mp_.max_mmapped_mem))
  mp_.max_mmapped_mem = sum;

check_chunk(av, p);

return chunk2mem(p);
      }
    }
  }

  /* Record incoming configuration of top */

  old_top  = av->top;
  old_size = chunksize(old_top);
  old_end  = (char*)(chunk_at_offset(old_top, old_size));

  brk = snd_brk = (char*)(MORECORE_FAILURE);

  /*
     If not the first time through, we require old_size to be
     at least MINSIZE and to have prev_inuse set.
  */

  assert((old_top == initial_top(av) && old_size == 0) ||
 ((unsigned long) (old_size) >= MINSIZE &&
  prev_inuse(old_top) &&
  ((unsigned long)old_end & pagemask) == 0));

  /* Precondition: not enough current space to satisfy nb request */
  assert((unsigned long)(old_size) < (unsigned long)(nb + MINSIZE));

  if (av != &main_arena) {

    heap_info *old_heap, *heap;
    size_t old_heap_size;

    /* First try to extend the current heap. */
    old_heap = heap_for_ptr(old_top);
    old_heap_size = old_heap->size;
    if ((long) (MINSIZE + nb - old_size) > 0
&& grow_heap(old_heap, MINSIZE + nb - old_size) == 0) {
      av->system_mem += old_heap->size - old_heap_size;
      arena_mem += old_heap->size - old_heap_size;
      set_head(old_top, (((char *)old_heap + old_heap->size) - (char *)old_top)
       | PREV_INUSE);
    }
    else if ((heap = new_heap(nb + (MINSIZE + sizeof(*heap)), mp_.top_pad))) {
      /* Use a newly allocated heap.  */
      heap->ar_ptr = av;
      heap->prev = old_heap;
      av->system_mem += heap->size;
      arena_mem += heap->size;
      /* Set up the new top.  */
      top(av) = chunk_at_offset(heap, sizeof(*heap));
      set_head(top(av), (heap->size - sizeof(*heap)) | PREV_INUSE);

      /* Setup fencepost and free the old top chunk with a multiple of
 MALLOC_ALIGNMENT in size. */
      /* The fencepost takes at least MINSIZE bytes, because it might
 become the top chunk again later.  Note that a footer is set
 up, too, although the chunk is marked in use. */
      old_size = (old_size - MINSIZE) & ~MALLOC_ALIGN_MASK;
      set_head(chunk_at_offset(old_top, old_size + 2*SIZE_SZ), 0|PREV_INUSE);
      if (old_size >= MINSIZE) {
set_head(chunk_at_offset(old_top, old_size), (2*SIZE_SZ)|PREV_INUSE);
set_foot(chunk_at_offset(old_top, old_size), (2*SIZE_SZ));
set_head(old_top, old_size|PREV_INUSE|NON_MAIN_ARENA);
_int_free(av, old_top, 1);
      } else {
set_head(old_top, (old_size + 2*SIZE_SZ)|PREV_INUSE);
set_foot(old_top, (old_size + 2*SIZE_SZ));
      }
    }
    else if (!tried_mmap)
      /* We can at least try to use to mmap memory.  */
      goto try_mmap;

  } else { /* av == main_arena */

  /* Request enough space for nb + pad + overhead */

  size = nb + mp_.top_pad + MINSIZE;

  /*
    If contiguous, we can subtract out existing space that we hope to
    combine with new space. We add it back later only if
    we don't actually get contiguous space.
  */

  if (contiguous(av))
    size -= old_size;

  /*
    Round to a multiple of page size.
    If MORECORE is not contiguous, this ensures that we only call it
    with whole-page arguments.  And if MORECORE is contiguous and
    this is not first time through, this preserves page-alignment of
    previous calls. Otherwise, we correct to page-align below.
  */

  size = (size + pagemask) & ~pagemask;

  /*
    Don't try to call MORECORE if argument is so big as to appear
    negative. Note that since mmap takes size_t arg, it may succeed
    below even if we cannot call MORECORE.
  */

  if (size > 0)
    brk = (char*)(MORECORE(size));

  if (brk != (char*)(MORECORE_FAILURE)) {
    /* Call the `morecore' hook if necessary.  */
    void (*hook) (void) = force_reg (__after_morecore_hook);
    if (__builtin_expect (hook != NULL, 0))
      (*hook) ();
  } else {
  /*
    If have mmap, try using it as a backup when MORECORE fails or
    cannot be used. This is worth doing on systems that have "holes" in
    address space, so sbrk cannot extend to give contiguous space, but
    space is available elsewhere.  Note that we ignore mmap max count
    and threshold limits, since the space will not be used as a
    segregated mmap region.
  */

    /* Cannot merge with old top, so add its size back in */
    if (contiguous(av))
      size = (size + old_size + pagemask) & ~pagemask;

    /* If we are relying on mmap as backup, then use larger units */
    if ((unsigned long)(size) < (unsigned long)(MMAP_AS_MORECORE_SIZE))
      size = MMAP_AS_MORECORE_SIZE;

    /* Don't try if size wraps around 0 */
    if ((unsigned long)(size) > (unsigned long)(nb)) {

      char *mbrk = (char*)(MMAP(0, size, PROT_READ|PROT_WRITE, 0));

      if (mbrk != MAP_FAILED) {

/* We do not need, and cannot use, another sbrk call to find end */
brk = mbrk;
snd_brk = brk + size;

/*
   Record that we no longer have a contiguous sbrk region.
   After the first time mmap is used as backup, we do not
   ever rely on contiguous space since this could incorrectly
   bridge regions.
*/
set_noncontiguous(av);
      }
    }
  }

内存释放

依然是文字先总结一遍流程。

先检查是否有钩子函数，有则调用并返回。
如果是 mmap 分配的 chunk，则用 munmap 将其释放，如果释放的 chunk 大小大于 mmap 分配的阈值，且未关闭动态调整阈值开关，则调整一下 mmap 的阈值为当前 chunk 大小。
调用 _int_free 释放内存。
若 chunk_size < 128B ，且 chunk 不与 top chunk 相邻则放入 fast bins 中，这里不会加锁，而是用的 CAS，返回。
加锁分配区，前一个 chunk 若空闲，则合并。
后一个 chunk 若为 top chunk ，则将其合并到 top chunk 中，若不是也合并，将其放到 unosrted bin。
如果合并的 chunk 大于 64KB，则开始整合 fast bins 到 unsorted bin ，若 top chunk 的大小大过收缩阈值了，默认为 128K ，则收缩堆，也就是还给内核。
也就是说释放内存回内核需要两个条件， chunk_size > 64KB，且 top chunk 大于收缩阈值，则释放。

__libc_free

void
__libc_free(void* mem)
{
  mstate ar_ptr;
  mchunkptr p;                          /* chunk corresponding to mem */

  void (*hook) (__malloc_ptr_t, const __malloc_ptr_t)
    = force_reg (__free_hook);
  if (__builtin_expect (hook != NULL, 0)) {
    (*hook)(mem, RETURN_ADDRESS (0));
    return;
  }

  if (mem == 0)                              /* free(0) has no effect */
    return;

  p = mem2chunk(mem);

  if (chunk_is_mmapped(p))                       /* release mmapped memory. */
  {
    /* see if the dynamic brk/mmap threshold needs adjusting */
    if (!mp_.no_dyn_threshold
&& p->size > mp_.mmap_threshold
&& p->size <= DEFAULT_MMAP_THRESHOLD_MAX)
      {
mp_.mmap_threshold = chunksize (p);
mp_.trim_threshold = 2 * mp_.mmap_threshold;
      }
    munmap_chunk(p);
    return;
  }

  ar_ptr = arena_for_chunk(p);
  _int_free(ar_ptr, p, 0);
}

_int_free

只放出最重要的一段，收缩堆的条件。

if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) {
      if (have_fastchunks(av))
malloc_consolidate(av);

      if (av == &main_arena) {
#ifndef MORECORE_CANNOT_TRIM
if ((unsigned long)(chunksize(av->top)) >=
    (unsigned long)(mp_.trim_threshold))
  systrim(mp_.top_pad, av);
#endif
      } else {
/* Always try heap_trim(), even if the top chunk is not
   large, because the corresponding heap might go away.  */
heap_info *heap = heap_for_ptr(top(av));

assert(heap->ar_ptr == av);
heap_trim(heap, mp_.top_pad);
      }
    }

其他细节

由于 ptmalloc 用了 mutex ，如果一个多线程的进程执行 fork 会将执行 fork 的线程进行拷贝，其他线程会突然消失，这个时候子进程的 mutex 处于不安全的状态，只能直接重新初始化。关于这一点可以查看 ptmalloc_unlock_all2 这个函数。

1	thread_atfork(ptmalloc_lock_all, ptmalloc_unlock_all, ptmalloc_unlock_all2);

扩展堆和收缩堆还有释放堆的几个操作补充一下。

/* Grow a heap.  size is automatically rounded up to a
   multiple of the page size. */

static int
grow_heap(heap_info *h, long diff)
{
  size_t page_mask = GLRO(dl_pagesize) - 1;
  long new_size;

  diff = (diff + page_mask) & ~page_mask;
  new_size = (long)h->size + diff;
  if((unsigned long) new_size > (unsigned long) HEAP_MAX_SIZE)
    return -1;
  if((unsigned long) new_size > h->mprotect_size) {
    if (__mprotect((char *)h + h->mprotect_size,
   (unsigned long) new_size - h->mprotect_size,
   PROT_READ|PROT_WRITE) != 0)
      return -2;
    h->mprotect_size = new_size;
  }

  h->size = new_size;
  return 0;
}

/* Shrink a heap.  */

static int
shrink_heap(heap_info *h, long diff)
{
  long new_size;

  new_size = (long)h->size - diff;
  if(new_size < (long)sizeof(*h))
    return -1;
  /* Try to re-map the extra heap space freshly to save memory, and make it
     inaccessible.  See malloc-sysdep.h to know when this is true.  */
  if (__builtin_expect (check_may_shrink_heap (), 0))
    {
      if((char *)MMAP((char *)h + new_size, diff, PROT_NONE,
      MAP_FIXED) == (char *) MAP_FAILED)
return -2;
      h->mprotect_size = new_size;
    }
  else
    __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
  /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/

  h->size = new_size;
  return 0;
}

/* Delete a heap. */

#define delete_heap(heap) \
  do {\
    if ((char *)(heap) + HEAP_MAX_SIZE == aligned_heap_area)\
      aligned_heap_area = NULL;\
    __munmap((char*)(heap), HEAP_MAX_SIZE);\
  } while (0)

结语

通过了解 ptmalloc2 分配释放内存的策略，可以知道，它比较适合短生命周期的内存分配，若是长生命周期的内存，则会不断抬高 top chunk ，导致无法将内存释放回内核，引起内存暴涨。而游戏服务器中，玩家的内存数据很有可能要等一个小时以上才释放，生命周期比较长，因此最好的做法还是自己写一个基于 mmap 的内存池（打脸了在Lua GC垃圾回收优化方案中我还提到，认为内存池没有必要），之所以特意强调是基于 mmap 主要是 brk 它类似于栈，会将堆顶抬高，如果堆顶内存没释放，会导致堆顶以下的内存都不能还回内核，又会导致内存暴涨。

Lua GC垃圾回收优化方案

2020-12-19T02:09:20.000Z

最近接手的一个游戏项目是重 Lua 的结构（网络模块在 C++，其余逻辑全在 Lua）。和许多用 Lua 的游戏项目一样，遇到了 Lua 的垃圾回收的性能问题，经常跑着跑着就会掉帧，因此花了一周的时间，给 Lua 虚拟机写了个模块，把 Lua 垃圾回收的速度提高了一个量级。

思路

这个思路其实在之前的一篇博客中也有提到，想要垃圾回收快，无非就那么几种思路。

使用内存池
减少对象生成
垃圾回收提速

使用内存池

第一种思路，我觉得不合理，因为现代的内存分配器早就有内存池的设计了，手写一个内存池的收益并不大。

减少对象生成

第二种思路，是比较合理的。因为我在项目的代码中发现很多处地方有动态生成 Closure 的情况。

function test()
  local fn = function()
    print("test")
  end
  fn()
  fn()
end

上面那个例子，每次调用到 test 函数的时候，都会动态根据 fn 的函数原型，生成一个 Closure

可能有人会问，Proto 不是有一个 cache 指向 Closure 吗？按道理这里没有 UpValue（即代表UpValue 完全相同），应该会复用啊，但是很可惜，执行完这个函数以后，因为没有对象指向 Closure 用完再不久的将来又会被回收。

因此，少写这种代码就可以减少对象的生成。

垃圾回收提速

第三种思路，我的想法是，让垃圾回收所要遍历的对象大幅减少，就可以为垃圾回收提速了，由于我们是重 Lua 的框架，因此我们的所有配置都存在于 Lua 的 table中，而这一部分肯定是不需要被回收的，但是每次垃圾回收的时候，又会不停的扫描递归遍历，不合理。同时代码中的很多全局函数，也是根本不需要被回收的，也会被扫描到，于是就想到一个想法，给这些对象打上标记，让他们不被遍历不被清理，就可以大幅度的提速了。

原理简单，但是做起来确实挺难受的，要注意要手动关闭 UpValue 将其保留下来。

目前已经开源，LuaJIT-5.3.6源码。

如何使用呢？

说的那么好，那如何使用呢？

目前提供了四个接口。

nogc("open", Table) -- 这一整个 Table 都不被扫描不被清理
nogc("close", Table) -- 相当于 open 的反方法
nogc("len") -- 当前不被 垃圾回收管理的对象个数
nogc("count") -- 当前不被 垃圾回收管理的对象的总内存大小 单位为k

Table 中的元素支持，字符串，整数，浮点数，布尔值，表，Lua 闭包。

不支持当 Table 是弱表的情况。

需要注意的是，当一个 Table 被打上标记之后，就不能够再修改其内部的数据，因为有可能会创建出一个新的对象，但是又不会被 Lua 的垃圾回收扫描到，导致这个对象被回收，发生段错误。

接入 Lua

首先需要引入我写的两个文件， YGC.c, YGC.h 。

然后跟着我的步伐修改以下几个文件。

lbaselib.c

添加头文件，然后导出 nogc 函数给 Lua 使用。

#include "YGC.h"

static const luaL_Reg base_funcs[] = {
  ....
  {"nogc", nogc},
  {NULL, NULL}
};

lvm.c

添加头文件，在 pushclosure 函数这里， if (!isblack(p)) 改为以下的代码。

这是因为，当我们标记的 Table 中含有的闭包，被执行到的时候，会动态的生成 Closure ，但是这个 Closure 是没办法被标记到的，因为是动态生成的，因此不应该指过去。

#include "YGC.h"

if (!isblack(p) && !Y_isnogc(p) && !Y_isnogc(ncl))
  p->cache = ncl;
}

lstate.h

在 global_State 记录两个辅助的值，其中一个是 nogc 的对象内存大小，另一个是不参与GC的链表，都是为了方便调试用的。

typedef struct global_State {
  ....
  lu_mem Y_GCmemnogc; /* memory size of nogc linked list */
  GCObject *Y_nogc;  /* list of objects not to be traversed or collected */
  ....
}

lstate.c

初始化上面的对象

LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
  ....
  g->Y_GCmemnogc = 0;
  g->Y_nogc = NULL;
  ....
}

lgc.c

这是最后一个文件了，依然是添加头文件。然后将以下代码进行对比替换。

1	#include "YGC.h"

提前返回对象，减少垃圾回收耗时。

将以下代码进行替换，简单的来说就是将不需要GC的对象，移出 allgc 链表。

static GCObject **sweeplist (lua_State *L, GCObject **p, lu_mem count) {
  global_State *g = G(L);
  int ow = otherwhite(g);
  int white = luaC_white(g);  /* current white */
  while (*p != NULL && count-- > 0) {
    GCObject *curr = *p;
    int marked = curr->marked;
    if (isdeadm(ow, marked)) {  /* is 'curr' dead? */
      *p = curr->next;  /* remove 'curr' from list */
      freeobj(L, curr);  /* erase 'curr' */
    }
    else {  /* change mark to 'white' */
      curr->marked = cast_byte((marked & maskcolors) | white);
      p = &curr->next;  /* go to next element */
    }
  }
  return (*p == NULL) ? NULL : p;
}

替换为以下这段。

static GCObject **sweeplist (lua_State *L, GCObject **p, lu_mem count) {
  global_State *g = G(L);
  int ow = otherwhite(g);
  int white = luaC_white(g);  /* current white */
  while (*p != NULL && count-- > 0) {
    GCObject *curr = *p;
    if (g->gcstate == GCSswpallgc && Y_isnogc(curr)) {
      *p = curr->next;
      curr->next = g->Y_nogc;
      g->Y_nogc = curr;
      continue;
    }
    int marked = curr->marked;
    if (isdeadm(ow, marked)) {  /* is 'curr' dead? */
      *p = curr->next;  /* remove 'curr' from list */
      freeobj(L, curr);  /* erase 'curr' */
    }
    else {  /* change mark to 'white' */
      curr->marked = cast_byte((marked & maskcolors) | white);
      p = &curr->next;  /* go to next element */
    }
  }
  return (*p == NULL) ? NULL : p;
}

propagatemark 的修改主要是为了提前返回，不要遍历不需要GC的对象。

static void propagatemark (global_State *g) {
  lu_mem size;
  GCObject *o = g->gray;
  lua_assert(isgray(o));
  gray2black(o);
  switch (o->tt) {
    case LUA_TTABLE: {
      Table *h = gco2t(o);
      g->gray = h->gclist;  /* remove from 'gray' list */
      size = traversetable(g, h);
      break;
    }
    case LUA_TLCL: {
      LClosure *cl = gco2lcl(o);
      g->gray = cl->gclist;  /* remove from 'gray' list */
      size = traverseLclosure(g, cl);
      break;
    }
    case LUA_TCCL: {
      CClosure *cl = gco2ccl(o);
      g->gray = cl->gclist;  /* remove from 'gray' list */
      size = traverseCclosure(g, cl);
      break;
    }
    case LUA_TTHREAD: {
      lua_State *th = gco2th(o);
      g->gray = th->gclist;  /* remove from 'gray' list */
      linkgclist(th, g->grayagain);  /* insert into 'grayagain' list */
      black2gray(o);
      size = traversethread(g, th);
      break;
    }
    case LUA_TPROTO: {
      Proto *p = gco2p(o);
      g->gray = p->gclist;  /* remove from 'gray' list */
      size = traverseproto(g, p);
      break;
    }
    default: lua_assert(0); return;
  }
  g->GCmemtrav += size;
}

替换为。

static void propagatemark (global_State *g) {
  lu_mem size;
  GCObject *o = g->gray;
  lua_assert(isgray(o));
  gray2black(o);
  switch (o->tt) {
    case LUA_TTABLE: {
      Table *h = gco2t(o);
      g->gray = h->gclist;  /* remove from 'gray' list */
      size = (Y_isnogc(o) ? 0 : traversetable(g, h));
      break;
    }
    case LUA_TLCL: {
      LClosure *cl = gco2lcl(o);
      g->gray = cl->gclist;  /* remove from 'gray' list */
      size = (Y_isnogc(cl) ? 0 : traverseLclosure(g, cl));
      break;
    }
    case LUA_TCCL: {
      CClosure *cl = gco2ccl(o);
      g->gray = cl->gclist;  /* remove from 'gray' list */
      size = traverseCclosure(g, cl);
      break;
    }
    case LUA_TTHREAD: {
      lua_State *th = gco2th(o);
      g->gray = th->gclist;  /* remove from 'gray' list */
      linkgclist(th, g->grayagain);  /* insert into 'grayagain' list */
      black2gray(o);
      size = traversethread(g, th);
      break;
    }
    case LUA_TPROTO: {
      Proto *p = gco2p(o);
      g->gray = p->gclist;  /* remove from 'gray' list */
      size = (Y_isnogc(p) ? 0 : traverseproto(g, p));
      break;
    }
    default: lua_assert(0); return;
  }
  g->GCmemtrav += size;
}

至此完结，享受提速后的快感吧。

Lua 服务端热更新

2020-12-15T04:19:20.000Z

游戏服务端之所以用 Lua，大多数时候是因为 Lua 方便做热更新，一般来说对 Lua 做热更新基本上都会使用以下两句语句。

1 2	package.loaded[name] = nil require(name)

这种方式的热更好处就是简单，不过有的代码写起来就要特别小心，当你在代码中看到以下类似的片段，很有可能是为了热更新做的一种妥协。

1	Activity.c2sFun = Activity.c2sFun or {};

同时，如果 Lua 代码中存有大量的 upvalue 时，还要记得保存原有的状态信息，否则会丢失原值，对于开发人员来说，这种热更方式费心费力。

因此， Lua HotFix 就是为了摆脱以上的限制，或者说减少需要关心的事情，让开发人员能够更为简单的做热更新。之所以要自己写这么一套东西，主要是因为网络上开源的热更方案不适合项目，要么支持的Lua版本过旧，要么就约束的过多，项目已经进行到了中后期，这个时候再来规范已经来不及了，其次有很多的错误，这点我会在本文中的第二部分进行讨论。

本文主要分为两个部分，第一部分为 HotFix 实现，第二部分为热更新的错误案例。

HotFix 实现

首先放出 HotFix 源码。

通过 loadfile 将文件读入 Lua ，此时为一个 function 也就是 chunk，设置这个 function 的执行环境为我的假环境表我管它叫 fakeEnv ，在里面替换掉一些函数，然后执行 chunk ，就能从 fakeEnv 得到一系列的函数，全局变量信息。

接下来是确定什么能更新，什么不能更新。首先函数必更新，因为你热更不更逻辑，要你有何用？其次数据默认不更新，为什么是默认不更新，主要考虑到 upvalue ，优先保证服务器正常运作（哪怕我热更失败），但是 table 这个类型我们要更新，只更新函数即可，table 中的数据也采用默认不更新的思路（因为项目中会在 table 中保存状态数据）。

这个时候就能成功的更新上新的逻辑了，此时就要考虑数据的更新，因为我们不确定什么数据是需要更新的（比如说配置信息），因此默认是不更新数据的，如果需要更新数据，则通过在模块中加入 __RELOAD 函数，因为什么数据要更新，使用者最为清楚，其次使用这个 __RELOAD 函数，代码入库也极为方便，基本上把热更修改后的文件直接入库就行了。

代码片段示例

yuerer = {}
yuerer.age = 21

function __RELOAD()
yuerer.age = 22
end

__RELOAD() -- 热更后可直接入库

因此，使用这套热更新有以下约束

除了函数会更新，其他默认不更新（table 里面的数据也不会默认更新，因为有的开发人员喜欢在 table 里保存状态数据）
如果要更新或新增除了函数以外的信息，自行定义 __RELOAD 函数，并实现
不支持 userdata，thread 类型
不要存任何 function，table 的引用（或者是显式在 __RELOAD 函数中重置引用）
不要热更 _ENV 的 metatable

How to use

假设我们 require 了一个模块 fix1，此时我们要更新 fix1 中的代码。

1
2
3

require("fix1")
local HotFix = require("HotFix")
HotFix:UpdateModule("fix1")

这样就能实现最基础的除 userdata thread 类型的热更新

如果想要更新数据，请在 fix1 模块下写一个 __RELOAD 函数
这主要是基于两个原因
- 数据可能有状态信息
- 方便入库

1
2
3

function __RELOAD()
    -- do some things
end

热更新的错误案例

热更新的方案在网络上多种多样，我将会挑选出几个常见的错误，在这里进行讨论。

错误①

更新前的函数没有 _ENV 这个 upvalue ，依赖对 table 进行热更的方式无法生效。

假设我有一个函数 error1 写错了，现在要进行热更。以下代码片段分别表示热更前与热更后，如果我采用函数替代的形式，我能更新的上吗？

--------------- 热更前
local count = 0
function error1()
count = count + 1
end

--------------- 热更后
local count = 0
function error1()
count = count + 2
end

显然是不能的，我们先来看看 error1 的热更前的版本的指令，可以看出，函数体只有一个 upvalue。

function <test.lua:3,5> (4 instructions at 0x1687da0)
0 params, 2 slots, 1 upvalue, 0 locals, 1 constant, 0 functions
1[4]GETUPVAL 0 0; count
2[4]ADD      0 0 -1; - 1
3[4]SETUPVAL 0 0; count
4[5]RETURN   0 1
constants (1) for 0x1687da0:
11
locals (0) for 0x1687da0:
upvalues (1) for 0x1687da0:
0count10

这个时候因为函数体没有调用任何全局函数或是全局变量，自然没有 _ENV 这个环境表作为 upvalue，也就没有办法通过改写 _ENV[error1] = error1 的方式修改全局表的 error1 的函数地址（除非你显式的加上 rawset(_ENV, error1, xxxx)，然而大多数开源的方案都没注意到这个问题)。因此当你热更后调用 error1 的时候还是会调用的热更前的版本。

而下面的版本就可以更新成功。

--------------- 热更前
local count = 0
function error1()
count = count + 1
print()
end

--------------- 热更后
local count = 0
function error1()
count = count + 2
print()
end

我们再来看一下这个版本的热更前的指令。可以看到这次函数里面有了 _ENV，我们此时可以通过改写这个 upvalue 的内容来达到替换 error1 的目的。

function <test.lua:3,6> (6 instructions at 0x733da0)
0 params, 2 slots, 2 upvalues, 0 locals, 2 constants, 0 functions
1[4]GETUPVAL 0 0; count
2[4]ADD      0 0 -1; - 1
3[4]SETUPVAL 0 0; count
4[5]GETTABUP 0 1 -2; _ENV "print"
5[5]CALL     0 1 1
6[6]RETURN   0 1
constants (2) for 0x733da0:
11
2"print"
locals (0) for 0x733da0:
upvalues (2) for 0x733da0:
0count10
1_ENV00

解决方案我认为分为两种。

鸵鸟，毕竟一个模块从头至尾不调用全局函数或是操作全局变量的概率实在是太低了（热更模块中任意函数有用到就行，估计这也是很多开源热更方案没有检查出这种错误的原因）
检查对函数热更的时候，有没有操作过 _ENV 如果没有，则通过 rawset(_ENV, k, v) 这种补丁的形式覆写环境表。

错误②

使用 debug.setupvalue 进行 upvalue 修复。以下代码猜一下执行结果。

local count = 0
function error2()
print(count)
end

function error2_another()
print(count)
end

error2()

error2_another()

debug.setupvalue(error2, 2, 1000) -- set error2 upvalue:count = 1000

error2()
error2_another()

答案揭晓，都为 1000。大部分的热更都没有考虑到一个 upvalue 会同时被一个以上的函数所使用的情况。

正确的热更方式是采用 debug.upvaluejoin 进行关联。

出现这样的错误主要还是因为分不清 debug 中 setupvalue 与 upvaluejoin 的区别。

错误③

还记得上个错误案例的 debug.setupvalue 吗？这里要讨论的是它和它的兄弟 debug.getupvalue。

1 2	debug.getupvalue (f, up) 此函数返回函数 f 的第 up 个上值的名字和值。如果该函数没有那个上值，返回 nil 。

可以看到，它的第二个参数为索引，那么考虑一下下面的代码片段，能否热更成功呢？

--------------- 热更前
local count = 0
function error3()
count = count + 1
end

--------------- 热更后
function error3()
print(count)
end

聪明的人可能已经发现了，如果我们按照索引来取 upvalue 然后更新到下面的那个函数中，是有问题的，我们来分别看看指令。

热更前的函数，第一个 upvalue 为 count。

function <test.lua:3,5> (4 instructions at 0x19acda0)
0 params, 2 slots, 1 upvalue, 0 locals, 1 constant, 0 functions
1[4]GETUPVAL 0 0; count
2[4]ADD      0 0 -1; - 1
3[4]SETUPVAL 0 0; count
4[5]RETURN   0 1
constants (1) for 0x19acda0:
11
locals (0) for 0x19acda0:
upvalues (1) for 0x19acda0:
0count10

热更后的函数，第一个 upvalue 则为 _ENV 这是因为我们在这里先调用了 print 这个全局函数。

function <test.lua:8,10> (4 instructions at 0x19aceb0)
0 params, 2 slots, 2 upvalues, 0 locals, 1 constant, 0 functions
1[9]GETTABUP 0 0 -1; _ENV "print"
2[9]GETUPVAL 1 1; count
3[9]CALL     0 2 1
4[10]RETURN   0 1
constants (1) for 0x19aceb0:
1"print"
locals (0) for 0x19aceb0:
upvalues (2) for 0x19aceb0:
0_ENV00
1count10

因此，热更 upvalue 的时候，一定不能默认更新前后的函数 upvalue 的顺序是不变的。

错误④

小心重复更新。下面的代码展示了一个错误案例。

--------------- 热更前
local k = 0
local count = function()
  return k
end

function c1()
  local i = count()
  print("c1", i)
end

function c2()
  local i = count()
  print("c2", i)
end

--------------- 热更后
local k = 0
local j = 1
local count = function()
k = k + j
   return k
end

function c1()
  local i = count()
  print("c1", i)
c2()
end

function c2()
  local i = count()
  print("c2", i)
c3()
end

function c3()
  print("c3", j)
end

正如我们前面所说，热更模块的实现无非是只替换函数与表中的函数，而数据则是默认用旧值，需要更新的数据在我所写的框架中需要定义一个 __RELOAD() 函数，在里面填写需要更新的数据（这块如果不记得的话，可以回到前面，了解一下我为什么要这么设计）

首先我们来讨论更新前，更新前没有 c3 函数，

现在来讨论更新后，更新后增加了一个 c3 函数，这种时候更新到 c1 时，会跟着 c1 的 _ENV 去更新 c2 （因为 ENV 是一个 table 需要更新）， c2也有 _ENV 然后又会更新 c1 ，还会更新 c3 此时因为 c3 原本是一个不存在函数，直接设进 _ENV 就行了。

这个时候更新了 c2 ，你可能有疑惑不是 c2 已经更新过了吗？之前更新的 c2 是因为 c1 的 _ENV 更新到的，这次是由假环境表中找出来更新的， c2 顺着自己的 _ENV 又会更新到 c3 ，第二次更新 c3 的时候，因为之前我们已经设置到真正的 _ENV 去了，此时就要重新更新 c3 的 upvalue ，可这个“旧函数” 是同一次热更中产生的，因此在有的时候会导致 c3 的 upvalue 关联到错误的地址。

因此，要小心重复更新，无论是 table 也好，还是 function 也好。

总结

Lua 热更新是一个值得研究的东西，它非常有趣，能够让你更理解 Lua 的运行机制，同时能够减轻项目开发人员的负担，由于时间关系，目前足够支持各个项目组各种奇怪的写法，在开源中的实现里应该是较为全面的，由于现有项目中不使用协程，而且是一个全Lua的框架，因此也没有 userdata ，目前来看是足够了。

唯一我觉得不足的地方，当其他地方存储一个 function 作为 callback 的时候，没法直接更新到，通常是采用调用字符串的形式来调用函数（其实就是从 _ENV 找这个函数的地址），如果可以将这一块做到 Lua 虚拟机中，就能实现更完美的热更新了。

Lua 5.3 设计实现(六) GC 垃圾回收

2020-12-11T12:20:20.000Z

虽然本系列主要讲的是 Lua 5.3 中的实现，不过在本篇中，想先聊聊 Lua 垃圾回收的历史。只有了解其历史，才知道为什么这么设计。

Lua GC 历史

Lua 5.0 之前

在 Lua 5.0 之前，Lua 因为没有 userdata ，垃圾回收的工作就很简单了，因为没有 userdata 也就没有了 __gc 元方法，也就不需要针对有特殊析构操作的对象进行特殊处理。

Lua 从早期到现在 2020年 推出的最新版 Lua 5.4 都是采用的标记扫描算法，垃圾回收算法一般分为两类。

标记扫描算法
引用计数算法

引用计数的话，每个对象都要占用多一块内存，同时需要频繁的增减引用计数值，特别指的是在栈上的时候，Lua 解释器做的又非常简单，如果采用引用计数，还要对指令进行优化。

而早期标记扫描也是比较简单，首先它每次扫描且回收垃圾都是需要一次执行完的，其次它只有两种标记，用到或没用到，而且每次创建新对象都会跑一次GC。

显然，这种垃圾回收注定了没人敢用。。。我每创建一个对象，你都跑一次GC，这谁顶得住？

Lua 5.0

到了 Lua 5.0，就采用了折中的办法，当内存分配超过了上次GC后的两倍，就跑一次全量GC。而且这个版本里支持了 userdata ，当一个 userdata 有 __gc 元方法时，需要对 userdata 作特殊处理，所谓的处理就是将其从所有对象的链表也就是 allgc 拿出来，放到一个单独的链上 finobj，因为还要调用完 __gc 方法，再将其释放。（这一个操作是在对 userdata 设置 metatable 后进行的，因为一个 userdata 如果没有 metatable 必然没有这个 __gc 元方法，当然 table 也可以有 __gc 元方法）

依然是全量GC，没人敢用，只不过稍微好一些，只有内存分配超过上次GC的两倍，才进行GC。

Lua 5.1

Lua 5.1 支持了渐进式垃圾回收，原理就是三色扫描，两种白分别表示不同回合的需要回收的对象的标记，灰色代表没扫描完，黑色代表一定别给我回收了！

但是这样也有问题，因为是渐进式扫描，如果一个 table 已经被扫描完了，这时再给他加一个对象，这个新对象默认为白色，到最后会被回收。

因此有两种方式，一种是 barrier forward 就是将白色改为灰色，另一种是 barrier back 就是将黑色的 table 改为灰色。

在 Lua 实现中，如果你对一个扫描完的 table 进行修改操作，会默认将 table 改为灰色，且加入到 grayagain，等到 atomic 的时候再一次性扫过。因为 table 被改过一次，说明它还有可能再被改，为了避免其在黑色与灰色里面反复横跳，干脆直接丢 grayagain 链表上，等到时候一次性解决，也就是 atomic 阶段。

如果对象在栈上的话，则直接变为灰色，而不是将栈改为灰色，减少对栈的操作。

关键是含有 __gc 元方法的对象，从 Lua 的角度，只有两类可以设置元表，table 与 userdata，从 C 的角度，任何类型都可以有自己的元表。

如果给一个黑色对象设置一个元表，那么将元表置为灰色即可。

拥有__gc 元方法的对象，在设置的那一刻，会将该对象，从 allgc 链表上弄下来，将其加入到 finobj 链表上。

atomic 时刻，扫描一次 finobj 链表，将可回收对象转移到 tobefnz 链上，同时标记为灰色不可回收，这是为了到最后阶段，先执行一次 __gc 然后将其重新链回到 allgc 走常规对象的 GC 流程。

因此，不要有过多含有 __gc 元方法的对象，毕竟都是在 atomic 阶段扫的，不可分割。

其次是弱表，弱表的话就是避免因为引用而无法被GC清理，它也是在 atomic 阶段进行扫描的，尽量减少 __gc 和弱表，就能减少 GC 的时间消耗。

键值都弱放 allweak 链表，键弱放 ephemeron 链表，弱值放 weak 链表。

Lua 5.2

在 Lua 5.2 中，推出了分代GC，不过又在 Lua 5.3 中将其删除，现又在 Lua 5.4 中加入。

Lua 5.4

再次推出了分代GC。所谓的分代GC 指的是对象分为老年代和新生代。老年代指的是常驻对象，长时间不需要GC的对象，但是在之前的版本中，大量的时间都是在扫描标记这些“老年代”，因此如果能够减少扫描标记老年代的话，GC性能就能达到提升。

至于新生代，则是刚创建出来的对象，很有可能需要进行清理，比如在栈上创建的对象，这样不只是GC效率有提升，还能保持内存占用的稳定（毕竟刚创建出来的对象，如果不用了就马上回收掉，而不是一直拖着）。

分代GC目前看来是挺好的，不过一旦与渐进式GC混用就很难受了，因为你没法复用 barrier forward 和 barrier back ，这里不指的是颜色/标记，而是指老年还是新生，试想渐进的时候，创建了个新对象，那么是应该把引用到新对象的老对象改为新生，还是把新对象改为老年代。这是一个问题，新对象改为老年，那老年就会有特别多，起不到回收的作用，老对象改为新生也是同理。

因此这个时候就需要第三种状态，类似于之前标记扫描法的第三种颜色，触碰过的对象，可以理解为触碰态，如果老对象指向了一个新的对象，则认为它处于触碰态，下次扫描把他一起扫了。

新生代和被触碰过的对象连续两次被扫描到，就说明它有可能经常被用到，就将其转为老年代。

分代GC 减少了老对象重复被扫描和标记的代价，提升了GC性能，但是总会有一个适合，会进行全量GC，只不过这个代价比较少，毕竟大部分对象都在新生代的时候就被回收了，如果项目要上 Lua 5.4，要特别小心这个全量GC的过程，最好主动的切换到步进模式，回收完一个周期后，再切回分代GC。

优化GC思路

从上面我们可以知道，所有的对象，都会在创建的时候挂上 allgc 链表，但是在游戏服务器中，我们有很多的对象，根本不需要GC，特别是配置表信息，（目前的几个项目都是重Lua的架构，所有配置都在 Lua 中进行读取。哪怕是 Lua 5.4 这些对象肯定会进入老年代，还是会被全量扫描标记到）。因此我们可以考虑给 table 加个函数，例如 table.nogc() ，把所有配置表的对象从 allgc 链拿下来，这样我们就能减少 O(N) 的时间。但是仅仅这样还是不够的，我们还要在扫描阶段提前返回，当扫描到我们标记过的不需要 GC 的 table，则提前返回，减少扫描标记的时间。理论上，配置越多，越大，减少的GC时间越多

同理，我们还可以对一些全局函数进行这样的操作，旨在于减少需要扫描标记的对象个数。

如果不进行这样的优化，几乎每次重新开始GC，前面的一大段时间都是在标记扫描我们的不能垃圾回收的对象，非常浪费。

这个思路，我将会在之后进行尝试，最后再链接过去。

还有个思路，则是在内存分配和释放上做手脚，简单来说就是你写个内存池，进行小内存分配。不过个人感觉，优化不大，基本上和原生的 malloc 性能差不多，毕竟现代的内存分配器早就迭代了N个版本了。

在此之前，我们还是先来过一下 Lua 5.3 的GC的设计与实现吧。

Lua 5.3 GC源码鉴赏

GC 的时机，主要由以下宏控制，可以看出默认是分步GC。

#define luaC_condGC(L,pre,pos) \
{ if (G(L)->GCdebt > 0) { pre; luaC_step(L); pos;}; \
  condchangemem(L,pre,pos); }

/* more often than not, 'pre'/'pos' are empty */
#define luaC_checkGC(L)luaC_condGC(L,(void)0,(void)0)

除此之外，还可以手动调用 lua_gc api。

分步GC 可以通过 LUA_GCSETPAUSE 控制执行GC 的时机，默认是新增内存为上一次的两倍也就是 200 。

LUA_GCSETSTEPMUL 则是控制 GC 的速度，默认为 2，是新增内存速度的两倍，这个值不能低于 40，也就是 0.4，最小也是 40。

static lu_mem singlestep (lua_State *L) {
  global_State *g = G(L);
  switch (g->gcstate) {
    case GCSpause: {
      g->GCmemtrav = g->strt.size * sizeof(GCObject*);
      restartcollection(g);
      g->gcstate = GCSpropagate;
      return g->GCmemtrav;
    }
    case GCSpropagate: {
      g->GCmemtrav = 0;
      lua_assert(g->gray);
      propagatemark(g);
       if (g->gray == NULL)  /* no more gray objects? */
        g->gcstate = GCSatomic;  /* finish propagate phase */
      return g->GCmemtrav;  /* memory traversed in this step */
    }
    case GCSatomic: {
      lu_mem work;
      propagateall(g);  /* make sure gray list is empty */
      work = atomic(L);  /* work is what was traversed by 'atomic' */
      entersweep(L);
      g->GCestimate = gettotalbytes(g);  /* first estimate */;
      return work;
    }
    case GCSswpallgc: {  /* sweep "regular" objects */
      return sweepstep(L, g, GCSswpfinobj, &g->finobj);// __gc
    }
    case GCSswpfinobj: {  /* sweep objects with finalizers */
      return sweepstep(L, g, GCSswptobefnz, &g->tobefnz);
    }
    case GCSswptobefnz: {  /* sweep objects to be finalized */
      return sweepstep(L, g, GCSswpend, NULL);
    }
    case GCSswpend: {  /* finish sweeps */
      makewhite(g, g->mainthread);  /* sweep main thread */
      checkSizes(L, g);
      g->gcstate = GCScallfin;
      return 0;
    }
    case GCScallfin: {  /* call remaining finalizers */
      if (g->tobefnz && g->gckind != KGC_EMERGENCY) {
        int n = runafewfinalizers(L);
        return (n * GCFINALIZECOST);
      }
      else {  /* emergency mode or no more finalizers */
        g->gcstate = GCSpause;  /* finish collection */
        return 0;
      }
    }
    default: lua_assert(0); return 0;
  }
}

GCSpause （一步完成）
- 标记起点（主线程，注册表，G的元表，上一次 GC 剩的 tobefnz （需要执行 __gc 元方法，执行后再放回 allgc 走常规回收流程）。
GCSpropagate （多步完成）
- 扫描灰色链表，逐步将灰色对象转为黑色对象。
GCSatomic （一步完成）
- 原子操作，主要是扫描 grayagain finobj 链表，还有弱键，弱值，弱表，将 finobj 可回收的对象转移到 tobefnz 链表。
- 进入清理阶段，将白色改为另一种白色。
GCSswpallgc （多步完成）
- 清理常规对象
GCSswpfinobj （多步完成）
- 清理 finobj 对象，这一个我一开始没反应过来，因为 finobj 链表中的对象难道不是在 atomic 阶段就已经将可回收的都转移到 tobefnz 链表吗？怎么还要进行清理 finobj 呢？
- 我能想到的原因是作者调用 sweepstep 的原因只是为了将其标记为另一种白色而已。
GCSswptobefnz （多步完成）
- 清理 tobefnz 对象，也可以和上面那样理解。
GCSswpend （一步完成）
- 清理完成，进入 GCScallfin
GCScallfin （多步完成）
- 调用 tobefnz 的 __gc 函数，后将其转移到 allgc 链表，走常规对象回收流程。

还有一条 fixedgc 链表，存储的都是不会被GC的对象，目前都是短字符串，但是它还是有可能会被扫描到，浪费了一定的时间，不过因为比较少，所以其实也还好。

Upvalue 如何 GC？

首先 Upvalue 受不受扫描标记控制，这个问题是有条件的，当 Upvalue 指向的对象处于栈上时，栈上的对象会被栈引用到，因此会被标记，但是不会通过闭包去扫描到 Upvalue。

一旦 Upvalue 被关闭（就是返回的时候，离开了作用域），就会将其拷贝到闭包内部的 UpVal中，这个时候就不受到扫描标记的管控了，而是被引用计数所管理。

void luaF_close (lua_State *L, StkId level) {
  UpVal *uv;
  while (L->openupval != NULL && (uv = L->openupval)->v >= level) {
    lua_assert(upisopen(uv));
    L->openupval = uv->u.open.next;  /* remove from 'open' list */
    if (uv->refcount == 0)  /* no references? */
      luaM_free(L, uv);  /* free upvalue */
    else {
      setobj(L, &uv->u.value, uv->v);  /* move value to upvalue slot */
      uv->v = &uv->u.value;  /* now current value lives here */
      luaC_upvalbarrier(L, uv);
    }
  }
}

而只有 Closure 被回收的时候，才会将 UpValue 的引用计数减少，因此被关闭的 UpValue 是否被回收依赖于其寄生的 Closure 。

static void freeLclosure (lua_State *L, LClosure *cl) {
  int i;
  for (i = 0; i < cl->nupvalues; i++) {
    UpVal *uv = cl->upvals[i];
    if (uv)
      luaC_upvdeccount(L, uv);
  }
  luaM_freemem(L, cl, sizeLclosure(cl->nupvalues));
}

这就说明 Closure 在初始化的时候，要把 UpValue 被关掉的时候的藏身的内存也给提前分配好，这点可以在以下代码可以看到。

struct UpVal {
  TValue *v;  /* points to stack or to its own value */
  lu_mem refcount;  /* reference counter */
  union {
    struct {  /* (when open) */
      UpVal *next;  /* linked list */
      int touched;  /* mark to avoid cycles with dead threads */
    } open;
    TValue value;  /* the value (when closed) */
  } u;
};

LClosure *luaF_newLclosure (lua_State *L, int n) {
  GCObject *o = luaC_newobj(L, LUA_TLCL, sizeLclosure(n));
  LClosure *c = gco2lcl(o);
  c->p = NULL;
  c->nupvalues = cast_byte(n);
  while (n--) c->upvals[n] = NULL;
  return c;
}

Lua 5.3 设计实现(五) Coroutine

2020-12-09T04:20:20.000Z

Lua的协程和 Golang的协程不同，它是在同一个主线程上跑的协程，个人感觉用途不是很大，毕竟没有发挥多核的优势，不过还是有不少人认为这是 Lua的一个亮点，可以用来实现异步代码改写为同步代码，减轻人脑负担，然而很多人用的时候，并不了解当 Lua协程调用到C函数而C函数又调用到Lua函数后又执行 yield 的解决方案。本篇主要是来探讨Lua协程的设计。

Lua协程的设计思路

试想一下，如果你来设计一个在同一个主线程上跑，且没有调度的协程，你会怎么做？

可能你会说这还不简单，我们都已经知道了 CallInfo 这样的结构，只需要创建一个新的Lua栈，将新的函数设置进其 CallInfo ，当执行到 resume 时，则将 Lua栈推入，去执行新的指令不就行了？

如果Lua只在自己的世界里面玩，从来不调用 C函数，那就还好。但问题是Lua会与其宿主语言也就是C语言进行打交道，会调用C的函数，如果这个C函数又调用了Lua Function，而其又调用了 yield，等到它又被 resume 的时候，它就没办法继续执行那尚未执行完成的C函数。

大致执行流程如下

// 因为 lua 的 resume，其实是在C中导出的
(1)Lua:resume->[C:resume]
// C函数又调用了 lua的函数 因此会执行到 lua_call
->Lua:Function->[C:Function]->[C:lua_call]
// lua的函数被执行到后，又去执行 yield
->(Lua:Function)->Lua:yield

// 某一刻协程又被启动，此时回不到 C:lua_call

一种可行的思路是，将Lua的协程与每一个系统线程绑定，消耗高(不过我觉得这样才能发挥出多线程的优势嘛)。

Lua采用的方案则是，通过保存C函数和其状态，并标记状态，当 resume时根据已有信息，回到原来未执行完C函数的位置。

以下的 lua_pcallk 为使用例子，倒数第二个参数为上下文，倒数第一个参数则是该C函数如果被中断后，应该继续执行的事情。

static int luaB_pcall (lua_State *L) {
  int status;
  luaL_checkany(L, 1);
  lua_pushboolean(L, 1);  /* first result if no errors */
  lua_insert(L, 1);  /* put it in place */
  status = lua_pcallk(L, lua_gettop(L) - 2, LUA_MULTRET, 0, 0, finishpcall);
  return finishpcall(L, status, 0);
}

Coroutine

create

先来看创建操作，调用 lua_newthread 创建一个新协程，这里面的协程的状态信息还是 lua_State ，各个协程之间的公共数据则在 global_State 。

lua_xmove 则是将两个 lua_State 的数据转移。

LUA_API void 
lua_xmove (lua_State *from, lua_State *to, int n) {
  int i;
....
  from->top -= n;
  for (i = 0; i < n; i++) {
    setobj2s(to, to->top, from->top + i);
    to->top++;  /* stack already checked by previous 'api_check' */
  }
  lua_unlock(to);
}

static int luaB_cocreate (lua_State *L) {
  lua_State *NL;
  luaL_checktype(L, 1, LUA_TFUNCTION);
  NL = lua_newthread(L);
  lua_pushvalue(L, 1);  /* move function to top */
  lua_xmove(L, NL, 1);  /* move function from L to NL */
  return 1;
}

resume

创建好协程，还需要手动调用 resume 才能执行，主要依托于 auxresume，将参数拷贝到协程中，调用 lua_resume 。

static int auxresume (lua_State *L, lua_State *co, int narg) {
  int status;
  if (!lua_checkstack(co, narg)) {
    lua_pushliteral(L, "too many arguments to resume");
    return -1;  /* error flag */
  }
  if (lua_status(co) == LUA_OK && lua_gettop(co) == 0) {
    lua_pushliteral(L, "cannot resume dead coroutine");
    return -1;  /* error flag */
  }

  lua_xmove(L, co, narg);
  status = lua_resume(co, L, narg);
  if (status == LUA_OK || status == LUA_YIELD) {
    int nres = lua_gettop(co);
    if (!lua_checkstack(L, nres + 1)) {
      lua_pop(co, nres);  /* remove results anyway */
      lua_pushliteral(L, "too many results to resume");
      return -1;  /* error flag */
    }
    lua_xmove(co, L, nres);  /* move yielded values */
    return nres;
  }
  else {
    lua_xmove(co, L, 1);  /* move error message */
    return -1;  /* error flag */
  }
}

static int luaB_coresume (lua_State *L) {
  lua_State *co = getco(L);
  int r;
  r = auxresume(L, co, lua_gettop(L) - 1);
  if (r < 0) {
    lua_pushboolean(L, 0);
    lua_insert(L, -2);
    return 2;  /* return false + error message */
  }
  else {
    lua_pushboolean(L, 1);
    lua_insert(L, -(r + 1));
    return r + 1;  /* return true + 'resume' returns */
  }
}

lua_resume 会检查各种条件，包括协程状态，调用层数。

接下来会将 nny 设置为 0，这个 nny 指的是 number of non-yieldable" calls ，它是用来控制是否允许 yield 的，最终会以保护的形式调用 resume。

LUA_API int lua_resume (lua_State *L, lua_State *from, int nargs) {
  int status;
  unsigned short oldnny = L->nny;  /* save "number of non-yieldable" calls */
  lua_lock(L);
  if (L->status == LUA_OK) {  /* may be starting a coroutine */
    if (L->ci != &L->base_ci)  /* not in base level? */
      return resume_error(L, "cannot resume non-suspended coroutine", nargs);
  }
  else if (L->status != LUA_YIELD)
    return resume_error(L, "cannot resume dead coroutine", nargs);
  L->nCcalls = (from) ? from->nCcalls + 1 : 1;
  if (L->nCcalls >= LUAI_MAXCCALLS)
    return resume_error(L, "C stack overflow", nargs);
  luai_userstateresume(L, nargs);
  L->nny = 0;  /* allow yields */
  api_checknelems(L, (L->status == LUA_OK) ? nargs + 1 : nargs);
  status = luaD_rawrunprotected(L, resume, &nargs);
  if (status == -1)  /* error calling 'lua_resume'? */
    status = LUA_ERRRUN;
  else {  /* continue running after recoverable errors */
    while (errorstatus(status) && recover(L, status)) {
      /* unroll continuation */
      status = luaD_rawrunprotected(L, unroll, &status);
    }
    if (errorstatus(status)) {  /* unrecoverable error? */
      L->status = cast_byte(status);  /* mark thread as 'dead' */
      seterrorobj(L, status, L->top);  /* push error message */
      L->ci->top = L->top;
    }
    else lua_assert(status == L->status);  /* normal end or yield */
  }
  L->nny = oldnny;  /* restore 'nny' */
  L->nCcalls--;
  lua_assert(L->nCcalls == ((from) ? from->nCcalls : 0));
  lua_unlock(L);
  return status;
}

如果是协程刚开始的时候，那就像是执行一个函数那么简单。相反如果是从 yield 状态切换回来，

其实这必然是 C函数中过来的，因为 lua调用 yield 其实还是到了C函数这。

如果在 lua 则继续解析指令即可，这里的 lua 其实是 hook 函数，看起来是 lua 函数其实还是 C函数，可以看到之前的堆栈信息存在了 CallInfo->extra，所以 resume 回来之后，实际上不会有 Lua函数，但是我们要跳过 Lua的指令。

若是在C中调用的 lua函数，而lua函数又调用了 yield ，则看看我们之前保存的继续处理函数和上下文存不存在，再去调用即可（调用的是C函数剩余的部分）。

执行完之前遗留的工作以后，只是说恢复到了正确的工作，别忘了 lua 中可能还有要执行的任务，因此会调用 unroll。

static void resume (lua_State *L, void *ud) {
  int n = *(cast(int*, ud));  /* number of arguments */
  StkId firstArg = L->top - n;  /* first argument */
  CallInfo *ci = L->ci;
  if (L->status == LUA_OK) {  /* starting a coroutine? */
    if (!luaD_precall(L, firstArg - 1, LUA_MULTRET))  /* Lua function? */
      luaV_execute(L);  /* call it */
  }
  else {  /* resuming from previous yield */
    lua_assert(L->status == LUA_YIELD);
    L->status = LUA_OK;  /* mark that it is running (again) */
    ci->func = restorestack(L, ci->extra);
    if (isLua(ci))  /* yielded inside a hook? */
      luaV_execute(L);  /* just continue running Lua code */
    else {  /* 'common' yield */
      if (ci->u.c.k != NULL) {  /* does it have a continuation function? */
        lua_unlock(L);
        n = (*ci->u.c.k)(L, LUA_YIELD, ci->u.c.ctx); /* call continuation */
        lua_lock(L);
        api_checknelems(L, n);
        firstArg = L->top - n;  /* yield results come from continuation */
      }
      luaD_poscall(L, ci, firstArg, n);  /* finish 'luaD_precall' */
    }
    unroll(L, NULL);  /* run continuation */
  }
}

unroll 较为简单，执行接下来的字节码，如果是停在了C函数，则会调用 finishCcall 去执行完剩余的C函数。

adjustresults(L, ci->nresults); 是因为此时一定停在了 luaD_precall 函数，而这后面就是这一句，因此可以写死，还有一句则是 luaD_poscall。

static void finishCcall (lua_State *L, int status) {
  CallInfo *ci = L->ci;
  int n;
  /* must have a continuation and must be able to call it */
  lua_assert(ci->u.c.k != NULL && L->nny == 0);
  /* error status can only happen in a protected call */
  lua_assert((ci->callstatus & CIST_YPCALL) || status == LUA_YIELD);
  if (ci->callstatus & CIST_YPCALL) {  /* was inside a pcall? */
    ci->callstatus &= ~CIST_YPCALL;  /* continuation is also inside it */
    L->errfunc = ci->u.c.old_errfunc;  /* with the same error function */
  }
  /* finish 'lua_callk'/'lua_pcall'; CIST_YPCALL and 'errfunc' already
     handled */
  adjustresults(L, ci->nresults);
  lua_unlock(L);
  n = (*ci->u.c.k)(L, status, ci->u.c.ctx);  /* call continuation function */
  lua_lock(L);
  api_checknelems(L, n);
  luaD_poscall(L, ci, L->top - n, n);  /* finish 'luaD_precall' */
}

static void unroll (lua_State *L, void *ud) {
  if (ud != NULL)  /* error status? */
    finishCcall(L, *(int *)ud);  /* finish 'lua_pcallk' callee */
  while (L->ci != &L->base_ci) {  /* something in the stack */
    if (!isLua(L->ci))  /* C function? */
      finishCcall(L, LUA_YIELD);  /* complete its execution */
    else {  /* Lua function */
      luaV_finishOp(L);  /* finish interrupted instruction */
      luaV_execute(L);  /* execute down to higher C 'boundary' */
    }
  }
}

lua_resume 以保护模式调用 resume 如果出现异常，则会调用 recover 去修复。可以看到这里是去找调用 pcall 的 CallInfo 。因为 pcall 确实会抛出异常，然后就会去找 pcall 在哪里，将其还未执行完的事情给完成（指的是 luaD_pcall 异常后应该做的事情）。

static CallInfo *findpcall (lua_State *L) {
  CallInfo *ci;
  for (ci = L->ci; ci != NULL; ci = ci->previous) {  /* search for a pcall */
    if (ci->callstatus & CIST_YPCALL)
      return ci;
  }
  return NULL;  /* no pending pcall */
}

static int recover (lua_State *L, int status) {
  StkId oldtop;
  CallInfo *ci = findpcall(L);
  if (ci == NULL) return 0;  /* no recovery point */
  /* "finish" luaD_pcall */
  oldtop = restorestack(L, ci->extra);
  luaF_close(L, oldtop);
  seterrorobj(L, status, oldtop);
  L->ci = ci;
  L->allowhook = getoah(ci->callstatus);  /* restore original 'allowhook' */
  L->nny = 0;  /* should be zero to be yieldable */
  luaD_shrinkstack(L);
  L->errfunc = ci->u.c.old_errfunc;
  return 1;  /* continue running the coroutine */
}

yield

交出CPU资源，给其他协程机会，有了前面的基础，比较好理解，保存了当下次 resume 的时候，应该继续执行的C函数和上下文环境。

LUA_API int lua_yieldk (lua_State *L, int nresults, lua_KContext ctx,
                        lua_KFunction k) {
  CallInfo *ci = L->ci;
  luai_userstateyield(L, nresults);
  lua_lock(L);
  api_checknelems(L, nresults);
  if (L->nny > 0) {
    if (L != G(L)->mainthread)
      luaG_runerror(L, "attempt to yield across a C-call boundary");
    else
      luaG_runerror(L, "attempt to yield from outside a coroutine");
  }
  L->status = LUA_YIELD;
  ci->extra = savestack(L, ci->func);  /* save current 'func' */
  if (isLua(ci)) {  /* inside a hook? */
    api_check(L, k == NULL, "hooks cannot continue after yielding");
  }
  else {
    if ((ci->u.c.k = k) != NULL)  /* is there a continuation? */
      ci->u.c.ctx = ctx;  /* save context */
    ci->func = L->top - nresults - 1;  /* protect stack below results */
    luaD_throw(L, LUA_YIELD);
  }
  lua_assert(ci->callstatus & CIST_HOOKED);  /* must be inside a hook */
  lua_unlock(L);
  return 0;  /* return to 'luaD_hook' */
}

Lua 5.3 设计实现(四) Closure与Upvalues

2020-12-05T14:29:20.000Z

Closure 其实对于 C/C++ 程序员可以简单理解为函数。不过由于有了 Upvalues 的概念，会让人理解起来不那么容易，但是 Lua 中的所有函数其实都是闭包，包括我们第一篇 Lua 5.3 设计实现(一) Lua是怎么跑起来的？) 文章中提到的运行流程的第一个主函数，其实也是一个闭包。

本文中函数与闭包的名字会混用，请根据其是否含有 Upvalue 进行区分。

Closure

闭包是由函数原型（Proto）+ （UpValue）组合而成的。

而 Proto 其实就是拥有所有执行所需要的信息，因为这一块在第一篇已经讲过，故大幅度跳过。

typedef struct Proto {
  CommonHeader;
  lu_byte numparams;  // 固定函数个数
  lu_byte is_vararg;  // 是否是可变长参数
  lu_byte maxstacksize;  // 寄存器数量，用栈模拟
  int sizeupvalues;  // Upvalues 个数

  int sizek;  /* size of 'k' */
  int sizecode;
  int sizelineinfo;
  int sizep;  /* size of 'p' */
  int sizelocvars;

  int linedefined;  // 开始行号
  int lastlinedefined;  // 结束行号
  TString  *source; // 源文件名

  TValue *k;  // 常量表
  Instruction *code;  // 指令表
  struct Proto **p;  // 子函数原型表
  int *lineinfo;  // 行号表 行号与指令对应
  LocVar *locvars;  // 局部变量表
  Upvaldesc *upvalues;  // Upvalue 表

  struct LClosure *cache;  /* last-created closure with this prototype */
  GCObject *gclist;
} Proto;

我们更关注的是 Upvalues。

Upvalues

upvalue 主要由一个union 和 TValue 构成，在这里要理解一个概念。

upvalue 的 open 状态。

open：当我们说一个 upvalue 是 open 的，指的是这个 upvalue 其原始值还在数据栈上（因此这个对象如果是可回收的，则被扫描标记管理）。
close：如果说一个 upvalue 是 close 的，指的是这个 upvalue 已经不在栈上了，离开了作用域，会被拷贝到 UpVal.u.value 中，不受到垃圾回收的管控，而是被引用计数管理。

struct UpVal {
  TValue *v;  /* points to stack or to its own value */
  lu_mem refcount;  /* reference counter */
  union {
    struct {  /* (when open) */
      UpVal *next;  /* linked list */
      int touched;  /* mark to avoid cycles with dead threads */
    } open;
    TValue value;  /* the value (when closed) */
  } u;
};

#define upisopen(up)((up)->v != &(up)->u.value)

因此当 upvalue 为 open 时，v 指向栈上原始值的地址。反之，则将其值存入到 UpVal 这个结构体自身。

这也就是为什么下面的代码能够正确执行的原因。

function Counter()
local t = 0
return function()
t = t + 1
return t
end
end

return 回去这个 function 因为 t 已经不在栈上了，故将其值存入了这个 UpVal 结构体中，跟随着这个 function 一起。

结构中的 open 这一个结构体，则是当 UpVal 为 open态时，链接上所有的 open UpVal，方便后续的查找，而 touched 是为了防止垃圾回收时还指向栈上对象的 upvalue 被清理。因为垃圾回收的 atomic 有个 remarkupval 的函数，在里面进行重新标记 upvalue 。

Closure

无论是 C 函数，还是 Lua 函数，其 UpValues 都与函数本身分离，但又被包裹在一个结构体中。

typedef struct CClosure {
  ClosureHeader;
  lua_CFunction f;
  TValue upvalue[1];  /* list of upvalues */
} CClosure;

typedef struct LClosure {
  ClosureHeader;
  struct Proto *p;
  UpVal *upvals[1];  /* list of upvalues */
} LClosure;

typedef union Closure {
  CClosure c;
  LClosure l;
} Closure;

其中 C 函数很有可能没有 UpValue，因此 Lua 也提供了一种叫 light C function 的东西，直接将函数指针设到栈顶，其生命周期由其 Host 去管理。

LUA_API void lua_pushcclosure (lua_State *L, lua_CFunction fn, int n) {
  lua_lock(L);
  if (n == 0) {
    setfvalue(L->top, fn);
    api_incr_top(L);
  }
  else {
    CClosure *cl;
    api_checknelems(L, n);
    api_check(L, n <= MAXUPVAL, "upvalue index too large");
    cl = luaF_newCclosure(L, n);
    cl->f = fn;
    L->top -= n;
    while (n--) {
      setobj2n(L, &cl->upvalue[n], L->top + n);
      /* does not need barrier because closure is white */
    }
    setclCvalue(L, L->top, cl);
    api_incr_top(L);
    luaC_checkGC(L);
  }
  lua_unlock(L);
}

#define setfvalue(obj,x) \
  { TValue *io=(obj); val_(io).f=(x); settt_(io, LUA_TLCF); }

Lua 的闭包就比较复杂了

先是创建一个闭包，然后才设置其 UpValue。

LClosure *luaF_newLclosure (lua_State *L, int n) {
  GCObject *o = luaC_newobj(L, LUA_TLCL, sizeLclosure(n));
  LClosure *c = gco2lcl(o);
  c->p = NULL;
  c->nupvalues = cast_byte(n);
  while (n--) c->upvals[n] = NULL;
  return c;
}

static void pushclosure (lua_State *L, Proto *p, UpVal **encup, StkId base,
                         StkId ra) {
  int nup = p->sizeupvalues;
  Upvaldesc *uv = p->upvalues;
  int i;
  LClosure *ncl = luaF_newLclosure(L, nup);
  ncl->p = p;
  setclLvalue(L, ra, ncl);  /* anchor new closure in stack */
  for (i = 0; i < nup; i++) {  /* fill in its upvalues */
    if (uv[i].instack)  /* upvalue refers to local variable? */
      ncl->upvals[i] = luaF_findupval(L, base + uv[i].idx);
    else  /* get upvalue from enclosing function */
      ncl->upvals[i] = encup[uv[i].idx];
    ncl->upvals[i]->refcount++;
    /* new closure is white, so we do not need a barrier here */
  }
  if (!isblack(p))  /* cache will not break GC invariant? */
    p->cache = ncl;  /* save it on cache for reuse */
}

UpValue 会根据其是否在栈上，用 Upvaldesc 中的 instack 字段进行表示。（一般是在代码被编译的时候，写入到调试信息中，或者是判断这个 key 是否出现在 local 中进行判断），这里的在栈上并不意味着它被打开，如果不在则在上层函数中进行寻找。

最后将这个闭包存入 Proto 的 cache中，如果下次还要根据 Proto 生成 Closure，则先检查该 CLosure 的 UpValue 是否完全一致，如果是则复用，因此最好不要写出动态生成闭包的代码，避免性能的损耗。

// 动态建立，判断是否为 local 是的话，则是在栈中
f->upvalues[fs->nups].instack = (v->k == VLOCAL);
// 从dump文件中读取
f->upvalues[i].instack = LoadByte(S);

如果在栈中，则会调用 luaF_findupval 函数。

这个函数从 openupval 链中找，如果找不到就新建一个。

UpVal *luaF_findupval (lua_State *L, StkId level) {
  UpVal **pp = &L->openupval;
  UpVal *p;
  UpVal *uv;
  lua_assert(isintwups(L) || L->openupval == NULL);
  while (*pp != NULL && (p = *pp)->v >= level) {
    lua_assert(upisopen(p));
    if (p->v == level)  /* found a corresponding upvalue? */
      return p;  /* return it */
    pp = &p->u.open.next;
  }
  /* not found: create a new upvalue */
  uv = luaM_new(L, UpVal);
  uv->refcount = 0;
  uv->u.open.next = *pp;  /* link it to list of open upvalues */
  uv->u.open.touched = 1;
  *pp = uv;
  uv->v = level;  /* current value lives in the stack */
  if (!isintwups(L)) {  /* thread not in list of threads with upvalues? */
    L->twups = G(L)->twups;  /* link it to the list */
    G(L)->twups = L;
  }
  return uv;
}

思考题

如果能答对以下几个问题相信对这一节的内容就已经完全理解了。

以下代码。

有几个 upvalue？
在内存中存在几份 upvalue？
return 的时候会拷贝几次 upvalue？

local _table = {}

function _table.test1()
_table.i = 10
end

function _table.test2()
_table.j = 100
end

可以先看看指令码。

[root@localhost src]# luac -l -l main.lua 

main 0,0> (6 instructions at 0x2216a20)
0+ params, 2 slots, 1 upvalue, 1 local, 2 constants, 2 functions
1[1]NEWTABLE 0 0 0
2[5]CLOSURE  1 0; 0x2216cc0
3[3]SETTABLE 0 -1 1; "test1" -
4[9]CLOSURE  1 1; 0x2216ed0
5[7]SETTABLE 0 -2 1; "test2" -
6[9]RETURN   0 1
constants (2) for 0x2216a20:
1"test1"
2"test2"
locals (1) for 0x2216a20:
0_table27
upvalues (1) for 0x2216a20:
0_ENV10

function <main.lua:3,5> (2 instructions at 0x2216cc0)
0 params, 2 slots, 1 upvalue, 0 locals, 2 constants, 0 functions
1[4]SETTABUP 0 -1 -2; _table "i" 10
2[5]RETURN   0 1
constants (2) for 0x2216cc0:
1"i"
210
locals (0) for 0x2216cc0:
upvalues (1) for 0x2216cc0:
0_table10

function <main.lua:7,9> (2 instructions at 0x2216ed0)
0 params, 2 slots, 1 upvalue, 0 locals, 2 constants, 0 functions
1[8]SETTABUP 0 -1 -2; _table "j" 100
2[9]RETURN   0 1
constants (2) for 0x2216ed0:
1"j"
2100
locals (0) for 0x2216ed0:
upvalues (1) for 0x2216ed0:
0_table10

可以看到两个函数都有一个 upvalue ，指的是 _table
内存中只会有一份 upvalue，因为第一次 luaF_findupval 会发现 openupval 没有，于是新建了一个，第二次 pushclosure 也会执行到 luaF_findupval ，这时候 openupval 已经有了，于是直接指向它。
从问题2可以得知，两个闭包指向的 upvalue 实际上为同一个，因此当这个文件被 return 的时候，只会拷贝一次到第一个闭包的 upvalue 上。

Lua 5.3 设计实现(三) 长短字符串

2020-10-28T05:12:20.000Z

上一篇主要是讲了 Table 和 MetaMethod 的一些设计实现，谈论到了 Lua 会对元方法的字符串名字作缓存，同时提到了 Lua 字符串分为长短字符串。这一篇主要是谈论一下 Lua 的长短字符串是怎么设计的？为什么要分长短这两种类型？

TString 结构

可以看到字符串内部会记录哈希值，每个字符串被创建出来就不能被改写，因此为了节约内存，Lua会复用相同的字符串，但是逐字节比较太慢了，因此会预处理将字符串hash，存入字符串的 hash 字段中。

字符串的实际内容会追加到 TString 的后面。

typedef struct TString {
  CommonHeader;  // GC 回收
  // 短字符串时 0为需要被GC接管, 1为不被GC回收
  // 长字符串时 0为未hash, 1为已hash
  lu_byte extra;
  lu_byte shrlen; // 短字符串长度, 如果是长字符串则无意义
  unsigned int hash; // 字符串哈希值
  union {
    size_t lnglen; // 长字符串长度 如果为短字符串则无效
    struct TString* hnext; // 短字符串的时候 与相同哈希值的字符串串起的链表
  } u;
} TString;

短字符串全局只有一份，Lua解释器会将其存到 stringtable 这个结构中。字符串 hash 会根据 global_State 的 seed 进行哈希。

typedef struct stringtable {
  TString** hash; // 哈希表数组
  int nuse; // 存在哈希表数组里的短字符串个数
  int size; // 哈希表数组的大小
} stringtable;

typedef struct global_State {
....
  stringtable strt; // 用于字符串的哈希表
  unsigned int seed; // 随机数 用于字符串 哈希
// 字符串缓存, 用于存储C语言中经常转TString的字符串
  TString *strcache[STRCACHE_N][STRCACHE_M];
....
 
} global_State;

为什么字符串要分长短？

LUAI_MAXSHORTLE 作为分界来区分长短字符串，默认为40字节

#define LUAI_MAXSHORTLEN40

#define LUA_TSTRING4

/* Variant tags for strings */
#define LUA_TSHRSTR(LUA_TSTRING | (0 << 4))  /* short strings */
#define LUA_TLNGSTR(LUA_TSTRING | (1 << 4))  /* long strings */

其实在 Lua 5.3 之前，字符串并不分长短，之所以现在要分主要是因为 Hash Dos 攻击。

Lua 中的字符串会进行 Hash，然后将其放入 strt 中，如果发生了冲突，就会用最简单的开链法，将相同Hash值的字符串串起来。

Lua 5.2.0 中创建字符串的规则比较简单，凡是阅读过源码的，都能大量构造出相同哈希值的字符串，导致 Lua解释器不得不根据链表上的字符串逐一比对字符，最终会因为比较字符串耗尽 CPU 资源。因此 Lua 5.2.1 之后才会采用 global_State 的 seed 去随机构造哈希。

TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
   GCObject *o; 
   unsigned int h = cast(unsigned int, l);  /* seed */
   size_t step = (l>>5)+1;  /* if string is too long, don't hash all its chars */
   size_t l1; 
   for (l1=l; l1>=step; l1-=step)  /* compute hash */
     h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1]));
   for (o = G(L)->strt.hash[lmod(h, G(L)->strt.size)];
        o != NULL;
        o = gch(o)->next) {
     TString *ts = rawgco2ts(o);
     if (h == ts->tsv.hash &&
         ts->tsv.len == l &&
         (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
       if (isdead(G(L), o))  /* string is dead (but was not collected yet)? */
         changewhite(o);  /* resurrect it */
       return ts; 
     }   
   }
   return newlstr(L, str, l, h);  /* not found; create a new string */
 }

Lua 5.3.6 随机生成随机数种子。

#define luai_makeseed() cast(unsigned int, time(NULL))
// create the seed
static unsigned int makeseed(lua_State* L) {
  char buff[4 * sizeof(size_t)];
  unsigned int h = luai_makeseed();
  int p = 0;
  addbuff(buff, p, L); /* heap variable */
  addbuff(buff, p, &h); /* local variable */
  addbuff(buff, p, luaO_nilobject); /* global variable */
  addbuff(buff, p, &lua_newstate); /* public function */
  return luaS_hash(buff, p, h);
}

随机数种子生成规律非常有趣它根据以下几点进行随机生成

根据 lua_State 的地址
根据虚拟机的运行时间
根据 luaO_nilobject 常量的地址
根据 lua_newstate 函数的地址

最后调用了 luaS_hash() 来创建 hash seed，这个函数即用来hash字符串，同时又用来创建 hash seed。

#define LUAI_HASHLIMIT 5

unsigned int luaS_hash(const char* str, size_t l, unsigned int seed) {
  unsigned int h = seed ^ cast(unsigned int, l); // ^ means Bitwise XOR
  // 如果字符串长度 < 2^5 则都会进行 hash, 否则 会跳过部分字符 提高效率
  size_t step = (l >> LUAI_HASHLIMIT) + 1;
  for (; l >= step; l -= step)
    h ^= ((h << 5) + (h >> 2) + cast_byte(str[l - 1]));
  return h;
}

可以看到 luaS_hash 对字符串 hash 的时候，如果字符串过长，就会跳过部分字符来提高性能。

当冲突的字符串越来越多的时候，查询相同字符串的效率会越来越差，不过没关系，当字符串的数量 > strt的大小，会分配一个原strt两倍大小的哈希表。同时将原有重新进行 Hash，放入新的哈希表中。同理，当字符串的数量 < strt的大小 / 4 的时候，strt 就会缩小为原先的一半。

static void checkSizes(lua_State* L, global_State* g) {
  if (g->gckind != KGC_EMERGENCY) {
    l_mem olddebt = g->GCdebt;
    if (g->strt.nuse < g->strt.size / 4) // strt 比 字符串数量大4倍 那就缩小strt一半
      luaS_resize(L, g->strt.size / 2);
    g->GCestimate += g->GCdebt - olddebt
  }
}

创建字符串

TString* luaS_newlstr(lua_State* L, const char* str, size_t l) {
// 短字符串流程 l <= 40
  if (l <= LUAI_MAXSHORTLEN)
    return internshrstr(L, str, l);
  else {
    TString* ts;
    if (l >= (MAX_SIZE - sizeof(TString)) / sizeof(char))
      luaM_toobig(L);
// 长字符串流程 l > 40
    ts = luaS_createlngstrobj(L, l);
    memcpy(getstr(ts), str, l * sizeof(char));
    return ts;
  }
}

创建短字符串

短字符串会直接进行 hash 若冲突则用开链法链起来。

static TString* internshrstr(lua_State* L, const char* str, size_t l) {
  TString* ts;
  global_State* g = G(L);
  unsigned int h = luaS_hash(str, l, g->seed);
  // g->strt.size always is 2^n, such as 128
  // lmod get the lowest n bit from h
  // list is a sub array from g->strt.hash
  // 通过将字符串Hash 来strt Hash表中找对应的桶
  TString** list = &g->strt.hash[lmod(h, g->strt.size)];
  for (ts = *list; ts != NULL; ts = ts->u.hnext) {
    if (l == ts->shrlen && (memcmp(str, getstr(ts), l * sizeof(char)) == 0)) {
      // 如果找到完全相同的字符串(Hash值相同, 字符串字符完全相同)
      // 就看看是否将被回收 未来再讲
      if (isdead(g, ts))
        changewhite(ts); // 将其留下 复用
      return ts;
    }
  }
// 如果 Hash 后的字符串 >= strt Hash表的大小 则进行 两倍扩容
  if (g->strt.nuse >= g->strt.size && g->strt.size <= MAX_INT / 2) {
    luaS_resize(L, g->strt.size * 2);
    // 扩容后 当然要重新找 当前新创建的字符串 所对应的 新位置啦
    list = &g->strt.hash[lmod(h, g->strt.size)];
  }
  // 真正创建 字符串对象的函数 无论长短最后都用这个
  ts = createstrobj(L, l, LUA_TSHRSTR, h);
  memcpy(getstr(ts), str, l * sizeof(char));
  ts->shrlen = cast_byte(l);
  // 短字符串 会将 相同 Hash 值的字符串链起来
  ts->u.hnext = *list;
  *list = ts;
  g->strt.nuse++;
  return ts;
}

创建长字符串

没有立即进行 hash 而是留到之后，再进行 hash。

TString* luaS_createlngstrobj(lua_State* L, size_t l) {
  TString* ts = createstrobj(L, l, LUA_TLNGSTR, G(L)->seed);
  // 没有进行 Hash!!!
  ts->u.lnglen = l;
  return ts;
}

在源码中，我只找到一处对长字符串进行 hash ，就是在上一篇的 table 中，当要对字符串key 进行 hash 的时候才 hash （它都需要哈希了才哈希，是否可以看作 Lua 并不想对长字符串进行哈希呢？）

static Node *mainposition (const Table *t, const TValue *key) {
  switch (ttype(key)) {
....
    case LUA_TLNGSTR:
      return hashpow2(t, luaS_hashlongstr(tsvalue(key)));
....
  }
}

Lua 5.3 设计实现(二) Table与Metatable

2020-10-26T05:12:20.000Z

上一篇主要讲了 Lua代码的运作过程，这一篇主讲 Lua Table 和基于 MetaTable 实现的 MetaMethod。

其实我觉得，Lua之所以能大放异彩，其一是它非常精小，其二是其开源，其三则是因为它的MetaMethod 的设计。

Lua 类型

虽然本篇主要讲 table，不过在那之前，最好先来认识一下 Lua 其他类型在 Lua解释器中的实现。

UserData 暂且不谈，NUMBER细分为浮点数和整数，字符串则分长短字符串，函数又分Lua函数和C函数还有轻量的C函数，这一部分会分别留到字符串和闭包的时候再谈论。

#define LUA_TNIL0
#define LUA_TBOOLEAN1
#define LUA_TLIGHTUSERDATA2
#define LUA_TNUMBER3
#define LUA_TSTRING4
#define LUA_TTABLE5
#define LUA_TFUNCTION6
#define LUA_TUSERDATA7
#define LUA_TTHREAD8

#define LUA_TNUMFLT(LUA_TNUMBER | (0 << 4))  /* float numbers */
#define LUA_TNUMINT(LUA_TNUMBER | (1 << 4))  /* integer numbers */

#define LUA_TSHRSTR(LUA_TSTRING | (0 << 4))  /* short strings */
#define LUA_TLNGSTR(LUA_TSTRING | (1 << 4))  /* long strings */

#define LUA_TLCL(LUA_TFUNCTION | (0 << 4))  /* Lua closure */
#define LUA_TLCF(LUA_TFUNCTION | (1 << 4))  /* light C function */
#define LUA_TCCL(LUA_TFUNCTION | (2 << 4))  /* C closure */

Table

先来想想，我们一般是怎么使用 table 的，是不是大部分时候都是既用来当数组又用来当哈希表。

因此，可以很简单的想到，table 很有可能底层是使用哈希表来实现的。事实上Lua早期版本也确实是这么做的，只不过后来优化了 table 被当做数组用的性能（就是加了个数组）。

可以看到 Table 的结构中，有表示 metatable，也有数组，还有哈希表，跟我们猜想的几乎一致。而且这还更激进一点，两者都启用！

注意到 lsizenode 是以2位低的整数次幂，非实际大小。

typedef struct Table {
....
  lu_byte flags;  /* 1<
  lu_byte lsizenode;  /* log2 of size of 'node' array */ // 以2为底表示哈希表大小
  unsigned int sizearray;  /* size of 'array' array */
  TValue *array;  /* array part */
  Node *node;
  Node *lastfree;  /* any free position is before this position */
  struct Table *metatable;
....
} Table;

数组部分没什么好看的，我们主要看其哈希表的实现。 TKey 中的 nk 主要是用来当Key的哈希值相同时，开链用。

typedef union TKey {
  struct {
    TValuefields;
    int next;  /* for chaining (offset for next node) */
  } nk;
  TValue tvk;
} TKey;

typedef struct Node {
  TValue i_val;
  TKey i_key;
} Node;

创建 table

创建 table 主要是对结构进行初始化，同时注意到一点，table 的 node 默认是 dummynode，在lua设计中，当一个table的哈希表部分为空时，则默认使用一个 dummynode 的全局对象，因为是只读访问，没有线程安全问题，其实设置成 NULL 我想也是可以的，不过还记得上面的 lsizenode 是以2为底的幂次吗？2^0 == 1，因此设置一个 dummynode，逻辑看起来更自然。不过如果你不小心链接了两次 Lua 库，内存上就有两份 dummynode，根据 dummynode 运算的逻辑都将是未定义行为。

#define dummynode(&dummynode_)

static const Node dummynode_ = {
  {NILCONSTANT},  /* value */
  {{NILCONSTANT, 0}}  /* key */
};

static void setnodevector (lua_State *L, Table *t, unsigned int size) {
  if (size == 0) {  /* no elements to hash part? */
    t->node = cast(Node *, dummynode);  /* use common 'dummynode' */
    t->lsizenode = 0;
    t->lastfree = NULL;  /* signal that it is using dummy node */
  }
....
}

Table *luaH_new (lua_State *L) {
  GCObject *o = luaC_newobj(L, LUA_TTABLE, sizeof(Table));
  Table *t = gco2t(o);
  t->metatable = NULL;
  t->flags = cast_byte(~0);
  t->array = NULL;
  t->sizearray = 0;
  setnodevector(L, t, 0);
  return t;
}

数组还是哈希表？

经过以上，我们可能会思考，我对这个table的操作，到底是操作了数组还是哈希表？在这里我们来看看以下几个操作。

1 2	local a = {1, 2, 3} a[3] = 5

可以看出，第一行的操作指令是 SETLIST，而第二行则是 SETTABLE 。

SETLIST

SETLIST 这种操作默认是在数组中的，因此会先检查 table 中数组的大小，然后进行赋值。 luaH_setint 会调用 luaH_newkey 通过哈希获取 Key 应当存在的位置，然后将其放入。

vmcase(OP_SETLIST) {
        int n = GETARG_B(i);
        int c = GETARG_C(i);
        unsigned int last;
        Table *h;
        if (n == 0) n = cast_int(L->top - ra) - 1;
        if (c == 0) {
          lua_assert(GET_OPCODE(*ci->u.l.savedpc) == OP_EXTRAARG);
          c = GETARG_Ax(*ci->u.l.savedpc++);
        }
        h = hvalue(ra);
        last = ((c-1)*LFIELDS_PER_FLUSH) + n;
        if (last > h->sizearray)  /* needs more space? */
          luaH_resizearray(L, h, last);  /* preallocate it at once */
        for (; n > 0; n--) {
          TValue *val = ra+n;
          luaH_setint(L, h, last--, val);
          luaC_barrierback(L, h, val);
        }
        L->top = ci->top;  /* correct top (in case of previous open call) */
        vmbreak;
      }

luaH_resize 会对数组和哈希表进行扩容or缩容，数组中 nil的值将会被省略。

SETTABLE

这个操作就得根据情况来判断了，但最终都是调用到了 luaH_newkey 这个函数。如果不是个 table，则检查其元方法是否存在，检查方法就是根据 table 结构中的 flags 字段按位来找是否有元方法。查找元方法的路径不能过长，默认是 MAXTAGLOOP 2000。

#define MAXTAGLOOP2000

void luaV_finishset (lua_State *L, const TValue *t, TValue *key,
                     StkId val, const TValue *slot) {
  int loop;  /* counter to avoid infinite loops */
  for (loop = 0; loop < MAXTAGLOOP; loop++) {
    const TValue *tm;  /* '__newindex' metamethod */
    if (slot != NULL) {  /* is 't' a table? */
      Table *h = hvalue(t);  /* save 't' table */
      lua_assert(ttisnil(slot));  /* old value must be nil */
      tm = fasttm(L, h->metatable, TM_NEWINDEX);  /* get metamethod */
      if (tm == NULL) {  /* no metamethod? */
        if (slot == luaO_nilobject)  /* no previous entry? */
          slot = luaH_newkey(L, h, key);  /* create one */
        /* no metamethod and (now) there is an entry with given key */
        setobj2t(L, cast(TValue *, slot), val);  /* set its new value */
        invalidateTMcache(h);
        luaC_barrierback(L, h, val);
        return;
      }
      /* else will try the metamethod */
    }
    else {  /* not a table; check metamethod */
      if (ttisnil(tm = luaT_gettmbyobj(L, t, TM_NEWINDEX)))
        luaG_typeerror(L, t, "index");
    }
    /* try the metamethod */
    if (ttisfunction(tm)) {
      luaT_callTM(L, tm, t, key, val, 0);
      return;
    }
    t = tm;  /* else repeat assignment over 'tm' */
    if (luaV_fastset(L, t, key, slot, luaH_get, val))
      return;  /* done */
    /* else loop */
  }
  luaG_runerror(L, "'__newindex' chain too long; possible loop");
}

luaH_newkey

根据哈希规则，找到 mp即在哈希表中应该存放key的位置，如果被用掉了，就检查占据这个位置的键的位置是不是真的就在这（通过哈希，你可以理解为线性探查法），若真在这，就通过左移 lastfree 指针，找一个新位置，然后将其链起来。否则的话，老让给新的，老重新哈希找到合适的位置，如果还冲突继续往左走。（我个人觉得像是线性探查+开链法的结合体）

TValue *luaH_newkey (lua_State *L, Table *t, const TValue *key) {
  Node *mp;
  TValue aux;
  if (ttisnil(key)) luaG_runerror(L, "table index is nil");
  else if (ttisfloat(key)) {
    lua_Integer k;
    if (luaV_tointeger(key, &k, 0)) {  /* does index fit in an integer? */
      setivalue(&aux, k);
      key = &aux;  /* insert it as an integer */
    }
    else if (luai_numisnan(fltvalue(key)))
      luaG_runerror(L, "table index is NaN");
  }
  mp = mainposition(t, key);
  if (!ttisnil(gval(mp)) || isdummy(t)) {  /* main position is taken? */
    Node *othern;
    Node *f = getfreepos(t);  /* get a free place */
    if (f == NULL) {  /* cannot find a free place? */
      rehash(L, t, key);  /* grow table */
      /* whatever called 'newkey' takes care of TM cache */
      return luaH_set(L, t, key);  /* insert key into grown table */
    }
    lua_assert(!isdummy(t));
    othern = mainposition(t, gkey(mp));
    if (othern != mp) {  /* is colliding node out of its main position? */
      /* yes; move colliding node into free position */
      while (othern + gnext(othern) != mp)  /* find previous */
        othern += gnext(othern);
      gnext(othern) = cast_int(f - othern);  /* rechain to point to 'f' */
      *f = *mp;  /* copy colliding node into free pos. (mp->next also goes) */
      if (gnext(mp) != 0) {
        gnext(f) += cast_int(mp - f);  /* correct 'next' */
        gnext(mp) = 0;  /* now 'mp' is free */
      }
      setnilvalue(gval(mp));
    }
    else {  /* colliding node is in its own main position */
      /* new node will go into free position */
      if (gnext(mp) != 0)
        gnext(f) = cast_int((mp + gnext(mp)) - f);  /* chain new position */
      else lua_assert(gnext(f) == 0);
      gnext(mp) = cast_int(f - mp);
      mp = f;
    }
  }
  setnodekey(L, &mp->i_key, key);
  luaC_barrierback(L, t, key);
  lua_assert(ttisnil(gval(mp)));
  return gval(mp);
}

如果 getfreepos 找不到合适的位置（lastfree 走到最左边），则调用 rehash。

里面会统计数组大小，哈希表中可以合入数组的大小（就是看一下key是不是能转换成整数）。

static void rehash (lua_State *L, Table *t, const TValue *ek) {
  unsigned int asize;  /* optimal size for array part */
  unsigned int na;  /* number of keys in the array part */
  unsigned int nums[MAXABITS + 1];
  int i;
  int totaluse;
  for (i = 0; i <= MAXABITS; i++) nums[i] = 0;  /* reset counts */
  na = numusearray(t, nums);  /* count keys in array part */
  totaluse = na;  /* all those keys are integer keys */
  totaluse += numusehash(t, nums, &na);  /* count keys in hash part */
  /* count extra key */
  na += countint(ek, nums);
  totaluse++;
  /* compute new size for array part */
  asize = computesizes(nums, &na);
  /* resize the table to new computed sizes */
  luaH_resize(L, t, asize, totaluse - na);
}

Table 长度怎么算？

Lua 中取长度采用 # 号获取，它会调用以下函数。

如果存在数组部分，则采用二分查找找到第一个 t[i] ≠nil && t[i + 1] = nil，如果数组真的全在里面，才会走到哈希表的计算。isdummy 为 ((t)->lastfree == NULL) ，如果哈希表部分为空，就不算哈希部分呗，如果有，就在哈希表里面二分查找，将整数下标中的个数给加入进来。因此永远不要对非序列进行取长度操作。

static lua_Unsigned unbound_search (Table *t, lua_Unsigned j) {
  lua_Unsigned i = j;  /* i is zero or a present index */
  j++;
  /* find 'i' and 'j' such that i is present and j is not */
  while (!ttisnil(luaH_getint(t, j))) {
    i = j;
    if (j > l_castS2U(LUA_MAXINTEGER) / 2) {  /* overflow? */
      /* table was built with bad purposes: resort to linear search */
      i = 1;
      while (!ttisnil(luaH_getint(t, i))) i++;
      return i - 1;
    }
    j *= 2;
  }
  /* now do a binary search between them */
  while (j - i > 1) {
    lua_Unsigned m = (i+j)/2;
    if (ttisnil(luaH_getint(t, m))) j = m;
    else i = m;
  }
  return i;
}

lua_Unsigned luaH_getn (Table *t) {
  unsigned int j = t->sizearray;
  if (j > 0 && ttisnil(&t->array[j - 1])) {
    /* there is a boundary in the array part: (binary) search for it */
    unsigned int i = 0;
    while (j - i > 1) {
      unsigned int m = (i+j)/2;
      if (ttisnil(&t->array[m - 1])) j = m;
      else i = m;
    }
    return i;
  }
  /* else must find a boundary in hash part */
  else if (isdummy(t))  /* hash part is empty? */
    return j;  /* that is easy... */
  else return unbound_search(t, j);
}

MetaMethod

前面提到过，table 的结构有个 flags 字段，表示哪些元方法不存在！然后对一个类型操作时，会去检查其元方法，如果有元方法，则尝试调用，最多调用2000次，超过则抛出错误。同时会对元方法的名字，进行优化，提前创建好这些字符串对象，并将其缓存起来。

void luaT_init (lua_State *L) {
  static const char *const luaT_eventname[] = {  /* ORDER TM */
    "__index", "__newindex",
    "__gc", "__mode", "__len", "__eq",
    "__add", "__sub", "__mul", "__mod", "__pow",
    "__div", "__idiv",
    "__band", "__bor", "__bxor", "__shl", "__shr",
    "__unm", "__bnot", "__lt", "__le",
    "__concat", "__call"
  };
  int i;
  for (i=0; i
    G(L)->tmname[i] = luaS_new(L, luaT_eventname[i]);
    luaC_fix(L, obj2gco(G(L)->tmname[i]));  /* never collect these names */
  }
}

pairs与ipairs

table 最常用的两种遍历操作，pairs 是通过 luaH_next 函数实现的。当key 为nil时，则从头开始遍历。

int luaH_next (lua_State *L, Table *t, StkId key) {
  unsigned int i = findindex(L, t, key);  /* find original element */
  for (; i < t->sizearray; i++) {  /* try first array part */
    if (!ttisnil(&t->array[i])) {  /* a non-nil value? */
      setivalue(key, i + 1);
      setobj2s(L, key+1, &t->array[i]);
      return 1;
    }
  }
  for (i -= t->sizearray; cast_int(i) < sizenode(t); i++) {  /* hash part */
    if (!ttisnil(gval(gnode(t, i)))) {  /* a non-nil value? */
      setobj2s(L, key, gkey(gnode(t, i)));
      setobj2s(L, key+1, gval(gnode(t, i)));
      return 1;
    }
  }
  return 0;  /* no more elements */
}

需要注意的是，如果 table 中某个键的值被设置为nil，有可能会被GC回收，但是此时还在遍历，Lua官方称其为死键。

其实也没做什么特殊的，标志为死键又不是被删除了，不过如果被 rehash 则会被从哈希表清除，触发 rehash 的条件是添加新键且空间不够了，因此如果你不添加新键，遍历就挺安全的。

static unsigned int findindex (lua_State *L, Table *t, StkId key) {
  unsigned int i;
  if (ttisnil(key)) return 0;  /* first iteration */
  i = arrayindex(key);
  if (i != 0 && i <= t->sizearray)  /* is 'key' inside array part? */
    return i;  /* yes; that's the index */
  else {
    int nx;
    Node *n = mainposition(t, key);
    for (;;) {  /* check whether 'key' is somewhere in the chain */
      /* key may be dead already, but it is ok to use it in 'next' */
      if (luaV_rawequalobj(gkey(n), key) ||
            (ttisdeadkey(gkey(n)) && iscollectable(key) &&
             deadvalue(gkey(n)) == gcvalue(key))) {
        i = cast_int(n - gnode(t, 0));  /* key index in hash table */
        /* hash elements are numbered after array ones */
        return (i + 1) + t->sizearray;
      }
      nx = gnext(n);
      if (nx == 0)
        luaG_runerror(L, "invalid key to 'next'");  /* key not found */
      else n += nx;
    }
  }
}

至于 ipairs 则是通过 lua_geti 实现，其真正的操作是在 luaH_getint 中，如果还是找不到，则会通过 luaV_finishget 去找其元方法。ipairs 当遍历到 nil 时则会停止，要特别注意不能有黑洞。

LUA_API int lua_geti (lua_State *L, int idx, lua_Integer n) {
  StkId t;
  const TValue *slot;
  lua_lock(L);
  t = index2addr(L, idx);
  if (luaV_fastget(L, t, n, slot, luaH_getint)) {
    setobj2s(L, L->top, slot);
    api_incr_top(L);
  }
  else {
    setivalue(L->top, n);
    api_incr_top(L);
    luaV_finishget(L, t, L->top - 1, L->top - 1, slot);
  }
  lua_unlock(L);
  return ttnov(L->top - 1);
}

Lua 5.3 设计实现(一) Lua是怎么跑起来的?

2020-10-25T05:12:20.000Z

其实在此之前已经写了一个 Lua 5.3 源码剖析系列，还有好几篇存档没有发。为什么突然又不发了呢？（甚至还删了），是因为我感觉之前那样学习的方式过于难受，折磨心智（人这一生最不该做的就是折磨自己），没有抓清主次，同时和网络上的博文同质化严重。因此就决定，再读一次 Lua 的源码，这次读的是 Lua 5.3.6 是 Lua 5.3 系列的最后一个版本。

本系列，不会谈论 Lua 语法，也默认读者已经有 Lua使用经验，我们将绕过 Lua 的编译器（大部分都是词法语法分析），直接进入到 Lua解释器中，来学习我们写好的 Lua 源码是怎么跑起来的。为了理解的方便，代码会有大量删减，只抽取其核心。

Lua 编译过程

虽然，我们在一开始就说好，不谈论 Lua 编译器，但是还是要先理解 Lua 的运行机制。这里简单提一下，你写好的 xxx.lua 文件会经过 luac 工具将 Lua源代码编译成二进制文件，Lua 作者在代码中称其为 Chunk，接着 Lua解释器会加载它并执行，所以 Lua执行起来，看起来是边执行边编译，但实际上是先编译成 Chunk，再加载 Chunk去执行。

加载 Chunk

假设我们现在有一段 lua代码，且已经过了 luac工具编译出了 Chunk，那么 Lua解释器是怎么将其加载的呢？

我们可以大胆猜测，Lua会有个load函数，去load我们的 Chunk。

LUA_API int lua_load (lua_State *L, lua_Reader reader, void *data,
                      const char *chunkname, const char *mode) {
  ....
  status = luaD_protectedparser(L, &z, chunkname, mode);
  if (status == LUA_OK) {  /* no errors? */
    LClosure *f = clLvalue(L->top - 1);  /* get newly created function */
....
  }
  return status;
}

确实拥有这个函数，其本质会调用 luaD_protectedparser，其内部又调用了 f_parser ，不用害怕 luaD_pcall 这个函数，其内部就是调用了传进去的函数指针，这里指 f_parser 。函数名p 指 Protect 安全的调用，其实就是有捕获异常的功能的调用函数，由于C语言没有异常机制，因此它内部用的 setjmp 来实现函数间跳转，模拟异常机制。

int luaD_protectedparser (lua_State *L, ZIO *z, const char *name,
                                        const char *mode) {
....
  status = luaD_pcall(L, f_parser, &p, savestack(L, L->top), L->errfunc);
  return status;
}

f_parser 会根据实际情况，选择从二进制或者文本中解析 Chunk，为了简单起见，我们只关注从二进制中解析的方法即 luaU_undump。

static void f_parser (lua_State *L, void *ud) {
  LClosure *cl;
  struct SParser *p = cast(struct SParser *, ud);
  int c = zgetc(p->z);  /* read first character */
  if (c == LUA_SIGNATURE[0]) {
    checkmode(L, p->mode, "binary");
    cl = luaU_undump(L, p->z, p->name);
  }
  else {
    checkmode(L, p->mode, "text");
    cl = luaY_parser(L, p->z, &p->buff, &p->dyd, p->name, c);
  }
  luaF_initupvals(L, cl);
}

luaU_undump 会先检查 Header，然后创建一个 closure，可以理解为是一个函数，里面会有其各种试行信息，然后将其放在虚拟机的栈顶，最后返回回去。

LClosure *luaU_undump(lua_State *L, ZIO *Z, const char *name) {
  LoadState S;
  LClosure *cl;
....
  checkHeader(&S);
  cl = luaF_newLclosure(L, LoadByte(&S));
  setclLvalue(L, L->top, cl);
  luaD_inctop(L);
  cl->p = luaF_newproto(L);
  LoadFunction(&S, cl->p, NULL);
  ....
  return cl;
}

checkHeader 主要是检查 Chunk 的Lua版本，大端小端字节序，浮点数是怎么存储的等信息，可以看出 Lua的设计理念是，不同版本我就直接不让你运行，非常霸道。

static void checkHeader (LoadState *S) {
  checkliteral(S, LUA_SIGNATURE + 1, "not a");  /* 1st char already checked */
  if (LoadByte(S) != LUAC_VERSION)
    error(S, "version mismatch in");
  if (LoadByte(S) != LUAC_FORMAT)
    error(S, "format mismatch in");
  checkliteral(S, LUAC_DATA, "corrupted");
  checksize(S, int);
  checksize(S, size_t);
  checksize(S, Instruction);
  checksize(S, lua_Integer);
  checksize(S, lua_Number);
  if (LoadInteger(S) != LUAC_INT)
    error(S, "endianness mismatch in");
  if (LoadNumber(S) != LUAC_NUM)
    error(S, "float format mismatch in");
}

现在回过头来看 closure 的结构定义。我们可以确定 cl 中的 Proto 才是函数原型，同时 cl 分为 Lua函数和 C函数。 upvals 根据字面意思可以翻译为上值，属于 Lua 特有，因为 Lua 支持嵌套函数，函数是一等公民，采用了静态作用域，将外界的变量绑定进来，可以暂时理解为将全局变量绑定进来。

typedef struct Proto {
  CommonHeader;
  lu_byte numparams;  // 固定函数个数
  lu_byte is_vararg;  // 是否是可变长参数
  lu_byte maxstacksize;  // 寄存器数量，用栈模拟
  int sizeupvalues;  // Upvalues 个数

  int sizek;  /* size of 'k' */
  int sizecode;
  int sizelineinfo;
  int sizep;  /* size of 'p' */
  int sizelocvars;

  int linedefined;  // 开始行号
  int lastlinedefined;  // 结束行号
  TString  *source; // 源文件名

  TValue *k;  // 常量表
  Instruction *code;  // 指令表
  struct Proto **p;  // 子函数原型表
  int *lineinfo;  // 行号表 行号与指令对应
  LocVar *locvars;  // 局部变量表
  Upvaldesc *upvalues;  // Upvalue 表

  struct LClosure *cache;  /* last-created closure with this prototype */
} Proto;

typedef struct LClosure {
  unsigned char nupvalues;
  struct Proto *p;
  UpVal *upvals[1];  /* list of upvalues */
} LClosure;

typedef struct CClosure {
  ClosureHeader;
  lua_CFunction f;
  TValue upvalue[1];  /* list of upvalues */
} CClosure;

typedef union Closure {
  CClosure c;
  LClosure l;
} Closure;

LoadFunction 将填充 Proto ，要注意 Proto 是嵌套的，如果有多个函数的情况下。

static void LoadFunction (LoadState *S, Proto *f, TString *psource) {
  f->source = LoadString(S, f);
  if (f->source == NULL)  /* no source in dump? */
    f->source = psource;  /* reuse parent's source */
  f->linedefined = LoadInt(S);
  f->lastlinedefined = LoadInt(S);
  f->numparams = LoadByte(S);
  f->is_vararg = LoadByte(S);
  f->maxstacksize = LoadByte(S);
  LoadCode(S, f);
  LoadConstants(S, f);
  LoadUpvalues(S, f);
  LoadProtos(S, f);
  LoadDebug(S, f);
}

加载完了 Chunk ，目光回到 f_parser 其最后会调用 luaF_initupvals 初始化 upVals 就是置nil。

void luaF_initupvals (lua_State *L, LClosure *cl) {
  int i;
  for (i = 0; i < cl->nupvalues; i++) {
    UpVal *uv = luaM_new(L, UpVal);
    uv->refcount = 1;
    uv->v = &uv->u.value;  /* make it closed */
    setnilvalue(uv->v);
    cl->upvals[i] = uv;
  }

Load 完之后，我们也能猜测到应当还有个 Call 方法，才能将加载进来的内容跑起来。将 func读入到 CallInfo（可以理解为Lua解释器中的执行栈），会设置一下是不是可变参，有几个返回值等行为，最后调用 luaV_execute 去执行指令。

int luaD_precall (lua_State *L, StkId func, int nresults) {
  CallInfo *ci;
  switch (ttype(func)) {
....
    case LUA_TLCL: {  /* Lua function: prepare its call */
      StkId base;
      Proto *p = clLvalue(func)->p;
      int n = cast_int(L->top - func) - 1;  /* number of real arguments */
      int fsize = p->maxstacksize;  /* frame size */
      checkstackp(L, fsize, func);
      if (p->is_vararg)
        base = adjust_varargs(L, p, n);
      else {  /* non vararg function */
        for (; n < p->numparams; n++)
          setnilvalue(L->top++);  /* complete missing arguments */
        base = func + 1;
      }
      ci = next_ci(L);  /* now 'enter' new function */
      ci->nresults = nresults;
      ci->func = func;
      ci->u.l.base = base;
      L->top = ci->top = base + fsize;
      ci->u.l.savedpc = p->code;  /* starting point */
      ci->callstatus = CIST_LUA;
      if (L->hookmask & LUA_MASKCALL)
        callhook(L, ci);
      return 0;
    }
  }
}
void luaD_call (lua_State *L, StkId func, int nResults) {
....
  if (!luaD_precall(L, func, nResults))  /* is a Lua function? */
    luaV_execute(L);  /* call it */
....
}

luaV_execute 会将指令读入，然后去执行，Lua 的指令长度为32位，其中6位为指令，剩余位数为操作数。

void luaV_execute (lua_State *L) {
  CallInfo *ci = L->ci;
  LClosure *cl;
  TValue *k;
  StkId base;
  ci->callstatus |= CIST_FRESH;  /* fresh invocation of 'luaV_execute" */
 newframe:  /* reentry point when frame changes (call/return) */
  lua_assert(ci == L->ci);
  cl = clLvalue(ci->func);  /* local reference to function's closure */
  k = cl->p->k;  /* local reference to function's constant table */
  base = ci->u.l.base;  /* local copy of function's base */
  /* main loop of interpreter */
  for (;;) {
    Instruction i;
    StkId ra;
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
...
      vmcase(OP_LOADNIL) {
        int b = GETARG_B(i);
        do {
          setnilvalue(ra++);
        } while (b--);
        vmbreak;
      }
....

luaD_precall 会将要执行的函数或称为闭包存放到 CallInfo，接着 luaV_execute 会调用 vmfetch 获取指令，savedpc 就是我们当前执行到的指令。

#define vmfetch(){ \
  i = *(ci->u.l.savedpc++); \
.... \
  ra = RA(i); /* WARNING: any stack reallocation invalidates 'ra' */ \
}

在这里，有必要看看 CallInfo 的结构，因为执行的函数有可能是C函数和Lua函数，故源码用 union将其包起来，我们目前只在意 Lua 的部分，可以看到 savedpc 存的就是每一条指令，它的实际类型就是 uint32 ，采用了定长指令，前六位为指令。

typedef struct CallInfo {
  StkId func;  /* function index in the stack */
  StkIdtop;  /* top for this function */
  struct CallInfo *previous, *next;  /* dynamic call link */
  union {
    struct {  /* only for Lua functions */
      StkId base;  /* base for this function */
      const Instruction *savedpc;
    } l;
    struct {  /* only for C functions */
      lua_KFunction k;  /* continuation in case of yields */
      ptrdiff_t old_errfunc;
      lua_KContext ctx;  /* context info. in case of yields */
    } c;
  } u;
  ptrdiff_t extra;
  short nresults;  /* expected number of results from this function */
  unsigned short callstatus;
} CallInfo;

就这样，Lua解释器从加载 Chunk 到执行 Chunk 的流程走完了。

但仅如此还不够，我们可以看到以上大部分函数，都以 lua_state 作为参数，因此我们还需要先实例化 lua_state ，不过在此之前，我们要先简单认识一下 lua_state 的结构定义。

lua_State

去除掉大量的无关信息，一个 Lua 解释器，仅需要以下几项即可运作。分别是栈的信息（如果你有Lua经验，想必早已知道Lua是通过栈模拟寄存器），调用栈信息即 CallInfo。

struct lua_State {
....
  unsigned short nci;  /* number of items in 'ci' list */
  StkId top;  /* first free slot in the stack */
....
  CallInfo *ci;  /* call info for current function */
  StkId stack_last;  /* last free slot in the stack */
  StkId stack;  /* stack base */
....
};

lua_newstate

简单地初始化 lua_State，在这里我将无关的内容给删除了，可以看到初始化后会调用 f_luaopen 函数去打开Lua基础库。

LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
  int i;
  lua_State *L;
  LG *l = cast(LG *, (*f)(ud, NULL, LUA_TTHREAD, sizeof(LG)));
  if (l == NULL) return NULL;
  L = &l->l.l;
....
  for (i=0; i < LUA_NUMTAGS; i++) g->mt[i] = NULL;
  if (luaD_rawrunprotected(L, f_luaopen, NULL) != LUA_OK) {
    /* memory allocation error: free partial state */
    close_state(L);
    L = NULL;
  }
  return L;
}

stack_init 初始化栈和初始化调用栈即 CallInfo， init_registry 初始化注册表，往后的全局对象，还有一些C函数都会注册到这里面。

static void f_luaopen (lua_State *L, void *ud) {
  stack_init(L, L);  /* init stack */
  init_registry(L, g);
  luaS_init(L); // 初始化用于复用的字符串，当字符串相等的时候能复用就复用
....
  luaX_init(L); // 关键字 字符串提前注册并设置不能GC
....
}

结语

经过以上的洗礼，可以看到 Lua 在加载 Chunk的时候，要先创建好Lua解释器，然后通过指定格式Load进内存，再调用 precall 预处理，最后将一条条的指令执行。

其实之前看Lua源码的时候感觉很复杂，特别难看懂，特别是C语言的通病各种宏，看一下后面的，过一阵又忘了宏里面写的是什么。这次则采用一种新的方式来阅读，即先想想如果是你来做这个功能，你会怎么做？想到的方法不会相差太多，这个时候顺着自己的思路来寻觅作者的思路，会简单的多。