Golang Panic 行为探秘

为了满足分布式系统测试的需求，我们经常需要在代码中埋下断点，以便于通过修改编译参数或者注册特定 Hook 的方式来强迫程序走特定的逻辑。这篇文章主要分享了我在实现 BreakPoint 时发现的 Golang Panic && Recover 的一个好玩行为及其背后的原因。

复现

package runtime

import (
	"testing"
)

func TestRecover(t *testing.T) {
	defer func() {
		recover()
	}()
	panic("panic in test")
}

func TestRecoverInClosure(t *testing.T) {
	defer func() {
		// This should be the callback function of a break point.
		// Let's call them directly for simpler example.
		func() {
			recover()
		}()
	}()
	panic("panic in test")
}

TestRecover 演示的是一个比较常见的情况，业务逻辑中可能会出现 panic，我们在 defer 的函数中执行 recover 并做进一步的处理。而 TestRecoverInClosure 中演示的则是我原本想要实现的逻辑，断点在触发时去调用在注册断点时传入的回调函数，在回调函数中去执行 recover 并获得 panic 的现场内容。但是事实证明这样是行不通的，在 TestRecoverInClosure 中，panic 并没有被捕获，而是直接抛到了最外层，在闭包中的 recover 也自然是什么都没有拿到，翻车现场如下：

=== RUN   TestRecoverInClosure
--- FAIL: TestRecoverInClosure (0.00s)
panic: panic in test [recovered]
	panic: panic in test

goroutine 6 [running]:
testing.tRunner.func1(0xc000138400)
	/usr/lib/go/src/testing/testing.go:830 +0x392
panic(0x8c1140, 0xb4d1a0)
	/usr/lib/go/src/runtime/panic.go:522 +0x1b5
xuanwo/playground/runtime.TestRecoverInClosure(0xc000138400)
	/home/xuanwo/Code/xuanwo/playground/runtime/panic_test.go:22 +0x55
testing.tRunner(0xc000138400, 0xad0678)
	/usr/lib/go/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:916 +0x35a

原因

为了搞清楚问题的原因，首先需要知道 panic && defer 是怎么工作。Golang 中 panic 和 defer 实现的相关代码主要是在 /usr/lib/go/src/runtime/panic.go 中，下文贴出来的代码来自于 Go 1.12.5。

defer

在了解 panic 之前，首先看看 defer 是如何实现并存储的：

// Allocate a Defer, usually using per-P pool.
// Each defer must be released with freedefer.
//
// This must not grow the stack because there may be a frame without
// stack map information when this is called.
//
//go:nosplit
func newdefer(siz int32) *_defer {
	var d *_defer
	sc := deferclass(uintptr(siz))
	gp := getg()
	if sc < uintptr(len(p{}.deferpool)) {
		...
	}
	if d == nil {
		// Allocate new defer+args.
		systemstack(func() {
			total := roundupsize(totaldefersize(uintptr(siz)))
			d = (*_defer)(mallocgc(total, deferType, true))
		})
		...
	}
	d.siz = siz
	d.link = gp._defer
	gp._defer = d
	return d
}

这里的 getg() 返回的是当前正在执行的 goroutine。

这里可以忽略掉具体的实现细节，只需要关注初始化 defer 和更新 gp._defer 的过程。不难看出 _defer 结构体是以链表的形式存储在 gouroutine 中的，下面 panic 的实现会高度依赖这一点。

panic

下面来看一下 panic 的实现，首先看一下整体的结构，然后挑出一些我认为需要关注的地方展开聊一聊。

// The implementation of the predeclared function panic.
func gopanic(e interface{}) {
	gp := getg()

	...

	var p _panic
	p.arg = e
	p.link = gp._panic
	gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

	atomic.Xadd(&runningPanicDefers, 1)

	for {
		d := gp._defer
		if d == nil {
			break
		}

		// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
		// take defer off list. The earlier panic or Goexit will not continue running.
		if d.started {
			if d._panic != nil {
				d._panic.aborted = true
			}
			d._panic = nil
			d.fn = nil
			gp._defer = d.link
			freedefer(d)
			continue
		}

		// Mark defer as started, but keep on list, so that traceback
		// can find and update the defer's argument frame if stack growth
		// or a garbage collection happens before reflectcall starts executing d.fn.
		d.started = true

		// Record the panic that is running the defer.
		// If there is a new panic during the deferred call, that panic
		// will find d in the list and will mark d._panic (this panic) aborted.
		d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

		p.argp = unsafe.Pointer(getargp(0))
		reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
		p.argp = nil

		...

		pc := d.pc
		sp := unsafe.Pointer(d.sp) // must be pointer so it gets adjusted during stack copy
		freedefer(d)
		if p.recovered {
			atomic.Xadd(&runningPanicDefers, -1)

			gp._panic = p.link
			// Aborted panics are marked but remain on the g.panic list.
			// Remove them from the list.
			for gp._panic != nil && gp._panic.aborted {
				gp._panic = gp._panic.link
			}
			if gp._panic == nil { // must be done with signal
				gp.sig = 0
			}
			// Pass information about recovering frame to recovery.
			gp.sigcode0 = uintptr(sp)
			gp.sigcode1 = pc
			mcall(recovery)
			throw("recovery failed") // mcall should not return
		}
	}

	// ran out of deferred calls - old-school panic now
	// Because it is unsafe to call arbitrary user code after freezing
	// the world, we call preprintpanics to invoke all necessary Error
	// and String methods to prepare the panic strings before startpanic.
	preprintpanics(gp._panic)

	fatalpanic(gp._panic) // should not return
	*(*int)(nil) = 0      // not reached
}

跟 _defer 一样，_panic 结构也是以链表形式存储在 goroutine 中的。

var p _panic
p.arg = e
p.link = gp._panic
gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

首先取出第一个 panic 节点，然后进入 for 循环。

d := gp._defer
if d == nil {
	break
}

取出对头的第一个 _defer 结构，开始执行 defer 函数，如果为空的话会直接 break 并抛出错误的堆栈。

// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
// take defer off list. The earlier panic or Goexit will not continue running.
if d.started {
	if d._panic != nil {
		d._panic.aborted = true
	}
	d._panic = nil
	d.fn = nil
	gp._defer = d.link
	freedefer(d)
	continue
}

// Mark defer as started, but keep on list, so that traceback
// can find and update the defer's argument frame if stack growth
// or a garbage collection happens before reflectcall starts executing d.fn.
d.started = true

// Record the panic that is running the defer.
// If there is a new panic during the deferred call, that panic
// will find d in the list and will mark d._panic (this panic) aborted.
d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

当一个 defer 函数开始执行时会将 started 标志置为 true，这样就可以知道是不是在这个 defer 函数执行过程中再次出现了 panic。下面修改 _panic 指针也是类似的操作，这些与我本次分享主题无关，就不展开叙述了。

p.argp = unsafe.Pointer(getargp(0))
reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
p.argp = nil

这里出现了函数执行逻辑的切换，gopanic 中会调用 reflectcall 去复制 defer 函数的参数并执行 defer 函数。

在 reflectcall 执行前修改 p.argp 为 unsafe.Pointer(getargp(0)) ，是当前 defer 函数调用的参数指针，或者说是 defer 函数的内存地址（这个地方我理解的可能有些问题），在 reflectcall 执行成功后再修改为 nil 避免影响下一次的循环。

if p.recovered {
	atomic.Xadd(&runningPanicDefers, -1)

	...

	mcall(recovery)
	throw("recovery failed") // mcall should not return
}

在 defer 函数执行成功后，通过 p.recovered 来判断是否已经成功 recover 并执行 recovery，这里不再展开。

recover

func gorecover(argp uintptr) interface{} {
	// Must be in a function running as part of a deferred call during the panic.
	// Must be called from the topmost function of the call
	// (the function used in the defer statement).
	// p.argp is the argument pointer of that topmost deferred function call.
	// Compare against argp reported by caller.
	// If they match, the caller is the one who can recover.
	gp := getg()
	p := gp._panic
	if p != nil && !p.recovered && argp == uintptr(p.argp) {
		p.recovered = true
		return p.arg
	}
	return nil
}

传入 gorecover 函数的 argp 是 recover 这个函数的调用者的地址。

recover 主要做的事情就是检查当前 goroutine 中是否存在 panic，panic 是否已经被 recover，以及调用者是否一致。如果检查通过的话就修改 p.recovered 为 true，并返回 panic 创建时传入的参数，否则就直接返回 nil。

解释

刚才简单分析了一下 defer && panic && recover 是如何工作的，下面可以利用刚才了解到的原理来解释我遇到的现象了：

func TestRecoverInClosure(t *testing.T) {
	defer func() {          <---- argp: 0x01
		// This should be the callback function of a break point.
		// Let's call them directly for simpler example.
		func() {
			recover()       <---- argp: 0x02
		}()
	}()
	panic("panic in test")
}

将这个 defer 函数加入 goroutine 的 _defer 列表
执行 panic，检查是否存在 defer 函数并执行
修改 p.argp 为 0x01，开始执行内部的匿名函数
recover 取到当前的调用者 argp 为 0x02，判断不通过，直接返回 nil
此时 p.recovered 仍然为 false，又没有更多的 defer 函数，进入 fatalpanic

困惑

上面对照着分析可以大概解释明白为什么 TestRecoverInClosure 中的 panic 捕获不到，但是很多被忽略的细节还是没有搞明白。

getargp

getargp 实现非常简单：

// getargp returns the location where the caller
// writes outgoing function call arguments.
//go:nosplit
//go:noinline
func getargp(x int) uintptr {
	// x is an argument mainly so that we can return its address.
	return uintptr(noescape(unsafe.Pointer(&x)))
}

为什么这就是当前 defer 函数调用的参数指针呢？

recover && gorecover

recover 是没有参数的，但是 gorecover 却有 argp 作为参数，跟下去可以看到这样的调用：

mkcall("gorecover", n.Type, init, nod(OADDR, nodfp, nil))

所以是 nod(OADDR, nodfp, nil) 取到了调用者的地址么？

总结

搞明白这个问题花费的时间比我想象的要更久，一方面是因为我对 go 内部的实现确实不太熟悉，另一方面是因为大多数的分享都集中在如何使用或者最佳实践之类的，讨论内部实现的文章不是很多。我要特别的推荐一下 @伊布的文章，他写的 Golang: 深入理解panic and recover 非常赞，对 panic && recover 切换和恢复过程具体实现感兴趣的同学不妨一读，定会有所收获。

Xuanwo's Blog