为了满足分布式系统测试的需求,我们经常需要在代码中埋下断点,以便于通过修改编译参数或者注册特定 Hook 的方式来强迫程序走特定的逻辑。这篇文章主要分享了我在实现 BreakPoint 时发现的 Golang Panic && Recover 的一个好玩行为及其背后的原因。
复现
package runtime
import (
"testing"
)
func TestRecover(t *testing.T) {
defer func() {
recover()
}()
panic("panic in test")
}
func TestRecoverInClosure(t *testing.T) {
defer func() {
// This should be the callback function of a break point.
// Let's call them directly for simpler example.
func() {
recover()
}()
}()
panic("panic in test")
}
TestRecover
演示的是一个比较常见的情况,业务逻辑中可能会出现 panic,我们在 defer 的函数中执行 recover 并做进一步的处理。而 TestRecoverInClosure
中演示的则是我原本想要实现的逻辑,断点在触发时去调用在注册断点时传入的回调函数,在回调函数中去执行 recover 并获得 panic 的现场内容。但是事实证明这样是行不通的,在 TestRecoverInClosure
中,panic 并没有被捕获,而是直接抛到了最外层,在闭包中的 recover 也自然是什么都没有拿到,翻车现场如下:
=== RUN TestRecoverInClosure
--- FAIL: TestRecoverInClosure (0.00s)
panic: panic in test [recovered]
panic: panic in test
goroutine 6 [running]:
testing.tRunner.func1(0xc000138400)
/usr/lib/go/src/testing/testing.go:830 +0x392
panic(0x8c1140, 0xb4d1a0)
/usr/lib/go/src/runtime/panic.go:522 +0x1b5
xuanwo/playground/runtime.TestRecoverInClosure(0xc000138400)
/home/xuanwo/Code/xuanwo/playground/runtime/panic_test.go:22 +0x55
testing.tRunner(0xc000138400, 0xad0678)
/usr/lib/go/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
/usr/lib/go/src/testing/testing.go:916 +0x35a
原因
为了搞清楚问题的原因,首先需要知道 panic && defer 是怎么工作。Golang 中 panic 和 defer 实现的相关代码主要是在 /usr/lib/go/src/runtime/panic.go
中,下文贴出来的代码来自于 Go 1.12.5。
defer
在了解 panic 之前,首先看看 defer 是如何实现并存储的:
// Allocate a Defer, usually using per-P pool.
// Each defer must be released with freedefer.
//
// This must not grow the stack because there may be a frame without
// stack map information when this is called.
//
//go:nosplit
func newdefer(siz int32) *_defer {
var d *_defer
sc := deferclass(uintptr(siz))
gp := getg()
if sc < uintptr(len(p{}.deferpool)) {
...
}
if d == nil {
// Allocate new defer+args.
systemstack(func() {
total := roundupsize(totaldefersize(uintptr(siz)))
d = (*_defer)(mallocgc(total, deferType, true))
})
...
}
d.siz = siz
d.link = gp._defer
gp._defer = d
return d
}
这里的
getg()
返回的是当前正在执行的 goroutine。
这里可以忽略掉具体的实现细节,只需要关注初始化 defer 和更新 gp._defer
的过程。不难看出 _defer
结构体是以链表的形式存储在 gouroutine 中的,下面 panic 的实现会高度依赖这一点。
panic
下面来看一下 panic 的实现,首先看一下整体的结构,然后挑出一些我认为需要关注的地方展开聊一聊。
// The implementation of the predeclared function panic.
func gopanic(e interface{}) {
gp := getg()
...
var p _panic
p.arg = e
p.link = gp._panic
gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
atomic.Xadd(&runningPanicDefers, 1)
for {
d := gp._defer
if d == nil {
break
}
// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
// take defer off list. The earlier panic or Goexit will not continue running.
if d.started {
if d._panic != nil {
d._panic.aborted = true
}
d._panic = nil
d.fn = nil
gp._defer = d.link
freedefer(d)
continue
}
// Mark defer as started, but keep on list, so that traceback
// can find and update the defer's argument frame if stack growth
// or a garbage collection happens before reflectcall starts executing d.fn.
d.started = true
// Record the panic that is running the defer.
// If there is a new panic during the deferred call, that panic
// will find d in the list and will mark d._panic (this panic) aborted.
d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
p.argp = unsafe.Pointer(getargp(0))
reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
p.argp = nil
...
pc := d.pc
sp := unsafe.Pointer(d.sp) // must be pointer so it gets adjusted during stack copy
freedefer(d)
if p.recovered {
atomic.Xadd(&runningPanicDefers, -1)
gp._panic = p.link
// Aborted panics are marked but remain on the g.panic list.
// Remove them from the list.
for gp._panic != nil && gp._panic.aborted {
gp._panic = gp._panic.link
}
if gp._panic == nil { // must be done with signal
gp.sig = 0
}
// Pass information about recovering frame to recovery.
gp.sigcode0 = uintptr(sp)
gp.sigcode1 = pc
mcall(recovery)
throw("recovery failed") // mcall should not return
}
}
// ran out of deferred calls - old-school panic now
// Because it is unsafe to call arbitrary user code after freezing
// the world, we call preprintpanics to invoke all necessary Error
// and String methods to prepare the panic strings before startpanic.
preprintpanics(gp._panic)
fatalpanic(gp._panic) // should not return
*(*int)(nil) = 0 // not reached
}
跟 _defer
一样,_panic
结构也是以链表形式存储在 goroutine 中的。
var p _panic
p.arg = e
p.link = gp._panic
gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
首先取出第一个 panic 节点,然后进入 for 循环。
d := gp._defer
if d == nil {
break
}
取出对头的第一个 _defer
结构,开始执行 defer 函数,如果为空的话会直接 break 并抛出错误的堆栈。
// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
// take defer off list. The earlier panic or Goexit will not continue running.
if d.started {
if d._panic != nil {
d._panic.aborted = true
}
d._panic = nil
d.fn = nil
gp._defer = d.link
freedefer(d)
continue
}
// Mark defer as started, but keep on list, so that traceback
// can find and update the defer's argument frame if stack growth
// or a garbage collection happens before reflectcall starts executing d.fn.
d.started = true
// Record the panic that is running the defer.
// If there is a new panic during the deferred call, that panic
// will find d in the list and will mark d._panic (this panic) aborted.
d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))
当一个 defer 函数开始执行时会将 started
标志置为 true
,这样就可以知道是不是在这个 defer 函数执行过程中再次出现了 panic。下面修改 _panic
指针也是类似的操作,这些与我本次分享主题无关,就不展开叙述了。
p.argp = unsafe.Pointer(getargp(0))
reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
p.argp = nil
这里出现了函数执行逻辑的切换,gopanic 中会调用 reflectcall
去复制 defer 函数的参数并执行 defer 函数。
在 reflectcall
执行前修改 p.argp
为 unsafe.Pointer(getargp(0))
,是当前 defer 函数调用的参数指针,或者说是 defer 函数的内存地址(这个地方我理解的可能有些问题),在 reflectcall
执行成功后再修改为 nil 避免影响下一次的循环。
if p.recovered {
atomic.Xadd(&runningPanicDefers, -1)
...
mcall(recovery)
throw("recovery failed") // mcall should not return
}
在 defer 函数执行成功后,通过 p.recovered
来判断是否已经成功 recover 并执行 recovery,这里不再展开。
recover
func gorecover(argp uintptr) interface{} {
// Must be in a function running as part of a deferred call during the panic.
// Must be called from the topmost function of the call
// (the function used in the defer statement).
// p.argp is the argument pointer of that topmost deferred function call.
// Compare against argp reported by caller.
// If they match, the caller is the one who can recover.
gp := getg()
p := gp._panic
if p != nil && !p.recovered && argp == uintptr(p.argp) {
p.recovered = true
return p.arg
}
return nil
}
传入
gorecover
函数的argp
是recover
这个函数的调用者的地址。
recover 主要做的事情就是检查当前 goroutine 中是否存在 panic,panic 是否已经被 recover,以及调用者是否一致。如果检查通过的话就修改 p.recovered
为 true,并返回 panic 创建时传入的参数,否则就直接返回 nil。
解释
刚才简单分析了一下 defer && panic && recover 是如何工作的,下面可以利用刚才了解到的原理来解释我遇到的现象了:
func TestRecoverInClosure(t *testing.T) {
defer func() { <---- argp: 0x01
// This should be the callback function of a break point.
// Let's call them directly for simpler example.
func() {
recover() <---- argp: 0x02
}()
}()
panic("panic in test")
}
- 将这个 defer 函数加入 goroutine 的
_defer
列表 - 执行 panic,检查是否存在 defer 函数并执行
- 修改
p.argp
为 0x01,开始执行内部的匿名函数 - recover 取到当前的调用者 argp 为 0x02,判断不通过,直接返回 nil
- 此时
p.recovered
仍然为false
,又没有更多的 defer 函数,进入 fatalpanic
困惑
上面对照着分析可以大概解释明白为什么 TestRecoverInClosure 中的 panic 捕获不到,但是很多被忽略的细节还是没有搞明白。
getargp
getargp
实现非常简单:
// getargp returns the location where the caller
// writes outgoing function call arguments.
//go:nosplit
//go:noinline
func getargp(x int) uintptr {
// x is an argument mainly so that we can return its address.
return uintptr(noescape(unsafe.Pointer(&x)))
}
为什么这就是当前 defer 函数调用的参数指针呢?
recover && gorecover
recover
是没有参数的,但是 gorecover
却有 argp 作为参数,跟下去可以看到这样的调用:
mkcall("gorecover", n.Type, init, nod(OADDR, nodfp, nil))
所以是 nod(OADDR, nodfp, nil)
取到了调用者的地址么?
总结
搞明白这个问题花费的时间比我想象的要更久,一方面是因为我对 go 内部的实现确实不太熟悉,另一方面是因为大多数的分享都集中在如何使用 或者最佳实践之类的,讨论内部实现的文章不是很多。我要特别的推荐一下 @伊布 的文章,他写的 Golang: 深入理解panic and recover 非常赞,对 panic && recover 切换和恢复过程具体实现感兴趣的同学不妨一读,定会有所收获。