为了满足分布式系统测试的需求,我们经常需要在代码中埋下断点,以便于通过修改编译参数或者注册特定 Hook 的方式来强迫程序走特定的逻辑。这篇文章主要分享了我在实现 BreakPoint 时发现的 Golang Panic && Recover 的一个好玩行为及其背后的原因。

复现

package runtime

import (
	"testing"
)

func TestRecover(t *testing.T) {
	defer func() {
		recover()
	}()
	panic("panic in test")
}

func TestRecoverInClosure(t *testing.T) {
	defer func() {
		// This should be the callback function of a break point.
		// Let's call them directly for simpler example.
		func() {
			recover()
		}()
	}()
	panic("panic in test")
}

TestRecover 演示的是一个比较常见的情况,业务逻辑中可能会出现 panic,我们在 defer 的函数中执行 recover 并做进一步的处理。而 TestRecoverInClosure 中演示的则是我原本想要实现的逻辑,断点在触发时去调用在注册断点时传入的回调函数,在回调函数中去执行 recover 并获得 panic 的现场内容。但是事实证明这样是行不通的,在 TestRecoverInClosure 中,panic 并没有被捕获,而是直接抛到了最外层,在闭包中的 recover 也自然是什么都没有拿到,翻车现场如下:

=== RUN   TestRecoverInClosure
--- FAIL: TestRecoverInClosure (0.00s)
panic: panic in test [recovered]
	panic: panic in test

goroutine 6 [running]:
testing.tRunner.func1(0xc000138400)
	/usr/lib/go/src/testing/testing.go:830 +0x392
panic(0x8c1140, 0xb4d1a0)
	/usr/lib/go/src/runtime/panic.go:522 +0x1b5
xuanwo/playground/runtime.TestRecoverInClosure(0xc000138400)
	/home/xuanwo/Code/xuanwo/playground/runtime/panic_test.go:22 +0x55
testing.tRunner(0xc000138400, 0xad0678)
	/usr/lib/go/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:916 +0x35a

原因

为了搞清楚问题的原因,首先需要知道 panic && defer 是怎么工作。Golang 中 panic 和 defer 实现的相关代码主要是在 /usr/lib/go/src/runtime/panic.go 中,下文贴出来的代码来自于 Go 1.12.5。

defer

在了解 panic 之前,首先看看 defer 是如何实现并存储的:

// Allocate a Defer, usually using per-P pool.
// Each defer must be released with freedefer.
//
// This must not grow the stack because there may be a frame without
// stack map information when this is called.
//
//go:nosplit
func newdefer(siz int32) *_defer {
	var d *_defer
	sc := deferclass(uintptr(siz))
	gp := getg()
	if sc < uintptr(len(p{}.deferpool)) {
		...
	}
	if d == nil {
		// Allocate new defer+args.
		systemstack(func() {
			total := roundupsize(totaldefersize(uintptr(siz)))
			d = (*_defer)(mallocgc(total, deferType, true))
		})
		...
	}
	d.siz = siz
	d.link = gp._defer
	gp._defer = d
	return d
}

这里的 getg() 返回的是当前正在执行的 goroutine。

这里可以忽略掉具体的实现细节,只需要关注初始化 defer 和更新 gp._defer 的过程。不难看出 _defer 结构体是以链表的形式存储在 gouroutine 中的,下面 panic 的实现会高度依赖这一点。

panic

下面来看一下 panic 的实现,首先看一下整体的结构,然后挑出一些我认为需要关注的地方展开聊一聊。

// The implementation of the predeclared function panic.
func gopanic(e interface{}) {
	gp := getg()

	...

	var p _panic
	p.arg = e
	p.link = gp._panic
	gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

	atomic.Xadd(&runningPanicDefers, 1)

	for {
		d := gp._defer
		if d == nil {
			break
		}

		// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
		// take defer off list. The earlier panic or Goexit will not continue running.
		if d.started {
			if d._panic != nil {
				d._panic.aborted = true
			}
			d._panic = nil
			d.fn = nil
			gp._defer = d.link
			freedefer(d)
			continue
		}

		// Mark defer as started, but keep on list, so that traceback
		// can find and update the defer's argument frame if stack growth
		// or a garbage collection happens before reflectcall starts executing d.fn.
		d.started = true

		// Record the panic that is running the defer.
		// If there is a new panic during the deferred call, that panic
		// will find d in the list and will mark d._panic (this panic) aborted.
		d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

		p.argp = unsafe.Pointer(getargp(0))
		reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
		p.argp = nil

		...

		pc := d.pc
		sp := unsafe.Pointer(d.sp) // must be pointer so it gets adjusted during stack copy
		freedefer(d)
		if p.recovered {
			atomic.Xadd(&runningPanicDefers, -1)

			gp._panic = p.link
			// Aborted panics are marked but remain on the g.panic list.
			// Remove them from the list.
			for gp._panic != nil && gp._panic.aborted {
				gp._panic = gp._panic.link
			}
			if gp._panic == nil { // must be done with signal
				gp.sig = 0
			}
			// Pass information about recovering frame to recovery.
			gp.sigcode0 = uintptr(sp)
			gp.sigcode1 = pc
			mcall(recovery)
			throw("recovery failed") // mcall should not return
		}
	}

	// ran out of deferred calls - old-school panic now
	// Because it is unsafe to call arbitrary user code after freezing
	// the world, we call preprintpanics to invoke all necessary Error
	// and String methods to prepare the panic strings before startpanic.
	preprintpanics(gp._panic)

	fatalpanic(gp._panic) // should not return
	*(*int)(nil) = 0      // not reached
}

_defer 一样,_panic 结构也是以链表形式存储在 goroutine 中的。

var p _panic
p.arg = e
p.link = gp._panic
gp._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

首先取出第一个 panic 节点,然后进入 for 循环。

d := gp._defer
if d == nil {
	break
}

取出对头的第一个 _defer 结构,开始执行 defer 函数,如果为空的话会直接 break 并抛出错误的堆栈。

// If defer was started by earlier panic or Goexit (and, since we're back here, that triggered a new panic),
// take defer off list. The earlier panic or Goexit will not continue running.
if d.started {
	if d._panic != nil {
		d._panic.aborted = true
	}
	d._panic = nil
	d.fn = nil
	gp._defer = d.link
	freedefer(d)
	continue
}

// Mark defer as started, but keep on list, so that traceback
// can find and update the defer's argument frame if stack growth
// or a garbage collection happens before reflectcall starts executing d.fn.
d.started = true

// Record the panic that is running the defer.
// If there is a new panic during the deferred call, that panic
// will find d in the list and will mark d._panic (this panic) aborted.
d._panic = (*_panic)(noescape(unsafe.Pointer(&p)))

当一个 defer 函数开始执行时会将 started 标志置为 true,这样就可以知道是不是在这个 defer 函数执行过程中再次出现了 panic。下面修改 _panic 指针也是类似的操作,这些与我本次分享主题无关,就不展开叙述了。

p.argp = unsafe.Pointer(getargp(0))
reflectcall(nil, unsafe.Pointer(d.fn), deferArgs(d), uint32(d.siz), uint32(d.siz))
p.argp = nil

这里出现了函数执行逻辑的切换,gopanic 中会调用 reflectcall 去复制 defer 函数的参数并执行 defer 函数。

reflectcall 执行前修改 p.argpunsafe.Pointer(getargp(0)) ,是当前 defer 函数调用的参数指针,或者说是 defer 函数的内存地址(这个地方我理解的可能有些问题),在 reflectcall 执行成功后再修改为 nil 避免影响下一次的循环。

if p.recovered {
	atomic.Xadd(&runningPanicDefers, -1)

	...

	mcall(recovery)
	throw("recovery failed") // mcall should not return
}

在 defer 函数执行成功后,通过 p.recovered 来判断是否已经成功 recover 并执行 recovery,这里不再展开。

recover

func gorecover(argp uintptr) interface{} {
	// Must be in a function running as part of a deferred call during the panic.
	// Must be called from the topmost function of the call
	// (the function used in the defer statement).
	// p.argp is the argument pointer of that topmost deferred function call.
	// Compare against argp reported by caller.
	// If they match, the caller is the one who can recover.
	gp := getg()
	p := gp._panic
	if p != nil && !p.recovered && argp == uintptr(p.argp) {
		p.recovered = true
		return p.arg
	}
	return nil
}

传入 gorecover 函数的 argprecover 这个函数的调用者的地址。

recover 主要做的事情就是检查当前 goroutine 中是否存在 panic,panic 是否已经被 recover,以及调用者是否一致。如果检查通过的话就修改 p.recovered 为 true,并返回 panic 创建时传入的参数,否则就直接返回 nil。

解释

刚才简单分析了一下 defer && panic && recover 是如何工作的,下面可以利用刚才了解到的原理来解释我遇到的现象了:

func TestRecoverInClosure(t *testing.T) {
	defer func() {          <---- argp: 0x01
		// This should be the callback function of a break point.
		// Let's call them directly for simpler example.
		func() {
			recover()       <---- argp: 0x02
		}()
	}()
	panic("panic in test")
}
  1. 将这个 defer 函数加入 goroutine 的 _defer 列表
  2. 执行 panic,检查是否存在 defer 函数并执行
  3. 修改 p.argp 为 0x01,开始执行内部的匿名函数
  4. recover 取到当前的调用者 argp 为 0x02,判断不通过,直接返回 nil
  5. 此时 p.recovered 仍然为 false,又没有更多的 defer 函数,进入 fatalpanic

困惑

上面对照着分析可以大概解释明白为什么 TestRecoverInClosure 中的 panic 捕获不到,但是很多被忽略的细节还是没有搞明白。

getargp

getargp 实现非常简单:

// getargp returns the location where the caller
// writes outgoing function call arguments.
//go:nosplit
//go:noinline
func getargp(x int) uintptr {
	// x is an argument mainly so that we can return its address.
	return uintptr(noescape(unsafe.Pointer(&x)))
}

为什么这就是当前 defer 函数调用的参数指针呢?

recover && gorecover

recover 是没有参数的,但是 gorecover 却有 argp 作为参数,跟下去可以看到这样的调用:

mkcall("gorecover", n.Type, init, nod(OADDR, nodfp, nil))

所以是 nod(OADDR, nodfp, nil) 取到了调用者的地址么?

总结

搞明白这个问题花费的时间比我想象的要更久,一方面是因为我对 go 内部的实现确实不太熟悉,另一方面是因为大多数的分享都集中在如何使用 或者最佳实践之类的,讨论内部实现的文章不是很多。我要特别的推荐一下 @伊布 的文章,他写的 Golang: 深入理解panic and recover 非常赞,对 panic && recover 切换和恢复过程具体实现感兴趣的同学不妨一读,定会有所收获。

参考资料