转载

Golang Profiling

主要介绍golang profiling的一些相关实践和参考链接，这一块内容比较多，先列个标题在这里，一点一点慢慢补充吧。

关于goprofile

比如程序写好了，如何对其进行定量的分析？

主要是通过golang提供的profilling tools来找到程序运行的bentchmark

之前陆续整理过一些相关的，这次以一个实际项目为例，分析下go profile的使用。

竞态分析

go test -v -race 发现原程序的哪些方面存在竞态问题。

benchmark testing

bench是go test

cpu profiling

注意，只有在 pprof.StopCPUProfile() 被调用了之后，数据才会被写入到摘要文件中，像那种server类的程序，一直阻塞在中间的某个地方，就是main函数一直不会结束。可以通过捕获中断信号，比如中断的时候执行停止服务，等等类似的操作，在这个地方执行 pprof.StopCPUProfile() 操作，将采样信息输出到文件中。

比如下面一个输出

(pprof) top5 140ms of 160ms total (87.50%) Showing top 5 nodes out of 52 (cum >= 10ms)       flat  flat%   sum%        cum   cum%       90ms 56.25% 56.25%       90ms 56.25%  syscall.Syscall       20ms 12.50% 68.75%       20ms 12.50%  runtime.futex       10ms  6.25% 75.00%       10ms  6.25%  runtime.duffzero       10ms  6.25% 81.25%       10ms  6.25%  runtime.mapaccess2       10ms  6.25% 87.50%       10ms  6.25%  runtime.stackmapdata

???指标的含义？ flat表示本地采样率，就是运行这个函数的时间，以及所占的百分比。sum%是累计起来的所占的百分比。cum累计取样？

实际情况下，注意对这里的cum的理解，因为实际情况下，各个方法的调用都是重叠的。通过 web 命令的图中，可以观察到各种重叠，比如系统调用，可能好多函数都用到了，通过web显示也可以说明各个函数的cpu占用时间具体都被分散到了哪些地方。

通过web画出的图中，每个方框都表示一个单的函数，从A方框到指向B方框的箭头就表示A函数对B函数的调用。边上的时间就表示这次调用所占用的时间。（在之前的版本这里显示的是调用次数？）也可以在web命令之后添加参数，只划出包含此次调用的函数拓扑图。

注意一下 flat cum 两个参数的属性：

The first three columns are the number of samples taken while running that line, the number of samples taken while running that line or in code called from that line, and the line number in the file.

旧版本的时候使用的是samples数目，新版本使用的是时间，blog上还是旧的信息，大概理解一下，flat参数应该是说：运行到这一行的采用时间，cum应该是运行这一行的采样时间加上从这一行开始调用的其他代码的采样时间。所以cum应该是>=flat的。然而这一块太具体的还不是很理解。。。

实际调优中比较关注我们自己写的那些代码，而不是golang本身的调用。这样的话，按照cum来排序比较能说明问题 top -cum

实例演示

下面参考官方博客的流程，跟着走一遍，体会一下具体的profile过程。

官房博客使用的go tool 版本较旧，具体的代码可以从这里下到。

首先是havlak1.go与havlak2.go的区别，首先是havlak1.go中比较费时的DFS函数：

(pprof) list DFS Total: 21.85s ROUTINE ======================== main.DFS in /home/wangzhe/goworkspace/src/benchgraffiti/havlak/havlak1.go      1.44s      7.06s (flat, cum) 32.31% of Total          .          .    234: return false          .          .    235:}          .          .    236:          .          .    237:// DFS - Depth-First-Search and node numbering.          .          .    238://       10ms       10ms    239:func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {       30ms       80ms    240: nodes[current].Init(currentNode, current)       10ms      300ms    241: number[currentNode] = current          .          .    242:          .          .    243: lastid := current      1.07s      1.07s    244: for _, target := range currentNode.OutEdges {      120ms      1.42s    245:     if number[target] == unvisited {       60ms      3.56s    246:         lastid = DFS(target, nodes, number, last, lastid+1)          .          .    247:     }          .          .    248: }      110ms      590ms    249: last[number[currentNode]] = lastid       30ms       30ms    250: return lastid          .          .    251:}          .          .    252:          .          .    253:// FindLoops          .          .    254://          .          .    255:// Find loops and build loop forest using Havlak's algorithm, which

可以看到，241 249 这些从map中取数据的操作都比较费时。实际上这里的index是一个BasicBlock指针，每次map的时候，需要对index进行hash操作，况且这个函数是一个递归的调用，显然这里浪费了许多时间，进一步观察，由于这里的BasicBlock的Name是一个int值，所以直接使用int数组[]int这种"特殊的map"来进行存储即可，可以看下havlak2.go中的修改，以及修改之后的cpuprofile。

修改之后的方法，可以看到numers改成了[]int的形式：

func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number []int, last []int, current int) int {         nodes[current].Init(currentNode, current)         number[currentNode.Name] = current          lastid := current         for _, target := range currentNode.OutEdges {                 if number[target.Name] == unvisited {                         lastid = DFS(target, nodes, number, last, lastid+1)                 }         }         last[number[currentNode.Name]] = lastid         return lastid }

修改之后的cpuprofile结果：

18200ms of 30990ms total (58.73%) Dropped 38 nodes (cum <= 154.95ms) Showing top 10 nodes out of 69 (cum >= 2350ms)       flat  flat%   sum%        cum   cum%     6260ms 20.20% 20.20%     6260ms 20.20%  scanblock     3090ms  9.97% 30.17%    28390ms 91.61%  main.FindLoops     1760ms  5.68% 35.85%    17320ms 55.89%  runtime.mallocgc     1730ms  5.58% 41.43%     3970ms 12.81%  sweep     1010ms  3.26% 44.69%     1110ms  3.58%  main.DFS      970ms  3.13% 47.82%     1150ms  3.71%  MCentral_Alloc      890ms  2.87% 50.69%     2090ms  6.74%  hash_insert_internal      860ms  2.78% 53.47%     4610ms 14.88%  runtime.MCache_Alloc      830ms  2.68% 56.15%      830ms  2.68%  runtime.rnd      800ms  2.58% 58.73%     2350ms  7.58%  runtime.MCache_Free

在修改之前，main.DFS的结果是16.29%，正确使用数据结构后减少不必要的开销。和之前的数据进行比较，发现虽然DFS减少了许多，但是 runtime.mallocgc 这个函数占用了不少时间。这个函数是进行garbage collection的时候执行的。为了找到gc运行占用cpu的原因，需要找到究竟是哪些操作占用了内存，这就需要引入memory profiling。

golang的gc问题会有比较多的坑，具体的几个例子可以看这个文章，这个不做为重点。

在havlak3.go中添加了memoryprofile的操作，具体方式与cpuprofile比较类似。

参考资料：

https://www.zhihu.com/question/21615032/answer/18781477

http://tonybai.com/2015/08/25/go-debugging-profiling-optimization/

net/pprof 也可以通过pprof工具来查看 https://golang.org/pkg/net/http/pprof/

pprof使用中文版介绍（很详细） https://github.com/hyper-carrot/go_command_tutorial/blob/master/0.12.md

关于profile文件的输出问题： http://golanghome.com/post/576

一次优化实例 http://studygolang.com/wr?u=http%3a%2f%2fwww.zenlife.tk%2fgo-pprof.md

雨痕学堂微信公众号各种性能优化技巧

原文 http://wangzhezhe.github.io/blog/2016/04/30/golang-profiling/

正文到此结束