Golang 性能调优

3 年 ago

韵, 科

4 minutes

这篇文章是2014年12月17日的Go圣诞日历的文章。

我会谈论在Go中进行性能调优的话题。这些都是从Denco和Kocha等性能调优经验中得来的见解。请注意，这并不涉及处理系统的讨论。

前提 (Qian ti)

プロファイリングを取った後、じゃあどうやって最適化するかというところの話です

「推測するな、計測せよ」

アルゴリズムやデータ構造は最適なものが選択されていると仮定します

小手先の最適化を行うよりアルゴリズム自体を変えたほうが圧倒的に良くなります。

この記事の各ベンチマークは Go 1.4 （go version go1.4 linux/amd64）で下記のコマンドにて取っています

go test -run NONE -bench . -benchmem

ベンチマークの結果は環境に左右されるので、あくまで筆者の環境での結果です

减少内存分配的次数

在Go的代码中，导致性能下降最大的是内存分配。换句话说，减少内存分配次数可以显著提升性能。

package bench_test

import "testing"

func BenchmarkMemAllocOndemand(b *testing.B) {
    n := 10
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s := make([]string, 0)
        for j := 0; j < n; j++ {
            s = append(s, "alice")
        }
    }
}

func BenchmarkMemAllocAllBeforeUsing(b *testing.B) {
    n := 10
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s := make([]string, 0, n)
        for j := 0; j < n; j++ {
            s = append(s, "alice")
        }
    }
}

BenchmarkMemAllocOndemand        1000000              1953 ns/op             496 B/op         5 allocs/op
BenchmarkMemAllocAllBeforeUsing  3000000               571 ns/op             160 B/op         1 allocs/op

如果已经事先知道要素数，就不要使用 append。

在填充切片元素时，使用索引进行赋值比使用append方法追加元素更快。

package bench_test

import "testing"

func BenchmarkFillSliceByAppend(b *testing.B) {
    n := 100
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s := make([]int, 0, n)
        for j := 0; j < n; j++ {
            s = append(s, j)
        }
    }
}

func BenchmarkFillSliceByIndex(b *testing.B) {
    n := 100
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        s := make([]int, n)
        for j := 0; j < n; j++ {
            s[j] = j
        }
    }
}

BenchmarkFillSliceByAppend        500000              2694 ns/op             896 B/op         1 allocs/op
BenchmarkFillSliceByIndex         500000              2487 ns/op             896 B/op         1 allocs/op

假设不使用频道

使用sync.Mutex或sync.RWMutex比使用channel进行互斥控制更快。

package bench_test

import (
    "sync"
    "testing"
)

func BenchmarkExclusiveWithChannel(b *testing.B) {
    c := make(chan struct{}, 1)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        c <- struct{}{}
        // do something.
        <-c
    }
}

func BenchmarkExclusiveWithMutex(b *testing.B) {
    mu := new(sync.Mutex)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        mu.Lock()
        // do something.
        mu.Unlock()
    }
}

BenchmarkExclusiveWithChannel   20000000                70.2 ns/op             0 B/op         0 allocs/op
BenchmarkExclusiveWithMutex     100000000               21.1 ns/op             0 B/op         0 allocs/op

同样，对于同步处理，使用sync.WaitGroup会更快速。

package bench_test

import (
    "sync"
    "testing"
)

func BenchmarkSyncWithChannel(b *testing.B) {
    n := 10
    c := make(chan struct{}, n)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < n; j++ {
            go func() {
                // do something.
                c <- struct{}{}
            }()
        }
        for j := 0; j < n; j++ {
            <-c
        }
    }
}

func BenchmarkSyncWithWaitGroup(b *testing.B) {
    n := 10
    var wg sync.WaitGroup
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        wg.Add(n)
        for j := 0; j < n; j++ {
            go func() {
                // do something.
                wg.Done()
            }()
        }
        wg.Wait()
    }
}

BenchmarkSyncWithChannel          500000              3511 ns/op             160 B/op        10 allocs/op
BenchmarkSyncWithWaitGroup        500000              3086 ns/op             164 B/op        11 allocs/op

举个例子，不调用函数（方法）

Go语言中的函数和方法调用速度较慢。如果只涉及数行代码，编译器会进行内联展开，但不涉及内联展开的代码会比较慢。
以Denco的代码为例，故意使用GOTO语句来提高性能，在能实现递归调用的情况下，可以编写出更简洁的代码。

正则表达式不是瑞士军刀。

Go的正则表达式包与Ruby和Python等语言不同，它采用了一种保证线性时间完成处理的算法。

然而，代价是它非常慢，所以如果像其他语言那样随意使用它，性能会严重下降。
因此，在可读性等方面存在权衡的情况下，当真正需要速度时，最好自己编写字符串匹配和替换等功能。

package bench_test

import (
    "regexp"
    "testing"
)

func BenchmarkStringMatchWithRegexp(b *testing.B) {
    s := "0xDeadBeef"
    re := regexp.MustCompile(`^0[xX][0-9A-Fa-f]+$`)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        re.MatchString(s)
    }
}

func BenchmarkStringMatchWithoutRegexp(b *testing.B) {
    s := "0xDeadBeef"
    isHexString := func(s string) bool {
        if len(s) < 3 || s[0] != '0' || s[1] != 'x' && s[1] != 'X' {
            return false
        }
        for _, c := range s[2:] {
            if c < '0' || '9' < c && c < 'A' || 'F' < c && c < 'a' || 'f' < c {
                return false
            }
        }
        return true
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        isHexString(s)
    }
}

BenchmarkStringMatchWithRegexp      2000000               643 ns/op             0 B/op         0 allocs/op
BenchmarkStringMatchWithoutRegexp   30000000               55.0 ns/op           0 B/op         0 allocs/op

第一個番外編

虽然与性能调优不同，但启动goroutine也是需要成本的，因此不是仅仅加上”go”就能让程序变得更快，相反，对于轻量级的处理来说，顺序处理反而更快。

package bench_test

import (
    "sync"
    "testing"
)

func BenchmarkGoroutine(b *testing.B) {
    n := 10
    var wg sync.WaitGroup
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        wg.Add(n)
        for j := 0; j < n; j++ {
            go func() {
                wg.Done()
            }()
        }
        wg.Wait()
    }
}

func BenchmarkSequential(b *testing.B) {
    n := 10
    var wg sync.WaitGroup
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        wg.Add(n)
        for j := 0; j < n; j++ {
            func() {
                wg.Done()
            }()
        }
        wg.Wait()
    }
}

BenchmarkGoroutine        500000              3097 ns/op             164 B/op       11 allocs/op
BenchmarkSequential     10000000               224 ns/op               0 B/op        0 allocs/op

我对这个反转的阈值有兴趣，想知道它会在多大程度上发生。

第二個額外章節

无论是直接在循环中的条件部分写入len()，还是先将其赋给变量再使用，都不会有明显的差异。

package bench_test

import "testing"

func BenchmarkLenDirect(b *testing.B) {
    n := 1000
    s := make([]string, n)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j := 0; j < len(s); j++ {
            _ = s[j]
        }
    }
}

func BenchmarkLenCached(b *testing.B) {
    n := 1000
    s := make([]string, n)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        for j, length := 0, len(s); j < length; j++ {
            _ = s[j]
        }
    }
}

BenchmarkLenDirect       1000000              1221 ns/op               0 B/op        0 allocs/op
BenchmarkLenCached       1000000              1221 ns/op               0 B/op        0 allocs/op

总结

不要盲目地相信这样的文章，你需要根据自己的标准来做评估！

这篇文章所使用的基准测试可以在 https://github.com/naoina/go-adventcalendar-2014-bench 找到。