go - The Go programming language

Age	Commit message (Collapse)	Author
2020-11-02	runtime: improve memmove performance on arm64	Jonathan Swinney
	Replace the memmove implementation for moves of 17 bytes or larger with an implementation from ARM optimized software. The moves of 16 bytes or fewer are unchanged, but the registers used are updated to match the rest of the implementation. This implementation makes use of new optimizations: - software pipelined loop for large (>128 byte) moves - medium size moves (17..128 bytes) have a new implementation - address realignment when src or dst is unaligned - preference for aligned src (loads) or dst (stores) depending on CPU To support preference for aligned loads or aligned stores, a new CPU flag is added. This flag indicates that the detected micro architecture performs better with aligned loads. Some tested CPUs did not exhibit a significant difference and are left with the default behavior of realigning based on the destination address (stores). Neoverse N1 (Tested on Graviton 2) name old time/op new time/op delta Memmove/0-4 1.88ns ± 1% 1.87ns ± 1% -0.58% (p=0.020 n=10+10) Memmove/1-4 4.40ns ± 0% 4.40ns ± 0% ~ (all equal) Memmove/8-4 3.88ns ± 3% 3.80ns ± 0% -1.97% (p=0.001 n=10+9) Memmove/16-4 3.90ns ± 3% 3.80ns ± 0% -2.49% (p=0.000 n=10+9) Memmove/32-4 4.80ns ± 0% 4.40ns ± 0% -8.33% (p=0.000 n=9+8) Memmove/64-4 5.86ns ± 0% 5.00ns ± 0% -14.76% (p=0.000 n=8+8) Memmove/128-4 8.46ns ± 0% 8.06ns ± 0% -4.62% (p=0.000 n=10+10) Memmove/256-4 12.4ns ± 0% 12.2ns ± 0% -1.61% (p=0.000 n=10+10) Memmove/512-4 19.5ns ± 0% 19.1ns ± 0% -2.05% (p=0.000 n=10+10) Memmove/1024-4 33.7ns ± 0% 33.5ns ± 0% -0.59% (p=0.000 n=10+10) Memmove/2048-4 62.1ns ± 0% 59.0ns ± 0% -4.99% (p=0.000 n=10+10) Memmove/4096-4 117ns ± 1% 110ns ± 0% -5.66% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 6.41ns ± 0% 5.62ns ± 0% -12.32% (p=0.000 n=10+7) MemmoveUnalignedDst/128-4 9.40ns ± 0% 8.34ns ± 0% -11.24% (p=0.000 n=10+10) MemmoveUnalignedDst/256-4 12.8ns ± 0% 12.8ns ± 0% ~ (all equal) MemmoveUnalignedDst/512-4 20.4ns ± 0% 19.7ns ± 0% -3.43% (p=0.000 n=9+10) MemmoveUnalignedDst/1024-4 34.1ns ± 0% 35.1ns ± 0% +2.93% (p=0.000 n=9+9) MemmoveUnalignedDst/2048-4 61.5ns ± 0% 60.4ns ± 0% -1.77% (p=0.000 n=10+10) MemmoveUnalignedDst/4096-4 122ns ± 0% 113ns ± 0% -7.38% (p=0.002 n=8+10) MemmoveUnalignedSrc/64-4 7.25ns ± 1% 6.26ns ± 0% -13.64% (p=0.000 n=9+9) MemmoveUnalignedSrc/128-4 10.5ns ± 0% 9.7ns ± 0% -7.52% (p=0.000 n=10+10) MemmoveUnalignedSrc/256-4 17.1ns ± 0% 17.3ns ± 0% +1.17% (p=0.000 n=10+10) MemmoveUnalignedSrc/512-4 27.0ns ± 0% 27.0ns ± 0% ~ (all equal) MemmoveUnalignedSrc/1024-4 46.7ns ± 0% 35.7ns ± 0% -23.55% (p=0.000 n=10+9) MemmoveUnalignedSrc/2048-4 85.2ns ± 0% 61.2ns ± 0% -28.17% (p=0.000 n=10+8) MemmoveUnalignedSrc/4096-4 162ns ± 0% 113ns ± 0% -30.25% (p=0.000 n=10+10) name old speed new speed delta Memmove/4096-4 35.2GB/s ± 0% 37.1GB/s ± 0% +5.56% (p=0.000 n=10+9) MemmoveUnalignedSrc/1024-4 21.9GB/s ± 0% 28.7GB/s ± 0% +30.90% (p=0.000 n=10+10) MemmoveUnalignedSrc/2048-4 24.0GB/s ± 0% 33.5GB/s ± 0% +39.18% (p=0.000 n=10+9) MemmoveUnalignedSrc/4096-4 25.3GB/s ± 0% 36.2GB/s ± 0% +43.50% (p=0.000 n=10+7) Cortex-A72 (Graviton 1) name old time/op new time/op delta Memmove/0-4 3.06ns ± 3% 3.08ns ± 1% ~ (p=0.958 n=10+9) Memmove/1-4 8.72ns ± 0% 7.85ns ± 0% -9.98% (p=0.002 n=8+10) Memmove/8-4 8.29ns ± 0% 8.29ns ± 0% ~ (all equal) Memmove/16-4 8.29ns ± 0% 8.29ns ± 0% ~ (all equal) Memmove/32-4 8.19ns ± 2% 8.29ns ± 0% ~ (p=0.114 n=10+10) Memmove/64-4 18.3ns ± 4% 10.0ns ± 0% -45.36% (p=0.000 n=10+10) Memmove/128-4 14.8ns ± 0% 17.4ns ± 0% +17.77% (p=0.000 n=10+10) Memmove/256-4 21.8ns ± 0% 23.1ns ± 0% +5.96% (p=0.000 n=10+10) Memmove/512-4 35.8ns ± 0% 37.2ns ± 0% +3.91% (p=0.000 n=10+10) Memmove/1024-4 63.7ns ± 0% 67.2ns ± 0% +5.49% (p=0.000 n=10+10) Memmove/2048-4 126ns ± 0% 123ns ± 0% -2.38% (p=0.000 n=10+10) Memmove/4096-4 238ns ± 1% 243ns ± 1% +1.93% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 19.3ns ± 1% 12.0ns ± 1% -37.49% (p=0.000 n=10+10) MemmoveUnalignedDst/128-4 17.2ns ± 0% 17.4ns ± 0% +1.16% (p=0.000 n=10+10) MemmoveUnalignedDst/256-4 28.2ns ± 8% 29.2ns ± 0% ~ (p=0.352 n=10+10) MemmoveUnalignedDst/512-4 49.8ns ± 3% 48.9ns ± 0% ~ (p=1.000 n=10+10) MemmoveUnalignedDst/1024-4 89.5ns ± 0% 80.5ns ± 1% -10.02% (p=0.000 n=10+10) MemmoveUnalignedDst/2048-4 180ns ± 0% 127ns ± 0% -29.44% (p=0.000 n=9+10) MemmoveUnalignedDst/4096-4 347ns ± 0% 244ns ± 0% -29.59% (p=0.000 n=10+9) MemmoveUnalignedSrc/128-4 16.1ns ± 0% 21.8ns ± 0% +35.40% (p=0.000 n=10+10) MemmoveUnalignedSrc/256-4 24.9ns ± 8% 26.6ns ± 0% +6.70% (p=0.015 n=10+10) MemmoveUnalignedSrc/512-4 39.4ns ± 6% 40.6ns ± 0% ~ (p=0.352 n=10+10) MemmoveUnalignedSrc/1024-4 72.5ns ± 0% 83.0ns ± 1% +14.44% (p=0.000 n=9+10) MemmoveUnalignedSrc/2048-4 129ns ± 1% 128ns ± 1% ~ (p=0.179 n=10+10) MemmoveUnalignedSrc/4096-4 241ns ± 0% 253ns ± 1% +4.99% (p=0.000 n=9+9) Cortex-A53 (Raspberry Pi 3) name old time/op new time/op delta Memmove/0-4 11.0ns ± 0% 11.0ns ± 1% ~ (p=0.294 n=8+10) Memmove/1-4 29.6ns ± 0% 28.0ns ± 1% -5.41% (p=0.000 n=9+10) Memmove/8-4 23.5ns ± 0% 22.1ns ± 0% -6.11% (p=0.000 n=8+8) Memmove/16-4 23.7ns ± 1% 22.1ns ± 0% -6.59% (p=0.000 n=10+8) Memmove/32-4 27.9ns ± 0% 27.1ns ± 0% -3.13% (p=0.000 n=8+8) Memmove/64-4 33.8ns ± 0% 31.5ns ± 1% -6.99% (p=0.000 n=8+10) Memmove/128-4 45.6ns ± 0% 44.2ns ± 1% -3.23% (p=0.000 n=9+10) Memmove/256-4 69.3ns ± 0% 69.3ns ± 0% ~ (p=0.072 n=8+8) Memmove/512-4 127ns ± 0% 110ns ± 0% -13.39% (p=0.000 n=8+8) Memmove/1024-4 222ns ± 0% 205ns ± 1% -7.66% (p=0.000 n=7+10) Memmove/2048-4 411ns ± 0% 366ns ± 0% -10.98% (p=0.000 n=8+9) Memmove/4096-4 795ns ± 1% 695ns ± 1% -12.63% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 44.0ns ± 0% 40.5ns ± 0% -7.93% (p=0.000 n=8+8) MemmoveUnalignedDst/128-4 59.6ns ± 0% 54.9ns ± 0% -7.85% (p=0.000 n=9+9) MemmoveUnalignedDst/256-4 98.2ns ±11% 90.0ns ± 1% ~ (p=0.130 n=10+10) MemmoveUnalignedDst/512-4 161ns ± 2% 145ns ± 1% -9.96% (p=0.000 n=10+10) MemmoveUnalignedDst/1024-4 281ns ± 0% 265ns ± 0% -5.65% (p=0.000 n=9+8) MemmoveUnalignedDst/2048-4 528ns ± 0% 482ns ± 0% -8.73% (p=0.000 n=8+9) MemmoveUnalignedDst/4096-4 1.02µs ± 1% 0.92µs ± 0% -10.00% (p=0.000 n=10+8) MemmoveUnalignedSrc/64-4 42.4ns ± 1% 40.5ns ± 0% -4.39% (p=0.000 n=10+8) MemmoveUnalignedSrc/128-4 57.4ns ± 0% 57.0ns ± 1% -0.75% (p=0.048 n=9+10) MemmoveUnalignedSrc/256-4 88.1ns ± 1% 89.6ns ± 0% +1.70% (p=0.000 n=9+8) MemmoveUnalignedSrc/512-4 160ns ± 2% 144ns ± 0% -9.89% (p=0.000 n=10+8) MemmoveUnalignedSrc/1024-4 286ns ± 0% 266ns ± 1% -6.69% (p=0.000 n=8+10) MemmoveUnalignedSrc/2048-4 525ns ± 0% 483ns ± 1% -7.96% (p=0.000 n=9+10) MemmoveUnalignedSrc/4096-4 1.01µs ± 0% 0.92µs ± 1% -9.40% (p=0.000 n=8+10) Change-Id: Ia1144e9d4dfafdece6e167c5e576bf80f254c8ab Reviewed-on: https://go-review.googlesource.com/c/go/+/243357 TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: eric fang <eric.fang@arm.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-27	runtime: add 2-byte and 8-byte sub-benchmarks for memmove load/store	Heisenberg
	Change-Id: I6389d7efe90836b6ece44d2e75053d1ad9f35d08 Reviewed-on: https://go-review.googlesource.com/c/go/+/253417 Trust: Emmanuel Odeke <emmanuel@orijtech.com> Reviewed-by: Keith Randall <khr@golang.org>
2020-01-06	runtime: test memmove writes pointers atomically	Cherry Zhang
	In the previous CL we ensures that memmove writes pointers atomically, so the concurrent GC won't observe a partially updated pointer. This CL adds a test. Change-Id: Icd1124bf3a15ef25bac20c7fb8933f1a642d897c Reviewed-on: https://go-review.googlesource.com/c/go/+/212627 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-05-16	runtime: disable some tests in -quick mode	Russ Cox
	Speeds up the "go test runtime -cpu=1,2,4 -short -quick" phase of all.bash. For #26473. Change-Id: I090f5a5aa754462b3253a2156dc31fa67ce7af2a Reviewed-on: https://go-review.googlesource.com/c/go/+/177399 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2018-03-06	runtime, cmd/compile: use ldp for DUFFCOPY on ARM64	Meng Zhuo
	name old time/op new time/op delta CopyFat8 2.15ns ± 1% 2.19ns ± 6% ~ (p=0.171 n=8+9) CopyFat12 2.15ns ± 0% 2.17ns ± 2% ~ (p=0.137 n=8+10) CopyFat16 2.17ns ± 3% 2.15ns ± 0% ~ (p=0.211 n=10+10) CopyFat24 2.16ns ± 1% 2.15ns ± 0% ~ (p=0.087 n=10+10) CopyFat32 11.5ns ± 0% 12.8ns ± 2% +10.87% (p=0.000 n=8+10) CopyFat64 20.2ns ± 2% 12.9ns ± 0% -36.11% (p=0.000 n=10+10) CopyFat128 37.2ns ± 0% 21.5ns ± 0% -42.20% (p=0.000 n=10+10) CopyFat256 71.6ns ± 0% 38.7ns ± 0% -45.95% (p=0.000 n=10+10) CopyFat512 140ns ± 0% 73ns ± 0% -47.86% (p=0.000 n=10+9) CopyFat520 142ns ± 0% 74ns ± 0% -47.54% (p=0.000 n=10+10) CopyFat1024 277ns ± 0% 141ns ± 0% -49.10% (p=0.000 n=10+10) Change-Id: If54bc571add5db674d5e081579c87e80153d0a5a Reviewed-on: https://go-review.googlesource.com/97395 Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-10-31	runtime: shorten tests in all.bash	Russ Cox
	This cuts 23 seconds from all.bash on my MacBook Pro. Change-Id: Ibc4d7c01660b9e9ebd088dd55ba993f0d7ec6aa3 Reviewed-on: https://go-review.googlesource.com/73991 Reviewed-by: Ian Lance Taylor <iant@golang.org>
2017-01-23	runtime: amd64, use 4-byte ops for memmove of 4 bytes	Keith Randall
	memmove used to use 2 2-byte load/store pairs to move 4 bytes. When the result is loaded with a single 4-byte load, it caused a store to load fowarding stall. To avoid the stall, special case memmove to use 4 byte ops for the 4 byte copy case. We already have a special case for 8-byte copies. 386 already specializes 4-byte copies. I'll do 2-byte copies also, but not for 1.8. benchmark old ns/op new ns/op delta BenchmarkIssue18740-8 7567 4799 -36.58% 3-byte copies get a bit slower. Other copies are unchanged. name old time/op new time/op delta Memmove/3-8 4.76ns ± 5% 5.26ns ± 3% +10.50% (p=0.000 n=10+10) Fixes #18740 Change-Id: Iec82cbac0ecfee80fa3c8fc83828f9a1819c3c74 Reviewed-on: https://go-review.googlesource.com/35567 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2016-11-04	all: sprinkle t.Parallel on some slow tests	Brad Fitzpatrick
	I used the slowtests.go tool as described in https://golang.org/cl/32684 on packages that stood out. go test -short std drops from ~56 to ~52 seconds. This isn't a huge win, but it was mostly an exercise. Updates #17751 Change-Id: I9f3402e36a038d71e662d06ce2c1d52f6c4b674d Reviewed-on: https://go-review.googlesource.com/32751 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-10-06	runtime: improve memmove for amd64	Denis Nagorny
	Use AVX if available on 4th generation of Intel(TM) Core(TM) processors. (collected on E5 2609v3 @1.9GHz) name old speed new speed delta Memmove/1-6 158MB/s ± 0% 172MB/s ± 0% +9.09% (p=0.000 n=16+16) Memmove/2-6 316MB/s ± 0% 345MB/s ± 0% +9.09% (p=0.000 n=18+16) Memmove/3-6 517MB/s ± 0% 517MB/s ± 0% ~ (p=0.445 n=16+16) Memmove/4-6 687MB/s ± 1% 690MB/s ± 0% +0.35% (p=0.000 n=20+17) Memmove/5-6 729MB/s ± 0% 729MB/s ± 0% +0.01% (p=0.000 n=16+18) Memmove/6-6 875MB/s ± 0% 875MB/s ± 0% +0.01% (p=0.000 n=18+18) Memmove/7-6 1.02GB/s ± 0% 1.02GB/s ± 1% ~ (p=0.139 n=19+20) Memmove/8-6 1.26GB/s ± 0% 1.26GB/s ± 0% +0.00% (p=0.000 n=18+18) Memmove/9-6 1.42GB/s ± 0% 1.42GB/s ± 0% +0.00% (p=0.000 n=17+18) Memmove/10-6 1.58GB/s ± 0% 1.58GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/11-6 1.74GB/s ± 0% 1.74GB/s ± 0% +0.00% (p=0.001 n=18+17) Memmove/12-6 1.90GB/s ± 0% 1.90GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/13-6 2.05GB/s ± 0% 2.05GB/s ± 0% +0.00% (p=0.000 n=18+19) Memmove/14-6 2.21GB/s ± 0% 2.21GB/s ± 0% +0.00% (p=0.000 n=16+20) Memmove/15-6 2.37GB/s ± 0% 2.37GB/s ± 0% +0.00% (p=0.004 n=19+20) Memmove/16-6 2.53GB/s ± 0% 2.53GB/s ± 0% +0.00% (p=0.000 n=16+16) Memmove/32-6 4.67GB/s ± 0% 4.67GB/s ± 0% +0.00% (p=0.000 n=17+17) Memmove/64-6 8.67GB/s ± 0% 8.64GB/s ± 0% -0.33% (p=0.000 n=18+17) Memmove/128-6 12.6GB/s ± 0% 11.6GB/s ± 0% -8.05% (p=0.000 n=16+19) Memmove/256-6 16.3GB/s ± 0% 16.6GB/s ± 0% +1.66% (p=0.000 n=20+18) Memmove/512-6 21.5GB/s ± 0% 24.4GB/s ± 0% +13.35% (p=0.000 n=18+17) Memmove/1024-6 24.7GB/s ± 0% 33.7GB/s ± 0% +36.12% (p=0.000 n=18+18) Memmove/2048-6 27.3GB/s ± 0% 43.3GB/s ± 0% +58.77% (p=0.000 n=19+17) Memmove/4096-6 37.5GB/s ± 0% 50.5GB/s ± 0% +34.56% (p=0.000 n=19+19) MemmoveUnalignedDst/1-6 135MB/s ± 0% 146MB/s ± 0% +7.69% (p=0.000 n=16+14) MemmoveUnalignedDst/2-6 271MB/s ± 0% 292MB/s ± 0% +7.69% (p=0.000 n=18+18) MemmoveUnalignedDst/3-6 438MB/s ± 0% 438MB/s ± 0% ~ (p=0.352 n=16+19) MemmoveUnalignedDst/4-6 584MB/s ± 0% 584MB/s ± 0% ~ (p=0.876 n=17+17) MemmoveUnalignedDst/5-6 631MB/s ± 1% 632MB/s ± 0% +0.25% (p=0.000 n=20+17) MemmoveUnalignedDst/6-6 759MB/s ± 0% 759MB/s ± 0% +0.00% (p=0.000 n=19+16) MemmoveUnalignedDst/7-6 885MB/s ± 0% 883MB/s ± 1% ~ (p=0.647 n=18+20) MemmoveUnalignedDst/8-6 1.08GB/s ± 0% 1.08GB/s ± 0% +0.00% (p=0.035 n=19+18) MemmoveUnalignedDst/9-6 1.22GB/s ± 0% 1.22GB/s ± 0% ~ (p=0.251 n=18+17) MemmoveUnalignedDst/10-6 1.35GB/s ± 0% 1.35GB/s ± 0% ~ (p=0.327 n=17+18) MemmoveUnalignedDst/11-6 1.49GB/s ± 0% 1.49GB/s ± 0% ~ (p=0.531 n=18+19) MemmoveUnalignedDst/12-6 1.63GB/s ± 0% 1.63GB/s ± 0% ~ (p=0.886 n=19+18) MemmoveUnalignedDst/13-6 1.76GB/s ± 0% 1.76GB/s ± 1% -0.24% (p=0.006 n=18+20) MemmoveUnalignedDst/14-6 1.90GB/s ± 0% 1.90GB/s ± 0% ~ (p=0.818 n=20+19) MemmoveUnalignedDst/15-6 2.03GB/s ± 0% 2.03GB/s ± 0% ~ (p=0.294 n=17+16) MemmoveUnalignedDst/16-6 2.17GB/s ± 0% 2.17GB/s ± 0% ~ (p=0.602 n=16+18) MemmoveUnalignedDst/32-6 4.05GB/s ± 0% 4.05GB/s ± 0% +0.00% (p=0.010 n=18+17) MemmoveUnalignedDst/64-6 7.59GB/s ± 0% 7.59GB/s ± 0% +0.00% (p=0.022 n=18+16) MemmoveUnalignedDst/128-6 11.1GB/s ± 0% 11.4GB/s ± 0% +2.79% (p=0.000 n=18+17) MemmoveUnalignedDst/256-6 16.4GB/s ± 0% 16.7GB/s ± 0% +1.59% (p=0.000 n=20+17) MemmoveUnalignedDst/512-6 15.7GB/s ± 0% 21.3GB/s ± 0% +35.87% (p=0.000 n=18+20) MemmoveUnalignedDst/1024-6 16.0GB/s ±20% 31.5GB/s ± 0% +96.93% (p=0.000 n=20+14) MemmoveUnalignedDst/2048-6 19.6GB/s ± 0% 42.1GB/s ± 0% +115.16% (p=0.000 n=17+18) MemmoveUnalignedDst/4096-6 6.41GB/s ± 0% 33.18GB/s ± 0% +417.56% (p=0.000 n=17+18) MemmoveUnalignedSrc/1-6 171MB/s ± 0% 166MB/s ± 0% -3.33% (p=0.000 n=19+16) MemmoveUnalignedSrc/2-6 343MB/s ± 0% 342MB/s ± 1% -0.41% (p=0.000 n=17+20) MemmoveUnalignedSrc/3-6 508MB/s ± 0% 493MB/s ± 1% -2.90% (p=0.000 n=17+17) MemmoveUnalignedSrc/4-6 677MB/s ± 0% 660MB/s ± 2% -2.55% (p=0.000 n=17+20) MemmoveUnalignedSrc/5-6 790MB/s ± 0% 790MB/s ± 0% ~ (p=0.139 n=17+17) MemmoveUnalignedSrc/6-6 948MB/s ± 0% 946MB/s ± 1% ~ (p=0.330 n=17+19) MemmoveUnalignedSrc/7-6 1.11GB/s ± 0% 1.11GB/s ± 0% -0.05% (p=0.026 n=17+17) MemmoveUnalignedSrc/8-6 1.38GB/s ± 0% 1.38GB/s ± 0% ~ (p=0.091 n=18+16) MemmoveUnalignedSrc/9-6 1.42GB/s ± 0% 1.40GB/s ± 1% -1.04% (p=0.000 n=19+20) MemmoveUnalignedSrc/10-6 1.58GB/s ± 0% 1.56GB/s ± 1% -1.15% (p=0.000 n=18+19) MemmoveUnalignedSrc/11-6 1.73GB/s ± 0% 1.71GB/s ± 1% -1.30% (p=0.000 n=20+20) MemmoveUnalignedSrc/12-6 1.89GB/s ± 0% 1.87GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/13-6 2.05GB/s ± 0% 2.02GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/14-6 2.21GB/s ± 0% 2.18GB/s ± 1% -1.14% (p=0.000 n=17+20) MemmoveUnalignedSrc/15-6 2.36GB/s ± 0% 2.34GB/s ± 1% -1.04% (p=0.000 n=17+20) MemmoveUnalignedSrc/16-6 2.52GB/s ± 0% 2.49GB/s ± 1% -1.26% (p=0.000 n=19+20) MemmoveUnalignedSrc/32-6 4.82GB/s ± 0% 4.61GB/s ± 0% -4.40% (p=0.000 n=19+20) MemmoveUnalignedSrc/64-6 5.03GB/s ± 4% 7.97GB/s ± 0% +58.55% (p=0.000 n=20+16) MemmoveUnalignedSrc/128-6 11.1GB/s ± 0% 11.2GB/s ± 0% +0.52% (p=0.000 n=17+18) MemmoveUnalignedSrc/256-6 16.5GB/s ± 0% 16.4GB/s ± 0% -0.10% (p=0.000 n=20+18) MemmoveUnalignedSrc/512-6 21.0GB/s ± 0% 22.1GB/s ± 0% +5.48% (p=0.000 n=14+17) MemmoveUnalignedSrc/1024-6 24.9GB/s ± 0% 31.9GB/s ± 0% +28.20% (p=0.000 n=19+20) MemmoveUnalignedSrc/2048-6 23.3GB/s ± 0% 33.8GB/s ± 0% +45.22% (p=0.000 n=17+19) MemmoveUnalignedSrc/4096-6 37.3GB/s ± 0% 42.7GB/s ± 0% +14.30% (p=0.000 n=17+17) Change-Id: Id66aa3e499ccfb117cb99d623ef326b50d057b64 Reviewed-on: https://go-review.googlesource.com/29590 Run-TryBot: Denis Nagorny <denis.nagorny@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2016-08-31	Revert "runtime: improve memmove for amd64"	Joe Tsai
	This reverts commit 3607c5f4f18ad4d423e40996ebf7f46b2f79ce02. This was causing failures on amd64 machines without AVX. Fixes #16939 Change-Id: I70080fbb4e7ae791857334f2bffd847d08cb25fa Reviewed-on: https://go-review.googlesource.com/28274 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2016-08-31	runtime: improve memmove for amd64	Denis Nagorny
	Use AVX if available on 4th generation of Intel(TM) Core(TM) processors. (collected on E5 2609v3 @1.9GHz) name old speed new speed delta Memmove/1-6 158MB/s ± 0% 172MB/s ± 0% +9.09% (p=0.000 n=16+16) Memmove/2-6 316MB/s ± 0% 345MB/s ± 0% +9.09% (p=0.000 n=18+16) Memmove/3-6 517MB/s ± 0% 517MB/s ± 0% ~ (p=0.445 n=16+16) Memmove/4-6 687MB/s ± 1% 690MB/s ± 0% +0.35% (p=0.000 n=20+17) Memmove/5-6 729MB/s ± 0% 729MB/s ± 0% +0.01% (p=0.000 n=16+18) Memmove/6-6 875MB/s ± 0% 875MB/s ± 0% +0.01% (p=0.000 n=18+18) Memmove/7-6 1.02GB/s ± 0% 1.02GB/s ± 1% ~ (p=0.139 n=19+20) Memmove/8-6 1.26GB/s ± 0% 1.26GB/s ± 0% +0.00% (p=0.000 n=18+18) Memmove/9-6 1.42GB/s ± 0% 1.42GB/s ± 0% +0.00% (p=0.000 n=17+18) Memmove/10-6 1.58GB/s ± 0% 1.58GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/11-6 1.74GB/s ± 0% 1.74GB/s ± 0% +0.00% (p=0.001 n=18+17) Memmove/12-6 1.90GB/s ± 0% 1.90GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/13-6 2.05GB/s ± 0% 2.05GB/s ± 0% +0.00% (p=0.000 n=18+19) Memmove/14-6 2.21GB/s ± 0% 2.21GB/s ± 0% +0.00% (p=0.000 n=16+20) Memmove/15-6 2.37GB/s ± 0% 2.37GB/s ± 0% +0.00% (p=0.004 n=19+20) Memmove/16-6 2.53GB/s ± 0% 2.53GB/s ± 0% +0.00% (p=0.000 n=16+16) Memmove/32-6 4.67GB/s ± 0% 4.67GB/s ± 0% +0.00% (p=0.000 n=17+17) Memmove/64-6 8.67GB/s ± 0% 8.64GB/s ± 0% -0.33% (p=0.000 n=18+17) Memmove/128-6 12.6GB/s ± 0% 11.6GB/s ± 0% -8.05% (p=0.000 n=16+19) Memmove/256-6 16.3GB/s ± 0% 16.6GB/s ± 0% +1.66% (p=0.000 n=20+18) Memmove/512-6 21.5GB/s ± 0% 24.4GB/s ± 0% +13.35% (p=0.000 n=18+17) Memmove/1024-6 24.7GB/s ± 0% 33.7GB/s ± 0% +36.12% (p=0.000 n=18+18) Memmove/2048-6 27.3GB/s ± 0% 43.3GB/s ± 0% +58.77% (p=0.000 n=19+17) Memmove/4096-6 37.5GB/s ± 0% 50.5GB/s ± 0% +34.56% (p=0.000 n=19+19) MemmoveUnalignedDst/1-6 135MB/s ± 0% 146MB/s ± 0% +7.69% (p=0.000 n=16+14) MemmoveUnalignedDst/2-6 271MB/s ± 0% 292MB/s ± 0% +7.69% (p=0.000 n=18+18) MemmoveUnalignedDst/3-6 438MB/s ± 0% 438MB/s ± 0% ~ (p=0.352 n=16+19) MemmoveUnalignedDst/4-6 584MB/s ± 0% 584MB/s ± 0% ~ (p=0.876 n=17+17) MemmoveUnalignedDst/5-6 631MB/s ± 1% 632MB/s ± 0% +0.25% (p=0.000 n=20+17) MemmoveUnalignedDst/6-6 759MB/s ± 0% 759MB/s ± 0% +0.00% (p=0.000 n=19+16) MemmoveUnalignedDst/7-6 885MB/s ± 0% 883MB/s ± 1% ~ (p=0.647 n=18+20) MemmoveUnalignedDst/8-6 1.08GB/s ± 0% 1.08GB/s ± 0% +0.00% (p=0.035 n=19+18) MemmoveUnalignedDst/9-6 1.22GB/s ± 0% 1.22GB/s ± 0% ~ (p=0.251 n=18+17) MemmoveUnalignedDst/10-6 1.35GB/s ± 0% 1.35GB/s ± 0% ~ (p=0.327 n=17+18) MemmoveUnalignedDst/11-6 1.49GB/s ± 0% 1.49GB/s ± 0% ~ (p=0.531 n=18+19) MemmoveUnalignedDst/12-6 1.63GB/s ± 0% 1.63GB/s ± 0% ~ (p=0.886 n=19+18) MemmoveUnalignedDst/13-6 1.76GB/s ± 0% 1.76GB/s ± 1% -0.24% (p=0.006 n=18+20) MemmoveUnalignedDst/14-6 1.90GB/s ± 0% 1.90GB/s ± 0% ~ (p=0.818 n=20+19) MemmoveUnalignedDst/15-6 2.03GB/s ± 0% 2.03GB/s ± 0% ~ (p=0.294 n=17+16) MemmoveUnalignedDst/16-6 2.17GB/s ± 0% 2.17GB/s ± 0% ~ (p=0.602 n=16+18) MemmoveUnalignedDst/32-6 4.05GB/s ± 0% 4.05GB/s ± 0% +0.00% (p=0.010 n=18+17) MemmoveUnalignedDst/64-6 7.59GB/s ± 0% 7.59GB/s ± 0% +0.00% (p=0.022 n=18+16) MemmoveUnalignedDst/128-6 11.1GB/s ± 0% 11.4GB/s ± 0% +2.79% (p=0.000 n=18+17) MemmoveUnalignedDst/256-6 16.4GB/s ± 0% 16.7GB/s ± 0% +1.59% (p=0.000 n=20+17) MemmoveUnalignedDst/512-6 15.7GB/s ± 0% 21.3GB/s ± 0% +35.87% (p=0.000 n=18+20) MemmoveUnalignedDst/1024-6 16.0GB/s ±20% 31.5GB/s ± 0% +96.93% (p=0.000 n=20+14) MemmoveUnalignedDst/2048-6 19.6GB/s ± 0% 42.1GB/s ± 0% +115.16% (p=0.000 n=17+18) MemmoveUnalignedDst/4096-6 6.41GB/s ± 0% 33.18GB/s ± 0% +417.56% (p=0.000 n=17+18) MemmoveUnalignedSrc/1-6 171MB/s ± 0% 166MB/s ± 0% -3.33% (p=0.000 n=19+16) MemmoveUnalignedSrc/2-6 343MB/s ± 0% 342MB/s ± 1% -0.41% (p=0.000 n=17+20) MemmoveUnalignedSrc/3-6 508MB/s ± 0% 493MB/s ± 1% -2.90% (p=0.000 n=17+17) MemmoveUnalignedSrc/4-6 677MB/s ± 0% 660MB/s ± 2% -2.55% (p=0.000 n=17+20) MemmoveUnalignedSrc/5-6 790MB/s ± 0% 790MB/s ± 0% ~ (p=0.139 n=17+17) MemmoveUnalignedSrc/6-6 948MB/s ± 0% 946MB/s ± 1% ~ (p=0.330 n=17+19) MemmoveUnalignedSrc/7-6 1.11GB/s ± 0% 1.11GB/s ± 0% -0.05% (p=0.026 n=17+17) MemmoveUnalignedSrc/8-6 1.38GB/s ± 0% 1.38GB/s ± 0% ~ (p=0.091 n=18+16) MemmoveUnalignedSrc/9-6 1.42GB/s ± 0% 1.40GB/s ± 1% -1.04% (p=0.000 n=19+20) MemmoveUnalignedSrc/10-6 1.58GB/s ± 0% 1.56GB/s ± 1% -1.15% (p=0.000 n=18+19) MemmoveUnalignedSrc/11-6 1.73GB/s ± 0% 1.71GB/s ± 1% -1.30% (p=0.000 n=20+20) MemmoveUnalignedSrc/12-6 1.89GB/s ± 0% 1.87GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/13-6 2.05GB/s ± 0% 2.02GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/14-6 2.21GB/s ± 0% 2.18GB/s ± 1% -1.14% (p=0.000 n=17+20) MemmoveUnalignedSrc/15-6 2.36GB/s ± 0% 2.34GB/s ± 1% -1.04% (p=0.000 n=17+20) MemmoveUnalignedSrc/16-6 2.52GB/s ± 0% 2.49GB/s ± 1% -1.26% (p=0.000 n=19+20) MemmoveUnalignedSrc/32-6 4.82GB/s ± 0% 4.61GB/s ± 0% -4.40% (p=0.000 n=19+20) MemmoveUnalignedSrc/64-6 5.03GB/s ± 4% 7.97GB/s ± 0% +58.55% (p=0.000 n=20+16) MemmoveUnalignedSrc/128-6 11.1GB/s ± 0% 11.2GB/s ± 0% +0.52% (p=0.000 n=17+18) MemmoveUnalignedSrc/256-6 16.5GB/s ± 0% 16.4GB/s ± 0% -0.10% (p=0.000 n=20+18) MemmoveUnalignedSrc/512-6 21.0GB/s ± 0% 22.1GB/s ± 0% +5.48% (p=0.000 n=14+17) MemmoveUnalignedSrc/1024-6 24.9GB/s ± 0% 31.9GB/s ± 0% +28.20% (p=0.000 n=19+20) MemmoveUnalignedSrc/2048-6 23.3GB/s ± 0% 33.8GB/s ± 0% +45.22% (p=0.000 n=17+19) MemmoveUnalignedSrc/4096-6 37.3GB/s ± 0% 42.7GB/s ± 0% +14.30% (p=0.000 n=17+17) Change-Id: Iab488d93a293cdf573ab5cd89b95a818bbb5d531 Reviewed-on: https://go-review.googlesource.com/22515 Run-TryBot: Denis Nagorny <denis.nagorny@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2016-05-25	runtime: use of Run for some benchmarks	Marcel van Lohuizen
	Names of sub-benchmarks are preserved, short of the additional slash. Change-Id: I9b3f82964f9a44b0d28724413320afd091ed3106 Reviewed-on: https://go-review.googlesource.com/23425 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Marcel van Lohuizen <mpvl@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-21	runtime: use MOVSB instead of MOVSQ for unaligned moves	Keith Randall
	MOVSB is quite a bit faster for unaligned moves. Possibly we should use MOVSB all of the time, but Intel folks say it might be a bit faster to use MOVSQ on some processors (but not any I have access to at the moment). benchmark old ns/op new ns/op delta BenchmarkMemmove4096-8 93.9 93.2 -0.75% BenchmarkMemmoveUnalignedDst4096-8 256 151 -41.02% BenchmarkMemmoveUnalignedSrc4096-8 175 90.5 -48.29% Fixes #14630 Change-Id: I568e6d6590eb3615e6a699fb474020596be665ff Reviewed-on: https://go-review.googlesource.com/20293 Reviewed-by: Ian Lance Taylor <iant@golang.org>
2016-03-01	all: make copyright headers consistent with one space after period	Brad Fitzpatrick
	This is a subset of https://golang.org/cl/20022 with only the copyright header lines, so the next CL will be smaller and more reviewable. Go policy has been single space after periods in comments for some time. The copyright header template at: https://golang.org/doc/contribute.html#copyright also uses a single space. Make them all consistent. Change-Id: Icc26c6b8495c3820da6b171ca96a74701b4a01b0 Reviewed-on: https://go-review.googlesource.com/20111 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-11-24	runtime: speed up memclr with avx2 on amd64	Ilya Tocar
	Results are a bit noisy, but show good improvement (haswell) name old time/op new time/op delta Memclr5-48 6.06ns ± 8% 5.65ns ± 8% -6.81% (p=0.000 n=20+20) Memclr16-48 5.75ns ± 6% 5.71ns ± 6% ~ (p=0.545 n=20+19) Memclr64-48 6.54ns ± 5% 6.14ns ± 9% -6.12% (p=0.000 n=18+20) Memclr256-48 10.1ns ±12% 9.9ns ±14% ~ (p=0.285 n=20+19) Memclr4096-48 104ns ± 8% 57ns ±15% -44.98% (p=0.000 n=20+20) Memclr65536-48 2.45µs ± 5% 2.43µs ± 8% ~ (p=0.665 n=16+20) Memclr1M-48 58.7µs ±13% 56.4µs ±11% -3.92% (p=0.033 n=20+19) Memclr4M-48 233µs ± 9% 234µs ± 9% ~ (p=0.728 n=20+19) Memclr8M-48 469µs ±11% 472µs ±16% ~ (p=0.947 n=20+20) Memclr16M-48 947µs ±10% 916µs ±10% ~ (p=0.050 n=20+19) Memclr64M-48 10.9ms ±10% 4.5ms ± 9% -58.43% (p=0.000 n=20+20) GoMemclr5-48 3.80ns ±13% 3.38ns ± 6% -11.02% (p=0.000 n=20+20) GoMemclr16-48 3.34ns ±15% 3.40ns ± 9% ~ (p=0.351 n=20+20) GoMemclr64-48 4.10ns ±15% 4.04ns ±10% ~ (p=1.000 n=20+19) GoMemclr256-48 7.75ns ±20% 7.88ns ± 9% ~ (p=0.227 n=20+19) name old speed new speed delta Memclr5-48 826MB/s ± 7% 886MB/s ± 8% +7.32% (p=0.000 n=20+20) Memclr16-48 2.78GB/s ± 5% 2.81GB/s ± 6% ~ (p=0.550 n=20+19) Memclr64-48 9.79GB/s ± 5% 10.44GB/s ±10% +6.64% (p=0.000 n=18+20) Memclr256-48 25.4GB/s ±14% 25.6GB/s ±12% ~ (p=0.647 n=20+19) Memclr4096-48 39.4GB/s ± 8% 72.0GB/s ±13% +82.81% (p=0.000 n=20+20) Memclr65536-48 26.6GB/s ± 6% 27.0GB/s ± 9% ~ (p=0.517 n=17+20) Memclr1M-48 17.9GB/s ±12% 18.5GB/s ±11% ~ (p=0.068 n=20+20) Memclr4M-48 18.0GB/s ± 9% 17.8GB/s ±14% ~ (p=0.547 n=20+20) Memclr8M-48 17.9GB/s ±10% 17.8GB/s ±14% ~ (p=0.947 n=20+20) Memclr16M-48 17.8GB/s ± 9% 18.4GB/s ± 9% ~ (p=0.050 n=20+19) Memclr64M-48 6.19GB/s ±10% 14.87GB/s ± 9% +140.11% (p=0.000 n=20+20) GoMemclr5-48 1.31GB/s ±10% 1.48GB/s ± 6% +13.06% (p=0.000 n=19+20) GoMemclr16-48 4.81GB/s ±14% 4.71GB/s ± 8% ~ (p=0.341 n=20+20) GoMemclr64-48 15.7GB/s ±13% 15.8GB/s ±11% ~ (p=0.967 n=20+19) Change-Id: I393f3f20e2f31538d1b1dd70d6e5c201c106a095 Reviewed-on: https://go-review.googlesource.com/16773 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Klaus Post <klauspost@gmail.com> Reviewed-by: Keith Randall <khr@golang.org>
2015-10-08	runtime: adjust the arm64 memmove and memclr to operate by word as much as ↵	Michael Hudson-Doyle
	they can Not only is this an obvious optimization: benchmark old MB/s new MB/s speedup BenchmarkMemmove1-4 35.35 29.65 0.84x BenchmarkMemmove2-4 63.78 52.53 0.82x BenchmarkMemmove3-4 89.72 73.96 0.82x BenchmarkMemmove4-4 109.94 95.73 0.87x BenchmarkMemmove5-4 127.60 112.80 0.88x BenchmarkMemmove6-4 143.59 126.67 0.88x BenchmarkMemmove7-4 157.90 138.92 0.88x BenchmarkMemmove8-4 167.18 231.81 1.39x BenchmarkMemmove9-4 175.23 252.07 1.44x BenchmarkMemmove10-4 165.68 261.10 1.58x BenchmarkMemmove11-4 174.43 263.31 1.51x BenchmarkMemmove12-4 180.76 267.56 1.48x BenchmarkMemmove13-4 189.06 284.93 1.51x BenchmarkMemmove14-4 186.31 284.72 1.53x BenchmarkMemmove15-4 195.75 281.62 1.44x BenchmarkMemmove16-4 202.96 439.23 2.16x BenchmarkMemmove32-4 264.77 775.77 2.93x BenchmarkMemmove64-4 306.81 1209.64 3.94x BenchmarkMemmove128-4 357.03 1515.41 4.24x BenchmarkMemmove256-4 380.77 2066.01 5.43x BenchmarkMemmove512-4 385.05 2556.45 6.64x BenchmarkMemmove1024-4 381.23 2804.10 7.36x BenchmarkMemmove2048-4 379.06 2814.83 7.43x BenchmarkMemmove4096-4 387.43 3064.96 7.91x BenchmarkMemmoveUnaligned1-4 28.91 25.40 0.88x BenchmarkMemmoveUnaligned2-4 56.13 47.56 0.85x BenchmarkMemmoveUnaligned3-4 74.32 69.31 0.93x BenchmarkMemmoveUnaligned4-4 97.02 83.58 0.86x BenchmarkMemmoveUnaligned5-4 110.17 103.62 0.94x BenchmarkMemmoveUnaligned6-4 124.95 113.26 0.91x BenchmarkMemmoveUnaligned7-4 142.37 130.82 0.92x BenchmarkMemmoveUnaligned8-4 151.20 205.64 1.36x BenchmarkMemmoveUnaligned9-4 166.97 215.42 1.29x BenchmarkMemmoveUnaligned10-4 148.49 221.22 1.49x BenchmarkMemmoveUnaligned11-4 159.47 239.57 1.50x BenchmarkMemmoveUnaligned12-4 163.52 247.32 1.51x BenchmarkMemmoveUnaligned13-4 167.55 256.54 1.53x BenchmarkMemmoveUnaligned14-4 175.12 251.03 1.43x BenchmarkMemmoveUnaligned15-4 192.10 267.13 1.39x BenchmarkMemmoveUnaligned16-4 190.76 378.87 1.99x BenchmarkMemmoveUnaligned32-4 259.02 562.98 2.17x BenchmarkMemmoveUnaligned64-4 317.72 842.44 2.65x BenchmarkMemmoveUnaligned128-4 355.43 1274.49 3.59x BenchmarkMemmoveUnaligned256-4 378.17 1815.74 4.80x BenchmarkMemmoveUnaligned512-4 362.15 2180.81 6.02x BenchmarkMemmoveUnaligned1024-4 376.07 2453.58 6.52x BenchmarkMemmoveUnaligned2048-4 381.66 2568.32 6.73x BenchmarkMemmoveUnaligned4096-4 398.51 2669.36 6.70x BenchmarkMemclr5-4 113.83 107.93 0.95x BenchmarkMemclr16-4 223.84 389.63 1.74x BenchmarkMemclr64-4 421.99 1209.58 2.87x BenchmarkMemclr256-4 525.94 2411.58 4.59x BenchmarkMemclr4096-4 581.66 4372.20 7.52x BenchmarkMemclr65536-4 565.84 4747.48 8.39x BenchmarkGoMemclr5-4 194.63 160.31 0.82x BenchmarkGoMemclr16-4 295.30 630.07 2.13x BenchmarkGoMemclr64-4 480.24 1884.03 3.92x BenchmarkGoMemclr256-4 540.23 2926.49 5.42x but it turns out that it's necessary to avoid the GC seeing partially written pointers. It's of course possible to be more sophisticated (using ldp/stp to move 16 bytes at a time in the core loop and unrolling the tail copying loops being the obvious ideas) but I wanted something simple and (reasonably) obviously correct. Fixes #12552 Change-Id: Iaeaf8a812cd06f4747ba2f792de1ded738890735 Reviewed-on: https://go-review.googlesource.com/14813 Reviewed-by: Austin Clements <austin@google.com>
2015-04-15	cmd/6g, runtime: improve duffzero throughput	Josh Bleecher Snyder
	It is faster to execute MOVQ AX,(DI) MOVQ AX,8(DI) MOVQ AX,16(DI) MOVQ AX,24(DI) ADDQ $32,DI than STOSQ STOSQ STOSQ STOSQ However, in order to be able to jump into the middle of a block of MOVQs, the call site needs to pre-adjust DI. If we're clearing a small area, the cost of that DI pre-adjustment isn't repaid. This CL switches the DUFFZERO implementation to use a hybrid strategy, in which small clears use STOSQ as before, but large clears use mostly MOVQ/ADDQ blocks. benchmark old ns/op new ns/op delta BenchmarkClearFat8 0.55 0.55 +0.00% BenchmarkClearFat12 0.82 0.83 +1.22% BenchmarkClearFat16 0.55 0.55 +0.00% BenchmarkClearFat24 0.82 0.82 +0.00% BenchmarkClearFat32 2.20 1.94 -11.82% BenchmarkClearFat40 1.92 1.66 -13.54% BenchmarkClearFat48 2.21 1.93 -12.67% BenchmarkClearFat56 3.03 2.20 -27.39% BenchmarkClearFat64 3.26 2.48 -23.93% BenchmarkClearFat72 3.57 2.76 -22.69% BenchmarkClearFat80 3.83 3.05 -20.37% BenchmarkClearFat88 4.14 3.30 -20.29% BenchmarkClearFat128 5.54 4.69 -15.34% BenchmarkClearFat256 9.95 9.09 -8.64% BenchmarkClearFat512 18.7 17.9 -4.28% BenchmarkClearFat1024 36.2 35.4 -2.21% Change-Id: Ic786406d9b3cab68d5a231688f9e66fcd1bd7103 Reviewed-on: https://go-review.googlesource.com/2585 Reviewed-by: Keith Randall <khr@golang.org>
2015-01-09	cmd/gc: optimize memclr of slices and arrays	Josh Bleecher Snyder
	Recognize loops of the form for i := range a { a[i] = zero } in which the evaluation of a is free from side effects. Replace these loops with calls to memclr. This occurs in the stdlib in 18 places. The motivating example is clearing a byte slice: benchmark old ns/op new ns/op delta BenchmarkGoMemclr5 3.31 3.26 -1.51% BenchmarkGoMemclr16 13.7 3.28 -76.06% BenchmarkGoMemclr64 50.8 4.14 -91.85% BenchmarkGoMemclr256 157 6.02 -96.17% Update #5373. Change-Id: I99d3e6f5f268e8c6499b7e661df46403e5eb83e4 Reviewed-on: https://go-review.googlesource.com/2520 Reviewed-by: Keith Randall <khr@golang.org>
2014-09-08	build: move package sources from src/pkg to src	Russ Cox
	Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg.

Replace the memmove implementation for moves of 17 bytes or larger with an implementation from ARM optimized software. The moves of 16 bytes or fewer are unchanged, but the registers used are updated to match the rest of the implementation. This implementation makes use of new optimizations: - software pipelined loop for large (>128 byte) moves - medium size moves (17..128 bytes) have a new implementation - address realignment when src or dst is unaligned - preference for aligned src (loads) or dst (stores) depending on CPU To support preference for aligned loads or aligned stores, a new CPU flag is added. This flag indicates that the detected micro architecture performs better with aligned loads. Some tested CPUs did not exhibit a significant difference and are left with the default behavior of realigning based on the destination address (stores). Neoverse N1 (Tested on Graviton 2) name old time/op new time/op delta Memmove/0-4 1.88ns ± 1% 1.87ns ± 1% -0.58% (p=0.020 n=10+10) Memmove/1-4 4.40ns ± 0% 4.40ns ± 0% ~ (all equal) Memmove/8-4 3.88ns ± 3% 3.80ns ± 0% -1.97% (p=0.001 n=10+9) Memmove/16-4 3.90ns ± 3% 3.80ns ± 0% -2.49% (p=0.000 n=10+9) Memmove/32-4 4.80ns ± 0% 4.40ns ± 0% -8.33% (p=0.000 n=9+8) Memmove/64-4 5.86ns ± 0% 5.00ns ± 0% -14.76% (p=0.000 n=8+8) Memmove/128-4 8.46ns ± 0% 8.06ns ± 0% -4.62% (p=0.000 n=10+10) Memmove/256-4 12.4ns ± 0% 12.2ns ± 0% -1.61% (p=0.000 n=10+10) Memmove/512-4 19.5ns ± 0% 19.1ns ± 0% -2.05% (p=0.000 n=10+10) Memmove/1024-4 33.7ns ± 0% 33.5ns ± 0% -0.59% (p=0.000 n=10+10) Memmove/2048-4 62.1ns ± 0% 59.0ns ± 0% -4.99% (p=0.000 n=10+10) Memmove/4096-4 117ns ± 1% 110ns ± 0% -5.66% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 6.41ns ± 0% 5.62ns ± 0% -12.32% (p=0.000 n=10+7) MemmoveUnalignedDst/128-4 9.40ns ± 0% 8.34ns ± 0% -11.24% (p=0.000 n=10+10) MemmoveUnalignedDst/256-4 12.8ns ± 0% 12.8ns ± 0% ~ (all equal) MemmoveUnalignedDst/512-4 20.4ns ± 0% 19.7ns ± 0% -3.43% (p=0.000 n=9+10) MemmoveUnalignedDst/1024-4 34.1ns ± 0% 35.1ns ± 0% +2.93% (p=0.000 n=9+9) MemmoveUnalignedDst/2048-4 61.5ns ± 0% 60.4ns ± 0% -1.77% (p=0.000 n=10+10) MemmoveUnalignedDst/4096-4 122ns ± 0% 113ns ± 0% -7.38% (p=0.002 n=8+10) MemmoveUnalignedSrc/64-4 7.25ns ± 1% 6.26ns ± 0% -13.64% (p=0.000 n=9+9) MemmoveUnalignedSrc/128-4 10.5ns ± 0% 9.7ns ± 0% -7.52% (p=0.000 n=10+10) MemmoveUnalignedSrc/256-4 17.1ns ± 0% 17.3ns ± 0% +1.17% (p=0.000 n=10+10) MemmoveUnalignedSrc/512-4 27.0ns ± 0% 27.0ns ± 0% ~ (all equal) MemmoveUnalignedSrc/1024-4 46.7ns ± 0% 35.7ns ± 0% -23.55% (p=0.000 n=10+9) MemmoveUnalignedSrc/2048-4 85.2ns ± 0% 61.2ns ± 0% -28.17% (p=0.000 n=10+8) MemmoveUnalignedSrc/4096-4 162ns ± 0% 113ns ± 0% -30.25% (p=0.000 n=10+10) name old speed new speed delta Memmove/4096-4 35.2GB/s ± 0% 37.1GB/s ± 0% +5.56% (p=0.000 n=10+9) MemmoveUnalignedSrc/1024-4 21.9GB/s ± 0% 28.7GB/s ± 0% +30.90% (p=0.000 n=10+10) MemmoveUnalignedSrc/2048-4 24.0GB/s ± 0% 33.5GB/s ± 0% +39.18% (p=0.000 n=10+9) MemmoveUnalignedSrc/4096-4 25.3GB/s ± 0% 36.2GB/s ± 0% +43.50% (p=0.000 n=10+7) Cortex-A72 (Graviton 1) name old time/op new time/op delta Memmove/0-4 3.06ns ± 3% 3.08ns ± 1% ~ (p=0.958 n=10+9) Memmove/1-4 8.72ns ± 0% 7.85ns ± 0% -9.98% (p=0.002 n=8+10) Memmove/8-4 8.29ns ± 0% 8.29ns ± 0% ~ (all equal) Memmove/16-4 8.29ns ± 0% 8.29ns ± 0% ~ (all equal) Memmove/32-4 8.19ns ± 2% 8.29ns ± 0% ~ (p=0.114 n=10+10) Memmove/64-4 18.3ns ± 4% 10.0ns ± 0% -45.36% (p=0.000 n=10+10) Memmove/128-4 14.8ns ± 0% 17.4ns ± 0% +17.77% (p=0.000 n=10+10) Memmove/256-4 21.8ns ± 0% 23.1ns ± 0% +5.96% (p=0.000 n=10+10) Memmove/512-4 35.8ns ± 0% 37.2ns ± 0% +3.91% (p=0.000 n=10+10) Memmove/1024-4 63.7ns ± 0% 67.2ns ± 0% +5.49% (p=0.000 n=10+10) Memmove/2048-4 126ns ± 0% 123ns ± 0% -2.38% (p=0.000 n=10+10) Memmove/4096-4 238ns ± 1% 243ns ± 1% +1.93% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 19.3ns ± 1% 12.0ns ± 1% -37.49% (p=0.000 n=10+10) MemmoveUnalignedDst/128-4 17.2ns ± 0% 17.4ns ± 0% +1.16% (p=0.000 n=10+10) MemmoveUnalignedDst/256-4 28.2ns ± 8% 29.2ns ± 0% ~ (p=0.352 n=10+10) MemmoveUnalignedDst/512-4 49.8ns ± 3% 48.9ns ± 0% ~ (p=1.000 n=10+10) MemmoveUnalignedDst/1024-4 89.5ns ± 0% 80.5ns ± 1% -10.02% (p=0.000 n=10+10) MemmoveUnalignedDst/2048-4 180ns ± 0% 127ns ± 0% -29.44% (p=0.000 n=9+10) MemmoveUnalignedDst/4096-4 347ns ± 0% 244ns ± 0% -29.59% (p=0.000 n=10+9) MemmoveUnalignedSrc/128-4 16.1ns ± 0% 21.8ns ± 0% +35.40% (p=0.000 n=10+10) MemmoveUnalignedSrc/256-4 24.9ns ± 8% 26.6ns ± 0% +6.70% (p=0.015 n=10+10) MemmoveUnalignedSrc/512-4 39.4ns ± 6% 40.6ns ± 0% ~ (p=0.352 n=10+10) MemmoveUnalignedSrc/1024-4 72.5ns ± 0% 83.0ns ± 1% +14.44% (p=0.000 n=9+10) MemmoveUnalignedSrc/2048-4 129ns ± 1% 128ns ± 1% ~ (p=0.179 n=10+10) MemmoveUnalignedSrc/4096-4 241ns ± 0% 253ns ± 1% +4.99% (p=0.000 n=9+9) Cortex-A53 (Raspberry Pi 3) name old time/op new time/op delta Memmove/0-4 11.0ns ± 0% 11.0ns ± 1% ~ (p=0.294 n=8+10) Memmove/1-4 29.6ns ± 0% 28.0ns ± 1% -5.41% (p=0.000 n=9+10) Memmove/8-4 23.5ns ± 0% 22.1ns ± 0% -6.11% (p=0.000 n=8+8) Memmove/16-4 23.7ns ± 1% 22.1ns ± 0% -6.59% (p=0.000 n=10+8) Memmove/32-4 27.9ns ± 0% 27.1ns ± 0% -3.13% (p=0.000 n=8+8) Memmove/64-4 33.8ns ± 0% 31.5ns ± 1% -6.99% (p=0.000 n=8+10) Memmove/128-4 45.6ns ± 0% 44.2ns ± 1% -3.23% (p=0.000 n=9+10) Memmove/256-4 69.3ns ± 0% 69.3ns ± 0% ~ (p=0.072 n=8+8) Memmove/512-4 127ns ± 0% 110ns ± 0% -13.39% (p=0.000 n=8+8) Memmove/1024-4 222ns ± 0% 205ns ± 1% -7.66% (p=0.000 n=7+10) Memmove/2048-4 411ns ± 0% 366ns ± 0% -10.98% (p=0.000 n=8+9) Memmove/4096-4 795ns ± 1% 695ns ± 1% -12.63% (p=0.000 n=10+10) MemmoveUnalignedDst/64-4 44.0ns ± 0% 40.5ns ± 0% -7.93% (p=0.000 n=8+8) MemmoveUnalignedDst/128-4 59.6ns ± 0% 54.9ns ± 0% -7.85% (p=0.000 n=9+9) MemmoveUnalignedDst/256-4 98.2ns ±11% 90.0ns ± 1% ~ (p=0.130 n=10+10) MemmoveUnalignedDst/512-4 161ns ± 2% 145ns ± 1% -9.96% (p=0.000 n=10+10) MemmoveUnalignedDst/1024-4 281ns ± 0% 265ns ± 0% -5.65% (p=0.000 n=9+8) MemmoveUnalignedDst/2048-4 528ns ± 0% 482ns ± 0% -8.73% (p=0.000 n=8+9) MemmoveUnalignedDst/4096-4 1.02µs ± 1% 0.92µs ± 0% -10.00% (p=0.000 n=10+8) MemmoveUnalignedSrc/64-4 42.4ns ± 1% 40.5ns ± 0% -4.39% (p=0.000 n=10+8) MemmoveUnalignedSrc/128-4 57.4ns ± 0% 57.0ns ± 1% -0.75% (p=0.048 n=9+10) MemmoveUnalignedSrc/256-4 88.1ns ± 1% 89.6ns ± 0% +1.70% (p=0.000 n=9+8) MemmoveUnalignedSrc/512-4 160ns ± 2% 144ns ± 0% -9.89% (p=0.000 n=10+8) MemmoveUnalignedSrc/1024-4 286ns ± 0% 266ns ± 1% -6.69% (p=0.000 n=8+10) MemmoveUnalignedSrc/2048-4 525ns ± 0% 483ns ± 1% -7.96% (p=0.000 n=9+10) MemmoveUnalignedSrc/4096-4 1.01µs ± 0% 0.92µs ± 1% -9.40% (p=0.000 n=8+10) Change-Id: Ia1144e9d4dfafdece6e167c5e576bf80f254c8ab Reviewed-on: https://go-review.googlesource.com/c/go/+/243357 TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com> Reviewed-by: eric fang <eric.fang@arm.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>

Change-Id: I6389d7efe90836b6ece44d2e75053d1ad9f35d08 Reviewed-on: https://go-review.googlesource.com/c/go/+/253417 Trust: Emmanuel Odeke <emmanuel@orijtech.com> Reviewed-by: Keith Randall <khr@golang.org>

In the previous CL we ensures that memmove writes pointers atomically, so the concurrent GC won't observe a partially updated pointer. This CL adds a test. Change-Id: Icd1124bf3a15ef25bac20c7fb8933f1a642d897c Reviewed-on: https://go-review.googlesource.com/c/go/+/212627 Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Speeds up the "go test runtime -cpu=1,2,4 -short -quick" phase of all.bash. For #26473. Change-Id: I090f5a5aa754462b3253a2156dc31fa67ce7af2a Reviewed-on: https://go-review.googlesource.com/c/go/+/177399 Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Austin Clements <austin@google.com>

name old time/op new time/op delta CopyFat8 2.15ns ± 1% 2.19ns ± 6% ~ (p=0.171 n=8+9) CopyFat12 2.15ns ± 0% 2.17ns ± 2% ~ (p=0.137 n=8+10) CopyFat16 2.17ns ± 3% 2.15ns ± 0% ~ (p=0.211 n=10+10) CopyFat24 2.16ns ± 1% 2.15ns ± 0% ~ (p=0.087 n=10+10) CopyFat32 11.5ns ± 0% 12.8ns ± 2% +10.87% (p=0.000 n=8+10) CopyFat64 20.2ns ± 2% 12.9ns ± 0% -36.11% (p=0.000 n=10+10) CopyFat128 37.2ns ± 0% 21.5ns ± 0% -42.20% (p=0.000 n=10+10) CopyFat256 71.6ns ± 0% 38.7ns ± 0% -45.95% (p=0.000 n=10+10) CopyFat512 140ns ± 0% 73ns ± 0% -47.86% (p=0.000 n=10+9) CopyFat520 142ns ± 0% 74ns ± 0% -47.54% (p=0.000 n=10+10) CopyFat1024 277ns ± 0% 141ns ± 0% -49.10% (p=0.000 n=10+10) Change-Id: If54bc571add5db674d5e081579c87e80153d0a5a Reviewed-on: https://go-review.googlesource.com/97395 Reviewed-by: Cherry Zhang <cherryyz@google.com>

This cuts 23 seconds from all.bash on my MacBook Pro. Change-Id: Ibc4d7c01660b9e9ebd088dd55ba993f0d7ec6aa3 Reviewed-on: https://go-review.googlesource.com/73991 Reviewed-by: Ian Lance Taylor <iant@golang.org>

memmove used to use 2 2-byte load/store pairs to move 4 bytes. When the result is loaded with a single 4-byte load, it caused a store to load fowarding stall. To avoid the stall, special case memmove to use 4 byte ops for the 4 byte copy case. We already have a special case for 8-byte copies. 386 already specializes 4-byte copies. I'll do 2-byte copies also, but not for 1.8. benchmark old ns/op new ns/op delta BenchmarkIssue18740-8 7567 4799 -36.58% 3-byte copies get a bit slower. Other copies are unchanged. name old time/op new time/op delta Memmove/3-8 4.76ns ± 5% 5.26ns ± 3% +10.50% (p=0.000 n=10+10) Fixes #18740 Change-Id: Iec82cbac0ecfee80fa3c8fc83828f9a1819c3c74 Reviewed-on: https://go-review.googlesource.com/35567 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>

I used the slowtests.go tool as described in https://golang.org/cl/32684 on packages that stood out. go test -short std drops from ~56 to ~52 seconds. This isn't a huge win, but it was mostly an exercise. Updates #17751 Change-Id: I9f3402e36a038d71e662d06ce2c1d52f6c4b674d Reviewed-on: https://go-review.googlesource.com/32751 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

Use AVX if available on 4th generation of Intel(TM) Core(TM) processors. (collected on E5 2609v3 @1.9GHz) name old speed new speed delta Memmove/1-6 158MB/s ± 0% 172MB/s ± 0% +9.09% (p=0.000 n=16+16) Memmove/2-6 316MB/s ± 0% 345MB/s ± 0% +9.09% (p=0.000 n=18+16) Memmove/3-6 517MB/s ± 0% 517MB/s ± 0% ~ (p=0.445 n=16+16) Memmove/4-6 687MB/s ± 1% 690MB/s ± 0% +0.35% (p=0.000 n=20+17) Memmove/5-6 729MB/s ± 0% 729MB/s ± 0% +0.01% (p=0.000 n=16+18) Memmove/6-6 875MB/s ± 0% 875MB/s ± 0% +0.01% (p=0.000 n=18+18) Memmove/7-6 1.02GB/s ± 0% 1.02GB/s ± 1% ~ (p=0.139 n=19+20) Memmove/8-6 1.26GB/s ± 0% 1.26GB/s ± 0% +0.00% (p=0.000 n=18+18) Memmove/9-6 1.42GB/s ± 0% 1.42GB/s ± 0% +0.00% (p=0.000 n=17+18) Memmove/10-6 1.58GB/s ± 0% 1.58GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/11-6 1.74GB/s ± 0% 1.74GB/s ± 0% +0.00% (p=0.001 n=18+17) Memmove/12-6 1.90GB/s ± 0% 1.90GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/13-6 2.05GB/s ± 0% 2.05GB/s ± 0% +0.00% (p=0.000 n=18+19) Memmove/14-6 2.21GB/s ± 0% 2.21GB/s ± 0% +0.00% (p=0.000 n=16+20) Memmove/15-6 2.37GB/s ± 0% 2.37GB/s ± 0% +0.00% (p=0.004 n=19+20) Memmove/16-6 2.53GB/s ± 0% 2.53GB/s ± 0% +0.00% (p=0.000 n=16+16) Memmove/32-6 4.67GB/s ± 0% 4.67GB/s ± 0% +0.00% (p=0.000 n=17+17) Memmove/64-6 8.67GB/s ± 0% 8.64GB/s ± 0% -0.33% (p=0.000 n=18+17) Memmove/128-6 12.6GB/s ± 0% 11.6GB/s ± 0% -8.05% (p=0.000 n=16+19) Memmove/256-6 16.3GB/s ± 0% 16.6GB/s ± 0% +1.66% (p=0.000 n=20+18) Memmove/512-6 21.5GB/s ± 0% 24.4GB/s ± 0% +13.35% (p=0.000 n=18+17) Memmove/1024-6 24.7GB/s ± 0% 33.7GB/s ± 0% +36.12% (p=0.000 n=18+18) Memmove/2048-6 27.3GB/s ± 0% 43.3GB/s ± 0% +58.77% (p=0.000 n=19+17) Memmove/4096-6 37.5GB/s ± 0% 50.5GB/s ± 0% +34.56% (p=0.000 n=19+19) MemmoveUnalignedDst/1-6 135MB/s ± 0% 146MB/s ± 0% +7.69% (p=0.000 n=16+14) MemmoveUnalignedDst/2-6 271MB/s ± 0% 292MB/s ± 0% +7.69% (p=0.000 n=18+18) MemmoveUnalignedDst/3-6 438MB/s ± 0% 438MB/s ± 0% ~ (p=0.352 n=16+19) MemmoveUnalignedDst/4-6 584MB/s ± 0% 584MB/s ± 0% ~ (p=0.876 n=17+17) MemmoveUnalignedDst/5-6 631MB/s ± 1% 632MB/s ± 0% +0.25% (p=0.000 n=20+17) MemmoveUnalignedDst/6-6 759MB/s ± 0% 759MB/s ± 0% +0.00% (p=0.000 n=19+16) MemmoveUnalignedDst/7-6 885MB/s ± 0% 883MB/s ± 1% ~ (p=0.647 n=18+20) MemmoveUnalignedDst/8-6 1.08GB/s ± 0% 1.08GB/s ± 0% +0.00% (p=0.035 n=19+18) MemmoveUnalignedDst/9-6 1.22GB/s ± 0% 1.22GB/s ± 0% ~ (p=0.251 n=18+17) MemmoveUnalignedDst/10-6 1.35GB/s ± 0% 1.35GB/s ± 0% ~ (p=0.327 n=17+18) MemmoveUnalignedDst/11-6 1.49GB/s ± 0% 1.49GB/s ± 0% ~ (p=0.531 n=18+19) MemmoveUnalignedDst/12-6 1.63GB/s ± 0% 1.63GB/s ± 0% ~ (p=0.886 n=19+18) MemmoveUnalignedDst/13-6 1.76GB/s ± 0% 1.76GB/s ± 1% -0.24% (p=0.006 n=18+20) MemmoveUnalignedDst/14-6 1.90GB/s ± 0% 1.90GB/s ± 0% ~ (p=0.818 n=20+19) MemmoveUnalignedDst/15-6 2.03GB/s ± 0% 2.03GB/s ± 0% ~ (p=0.294 n=17+16) MemmoveUnalignedDst/16-6 2.17GB/s ± 0% 2.17GB/s ± 0% ~ (p=0.602 n=16+18) MemmoveUnalignedDst/32-6 4.05GB/s ± 0% 4.05GB/s ± 0% +0.00% (p=0.010 n=18+17) MemmoveUnalignedDst/64-6 7.59GB/s ± 0% 7.59GB/s ± 0% +0.00% (p=0.022 n=18+16) MemmoveUnalignedDst/128-6 11.1GB/s ± 0% 11.4GB/s ± 0% +2.79% (p=0.000 n=18+17) MemmoveUnalignedDst/256-6 16.4GB/s ± 0% 16.7GB/s ± 0% +1.59% (p=0.000 n=20+17) MemmoveUnalignedDst/512-6 15.7GB/s ± 0% 21.3GB/s ± 0% +35.87% (p=0.000 n=18+20) MemmoveUnalignedDst/1024-6 16.0GB/s ±20% 31.5GB/s ± 0% +96.93% (p=0.000 n=20+14) MemmoveUnalignedDst/2048-6 19.6GB/s ± 0% 42.1GB/s ± 0% +115.16% (p=0.000 n=17+18) MemmoveUnalignedDst/4096-6 6.41GB/s ± 0% 33.18GB/s ± 0% +417.56% (p=0.000 n=17+18) MemmoveUnalignedSrc/1-6 171MB/s ± 0% 166MB/s ± 0% -3.33% (p=0.000 n=19+16) MemmoveUnalignedSrc/2-6 343MB/s ± 0% 342MB/s ± 1% -0.41% (p=0.000 n=17+20) MemmoveUnalignedSrc/3-6 508MB/s ± 0% 493MB/s ± 1% -2.90% (p=0.000 n=17+17) MemmoveUnalignedSrc/4-6 677MB/s ± 0% 660MB/s ± 2% -2.55% (p=0.000 n=17+20) MemmoveUnalignedSrc/5-6 790MB/s ± 0% 790MB/s ± 0% ~ (p=0.139 n=17+17) MemmoveUnalignedSrc/6-6 948MB/s ± 0% 946MB/s ± 1% ~ (p=0.330 n=17+19) MemmoveUnalignedSrc/7-6 1.11GB/s ± 0% 1.11GB/s ± 0% -0.05% (p=0.026 n=17+17) MemmoveUnalignedSrc/8-6 1.38GB/s ± 0% 1.38GB/s ± 0% ~ (p=0.091 n=18+16) MemmoveUnalignedSrc/9-6 1.42GB/s ± 0% 1.40GB/s ± 1% -1.04% (p=0.000 n=19+20) MemmoveUnalignedSrc/10-6 1.58GB/s ± 0% 1.56GB/s ± 1% -1.15% (p=0.000 n=18+19) MemmoveUnalignedSrc/11-6 1.73GB/s ± 0% 1.71GB/s ± 1% -1.30% (p=0.000 n=20+20) MemmoveUnalignedSrc/12-6 1.89GB/s ± 0% 1.87GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/13-6 2.05GB/s ± 0% 2.02GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/14-6 2.21GB/s ± 0% 2.18GB/s ± 1% -1.14% (p=0.000 n=17+20) MemmoveUnalignedSrc/15-6 2.36GB/s ± 0% 2.34GB/s ± 1% -1.04% (p=0.000 n=17+20) MemmoveUnalignedSrc/16-6 2.52GB/s ± 0% 2.49GB/s ± 1% -1.26% (p=0.000 n=19+20) MemmoveUnalignedSrc/32-6 4.82GB/s ± 0% 4.61GB/s ± 0% -4.40% (p=0.000 n=19+20) MemmoveUnalignedSrc/64-6 5.03GB/s ± 4% 7.97GB/s ± 0% +58.55% (p=0.000 n=20+16) MemmoveUnalignedSrc/128-6 11.1GB/s ± 0% 11.2GB/s ± 0% +0.52% (p=0.000 n=17+18) MemmoveUnalignedSrc/256-6 16.5GB/s ± 0% 16.4GB/s ± 0% -0.10% (p=0.000 n=20+18) MemmoveUnalignedSrc/512-6 21.0GB/s ± 0% 22.1GB/s ± 0% +5.48% (p=0.000 n=14+17) MemmoveUnalignedSrc/1024-6 24.9GB/s ± 0% 31.9GB/s ± 0% +28.20% (p=0.000 n=19+20) MemmoveUnalignedSrc/2048-6 23.3GB/s ± 0% 33.8GB/s ± 0% +45.22% (p=0.000 n=17+19) MemmoveUnalignedSrc/4096-6 37.3GB/s ± 0% 42.7GB/s ± 0% +14.30% (p=0.000 n=17+17) Change-Id: Id66aa3e499ccfb117cb99d623ef326b50d057b64 Reviewed-on: https://go-review.googlesource.com/29590 Run-TryBot: Denis Nagorny <denis.nagorny@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

This reverts commit 3607c5f4f18ad4d423e40996ebf7f46b2f79ce02. This was causing failures on amd64 machines without AVX. Fixes #16939 Change-Id: I70080fbb4e7ae791857334f2bffd847d08cb25fa Reviewed-on: https://go-review.googlesource.com/28274 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

Use AVX if available on 4th generation of Intel(TM) Core(TM) processors. (collected on E5 2609v3 @1.9GHz) name old speed new speed delta Memmove/1-6 158MB/s ± 0% 172MB/s ± 0% +9.09% (p=0.000 n=16+16) Memmove/2-6 316MB/s ± 0% 345MB/s ± 0% +9.09% (p=0.000 n=18+16) Memmove/3-6 517MB/s ± 0% 517MB/s ± 0% ~ (p=0.445 n=16+16) Memmove/4-6 687MB/s ± 1% 690MB/s ± 0% +0.35% (p=0.000 n=20+17) Memmove/5-6 729MB/s ± 0% 729MB/s ± 0% +0.01% (p=0.000 n=16+18) Memmove/6-6 875MB/s ± 0% 875MB/s ± 0% +0.01% (p=0.000 n=18+18) Memmove/7-6 1.02GB/s ± 0% 1.02GB/s ± 1% ~ (p=0.139 n=19+20) Memmove/8-6 1.26GB/s ± 0% 1.26GB/s ± 0% +0.00% (p=0.000 n=18+18) Memmove/9-6 1.42GB/s ± 0% 1.42GB/s ± 0% +0.00% (p=0.000 n=17+18) Memmove/10-6 1.58GB/s ± 0% 1.58GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/11-6 1.74GB/s ± 0% 1.74GB/s ± 0% +0.00% (p=0.001 n=18+17) Memmove/12-6 1.90GB/s ± 0% 1.90GB/s ± 0% +0.00% (p=0.000 n=19+19) Memmove/13-6 2.05GB/s ± 0% 2.05GB/s ± 0% +0.00% (p=0.000 n=18+19) Memmove/14-6 2.21GB/s ± 0% 2.21GB/s ± 0% +0.00% (p=0.000 n=16+20) Memmove/15-6 2.37GB/s ± 0% 2.37GB/s ± 0% +0.00% (p=0.004 n=19+20) Memmove/16-6 2.53GB/s ± 0% 2.53GB/s ± 0% +0.00% (p=0.000 n=16+16) Memmove/32-6 4.67GB/s ± 0% 4.67GB/s ± 0% +0.00% (p=0.000 n=17+17) Memmove/64-6 8.67GB/s ± 0% 8.64GB/s ± 0% -0.33% (p=0.000 n=18+17) Memmove/128-6 12.6GB/s ± 0% 11.6GB/s ± 0% -8.05% (p=0.000 n=16+19) Memmove/256-6 16.3GB/s ± 0% 16.6GB/s ± 0% +1.66% (p=0.000 n=20+18) Memmove/512-6 21.5GB/s ± 0% 24.4GB/s ± 0% +13.35% (p=0.000 n=18+17) Memmove/1024-6 24.7GB/s ± 0% 33.7GB/s ± 0% +36.12% (p=0.000 n=18+18) Memmove/2048-6 27.3GB/s ± 0% 43.3GB/s ± 0% +58.77% (p=0.000 n=19+17) Memmove/4096-6 37.5GB/s ± 0% 50.5GB/s ± 0% +34.56% (p=0.000 n=19+19) MemmoveUnalignedDst/1-6 135MB/s ± 0% 146MB/s ± 0% +7.69% (p=0.000 n=16+14) MemmoveUnalignedDst/2-6 271MB/s ± 0% 292MB/s ± 0% +7.69% (p=0.000 n=18+18) MemmoveUnalignedDst/3-6 438MB/s ± 0% 438MB/s ± 0% ~ (p=0.352 n=16+19) MemmoveUnalignedDst/4-6 584MB/s ± 0% 584MB/s ± 0% ~ (p=0.876 n=17+17) MemmoveUnalignedDst/5-6 631MB/s ± 1% 632MB/s ± 0% +0.25% (p=0.000 n=20+17) MemmoveUnalignedDst/6-6 759MB/s ± 0% 759MB/s ± 0% +0.00% (p=0.000 n=19+16) MemmoveUnalignedDst/7-6 885MB/s ± 0% 883MB/s ± 1% ~ (p=0.647 n=18+20) MemmoveUnalignedDst/8-6 1.08GB/s ± 0% 1.08GB/s ± 0% +0.00% (p=0.035 n=19+18) MemmoveUnalignedDst/9-6 1.22GB/s ± 0% 1.22GB/s ± 0% ~ (p=0.251 n=18+17) MemmoveUnalignedDst/10-6 1.35GB/s ± 0% 1.35GB/s ± 0% ~ (p=0.327 n=17+18) MemmoveUnalignedDst/11-6 1.49GB/s ± 0% 1.49GB/s ± 0% ~ (p=0.531 n=18+19) MemmoveUnalignedDst/12-6 1.63GB/s ± 0% 1.63GB/s ± 0% ~ (p=0.886 n=19+18) MemmoveUnalignedDst/13-6 1.76GB/s ± 0% 1.76GB/s ± 1% -0.24% (p=0.006 n=18+20) MemmoveUnalignedDst/14-6 1.90GB/s ± 0% 1.90GB/s ± 0% ~ (p=0.818 n=20+19) MemmoveUnalignedDst/15-6 2.03GB/s ± 0% 2.03GB/s ± 0% ~ (p=0.294 n=17+16) MemmoveUnalignedDst/16-6 2.17GB/s ± 0% 2.17GB/s ± 0% ~ (p=0.602 n=16+18) MemmoveUnalignedDst/32-6 4.05GB/s ± 0% 4.05GB/s ± 0% +0.00% (p=0.010 n=18+17) MemmoveUnalignedDst/64-6 7.59GB/s ± 0% 7.59GB/s ± 0% +0.00% (p=0.022 n=18+16) MemmoveUnalignedDst/128-6 11.1GB/s ± 0% 11.4GB/s ± 0% +2.79% (p=0.000 n=18+17) MemmoveUnalignedDst/256-6 16.4GB/s ± 0% 16.7GB/s ± 0% +1.59% (p=0.000 n=20+17) MemmoveUnalignedDst/512-6 15.7GB/s ± 0% 21.3GB/s ± 0% +35.87% (p=0.000 n=18+20) MemmoveUnalignedDst/1024-6 16.0GB/s ±20% 31.5GB/s ± 0% +96.93% (p=0.000 n=20+14) MemmoveUnalignedDst/2048-6 19.6GB/s ± 0% 42.1GB/s ± 0% +115.16% (p=0.000 n=17+18) MemmoveUnalignedDst/4096-6 6.41GB/s ± 0% 33.18GB/s ± 0% +417.56% (p=0.000 n=17+18) MemmoveUnalignedSrc/1-6 171MB/s ± 0% 166MB/s ± 0% -3.33% (p=0.000 n=19+16) MemmoveUnalignedSrc/2-6 343MB/s ± 0% 342MB/s ± 1% -0.41% (p=0.000 n=17+20) MemmoveUnalignedSrc/3-6 508MB/s ± 0% 493MB/s ± 1% -2.90% (p=0.000 n=17+17) MemmoveUnalignedSrc/4-6 677MB/s ± 0% 660MB/s ± 2% -2.55% (p=0.000 n=17+20) MemmoveUnalignedSrc/5-6 790MB/s ± 0% 790MB/s ± 0% ~ (p=0.139 n=17+17) MemmoveUnalignedSrc/6-6 948MB/s ± 0% 946MB/s ± 1% ~ (p=0.330 n=17+19) MemmoveUnalignedSrc/7-6 1.11GB/s ± 0% 1.11GB/s ± 0% -0.05% (p=0.026 n=17+17) MemmoveUnalignedSrc/8-6 1.38GB/s ± 0% 1.38GB/s ± 0% ~ (p=0.091 n=18+16) MemmoveUnalignedSrc/9-6 1.42GB/s ± 0% 1.40GB/s ± 1% -1.04% (p=0.000 n=19+20) MemmoveUnalignedSrc/10-6 1.58GB/s ± 0% 1.56GB/s ± 1% -1.15% (p=0.000 n=18+19) MemmoveUnalignedSrc/11-6 1.73GB/s ± 0% 1.71GB/s ± 1% -1.30% (p=0.000 n=20+20) MemmoveUnalignedSrc/12-6 1.89GB/s ± 0% 1.87GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/13-6 2.05GB/s ± 0% 2.02GB/s ± 1% -1.18% (p=0.000 n=17+20) MemmoveUnalignedSrc/14-6 2.21GB/s ± 0% 2.18GB/s ± 1% -1.14% (p=0.000 n=17+20) MemmoveUnalignedSrc/15-6 2.36GB/s ± 0% 2.34GB/s ± 1% -1.04% (p=0.000 n=17+20) MemmoveUnalignedSrc/16-6 2.52GB/s ± 0% 2.49GB/s ± 1% -1.26% (p=0.000 n=19+20) MemmoveUnalignedSrc/32-6 4.82GB/s ± 0% 4.61GB/s ± 0% -4.40% (p=0.000 n=19+20) MemmoveUnalignedSrc/64-6 5.03GB/s ± 4% 7.97GB/s ± 0% +58.55% (p=0.000 n=20+16) MemmoveUnalignedSrc/128-6 11.1GB/s ± 0% 11.2GB/s ± 0% +0.52% (p=0.000 n=17+18) MemmoveUnalignedSrc/256-6 16.5GB/s ± 0% 16.4GB/s ± 0% -0.10% (p=0.000 n=20+18) MemmoveUnalignedSrc/512-6 21.0GB/s ± 0% 22.1GB/s ± 0% +5.48% (p=0.000 n=14+17) MemmoveUnalignedSrc/1024-6 24.9GB/s ± 0% 31.9GB/s ± 0% +28.20% (p=0.000 n=19+20) MemmoveUnalignedSrc/2048-6 23.3GB/s ± 0% 33.8GB/s ± 0% +45.22% (p=0.000 n=17+19) MemmoveUnalignedSrc/4096-6 37.3GB/s ± 0% 42.7GB/s ± 0% +14.30% (p=0.000 n=17+17) Change-Id: Iab488d93a293cdf573ab5cd89b95a818bbb5d531 Reviewed-on: https://go-review.googlesource.com/22515 Run-TryBot: Denis Nagorny <denis.nagorny@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

Names of sub-benchmarks are preserved, short of the additional slash. Change-Id: I9b3f82964f9a44b0d28724413320afd091ed3106 Reviewed-on: https://go-review.googlesource.com/23425 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Marcel van Lohuizen <mpvl@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

MOVSB is quite a bit faster for unaligned moves. Possibly we should use MOVSB all of the time, but Intel folks say it might be a bit faster to use MOVSQ on some processors (but not any I have access to at the moment). benchmark old ns/op new ns/op delta BenchmarkMemmove4096-8 93.9 93.2 -0.75% BenchmarkMemmoveUnalignedDst4096-8 256 151 -41.02% BenchmarkMemmoveUnalignedSrc4096-8 175 90.5 -48.29% Fixes #14630 Change-Id: I568e6d6590eb3615e6a699fb474020596be665ff Reviewed-on: https://go-review.googlesource.com/20293 Reviewed-by: Ian Lance Taylor <iant@golang.org>

This is a subset of https://golang.org/cl/20022 with only the copyright header lines, so the next CL will be smaller and more reviewable. Go policy has been single space after periods in comments for some time. The copyright header template at: https://golang.org/doc/contribute.html#copyright also uses a single space. Make them all consistent. Change-Id: Icc26c6b8495c3820da6b171ca96a74701b4a01b0 Reviewed-on: https://go-review.googlesource.com/20111 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Results are a bit noisy, but show good improvement (haswell) name old time/op new time/op delta Memclr5-48 6.06ns ± 8% 5.65ns ± 8% -6.81% (p=0.000 n=20+20) Memclr16-48 5.75ns ± 6% 5.71ns ± 6% ~ (p=0.545 n=20+19) Memclr64-48 6.54ns ± 5% 6.14ns ± 9% -6.12% (p=0.000 n=18+20) Memclr256-48 10.1ns ±12% 9.9ns ±14% ~ (p=0.285 n=20+19) Memclr4096-48 104ns ± 8% 57ns ±15% -44.98% (p=0.000 n=20+20) Memclr65536-48 2.45µs ± 5% 2.43µs ± 8% ~ (p=0.665 n=16+20) Memclr1M-48 58.7µs ±13% 56.4µs ±11% -3.92% (p=0.033 n=20+19) Memclr4M-48 233µs ± 9% 234µs ± 9% ~ (p=0.728 n=20+19) Memclr8M-48 469µs ±11% 472µs ±16% ~ (p=0.947 n=20+20) Memclr16M-48 947µs ±10% 916µs ±10% ~ (p=0.050 n=20+19) Memclr64M-48 10.9ms ±10% 4.5ms ± 9% -58.43% (p=0.000 n=20+20) GoMemclr5-48 3.80ns ±13% 3.38ns ± 6% -11.02% (p=0.000 n=20+20) GoMemclr16-48 3.34ns ±15% 3.40ns ± 9% ~ (p=0.351 n=20+20) GoMemclr64-48 4.10ns ±15% 4.04ns ±10% ~ (p=1.000 n=20+19) GoMemclr256-48 7.75ns ±20% 7.88ns ± 9% ~ (p=0.227 n=20+19) name old speed new speed delta Memclr5-48 826MB/s ± 7% 886MB/s ± 8% +7.32% (p=0.000 n=20+20) Memclr16-48 2.78GB/s ± 5% 2.81GB/s ± 6% ~ (p=0.550 n=20+19) Memclr64-48 9.79GB/s ± 5% 10.44GB/s ±10% +6.64% (p=0.000 n=18+20) Memclr256-48 25.4GB/s ±14% 25.6GB/s ±12% ~ (p=0.647 n=20+19) Memclr4096-48 39.4GB/s ± 8% 72.0GB/s ±13% +82.81% (p=0.000 n=20+20) Memclr65536-48 26.6GB/s ± 6% 27.0GB/s ± 9% ~ (p=0.517 n=17+20) Memclr1M-48 17.9GB/s ±12% 18.5GB/s ±11% ~ (p=0.068 n=20+20) Memclr4M-48 18.0GB/s ± 9% 17.8GB/s ±14% ~ (p=0.547 n=20+20) Memclr8M-48 17.9GB/s ±10% 17.8GB/s ±14% ~ (p=0.947 n=20+20) Memclr16M-48 17.8GB/s ± 9% 18.4GB/s ± 9% ~ (p=0.050 n=20+19) Memclr64M-48 6.19GB/s ±10% 14.87GB/s ± 9% +140.11% (p=0.000 n=20+20) GoMemclr5-48 1.31GB/s ±10% 1.48GB/s ± 6% +13.06% (p=0.000 n=19+20) GoMemclr16-48 4.81GB/s ±14% 4.71GB/s ± 8% ~ (p=0.341 n=20+20) GoMemclr64-48 15.7GB/s ±13% 15.8GB/s ±11% ~ (p=0.967 n=20+19) Change-Id: I393f3f20e2f31538d1b1dd70d6e5c201c106a095 Reviewed-on: https://go-review.googlesource.com/16773 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Klaus Post <klauspost@gmail.com> Reviewed-by: Keith Randall <khr@golang.org>

they can Not only is this an obvious optimization: benchmark old MB/s new MB/s speedup BenchmarkMemmove1-4 35.35 29.65 0.84x BenchmarkMemmove2-4 63.78 52.53 0.82x BenchmarkMemmove3-4 89.72 73.96 0.82x BenchmarkMemmove4-4 109.94 95.73 0.87x BenchmarkMemmove5-4 127.60 112.80 0.88x BenchmarkMemmove6-4 143.59 126.67 0.88x BenchmarkMemmove7-4 157.90 138.92 0.88x BenchmarkMemmove8-4 167.18 231.81 1.39x BenchmarkMemmove9-4 175.23 252.07 1.44x BenchmarkMemmove10-4 165.68 261.10 1.58x BenchmarkMemmove11-4 174.43 263.31 1.51x BenchmarkMemmove12-4 180.76 267.56 1.48x BenchmarkMemmove13-4 189.06 284.93 1.51x BenchmarkMemmove14-4 186.31 284.72 1.53x BenchmarkMemmove15-4 195.75 281.62 1.44x BenchmarkMemmove16-4 202.96 439.23 2.16x BenchmarkMemmove32-4 264.77 775.77 2.93x BenchmarkMemmove64-4 306.81 1209.64 3.94x BenchmarkMemmove128-4 357.03 1515.41 4.24x BenchmarkMemmove256-4 380.77 2066.01 5.43x BenchmarkMemmove512-4 385.05 2556.45 6.64x BenchmarkMemmove1024-4 381.23 2804.10 7.36x BenchmarkMemmove2048-4 379.06 2814.83 7.43x BenchmarkMemmove4096-4 387.43 3064.96 7.91x BenchmarkMemmoveUnaligned1-4 28.91 25.40 0.88x BenchmarkMemmoveUnaligned2-4 56.13 47.56 0.85x BenchmarkMemmoveUnaligned3-4 74.32 69.31 0.93x BenchmarkMemmoveUnaligned4-4 97.02 83.58 0.86x BenchmarkMemmoveUnaligned5-4 110.17 103.62 0.94x BenchmarkMemmoveUnaligned6-4 124.95 113.26 0.91x BenchmarkMemmoveUnaligned7-4 142.37 130.82 0.92x BenchmarkMemmoveUnaligned8-4 151.20 205.64 1.36x BenchmarkMemmoveUnaligned9-4 166.97 215.42 1.29x BenchmarkMemmoveUnaligned10-4 148.49 221.22 1.49x BenchmarkMemmoveUnaligned11-4 159.47 239.57 1.50x BenchmarkMemmoveUnaligned12-4 163.52 247.32 1.51x BenchmarkMemmoveUnaligned13-4 167.55 256.54 1.53x BenchmarkMemmoveUnaligned14-4 175.12 251.03 1.43x BenchmarkMemmoveUnaligned15-4 192.10 267.13 1.39x BenchmarkMemmoveUnaligned16-4 190.76 378.87 1.99x BenchmarkMemmoveUnaligned32-4 259.02 562.98 2.17x BenchmarkMemmoveUnaligned64-4 317.72 842.44 2.65x BenchmarkMemmoveUnaligned128-4 355.43 1274.49 3.59x BenchmarkMemmoveUnaligned256-4 378.17 1815.74 4.80x BenchmarkMemmoveUnaligned512-4 362.15 2180.81 6.02x BenchmarkMemmoveUnaligned1024-4 376.07 2453.58 6.52x BenchmarkMemmoveUnaligned2048-4 381.66 2568.32 6.73x BenchmarkMemmoveUnaligned4096-4 398.51 2669.36 6.70x BenchmarkMemclr5-4 113.83 107.93 0.95x BenchmarkMemclr16-4 223.84 389.63 1.74x BenchmarkMemclr64-4 421.99 1209.58 2.87x BenchmarkMemclr256-4 525.94 2411.58 4.59x BenchmarkMemclr4096-4 581.66 4372.20 7.52x BenchmarkMemclr65536-4 565.84 4747.48 8.39x BenchmarkGoMemclr5-4 194.63 160.31 0.82x BenchmarkGoMemclr16-4 295.30 630.07 2.13x BenchmarkGoMemclr64-4 480.24 1884.03 3.92x BenchmarkGoMemclr256-4 540.23 2926.49 5.42x but it turns out that it's necessary to avoid the GC seeing partially written pointers. It's of course possible to be more sophisticated (using ldp/stp to move 16 bytes at a time in the core loop and unrolling the tail copying loops being the obvious ideas) but I wanted something simple and (reasonably) obviously correct. Fixes #12552 Change-Id: Iaeaf8a812cd06f4747ba2f792de1ded738890735 Reviewed-on: https://go-review.googlesource.com/14813 Reviewed-by: Austin Clements <austin@google.com>

It is faster to execute MOVQ AX,(DI) MOVQ AX,8(DI) MOVQ AX,16(DI) MOVQ AX,24(DI) ADDQ $32,DI than STOSQ STOSQ STOSQ STOSQ However, in order to be able to jump into the middle of a block of MOVQs, the call site needs to pre-adjust DI. If we're clearing a small area, the cost of that DI pre-adjustment isn't repaid. This CL switches the DUFFZERO implementation to use a hybrid strategy, in which small clears use STOSQ as before, but large clears use mostly MOVQ/ADDQ blocks. benchmark old ns/op new ns/op delta BenchmarkClearFat8 0.55 0.55 +0.00% BenchmarkClearFat12 0.82 0.83 +1.22% BenchmarkClearFat16 0.55 0.55 +0.00% BenchmarkClearFat24 0.82 0.82 +0.00% BenchmarkClearFat32 2.20 1.94 -11.82% BenchmarkClearFat40 1.92 1.66 -13.54% BenchmarkClearFat48 2.21 1.93 -12.67% BenchmarkClearFat56 3.03 2.20 -27.39% BenchmarkClearFat64 3.26 2.48 -23.93% BenchmarkClearFat72 3.57 2.76 -22.69% BenchmarkClearFat80 3.83 3.05 -20.37% BenchmarkClearFat88 4.14 3.30 -20.29% BenchmarkClearFat128 5.54 4.69 -15.34% BenchmarkClearFat256 9.95 9.09 -8.64% BenchmarkClearFat512 18.7 17.9 -4.28% BenchmarkClearFat1024 36.2 35.4 -2.21% Change-Id: Ic786406d9b3cab68d5a231688f9e66fcd1bd7103 Reviewed-on: https://go-review.googlesource.com/2585 Reviewed-by: Keith Randall <khr@golang.org>

Recognize loops of the form for i := range a { a[i] = zero } in which the evaluation of a is free from side effects. Replace these loops with calls to memclr. This occurs in the stdlib in 18 places. The motivating example is clearing a byte slice: benchmark old ns/op new ns/op delta BenchmarkGoMemclr5 3.31 3.26 -1.51% BenchmarkGoMemclr16 13.7 3.28 -76.06% BenchmarkGoMemclr64 50.8 4.14 -91.85% BenchmarkGoMemclr256 157 6.02 -96.17% Update #5373. Change-Id: I99d3e6f5f268e8c6499b7e661df46403e5eb83e4 Reviewed-on: https://go-review.googlesource.com/2520 Reviewed-by: Keith Randall <khr@golang.org>

Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg.