aboutsummaryrefslogtreecommitdiff
path: root/src/cmd/compile/internal/ssa/gen/ARM64Ops.go
AgeCommit message (Collapse)Author
2021-08-02[release-branch.go1.15] cmd/compile: mark R16, R17 clobbered for ↵Cherry Zhang
non-standard calls on ARM64 On ARM64, (external) linker generated trampoline may clobber R16 and R17. In CL 183842 we change Duff's devices not to use those registers. However, this is not enough. The register allocator also needs to know that these registers may be clobbered in any calls that don't follow the standard Go calling convention. This include Duff's devices and the write barrier. Fixes #46927. Updates #32773. Change-Id: Ia52a891d9bbb8515c927617dd53aee5af5bd9aa4 Reviewed-on: https://go-review.googlesource.com/c/go/+/184437 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Meng Zhuo <mzh@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Meng Zhuo <mzh@golangcn.org> (cherry picked from commit 11b4aee05bfe83513cf08f83091e5aef8b33e766) Reviewed-on: https://go-review.googlesource.com/c/go/+/331030 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com>
2020-06-18cmd/compile: redo flag constant ops for arm64Keith Randall
Fixes the *noov opcodes so they handle a constant argument properly. Most of the infrastructure for this CL is in CL 238077 (the arm32 one). Fixes #39505 Change-Id: Id424a4e18964b848f05aa42f4d78e5f2e2cdf43b Reviewed-on: https://go-review.googlesource.com/c/go/+/237999 Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-05-29cmd/compile: fix incorrect rewriting to if conditionXiangdong Ji
Some ARM64 rewriting rules convert 'comparing to zero' conditions of if statements to a simplified version utilizing CMN and CMP instructions to branch over condition flags, in order to save one Add or Sub caculation. Such optimizations lead to wrong branching in case an overflow/underflow occurs when executing CMN or CMP. Fix the issue by introducing new block opcodes that don't honor the overflow/underflow flag, in the following categories: Block-Op Meaning ARM condition codes 1. LTnoov less than MI 2. GEnoov greater than or equal PL 3. LEnoov less than or equal MI || EQ 4. GTnoov greater than NEQ & PL The backend generates two consecutive branch instructions for 'LEnoov' and 'GTnoov' to model their expected behavior. A slight change to 'gc' and amd64/386 backends is made to unify the code generation. Add a test 'TestCondRewrite' as justification, it covers 32 incorrect rules identified on arm64, more might be needed on other arches, like 32-bit arm. Add two benchmarks profiling the aforementioned category 1&2 and category 3&4 separetely, we expect the first two categories will show performance improvement and the second will not result in visible regression compared with the non-optimized version. This change also updates TestFormats to support using %#x. Examples exhibiting where does the issue come from: 1: 'if x + 3 < 0' might be converted to: before: CMN $3, R0 BGE <else branch> // wrong branch is taken if 'x+3' overflows after: CMN $3, R0 BPL <else branch> 2: 'if y - 3 > 0' might be converted to: before: CMP $3, R0 BLE <else branch> // wrong branch is taken if 'y-3' underflows after: CMP $3, R0 BMI <else branch> BEQ <else branch> Benchmark data from different kinds of arm64 servers, 'old' is the non-optimized version (not the parent commit), generally the optimization version outperforms. S1: name old time/op new time/op delta CondRewrite/SoloJump 13.6ns ± 0% 12.9ns ± 0% -5.15% (p=0.000 n=10+10) CondRewrite/CombJump 13.8ns ± 1% 12.9ns ± 0% -6.32% (p=0.000 n=10+10) S2: name old time/op new time/op delta CondRewrite/SoloJump 11.6ns ± 0% 10.9ns ± 0% -6.03% (p=0.000 n=10+10) CondRewrite/CombJump 11.4ns ± 0% 10.8ns ± 1% -5.53% (p=0.000 n=10+10) S3: name old time/op new time/op delta CondRewrite/SoloJump 7.36ns ± 0% 7.50ns ± 0% +1.79% (p=0.000 n=9+10) CondRewrite/CombJump 7.35ns ± 0% 7.75ns ± 0% +5.51% (p=0.000 n=8+9) S4: name old time/op new time/op delta CondRewrite/SoloJump-224 11.5ns ± 1% 10.9ns ± 0% -4.97% (p=0.000 n=10+10) CondRewrite/CombJump-224 11.9ns ± 0% 11.5ns ± 0% -2.95% (p=0.000 n=10+10) S5: name old time/op new time/op delta CondRewrite/SoloJump 10.0ns ± 0% 10.0ns ± 0% -0.45% (p=0.000 n=9+10) CondRewrite/CombJump 9.93ns ± 0% 9.77ns ± 0% -1.53% (p=0.000 n=10+9) Go1 perf. data: name old time/op new time/op delta BinaryTree17 6.29s ± 1% 6.30s ± 1% ~ (p=1.000 n=5+5) Fannkuch11 5.40s ± 0% 5.40s ± 0% ~ (p=0.841 n=5+5) FmtFprintfEmpty 97.9ns ± 0% 98.9ns ± 3% ~ (p=0.937 n=4+5) FmtFprintfString 171ns ± 3% 171ns ± 2% ~ (p=0.754 n=5+5) FmtFprintfInt 212ns ± 0% 217ns ± 6% +2.55% (p=0.008 n=5+5) FmtFprintfIntInt 296ns ± 1% 297ns ± 2% ~ (p=0.516 n=5+5) FmtFprintfPrefixedInt 371ns ± 2% 374ns ± 7% ~ (p=1.000 n=5+5) FmtFprintfFloat 435ns ± 1% 439ns ± 2% ~ (p=0.056 n=5+5) FmtManyArgs 1.37µs ± 1% 1.36µs ± 1% ~ (p=0.730 n=5+5) GobDecode 14.6ms ± 4% 14.4ms ± 4% ~ (p=0.690 n=5+5) GobEncode 11.8ms ±20% 11.6ms ±15% ~ (p=1.000 n=5+5) Gzip 507ms ± 0% 491ms ± 0% -3.22% (p=0.008 n=5+5) Gunzip 73.8ms ± 0% 73.9ms ± 0% ~ (p=0.690 n=5+5) HTTPClientServer 116µs ± 0% 116µs ± 0% ~ (p=0.686 n=4+4) JSONEncode 21.8ms ± 1% 21.6ms ± 2% ~ (p=0.151 n=5+5) JSONDecode 104ms ± 1% 103ms ± 1% -1.08% (p=0.016 n=5+5) Mandelbrot200 9.53ms ± 0% 9.53ms ± 0% ~ (p=0.421 n=5+5) GoParse 7.55ms ± 1% 7.51ms ± 1% ~ (p=0.151 n=5+5) RegexpMatchEasy0_32 158ns ± 0% 158ns ± 0% ~ (all equal) RegexpMatchEasy0_1K 606ns ± 1% 608ns ± 3% ~ (p=0.937 n=5+5) RegexpMatchEasy1_32 143ns ± 0% 144ns ± 1% ~ (p=0.095 n=5+4) RegexpMatchEasy1_1K 927ns ± 2% 944ns ± 2% ~ (p=0.056 n=5+5) RegexpMatchMedium_32 16.0ns ± 0% 16.0ns ± 0% ~ (all equal) RegexpMatchMedium_1K 69.3µs ± 2% 69.7µs ± 0% ~ (p=0.690 n=5+5) RegexpMatchHard_32 3.73µs ± 0% 3.73µs ± 1% ~ (p=0.984 n=5+5) RegexpMatchHard_1K 111µs ± 1% 110µs ± 0% ~ (p=0.151 n=5+5) Revcomp 1.91s ±47% 1.77s ±68% ~ (p=1.000 n=5+5) Template 138ms ± 1% 138ms ± 1% ~ (p=1.000 n=5+5) TimeParse 787ns ± 2% 785ns ± 1% ~ (p=0.540 n=5+5) TimeFormat 729ns ± 1% 726ns ± 1% ~ (p=0.151 n=5+5) Updates #38740 Change-Id: I06c604874acdc1e63e66452dadee5df053045222 Reviewed-on: https://go-review.googlesource.com/c/go/+/233097 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org>
2020-05-04cmd/compile: use typed aux in arm64 MOVstore rulesAlberto Donizetti
Introduces a few casts, mostly to fix rules that mix int64 and int32 off1 and off2. Passes GOARCH=arm64 gotip build -toolexec 'toolstash -cmp' -a std Change-Id: I1ec75211f3bb8e521dcc5217cf29ab0655a84d79 Reviewed-on: https://go-review.googlesource.com/c/go/+/230840 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-04-30cmd/compile: switch to typed auxint for arm64 TBZ/TBNZ blockAlberto Donizetti
This CL changes the arm64 TBZ/TBNZ block from using Aux to using a (typed) AuxInt. The corresponding rules have also been changed to be typed. Passes GOARCH=arm64 gotip build -toolexec 'toolstash -cmp' -a std Change-Id: I98d0cd2a791948f1db13259c17fb1b9b2807a043 Reviewed-on: https://go-review.googlesource.com/c/go/+/230839 Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-04-29cmd/compile: mark PanicBounds/Extend as callsAustin Clements
PanicBounds and PanicExtend are lowered to runtime calls (with a non-Go ABI), but are not currently marked as calls. Since liveness analysis only emits stack maps at calls in the runtime, this means these panic call sites in the runtime won't get a stack map. These almost immediately turn into throws in the runtime, but there's still a chance they'll try to grow the stack first, which would lead to a different panic. To fix this, mark these operations as calls. Outside the runtime, we currently emit stack maps for everything that isn't an unsafe-point, so these panic calls get stack maps by default. However, we're about to move to emitting stack maps only at call sites, at which point this will start to matter outside the runtime as well. I confirmed that this has no effect on anything but PCDATA/FUNCDATA in runtime and net/http. For #36365. Change-Id: Ic5bb463fd152cc320c815dc04cf62005261ae169 Reviewed-on: https://go-review.googlesource.com/c/go/+/230539 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-02-28cmd/compile: add dedicated ARM64BitField aux typeJosh Bleecher Snyder
The goal here is improved AuxInt printing in ssa.html. Instead of displaying an inscrutable encoded integer, it displays something like v25 (28) = UBFX <int> [lsb=4,width=8] v52 which is much nicer for debugging. Change-Id: I40713ff7f4a857c4557486cdf73c2dff137511ca Reviewed-on: https://go-review.googlesource.com/c/go/+/221420 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-11-07runtime: add async preemption support on ARM64Cherry Zhang
This CL adds support of call injection and async preemption on ARM64. There seems no way to return from the injected call without clobbering *any* register. So we have to clobber one, which is chosen to be REGTMP. Previous CLs have marked code sequences that use REGTMP async-nonpreemtible. Change-Id: Ieca4e3ba5557adf3d0f5d923bce5f1769b58e30b Reviewed-on: https://go-review.googlesource.com/c/go/+/203461 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2019-11-05cmd/compile: mark architecture-specific unsafe pointsCherry Zhang
Introduce a mechanism for marking architecture-specific Ops unsafe. And mark ones that use REGTMP on ARM64, as for async preemption we will be using REGTMP as a temporary register in the injected call. Change-Id: I8ff22e87d8f9cb10d02a2f0af7c12ad6d7d58f54 Reviewed-on: https://go-review.googlesource.com/c/go/+/203459 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: Austin Clements <austin@google.com>
2019-10-29cmd/compile: intrinsics for runtime/internal/atomic.Store8Austin Clements
For #10958, #24543, but makes sense on its own. Change-Id: I2a87dab66b82a1863e4b6512b1f8def51463ce2a Reviewed-on: https://go-review.googlesource.com/c/go/+/203284 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-10-02cmd/compile: allow multiple SSA block control valuesMichael Munday
Control values are used to choose which successor of a block is jumped to. Typically a control value takes the form of a 'flags' value that represents the result of a comparison. Some architectures however use a variable in a register as a control value. Up until now we have managed with a single control value per block. However some architectures (e.g. s390x and riscv64) have combined compare-and-branch instructions that take two variables in registers as parameters. To generate these instructions we need to support 2 control values per block. This CL allows up to 2 control values to be used in a block in order to support the addition of compare-and-branch instructions. I have implemented s390x compare-and-branch instructions in a different CL. Passes toolstash-check -all. Results of compilebench: name old time/op new time/op delta Template 208ms ± 1% 209ms ± 1% ~ (p=0.289 n=20+20) Unicode 83.7ms ± 1% 83.3ms ± 3% -0.49% (p=0.017 n=18+18) GoTypes 748ms ± 1% 748ms ± 0% ~ (p=0.460 n=20+18) Compiler 3.47s ± 1% 3.48s ± 1% ~ (p=0.070 n=19+18) SSA 11.5s ± 1% 11.7s ± 1% +1.64% (p=0.000 n=19+18) Flate 130ms ± 1% 130ms ± 1% ~ (p=0.588 n=19+20) GoParser 160ms ± 1% 161ms ± 1% ~ (p=0.211 n=20+20) Reflect 465ms ± 1% 467ms ± 1% +0.42% (p=0.007 n=20+20) Tar 184ms ± 1% 185ms ± 2% ~ (p=0.087 n=18+20) XML 253ms ± 1% 253ms ± 1% ~ (p=0.377 n=20+18) LinkCompiler 769ms ± 2% 774ms ± 2% ~ (p=0.070 n=19+19) ExternalLinkCompiler 3.59s ±11% 3.68s ± 6% ~ (p=0.072 n=20+20) LinkWithoutDebugCompiler 446ms ± 5% 454ms ± 3% +1.79% (p=0.002 n=19+20) StdCmd 26.0s ± 2% 26.0s ± 2% ~ (p=0.799 n=20+20) name old user-time/op new user-time/op delta Template 238ms ± 5% 240ms ± 5% ~ (p=0.142 n=20+20) Unicode 105ms ±11% 106ms ±10% ~ (p=0.512 n=20+20) GoTypes 876ms ± 2% 873ms ± 4% ~ (p=0.647 n=20+19) Compiler 4.17s ± 2% 4.19s ± 1% ~ (p=0.093 n=20+18) SSA 13.9s ± 1% 14.1s ± 1% +1.45% (p=0.000 n=18+18) Flate 145ms ±13% 146ms ± 5% ~ (p=0.851 n=20+18) GoParser 185ms ± 5% 188ms ± 7% ~ (p=0.174 n=20+20) Reflect 534ms ± 3% 538ms ± 2% ~ (p=0.105 n=20+18) Tar 215ms ± 4% 211ms ± 9% ~ (p=0.079 n=19+20) XML 295ms ± 6% 295ms ± 5% ~ (p=0.968 n=20+20) LinkCompiler 832ms ± 4% 837ms ± 7% ~ (p=0.707 n=17+20) ExternalLinkCompiler 1.58s ± 8% 1.60s ± 4% ~ (p=0.296 n=20+19) LinkWithoutDebugCompiler 478ms ±12% 489ms ±10% ~ (p=0.429 n=20+20) name old object-bytes new object-bytes delta Template 559kB ± 0% 559kB ± 0% ~ (all equal) Unicode 216kB ± 0% 216kB ± 0% ~ (all equal) GoTypes 2.03MB ± 0% 2.03MB ± 0% ~ (all equal) Compiler 8.07MB ± 0% 8.07MB ± 0% -0.06% (p=0.000 n=20+20) SSA 27.1MB ± 0% 27.3MB ± 0% +0.89% (p=0.000 n=20+20) Flate 343kB ± 0% 343kB ± 0% ~ (all equal) GoParser 441kB ± 0% 441kB ± 0% ~ (all equal) Reflect 1.36MB ± 0% 1.36MB ± 0% ~ (all equal) Tar 487kB ± 0% 487kB ± 0% ~ (all equal) XML 632kB ± 0% 632kB ± 0% ~ (all equal) name old export-bytes new export-bytes delta Template 18.5kB ± 0% 18.5kB ± 0% ~ (all equal) Unicode 7.92kB ± 0% 7.92kB ± 0% ~ (all equal) GoTypes 35.0kB ± 0% 35.0kB ± 0% ~ (all equal) Compiler 109kB ± 0% 110kB ± 0% +0.72% (p=0.000 n=20+20) SSA 137kB ± 0% 138kB ± 0% +0.58% (p=0.000 n=20+20) Flate 4.89kB ± 0% 4.89kB ± 0% ~ (all equal) GoParser 8.49kB ± 0% 8.49kB ± 0% ~ (all equal) Reflect 11.4kB ± 0% 11.4kB ± 0% ~ (all equal) Tar 10.5kB ± 0% 10.5kB ± 0% ~ (all equal) XML 16.7kB ± 0% 16.7kB ± 0% ~ (all equal) name old text-bytes new text-bytes delta HelloSize 761kB ± 0% 761kB ± 0% ~ (all equal) CmdGoSize 10.8MB ± 0% 10.8MB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 10.7kB ± 0% 10.7kB ± 0% ~ (all equal) CmdGoSize 312kB ± 0% 312kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 122kB ± 0% 122kB ± 0% ~ (all equal) CmdGoSize 146kB ± 0% 146kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.13MB ± 0% 1.13MB ± 0% ~ (all equal) CmdGoSize 15.1MB ± 0% 15.1MB ± 0% ~ (all equal) Change-Id: I3cc2f9829a109543d9a68be4a21775d2d3e9801f Reviewed-on: https://go-review.googlesource.com/c/go/+/196557 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Keith Randall <khr@golang.org>
2019-06-26cmd/compile, runtime: use R20, R21 in ARM64's Duff's devicesCherry Zhang
Currently we use R16 and R17 for ARM64's Duff's devices. According to ARM64 ABI, R16 and R17 can be used by the (external) linker as scratch registers in trampolines. So don't use these registers to pass information across functions. It seems unlikely that calling Duff's devices would need a trampoline in normal cases. But it could happen if the call target is out of the 128 MB direct jump limit. The choice of R20 and R21 is kind of arbitrary. The register allocator allocates from low-numbered registers. High numbered registers are chosen so it is unlikely to hold a live value and forces a spill. Fixes #32773. Change-Id: Id22d555b5afeadd4efcf62797d1580d641c39218 Reviewed-on: https://go-review.googlesource.com/c/go/+/183842 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org>
2019-05-03cmd/compile,runtime/internal/atomic: add Load8Austin Clements
Change-Id: Id52a5730cf9207ee7ccebac4ef12791dc5720e7c Reviewed-on: https://go-review.googlesource.com/c/go/+/172283 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2019-04-22cmd/compile: intrinsify math/bits.Sub64 for arm64erifan01
This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS, NGC and NEG, and optimzes the case of borrowing chains. Benchmarks: name old time/op new time/op delta Sub-64 2.500000ns +- 0% 2.048000ns +- 1% -18.08% (p=0.000 n=10+10) Sub32-64 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal) Sub64-64 2.500000ns +- 0% 2.080000ns +- 0% -16.80% (p=0.000 n=10+7) Sub64multiple-64 7.090000ns +- 0% 2.090000ns +- 0% -70.52% (p=0.000 n=10+10) Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655 Reviewed-on: https://go-review.googlesource.com/c/go/+/159018 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-22cmd/compile: follow up intrinsifying math/bits.Add64 for arm64erifan01
This CL deals with the additional comments of CL 159017. Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b Reviewed-on: https://go-review.googlesource.com/c/go/+/168857 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-20cmd/compile: intrinsify math/bits.Add64 for arm64erifan01
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS and ADC, and optimzes the case of carry chains.The CL also changes the test code so that the intrinsic implementation can be tested. Benchmarks: name old time/op new time/op delta Add-224 2.500000ns +- 0% 2.090000ns +- 4% -16.40% (p=0.000 n=9+10) Add32-224 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal) Add64-224 2.500000ns +- 0% 1.577778ns +- 2% -36.89% (p=0.000 n=10+9) Add64multiple-224 6.000000ns +- 0% 2.000000ns +- 0% -66.67% (p=0.000 n=10+10) Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615 Reviewed-on: https://go-review.googlesource.com/c/go/+/159017 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-18cmd/compile,runtime: provide index information on bounds check failureKeith Randall
A few examples (for accessing a slice of length 3): s[-1] runtime error: index out of range [-1] s[3] runtime error: index out of range [3] with length 3 s[-1:0] runtime error: slice bounds out of range [-1:] s[3:0] runtime error: slice bounds out of range [3:0] s[3:-1] runtime error: slice bounds out of range [:-1] s[3:4] runtime error: slice bounds out of range [:4] with capacity 3 s[0:3:4] runtime error: slice bounds out of range [::4] with capacity 3 Note that in cases where there are multiple things wrong with the indexes (e.g. s[3:-1]), we report one of those errors kind of arbitrarily, currently the rightmost one. An exhaustive set of examples is in issue30116[u].out in the CL. The message text has the same prefix as the old message text. That leads to slightly awkward phrasing but hopefully minimizes the chance that code depending on the error text will break. Increases the size of the go binary by 0.5% (amd64). The panic functions take arguments in registers in order to keep the size of the compiled code as small as possible. Fixes #30116 Change-Id: Idb99a827b7888822ca34c240eca87b7e44a04fdd Reviewed-on: https://go-review.googlesource.com/c/go/+/161477 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2019-03-07cmd/compile: optimize arm64 comparison of x and 0.0 with "FCMP $(0.0), Fn"fanzha02
Code: func comp(x float64) bool {return x < 0} Previous version: FMOVD "".x(FP), F0 FMOVD ZR, F1 FCMPD F1, F0 CSET MI, R0 MOVB R0, "".~r1+8(FP) RET (R30) Optimized version: FMOVD "".x(FP), F0 FCMPD $(0.0), F0 CSET MI, R0 MOVB R0, "".~r1+8(FP) RET (R30) Math package benchmark results: name old time/op new time/op delta Acos-8 77.500000ns +- 0% 77.400000ns +- 0% -0.13% (p=0.000 n=9+10) Acosh-8 98.600000ns +- 0% 98.100000ns +- 0% -0.51% (p=0.000 n=10+9) Asin-8 67.600000ns +- 0% 66.600000ns +- 0% -1.48% (p=0.000 n=9+10) Asinh-8 108.000000ns +- 0% 109.000000ns +- 0% +0.93% (p=0.000 n=10+10) Atan-8 36.788889ns +- 0% 36.000000ns +- 0% -2.14% (p=0.000 n=9+10) Atanh-8 104.000000ns +- 0% 105.000000ns +- 0% +0.96% (p=0.000 n=10+10) Atan2-8 67.100000ns +- 0% 66.600000ns +- 0% -0.75% (p=0.000 n=10+10) Cbrt-8 89.100000ns +- 0% 82.000000ns +- 0% -7.97% (p=0.000 n=10+10) Erf-8 43.500000ns +- 0% 43.000000ns +- 0% -1.15% (p=0.000 n=10+10) Erfc-8 49.000000ns +- 0% 48.220000ns +- 0% -1.59% (p=0.000 n=9+10) Erfinv-8 59.100000ns +- 0% 58.600000ns +- 0% -0.85% (p=0.000 n=10+10) Erfcinv-8 59.100000ns +- 0% 58.600000ns +- 0% -0.85% (p=0.000 n=10+10) Expm1-8 56.600000ns +- 0% 56.040000ns +- 0% -0.99% (p=0.000 n=8+10) Exp2Go-8 97.600000ns +- 0% 99.400000ns +- 0% +1.84% (p=0.000 n=10+10) Dim-8 2.500000ns +- 0% 2.250000ns +- 0% -10.00% (p=0.000 n=10+10) Mod-8 108.000000ns +- 0% 106.000000ns +- 0% -1.85% (p=0.000 n=8+8) Frexp-8 12.000000ns +- 0% 12.500000ns +- 0% +4.17% (p=0.000 n=10+10) Gamma-8 67.100000ns +- 0% 67.600000ns +- 0% +0.75% (p=0.000 n=10+10) Hypot-8 17.100000ns +- 0% 17.000000ns +- 0% -0.58% (p=0.002 n=8+10) Ilogb-8 9.010000ns +- 0% 8.510000ns +- 0% -5.55% (p=0.000 n=10+9) J1-8 288.000000ns +- 0% 287.000000ns +- 0% -0.35% (p=0.000 n=10+10) Jn-8 605.000000ns +- 0% 604.000000ns +- 0% -0.17% (p=0.001 n=8+9) Logb-8 10.600000ns +- 0% 10.500000ns +- 0% -0.94% (p=0.000 n=9+10) Log2-8 16.500000ns +- 0% 17.000000ns +- 0% +3.03% (p=0.000 n=10+10) PowFrac-8 232.000000ns +- 0% 233.000000ns +- 0% +0.43% (p=0.000 n=10+10) Remainder-8 70.600000ns +- 0% 69.600000ns +- 0% -1.42% (p=0.000 n=10+10) SqrtGoLatency-8 77.600000ns +- 0% 76.600000ns +- 0% -1.29% (p=0.000 n=10+10) Tanh-8 97.600000ns +- 0% 94.100000ns +- 0% -3.59% (p=0.000 n=10+10) Y1-8 289.000000ns +- 0% 288.000000ns +- 0% -0.35% (p=0.000 n=10+10) Yn-8 603.000000ns +- 0% 589.000000ns +- 0% -2.32% (p=0.000 n=10+10) Change-Id: I6920734f8662b329aa58f5b8e4eeae73b409984d Reviewed-on: https://go-review.googlesource.com/c/go/+/164719 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07cmd/compile: change the condition flags of floating-point comparisons in ↵fanzha02
arm64 backend Current compiler reverses operands to work around NaN in "less than" and "less equal than" comparisons. But if we want to use "FCMPD/FCMPS $(0.0), Fn" to do some optimization, the workaround way does not work. Because assembler does not support instruction "FCMPD/FCMPS Fn, $(0.0)". This CL sets condition flags for floating-point comparisons to resolve this problem. Change-Id: Ia48076a1da95da64596d6e68304018cb301ebe33 Reviewed-on: https://go-review.googlesource.com/c/go/+/164718 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-28cmd/compile: optimize arm64's code with more shifted operationsBen Shi
This CL optimizes arm64's NEG/MVN/TST/CMN with a shifted operand. 1. The total size of pkg/android_arm64 decreases about 0.2KB, excluding cmd/compile/ . 2. The go1 benchmark shows no regression, excluding noise. name old time/op new time/op delta BinaryTree17-4 16.4s ± 1% 16.4s ± 1% ~ (p=0.914 n=29+29) Fannkuch11-4 8.72s ± 0% 8.72s ± 0% ~ (p=0.274 n=30+29) FmtFprintfEmpty-4 174ns ± 0% 174ns ± 0% ~ (all equal) FmtFprintfString-4 370ns ± 0% 370ns ± 0% ~ (all equal) FmtFprintfInt-4 419ns ± 0% 419ns ± 0% ~ (all equal) FmtFprintfIntInt-4 672ns ± 1% 675ns ± 2% ~ (p=0.217 n=28+30) FmtFprintfPrefixedInt-4 806ns ± 0% 806ns ± 0% ~ (p=0.402 n=30+28) FmtFprintfFloat-4 1.09µs ± 0% 1.09µs ± 0% +0.02% (p=0.011 n=22+27) FmtManyArgs-4 2.67µs ± 0% 2.68µs ± 0% ~ (p=0.279 n=29+30) GobDecode-4 33.1ms ± 1% 33.1ms ± 0% ~ (p=0.052 n=28+29) GobEncode-4 29.6ms ± 0% 29.6ms ± 0% +0.08% (p=0.013 n=28+29) Gzip-4 1.38s ± 2% 1.39s ± 2% ~ (p=0.071 n=29+29) Gunzip-4 139ms ± 0% 139ms ± 0% ~ (p=0.265 n=29+29) HTTPClientServer-4 789µs ± 4% 785µs ± 4% ~ (p=0.206 n=29+28) JSONEncode-4 49.7ms ± 0% 49.6ms ± 0% -0.24% (p=0.000 n=30+30) JSONDecode-4 266ms ± 1% 267ms ± 1% +0.34% (p=0.000 n=30+30) Mandelbrot200-4 16.6ms ± 0% 16.6ms ± 0% ~ (p=0.835 n=28+30) GoParse-4 15.9ms ± 0% 15.8ms ± 0% -0.29% (p=0.000 n=27+30) RegexpMatchEasy0_32-4 380ns ± 0% 381ns ± 0% +0.18% (p=0.000 n=30+30) RegexpMatchEasy0_1K-4 1.18µs ± 0% 1.19µs ± 0% +0.23% (p=0.000 n=30+30) RegexpMatchEasy1_32-4 357ns ± 0% 358ns ± 0% +0.28% (p=0.000 n=29+29) RegexpMatchEasy1_1K-4 2.04µs ± 0% 2.04µs ± 0% +0.06% (p=0.006 n=30+30) RegexpMatchMedium_32-4 589ns ± 0% 590ns ± 0% +0.24% (p=0.000 n=28+30) RegexpMatchMedium_1K-4 162µs ± 0% 162µs ± 0% -0.01% (p=0.027 n=26+29) RegexpMatchHard_32-4 9.58µs ± 0% 9.58µs ± 0% ~ (p=0.935 n=30+30) RegexpMatchHard_1K-4 287µs ± 0% 287µs ± 0% ~ (p=0.387 n=29+30) Revcomp-4 2.50s ± 0% 2.50s ± 0% -0.10% (p=0.020 n=28+28) Template-4 310ms ± 0% 310ms ± 1% ~ (p=0.406 n=30+30) TimeParse-4 1.68µs ± 0% 1.68µs ± 0% +0.03% (p=0.014 n=30+17) TimeFormat-4 1.65µs ± 0% 1.66µs ± 0% +0.32% (p=0.000 n=27+29) [Geo mean] 247µs 247µs +0.05% name old speed new speed delta GobDecode-4 23.2MB/s ± 0% 23.2MB/s ± 0% -0.08% (p=0.032 n=27+29) GobEncode-4 26.0MB/s ± 0% 25.9MB/s ± 0% -0.10% (p=0.011 n=29+29) Gzip-4 14.1MB/s ± 2% 14.0MB/s ± 2% ~ (p=0.081 n=29+29) Gunzip-4 139MB/s ± 0% 139MB/s ± 0% ~ (p=0.290 n=29+29) JSONEncode-4 39.0MB/s ± 0% 39.1MB/s ± 0% +0.25% (p=0.000 n=29+30) JSONDecode-4 7.30MB/s ± 1% 7.28MB/s ± 1% -0.33% (p=0.000 n=30+30) GoParse-4 3.65MB/s ± 0% 3.66MB/s ± 0% +0.29% (p=0.000 n=27+30) RegexpMatchEasy0_32-4 84.1MB/s ± 0% 84.0MB/s ± 0% -0.17% (p=0.000 n=30+28) RegexpMatchEasy0_1K-4 864MB/s ± 0% 862MB/s ± 0% -0.24% (p=0.000 n=30+30) RegexpMatchEasy1_32-4 89.5MB/s ± 0% 89.3MB/s ± 0% -0.18% (p=0.000 n=28+24) RegexpMatchEasy1_1K-4 502MB/s ± 0% 502MB/s ± 0% -0.05% (p=0.008 n=30+29) RegexpMatchMedium_32-4 1.70MB/s ± 0% 1.69MB/s ± 0% -0.59% (p=0.000 n=29+30) RegexpMatchMedium_1K-4 6.31MB/s ± 0% 6.31MB/s ± 0% +0.05% (p=0.005 n=30+26) RegexpMatchHard_32-4 3.34MB/s ± 0% 3.34MB/s ± 0% ~ (all equal) RegexpMatchHard_1K-4 3.57MB/s ± 0% 3.57MB/s ± 0% ~ (all equal) Revcomp-4 102MB/s ± 0% 102MB/s ± 0% +0.10% (p=0.022 n=28+28) Template-4 6.26MB/s ± 0% 6.26MB/s ± 1% ~ (p=0.768 n=30+30) [Geo mean] 24.2MB/s 24.1MB/s -0.08% Change-Id: I494f9db7f8a568a00e9c74ae25086a58b2221683 Reviewed-on: https://go-review.googlesource.com/137976 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-17cmd/compile: optimize math.Float64(32)bits and math.Float64(32)frombits on arm64fanzha02
Use float <-> int register moves without conversion instead of stores and loads to move float <-> int values. Math package benchmark results. name old time/op new time/op delta Acosh 153ns ± 0% 147ns ± 0% -3.92% (p=0.000 n=10+10) Asinh 183ns ± 0% 177ns ± 0% -3.28% (p=0.000 n=10+10) Atanh 157ns ± 0% 155ns ± 0% -1.27% (p=0.000 n=10+10) Atan2 118ns ± 0% 117ns ± 1% -0.59% (p=0.003 n=10+10) Cbrt 119ns ± 0% 114ns ± 0% -4.20% (p=0.000 n=10+10) Copysign 7.51ns ± 0% 6.51ns ± 0% -13.32% (p=0.000 n=9+10) Cos 73.1ns ± 0% 70.6ns ± 0% -3.42% (p=0.000 n=10+10) Cosh 119ns ± 0% 121ns ± 0% +1.68% (p=0.000 n=10+9) ExpGo 154ns ± 0% 149ns ± 0% -3.05% (p=0.000 n=9+10) Expm1 101ns ± 0% 99ns ± 0% -1.88% (p=0.000 n=10+10) Exp2Go 150ns ± 0% 146ns ± 0% -2.67% (p=0.000 n=10+10) Abs 7.01ns ± 0% 6.01ns ± 0% -14.27% (p=0.000 n=10+9) Mod 234ns ± 0% 212ns ± 0% -9.40% (p=0.000 n=9+10) Frexp 34.5ns ± 0% 30.0ns ± 0% -13.04% (p=0.000 n=10+10) Gamma 112ns ± 0% 111ns ± 0% -0.89% (p=0.000 n=10+10) Hypot 73.6ns ± 0% 68.6ns ± 0% -6.79% (p=0.000 n=10+10) HypotGo 77.1ns ± 0% 72.1ns ± 0% -6.49% (p=0.000 n=10+10) Ilogb 31.0ns ± 0% 28.0ns ± 0% -9.68% (p=0.000 n=10+10) J0 437ns ± 0% 434ns ± 0% -0.62% (p=0.000 n=10+10) J1 433ns ± 0% 431ns ± 0% -0.46% (p=0.000 n=10+10) Jn 927ns ± 0% 922ns ± 0% -0.54% (p=0.000 n=10+10) Ldexp 41.5ns ± 0% 37.0ns ± 0% -10.84% (p=0.000 n=9+10) Log 124ns ± 0% 118ns ± 0% -4.84% (p=0.000 n=10+9) Logb 34.0ns ± 0% 32.0ns ± 0% -5.88% (p=0.000 n=10+10) Log1p 110ns ± 0% 108ns ± 0% -1.82% (p=0.000 n=10+10) Log10 136ns ± 0% 132ns ± 0% -2.94% (p=0.000 n=10+10) Log2 51.6ns ± 0% 47.1ns ± 0% -8.72% (p=0.000 n=10+10) Nextafter32 33.0ns ± 0% 30.5ns ± 0% -7.58% (p=0.000 n=10+10) Nextafter64 29.0ns ± 0% 26.5ns ± 0% -8.62% (p=0.000 n=10+10) PowInt 169ns ± 0% 160ns ± 0% -5.33% (p=0.000 n=10+10) PowFrac 375ns ± 0% 361ns ± 0% -3.73% (p=0.000 n=10+10) RoundToEven 14.0ns ± 0% 12.5ns ± 0% -10.71% (p=0.000 n=10+10) Remainder 206ns ± 0% 192ns ± 0% -6.80% (p=0.000 n=10+9) Signbit 6.01ns ± 0% 5.51ns ± 0% -8.32% (p=0.000 n=10+9) Sin 70.1ns ± 0% 69.6ns ± 0% -0.71% (p=0.000 n=10+10) Sincos 99.1ns ± 0% 99.6ns ± 0% +0.50% (p=0.000 n=9+10) SqrtGoLatency 178ns ± 0% 146ns ± 0% -17.70% (p=0.000 n=8+10) SqrtPrime 9.19µs ± 0% 9.20µs ± 0% +0.01% (p=0.000 n=9+9) Tanh 125ns ± 1% 127ns ± 0% +1.36% (p=0.000 n=10+10) Y0 428ns ± 0% 426ns ± 0% -0.47% (p=0.000 n=10+10) Y1 431ns ± 0% 429ns ± 0% -0.46% (p=0.000 n=10+9) Yn 906ns ± 0% 901ns ± 0% -0.55% (p=0.000 n=10+10) Float64bits 4.50ns ± 0% 3.50ns ± 0% -22.22% (p=0.000 n=10+10) Float64frombits 4.00ns ± 0% 3.50ns ± 0% -12.50% (p=0.000 n=10+9) Float32bits 4.50ns ± 0% 3.50ns ± 0% -22.22% (p=0.002 n=8+10) Float32frombits 4.00ns ± 0% 3.50ns ± 0% -12.50% (p=0.000 n=10+10) Change-Id: Iba829e15d5624962fe0c699139ea783efeefabc2 Reviewed-on: https://go-review.googlesource.com/129715 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-09-13cmd/compile: intrinsify math.RoundToEven and math.Abs on arm64erifan01
math.RoundToEven can be done by one arm64 instruction FRINTND, intrinsify it to improve performance. The current pure Go implementation of the function Abs is translated into five instructions on arm64: str, ldr, and, str, ldr. The intrinsic implementation requires only one instruction, so in terms of performance, intrinsify it is worthwhile. Benchmarks: name old time/op new time/op delta Abs-8 3.50ns ± 0% 1.50ns ± 0% -57.14% (p=0.000 n=10+10) RoundToEven-8 9.26ns ± 0% 1.50ns ± 0% -83.80% (p=0.000 n=10+10) Change-Id: I9456b26ab282b544dfac0154fc86f17aed96ac3d Reviewed-on: https://go-review.googlesource.com/116535 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-09-07cmd/compile: implement non-constant rotates using ROR on arm64erifan01
Add some rules to match the Go code like: y &= 63 x << y | x >> (64-y) or y &= 63 x >> y | x << (64-y) as a ROR instruction. Make math/bits.RotateLeft faster on arm64. Extends CL 132435 to arm64. Benchmarks of math/bits.RotateLeftxxN: name old time/op new time/op delta RotateLeft-8 3.548750ns +- 1% 2.003750ns +- 0% -43.54% (p=0.000 n=8+8) RotateLeft8-8 3.925000ns +- 0% 3.925000ns +- 0% ~ (p=1.000 n=8+8) RotateLeft16-8 3.925000ns +- 0% 3.927500ns +- 0% ~ (p=0.608 n=8+8) RotateLeft32-8 3.925000ns +- 0% 2.002500ns +- 0% -48.98% (p=0.000 n=8+8) RotateLeft64-8 3.536250ns +- 0% 2.003750ns +- 0% -43.34% (p=0.000 n=8+8) Change-Id: I77622cd7f39b917427e060647321f5513973232c Reviewed-on: https://go-review.googlesource.com/122542 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-05cmd/compile: optimize arm64's comparisonBen Shi
Add more optimization with TST/CMN. 1. A tiny benchmark shows more than 12% improvement. TSTCMN-4 378µs ± 0% 332µs ± 0% -12.15% (p=0.000 n=30+27) (https://github.com/benshi001/ugo1/blob/master/tstcmn_test.go) 2. There is little regression in the go1 benchmark, excluding noise. name old time/op new time/op delta BinaryTree17-4 19.1s ± 0% 19.1s ± 0% ~ (p=0.994 n=28+29) Fannkuch11-4 10.0s ± 0% 10.0s ± 0% ~ (p=0.198 n=30+25) FmtFprintfEmpty-4 233ns ± 0% 233ns ± 0% +0.14% (p=0.002 n=24+30) FmtFprintfString-4 428ns ± 0% 428ns ± 0% ~ (all equal) FmtFprintfInt-4 472ns ± 0% 472ns ± 0% ~ (all equal) FmtFprintfIntInt-4 725ns ± 0% 725ns ± 0% ~ (all equal) FmtFprintfPrefixedInt-4 889ns ± 0% 888ns ± 0% ~ (p=0.632 n=28+30) FmtFprintfFloat-4 1.20µs ± 0% 1.20µs ± 0% +0.05% (p=0.001 n=18+30) FmtManyArgs-4 3.00µs ± 0% 2.99µs ± 0% -0.07% (p=0.001 n=27+30) GobDecode-4 42.1ms ± 0% 42.2ms ± 0% +0.29% (p=0.000 n=28+28) GobEncode-4 38.6ms ± 9% 38.8ms ± 9% ~ (p=0.912 n=30+30) Gzip-4 2.07s ± 1% 2.05s ± 1% -0.64% (p=0.000 n=29+30) Gunzip-4 175ms ± 0% 175ms ± 0% -0.15% (p=0.001 n=30+30) HTTPClientServer-4 872µs ± 5% 880µs ± 6% ~ (p=0.196 n=30+29) JSONEncode-4 88.5ms ± 1% 89.8ms ± 1% +1.49% (p=0.000 n=23+24) JSONDecode-4 393ms ± 1% 390ms ± 1% -0.89% (p=0.000 n=28+30) Mandelbrot200-4 19.5ms ± 0% 19.5ms ± 0% ~ (p=0.405 n=29+28) GoParse-4 19.9ms ± 0% 20.0ms ± 0% +0.27% (p=0.000 n=30+30) RegexpMatchEasy0_32-4 431ns ± 0% 431ns ± 0% ~ (p=1.000 n=30+30) RegexpMatchEasy0_1K-4 1.61µs ± 0% 1.61µs ± 0% ~ (p=0.527 n=26+26) RegexpMatchEasy1_32-4 443ns ± 0% 443ns ± 0% ~ (all equal) RegexpMatchEasy1_1K-4 2.58µs ± 1% 2.58µs ± 1% ~ (p=0.578 n=27+25) RegexpMatchMedium_32-4 740ns ± 0% 740ns ± 0% ~ (p=0.357 n=30+30) RegexpMatchMedium_1K-4 223µs ± 0% 223µs ± 0% +0.16% (p=0.000 n=30+29) RegexpMatchHard_32-4 12.3µs ± 0% 12.3µs ± 0% ~ (p=0.236 n=27+27) RegexpMatchHard_1K-4 371µs ± 0% 371µs ± 0% +0.09% (p=0.000 n=30+27) Revcomp-4 2.85s ± 0% 2.85s ± 0% ~ (p=0.057 n=28+25) Template-4 408ms ± 1% 409ms ± 1% ~ (p=0.117 n=29+29) TimeParse-4 1.93µs ± 0% 1.93µs ± 0% ~ (p=0.535 n=29+28) TimeFormat-4 1.99µs ± 0% 1.99µs ± 0% ~ (p=0.168 n=29+28) [Geo mean] 306µs 307µs +0.07% name old speed new speed delta GobDecode-4 18.3MB/s ± 0% 18.2MB/s ± 0% -0.31% (p=0.000 n=28+29) GobEncode-4 19.9MB/s ± 8% 19.8MB/s ± 9% ~ (p=0.923 n=30+30) Gzip-4 9.39MB/s ± 1% 9.45MB/s ± 1% +0.65% (p=0.000 n=29+30) Gunzip-4 111MB/s ± 0% 111MB/s ± 0% +0.15% (p=0.001 n=30+30) JSONEncode-4 21.9MB/s ± 1% 21.6MB/s ± 1% -1.45% (p=0.000 n=23+23) JSONDecode-4 4.94MB/s ± 1% 4.98MB/s ± 1% +0.84% (p=0.000 n=27+30) GoParse-4 2.91MB/s ± 0% 2.90MB/s ± 0% -0.34% (p=0.000 n=21+22) RegexpMatchEasy0_32-4 74.1MB/s ± 0% 74.1MB/s ± 0% ~ (p=0.469 n=29+28) RegexpMatchEasy0_1K-4 634MB/s ± 0% 634MB/s ± 0% ~ (p=0.978 n=24+28) RegexpMatchEasy1_32-4 72.2MB/s ± 0% 72.2MB/s ± 0% ~ (p=0.064 n=27+29) RegexpMatchEasy1_1K-4 396MB/s ± 1% 396MB/s ± 1% ~ (p=0.583 n=27+25) RegexpMatchMedium_32-4 1.35MB/s ± 0% 1.35MB/s ± 0% ~ (all equal) RegexpMatchMedium_1K-4 4.60MB/s ± 0% 4.59MB/s ± 0% -0.14% (p=0.000 n=30+26) RegexpMatchHard_32-4 2.61MB/s ± 0% 2.61MB/s ± 0% ~ (all equal) RegexpMatchHard_1K-4 2.76MB/s ± 0% 2.76MB/s ± 0% ~ (all equal) Revcomp-4 89.1MB/s ± 0% 89.1MB/s ± 0% ~ (p=0.059 n=28+25) Template-4 4.75MB/s ± 1% 4.75MB/s ± 1% ~ (p=0.106 n=29+29) [Geo mean] 18.3MB/s 18.3MB/s -0.07% Change-Id: I3cd76ce63e84b0c3cebabf9fa3573b76a7343899 Reviewed-on: https://go-review.googlesource.com/124935 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-09-04cmd/compile: optimize ARM64's code with MADD/MSUBBen Shi
MADD does MUL-ADD in a single instruction, and MSUB does the similiar simplification for MUL-SUB. The CL implements the optimization with MADD/MSUB. 1. The total size of pkg/android_arm64/ decreases about 20KB, excluding cmd/compile/. 2. The go1 benchmark shows a little improvement for RegexpMatchHard_32-4 and Template-4, excluding noise. name old time/op new time/op delta BinaryTree17-4 16.3s ± 1% 16.5s ± 1% +1.41% (p=0.000 n=26+28) Fannkuch11-4 8.79s ± 1% 8.76s ± 0% -0.36% (p=0.000 n=26+28) FmtFprintfEmpty-4 172ns ± 0% 172ns ± 0% ~ (all equal) FmtFprintfString-4 362ns ± 1% 364ns ± 0% +0.55% (p=0.000 n=30+30) FmtFprintfInt-4 416ns ± 0% 416ns ± 0% ~ (p=0.099 n=22+30) FmtFprintfIntInt-4 655ns ± 1% 660ns ± 1% +0.76% (p=0.000 n=30+30) FmtFprintfPrefixedInt-4 810ns ± 0% 809ns ± 0% -0.08% (p=0.009 n=29+29) FmtFprintfFloat-4 1.08µs ± 0% 1.09µs ± 0% +0.61% (p=0.000 n=30+29) FmtManyArgs-4 2.70µs ± 0% 2.69µs ± 0% -0.23% (p=0.000 n=29+28) GobDecode-4 32.2ms ± 1% 32.1ms ± 1% -0.39% (p=0.000 n=27+26) GobEncode-4 27.4ms ± 2% 27.4ms ± 1% ~ (p=0.864 n=28+28) Gzip-4 1.53s ± 1% 1.52s ± 1% -0.30% (p=0.031 n=29+29) Gunzip-4 146ms ± 0% 146ms ± 0% -0.14% (p=0.001 n=25+30) HTTPClientServer-4 1.00ms ± 4% 0.98ms ± 6% -1.65% (p=0.001 n=29+30) JSONEncode-4 67.3ms ± 1% 67.2ms ± 1% ~ (p=0.520 n=28+28) JSONDecode-4 329ms ± 5% 330ms ± 4% ~ (p=0.142 n=30+30) Mandelbrot200-4 17.3ms ± 0% 17.3ms ± 0% ~ (p=0.055 n=26+29) GoParse-4 16.9ms ± 1% 17.0ms ± 1% +0.82% (p=0.000 n=30+30) RegexpMatchEasy0_32-4 382ns ± 0% 382ns ± 0% ~ (all equal) RegexpMatchEasy0_1K-4 1.33µs ± 0% 1.33µs ± 0% -0.25% (p=0.000 n=30+27) RegexpMatchEasy1_32-4 361ns ± 0% 361ns ± 0% -0.08% (p=0.002 n=30+28) RegexpMatchEasy1_1K-4 2.11µs ± 0% 2.09µs ± 0% -0.54% (p=0.000 n=30+29) RegexpMatchMedium_32-4 594ns ± 0% 592ns ± 0% -0.32% (p=0.000 n=30+30) RegexpMatchMedium_1K-4 173µs ± 0% 172µs ± 0% -0.77% (p=0.000 n=29+27) RegexpMatchHard_32-4 10.4µs ± 0% 10.1µs ± 0% -3.63% (p=0.000 n=28+27) RegexpMatchHard_1K-4 306µs ± 0% 301µs ± 0% -1.64% (p=0.000 n=29+30) Revcomp-4 2.51s ± 1% 2.52s ± 0% +0.18% (p=0.017 n=26+27) Template-4 394ms ± 3% 382ms ± 3% -3.22% (p=0.000 n=28+28) TimeParse-4 1.67µs ± 0% 1.67µs ± 0% +0.05% (p=0.030 n=27+30) TimeFormat-4 1.72µs ± 0% 1.70µs ± 0% -0.79% (p=0.000 n=28+26) [Geo mean] 259µs 259µs -0.33% name old speed new speed delta GobDecode-4 23.8MB/s ± 1% 23.9MB/s ± 1% +0.40% (p=0.001 n=27+26) GobEncode-4 28.0MB/s ± 2% 28.0MB/s ± 1% ~ (p=0.863 n=28+28) Gzip-4 12.7MB/s ± 1% 12.7MB/s ± 1% +0.32% (p=0.026 n=29+29) Gunzip-4 133MB/s ± 0% 133MB/s ± 0% +0.15% (p=0.001 n=24+30) JSONEncode-4 28.8MB/s ± 1% 28.9MB/s ± 1% ~ (p=0.475 n=28+28) JSONDecode-4 5.89MB/s ± 4% 5.87MB/s ± 5% ~ (p=0.174 n=29+30) GoParse-4 3.43MB/s ± 0% 3.40MB/s ± 1% -0.83% (p=0.000 n=28+30) RegexpMatchEasy0_32-4 83.6MB/s ± 0% 83.6MB/s ± 0% ~ (p=0.848 n=28+29) RegexpMatchEasy0_1K-4 768MB/s ± 0% 770MB/s ± 0% +0.25% (p=0.000 n=30+27) RegexpMatchEasy1_32-4 88.5MB/s ± 0% 88.5MB/s ± 0% ~ (p=0.086 n=29+29) RegexpMatchEasy1_1K-4 486MB/s ± 0% 489MB/s ± 0% +0.54% (p=0.000 n=30+29) RegexpMatchMedium_32-4 1.68MB/s ± 0% 1.69MB/s ± 0% +0.60% (p=0.000 n=30+23) RegexpMatchMedium_1K-4 5.90MB/s ± 0% 5.95MB/s ± 0% +0.85% (p=0.000 n=18+20) RegexpMatchHard_32-4 3.07MB/s ± 0% 3.18MB/s ± 0% +3.72% (p=0.000 n=29+26) RegexpMatchHard_1K-4 3.35MB/s ± 0% 3.40MB/s ± 0% +1.69% (p=0.000 n=30+30) Revcomp-4 101MB/s ± 0% 101MB/s ± 0% -0.18% (p=0.018 n=26+27) Template-4 4.92MB/s ± 4% 5.09MB/s ± 3% +3.31% (p=0.000 n=28+28) [Geo mean] 22.4MB/s 22.6MB/s +0.62% Change-Id: I8f304b272785739f57b3c8f736316f658f8c1b2a Reviewed-on: https://go-review.googlesource.com/129119 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-08-28cmd/compile: optimize arm64 with indexed FP load/storeBen Shi
The FP load/store on arm64 have register indexed forms. And this CL implements this optimization. 1. The total size of pkg/android_arm64 (excluding cmd/compile) decreases about 400 bytes. 2. There is no regression in the go1 benchmark, the test case GobEncode even gets slight improvement, excluding noise. name old time/op new time/op delta BinaryTree17-4 19.0s ± 0% 19.0s ± 1% ~ (p=0.817 n=29+29) Fannkuch11-4 9.94s ± 0% 9.95s ± 0% +0.03% (p=0.010 n=24+30) FmtFprintfEmpty-4 233ns ± 0% 233ns ± 0% ~ (all equal) FmtFprintfString-4 427ns ± 0% 427ns ± 0% ~ (p=0.649 n=30+30) FmtFprintfInt-4 471ns ± 0% 471ns ± 0% ~ (all equal) FmtFprintfIntInt-4 730ns ± 0% 730ns ± 0% ~ (all equal) FmtFprintfPrefixedInt-4 889ns ± 0% 889ns ± 0% ~ (all equal) FmtFprintfFloat-4 1.21µs ± 0% 1.21µs ± 0% +0.04% (p=0.012 n=20+30) FmtManyArgs-4 2.99µs ± 0% 2.99µs ± 0% ~ (p=0.651 n=29+29) GobDecode-4 42.4ms ± 1% 42.3ms ± 1% -0.27% (p=0.001 n=29+28) GobEncode-4 37.8ms ±11% 36.0ms ± 0% -4.67% (p=0.000 n=30+26) Gzip-4 1.98s ± 1% 1.96s ± 1% -1.26% (p=0.000 n=30+30) Gunzip-4 175ms ± 0% 175ms ± 0% ~ (p=0.988 n=29+29) HTTPClientServer-4 854µs ± 5% 860µs ± 5% ~ (p=0.236 n=28+29) JSONEncode-4 88.8ms ± 0% 87.9ms ± 0% -1.00% (p=0.000 n=24+26) JSONDecode-4 390ms ± 1% 392ms ± 2% +0.48% (p=0.025 n=30+30) Mandelbrot200-4 19.5ms ± 0% 19.5ms ± 0% ~ (p=0.894 n=24+29) GoParse-4 20.3ms ± 0% 20.1ms ± 1% -0.94% (p=0.000 n=27+26) RegexpMatchEasy0_32-4 451ns ± 0% 451ns ± 0% ~ (p=0.578 n=30+30) RegexpMatchEasy0_1K-4 1.63µs ± 0% 1.63µs ± 0% ~ (p=0.298 n=30+28) RegexpMatchEasy1_32-4 431ns ± 0% 434ns ± 0% +0.67% (p=0.000 n=30+29) RegexpMatchEasy1_1K-4 2.60µs ± 0% 2.64µs ± 0% +1.36% (p=0.000 n=28+26) RegexpMatchMedium_32-4 744ns ± 0% 744ns ± 0% ~ (p=0.474 n=29+29) RegexpMatchMedium_1K-4 223µs ± 0% 223µs ± 0% -0.08% (p=0.038 n=26+30) RegexpMatchHard_32-4 12.2µs ± 0% 12.3µs ± 0% +0.27% (p=0.000 n=29+30) RegexpMatchHard_1K-4 373µs ± 0% 373µs ± 0% ~ (p=0.219 n=29+28) Revcomp-4 2.84s ± 0% 2.84s ± 0% ~ (p=0.130 n=28+28) Template-4 394ms ± 1% 392ms ± 1% -0.52% (p=0.001 n=30+30) TimeParse-4 1.93µs ± 0% 1.93µs ± 0% ~ (p=0.587 n=29+30) TimeFormat-4 2.00µs ± 0% 2.00µs ± 0% +0.07% (p=0.001 n=28+27) [Geo mean] 306µs 305µs -0.17% name old speed new speed delta GobDecode-4 18.1MB/s ± 1% 18.2MB/s ± 1% +0.27% (p=0.001 n=29+28) GobEncode-4 20.3MB/s ±10% 21.3MB/s ± 0% +4.64% (p=0.000 n=30+26) Gzip-4 9.79MB/s ± 1% 9.91MB/s ± 1% +1.28% (p=0.000 n=30+30) Gunzip-4 111MB/s ± 0% 111MB/s ± 0% ~ (p=0.988 n=29+29) JSONEncode-4 21.8MB/s ± 0% 22.1MB/s ± 0% +1.02% (p=0.000 n=24+26) JSONDecode-4 4.97MB/s ± 1% 4.95MB/s ± 2% -0.45% (p=0.031 n=30+30) GoParse-4 2.85MB/s ± 1% 2.88MB/s ± 1% +1.03% (p=0.000 n=30+26) RegexpMatchEasy0_32-4 70.9MB/s ± 0% 70.9MB/s ± 0% ~ (p=0.904 n=29+28) RegexpMatchEasy0_1K-4 627MB/s ± 0% 627MB/s ± 0% ~ (p=0.156 n=30+30) RegexpMatchEasy1_32-4 74.2MB/s ± 0% 73.7MB/s ± 0% -0.67% (p=0.000 n=30+29) RegexpMatchEasy1_1K-4 393MB/s ± 0% 388MB/s ± 0% -1.34% (p=0.000 n=28+26) RegexpMatchMedium_32-4 1.34MB/s ± 0% 1.34MB/s ± 0% ~ (all equal) RegexpMatchMedium_1K-4 4.59MB/s ± 0% 4.59MB/s ± 0% +0.07% (p=0.035 n=25+30) RegexpMatchHard_32-4 2.61MB/s ± 0% 2.61MB/s ± 0% -0.11% (p=0.002 n=28+30) RegexpMatchHard_1K-4 2.75MB/s ± 0% 2.75MB/s ± 0% +0.15% (p=0.001 n=30+24) Revcomp-4 89.4MB/s ± 0% 89.4MB/s ± 0% ~ (p=0.140 n=28+28) Template-4 4.93MB/s ± 1% 4.95MB/s ± 1% +0.51% (p=0.001 n=30+30) [Geo mean] 18.4MB/s 18.4MB/s +0.37% Change-Id: I9a6b521a971b21cfb51064e8e9b853cef8a1d071 Reviewed-on: https://go-review.googlesource.com/124636 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-08-27cmd/compile: add missing type information for some arm/arm64 rulesBen Shi
Some indexed load/store rules lack of type information, and this CL adds that for them. Change-Id: Icac315ccb83a2f5bf30b056d4667d5b59eb4e5e2 Reviewed-on: https://go-review.googlesource.com/128455 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-06-21cmd/compile: improve atomic add intrinsics with ARMv8.1 new instructionWei Xiao
ARMv8.1 has added new instruction (LDADDAL) for atomic memory operations. This CL improves existing atomic add intrinsics with the new instruction. Since the new instruction is only guaranteed to be present after ARMv8.1, we guard its usage with a conditional on CPU feature. Performance result on ARMv8.1 machine: name old time/op new time/op delta Xadd-224 1.05µs ± 6% 0.02µs ± 4% -98.06% (p=0.000 n=10+8) Xadd64-224 1.05µs ± 3% 0.02µs ±13% -98.10% (p=0.000 n=9+10) [Geo mean] 1.05µs 0.02µs -98.08% Performance result on ARMv8.0 machine: name old time/op new time/op delta Xadd-46 538ns ± 1% 541ns ± 1% +0.62% (p=0.000 n=9+9) Xadd64-46 505ns ± 1% 508ns ± 0% +0.48% (p=0.003 n=9+8) [Geo mean] 521ns 524ns +0.55% Change-Id: If4b5d8d0e2d6f84fe1492a4f5de0789910ad0ee9 Reviewed-on: https://go-review.googlesource.com/81877 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-06-12cmd/compile: use a different register for updated value in AtomicAnd8/Or8 on ↵Cherry Zhang
ARM64 ARM64 manual says it is "constrained unpredictable" if the src and dst registers of STLXRB are same, although it doesn't seem to cause any problem on real hardwares so far. Fix by allocating a different register to hold the updated value for AtomicAnd8/Or8. We do this by making the ops returns <val,mem> like AtomicAdd, although val will not be used elsewhere. Fixes #25823. Change-Id: I735b9822f99877b3c7aee67a65e62b7278dc40df Reviewed-on: https://go-review.googlesource.com/117976 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Wei Xiao <Wei.Xiao@arm.com>
2018-04-30cmd/compile: intrinsify runtime.getcallerpc on arm64Wei Xiao
Add a compiler intrinsic for getcallerpc on arm64 for better code generation. Change-Id: I897e670a2b8ffa1a8c2fdc638f5b2c44bda26318 Reviewed-on: https://go-review.googlesource.com/109276 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-27cmd/compile: optimize ARM64 with shifted register indexed load/storeBen Shi
ARM64 supports efficient instructions which combine shift, addition, load/store together. Such as "MOVD (R0)(R1<<3), R2" and "MOVWU R6, (R4)(R1<<2)". This CL optimizes the compiler to emit such efficient instuctions. And below is some test data. 1. binary size before/after binary size change pkg/linux_arm64 +80.1KB pkg/tool/linux_arm64 +121.9KB go -4.3KB gofmt -64KB 2. go1 benchmark There is big improvement for the test case Fannkuch11, and slight improvement for sme others, excluding noise. name old time/op new time/op delta BinaryTree17-4 43.9s ± 2% 44.0s ± 2% ~ (p=0.820 n=30+30) Fannkuch11-4 30.6s ± 2% 24.5s ± 3% -19.93% (p=0.000 n=25+30) FmtFprintfEmpty-4 500ns ± 0% 499ns ± 0% -0.11% (p=0.000 n=23+25) FmtFprintfString-4 1.03µs ± 0% 1.04µs ± 3% ~ (p=0.065 n=29+30) FmtFprintfInt-4 1.15µs ± 3% 1.15µs ± 4% -0.56% (p=0.000 n=30+30) FmtFprintfIntInt-4 1.80µs ± 5% 1.82µs ± 0% ~ (p=0.094 n=30+24) FmtFprintfPrefixedInt-4 2.17µs ± 5% 2.20µs ± 0% ~ (p=0.100 n=30+23) FmtFprintfFloat-4 3.08µs ± 3% 3.09µs ± 4% ~ (p=0.123 n=30+30) FmtManyArgs-4 7.41µs ± 4% 7.17µs ± 1% -3.26% (p=0.000 n=30+23) GobDecode-4 93.7ms ± 0% 94.7ms ± 4% ~ (p=0.685 n=24+30) GobEncode-4 78.7ms ± 7% 77.1ms ± 0% ~ (p=0.729 n=30+23) Gzip-4 4.01s ± 0% 3.97s ± 5% -1.11% (p=0.037 n=24+30) Gunzip-4 389ms ± 4% 384ms ± 0% ~ (p=0.155 n=30+23) HTTPClientServer-4 536µs ± 1% 537µs ± 1% ~ (p=0.236 n=30+30) JSONEncode-4 179ms ± 1% 182ms ± 6% ~ (p=0.763 n=24+30) JSONDecode-4 843ms ± 0% 839ms ± 6% -0.42% (p=0.003 n=25+30) Mandelbrot200-4 46.5ms ± 0% 46.5ms ± 0% +0.02% (p=0.000 n=26+26) GoParse-4 44.3ms ± 6% 43.3ms ± 0% ~ (p=0.067 n=30+27) RegexpMatchEasy0_32-4 1.07µs ± 7% 1.07µs ± 4% ~ (p=0.835 n=30+30) RegexpMatchEasy0_1K-4 5.51µs ± 0% 5.49µs ± 0% -0.35% (p=0.000 n=23+26) RegexpMatchEasy1_32-4 1.01µs ± 0% 1.02µs ± 4% +0.96% (p=0.014 n=24+30) RegexpMatchEasy1_1K-4 7.43µs ± 0% 7.18µs ± 0% -3.41% (p=0.000 n=23+24) RegexpMatchMedium_32-4 1.78µs ± 0% 1.81µs ± 4% +1.47% (p=0.012 n=23+30) RegexpMatchMedium_1K-4 547µs ± 1% 542µs ± 3% -0.90% (p=0.003 n=24+30) RegexpMatchHard_32-4 30.4µs ± 0% 29.7µs ± 0% -2.15% (p=0.000 n=19+23) RegexpMatchHard_1K-4 913µs ± 0% 915µs ± 6% +0.25% (p=0.012 n=24+30) Revcomp-4 6.32s ± 1% 6.42s ± 4% ~ (p=0.342 n=25+30) Template-4 868ms ± 6% 878ms ± 6% +1.15% (p=0.000 n=30+30) TimeParse-4 4.57µs ± 4% 4.59µs ± 3% +0.65% (p=0.010 n=29+30) TimeFormat-4 4.51µs ± 0% 4.50µs ± 0% -0.27% (p=0.000 n=27+24) [Geo mean] 695µs 689µs -0.92% name old speed new speed delta GobDecode-4 8.19MB/s ± 0% 8.12MB/s ± 4% ~ (p=0.680 n=24+30) GobEncode-4 9.76MB/s ± 7% 9.96MB/s ± 0% ~ (p=0.616 n=30+23) Gzip-4 4.84MB/s ± 0% 4.89MB/s ± 4% +1.16% (p=0.030 n=24+30) Gunzip-4 49.9MB/s ± 4% 50.6MB/s ± 0% ~ (p=0.162 n=30+23) JSONEncode-4 10.9MB/s ± 1% 10.7MB/s ± 6% ~ (p=0.575 n=24+30) JSONDecode-4 2.30MB/s ± 0% 2.32MB/s ± 5% +0.72% (p=0.003 n=22+30) GoParse-4 1.31MB/s ± 6% 1.34MB/s ± 0% +2.26% (p=0.002 n=30+27) RegexpMatchEasy0_32-4 30.0MB/s ± 6% 30.0MB/s ± 4% ~ (p=1.000 n=30+30) RegexpMatchEasy0_1K-4 186MB/s ± 0% 187MB/s ± 0% +0.35% (p=0.000 n=23+26) RegexpMatchEasy1_32-4 31.8MB/s ± 0% 31.5MB/s ± 4% -0.92% (p=0.012 n=25+30) RegexpMatchEasy1_1K-4 138MB/s ± 0% 143MB/s ± 0% +3.53% (p=0.000 n=23+24) RegexpMatchMedium_32-4 560kB/s ± 0% 553kB/s ± 4% -1.19% (p=0.005 n=23+30) RegexpMatchMedium_1K-4 1.87MB/s ± 0% 1.89MB/s ± 3% +1.04% (p=0.002 n=24+30) RegexpMatchHard_32-4 1.05MB/s ± 0% 1.08MB/s ± 0% +2.40% (p=0.000 n=19+23) RegexpMatchHard_1K-4 1.12MB/s ± 0% 1.12MB/s ± 5% +0.12% (p=0.006 n=25+30) Revcomp-4 40.2MB/s ± 1% 39.6MB/s ± 4% ~ (p=0.242 n=25+30) Template-4 2.24MB/s ± 6% 2.21MB/s ± 6% -1.15% (p=0.000 n=30+30) [Geo mean] 7.87MB/s 7.91MB/s +0.44% Change-Id: If374cb7abf83537aa0a176f73c0f736f7800db03 Reviewed-on: https://go-review.googlesource.com/108735 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-04-26cmd/compile: optimize ARM64 code with CMN/TSTBalaram Makam
Use CMN/TST to simplify comparisons. This can reduce the register pressure by removing single def/use registers for example: ADDW R0, R1, R8 -> CMNW R1, R0 ; CMN is an alias of ADDS. CBZW R8, label -> BEQ label ; single def/use of R8 removed. Little change in performance of go1 benchmark on Amberwing: name old time/op new time/op delta RegexpMatchEasy0_32 247ns ± 0% 246ns ± 0% -0.40% (p=0.008 n=5+5) RegexpMatchEasy0_1K 581ns ± 0% 580ns ± 0% ~ (p=0.079 n=4+5) RegexpMatchEasy1_32 244ns ± 0% 243ns ± 0% -0.41% (p=0.008 n=5+5) RegexpMatchEasy1_1K 804ns ± 0% 806ns ± 0% +0.25% (p=0.016 n=5+4) RegexpMatchMedium_32 313ns ± 0% 311ns ± 0% -0.64% (p=0.008 n=5+5) RegexpMatchMedium_1K 52.2µs ± 0% 51.9µs ± 0% -0.51% (p=0.008 n=5+5) RegexpMatchHard_32 2.76µs ± 3% 2.74µs ± 0% ~ (p=0.683 n=5+5) RegexpMatchHard_1K 78.8µs ± 0% 78.9µs ± 0% +0.04% (p=0.008 n=5+5) FmtFprintfEmpty 58.6ns ± 0% 57.7ns ± 0% -1.54% (p=0.008 n=5+5) FmtFprintfString 118ns ± 0% 115ns ± 0% -2.54% (p=0.008 n=5+5) FmtFprintfInt 119ns ± 0% 119ns ± 0% ~ (all equal) FmtFprintfIntInt 192ns ± 0% 192ns ± 0% ~ (all equal) FmtFprintfPrefixedInt 224ns ± 0% 205ns ± 0% -8.48% (p=0.008 n=5+5) FmtFprintfFloat 336ns ± 0% 333ns ± 1% ~ (p=0.683 n=5+5) FmtManyArgs 779ns ± 1% 760ns ± 1% -2.41% (p=0.008 n=5+5) Gzip 437ms ± 0% 436ms ± 0% -0.27% (p=0.008 n=5+5) HTTPClientServer 90.1µs ± 1% 91.1µs ± 0% +1.19% (p=0.008 n=5+5) JSONEncode 20.1ms ± 0% 20.2ms ± 1% ~ (p=0.690 n=5+5) JSONDecode 94.5ms ± 1% 94.1ms ± 1% ~ (p=0.095 n=5+5) Mandelbrot200 5.37ms ± 0% 5.37ms ± 0% ~ (p=0.421 n=5+5) TimeParse 450ns ± 0% 446ns ± 0% -0.89% (p=0.000 n=5+4) TimeFormat 483ns ± 1% 473ns ± 0% -2.19% (p=0.008 n=5+5) Template 90.6ms ± 0% 89.7ms ± 0% -0.93% (p=0.008 n=5+5) GoParse 5.97ms ± 0% 6.01ms ± 0% +0.65% (p=0.008 n=5+5) BinaryTree17 11.8s ± 0% 11.7s ± 0% -0.28% (p=0.016 n=5+5) Revcomp 669ms ± 0% 669ms ± 0% ~ (p=0.222 n=5+5) Fannkuch11 3.28s ± 0% 3.34s ± 0% +1.72% (p=0.016 n=4+5) [Geo mean] 46.6µs 46.3µs -0.74% name old speed new speed delta RegexpMatchEasy0_32 129MB/s ± 0% 130MB/s ± 0% +0.32% (p=0.016 n=5+4) RegexpMatchEasy0_1K 1.76GB/s ± 0% 1.76GB/s ± 0% +0.13% (p=0.016 n=4+5) RegexpMatchEasy1_32 131MB/s ± 0% 132MB/s ± 0% +0.32% (p=0.008 n=5+5) RegexpMatchEasy1_1K 1.27GB/s ± 0% 1.27GB/s ± 0% -0.24% (p=0.016 n=5+4) RegexpMatchMedium_32 3.19MB/s ± 0% 3.21MB/s ± 0% +0.63% (p=0.008 n=5+5) RegexpMatchMedium_1K 19.6MB/s ± 0% 19.7MB/s ± 0% +0.51% (p=0.029 n=4+4) RegexpMatchHard_32 11.6MB/s ± 2% 11.7MB/s ± 0% ~ (p=1.000 n=5+5) RegexpMatchHard_1K 13.0MB/s ± 0% 13.0MB/s ± 0% ~ (p=0.079 n=4+5) Gzip 44.4MB/s ± 0% 44.5MB/s ± 0% +0.27% (p=0.008 n=5+5) JSONEncode 96.4MB/s ± 0% 96.2MB/s ± 1% ~ (p=0.579 n=5+5) JSONDecode 20.5MB/s ± 1% 20.6MB/s ± 1% ~ (p=0.111 n=5+5) Template 21.4MB/s ± 0% 21.6MB/s ± 0% +0.94% (p=0.008 n=5+5) GoParse 9.70MB/s ± 0% 9.63MB/s ± 0% -0.68% (p=0.016 n=4+5) Revcomp 380MB/s ± 0% 380MB/s ± 0% ~ (p=0.222 n=5+5) [Geo mean] 55.3MB/s 55.4MB/s +0.23% Change-Id: I2e5338138991d9bc984e67b51212aa5d1b0f2a6b Reviewed-on: https://go-review.googlesource.com/97335 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com>
2018-04-20cmd/compile: don't lower OpConvertAustin Clements
Currently, each architecture lowers OpConvert to an arch-specific OpXXXconvert. This is silly because OpConvert means the same thing on all architectures and is logically a no-op that exists only to keep track of conversions to and from unsafe.Pointer. Furthermore, lowering it makes it harder to recognize in other analyses, particularly liveness analysis. This CL eliminates the lowering of OpConvert, leaving it as the generic op until code generation time. The main complexity here is that we still need to register-allocate OpConvert operations. Currently, each arch's lowered OpConvert specifies all GP registers in its register mask. Ideally, OpConvert wouldn't affect value homing at all, and we could just copy the home of OpConvert's source, but this can potentially home an OpConvert in a LocalSlot, which neither regalloc nor stackalloc expect. Rather than try to disentangle this assumption from regalloc and stackalloc, we continue to register-allocate OpConvert, but teach regalloc that OpConvert can be allocated to any allocatable GP register. For #24543. Change-Id: I795a6aee5fd94d4444a7bafac3838a400c9f7bb6 Reviewed-on: https://go-review.googlesource.com/108496 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2018-04-19cmd/compile: optimize ARM64 with register indexed load/storeBen Shi
ARM64 supports load/store instructions with a memory operand that the address is calculated by base register + index register. In this CL, 1. Some rules are added to the compile's ARM64 backend to emit such efficient instructions. 2. A wrong rule of load combination is fixed. The go1 benchmark does show improvement. name old time/op new time/op delta BinaryTree17-4 44.5s ± 2% 44.1s ± 1% -0.81% (p=0.000 n=28+29) Fannkuch11-4 32.7s ± 3% 30.5s ± 0% -6.79% (p=0.000 n=30+26) FmtFprintfEmpty-4 499ns ± 0% 506ns ± 5% +1.39% (p=0.003 n=25+30) FmtFprintfString-4 1.07µs ± 0% 1.04µs ± 4% -3.17% (p=0.000 n=23+30) FmtFprintfInt-4 1.15µs ± 4% 1.13µs ± 0% -1.55% (p=0.000 n=30+23) FmtFprintfIntInt-4 1.77µs ± 4% 1.74µs ± 0% -1.71% (p=0.000 n=30+24) FmtFprintfPrefixedInt-4 2.37µs ± 5% 2.12µs ± 0% -10.56% (p=0.000 n=30+23) FmtFprintfFloat-4 3.03µs ± 1% 3.03µs ± 4% -0.13% (p=0.003 n=25+30) FmtManyArgs-4 7.38µs ± 1% 7.43µs ± 4% +0.59% (p=0.003 n=25+30) GobDecode-4 101ms ± 6% 95ms ± 5% -5.55% (p=0.000 n=30+30) GobEncode-4 78.0ms ± 4% 78.8ms ± 6% +1.05% (p=0.000 n=30+30) Gzip-4 4.25s ± 0% 4.27s ± 4% +0.45% (p=0.003 n=24+30) Gunzip-4 428ms ± 1% 420ms ± 0% -1.88% (p=0.000 n=23+23) HTTPClientServer-4 549µs ± 1% 541µs ± 1% -1.56% (p=0.000 n=29+29) JSONEncode-4 194ms ± 0% 188ms ± 4% ~ (p=0.417 n=23+30) JSONDecode-4 890ms ± 5% 831ms ± 0% -6.55% (p=0.000 n=30+23) Mandelbrot200-4 47.3ms ± 2% 46.5ms ± 0% ~ (p=0.980 n=30+26) GoParse-4 43.1ms ± 6% 43.8ms ± 6% +1.65% (p=0.000 n=30+30) RegexpMatchEasy0_32-4 1.06µs ± 0% 1.07µs ± 3% ~ (p=0.092 n=23+30) RegexpMatchEasy0_1K-4 5.53µs ± 0% 5.51µs ± 0% -0.24% (p=0.000 n=25+25) RegexpMatchEasy1_32-4 1.02µs ± 3% 1.01µs ± 0% -1.27% (p=0.000 n=30+24) RegexpMatchEasy1_1K-4 7.26µs ± 0% 7.33µs ± 0% +0.95% (p=0.000 n=23+26) RegexpMatchMedium_32-4 1.84µs ± 7% 1.79µs ± 1% ~ (p=0.333 n=30+23) RegexpMatchMedium_1K-4 553µs ± 0% 547µs ± 0% -1.14% (p=0.000 n=24+22) RegexpMatchHard_32-4 30.8µs ± 1% 30.3µs ± 0% -1.40% (p=0.000 n=24+24) RegexpMatchHard_1K-4 928µs ± 0% 929µs ± 5% +0.12% (p=0.013 n=23+30) Revcomp-4 8.13s ± 4% 6.32s ± 1% -22.23% (p=0.000 n=30+23) Template-4 899ms ± 6% 854ms ± 1% -5.01% (p=0.000 n=30+24) TimeParse-4 4.66µs ± 4% 4.59µs ± 1% -1.57% (p=0.000 n=30+23) TimeFormat-4 4.58µs ± 0% 4.61µs ± 0% +0.57% (p=0.000 n=26+24) [Geo mean] 717µs 698µs -2.55% name old speed new speed delta GobDecode-4 7.63MB/s ± 6% 8.08MB/s ± 5% +5.88% (p=0.000 n=30+30) GobEncode-4 9.85MB/s ± 4% 9.75MB/s ± 6% -1.04% (p=0.000 n=30+30) Gzip-4 4.56MB/s ± 0% 4.55MB/s ± 4% -0.36% (p=0.003 n=24+30) Gunzip-4 45.3MB/s ± 1% 46.2MB/s ± 0% +1.92% (p=0.000 n=23+23) JSONEncode-4 10.0MB/s ± 0% 10.4MB/s ± 4% ~ (p=0.403 n=23+30) JSONDecode-4 2.18MB/s ± 5% 2.33MB/s ± 0% +6.91% (p=0.000 n=30+23) GoParse-4 1.34MB/s ± 5% 1.32MB/s ± 5% -1.66% (p=0.000 n=30+30) RegexpMatchEasy0_32-4 30.2MB/s ± 0% 29.8MB/s ± 3% ~ (p=0.099 n=23+30) RegexpMatchEasy0_1K-4 185MB/s ± 0% 186MB/s ± 0% +0.24% (p=0.000 n=25+25) RegexpMatchEasy1_32-4 31.4MB/s ± 3% 31.8MB/s ± 0% +1.24% (p=0.000 n=30+24) RegexpMatchEasy1_1K-4 141MB/s ± 0% 140MB/s ± 0% -0.94% (p=0.000 n=23+26) RegexpMatchMedium_32-4 541kB/s ± 6% 560kB/s ± 0% +3.45% (p=0.000 n=30+23) RegexpMatchMedium_1K-4 1.85MB/s ± 0% 1.87MB/s ± 0% +1.08% (p=0.000 n=24+23) RegexpMatchHard_32-4 1.04MB/s ± 1% 1.06MB/s ± 1% +1.48% (p=0.000 n=24+24) RegexpMatchHard_1K-4 1.10MB/s ± 0% 1.10MB/s ± 5% +0.15% (p=0.004 n=23+30) Revcomp-4 31.3MB/s ± 4% 40.2MB/s ± 1% +28.52% (p=0.000 n=30+23) Template-4 2.16MB/s ± 6% 2.27MB/s ± 1% +5.18% (p=0.000 n=30+24) [Geo mean] 7.57MB/s 7.79MB/s +2.98% fixes #24907 Change-Id: I94afd0e3f53d62a1cf5e452f3dd6daf61be21785 Reviewed-on: https://go-review.googlesource.com/107376 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-04-04cmd/compile: intrinsify math/big.mulWW on ARM64Balaram Makam
Performance numbers on amberwing: pkg: math/big name old time/op new time/op delta QuoRem 3.08µs ± 0% 2.93µs ± 1% -4.89% (p=0.008 n=5+5) ModSqrt225_Tonelli 721µs ± 0% 718µs ± 0% -0.46% (p=0.008 n=5+5) ModSqrt224_3Mod4 218µs ± 0% 217µs ± 0% -0.27% (p=0.008 n=5+5) ModSqrt5430_Tonelli 2.91s ± 0% 2.91s ± 0% ~ (p=0.222 n=5+5) ModSqrt5430_3Mod4 970ms ± 0% 970ms ± 0% ~ (p=0.151 n=5+5) Sqrt 45.9µs ± 0% 43.8µs ± 0% -4.63% (p=0.008 n=5+5) IntSqr/1 19.9ns ± 0% 17.3ns ± 0% -13.07% (p=0.008 n=5+5) IntSqr/2 52.6ns ± 0% 50.8ns ± 0% -3.35% (p=0.008 n=5+5) IntSqr/3 70.4ns ± 0% 69.4ns ± 0% ~ (p=0.079 n=4+5) IntSqr/5 103ns ± 0% 99ns ± 0% -3.98% (p=0.008 n=5+5) IntSqr/8 179ns ± 0% 178ns ± 0% -0.56% (p=0.008 n=5+5) IntSqr/10 272ns ± 0% 272ns ± 0% ~ (all equal) IntSqr/20 763ns ± 0% 787ns ± 0% +3.15% (p=0.016 n=5+4) IntSqr/30 1.25µs ± 1% 1.29µs ± 1% +3.27% (p=0.008 n=5+5) IntSqr/50 2.64µs ± 0% 2.71µs ± 0% +2.61% (p=0.008 n=5+5) IntSqr/80 5.67µs ± 0% 5.72µs ± 0% +0.88% (p=0.008 n=5+5) IntSqr/100 8.05µs ± 0% 8.09µs ± 0% +0.45% (p=0.008 n=5+5) IntSqr/200 28.0µs ± 0% 28.1µs ± 0% ~ (p=0.151 n=5+5) IntSqr/300 59.4µs ± 0% 59.6µs ± 0% +0.36% (p=0.008 n=5+5) IntSqr/500 141µs ± 0% 141µs ± 0% +0.08% (p=0.008 n=5+5) IntSqr/800 280µs ± 0% 280µs ± 0% -0.12% (p=0.008 n=5+5) IntSqr/1000 429µs ± 0% 428µs ± 0% -0.27% (p=0.008 n=5+5) pkg: crypto-ecdsa name old time/op new time/op delta SignP384 7.85ms ± 1% 7.61ms ± 1% -3.12% (p=0.008 n=5+5) Change-Id: I1ab30856cc0e570f6312f0bd8914779b55adbc16 Reviewed-on: https://go-review.googlesource.com/104135 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-15cmd/compile/internal/ssa: add patterns for arm64 bitfield opcodesGeoff Berry
Add patterns to match common idioms for EXTR, BFI, BFXIL, SBFIZ, SBFX, UBFIZ and UBFX opcodes. go1 benchmarks results on Amberwing: name old time/op new time/op delta FmtManyArgs 786ns ± 2% 714ns ± 1% -9.20% (p=0.000 n=10+10) Gzip 437ms ± 0% 402ms ± 0% -7.99% (p=0.000 n=10+10) FmtFprintfIntInt 196ns ± 0% 182ns ± 0% -7.28% (p=0.000 n=10+9) FmtFprintfPrefixedInt 207ns ± 0% 199ns ± 0% -3.86% (p=0.000 n=10+10) FmtFprintfFloat 324ns ± 0% 316ns ± 0% -2.47% (p=0.000 n=10+8) FmtFprintfInt 119ns ± 0% 117ns ± 0% -1.68% (p=0.000 n=10+9) GobDecode 12.8ms ± 2% 12.6ms ± 1% -1.62% (p=0.002 n=10+10) JSONDecode 94.4ms ± 1% 93.4ms ± 0% -1.10% (p=0.000 n=10+10) RegexpMatchEasy0_32 247ns ± 0% 245ns ± 0% -0.65% (p=0.000 n=10+10) RegexpMatchMedium_32 314ns ± 0% 312ns ± 0% -0.64% (p=0.000 n=10+10) RegexpMatchEasy0_1K 541ns ± 0% 538ns ± 0% -0.55% (p=0.000 n=10+9) TimeParse 450ns ± 1% 448ns ± 1% -0.42% (p=0.035 n=9+9) RegexpMatchEasy1_32 244ns ± 0% 243ns ± 0% -0.41% (p=0.000 n=10+10) GoParse 6.03ms ± 0% 6.00ms ± 0% -0.40% (p=0.002 n=10+10) RegexpMatchEasy1_1K 779ns ± 0% 777ns ± 0% -0.26% (p=0.000 n=10+10) RegexpMatchHard_32 2.75µs ± 0% 2.74µs ± 1% -0.06% (p=0.026 n=9+9) BinaryTree17 11.7s ± 0% 11.6s ± 0% ~ (p=0.089 n=10+10) HTTPClientServer 89.1µs ± 1% 89.5µs ± 2% ~ (p=0.436 n=10+10) RegexpMatchHard_1K 78.9µs ± 0% 79.5µs ± 2% ~ (p=0.469 n=10+10) FmtFprintfEmpty 58.5ns ± 0% 58.5ns ± 0% ~ (all equal) GobEncode 12.0ms ± 1% 12.1ms ± 0% ~ (p=0.075 n=10+10) Revcomp 669ms ± 0% 668ms ± 0% ~ (p=0.091 n=7+9) Mandelbrot200 5.35ms ± 0% 5.36ms ± 0% +0.07% (p=0.000 n=9+9) RegexpMatchMedium_1K 52.1µs ± 0% 52.1µs ± 0% +0.10% (p=0.000 n=9+9) Fannkuch11 3.25s ± 0% 3.26s ± 0% +0.36% (p=0.000 n=9+10) FmtFprintfString 114ns ± 1% 115ns ± 0% +0.52% (p=0.011 n=10+10) JSONEncode 20.2ms ± 0% 20.3ms ± 0% +0.65% (p=0.000 n=10+10) Template 91.3ms ± 0% 92.3ms ± 0% +1.08% (p=0.000 n=10+10) TimeFormat 484ns ± 0% 495ns ± 1% +2.30% (p=0.000 n=9+10) There are some opportunities to improve this change further by adding patterns to match the "extended register" versions of ADD/SUB/CMP, but I think that should be evaluated on its own. The regressions in Template and TimeFormat would likely be recovered by this, as they seem to be due to generating: ubfiz x0, x0, #3, #8 add x1, x2, x0 instead of add x1, x2, x0, lsl #3 Change-Id: I5644a8d70ac7a98e784a377a2b76ab47a3415a4b Reviewed-on: https://go-review.googlesource.com/88355 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-06runtime, cmd/compile: use ldp for DUFFCOPY on ARM64Meng Zhuo
name old time/op new time/op delta CopyFat8 2.15ns ± 1% 2.19ns ± 6% ~ (p=0.171 n=8+9) CopyFat12 2.15ns ± 0% 2.17ns ± 2% ~ (p=0.137 n=8+10) CopyFat16 2.17ns ± 3% 2.15ns ± 0% ~ (p=0.211 n=10+10) CopyFat24 2.16ns ± 1% 2.15ns ± 0% ~ (p=0.087 n=10+10) CopyFat32 11.5ns ± 0% 12.8ns ± 2% +10.87% (p=0.000 n=8+10) CopyFat64 20.2ns ± 2% 12.9ns ± 0% -36.11% (p=0.000 n=10+10) CopyFat128 37.2ns ± 0% 21.5ns ± 0% -42.20% (p=0.000 n=10+10) CopyFat256 71.6ns ± 0% 38.7ns ± 0% -45.95% (p=0.000 n=10+10) CopyFat512 140ns ± 0% 73ns ± 0% -47.86% (p=0.000 n=10+9) CopyFat520 142ns ± 0% 74ns ± 0% -47.54% (p=0.000 n=10+10) CopyFat1024 277ns ± 0% 141ns ± 0% -49.10% (p=0.000 n=10+10) Change-Id: If54bc571add5db674d5e081579c87e80153d0a5a Reviewed-on: https://go-review.googlesource.com/97395 Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-03-02cmd/compile/internal/ssa: note zero-width OpsHeschi Kreinick
Add a bool to opInfo to indicate if an Op never results in any instructions. This is a conservative approximation: some operations, like Copy, may or may not generate code depending on their arguments. I built the list by reading each arch's ssaGenValue function. Hopefully I got them all. Change-Id: I130b251b65f18208294e129bb7ddc3f91d57d31d Reviewed-on: https://go-review.googlesource.com/97957 Reviewed-by: Keith Randall <khr@golang.org>
2018-02-28cmd/compile: optimize ARM64 code with EON/ORNBen Shi
EON and ORN are efficient ARM64 instructions. EON combines (x ^ ^y) into a single operation, and so ORN does for (x | ^y). This CL implements that optimization. And here are benchmark results with RaspberryPi3/ArchLinux. 1. A specific test gets about 13% improvement. EONORN 181µs ± 0% 157µs ± 0% -13.26% (p=0.000 n=26+23) (https://github.com/benshi001/ugo1/blob/master/eonorn_test.go) 2. There is little change in the go1 benchmark, excluding noise. name old time/op new time/op delta BinaryTree17-4 44.1s ± 2% 44.0s ± 2% ~ (p=0.513 n=30+30) Fannkuch11-4 32.9s ± 3% 32.8s ± 3% -0.12% (p=0.024 n=30+30) FmtFprintfEmpty-4 561ns ± 9% 558ns ± 9% ~ (p=0.654 n=30+30) FmtFprintfString-4 1.09µs ± 4% 1.09µs ± 3% ~ (p=0.158 n=30+30) FmtFprintfInt-4 1.12µs ± 0% 1.12µs ± 0% ~ (p=0.917 n=23+28) FmtFprintfIntInt-4 1.73µs ± 0% 1.76µs ± 4% ~ (p=0.665 n=23+30) FmtFprintfPrefixedInt-4 2.15µs ± 1% 2.15µs ± 0% ~ (p=0.389 n=27+26) FmtFprintfFloat-4 3.18µs ± 4% 3.13µs ± 0% -1.50% (p=0.003 n=30+23) FmtManyArgs-4 7.32µs ± 4% 7.21µs ± 0% ~ (p=0.220 n=30+25) GobDecode-4 99.1ms ± 9% 97.0ms ± 0% -2.07% (p=0.000 n=30+23) GobEncode-4 83.3ms ± 3% 82.4ms ± 4% ~ (p=0.321 n=30+30) Gzip-4 4.39s ± 4% 4.32s ± 2% -1.42% (p=0.017 n=30+23) Gunzip-4 440ms ± 0% 447ms ± 4% +1.54% (p=0.006 n=24+30) HTTPClientServer-4 547µs ± 1% 537µs ± 1% -1.91% (p=0.000 n=30+30) JSONEncode-4 211ms ± 0% 211ms ± 0% +0.04% (p=0.000 n=23+24) JSONDecode-4 847ms ± 0% 847ms ± 0% ~ (p=0.158 n=25+25) Mandelbrot200-4 46.5ms ± 0% 46.5ms ± 0% -0.04% (p=0.000 n=25+24) GoParse-4 43.4ms ± 0% 43.4ms ± 0% ~ (p=0.494 n=24+25) RegexpMatchEasy0_32-4 1.03µs ± 0% 1.03µs ± 0% ~ (all equal) RegexpMatchEasy0_1K-4 4.02µs ± 3% 3.98µs ± 0% -0.95% (p=0.003 n=30+24) RegexpMatchEasy1_32-4 1.01µs ± 3% 1.01µs ± 2% ~ (p=0.629 n=30+30) RegexpMatchEasy1_1K-4 6.39µs ± 0% 6.39µs ± 0% ~ (p=0.564 n=24+23) RegexpMatchMedium_32-4 1.80µs ± 3% 1.78µs ± 0% ~ (p=0.155 n=30+24) RegexpMatchMedium_1K-4 555µs ± 0% 563µs ± 3% +1.55% (p=0.004 n=27+30) RegexpMatchHard_32-4 31.0µs ± 4% 30.5µs ± 1% -1.58% (p=0.000 n=30+23) RegexpMatchHard_1K-4 947µs ± 4% 931µs ± 0% -1.66% (p=0.009 n=30+24) Revcomp-4 7.71s ± 4% 7.71s ± 4% ~ (p=0.196 n=29+30) Template-4 877ms ± 0% 878ms ± 0% +0.16% (p=0.018 n=23+27) TimeParse-4 4.75µs ± 1% 4.74µs ± 0% ~ (p=0.895 n=24+23) TimeFormat-4 4.83µs ± 4% 4.83µs ± 4% ~ (p=0.767 n=30+30) [Geo mean] 709µs 707µs -0.35% name old speed new speed delta GobDecode-4 7.75MB/s ± 8% 7.91MB/s ± 0% +2.03% (p=0.001 n=30+23) GobEncode-4 9.22MB/s ± 3% 9.32MB/s ± 4% ~ (p=0.389 n=30+30) Gzip-4 4.43MB/s ± 4% 4.43MB/s ± 4% ~ (p=0.888 n=30+30) Gunzip-4 44.1MB/s ± 0% 43.4MB/s ± 4% -1.46% (p=0.009 n=24+30) JSONEncode-4 9.18MB/s ± 0% 9.18MB/s ± 0% ~ (p=0.308 n=16+24) JSONDecode-4 2.29MB/s ± 0% 2.29MB/s ± 0% ~ (all equal) GoParse-4 1.33MB/s ± 0% 1.33MB/s ± 0% ~ (all equal) RegexpMatchEasy0_32-4 30.9MB/s ± 0% 30.9MB/s ± 0% ~ (p=1.000 n=23+24) RegexpMatchEasy0_1K-4 255MB/s ± 3% 257MB/s ± 0% +0.92% (p=0.004 n=30+24) RegexpMatchEasy1_32-4 31.7MB/s ± 3% 31.6MB/s ± 2% ~ (p=0.603 n=30+30) RegexpMatchEasy1_1K-4 160MB/s ± 0% 160MB/s ± 0% ~ (p=0.435 n=24+23) RegexpMatchMedium_32-4 554kB/s ± 3% 560kB/s ± 0% +1.08% (p=0.004 n=30+24) RegexpMatchMedium_1K-4 1.85MB/s ± 0% 1.82MB/s ± 3% -1.48% (p=0.001 n=27+30) RegexpMatchHard_32-4 1.03MB/s ± 4% 1.05MB/s ± 1% +1.51% (p=0.027 n=30+23) RegexpMatchHard_1K-4 1.08MB/s ± 4% 1.10MB/s ± 0% +1.69% (p=0.002 n=30+25) Revcomp-4 33.0MB/s ± 4% 33.0MB/s ± 4% ~ (p=0.272 n=29+30) Template-4 2.21MB/s ± 0% 2.21MB/s ± 0% ~ (all equal) [Geo mean] 7.75MB/s 7.77MB/s +0.29% 3. There is little regression in the compilecmp benchmark. name old time/op new time/op delta Template 2.28s ± 3% 2.28s ± 4% ~ (p=0.739 n=10+10) Unicode 1.34s ± 4% 1.32s ± 3% ~ (p=0.113 n=10+9) GoTypes 8.10s ± 3% 8.18s ± 3% ~ (p=0.393 n=10+10) Compiler 39.0s ± 3% 39.2s ± 3% ~ (p=0.393 n=10+10) SSA 114s ± 3% 115s ± 2% ~ (p=0.631 n=10+10) Flate 1.41s ± 2% 1.42s ± 3% ~ (p=0.353 n=10+10) GoParser 1.81s ± 1% 1.83s ± 2% ~ (p=0.211 n=10+9) Reflect 5.06s ± 2% 5.06s ± 2% ~ (p=0.912 n=10+10) Tar 2.19s ± 3% 2.20s ± 3% ~ (p=0.247 n=10+10) XML 2.65s ± 2% 2.67s ± 5% ~ (p=0.796 n=10+10) [Geo mean] 4.92s 4.93s +0.27% name old user-time/op new user-time/op delta Template 2.81s ± 2% 2.81s ± 3% ~ (p=0.971 n=10+10) Unicode 1.70s ± 3% 1.67s ± 5% ~ (p=0.315 n=10+10) GoTypes 9.71s ± 1% 9.78s ± 1% +0.71% (p=0.023 n=10+10) Compiler 47.3s ± 1% 47.1s ± 3% ~ (p=0.579 n=10+10) SSA 143s ± 2% 143s ± 2% ~ (p=0.280 n=10+10) Flate 1.70s ± 3% 1.71s ± 3% ~ (p=0.481 n=10+10) GoParser 2.21s ± 3% 2.21s ± 1% ~ (p=0.549 n=10+9) Reflect 5.89s ± 1% 5.87s ± 2% ~ (p=0.739 n=10+10) Tar 2.66s ± 2% 2.63s ± 2% ~ (p=0.105 n=10+10) XML 3.16s ± 3% 3.18s ± 2% ~ (p=0.143 n=10+10) [Geo mean] 5.97s 5.97s -0.06% name old text-bytes new text-bytes delta HelloSize 637kB ± 0% 637kB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 9.46kB ± 0% 9.46kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.24MB ± 0% 1.24MB ± 0% ~ (all equal) Change-Id: Ie27357d65c5ce9d07afdffebe1e2daadcaa3369f Reviewed-on: https://go-review.googlesource.com/97036 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-22cmd/compile: fix FP accuracy issue introduced by FMA optimization on ARM64Ben Shi
Two ARM64 rules are added to avoid FP accuracy issue, which causes build failure. https://build.golang.org/log/1360f5c9ef3f37968216350283c1013e9681725d fixes #24033 Change-Id: I9b74b584ab5cc53fa49476de275dc549adf97610 Reviewed-on: https://go-review.googlesource.com/96355 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-22cmd/compile: improve FP performance on ARM64Ben Shi
FMADD/FMSUB/FNMADD/FNMSUB are efficient FP instructions, which can be used by the comiler to improve FP performance. This CL implements this optimization. 1. The compilecmp benchmark shows little change. name old time/op new time/op delta Template 2.35s ± 4% 2.38s ± 4% ~ (p=0.161 n=15+15) Unicode 1.36s ± 5% 1.36s ± 4% ~ (p=0.685 n=14+13) GoTypes 8.11s ± 3% 8.13s ± 2% ~ (p=0.624 n=15+15) Compiler 40.5s ± 2% 40.7s ± 2% ~ (p=0.137 n=15+15) SSA 115s ± 3% 116s ± 1% ~ (p=0.270 n=15+14) Flate 1.46s ± 4% 1.45s ± 5% ~ (p=0.870 n=15+15) GoParser 1.85s ± 2% 1.87s ± 3% ~ (p=0.477 n=14+15) Reflect 5.11s ± 4% 5.10s ± 2% ~ (p=0.624 n=15+15) Tar 2.23s ± 3% 2.23s ± 5% ~ (p=0.624 n=15+15) XML 2.72s ± 5% 2.74s ± 3% ~ (p=0.290 n=15+14) [Geo mean] 5.02s 5.03s +0.29% name old user-time/op new user-time/op delta Template 2.90s ± 2% 2.90s ± 3% ~ (p=0.780 n=14+15) Unicode 1.71s ± 5% 1.70s ± 3% ~ (p=0.458 n=14+13) GoTypes 9.77s ± 2% 9.76s ± 2% ~ (p=0.838 n=15+15) Compiler 49.1s ± 2% 49.1s ± 2% ~ (p=0.902 n=15+15) SSA 144s ± 1% 144s ± 2% ~ (p=0.567 n=15+15) Flate 1.75s ± 5% 1.74s ± 3% ~ (p=0.461 n=15+15) GoParser 2.22s ± 2% 2.21s ± 3% ~ (p=0.233 n=15+15) Reflect 5.99s ± 2% 5.95s ± 1% ~ (p=0.093 n=14+15) Tar 2.68s ± 2% 2.67s ± 3% ~ (p=0.310 n=14+15) XML 3.22s ± 2% 3.24s ± 3% ~ (p=0.512 n=15+15) [Geo mean] 6.08s 6.07s -0.19% name old text-bytes new text-bytes delta HelloSize 641kB ± 0% 641kB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 9.46kB ± 0% 9.46kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.24MB ± 0% 1.24MB ± 0% ~ (all equal) 2. The go1 benchmark shows little improvement in total (excluding noise), but some improvement in test case Mandelbrot200 and FmtFprintfFloat. name old time/op new time/op delta BinaryTree17-4 42.1s ± 2% 42.0s ± 2% ~ (p=0.453 n=30+28) Fannkuch11-4 33.5s ± 3% 33.3s ± 3% -0.38% (p=0.045 n=30+30) FmtFprintfEmpty-4 534ns ± 0% 534ns ± 0% ~ (all equal) FmtFprintfString-4 1.09µs ± 0% 1.09µs ± 0% -0.27% (p=0.000 n=23+17) FmtFprintfInt-4 1.16µs ± 3% 1.16µs ± 3% ~ (p=0.714 n=30+30) FmtFprintfIntInt-4 1.76µs ± 1% 1.77µs ± 0% +0.15% (p=0.002 n=23+23) FmtFprintfPrefixedInt-4 2.21µs ± 3% 2.20µs ± 3% ~ (p=0.390 n=30+30) FmtFprintfFloat-4 3.28µs ± 0% 3.11µs ± 0% -5.01% (p=0.000 n=25+26) FmtManyArgs-4 7.18µs ± 0% 7.19µs ± 0% +0.13% (p=0.000 n=24+25) GobDecode-4 94.9ms ± 0% 95.6ms ± 5% +0.83% (p=0.002 n=23+29) GobEncode-4 80.7ms ± 4% 79.8ms ± 0% -1.11% (p=0.003 n=30+24) Gzip-4 4.58s ± 4% 4.59s ± 3% +0.26% (p=0.002 n=30+26) Gunzip-4 449ms ± 4% 443ms ± 0% ~ (p=0.096 n=30+26) HTTPClientServer-4 553µs ± 1% 548µs ± 1% -0.96% (p=0.000 n=30+30) JSONEncode-4 215ms ± 4% 214ms ± 4% -0.29% (p=0.000 n=30+30) JSONDecode-4 868ms ± 4% 875ms ± 5% +0.79% (p=0.008 n=30+30) Mandelbrot200-4 51.4ms ± 0% 46.7ms ± 3% -9.09% (p=0.000 n=25+26) GoParse-4 42.1ms ± 0% 41.8ms ± 0% -0.61% (p=0.000 n=25+24) RegexpMatchEasy0_32-4 1.02µs ± 4% 1.02µs ± 4% -0.17% (p=0.000 n=30+30) RegexpMatchEasy0_1K-4 3.90µs ± 0% 3.95µs ± 4% ~ (p=0.516 n=23+30) RegexpMatchEasy1_32-4 970ns ± 3% 973ns ± 3% ~ (p=0.951 n=30+30) RegexpMatchEasy1_1K-4 6.43µs ± 3% 6.33µs ± 0% -1.62% (p=0.000 n=30+25) RegexpMatchMedium_32-4 1.75µs ± 0% 1.75µs ± 0% ~ (p=0.422 n=25+24) RegexpMatchMedium_1K-4 568µs ± 3% 562µs ± 0% ~ (p=0.079 n=30+24) RegexpMatchHard_32-4 30.8µs ± 0% 31.2µs ± 4% +1.46% (p=0.018 n=23+30) RegexpMatchHard_1K-4 932µs ± 0% 946µs ± 3% +1.49% (p=0.000 n=24+30) Revcomp-4 7.69s ± 3% 7.69s ± 2% +0.04% (p=0.032 n=24+25) Template-4 893ms ± 5% 880ms ± 6% -1.53% (p=0.000 n=30+30) TimeParse-4 4.90µs ± 3% 4.84µs ± 0% ~ (p=0.080 n=30+25) TimeFormat-4 4.70µs ± 1% 4.76µs ± 0% +1.21% (p=0.000 n=23+26) [Geo mean] 710µs 706µs -0.63% name old speed new speed delta GobDecode-4 8.09MB/s ± 0% 8.03MB/s ± 5% -0.77% (p=0.002 n=23+29) GobEncode-4 9.52MB/s ± 4% 9.62MB/s ± 0% +1.07% (p=0.003 n=30+24) Gzip-4 4.24MB/s ± 4% 4.23MB/s ± 3% -0.35% (p=0.002 n=30+26) Gunzip-4 43.2MB/s ± 4% 43.8MB/s ± 0% ~ (p=0.123 n=30+26) JSONEncode-4 9.03MB/s ± 4% 9.06MB/s ± 4% +0.28% (p=0.000 n=30+30) JSONDecode-4 2.24MB/s ± 4% 2.22MB/s ± 5% -0.79% (p=0.008 n=30+30) GoParse-4 1.38MB/s ± 1% 1.38MB/s ± 0% ~ (p=0.401 n=25+17) RegexpMatchEasy0_32-4 31.4MB/s ± 4% 31.5MB/s ± 3% +0.16% (p=0.000 n=30+30) RegexpMatchEasy0_1K-4 262MB/s ± 0% 259MB/s ± 4% ~ (p=0.693 n=23+30) RegexpMatchEasy1_32-4 33.0MB/s ± 3% 32.9MB/s ± 3% ~ (p=0.139 n=30+30) RegexpMatchEasy1_1K-4 159MB/s ± 3% 162MB/s ± 0% +1.60% (p=0.000 n=30+25) RegexpMatchMedium_32-4 570kB/s ± 0% 570kB/s ± 0% ~ (all equal) RegexpMatchMedium_1K-4 1.80MB/s ± 3% 1.82MB/s ± 0% +1.09% (p=0.007 n=30+24) RegexpMatchHard_32-4 1.04MB/s ± 0% 1.03MB/s ± 3% -1.38% (p=0.003 n=23+30) RegexpMatchHard_1K-4 1.10MB/s ± 0% 1.08MB/s ± 3% -1.52% (p=0.000 n=24+30) Revcomp-4 33.0MB/s ± 3% 33.0MB/s ± 2% ~ (p=0.128 n=24+25) Template-4 2.17MB/s ± 5% 2.21MB/s ± 6% +1.61% (p=0.000 n=30+30) [Geo mean] 7.79MB/s 7.79MB/s +0.05% Change-Id: Ied3dbdb5ba8e386168629cba06fcd4263bbb83e1 Reviewed-on: https://go-review.googlesource.com/94901 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-02-20cmd/compile: optimize ARM64 code with MNEGBen Shi
A pair of MUL/NEG instructions can be combined to a single MNEG on ARM64. This CL implements this optimization. 1. A special test case gets big improvement. (https://github.com/benshi001/ugo1/blob/master/mneg_test.go) name old time/op new time/op delta MNEG-4 315µs ± 0% 260µs ± 0% -17.39% (p=0.000 n=24+25) 2. There is little change in the go1 benchmark, excluding noise. name old time/op new time/op delta BinaryTree17-4 42.2s ± 2% 41.9s ± 2% -0.82% (p=0.001 n=30+26) Fannkuch11-4 32.9s ± 0% 32.9s ± 0% -0.01% (p=0.006 n=20+26) FmtFprintfEmpty-4 541ns ± 3% 534ns ± 0% -1.24% (p=0.003 n=30+26) FmtFprintfString-4 1.09µs ± 0% 1.10µs ± 3% ~ (p=0.142 n=23+30) FmtFprintfInt-4 1.14µs ± 0% 1.14µs ± 0% ~ (p=0.435 n=24+24) FmtFprintfIntInt-4 1.76µs ± 0% 1.76µs ± 0% ~ (p=0.508 n=24+26) FmtFprintfPrefixedInt-4 2.20µs ± 3% 2.17µs ± 0% -1.10% (p=0.017 n=30+24) FmtFprintfFloat-4 3.28µs ± 0% 3.28µs ± 0% ~ (p=0.579 n=24+24) FmtManyArgs-4 7.30µs ± 0% 7.30µs ± 0% ~ (p=0.662 n=26+27) GobDecode-4 94.8ms ± 0% 94.8ms ± 0% +0.07% (p=0.010 n=25+23) GobEncode-4 80.9ms ± 4% 80.6ms ± 4% ~ (p=0.901 n=30+30) Gzip-4 4.45s ± 0% 4.49s ± 0% +0.98% (p=0.000 n=25+24) Gunzip-4 450ms ± 3% 443ms ± 0% ~ (p=0.942 n=30+26) HTTPClientServer-4 548µs ± 1% 551µs ± 1% +0.60% (p=0.000 n=29+30) JSONEncode-4 210ms ± 0% 211ms ± 0% +0.03% (p=0.000 n=23+25) JSONDecode-4 866ms ± 5% 877ms ± 5% ~ (p=0.187 n=30+30) Mandelbrot200-4 51.4ms ± 0% 52.0ms ± 3% +1.15% (p=0.001 n=24+30) GoParse-4 42.9ms ± 5% 41.9ms ± 0% -2.24% (p=0.000 n=30+26) RegexpMatchEasy0_32-4 1.02µs ± 3% 1.01µs ± 0% ~ (p=0.247 n=30+26) RegexpMatchEasy0_1K-4 3.90µs ± 0% 3.90µs ± 0% ~ (p=0.062 n=24+24) RegexpMatchEasy1_32-4 955ns ± 0% 956ns ± 0% +0.16% (p=0.000 n=25+23) RegexpMatchEasy1_1K-4 6.42µs ± 3% 6.37µs ± 0% -0.81% (p=0.012 n=30+24) RegexpMatchMedium_32-4 1.77µs ± 3% 1.79µs ± 0% +1.28% (p=0.003 n=30+24) RegexpMatchMedium_1K-4 561µs ± 0% 569µs ± 3% +1.50% (p=0.000 n=25+30) RegexpMatchHard_32-4 31.0µs ± 4% 30.8µs ± 0% ~ (p=1.000 n=26+26) RegexpMatchHard_1K-4 945µs ± 3% 945µs ± 3% ~ (p=0.513 n=30+30) Revcomp-4 7.76s ± 4% 7.68s ± 0% ~ (p=0.464 n=29+23) Template-4 903ms ± 5% 904ms ± 5% ~ (p=0.248 n=30+30) TimeParse-4 4.80µs ± 0% 4.80µs ± 0% ~ (p=0.081 n=25+26) TimeFormat-4 4.70µs ± 1% 4.70µs ± 1% ~ (p=0.763 n=24+26) [Geo mean] 709µs 708µs -0.09% name old speed new speed delta GobDecode-4 8.10MB/s ± 0% 8.09MB/s ± 0% ~ (p=0.160 n=25+23) GobEncode-4 9.49MB/s ± 4% 9.53MB/s ± 4% ~ (p=0.360 n=30+30) Gzip-4 4.36MB/s ± 0% 4.32MB/s ± 0% -0.92% (p=0.000 n=25+24) Gunzip-4 43.2MB/s ± 3% 43.8MB/s ± 0% ~ (p=0.980 n=30+26) JSONEncode-4 9.22MB/s ± 0% 9.22MB/s ± 0% -0.04% (p=0.005 n=23+25) JSONDecode-4 2.24MB/s ± 5% 2.21MB/s ± 4% ~ (p=0.252 n=30+30) GoParse-4 1.35MB/s ± 5% 1.38MB/s ± 0% +2.00% (p=0.003 n=30+26) RegexpMatchEasy0_32-4 31.5MB/s ± 3% 31.8MB/s ± 0% ~ (p=0.110 n=30+26) RegexpMatchEasy0_1K-4 263MB/s ± 0% 263MB/s ± 0% ~ (p=0.111 n=24+24) RegexpMatchEasy1_32-4 33.5MB/s ± 0% 33.4MB/s ± 0% -0.16% (p=0.003 n=25+23) RegexpMatchEasy1_1K-4 160MB/s ± 3% 161MB/s ± 0% +0.78% (p=0.012 n=30+24) RegexpMatchMedium_32-4 565kB/s ± 3% 560kB/s ± 0% -0.83% (p=0.001 n=30+24) RegexpMatchMedium_1K-4 1.83MB/s ± 0% 1.80MB/s ± 3% -1.56% (p=0.000 n=25+30) RegexpMatchHard_32-4 1.03MB/s ± 3% 1.04MB/s ± 0% +1.46% (p=0.000 n=30+26) RegexpMatchHard_1K-4 1.08MB/s ± 3% 1.09MB/s ± 3% ~ (p=0.444 n=30+30) Revcomp-4 32.8MB/s ± 4% 33.1MB/s ± 0% ~ (p=0.858 n=29+23) Template-4 2.15MB/s ± 5% 2.15MB/s ± 5% ~ (p=0.646 n=30+30) [Geo mean] 7.79MB/s 7.81MB/s +0.21% 3. There is no regression in the compilecmp benchmark. name old time/op new time/op delta Template 2.35s ± 4% 2.33s ± 3% ~ (p=0.796 n=10+10) Unicode 1.35s ± 6% 1.35s ± 5% ~ (p=1.000 n=9+10) GoTypes 8.10s ± 3% 8.14s ± 3% ~ (p=0.604 n=9+10) Compiler 40.5s ± 2% 40.2s ± 2% ~ (p=0.065 n=10+9) SSA 115s ± 2% 115s ± 2% ~ (p=0.447 n=9+10) Flate 1.45s ± 3% 1.45s ± 4% ~ (p=0.739 n=10+10) GoParser 1.85s ± 3% 1.86s ± 2% ~ (p=0.853 n=10+10) Reflect 5.11s ± 2% 5.10s ± 2% ~ (p=0.971 n=10+10) Tar 2.23s ± 5% 2.23s ± 3% ~ (p=0.796 n=10+10) XML 2.67s ± 2% 2.69s ± 2% ~ (p=0.549 n=9+10) [Geo mean] 5.00s 5.00s +0.02% name old user-time/op new user-time/op delta Template 2.88s ± 2% 2.86s ± 2% ~ (p=0.529 n=10+10) Unicode 1.70s ± 7% 1.69s ± 5% ~ (p=0.853 n=10+10) GoTypes 9.72s ± 1% 9.73s ± 1% ~ (p=0.684 n=10+10) Compiler 49.0s ± 1% 48.9s ± 1% ~ (p=0.631 n=10+10) SSA 144s ± 1% 144s ± 2% ~ (p=0.684 n=10+10) Flate 1.71s ± 4% 1.72s ± 4% ~ (p=0.853 n=10+10) GoParser 2.23s ± 2% 2.23s ± 2% ~ (p=0.971 n=10+10) Reflect 5.98s ± 2% 5.96s ± 2% ~ (p=0.481 n=10+10) Tar 2.68s ± 3% 2.67s ± 2% ~ (p=0.393 n=10+10) XML 3.21s ± 3% 3.22s ± 1% ~ (p=0.604 n=10+9) [Geo mean] 6.05s 6.05s -0.04% name old text-bytes new text-bytes delta HelloSize 641kB ± 0% 641kB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 9.46kB ± 0% 9.46kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.24MB ± 0% 1.24MB ± 0% ~ (all equal) Change-Id: I9ed9128f0114e0f1ebb08ca2d042c90fcb2b1dcd Reviewed-on: https://go-review.googlesource.com/95075 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-02-20cmd/compile/internal/ssa: emit csel on arm64philhofer
Introduce a new SSA pass to generate CondSelect intstrutions, and add CondSelect lowering rules for arm64. In order to make the CSEL instruction easier to optimize, and to simplify the introduction of CSNEG, CSINC, and CSINV in the future, modify the CSEL instruction to accept a condition code in the aux field. Notably, this change makes the go1 Gzip benchmark more than 10% faster. Benchmarks on a Cavium ThunderX: name old time/op new time/op delta BinaryTree17-96 15.9s ± 6% 16.0s ± 4% ~ (p=0.968 n=10+9) Fannkuch11-96 7.17s ± 0% 7.00s ± 0% -2.43% (p=0.000 n=8+9) FmtFprintfEmpty-96 208ns ± 1% 207ns ± 0% ~ (p=0.152 n=10+8) FmtFprintfString-96 379ns ± 0% 375ns ± 0% -0.95% (p=0.000 n=10+9) FmtFprintfInt-96 385ns ± 0% 383ns ± 0% -0.52% (p=0.000 n=9+10) FmtFprintfIntInt-96 591ns ± 0% 586ns ± 0% -0.85% (p=0.006 n=7+9) FmtFprintfPrefixedInt-96 656ns ± 0% 667ns ± 0% +1.71% (p=0.000 n=10+10) FmtFprintfFloat-96 967ns ± 0% 984ns ± 0% +1.78% (p=0.000 n=10+10) FmtManyArgs-96 2.35µs ± 0% 2.25µs ± 0% -4.63% (p=0.000 n=9+8) GobDecode-96 31.0ms ± 0% 30.8ms ± 0% -0.36% (p=0.006 n=9+9) GobEncode-96 24.4ms ± 0% 24.5ms ± 0% +0.30% (p=0.000 n=9+9) Gzip-96 1.60s ± 0% 1.43s ± 0% -10.58% (p=0.000 n=9+10) Gunzip-96 167ms ± 0% 169ms ± 0% +0.83% (p=0.000 n=8+9) HTTPClientServer-96 311µs ± 1% 308µs ± 0% -0.75% (p=0.000 n=10+10) JSONEncode-96 65.0ms ± 0% 64.8ms ± 0% -0.25% (p=0.000 n=9+8) JSONDecode-96 262ms ± 1% 261ms ± 1% ~ (p=0.579 n=10+10) Mandelbrot200-96 18.0ms ± 0% 18.1ms ± 0% +0.17% (p=0.000 n=8+10) GoParse-96 14.0ms ± 0% 14.1ms ± 1% +0.42% (p=0.003 n=9+10) RegexpMatchEasy0_32-96 644ns ± 2% 645ns ± 2% ~ (p=0.836 n=10+10) RegexpMatchEasy0_1K-96 3.70µs ± 0% 3.49µs ± 0% -5.58% (p=0.000 n=10+10) RegexpMatchEasy1_32-96 662ns ± 2% 657ns ± 2% ~ (p=0.137 n=10+10) RegexpMatchEasy1_1K-96 4.47µs ± 0% 4.31µs ± 0% -3.48% (p=0.000 n=10+10) RegexpMatchMedium_32-96 844ns ± 2% 849ns ± 1% ~ (p=0.208 n=10+10) RegexpMatchMedium_1K-96 179µs ± 0% 182µs ± 0% +1.20% (p=0.000 n=10+10) RegexpMatchHard_32-96 10.0µs ± 0% 10.1µs ± 0% +0.48% (p=0.000 n=10+9) RegexpMatchHard_1K-96 297µs ± 0% 297µs ± 0% -0.14% (p=0.000 n=10+10) Revcomp-96 3.08s ± 0% 3.13s ± 0% +1.56% (p=0.000 n=9+9) Template-96 276ms ± 2% 275ms ± 1% ~ (p=0.393 n=10+10) TimeParse-96 1.37µs ± 0% 1.36µs ± 0% -0.53% (p=0.000 n=10+7) TimeFormat-96 1.40µs ± 0% 1.42µs ± 0% +0.97% (p=0.000 n=10+10) [Geo mean] 264µs 262µs -0.77% Change-Id: Ie54eee4b3092af53e6da3baa6d1755098f57f3a2 Reviewed-on: https://go-review.googlesource.com/55670 Run-TryBot: Philip Hofer <phofer@umich.edu> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org>
2018-02-16cmd/compile: make math.Ceil/Floor/Round/Trunc intrinsics on arm64Chad Rosier
name old time/op new time/op delta Ceil 550ns ± 0% 486ns ± 7% -11.64% (p=0.000 n=13+18) Floor 495ns ±19% 512ns ±12% ~ (p=0.164 n=20+20) Round 550ns ± 0% 487ns ± 8% -11.49% (p=0.000 n=12+19) Trunc 563ns ± 7% 488ns ±13% -13.44% (p=0.000 n=15+2) Change-Id: I53f234b160b3c026a277506e2cf977d150379464 Reviewed-on: https://go-review.googlesource.com/88295 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-15cmd/compile: arm64 intrinsics for math/bits.OnesCountBalaram Makam
This adds math/bits intrinsics for OnesCount on arm64. name old time/op new time/op delta OnesCount 3.81ns ± 0% 1.60ns ± 0% -57.96% (p=0.000 n=7+8) OnesCount8 1.60ns ± 0% 1.60ns ± 0% ~ (all equal) OnesCount16 2.41ns ± 0% 1.60ns ± 0% -33.61% (p=0.000 n=8+8) OnesCount32 4.17ns ± 0% 1.60ns ± 0% -61.58% (p=0.000 n=8+8) OnesCount64 3.80ns ± 0% 1.60ns ± 0% -57.84% (p=0.000 n=8+8) Update #18616 Conflicts: src/cmd/compile/internal/gc/asm_test.go Change-Id: I63ac2f63acafdb1f60656ab8a56be0b326eec5cb Reviewed-on: https://go-review.googlesource.com/90835 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-14cmd/compile/internal/ssa: optimize arm64 with FNMULS/FNMULDBen Shi
FNMULS&FNMULD are efficient arm64 instructions, which can be used to improve FP performance. This CL use them to optimize pairs of neg-mul operations. Here are benchmark test results on Raspberry Pi 3 with ArchLinux. 1. A special test case gets about 15% improvement. (https://github.com/benshi001/ugo1/blob/master/fpmul_test.go) FPMul-4 485µs ± 0% 410µs ± 0% -15.49% (p=0.000 n=26+23) 2. There is little regression in the go1 benchmark (excluding noise). name old time/op new time/op delta BinaryTree17-4 42.0s ± 3% 42.1s ± 2% ~ (p=0.542 n=39+40) Fannkuch11-4 33.3s ± 3% 32.9s ± 1% ~ (p=0.200 n=40+32) FmtFprintfEmpty-4 534ns ± 0% 534ns ± 0% ~ (all equal) FmtFprintfString-4 1.09µs ± 1% 1.09µs ± 0% ~ (p=0.950 n=32+32) FmtFprintfInt-4 1.14µs ± 0% 1.14µs ± 1% ~ (p=0.571 n=32+31) FmtFprintfIntInt-4 1.79µs ± 3% 1.76µs ± 0% -1.42% (p=0.004 n=40+34) FmtFprintfPrefixedInt-4 2.17µs ± 0% 2.17µs ± 0% ~ (p=0.073 n=31+34) FmtFprintfFloat-4 3.33µs ± 3% 3.28µs ± 0% -1.46% (p=0.001 n=40+34) FmtManyArgs-4 7.28µs ± 6% 7.19µs ± 0% ~ (p=0.641 n=40+33) GobDecode-4 96.5ms ± 4% 96.5ms ± 9% ~ (p=0.214 n=40+40) GobEncode-4 79.5ms ± 0% 80.7ms ± 4% +1.51% (p=0.000 n=34+40) Gzip-4 4.53s ± 4% 4.56s ± 4% +0.60% (p=0.000 n=40+40) Gunzip-4 451ms ± 3% 442ms ± 0% -1.93% (p=0.000 n=40+32) HTTPClientServer-4 530µs ± 1% 535µs ± 1% +0.88% (p=0.000 n=39+39) JSONEncode-4 214ms ± 4% 211ms ± 0% ~ (p=0.059 n=40+31) JSONDecode-4 865ms ± 5% 864ms ± 4% -0.06% (p=0.003 n=40+40) Mandelbrot200-4 52.0ms ± 3% 52.1ms ± 3% ~ (p=0.556 n=40+40) GoParse-4 43.1ms ± 8% 42.1ms ± 0% ~ (p=0.083 n=40+33) RegexpMatchEasy0_32-4 1.02µs ± 3% 1.02µs ± 4% +0.06% (p=0.020 n=40+40) RegexpMatchEasy0_1K-4 3.90µs ± 0% 3.96µs ± 3% +1.58% (p=0.000 n=31+40) RegexpMatchEasy1_32-4 967ns ± 4% 981ns ± 3% +1.40% (p=0.000 n=40+40) RegexpMatchEasy1_1K-4 6.41µs ± 4% 6.43µs ± 3% ~ (p=0.386 n=40+40) RegexpMatchMedium_32-4 1.76µs ± 3% 1.78µs ± 3% +1.08% (p=0.000 n=40+40) RegexpMatchMedium_1K-4 561µs ± 0% 562µs ± 0% +0.09% (p=0.003 n=34+31) RegexpMatchHard_32-4 31.5µs ± 2% 31.1µs ± 4% -1.17% (p=0.000 n=30+40) RegexpMatchHard_1K-4 960µs ± 3% 950µs ± 4% -1.02% (p=0.016 n=40+40) Revcomp-4 7.79s ± 7% 7.79s ± 4% ~ (p=0.859 n=40+40) Template-4 889ms ± 6% 872ms ± 3% -1.86% (p=0.025 n=40+31) TimeParse-4 4.80µs ± 0% 4.89µs ± 3% +1.71% (p=0.001 n=31+40) TimeFormat-4 4.70µs ± 1% 4.78µs ± 3% +1.57% (p=0.000 n=33+40) [Geo mean] 710µs 709µs -0.13% name old speed new speed delta GobDecode-4 7.96MB/s ± 4% 7.96MB/s ± 9% ~ (p=0.174 n=40+40) GobEncode-4 9.65MB/s ± 0% 9.51MB/s ± 4% -1.45% (p=0.000 n=34+40) Gzip-4 4.29MB/s ± 4% 4.26MB/s ± 4% -0.59% (p=0.000 n=40+40) Gunzip-4 43.0MB/s ± 3% 43.9MB/s ± 0% +1.90% (p=0.000 n=40+32) JSONEncode-4 9.09MB/s ± 4% 9.22MB/s ± 0% ~ (p=0.429 n=40+31) JSONDecode-4 2.25MB/s ± 5% 2.25MB/s ± 4% ~ (p=0.278 n=40+40) GoParse-4 1.35MB/s ± 7% 1.37MB/s ± 0% ~ (p=0.071 n=40+25) RegexpMatchEasy0_32-4 31.5MB/s ± 3% 31.5MB/s ± 4% -0.08% (p=0.018 n=40+40) RegexpMatchEasy0_1K-4 263MB/s ± 0% 259MB/s ± 3% -1.51% (p=0.000 n=31+40) RegexpMatchEasy1_32-4 33.1MB/s ± 4% 32.6MB/s ± 3% -1.38% (p=0.000 n=40+40) RegexpMatchEasy1_1K-4 160MB/s ± 4% 159MB/s ± 3% ~ (p=0.364 n=40+40) RegexpMatchMedium_32-4 565kB/s ± 3% 562kB/s ± 2% ~ (p=0.208 n=40+40) RegexpMatchMedium_1K-4 1.82MB/s ± 0% 1.82MB/s ± 0% -0.27% (p=0.000 n=34+31) RegexpMatchHard_32-4 1.02MB/s ± 3% 1.03MB/s ± 4% +1.04% (p=0.000 n=32+40) RegexpMatchHard_1K-4 1.07MB/s ± 4% 1.08MB/s ± 4% +0.94% (p=0.003 n=40+40) Revcomp-4 32.6MB/s ± 7% 32.6MB/s ± 4% ~ (p=0.965 n=40+40) Template-4 2.18MB/s ± 6% 2.22MB/s ± 3% +1.83% (p=0.020 n=40+31) [Geo mean] 7.77MB/s 7.78MB/s +0.16% 3. There is little change in the compilecmp benchmark (excluding noise). name old time/op new time/op delta Template 2.37s ± 3% 2.35s ± 4% ~ (p=0.529 n=10+10) Unicode 1.38s ± 8% 1.36s ± 5% ~ (p=0.247 n=10+10) GoTypes 8.10s ± 2% 8.10s ± 2% ~ (p=0.971 n=10+10) Compiler 40.5s ± 4% 40.8s ± 1% ~ (p=0.529 n=10+10) SSA 115s ± 2% 115s ± 3% ~ (p=0.684 n=10+10) Flate 1.45s ± 5% 1.46s ± 3% ~ (p=0.796 n=10+10) GoParser 1.86s ± 4% 1.84s ± 2% ~ (p=0.095 n=9+10) Reflect 5.11s ± 2% 5.13s ± 2% ~ (p=0.315 n=10+10) Tar 2.22s ± 3% 2.23s ± 1% ~ (p=0.299 n=9+7) XML 2.72s ± 3% 2.72s ± 3% ~ (p=0.912 n=10+10) [Geo mean] 5.03s 5.02s -0.21% name old user-time/op new user-time/op delta Template 2.92s ± 2% 2.89s ± 1% ~ (p=0.247 n=10+10) Unicode 1.71s ± 5% 1.69s ± 4% ~ (p=0.393 n=10+10) GoTypes 9.78s ± 2% 9.76s ± 2% ~ (p=0.631 n=10+10) Compiler 49.1s ± 2% 49.1s ± 1% ~ (p=0.796 n=10+10) SSA 144s ± 1% 144s ± 2% ~ (p=0.796 n=10+10) Flate 1.74s ± 2% 1.73s ± 3% ~ (p=0.842 n=10+9) GoParser 2.23s ± 3% 2.25s ± 2% ~ (p=0.143 n=10+10) Reflect 5.93s ± 3% 5.98s ± 2% ~ (p=0.211 n=10+9) Tar 2.65s ± 2% 2.69s ± 3% +1.51% (p=0.010 n=9+10) XML 3.25s ± 2% 3.21s ± 1% -1.24% (p=0.035 n=10+9) [Geo mean] 6.07s 6.07s -0.08% name old text-bytes new text-bytes delta HelloSize 641kB ± 0% 641kB ± 0% ~ (all equal) name old data-bytes new data-bytes delta HelloSize 9.46kB ± 0% 9.46kB ± 0% ~ (all equal) name old bss-bytes new bss-bytes delta HelloSize 125kB ± 0% 125kB ± 0% ~ (all equal) name old exe-bytes new exe-bytes delta HelloSize 1.24MB ± 0% 1.24MB ± 0% ~ (all equal) Change-Id: Id095d998c380eef929755124084df02446a6b7c1 Reviewed-on: https://go-review.googlesource.com/92555 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-02-13runtime: buffered write barrier for arm64Austin Clements
Updates #22460. Change-Id: I5f8fbece9545840f5fc4c9834e2050b0920776f0 Reviewed-on: https://go-review.googlesource.com/92699 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-10-10cmd/compile: intrinsify runtime.getcallerspCherry Zhang
Add a compiler intrinsic for getcallersp. So we are able to get rid of the argument (not done in this CL). Change-Id: Ic38fda1c694f918328659ab44654198fb116668d Reviewed-on: https://go-review.googlesource.com/69350 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: David Chase <drchase@google.com>
2017-08-25cmd/compile: memory clearing optimization for arm64Wei Xiao
Use "STP (ZR, ZR), O(R)" instead of "MOVD ZR, O(R)" to implement memory clearing. Also improve assembler supports to STP/LDP. Results (A57@2GHzx8): benchmark old ns/op new ns/op delta BenchmarkClearFat8-8 1.00 1.00 +0.00% BenchmarkClearFat12-8 1.01 1.01 +0.00% BenchmarkClearFat16-8 1.01 1.01 +0.00% BenchmarkClearFat24-8 1.52 1.52 +0.00% BenchmarkClearFat32-8 3.00 2.02 -32.67% BenchmarkClearFat40-8 3.50 2.52 -28.00% BenchmarkClearFat48-8 3.50 3.03 -13.43% BenchmarkClearFat56-8 4.00 3.50 -12.50% BenchmarkClearFat64-8 4.25 4.00 -5.88% BenchmarkClearFat128-8 8.01 8.01 +0.00% BenchmarkClearFat256-8 16.1 16.0 -0.62% BenchmarkClearFat512-8 32.1 32.0 -0.31% BenchmarkClearFat1024-8 64.1 64.1 +0.00% Change-Id: Ie5f5eac271ff685884775005825f206167a5c146 Reviewed-on: https://go-review.googlesource.com/55610 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2017-08-15cmd/compile: add support for arm64 bit-test instructionsphilhofer
Add support for generating TBZ/TBNZ instructions. The bit-test-and-branch pattern shows up in a number of important places, including the runtime (gc bitmaps). Before this change, there were 3 TB[N]?Z instructions in the Go tool, all of which were in hand-written assembly. After this change, there are 285. Also, the go1 benchmark binary gets about 4.5kB smaller. Fixes #21361 Change-Id: I170c138b852754b9b8df149966ca5e62e6dfa771 Reviewed-on: https://go-review.googlesource.com/54470 Run-TryBot: Philip Hofer <phofer@umich.edu> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>