aboutsummaryrefslogtreecommitdiff
path: root/test/codegen
AgeCommit message (Collapse)Author
2021-08-25cmd/compile: generic SSA rules for simplifying 2 and 3 operand integer ↵Jake Ciolek
arithmetic expressions This applies the following generic integer addition/subtraction transformations: x - (x + y) = -y (x - y) - x = -y y + (x - y) = x y + (z + (x - y) = x + z There's over 40 unique functions matching in Go. Hits 2 funcs in the runtime itself: runtime.stackfree() runtime.runqdrain() Go binary size reduced by 0.05% on Linux x86_64. StackCopy bench (perflocked Cascade Lake x86): name old time/op new time/op delta StackCopyPtr-8 87.3ms ± 1% 86.9ms ± 0% -0.45% (p=0.000 n=20+20) StackCopy-8 77.6ms ± 1% 77.0ms ± 0% -0.76% (p=0.000 n=20+20) StackCopyNoCache-8 2.28ms ± 2% 2.26ms ± 2% -0.93% (p=0.008 n=19+20) test/bench/go1 benchmarks (perflocked Cascade Lake x86): name old time/op new time/op delta BinaryTree17-8 1.88s ± 1% 1.88s ± 0% ~ (p=0.373 n=15+12) Fannkuch11-8 2.31s ± 0% 2.35s ± 0% +1.52% (p=0.000 n=15+14) FmtFprintfEmpty-8 26.6ns ± 0% 26.6ns ± 0% ~ (p=0.081 n=14+13) FmtFprintfString-8 48.6ns ± 0% 50.0ns ± 0% +2.86% (p=0.000 n=15+14) FmtFprintfInt-8 56.9ns ± 0% 54.8ns ± 0% -3.70% (p=0.000 n=15+15) FmtFprintfIntInt-8 90.4ns ± 0% 88.8ns ± 0% -1.78% (p=0.000 n=15+15) FmtFprintfPrefixedInt-8 104ns ± 0% 104ns ± 0% ~ (p=0.905 n=14+13) FmtFprintfFloat-8 148ns ± 0% 144ns ± 0% -2.19% (p=0.000 n=14+15) FmtManyArgs-8 389ns ± 0% 390ns ± 0% +0.35% (p=0.000 n=12+15) GobDecode-8 3.90ms ± 1% 3.88ms ± 0% -0.49% (p=0.000 n=15+14) GobEncode-8 2.73ms ± 0% 2.73ms ± 0% ~ (p=0.425 n=15+14) Gzip-8 169ms ± 0% 168ms ± 0% -0.52% (p=0.000 n=13+13) Gunzip-8 24.7ms ± 0% 24.8ms ± 0% +0.61% (p=0.000 n=15+15) HTTPClientServer-8 60.5µs ± 6% 60.4µs ± 7% ~ (p=0.595 n=15+15) JSONEncode-8 6.97ms ± 1% 6.93ms ± 0% -0.69% (p=0.000 n=14+14) JSONDecode-8 31.2ms ± 1% 30.8ms ± 1% -1.27% (p=0.000 n=14+14) Mandelbrot200-8 3.87ms ± 0% 3.87ms ± 0% ~ (p=0.652 n=15+14) GoParse-8 2.65ms ± 2% 2.64ms ± 1% ~ (p=0.202 n=15+15) RegexpMatchEasy0_32-8 45.1ns ± 0% 45.9ns ± 0% +1.68% (p=0.000 n=14+15) RegexpMatchEasy0_1K-8 140ns ± 0% 139ns ± 0% -0.44% (p=0.000 n=15+14) RegexpMatchEasy1_32-8 40.9ns ± 3% 40.5ns ± 0% -0.88% (p=0.000 n=15+13) RegexpMatchEasy1_1K-8 215ns ± 1% 220ns ± 1% +2.27% (p=0.000 n=15+15) RegexpMatchMedium_32-8 783ns ± 7% 738ns ± 0% ~ (p=0.361 n=15+15) RegexpMatchMedium_1K-8 24.1µs ± 6% 23.4µs ± 6% -2.94% (p=0.004 n=15+15) RegexpMatchHard_32-8 1.10µs ± 1% 1.09µs ± 1% -0.40% (p=0.006 n=15+14) RegexpMatchHard_1K-8 33.0µs ± 0% 33.0µs ± 0% ~ (p=0.535 n=12+14) Revcomp-8 354ms ± 0% 353ms ± 0% -0.23% (p=0.002 n=15+13) Template-8 42.0ms ± 1% 41.8ms ± 2% -0.37% (p=0.023 n=14+15) TimeParse-8 181ns ± 0% 180ns ± 1% -0.18% (p=0.014 n=12+13) TimeFormat-8 240ns ± 0% 242ns ± 1% +0.69% (p=0.000 n=12+15) [Geo mean] 35.2µs 35.1µs -0.43% name old speed new speed delta GobDecode-8 197MB/s ± 1% 198MB/s ± 0% +0.49% (p=0.000 n=15+14) GobEncode-8 281MB/s ± 0% 281MB/s ± 0% ~ (p=0.419 n=15+14) Gzip-8 115MB/s ± 0% 115MB/s ± 0% +0.52% (p=0.000 n=13+13) Gunzip-8 786MB/s ± 0% 781MB/s ± 0% -0.60% (p=0.000 n=15+15) JSONEncode-8 278MB/s ± 1% 280MB/s ± 0% +0.69% (p=0.000 n=14+14) JSONDecode-8 62.3MB/s ± 1% 63.1MB/s ± 1% +1.29% (p=0.000 n=14+14) GoParse-8 21.9MB/s ± 2% 22.0MB/s ± 1% ~ (p=0.205 n=15+15) RegexpMatchEasy0_32-8 709MB/s ± 0% 697MB/s ± 0% -1.65% (p=0.000 n=14+15) RegexpMatchEasy0_1K-8 7.34GB/s ± 0% 7.37GB/s ± 0% +0.43% (p=0.000 n=15+15) RegexpMatchEasy1_32-8 783MB/s ± 2% 790MB/s ± 0% +0.88% (p=0.000 n=15+13) RegexpMatchEasy1_1K-8 4.77GB/s ± 1% 4.66GB/s ± 1% -2.23% (p=0.000 n=15+15) RegexpMatchMedium_32-8 41.0MB/s ± 7% 43.3MB/s ± 0% ~ (p=0.360 n=15+15) RegexpMatchMedium_1K-8 42.5MB/s ± 6% 43.8MB/s ± 6% +3.07% (p=0.004 n=15+15) RegexpMatchHard_32-8 29.2MB/s ± 1% 29.3MB/s ± 1% +0.41% (p=0.006 n=15+14) RegexpMatchHard_1K-8 31.1MB/s ± 0% 31.1MB/s ± 0% ~ (p=0.495 n=12+14) Revcomp-8 718MB/s ± 0% 720MB/s ± 0% +0.23% (p=0.002 n=15+13) Template-8 46.3MB/s ± 1% 46.4MB/s ± 2% +0.38% (p=0.021 n=14+15) [Geo mean] 205MB/s 206MB/s +0.57% Change-Id: Ibd1afdf8b6c0b08087dcc3acd8f943637eb95ac0 Reviewed-on: https://go-review.googlesource.com/c/go/+/344930 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Trust: Josh Bleecher Snyder <josharian@gmail.com>
2021-08-16cmd/compile: intrinsify Mul64 on riscv64Meng Zhuo
According to RISCV instruction set manual v2.2 Sec 6.1 MULHU followed by MUL will be fused into one multiply by microarchitecture Benchstat on Hifive unmatched: name old time/op new time/op delta Hash8Bytes 245ns ± 3% 186ns ± 4% -23.99% (p=0.000 n=10+10) Hash320Bytes 1.94µs ± 1% 1.31µs ± 1% -32.38% (p=0.000 n=9+10) Hash1K 5.84µs ± 0% 3.84µs ± 0% -34.20% (p=0.000 n=10+9) Hash8K 45.3µs ± 0% 29.4µs ± 0% -35.04% (p=0.000 n=10+10) name old speed new speed delta Hash8Bytes 32.7MB/s ± 3% 43.0MB/s ± 4% +31.61% (p=0.000 n=10+10) Hash320Bytes 165MB/s ± 1% 244MB/s ± 1% +47.88% (p=0.000 n=9+10) Hash1K 175MB/s ± 0% 266MB/s ± 0% +51.98% (p=0.000 n=10+9) Hash8K 181MB/s ± 0% 279MB/s ± 0% +53.94% (p=0.000 n=10+10) Change-Id: I3561495d02a4a0ad8578e9b9819bf0a4eaca5d12 Reviewed-on: https://go-review.googlesource.com/c/go/+/329970 Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Go Bot <gobot@golang.org> Trust: Meng Zhuo <mzh@golangcn.org>
2021-06-03[dev.typeparams] cmd/compile: implement clobberdead mode on ARM64Cherry Mui
For debugging. Change-Id: I5875ccd2413b8ffd2ec97a0ace66b5cae7893b24 Reviewed-on: https://go-review.googlesource.com/c/go/+/324765 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-06-03[dev.typeparams] test: adjust codegen test for register ABI on ARM64Cherry Mui
In codegen/arithmetic.go, previously there are MOVD's that match for loads of arguments. With register ABI there are no more such loads. Remove the MOVD matches. Change-Id: I920ee2629c8c04d454f13a0c08e283d3528d9a64 Reviewed-on: https://go-review.googlesource.com/c/go/+/324251 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Than McIntosh <thanm@google.com>
2021-05-12cmd/compile: add arch-specific inlining for runtime.memmoveRuslan Andreev
This CL add runtime.memmove inlining for AMD64 and ARM64. According to ssa dump from testcases generic rules can't inline memmomve properly due to one of the arguments is Phi operation. But this Phi op will be optimized out by later optimization stages. As a result memmove can be inlined during arch-specific rules. The commit add new optimization rules to arch-specific rules that can inline runtime.memmove if it possible during lowering stage. Optimization fires 5 times in Go source-code using regabi. Fixes #41662 Change-Id: Iaffaf4c482d068b5f0683d141863892202cc8824 Reviewed-on: https://go-review.googlesource.com/c/go/+/289151 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Trust: David Chase <drchase@google.com>
2021-05-08cmd/compile: remove bit operations that modify memory directlyKeith Randall
These operations (BT{S,R,C}{Q,L}modify) are quite a bit slower than other ways of doing the same thing. Without the BTxmodify operations, there are two fallback ways the compiler performs these operations: AND/OR/XOR operations directly on memory, or load-BTx-write sequences. The compiler kinda chooses one arbitrarily depending on rewrite rule application order. Currently, it uses load-BTx-write for the Const benchmarks and AND/OR/XOR directly to memory for the non-Const benchmarks. TBD, someone might investigate which of the two fallback strategies is really better. For now, they are both better than BTx ops. name old time/op new time/op delta BitSet-8 1.09µs ± 2% 0.64µs ± 5% -41.60% (p=0.000 n=9+10) BitClear-8 1.15µs ± 3% 0.68µs ± 6% -41.00% (p=0.000 n=10+10) BitToggle-8 1.18µs ± 4% 0.73µs ± 2% -38.36% (p=0.000 n=10+8) BitSetConst-8 37.0ns ± 7% 25.8ns ± 2% -30.24% (p=0.000 n=10+10) BitClearConst-8 30.7ns ± 2% 25.0ns ±12% -18.46% (p=0.000 n=10+10) BitToggleConst-8 36.9ns ± 1% 23.8ns ± 3% -35.46% (p=0.000 n=9+10) Fixes #45790 Update #45242 Change-Id: Ie33a72dc139f261af82db15d446cd0855afb4e59 Reviewed-on: https://go-review.googlesource.com/c/go/+/318149 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ben Shi <powerman1st@163.com>
2021-04-19cmd/compile: reduce redundant register moves for regabi callsCherry Zhang
Currently, if we have AX=a and BX=b, and we want to make a call F(1, a, b), to move arguments into the desired registers it emits MOVQ AX, CX MOVL $1, AX // AX=1 MOVQ BX, DX MOVQ CX, BX // BX=a MOVQ DX, CX // CX=b This has a few redundant moves. This is because we process inputs in order. First, allocate 1 to AX, which kicks out a (in AX) to CX (a free register at the moment). Then, allocate a to BX, which kicks out b (in BX) to DX. Finally, put b to CX. Notice that if we start with allocating CX=b, then BX=a, AX=1, we will not have redundant moves. This CL reduces redundant moves by allocating them in different order: First, for inpouts that are already in place, keep them there. Then allocate free registers. Then everything else. before after cmd/compile binary size 23703888 23609680 text size 8565899 8533291 (with regabiargs enabled.) Change-Id: I69e1bdf745f2c90bb791f6d7c45b37384af1e874 Reviewed-on: https://go-review.googlesource.com/c/go/+/311371 Trust: Cherry Zhang <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Than McIntosh <thanm@google.com>
2021-04-14cmd/compile: rescue stmt boundaries from OpArgXXXReg and OpSelectN.David Chase
Fixes this failure: go test cmd/compile/internal/ssa -run TestStmtLines -v === RUN TestStmtLines stmtlines_test.go:115: Saw too many (amd64, > 1%) lines without statement marks, total=88263, nostmt=1930 ('-run TestStmtLines -v' lists failing lines) The failure has two causes. One is that the first-line adjuster in code generation was relocating "first lines" to instructions that would either not have any code generated, or would have the statment marker removed by a different believed-good heuristic. The other was that statement boundaries were getting attached to register values (that with the old ABI were loads from the stack, hence real instructions). The register values disappear at code generation. The fixes are to (1) note that certain instructions are not good choices for "first value" and skip them, and (2) in an expandCalls post-pass, look for register valued instructions and under appropriate conditions move their statement marker to a compatible use. Also updates TestStmtLines to always log the score, for easier comparison of minor compiler changes. Updates #40724. Change-Id: I485573ce900e292d7c44574adb7629cdb4695c3f Reviewed-on: https://go-review.googlesource.com/c/go/+/309649 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-04-13test: make codegen/memops.go work with both ABIsCherry Zhang
Following CL 309335, this fixes memops.go. Change-Id: Ia2458b5267deee9f906f76c50e70a021ea2fcb5b Reviewed-on: https://go-review.googlesource.com/c/go/+/309552 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Than McIntosh <thanm@google.com>
2021-04-12test: make codegen tests work with both ABIsCherry Zhang
Some codegen tests were written with the assumption that arguments and results are in memory, and with a specific stack layout. With the register ABI, the assumption is no longer true. Adjust the tests to work with both cases. - For tests expecting in memory arguments/results, change to use global variables or memory-assigned argument/results. - Allow more registers. E.g. some tests expecting register names contain only letters (e.g. AX), but it can also contain numbers (e.g. R10). - Some instruction selection changes when operate on register vs. memory, e.g. ADDQ vs. LEAQ, MOVB vs. MOVL. Accept both. TODO: mathbits.go and memops.go still need fix. Change-Id: Ic5932b4b5dd3f5d30ed078d296476b641420c4c5 Reviewed-on: https://go-review.googlesource.com/c/go/+/309335 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-03-26cmd/compile: fix long RMW bit operations on AMD64Pat Gavlin
Under certain circumstances, the existing rules for bit operations can produce code that writes beyond its intended bounds. For example, consider the following code: func repro(b []byte, addr, bit int32) { _ = b[3] v := uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24 | 1<<(bit&31) b[0] = byte(v) b[1] = byte(v >> 8) b[2] = byte(v >> 16) b[3] = byte(v >> 24) } Roughly speaking: 1. The expression `1 << (bit & 31)` is rewritten into `(SHLL 1 bit)` 2. The expression `uint32(b[0]) | uint32(b[1])<<8 | uint32(b[2])<<16 | uint32(b[3])<<24` is rewritten into `(MOVLload &b[0])` 3. The statements `b[0] = byte(v) ... b[3] = byte(v >> 24)` are rewritten into `(MOVLstore &b[0], v)` 4. `(ORL (SHLL 1, bit) (MOVLload &b[0]))` is rewritten into `(BTSL (MOVLload &b[0]) bit)`. This is a valid transformation because the destination is a register: in this case, the bit offset is masked by the number of bits in the destination register. This is identical to the masking performed by `SHL`. 5. `(MOVLstore &b[0] (BTSL (MOVLload &b[0]) bit))` is rewritten into `(BTSLmodify &b[0] bit)`. This is an invalid transformation because the destination is memory: in this case, the bit offset is not masked, and the chosen instruction may write outside its intended 32-bit location. These changes fix the invalid rewrite performed in step (5) by explicitly maksing the bit offset operand to `BT(S|R|C)(L|Q)modify`. In the example above, the adjusted rules produce `(BTSLmodify &b[0] (ANDLconst [31] bit))` in step (5). These changes also add several new rules to rewrite bit sets, toggles, and clears that are rooted at `(OR|XOR|AND)(L|Q)modify` operators into appropriate `BT(S|R|C)(L|Q)modify` operators. These rules catch cases where `MOV(L|Q)store ((OR|XOR|AND)(L|Q) ...)` is rewritten to `(OR|XOR|AND)(L|Q)modify` before the `(OR|XOR|AND)(L|Q) ...` can be rewritten to `BT(S|R|C)(L|Q) ...`. Overall, compilecmp reports small improvements in code size on darwin/amd64 when the changes to the compiler itself are exlcuded: file before after Δ % runtime.s 536464 536412 -52 -0.010% bytes.s 32629 32593 -36 -0.110% strings.s 44565 44529 -36 -0.081% os/signal.s 7967 7959 -8 -0.100% cmd/vendor/golang.org/x/sys/unix.s 81686 81678 -8 -0.010% math/big.s 188235 188253 +18 +0.010% cmd/link/internal/loader.s 89295 89056 -239 -0.268% cmd/link/internal/ld.s 633551 633232 -319 -0.050% cmd/link/internal/arm.s 18934 18928 -6 -0.032% cmd/link/internal/arm64.s 31814 31801 -13 -0.041% cmd/link/internal/riscv64.s 7347 7345 -2 -0.027% cmd/compile/internal/ssa.s 4029173 4033066 +3893 +0.097% total 21298280 21301472 +3192 +0.015% Change-Id: I2e560548b515865129e1724e150e30540e9d29ce GitHub-Last-Rev: 9a42bd29a55b3917651aecab6932074df96535ae GitHub-Pull-Request: golang/go#45242 Reviewed-on: https://go-review.googlesource.com/c/go/+/304869 Reviewed-by: Keith Randall <khr@golang.org> Trust: Josh Bleecher Snyder <josharian@gmail.com>
2021-03-26cmd/compile: add arm64 rules to optimize go codes to constant 0fanzha02
Optimize the following codes to constant 0. function shift (x uint32) uint64 { return uint64(x) >> 32 } Change-Id: Ida6b39d713cc119ad5a2f01fd54bfd252cf2c975 Reviewed-on: https://go-review.googlesource.com/c/go/+/303830 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-23cmd/compile: optimize codes with arm64 REV16 instructionfanzha02
Optimize some patterns into rev16/rev16w instruction. Pattern1: (c & 0xff00ff00)>>8 | (c & 0x00ff00ff)<<8 To: rev16w c Pattern2: (c & 0xff00ff00ff00ff00)>>8 | (c & 0x00ff00ff00ff00ff)<<8 To: rev16 c This patch is a copy of CL 239637, contributed by Alice Xu(dianhong.xu@arm.com). Change-Id: I96936c1db87618bc1903c04221c7e9b2779455b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/268377 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-19cmd/compile: add clobberdeadreg modeCherry Zhang
When -clobberdeadreg flag is set, the compiler inserts code that clobbers integer registers at call sites. This may be helpful for debugging register ABI. Only implemented on AMD64 for now. Change-Id: Ia203d3f891c30fd95d0103489056fe01d63a2899 Reviewed-on: https://go-review.googlesource.com/c/go/+/302809 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-03-18cmd/compile: add rewrite rules for conditional instructions on arm64fanzha02
This CL adds rewrite rules for CSETM, CSINC, CSINV, and CSNEG. By adding these rules, we can save one instruction. For example, func test(cond bool, a int) int { if cond { a++ } return a } Before: MOVD "".a+8(RSP), R0 ADD $1, R0, R1 MOVBU "".cond(RSP), R2 CMPW $0, R2 CSEL NE, R1, R0, R0 After: MOVBU "".cond(RSP), R0 CMPW $0, R0 MOVD "".a+8(RSP), R0 CSINC EQ, R0, R0, R0 This patch is a copy of CL 285694. Co-authored-by: JunchenLi <junchen.li@arm.com> Change-Id: Ic1a79e8b8ece409b533becfcb7950f11e7b76f24 Reviewed-on: https://go-review.googlesource.com/c/go/+/302231 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-03-17cmd/compile: resurrect clobberdead modeCherry Zhang
This CL resurrects the clobberdead debugging mode (CL 23924). When -clobberdead flag is set (TODO: make it GOEXPERIMENT?), the compiler inserts code that clobbers all dead stack slots that contains pointers. Mark windows syscall functions cgo_unsafe_args, as the code actually does that, by taking the address of one argument and passing it to cgocall. Change-Id: Ie09a015f4bd14ae6053cc707866e30ae509b9d6f Reviewed-on: https://go-review.googlesource.com/c/go/+/301791 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Than McIntosh <thanm@google.com>
2021-03-16cmd/compile: loads from readonly globals into const for mips64xMeng Zhuo
Ref: CL 141118 Update #26498 Change-Id: If4ea55c080b9aa10183eefe81fefbd4072deaf3a Reviewed-on: https://go-review.googlesource.com/c/go/+/280646 Trust: Meng Zhuo <mzh@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-16cmd/compile: use depth first topological sort algorithm for layouterifan01
The current layout algorithm tries to put consecutive blocks together, so the priority of the successor block is higher than the priority of the zero indegree block. This algorithm is beneficial for subsequent register allocation, but will result in more branch instructions. The depth-first topological sorting algorithm is a well-known layout algorithm, which has applications in many languages, and it helps to reduce branch instructions. This CL applies it to the layout pass. The test results show that it helps to reduce the code size. This CL also includes the following changes: 1, Removed the primary predecessor mechanism. The new layout algorithm is not very friendly to register allocator in some cases, in order to adapt to the new layout algorithm, a new primary predecessor selection strategy is introduced. 2, Since the new layout implementation may place non-loop blocks between loop blocks, some adaptive modifications have also been made to looprotate pass. 3, The layout also affects the results of codegen, so this CL also adjusted several codegen tests accordingly. It is inevitable that this CL will cause the code size or performance of a few functions to decrease, but the number of cases it improves is much larger than the number of cases it drops. Statistical data from compilecmp on linux/amd64 is as follow: name old time/op new time/op delta Template 382ms ± 4% 382ms ± 4% ~ (p=0.497 n=49+50) Unicode 170ms ± 9% 169ms ± 8% ~ (p=0.344 n=48+50) GoTypes 2.01s ± 4% 2.01s ± 4% ~ (p=0.628 n=50+48) Compiler 190ms ±10% 189ms ± 9% ~ (p=0.734 n=50+50) SSA 11.8s ± 2% 11.8s ± 3% ~ (p=0.877 n=50+50) Flate 241ms ± 9% 241ms ± 8% ~ (p=0.897 n=50+49) GoParser 366ms ± 3% 361ms ± 4% -1.21% (p=0.004 n=47+50) Reflect 835ms ± 3% 838ms ± 3% ~ (p=0.275 n=50+49) Tar 336ms ± 4% 335ms ± 3% ~ (p=0.454 n=48+48) XML 433ms ± 4% 431ms ± 3% ~ (p=0.071 n=49+48) LinkCompiler 706ms ± 4% 705ms ± 4% ~ (p=0.608 n=50+49) ExternalLinkCompiler 1.85s ± 3% 1.83s ± 2% -1.47% (p=0.000 n=49+48) LinkWithoutDebugCompiler 437ms ± 5% 437ms ± 6% ~ (p=0.953 n=49+50) [Geo mean] 615ms 613ms -0.37% name old alloc/op new alloc/op delta Template 38.7MB ± 1% 38.7MB ± 1% ~ (p=0.834 n=50+50) Unicode 28.1MB ± 0% 28.1MB ± 0% -0.22% (p=0.000 n=49+50) GoTypes 168MB ± 1% 168MB ± 1% ~ (p=0.054 n=47+47) Compiler 23.0MB ± 1% 23.0MB ± 1% ~ (p=0.432 n=50+50) SSA 1.54GB ± 0% 1.54GB ± 0% +0.21% (p=0.000 n=50+50) Flate 23.6MB ± 1% 23.6MB ± 1% ~ (p=0.153 n=43+46) GoParser 35.1MB ± 1% 35.1MB ± 2% ~ (p=0.202 n=50+50) Reflect 84.7MB ± 1% 84.7MB ± 1% ~ (p=0.333 n=48+49) Tar 34.5MB ± 1% 34.5MB ± 1% ~ (p=0.406 n=46+49) XML 44.3MB ± 2% 44.2MB ± 3% ~ (p=0.981 n=50+50) LinkCompiler 131MB ± 0% 128MB ± 0% -2.74% (p=0.000 n=50+50) ExternalLinkCompiler 120MB ± 0% 120MB ± 0% +0.01% (p=0.007 n=50+50) LinkWithoutDebugCompiler 77.3MB ± 0% 77.3MB ± 0% -0.02% (p=0.000 n=50+50) [Geo mean] 69.3MB 69.1MB -0.22% file before after Δ % addr2line 4104220 4043684 -60536 -1.475% api 5342502 5249678 -92824 -1.737% asm 4973785 4858257 -115528 -2.323% buildid 2667844 2625660 -42184 -1.581% cgo 4686849 4616313 -70536 -1.505% compile 23667431 23268406 -399025 -1.686% cover 4959676 4874108 -85568 -1.725% dist 3515934 3450422 -65512 -1.863% doc 3995581 3925469 -70112 -1.755% fix 3379202 3318522 -60680 -1.796% link 6743249 6629913 -113336 -1.681% nm 4047529 3991777 -55752 -1.377% objdump 4456151 4388151 -68000 -1.526% pack 2435040 2398072 -36968 -1.518% pprof 13804080 13565808 -238272 -1.726% test2json 2690043 2645987 -44056 -1.638% trace 10418492 10232716 -185776 -1.783% vet 7258259 7121259 -137000 -1.888% total 113145867 111204202 -1941665 -1.716% The situation on linux/arm64 is as follow: name old time/op new time/op delta Template 280ms ± 1% 282ms ± 1% +0.75% (p=0.000 n=46+48) Unicode 124ms ± 2% 124ms ± 2% +0.37% (p=0.045 n=50+50) GoTypes 1.69s ± 1% 1.70s ± 1% +0.56% (p=0.000 n=49+50) Compiler 122ms ± 1% 123ms ± 1% +0.93% (p=0.000 n=50+50) SSA 12.6s ± 1% 12.7s ± 0% +0.72% (p=0.000 n=50+50) Flate 170ms ± 1% 172ms ± 1% +0.97% (p=0.000 n=49+49) GoParser 262ms ± 1% 263ms ± 1% +0.39% (p=0.000 n=49+48) Reflect 639ms ± 1% 650ms ± 1% +1.63% (p=0.000 n=49+49) Tar 243ms ± 1% 245ms ± 1% +0.82% (p=0.000 n=50+50) XML 324ms ± 1% 327ms ± 1% +0.72% (p=0.000 n=50+49) LinkCompiler 597ms ± 1% 596ms ± 1% -0.27% (p=0.001 n=48+47) ExternalLinkCompiler 1.90s ± 1% 1.88s ± 1% -1.00% (p=0.000 n=50+50) LinkWithoutDebugCompiler 364ms ± 1% 363ms ± 1% ~ (p=0.220 n=49+50) [Geo mean] 485ms 488ms +0.49% name old alloc/op new alloc/op delta Template 38.7MB ± 0% 38.8MB ± 1% ~ (p=0.093 n=43+49) Unicode 28.4MB ± 0% 28.4MB ± 0% +0.03% (p=0.000 n=49+45) GoTypes 169MB ± 1% 169MB ± 1% +0.23% (p=0.010 n=50+50) Compiler 23.2MB ± 1% 23.2MB ± 1% +0.11% (p=0.000 n=40+44) SSA 1.54GB ± 0% 1.55GB ± 0% +0.45% (p=0.000 n=47+49) Flate 23.8MB ± 2% 23.8MB ± 1% ~ (p=0.543 n=50+50) GoParser 35.3MB ± 1% 35.4MB ± 1% ~ (p=0.792 n=50+50) Reflect 85.2MB ± 1% 85.2MB ± 0% ~ (p=0.055 n=50+47) Tar 34.5MB ± 1% 34.5MB ± 1% +0.06% (p=0.015 n=50+50) XML 43.8MB ± 2% 43.9MB ± 2% +0.19% (p=0.000 n=48+48) LinkCompiler 137MB ± 0% 136MB ± 0% -0.92% (p=0.000 n=50+50) ExternalLinkCompiler 127MB ± 0% 127MB ± 0% ~ (p=0.516 n=50+50) LinkWithoutDebugCompiler 84.0MB ± 0% 84.0MB ± 0% ~ (p=0.057 n=50+50) [Geo mean] 70.4MB 70.4MB +0.01% file before after Δ % addr2line 4021557 4002933 -18624 -0.463% api 5127847 5028503 -99344 -1.937% asm 5034716 4936836 -97880 -1.944% buildid 2608118 2594094 -14024 -0.538% cgo 4488592 4398320 -90272 -2.011% compile 22501129 22213592 -287537 -1.278% cover 4742301 4713573 -28728 -0.606% dist 3388071 3365311 -22760 -0.672% doc 3802250 3776082 -26168 -0.688% fix 3306147 3216939 -89208 -2.698% link 6404483 6363699 -40784 -0.637% nm 3941026 3921930 -19096 -0.485% objdump 4383330 4295122 -88208 -2.012% pack 2404547 2389515 -15032 -0.625% pprof 12996234 12856818 -139416 -1.073% test2json 2668500 2586788 -81712 -3.062% trace 9816276 9609580 -206696 -2.106% vet 6900682 6787338 -113344 -1.643% total 108535806 107056973 -1478833 -1.363% Change-Id: Iaec1cdcaacca8025e9babb0fb8a532fddb70c87d Reviewed-on: https://go-review.googlesource.com/c/go/+/255239 Reviewed-by: eric fang <eric.fang@arm.com> Reviewed-by: Keith Randall <khr@golang.org> Trust: eric fang <eric.fang@arm.com>
2021-03-11cmd/compile: optimize multi-register shifts on amd64Josh Bleecher Snyder
amd64 can shift in bits from another register instead of filling with 0/1. This pattern is helpful when implementing 128 bit shifts or arbitrary length shifts. In the standard library, it shows up in pure Go math/big. Benchmarks results on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.45ns ± 3% 4.39ns ± 1% -1.28% (p=0.000 n=30+27) NonZeroShifts/1/shlVU-8 4.13ns ± 4% 4.10ns ± 2% ~ (p=0.254 n=29+28) NonZeroShifts/2/shrVU-8 5.55ns ± 1% 5.63ns ± 2% +1.42% (p=0.000 n=28+29) NonZeroShifts/2/shlVU-8 5.70ns ± 2% 5.14ns ± 1% -9.82% (p=0.000 n=29+28) NonZeroShifts/3/shrVU-8 6.79ns ± 2% 6.35ns ± 2% -6.46% (p=0.000 n=28+29) NonZeroShifts/3/shlVU-8 6.69ns ± 1% 6.25ns ± 1% -6.60% (p=0.000 n=28+27) NonZeroShifts/4/shrVU-8 7.79ns ± 2% 7.06ns ± 2% -9.48% (p=0.000 n=30+30) NonZeroShifts/4/shlVU-8 7.82ns ± 1% 7.24ns ± 1% -7.37% (p=0.000 n=28+29) NonZeroShifts/5/shrVU-8 8.90ns ± 3% 7.93ns ± 1% -10.84% (p=0.000 n=29+26) NonZeroShifts/5/shlVU-8 8.68ns ± 1% 7.92ns ± 1% -8.76% (p=0.000 n=29+29) NonZeroShifts/10/shrVU-8 14.4ns ± 1% 12.3ns ± 2% -14.79% (p=0.000 n=28+29) NonZeroShifts/10/shlVU-8 14.1ns ± 1% 11.9ns ± 2% -15.55% (p=0.000 n=28+27) NonZeroShifts/100/shrVU-8 118ns ± 1% 96ns ± 3% -18.82% (p=0.000 n=30+29) NonZeroShifts/100/shlVU-8 120ns ± 2% 98ns ± 2% -18.46% (p=0.000 n=29+28) NonZeroShifts/1000/shrVU-8 1.10µs ± 1% 0.88µs ± 2% -19.63% (p=0.000 n=29+30) NonZeroShifts/1000/shlVU-8 1.10µs ± 2% 0.88µs ± 2% -20.28% (p=0.000 n=29+28) NonZeroShifts/10000/shrVU-8 10.9µs ± 1% 8.7µs ± 1% -19.78% (p=0.000 n=28+27) NonZeroShifts/10000/shlVU-8 10.9µs ± 2% 8.7µs ± 1% -19.64% (p=0.000 n=29+27) NonZeroShifts/100000/shrVU-8 111µs ± 2% 90µs ± 2% -19.39% (p=0.000 n=28+29) NonZeroShifts/100000/shlVU-8 113µs ± 2% 90µs ± 2% -20.43% (p=0.000 n=30+27) The assembly version is still faster, unfortunately, but the gap is narrowing. Speedup from pure Go to assembly: name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.39ns ± 1% 3.45ns ± 2% -21.36% (p=0.000 n=27+29) NonZeroShifts/1/shlVU-8 4.10ns ± 2% 3.47ns ± 3% -15.42% (p=0.000 n=28+30) NonZeroShifts/2/shrVU-8 5.63ns ± 2% 3.97ns ± 0% -29.40% (p=0.000 n=29+25) NonZeroShifts/2/shlVU-8 5.14ns ± 1% 3.77ns ± 2% -26.65% (p=0.000 n=28+26) NonZeroShifts/3/shrVU-8 6.35ns ± 2% 4.79ns ± 2% -24.52% (p=0.000 n=29+29) NonZeroShifts/3/shlVU-8 6.25ns ± 1% 4.42ns ± 1% -29.29% (p=0.000 n=27+26) NonZeroShifts/4/shrVU-8 7.06ns ± 2% 5.64ns ± 1% -20.05% (p=0.000 n=30+29) NonZeroShifts/4/shlVU-8 7.24ns ± 1% 5.34ns ± 2% -26.23% (p=0.000 n=29+29) NonZeroShifts/5/shrVU-8 7.93ns ± 1% 6.56ns ± 2% -17.26% (p=0.000 n=26+30) NonZeroShifts/5/shlVU-8 7.92ns ± 1% 6.27ns ± 1% -20.79% (p=0.000 n=29+25) NonZeroShifts/10/shrVU-8 12.3ns ± 2% 10.2ns ± 2% -17.21% (p=0.000 n=29+29) NonZeroShifts/10/shlVU-8 11.9ns ± 2% 10.5ns ± 2% -12.45% (p=0.000 n=27+29) NonZeroShifts/100/shrVU-8 95.9ns ± 3% 77.7ns ± 1% -19.00% (p=0.000 n=29+30) NonZeroShifts/100/shlVU-8 97.5ns ± 2% 66.8ns ± 2% -31.47% (p=0.000 n=28+30) NonZeroShifts/1000/shrVU-8 884ns ± 2% 705ns ± 1% -20.17% (p=0.000 n=30+28) NonZeroShifts/1000/shlVU-8 880ns ± 2% 590ns ± 1% -32.96% (p=0.000 n=28+25) NonZeroShifts/10000/shrVU-8 8.74µs ± 1% 7.34µs ± 3% -15.94% (p=0.000 n=27+30) NonZeroShifts/10000/shlVU-8 8.73µs ± 1% 6.00µs ± 1% -31.25% (p=0.000 n=27+28) NonZeroShifts/100000/shrVU-8 89.6µs ± 2% 75.5µs ± 2% -15.80% (p=0.000 n=29+29) NonZeroShifts/100000/shlVU-8 89.6µs ± 2% 68.0µs ± 3% -24.09% (p=0.000 n=27+30) Change-Id: I18f58d8f5513d737d9cdf09b8f9d14011ffe3958 Reviewed-on: https://go-review.googlesource.com/c/go/+/297050 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-03-09cmd/asm,cmd/compile: support 5 operand RLWNM/RLWMI on ppc64Paul E. Murphy
These instructions are actually 5 argument opcodes as specified by the ISA. Prior to this patch, the MB and ME arguments were merged into a single bitmask operand to workaround the limitations of the ppc64 assembler backend. This limitation no longer exists. Thus, we can pass operands for these opcodes without having to merge the MB and ME arguments in the assembler frontend or compiler backend. Likewise, support for 4 operand variants is unchanged. Change-Id: Ib086774f3581edeaadfd2190d652aaaa8a90daeb Reviewed-on: https://go-review.googlesource.com/c/go/+/298750 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org> Trust: Carlos Eduardo Seo <carlos.seo@linaro.org>
2021-03-02cmd/compile: optimize single-precision floating point square rootfanzha02
Add generic rule to rewrite the single-precision square root expression with one single-precision instruction. The optimization will reduce two times of precision converting between double-precision and single-precision. On arm64 flatform. previous: FCVTSD F0, F0 FSQRTD F0, F0 FCVTDS F0, F0 optimized: FSQRTS S0, S0 And this patch adds the test case to check the correctness. This patch refers to CL 241877, contributed by Alice Xu (dianhong.xu@arm.com) Change-Id: I6de5d02281c693017ac4bd4c10963dd55989bd7e Reviewed-on: https://go-review.googlesource.com/c/go/+/276873 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-02-24cmd/compile: ARM64 optimize []float64 and []float32 accessEgon Elbre
Optimize load and store to []float64 and []float32. Previously it used LSL instead of shifted register indexed load/store. Before: LSL $3, R0, R0 FMOVD F0, (R1)(R0) After: FMOVD F0, (R1)(R0<<3) Fixes #42798 Change-Id: I0c0912140c3dce5aa6abc27097c0eb93833cc589 Reviewed-on: https://go-review.googlesource.com/c/go/+/273706 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Giovanni Bajo <rasky@develer.com>
2021-02-24cmd/compile: add rule to coalesce writesAlejandro García Montoro
The code generated when storing eight bytes loaded from memory created a series of small writes instead of a single, large one. The specific pattern of instructions generated stored 1 byte, then 2 bytes, then 4 bytes, and finally 1 byte. The new rules match this specific pattern both for amd64 and for s390x, and convert it into a single instruction to store the 8 bytes. arm64 and ppc64le already generated the right code, but the new codegen test covers also those architectures. Fixes #41663 Change-Id: Ifb9b464be2d59c2ed5034acf7b9c3e473f344030 Reviewed-on: https://go-review.googlesource.com/c/go/+/280456 Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com> Trust: Josh Bleecher Snyder <josharian@gmail.com> Trust: Jason A. Donenfeld <Jason@zx2c4.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-02-23cmd/compile: improve bit test codeKeith Randall
Some bit test instruction generation stopped triggering after the change to addressing modes. I suspect this was just because ANDQload was being generated before the rewrite rules could discover the BTQ. Fix that by decomposing the ANDQload when it is surrounded by a TESTQ (thus re-enabling the BTQ rules). Fixes #44228 Change-Id: I489b4a5a7eb01c65fc8db0753f8cec54097cadb2 Reviewed-on: https://go-review.googlesource.com/c/go/+/291749 Trust: Keith Randall <khr@golang.org> Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2021-02-03[dev.regabi] cmd/compile: reserve X15 as zero register on AMD64Cherry Zhang
In ABIInternal, reserve X15 as constant zero, and use it to zero memory. (Maybe there can be more use of it?) The register is zeroed when transition to ABIInternal from ABI0. Caveat: using X15 generates longer instructions than using X0. Maybe we want to use X0? Change-Id: I12d5ee92a01fc0b59dad4e5ab023ac71bc2a8b7d Reviewed-on: https://go-review.googlesource.com/c/go/+/288093 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-01-13[dev.regabi] cmd/compile: make ordering for InvertFlags more stableDavid Chase
Current many architectures use a rule along the lines of // Canonicalize the order of arguments to comparisons - helps with CSE. ((CMP|CMPW) x y) && x.ID > y.ID => (InvertFlags ((CMP|CMPW) y x)) to normalize comparisons as much as possible for CSE. Replace the ID comparison with something less variable across compiler changes. This helps avoid spurious failures in some of the codegen-comparison tests (though the current choice of comparison is sensitive to Op ordering). Two tests changed to accommodate modified instruction choice. Change-Id: Ib35f450bd2bae9d4f9f7838ceaf7ec682bcf1e1a Reviewed-on: https://go-review.googlesource.com/c/go/+/280155 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-11-17cmd/compile: fix rules regression with shifts on PPC64Lynn Boger
Some rules for PPC64 were checking for a case where a shift followed by an 'and' of a mask could be lowered, depending on the format of the mask. The function to verify if the mask was valid for this purpose was not checking if the mask was 0 which we don't want to allow. This case can happen if previous optimizations resulted in that mask value. This fixes isPPC64ValidShiftMask to check for a mask of 0 and return false. This also adds a codegen testcase to verify it doesn't try to match the rules in the future. Fixes #42610 Change-Id: I565d94e88495f51321ab365d6388c01e791b4dbb Reviewed-on: https://go-review.googlesource.com/c/go/+/270358 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-11-08test/codegen: go fmtAlberto Donizetti
Fixes #42445 Change-Id: I9653ef094dba2a1ac2e3daaa98279d10df17a2a1 Reviewed-on: https://go-review.googlesource.com/c/go/+/268257 Trust: Alberto Donizetti <alb.donizetti@gmail.com> Trust: Martin Möhrmann <moehrmann@google.com> Run-TryBot: Alberto Donizetti <alb.donizetti@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
2020-11-06cmd/compile: optimize shift pairs and masks on s390xMichael Munday
Optimize combinations of left and right shifts by a constant value into a 'rotate then insert selected bits [into zero]' instruction. Use the same instruction for contiguous masks since it has some benefits over 'and immediate' (not restricted to 32-bits, does not overwrite source register). To keep the complexity of this change under control I've only implemented 64 bit operations for now. There are a lot more optimizations that can be done with this instruction family. However, since their function overlaps with other instructions we need to be somewhat careful not to break existing optimization rules by creating optimization dead ends. This is particularly true of the load/store merging rules which contain lots of zero extensions and shifts. This CL does interfere with the store merging rules when an operand is shifted left before it is stored: binary.BigEndian.PutUint64(b, x << 1) This is unfortunate but it's not critical and somewhat complex so I plan to fix that in a follow up CL. file before after Δ % addr2line 4117446 4117282 -164 -0.004% api 4945184 4942752 -2432 -0.049% asm 4998079 4991891 -6188 -0.124% buildid 2685158 2684074 -1084 -0.040% cgo 4553732 4553394 -338 -0.007% compile 19294446 19245070 -49376 -0.256% cover 4897105 4891319 -5786 -0.118% dist 3544389 3542785 -1604 -0.045% doc 3926795 3927617 +822 +0.021% fix 3302958 3293868 -9090 -0.275% link 6546274 6543456 -2818 -0.043% nm 4102021 4100825 -1196 -0.029% objdump 4542431 4548483 +6052 +0.133% pack 2482465 2416389 -66076 -2.662% pprof 13366541 13363915 -2626 -0.020% test2json 2829007 2761515 -67492 -2.386% trace 10216164 10219684 +3520 +0.034% vet 6773956 6773572 -384 -0.006% total 107124151 106917891 -206260 -0.193% Change-Id: I7591cce41e06867ba10a745daae9333513062746 Reviewed-on: https://go-review.googlesource.com/c/go/+/233317 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Michael Munday <mike.munday@ibm.com>
2020-11-02cmd/compile: remove racefuncenterfp when it is not neededCherry Zhang
We already remove racefuncenter and racefuncexit if they are not needed (i.e. the function doesn't have any other race calls). racefuncenterfp is like racefuncenter but used on LR machines. Remove unnecessary racefuncenterfp as well. Change-Id: I65edb00e19c6d9ab55a204cbbb93e9fb710559f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/267099 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2020-10-27cmd/compile: combine more 32 bit shift and mask operations on ppc64Paul E. Murphy
Combine (AND m (SRWconst x)) or (SRWconst (AND m x)) when mask m is and the shift value produce constant which can be encoded into an RLWINM instruction. Combine (CLRLSLDI (SRWconst x)) if the combining of the underling rotate masks produces a constant which can be encoded into RLWINM. Likewise for (SLDconst (SRWconst x)) and (CLRLSDI (RLWINM x)). Combine rotate word + and operations which can be encoded as a single RLWINM/RLWNM instruction. The most notable performance improvements arise from the crypto benchmarks below (GOARCH=power8 on a ppc64le/linux): pkg:golang.org/x/crypto/blowfish goos:linux goarch:ppc64le ExpandKeyWithSalt 52.2µs ± 0% 47.5µs ± 0% -8.88% ExpandKey 44.4µs ± 0% 40.3µs ± 0% -9.15% pkg:golang.org/x/crypto/ssh/internal/bcrypt_pbkdf goos:linux goarch:ppc64le Key 57.6ms ± 0% 52.3ms ± 0% -9.13% pkg:golang.org/x/crypto/bcrypt goos:linux goarch:ppc64le Equal 90.9ms ± 0% 82.6ms ± 0% -9.13% DefaultCost 91.0ms ± 0% 82.7ms ± 0% -9.12% Change-Id: I59a0ca29face38f4ab46e37124c32906f216c4ce Reviewed-on: https://go-review.googlesource.com/c/go/+/260798 Run-TryBot: Carlos Eduardo Seo <carlos.seo@linaro.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-10-06all: implement GO386=softfloatKeith Randall
Backstop support for non-sse2 chips now that 387 is gone. RELNOTE=yes Change-Id: Ib10e69c4a3654c15a03568f93393437e1939e013 Reviewed-on: https://go-review.googlesource.com/c/go/+/260017 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
2020-10-02all: drop 387 supportKeith Randall
My last 387 CL. So sad ... ... ... ... not! Fixes #40255 Change-Id: I8d4ddb744b234b8adc735db2f7c3c7b6d8bbdfa4 Reviewed-on: https://go-review.googlesource.com/c/go/+/258957 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-01cmd/compile,cmd/internal/obj/ppc64: fix some shift rules due to a regressionLynn Boger
A recent change to improve shifts was generating some invalid cases when the rule was based on an AND. The extended mnemonics CLRLSLDI and CLRLSLWI only allow certain values for the operands and in the mask case those values were not being checked properly. This adds a check to those rules to verify that the 'b' and 'n' values used when an AND was part of the rule have correct values. There was a bug in some diag messages in asm9. The message expected 3 values but only provided 2. Those are corrected here also. The test/codegen/shift.go was updated to add a few more cases to check for the case mentioned here. Some of the comments that mention the order of operands in these extended mnemonics were wrong and those have been corrected. Fixes #41683. Change-Id: If5bb860acaa5051b9e0cd80784b2868b85898c31 Reviewed-on: https://go-review.googlesource.com/c/go/+/258138 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-09-28cmd/asm,cmd/compile,cmd/internal/obj/ppc64: add extswsli support on power9Lynn Boger
This adds support for the extswsli instruction which combines extsw followed by a shift. New benchmark demonstrates the improvement: name old time/op new time/op delta ExtShift 1.34µs ± 0% 1.30µs ± 0% -3.15% (p=0.057 n=4+3) Change-Id: I21b410676fdf15d20e0cbbaa75d7c6dcd3bbb7b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/257017 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-09-17cmd/compile: use combined shifts to improve array addressing on ppc64xLynn Boger
This change adds rules to find pairs of instructions that can be combined into a single shifts. These instruction sequences are common in array addressing within loops. Improvements can be seen in many crypto packages and the hash packages. These are based on the extended mnemonics found in the ISA sections C.8.1 and C.8.2. Some rules in PPC64.rules were moved because the ordering prevented some matching. The following results were generated on power9. hash/crc32: CRC32/poly=Koopman/size=40/align=0 195ns ± 0% 163ns ± 0% -16.41% CRC32/poly=Koopman/size=40/align=1 200ns ± 0% 163ns ± 0% -18.50% CRC32/poly=Koopman/size=512/align=0 1.98µs ± 0% 1.67µs ± 0% -15.46% CRC32/poly=Koopman/size=512/align=1 1.98µs ± 0% 1.69µs ± 0% -14.80% CRC32/poly=Koopman/size=1kB/align=0 3.90µs ± 0% 3.31µs ± 0% -15.27% CRC32/poly=Koopman/size=1kB/align=1 3.85µs ± 0% 3.31µs ± 0% -14.15% CRC32/poly=Koopman/size=4kB/align=0 15.3µs ± 0% 13.1µs ± 0% -14.22% CRC32/poly=Koopman/size=4kB/align=1 15.4µs ± 0% 13.1µs ± 0% -14.79% CRC32/poly=Koopman/size=32kB/align=0 137µs ± 0% 105µs ± 0% -23.56% CRC32/poly=Koopman/size=32kB/align=1 137µs ± 0% 105µs ± 0% -23.53% crypto/rc4: RC4_128 733ns ± 0% 650ns ± 0% -11.32% (p=1.000 n=1+1) RC4_1K 5.80µs ± 0% 5.17µs ± 0% -10.89% (p=1.000 n=1+1) RC4_8K 45.7µs ± 0% 40.8µs ± 0% -10.73% (p=1.000 n=1+1) crypto/sha1: Hash8Bytes 635ns ± 0% 613ns ± 0% -3.46% (p=1.000 n=1+1) Hash320Bytes 2.30µs ± 0% 2.18µs ± 0% -5.38% (p=1.000 n=1+1) Hash1K 5.88µs ± 0% 5.38µs ± 0% -8.62% (p=1.000 n=1+1) Hash8K 42.0µs ± 0% 37.9µs ± 0% -9.75% (p=1.000 n=1+1) There are other improvements found in golang.org/x/crypto which are all in the range of 5-15%. Change-Id: I193471fbcf674151ffe2edab212799d9b08dfb8c Reviewed-on: https://go-review.googlesource.com/c/go/+/252097 Trust: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-08-29cmd/compile,runtime: skip zero'ing order array for select statementsCuong Manh Le
The order array was zero initialized by the compiler, but ends up being overwritten by the runtime anyway. So let the runtime takes full responsibility for initializing, save us one instruction per select. Fixes #40399 Change-Id: Iec1eca27ad7180d4fcb3cc9ef97348206b7fe6b8 Reviewed-on: https://go-review.googlesource.com/c/go/+/251517 Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2020-08-27cmd/compile: generate subfic on ppc64Paul E. Murphy
This merges an lis + subf into subfic, and for 32b constants lwa + subf into oris + ori + subf. The carry bit is no longer used in code generation, therefore I think we can clobber it as needed. Note, lowered borrow/carry arithmetic is self-contained and thus is not affected. A few extra rules are added to ensure early transformations to SUBFCconst don't trip up earlier rules, fold constant operations, or otherwise simplify lowering. Likewise, tests are added to ensure all rules are hit. Generic constant folding catches trivial cases, however some lowering rules insert arithmetic which can introduce new opportunities (e.g BitLen or Slicemask). I couldn't find a specific benchmark to demonstrate noteworthy improvements, but this is generating subfic in many of the default bent test binaries, so we are at least saving a little code space. Change-Id: Iad7c6e5767eaa9dc24dc1c989bd1c8cfe1982012 Reviewed-on: https://go-review.googlesource.com/c/go/+/249461 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-08-24cmd/compile: add more generic rewrite rules to reassociate (op (op y C) x|C)fanzha02
With this patch, opt pass can expose more obvious constant-folding opportunites. Example: func test(i int) int {return (i+8)-(i+4)} The previous version: MOVD "".i(FP), R0 ADD $8, R0, R1 ADD $4, R0, R0 SUB R0, R1, R0 MOVD R0, "".~r1+8(FP) RET (R30) The optimized version: MOVD $4, R0 MOVD R0, "".~r1+8(FP) RET (R30) This patch removes some existing reassociation rules, such as "x+(z-C)", because the current generic rewrite rules will canonicalize "x-const" to "x+(-const)", making "x+(z-C)" equal to "x+(z+(-C))". This patch also adds test cases. Change-Id: I857108ba0b5fcc18a879eeab38e2551bc4277797 Reviewed-on: https://go-review.googlesource.com/c/go/+/237137 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-08-22cmd/compile: optimize unsigned comparisons with 0/1 on wasmAgniva De Sarker
Updates #21439 Change-Id: I0fbcde6e0c2fc368fe686b271670f9d8be4a7900 Reviewed-on: https://go-review.googlesource.com/c/go/+/247557 Run-TryBot: Agniva De Sarker <agniva.quicksilver@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Richard Musiol <neelance@gmail.com>
2020-08-19cmd/compile: Optimize ARM64's code with EONdiaxu01
This patch fuses pattern '(MVN (XOR x y))' into '(EON x y)'. Change-Id: I269c98ce198d51a4945ce8bd0e1024acbd1b7609 Reviewed-on: https://go-review.googlesource.com/c/go/+/239638 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-08-18cmd/compile: combine multiply/add into maddld on ppc64le/power9Paul E. Murphy
Add a new lowering rule to match and replace such instances with the MADDLD instruction available on power9 where possible. Likewise, this plumbs in a new ppc64 ssa opcode to house the newly generated MADDLD instructions. When testing ed25519, this reduced binary size by 936B. Similarly, MADDLD combination occcurs in a few other less obvious cases such as division by constant. Testing of golang.org/x/crypto/ed25519 shows non-trivial speedup during keygeneration: name old time/op new time/op delta KeyGeneration 65.2µs ± 0% 63.1µs ± 0% -3.19% Signing 64.3µs ± 0% 64.4µs ± 0% +0.16% Verification 147µs ± 0% 147µs ± 0% +0.11% Similarly, this test binary has shrunk by 66488B. Change-Id: I077aeda7943119b41f07e4e62e44a648f16e4ad0 Reviewed-on: https://go-review.googlesource.com/c/go/+/248723 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-08-18cmd/compile: clean up and optimize s390x multiplication rulesMichael Munday
Some of the existing optimizations aren't triggered because they are handled by the generic rules so this CL removes them. Also some constraints were copied without much thought from the amd64 rules and they don't make sense on s390x, so we remove those constraints. Finally, add a 'multiply by the sum of two powers of two' optimization. This makes sense on s390x as shifts are low latency and can also sometimes be optimized further (especially if we add support for RISBG instructions). name old time/op new time/op delta IntMulByConst/3-8 1.70ns ±11% 1.10ns ± 5% -35.26% (p=0.000 n=10+10) IntMulByConst/5-8 1.64ns ± 7% 1.10ns ± 4% -32.94% (p=0.000 n=10+9) IntMulByConst/12-8 1.65ns ± 6% 1.20ns ± 4% -27.16% (p=0.000 n=10+9) IntMulByConst/120-8 1.66ns ± 4% 1.22ns ±13% -26.43% (p=0.000 n=10+10) IntMulByConst/-120-8 1.65ns ± 7% 1.19ns ± 4% -28.06% (p=0.000 n=9+10) IntMulByConst/65537-8 0.86ns ± 9% 1.12ns ±12% +30.41% (p=0.000 n=10+10) IntMulByConst/65538-8 1.65ns ± 5% 1.23ns ± 5% -25.11% (p=0.000 n=10+10) Change-Id: Ib196e6bff1e97febfd266134d0a2b2a62897989f Reviewed-on: https://go-review.googlesource.com/c/go/+/248937 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-08-18cmd/compile: optimize unsigned comparisons to 0/1 on arm64Junchen Li
For an unsigned integer, it's useful to convert its order test with 0/1 to its equality test with 0. We can save a comparison instruction that followed by a conditional branch on arm64 since it supports compare-with-zero-and-branch instructions. For example, if x > 0 { ... } else { ... } the original version: CMP $0, R0 BLS 9 the optimized version: CBZ R0, 8 Updates #21439 Change-Id: Id1de6f865f6aa72c5d45b29f7894818857288425 Reviewed-on: https://go-review.googlesource.com/c/go/+/246857 Reviewed-by: Keith Randall <khr@golang.org>
2020-08-17cmd/compile: don't rewrite (CMP (AND x y) 0) to TEST if AND has other usesKeith Randall
If the AND has other uses, we end up saving an argument to the AND in another register, so we can use it for the TEST. No point in doing that. Change-Id: I73444a6aeddd6f55e2328ce04d77c3e6cf4a83e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/241280 Run-TryBot: Keith Randall <khr@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-08-17cmd/compile: optimize unsigned comparisons to 0Junchen Li
There are some architecture-independent rules in #21439, since an unsigned integer >= 0 is always true and < 0 is always false. This CL adds these optimizations to generic rules. Updates #21439 Change-Id: Iec7e3040b761ecb1e60908f764815fdd9bc62495 Reviewed-on: https://go-review.googlesource.com/c/go/+/246617 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-07-27cmd/compile: add floating point load+op operations to addressing modes passKeith Randall
They were missed as part of the refactoring to use a separate addressing modes pass. Fixes #40426 Change-Id: Ie0418b2fac4ba1ffe720644ac918f6d728d5e420 Reviewed-on: https://go-review.googlesource.com/c/go/+/244859 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-06-09cmd/compile: ARM comparisons with 0 incorrect on overflowXiangdong Ji
Some ARM rewriting rules convert 'comparing to zero' conditions of if statements to a simplified version utilizing CMN and CMP instructions to branch over condition flags, in order to save one Add or Sub caculation. Such optimizations lead to wrong branching in case an overflow/underflow occurs when executing CMN or CMP. Fix the issue by introducing new block opcodes that don't honor the overflow/underflow flag: Block-Op Meaning ARM condition codes 1. LTnoov less than MI 2. GEnoov greater than or equal PL 3. LEnoov less than or equal MI || EQ 4. GTnoov greater than NEQ & PL The patch also adds a few test cases to cover scenarios that are specific to ARM and fine-tunes the code generation tests for 'x-const'. For more details please refer to the previous fix on 64-bit ARM: https://go-review.googlesource.com/c/go/+/233097 Go1 perf, 'old' is the non-optimized version, that is removing all concerned rewriting rules. name old time/op new time/op delta BinaryTree17-8 7.73s ± 0% 7.81s ± 0% +0.97% (p=0.000 n=7+8) Fannkuch11-8 7.06s ± 0% 7.00s ± 0% -0.83% (p=0.000 n=8+8) FmtFprintfEmpty-8 181ns ± 1% 183ns ± 1% +1.31% (p=0.001 n=8+8) FmtFprintfString-8 319ns ± 1% 325ns ± 2% +1.71% (p=0.009 n=7+8) FmtFprintfInt-8 358ns ± 1% 359ns ± 1% ~ (p=0.293 n=7+7) FmtFprintfIntInt-8 459ns ± 3% 456ns ± 1% ~ (p=0.869 n=8+8) FmtFprintfPrefixedInt-8 535ns ± 4% 538ns ± 4% ~ (p=0.572 n=8+8) FmtFprintfFloat-8 1.01µs ± 2% 1.01µs ± 2% ~ (p=0.625 n=8+8) FmtManyArgs-8 1.93µs ± 2% 1.93µs ± 1% ~ (p=0.979 n=8+7) GobDecode-8 16.1ms ± 1% 16.5ms ± 1% +2.32% (p=0.000 n=8+8) GobEncode-8 15.9ms ± 0% 15.8ms ± 1% -1.00% (p=0.000 n=8+7) Gzip-8 690ms ± 1% 670ms ± 0% -2.90% (p=0.000 n=8+8) Gunzip-8 109ms ± 1% 109ms ± 1% ~ (p=0.694 n=7+8) HTTPClientServer-8 149µs ± 3% 146µs ± 2% -1.70% (p=0.028 n=8+8) JSONEncode-8 50.5ms ± 1% 49.2ms ± 0% -2.60% (p=0.001 n=7+7) JSONDecode-8 135ms ± 2% 137ms ± 1% ~ (p=0.054 n=8+7) Mandelbrot200-8 951ms ± 0% 952ms ± 0% ~ (p=0.852 n=6+8) GoParse-8 9.47ms ± 1% 9.66ms ± 1% +2.01% (p=0.000 n=8+8) RegexpMatchEasy0_32-8 288ns ± 2% 277ns ± 2% -3.61% (p=0.000 n=8+8) RegexpMatchEasy0_1K-8 1.66µs ± 1% 1.69µs ± 2% +2.21% (p=0.001 n=7+7) RegexpMatchEasy1_32-8 334ns ± 1% 305ns ± 2% -8.86% (p=0.000 n=8+8) RegexpMatchEasy1_1K-8 2.14µs ± 2% 2.15µs ± 0% ~ (p=0.099 n=8+8) RegexpMatchMedium_32-8 13.3ns ± 1% 13.3ns ± 0% ~ (p=1.000 n=7+7) RegexpMatchMedium_1K-8 81.1µs ± 3% 80.7µs ± 1% ~ (p=0.955 n=7+8) RegexpMatchHard_32-8 4.26µs ± 0% 4.26µs ± 0% ~ (p=0.933 n=7+8) RegexpMatchHard_1K-8 124µs ± 0% 124µs ± 0% +0.31% (p=0.000 n=8+8) Revcomp-8 14.7ms ± 2% 14.5ms ± 1% -1.66% (p=0.003 n=8+8) Template-8 197ms ± 2% 200ms ± 3% +1.62% (p=0.021 n=8+8) TimeParse-8 1.33µs ± 1% 1.30µs ± 1% -1.86% (p=0.002 n=8+8) TimeFormat-8 3.04µs ± 1% 3.02µs ± 0% -0.60% (p=0.000 n=8+8) name old speed new speed delta GobDecode-8 47.6MB/s ± 1% 46.5MB/s ± 1% -2.28% (p=0.000 n=8+8) GobEncode-8 48.1MB/s ± 0% 48.6MB/s ± 1% +1.02% (p=0.000 n=8+7) Gzip-8 28.1MB/s ± 1% 29.0MB/s ± 0% +2.97% (p=0.000 n=8+8) Gunzip-8 178MB/s ± 1% 179MB/s ± 2% ~ (p=0.694 n=7+8) JSONEncode-8 38.4MB/s ± 1% 39.4MB/s ± 0% +2.67% (p=0.001 n=7+7) JSONDecode-8 14.3MB/s ± 2% 14.2MB/s ± 1% -0.81% (p=0.043 n=8+7) GoParse-8 6.12MB/s ± 1% 5.99MB/s ± 1% -2.00% (p=0.000 n=8+8) RegexpMatchEasy0_32-8 111MB/s ± 2% 115MB/s ± 2% +3.77% (p=0.000 n=8+8) RegexpMatchEasy0_1K-8 618MB/s ± 1% 604MB/s ± 2% -2.16% (p=0.001 n=7+7) RegexpMatchEasy1_32-8 95.7MB/s ± 1% 105.1MB/s ± 2% +9.76% (p=0.000 n=8+8) RegexpMatchEasy1_1K-8 479MB/s ± 2% 477MB/s ± 0% ~ (p=0.105 n=8+8) RegexpMatchMedium_32-8 75.2MB/s ± 1% 75.2MB/s ± 0% ~ (p=0.247 n=7+7) RegexpMatchMedium_1K-8 12.6MB/s ± 3% 12.7MB/s ± 1% ~ (p=0.538 n=7+8) RegexpMatchHard_32-8 7.52MB/s ± 0% 7.52MB/s ± 0% ~ (p=0.968 n=7+8) RegexpMatchHard_1K-8 8.26MB/s ± 0% 8.24MB/s ± 0% -0.30% (p=0.001 n=8+8) Revcomp-8 173MB/s ± 2% 176MB/s ± 1% +1.68% (p=0.003 n=8+8) Template-8 9.85MB/s ± 2% 9.69MB/s ± 3% -1.59% (p=0.021 n=8+8) Fixes #39303 Updates #38740 Change-Id: I0a5f87bfda679f66414c0041ace2ca2e28363f36 Reviewed-on: https://go-review.googlesource.com/c/go/+/236637 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-05-29cmd/compile: fix incorrect rewriting to if conditionXiangdong Ji
Some ARM64 rewriting rules convert 'comparing to zero' conditions of if statements to a simplified version utilizing CMN and CMP instructions to branch over condition flags, in order to save one Add or Sub caculation. Such optimizations lead to wrong branching in case an overflow/underflow occurs when executing CMN or CMP. Fix the issue by introducing new block opcodes that don't honor the overflow/underflow flag, in the following categories: Block-Op Meaning ARM condition codes 1. LTnoov less than MI 2. GEnoov greater than or equal PL 3. LEnoov less than or equal MI || EQ 4. GTnoov greater than NEQ & PL The backend generates two consecutive branch instructions for 'LEnoov' and 'GTnoov' to model their expected behavior. A slight change to 'gc' and amd64/386 backends is made to unify the code generation. Add a test 'TestCondRewrite' as justification, it covers 32 incorrect rules identified on arm64, more might be needed on other arches, like 32-bit arm. Add two benchmarks profiling the aforementioned category 1&2 and category 3&4 separetely, we expect the first two categories will show performance improvement and the second will not result in visible regression compared with the non-optimized version. This change also updates TestFormats to support using %#x. Examples exhibiting where does the issue come from: 1: 'if x + 3 < 0' might be converted to: before: CMN $3, R0 BGE <else branch> // wrong branch is taken if 'x+3' overflows after: CMN $3, R0 BPL <else branch> 2: 'if y - 3 > 0' might be converted to: before: CMP $3, R0 BLE <else branch> // wrong branch is taken if 'y-3' underflows after: CMP $3, R0 BMI <else branch> BEQ <else branch> Benchmark data from different kinds of arm64 servers, 'old' is the non-optimized version (not the parent commit), generally the optimization version outperforms. S1: name old time/op new time/op delta CondRewrite/SoloJump 13.6ns ± 0% 12.9ns ± 0% -5.15% (p=0.000 n=10+10) CondRewrite/CombJump 13.8ns ± 1% 12.9ns ± 0% -6.32% (p=0.000 n=10+10) S2: name old time/op new time/op delta CondRewrite/SoloJump 11.6ns ± 0% 10.9ns ± 0% -6.03% (p=0.000 n=10+10) CondRewrite/CombJump 11.4ns ± 0% 10.8ns ± 1% -5.53% (p=0.000 n=10+10) S3: name old time/op new time/op delta CondRewrite/SoloJump 7.36ns ± 0% 7.50ns ± 0% +1.79% (p=0.000 n=9+10) CondRewrite/CombJump 7.35ns ± 0% 7.75ns ± 0% +5.51% (p=0.000 n=8+9) S4: name old time/op new time/op delta CondRewrite/SoloJump-224 11.5ns ± 1% 10.9ns ± 0% -4.97% (p=0.000 n=10+10) CondRewrite/CombJump-224 11.9ns ± 0% 11.5ns ± 0% -2.95% (p=0.000 n=10+10) S5: name old time/op new time/op delta CondRewrite/SoloJump 10.0ns ± 0% 10.0ns ± 0% -0.45% (p=0.000 n=9+10) CondRewrite/CombJump 9.93ns ± 0% 9.77ns ± 0% -1.53% (p=0.000 n=10+9) Go1 perf. data: name old time/op new time/op delta BinaryTree17 6.29s ± 1% 6.30s ± 1% ~ (p=1.000 n=5+5) Fannkuch11 5.40s ± 0% 5.40s ± 0% ~ (p=0.841 n=5+5) FmtFprintfEmpty 97.9ns ± 0% 98.9ns ± 3% ~ (p=0.937 n=4+5) FmtFprintfString 171ns ± 3% 171ns ± 2% ~ (p=0.754 n=5+5) FmtFprintfInt 212ns ± 0% 217ns ± 6% +2.55% (p=0.008 n=5+5) FmtFprintfIntInt 296ns ± 1% 297ns ± 2% ~ (p=0.516 n=5+5) FmtFprintfPrefixedInt 371ns ± 2% 374ns ± 7% ~ (p=1.000 n=5+5) FmtFprintfFloat 435ns ± 1% 439ns ± 2% ~ (p=0.056 n=5+5) FmtManyArgs 1.37µs ± 1% 1.36µs ± 1% ~ (p=0.730 n=5+5) GobDecode 14.6ms ± 4% 14.4ms ± 4% ~ (p=0.690 n=5+5) GobEncode 11.8ms ±20% 11.6ms ±15% ~ (p=1.000 n=5+5) Gzip 507ms ± 0% 491ms ± 0% -3.22% (p=0.008 n=5+5) Gunzip 73.8ms ± 0% 73.9ms ± 0% ~ (p=0.690 n=5+5) HTTPClientServer 116µs ± 0% 116µs ± 0% ~ (p=0.686 n=4+4) JSONEncode 21.8ms ± 1% 21.6ms ± 2% ~ (p=0.151 n=5+5) JSONDecode 104ms ± 1% 103ms ± 1% -1.08% (p=0.016 n=5+5) Mandelbrot200 9.53ms ± 0% 9.53ms ± 0% ~ (p=0.421 n=5+5) GoParse 7.55ms ± 1% 7.51ms ± 1% ~ (p=0.151 n=5+5) RegexpMatchEasy0_32 158ns ± 0% 158ns ± 0% ~ (all equal) RegexpMatchEasy0_1K 606ns ± 1% 608ns ± 3% ~ (p=0.937 n=5+5) RegexpMatchEasy1_32 143ns ± 0% 144ns ± 1% ~ (p=0.095 n=5+4) RegexpMatchEasy1_1K 927ns ± 2% 944ns ± 2% ~ (p=0.056 n=5+5) RegexpMatchMedium_32 16.0ns ± 0% 16.0ns ± 0% ~ (all equal) RegexpMatchMedium_1K 69.3µs ± 2% 69.7µs ± 0% ~ (p=0.690 n=5+5) RegexpMatchHard_32 3.73µs ± 0% 3.73µs ± 1% ~ (p=0.984 n=5+5) RegexpMatchHard_1K 111µs ± 1% 110µs ± 0% ~ (p=0.151 n=5+5) Revcomp 1.91s ±47% 1.77s ±68% ~ (p=1.000 n=5+5) Template 138ms ± 1% 138ms ± 1% ~ (p=1.000 n=5+5) TimeParse 787ns ± 2% 785ns ± 1% ~ (p=0.540 n=5+5) TimeFormat 729ns ± 1% 726ns ± 1% ~ (p=0.151 n=5+5) Updates #38740 Change-Id: I06c604874acdc1e63e66452dadee5df053045222 Reviewed-on: https://go-review.googlesource.com/c/go/+/233097 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org>
2020-05-11cmd/compile: in prove, zero right shifts of positive int by #bits - 1Keith Randall
Taking over Zach's CL 212277. Just cleaned up and added a test. For a positive, signed integer, an arithmetic right shift of count (bit-width - 1) equals zero. e.g. int64(22) >> 63 -> 0. This CL makes prove replace these right shifts with a zero-valued constant. These shifts may arise in source code explicitly, but can also be created by the generic rewrite of signed division by a power of 2. // Signed divide by power of 2. // n / c = n >> log(c) if n >= 0 // = (n+c-1) >> log(c) if n < 0 // We conditionally add c-1 by adding n>>63>>(64-log(c)) (first shift signed, second shift unsigned). (Div64 <t> n (Const64 [c])) && isPowerOfTwo(c) -> (Rsh64x64 (Add64 <t> n (Rsh64Ux64 <t> (Rsh64x64 <t> n (Const64 <typ.UInt64> [63])) (Const64 <typ.UInt64> [64-log2(c)]))) (Const64 <typ.UInt64> [log2(c)])) If n is known to be positive, this rewrite includes an extra Add and 2 extra Rsh. This CL will allow prove to replace one of the extra Rsh with a 0. That replacement then allows lateopt to remove all the unneccesary fixups from the generic rewrite. There is a rewrite rule to handle this case directly: (Div64 n (Const64 [c])) && isNonNegative(n) && isPowerOfTwo(c) -> (Rsh64Ux64 n (Const64 <typ.UInt64> [log2(c)])) But this implementation of isNonNegative really only handles constants and a few special operations like len/cap. The division could be handled if the factsTable version of isNonNegative were available. Unfortunately, the first opt pass happens before prove even has a chance to deduce the numerator is non-negative, so the generic rewrite has already fired and created the extra Ops discussed above. Fixes #36159 By Printf count, this zeroes 137 right shifts when building std and cmd. Change-Id: Iab486910ac9d7cfb86ace2835456002732b384a2 Reviewed-on: https://go-review.googlesource.com/c/go/+/232857 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>