aboutsummaryrefslogtreecommitdiff
path: root/src/cmd/compile/internal/ssa/opGen.go
AgeCommit message (Collapse)Author
2021-08-17cmd/compile: lowered MulUintptr on riscv64Meng Zhuo
According to RISCV instruction set manual v2.2 Sec 6.1 MULHU followed by MUL will be fused into one multiply by microarchitecture name old time/op new time/op delta MulUintptr/small 11.2ns ±24% 9.2ns ± 0% -17.54% (p=0.000 n=10+9) MulUintptr/large 15.9ns ± 0% 10.9ns ± 0% -31.55% (p=0.000 n=8+8) Change-Id: I3d152218f83948cbc5c576bda29dc86e9b4206ee Reviewed-on: https://go-review.googlesource.com/c/go/+/338753 Trust: Meng Zhuo <mzh@golangcn.org> Reviewed-by: Joel Sing <joel@sing.id.au>
2021-08-16cmd/compile: intrinsify Mul64 on riscv64Meng Zhuo
According to RISCV instruction set manual v2.2 Sec 6.1 MULHU followed by MUL will be fused into one multiply by microarchitecture Benchstat on Hifive unmatched: name old time/op new time/op delta Hash8Bytes 245ns ± 3% 186ns ± 4% -23.99% (p=0.000 n=10+10) Hash320Bytes 1.94µs ± 1% 1.31µs ± 1% -32.38% (p=0.000 n=9+10) Hash1K 5.84µs ± 0% 3.84µs ± 0% -34.20% (p=0.000 n=10+9) Hash8K 45.3µs ± 0% 29.4µs ± 0% -35.04% (p=0.000 n=10+10) name old speed new speed delta Hash8Bytes 32.7MB/s ± 3% 43.0MB/s ± 4% +31.61% (p=0.000 n=10+10) Hash320Bytes 165MB/s ± 1% 244MB/s ± 1% +47.88% (p=0.000 n=9+10) Hash1K 175MB/s ± 0% 266MB/s ± 0% +51.98% (p=0.000 n=10+9) Hash8K 181MB/s ± 0% 279MB/s ± 0% +53.94% (p=0.000 n=10+10) Change-Id: I3561495d02a4a0ad8578e9b9819bf0a4eaca5d12 Reviewed-on: https://go-review.googlesource.com/c/go/+/329970 Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Go Bot <gobot@golang.org> Trust: Meng Zhuo <mzh@golangcn.org>
2021-06-01[dev.typeparams] cmd/compile: update ARM64 CALL* ops for register ABICherry Mui
Now they take variable number of args. Change-Id: I49c8bce9c3a403947eac03e397ae264a8f4fdd2c Reviewed-on: https://go-review.googlesource.com/c/go/+/323929 Trust: Cherry Mui <cherryyz@google.com> Run-TryBot: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-05-26[dev.typeparams] cmd/compile: define ARM64 parameter registersCherry Mui
Define the registers. They are not really enabled for now. Otherwise the compiler will start using them for go:registerparams functions and it is not fully working. Some test will fail. Now we can compile a simple Add function with registerparams (with registers enabled). Change-Id: Ifdfac931052c0196096a1dd8b0687b5fdedb14d5 Reviewed-on: https://go-review.googlesource.com/c/go/+/322850 Trust: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Than McIntosh <thanm@google.com>
2021-05-08cmd/compile: remove bit operations that modify memory directlyKeith Randall
These operations (BT{S,R,C}{Q,L}modify) are quite a bit slower than other ways of doing the same thing. Without the BTxmodify operations, there are two fallback ways the compiler performs these operations: AND/OR/XOR operations directly on memory, or load-BTx-write sequences. The compiler kinda chooses one arbitrarily depending on rewrite rule application order. Currently, it uses load-BTx-write for the Const benchmarks and AND/OR/XOR directly to memory for the non-Const benchmarks. TBD, someone might investigate which of the two fallback strategies is really better. For now, they are both better than BTx ops. name old time/op new time/op delta BitSet-8 1.09µs ± 2% 0.64µs ± 5% -41.60% (p=0.000 n=9+10) BitClear-8 1.15µs ± 3% 0.68µs ± 6% -41.00% (p=0.000 n=10+10) BitToggle-8 1.18µs ± 4% 0.73µs ± 2% -38.36% (p=0.000 n=10+8) BitSetConst-8 37.0ns ± 7% 25.8ns ± 2% -30.24% (p=0.000 n=10+10) BitClearConst-8 30.7ns ± 2% 25.0ns ±12% -18.46% (p=0.000 n=10+10) BitToggleConst-8 36.9ns ± 1% 23.8ns ± 3% -35.46% (p=0.000 n=9+10) Fixes #45790 Update #45242 Change-Id: Ie33a72dc139f261af82db15d446cd0855afb4e59 Reviewed-on: https://go-review.googlesource.com/c/go/+/318149 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ben Shi <powerman1st@163.com>
2021-04-28cmd/compile: mark R12 clobbered for special callsCherry Zhang
In external linking mode the external linker may insert trampolines, which use R12 as a scratch register. So a call could potentially clobber R12 if the target is laid out too far. Mark R12 clobbered. Also, we will use R12 for trampolines in the Go linker as well. CL 310731 updated the generated rewrite files so imports are grouped, but the generator was not updated to do so. Grouped imports are nice. But as those are generated files, for simplicity and my laziness, just regenerate with the current generator (which makes imports not grouped). Change-Id: Iddb741ff7314a291ade5fbffc7d315f555808409 Reviewed-on: https://go-review.googlesource.com/c/go/+/314453 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Than McIntosh <thanm@google.com>
2021-04-21cmd/compile: allow conversion from slice to array ptrJosh Bleecher Snyder
Panic if the slice is too short. Updates #395 Change-Id: I90f4bff2da5d8f3148ba06d2482084f32b25c29a Reviewed-on: https://go-review.googlesource.com/c/go/+/301650 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2021-03-25cmd/compile: mark R16, R17 clobbered for non-standard calls on ARM64Cherry Zhang
On ARM64, (external) linker generated trampoline may clobber R16 and R17. In CL 183842 we change Duff's devices not to use those registers. However, this is not enough. The register allocator also needs to know that these registers may be clobbered in any calls that don't follow the standard Go calling convention. This include Duff's devices and the write barrier. Fixes #32773, second attempt. Change-Id: Ia52a891d9bbb8515c927617dd53aee5af5bd9aa4 Reviewed-on: https://go-review.googlesource.com/c/go/+/184437 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Meng Zhuo <mzh@golangcn.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Meng Zhuo <mzh@golangcn.org>
2021-03-23cmd/compile: optimize codes with arm64 REV16 instructionfanzha02
Optimize some patterns into rev16/rev16w instruction. Pattern1: (c & 0xff00ff00)>>8 | (c & 0x00ff00ff)<<8 To: rev16w c Pattern2: (c & 0xff00ff00ff00ff00)>>8 | (c & 0x00ff00ff00ff00ff)<<8 To: rev16 c This patch is a copy of CL 239637, contributed by Alice Xu(dianhong.xu@arm.com). Change-Id: I96936c1db87618bc1903c04221c7e9b2779455b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/268377 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-19cmd/compile: add clobberdeadreg modeCherry Zhang
When -clobberdeadreg flag is set, the compiler inserts code that clobbers integer registers at call sites. This may be helpful for debugging register ABI. Only implemented on AMD64 for now. Change-Id: Ia203d3f891c30fd95d0103489056fe01d63a2899 Reviewed-on: https://go-review.googlesource.com/c/go/+/302809 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-03-18cmd/compile: use a single const MOV operand for riscv64Joel Sing
Most platforms only use a single MOV const operand - remove the MOV{B,H,W}const operands from riscv64 and consistently use MOVDconst instead. The implementation of all four is the same and there is no benefit gained from having multiple const operands (in fact it requires a lot more rewrite rules). Change-Id: I0ba7d7554e371a1de762ef5f3745e9c0c30d41ac Reviewed-on: https://go-review.googlesource.com/c/go/+/302610 Trust: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Michael Munday <mike.munday@lowrisc.org> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-03-18cmd/compile: add rewrite rules for conditional instructions on arm64fanzha02
This CL adds rewrite rules for CSETM, CSINC, CSINV, and CSNEG. By adding these rules, we can save one instruction. For example, func test(cond bool, a int) int { if cond { a++ } return a } Before: MOVD "".a+8(RSP), R0 ADD $1, R0, R1 MOVBU "".cond(RSP), R2 CMPW $0, R2 CSEL NE, R1, R0, R0 After: MOVBU "".cond(RSP), R0 CMPW $0, R0 MOVD "".a+8(RSP), R0 CSINC EQ, R0, R0, R0 This patch is a copy of CL 285694. Co-authored-by: JunchenLi <junchen.li@arm.com> Change-Id: Ic1a79e8b8ece409b533becfcb7950f11e7b76f24 Reviewed-on: https://go-review.googlesource.com/c/go/+/302231 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-03-12cmd/compile: test register ABI for method, interface, closure callsDavid Chase
This is enabled with a ridiculous magic name for method, or for last input type passed, that needs to be changed to something inutterable before actual release. Ridiculous method name: MagicMethodNameForTestingRegisterABI Ridiculous last (input) type name: MagicLastTypeNameForTestingRegisterABI RLTN is tested with strings.Contains, so you can have MagicLastTypeNameForTestingRegisterABI1 and MagicLastTypeNameForTestingRegisterABI2 if that is helpful Includes test test/abi/fibish2.go Updates #44816. Change-Id: I592a6edc71ca9bebdd1d00e24edee1ceebb3e43f Reviewed-on: https://go-review.googlesource.com/c/go/+/299410 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-11cmd/compile: optimize multi-register shifts on amd64Josh Bleecher Snyder
amd64 can shift in bits from another register instead of filling with 0/1. This pattern is helpful when implementing 128 bit shifts or arbitrary length shifts. In the standard library, it shows up in pure Go math/big. Benchmarks results on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.45ns ± 3% 4.39ns ± 1% -1.28% (p=0.000 n=30+27) NonZeroShifts/1/shlVU-8 4.13ns ± 4% 4.10ns ± 2% ~ (p=0.254 n=29+28) NonZeroShifts/2/shrVU-8 5.55ns ± 1% 5.63ns ± 2% +1.42% (p=0.000 n=28+29) NonZeroShifts/2/shlVU-8 5.70ns ± 2% 5.14ns ± 1% -9.82% (p=0.000 n=29+28) NonZeroShifts/3/shrVU-8 6.79ns ± 2% 6.35ns ± 2% -6.46% (p=0.000 n=28+29) NonZeroShifts/3/shlVU-8 6.69ns ± 1% 6.25ns ± 1% -6.60% (p=0.000 n=28+27) NonZeroShifts/4/shrVU-8 7.79ns ± 2% 7.06ns ± 2% -9.48% (p=0.000 n=30+30) NonZeroShifts/4/shlVU-8 7.82ns ± 1% 7.24ns ± 1% -7.37% (p=0.000 n=28+29) NonZeroShifts/5/shrVU-8 8.90ns ± 3% 7.93ns ± 1% -10.84% (p=0.000 n=29+26) NonZeroShifts/5/shlVU-8 8.68ns ± 1% 7.92ns ± 1% -8.76% (p=0.000 n=29+29) NonZeroShifts/10/shrVU-8 14.4ns ± 1% 12.3ns ± 2% -14.79% (p=0.000 n=28+29) NonZeroShifts/10/shlVU-8 14.1ns ± 1% 11.9ns ± 2% -15.55% (p=0.000 n=28+27) NonZeroShifts/100/shrVU-8 118ns ± 1% 96ns ± 3% -18.82% (p=0.000 n=30+29) NonZeroShifts/100/shlVU-8 120ns ± 2% 98ns ± 2% -18.46% (p=0.000 n=29+28) NonZeroShifts/1000/shrVU-8 1.10µs ± 1% 0.88µs ± 2% -19.63% (p=0.000 n=29+30) NonZeroShifts/1000/shlVU-8 1.10µs ± 2% 0.88µs ± 2% -20.28% (p=0.000 n=29+28) NonZeroShifts/10000/shrVU-8 10.9µs ± 1% 8.7µs ± 1% -19.78% (p=0.000 n=28+27) NonZeroShifts/10000/shlVU-8 10.9µs ± 2% 8.7µs ± 1% -19.64% (p=0.000 n=29+27) NonZeroShifts/100000/shrVU-8 111µs ± 2% 90µs ± 2% -19.39% (p=0.000 n=28+29) NonZeroShifts/100000/shlVU-8 113µs ± 2% 90µs ± 2% -20.43% (p=0.000 n=30+27) The assembly version is still faster, unfortunately, but the gap is narrowing. Speedup from pure Go to assembly: name old time/op new time/op delta NonZeroShifts/1/shrVU-8 4.39ns ± 1% 3.45ns ± 2% -21.36% (p=0.000 n=27+29) NonZeroShifts/1/shlVU-8 4.10ns ± 2% 3.47ns ± 3% -15.42% (p=0.000 n=28+30) NonZeroShifts/2/shrVU-8 5.63ns ± 2% 3.97ns ± 0% -29.40% (p=0.000 n=29+25) NonZeroShifts/2/shlVU-8 5.14ns ± 1% 3.77ns ± 2% -26.65% (p=0.000 n=28+26) NonZeroShifts/3/shrVU-8 6.35ns ± 2% 4.79ns ± 2% -24.52% (p=0.000 n=29+29) NonZeroShifts/3/shlVU-8 6.25ns ± 1% 4.42ns ± 1% -29.29% (p=0.000 n=27+26) NonZeroShifts/4/shrVU-8 7.06ns ± 2% 5.64ns ± 1% -20.05% (p=0.000 n=30+29) NonZeroShifts/4/shlVU-8 7.24ns ± 1% 5.34ns ± 2% -26.23% (p=0.000 n=29+29) NonZeroShifts/5/shrVU-8 7.93ns ± 1% 6.56ns ± 2% -17.26% (p=0.000 n=26+30) NonZeroShifts/5/shlVU-8 7.92ns ± 1% 6.27ns ± 1% -20.79% (p=0.000 n=29+25) NonZeroShifts/10/shrVU-8 12.3ns ± 2% 10.2ns ± 2% -17.21% (p=0.000 n=29+29) NonZeroShifts/10/shlVU-8 11.9ns ± 2% 10.5ns ± 2% -12.45% (p=0.000 n=27+29) NonZeroShifts/100/shrVU-8 95.9ns ± 3% 77.7ns ± 1% -19.00% (p=0.000 n=29+30) NonZeroShifts/100/shlVU-8 97.5ns ± 2% 66.8ns ± 2% -31.47% (p=0.000 n=28+30) NonZeroShifts/1000/shrVU-8 884ns ± 2% 705ns ± 1% -20.17% (p=0.000 n=30+28) NonZeroShifts/1000/shlVU-8 880ns ± 2% 590ns ± 1% -32.96% (p=0.000 n=28+25) NonZeroShifts/10000/shrVU-8 8.74µs ± 1% 7.34µs ± 3% -15.94% (p=0.000 n=27+30) NonZeroShifts/10000/shlVU-8 8.73µs ± 1% 6.00µs ± 1% -31.25% (p=0.000 n=27+28) NonZeroShifts/100000/shrVU-8 89.6µs ± 2% 75.5µs ± 2% -15.80% (p=0.000 n=29+29) NonZeroShifts/100000/shlVU-8 89.6µs ± 2% 68.0µs ± 3% -24.09% (p=0.000 n=27+30) Change-Id: I18f58d8f5513d737d9cdf09b8f9d14011ffe3958 Reviewed-on: https://go-review.googlesource.com/c/go/+/297050 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-03-04cmd/compile: pass register parameters to called functionDavid Chase
still needs morestack still needs results lots of corner cases also not dealt with. For #40724. Change-Id: I03abdf1e8363d75c52969560b427e488a48cd37a Reviewed-on: https://go-review.googlesource.com/c/go/+/293889 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Jeremy Faller <jeremy@golang.org>
2021-03-03cmd/compile: make modified Aux type for OpArgXXXX pass ssa/checkDavid Chase
For #40724. Change-Id: I7d1e76139d187cd15a6e0df9d19542b7200589f6 Reviewed-on: https://go-review.googlesource.com/c/go/+/297911 Trust: David Chase <drchase@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-03cmd/compile: intrinsify runtime/internal/atomic.{And,Or}{8,} on RISCV64Joel Sing
The 32 bit versions are easily implement with a single instruction, while the 8 bit versions require a bit more effort but use the same atomic instructions via rewrite rules. Change-Id: I42e8d457b239c8f75e39a8e282fc88c1bb292a99 Reviewed-on: https://go-review.googlesource.com/c/go/+/268098 Trust: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-03-02cmd/compile: optimize single-precision floating point square rootfanzha02
Add generic rule to rewrite the single-precision square root expression with one single-precision instruction. The optimization will reduce two times of precision converting between double-precision and single-precision. On arm64 flatform. previous: FCVTSD F0, F0 FSQRTD F0, F0 FCVTDS F0, F0 optimized: FSQRTS S0, S0 And this patch adds the test case to check the correctness. This patch refers to CL 241877, contributed by Alice Xu (dianhong.xu@arm.com) Change-Id: I6de5d02281c693017ac4bd4c10963dd55989bd7e Reviewed-on: https://go-review.googlesource.com/c/go/+/276873 Trust: fannie zhang <Fannie.Zhang@arm.com> Run-TryBot: fannie zhang <Fannie.Zhang@arm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-02-26cmd/compile: change StaticCall to return a "Results"David Chase
StaticLECall (multiple value in +mem, multiple value result +mem) -> StaticCall (multiple ergister value in +mem, multiple register-sized-value result +mem) -> ARCH CallStatic (multiple ergister value in +mem, multiple register-sized-value result +mem) But the architecture-dependent stuff is indifferent to whether it is mem->mem or (mem)->(mem) until Prog generation. Deal with OpSelectN -> Prog in ssagen/ssa.go, others, as they appear. For #40724. Change-Id: I1d0436f6371054f1881862641d8e7e418e4a6a16 Reviewed-on: https://go-review.googlesource.com/c/go/+/293391 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-02-25cmd/compile: automate resultInArg0 register checksJosh Bleecher Snyder
No functional changes; passes toolstash-check. No measureable performance changes. Change-Id: I2629f73d4a3cc56d80f512f33cf57cf41d8f15d3 Reviewed-on: https://go-review.googlesource.com/c/go/+/296010 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-02-24cmd/compile: ARM64 optimize []float64 and []float32 accessEgon Elbre
Optimize load and store to []float64 and []float32. Previously it used LSL instead of shifted register indexed load/store. Before: LSL $3, R0, R0 FMOVD F0, (R1)(R0) After: FMOVD F0, (R1)(R0<<3) Fixes #42798 Change-Id: I0c0912140c3dce5aa6abc27097c0eb93833cc589 Reviewed-on: https://go-review.googlesource.com/c/go/+/273706 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Giovanni Bajo <rasky@develer.com>
2021-02-23cmd/compile: add AMD64 parameter register defs, Arg ops, plumb to ssa.ConfigDavid Chase
This is partial plumbing recycled from the original register abi test work; these are the parts that translate easily. Some other bits are deferred till later when they are ready to be used. For #40724. Change-Id: Ica8c55a4526793446189725a2bc3839124feb38f Reviewed-on: https://go-review.googlesource.com/c/go/+/260539 Trust: David Chase <drchase@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2021-02-23cmd/compile: remove backend's "scratch mem" supportCuong Manh Le
This CL rebases CL 273987 on top of master with @mdempsky's permission. The last (only?) use for this feature was 387 support, which was removed in golang.org/cl/258957. Change-Id: I4f79fee8d0c336c9b6082bcd5eb6ece52c032dc0 Reviewed-on: https://go-review.googlesource.com/c/go/+/292893 Trust: Cuong Manh Le <cuong.manhle.vn@gmail.com> Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2021-02-08[dev.regabi] cmd/compile, runtime: reserve R14 as g registers on AMD64Cherry Zhang
This is a proof-of-concept change for using the g register on AMD64. getg is now lowered to R14 in the new ABI. The g register is not yet used in all places where it can be used (e.g. stack bounds check, runtime assembly code). Change-Id: I10123ddf38e31782cf58bafcdff170aee0ff0d1b Reviewed-on: https://go-review.googlesource.com/c/go/+/289196 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: David Chase <drchase@google.com>
2021-02-03[dev.regabi] cmd/compile: reserve X15 as zero register on AMD64Cherry Zhang
In ABIInternal, reserve X15 as constant zero, and use it to zero memory. (Maybe there can be more use of it?) The register is zeroed when transition to ABIInternal from ABI0. Caveat: using X15 generates longer instructions than using X0. Maybe we want to use X0? Change-Id: I12d5ee92a01fc0b59dad4e5ab023ac71bc2a8b7d Reviewed-on: https://go-review.googlesource.com/c/go/+/288093 Trust: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2021-01-14cmd/compile: fix wrong complement for arm64 floating-point comparisonsJunchen Li
Consider the following example, func test(a, b float64, x uint64) uint64 { if a < b { x = 0 } return x } func main() { fmt.Println(test(1, math.NaN(), 123)) } The output is 0, but the expectation is 123. This is because the rewrite rule (CSEL [cc] (MOVDconst [0]) y flag) => (CSEL0 [arm64Negate(cc)] y flag) converts FCMP NaN, 1 CSEL MI, 0, 123, R0 // if 1 < NaN then R0 = 0 else R0 = 123 to FCMP NaN, 1 CSEL GE, 123, 0, R0 // if 1 >= NaN then R0 = 123 else R0 = 0 But both 1 < NaN and 1 >= NaN are false. So the output is 0, not 123. The root cause is arm64Negate not handle negation of floating comparison correctly. According to the ARM manual, the meaning of MI, GE, and PL are MI: Less than GE: Greater than or equal to PL: Greater than, equal to, or unordered Because NaN cannot be compared with other numbers, the result of such comparison is unordered. So when NaN is involved, unlike integer, the result of !(a < b) is not a >= b, it is a >= b || a is NaN || b is NaN. This is exactly what PL means. We add NotLessThanF to represent PL. Then the negation of LessThanF is NotLessThanF rather than GreaterEqualF. The same reason for the other floating comparison operations. Fixes #43619 Change-Id: Ia511b0027ad067436bace9fbfd261dbeaae01bcd Reviewed-on: https://go-review.googlesource.com/c/go/+/283572 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Keith Randall <khr@golang.org>
2020-12-14cmd/compile: fix incorrect shift count type with s390x rulesRuixin Bao
The type of the shift count must be an unsigned integer. Some s390x rules for shift have their auxint type being int8. This results in a compilation failure on s390x with an invalid operation when running make.bash using older versions of go (e.g: go1.10.4). This CL adds an auxint type of uint8 and changes the ops for shift and rotate to use auxint with type uint8. The related rules are also modified to address this change. Fixes #43090 Change-Id: I594274b6e3d9b23092fc9e9f4b354870164f2f19 Reviewed-on: https://go-review.googlesource.com/c/go/+/277078 Reviewed-by: Keith Randall <khr@golang.org> Trust: Dmitri Shuralyov <dmitshur@golang.org>
2020-11-06cmd/compile: optimize shift pairs and masks on s390xMichael Munday
Optimize combinations of left and right shifts by a constant value into a 'rotate then insert selected bits [into zero]' instruction. Use the same instruction for contiguous masks since it has some benefits over 'and immediate' (not restricted to 32-bits, does not overwrite source register). To keep the complexity of this change under control I've only implemented 64 bit operations for now. There are a lot more optimizations that can be done with this instruction family. However, since their function overlaps with other instructions we need to be somewhat careful not to break existing optimization rules by creating optimization dead ends. This is particularly true of the load/store merging rules which contain lots of zero extensions and shifts. This CL does interfere with the store merging rules when an operand is shifted left before it is stored: binary.BigEndian.PutUint64(b, x << 1) This is unfortunate but it's not critical and somewhat complex so I plan to fix that in a follow up CL. file before after Δ % addr2line 4117446 4117282 -164 -0.004% api 4945184 4942752 -2432 -0.049% asm 4998079 4991891 -6188 -0.124% buildid 2685158 2684074 -1084 -0.040% cgo 4553732 4553394 -338 -0.007% compile 19294446 19245070 -49376 -0.256% cover 4897105 4891319 -5786 -0.118% dist 3544389 3542785 -1604 -0.045% doc 3926795 3927617 +822 +0.021% fix 3302958 3293868 -9090 -0.275% link 6546274 6543456 -2818 -0.043% nm 4102021 4100825 -1196 -0.029% objdump 4542431 4548483 +6052 +0.133% pack 2482465 2416389 -66076 -2.662% pprof 13366541 13363915 -2626 -0.020% test2json 2829007 2761515 -67492 -2.386% trace 10216164 10219684 +3520 +0.034% vet 6773956 6773572 -384 -0.006% total 107124151 106917891 -206260 -0.193% Change-Id: I7591cce41e06867ba10a745daae9333513062746 Reviewed-on: https://go-review.googlesource.com/c/go/+/233317 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Michael Munday <mike.munday@ibm.com>
2020-11-05cmd/compile: improve atomic swap intrinsics on arm64Jonathan Swinney
ARMv8.1 has added new instructions for atomic memory operations. This change builds on the previous change which added support for atomic add, 0a7ac93c27c9ade79fe0f66ae0bb81484c241ae5, to include similar support for atomic-compare-and-swap, atomic-swap, atomic-or, and atomic-and intrinsics. Since the new instructions are not guaranteed to be present, we guard their usages with a branch on a CPU feature. Peformance on an ARMv8.1 machine: name old time/op new time/op delta CompareAndSwap-16 37.9ns ±16% 24.1ns ± 4% -36.44% (p=0.000 n=10+9) CompareAndSwap64-16 38.6ns ±15% 24.1ns ± 3% -37.47% (p=0.000 n=10+10) name old time/op new time/op delta Swap-16 46.9ns ±32% 12.5ns ± 6% -73.40% (p=0.000 n=10+10) Swap64-16 53.4ns ± 1% 12.5ns ± 6% -76.56% (p=0.000 n=10+10) name old time/op new time/op delta Or8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) Or-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) Or8Parallel-16 59.8ns ± 3% 12.5ns ± 2% -79.10% (p=0.000 n=10+10) OrParallel-16 51.7ns ± 3% 12.5ns ± 2% -75.84% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 8.81ns ± 0% 5.61ns ± 0% -36.32% (p=0.000 n=10+10) And-16 7.21ns ± 0% 5.61ns ± 0% -22.19% (p=0.000 n=10+10) And8Parallel-16 59.1ns ± 6% 12.8ns ± 3% -78.33% (p=0.000 n=10+10) AndParallel-16 51.4ns ± 7% 12.8ns ± 3% -75.03% (p=0.000 n=10+10) Performance on an ARMv8.0 machine (no atomics instructions): name old time/op new time/op delta CompareAndSwap-16 61.3ns ± 0% 62.4ns ± 0% +1.70% (p=0.000 n=8+9) CompareAndSwap64-16 62.0ns ± 3% 61.3ns ± 2% ~ (p=0.093 n=10+10) name old time/op new time/op delta Swap-16 127ns ± 2% 131ns ± 2% +2.91% (p=0.001 n=10+10) Swap64-16 128ns ± 1% 131ns ± 2% +2.43% (p=0.001 n=10+10) name old time/op new time/op delta Or8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) Or-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) Or8Parallel-16 137ns ± 1% 144ns ± 1% +4.97% (p=0.000 n=10+10) OrParallel-16 128ns ± 1% 136ns ± 1% +6.34% (p=0.000 n=10+10) name old time/op new time/op delta And8-16 14.9ns ± 0% 15.3ns ± 0% +2.68% (p=0.000 n=10+10) And-16 11.8ns ± 0% 12.3ns ± 0% +4.24% (p=0.000 n=10+10) And8Parallel-16 134ns ± 2% 141ns ± 1% +5.29% (p=0.000 n=10+10) AndParallel-16 125ns ± 2% 134ns ± 1% +7.10% (p=0.000 n=10+10) Fixes #39304 Change-Id: Idaca68701d4751650be6b4bedca3d57f51571712 Reviewed-on: https://go-review.googlesource.com/c/go/+/234217 Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Trust: fannie zhang <Fannie.Zhang@arm.com>
2020-11-02cmd/compile: using new calls, optimize runtime.memequal(x,constant,1)David Chase
Proof of concept; also an actual optimization that fires 180 times in the Go source base. Change-Id: I5cb87474be764264cde6e4cbcb471ef109306f08 Reviewed-on: https://go-review.googlesource.com/c/go/+/248404 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-28cmd/compile,cmd/internal/obj/riscv,runtime: use Duff's devices on riscv64Michał Derkacz
Implement runtime.duffzero and runtime.duffcopy for riscv64. Use obj.ADUFFZERO/obj.ADUFFCOPY for medium size, word aligned zeroing/moving. Change-Id: I42ec622055630c94cb77e286d8d33dbe7c9f846c Reviewed-on: https://go-review.googlesource.com/c/go/+/237797 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-27cmd/compile: combine more 32 bit shift and mask operations on ppc64Paul E. Murphy
Combine (AND m (SRWconst x)) or (SRWconst (AND m x)) when mask m is and the shift value produce constant which can be encoded into an RLWINM instruction. Combine (CLRLSLDI (SRWconst x)) if the combining of the underling rotate masks produces a constant which can be encoded into RLWINM. Likewise for (SLDconst (SRWconst x)) and (CLRLSDI (RLWINM x)). Combine rotate word + and operations which can be encoded as a single RLWINM/RLWNM instruction. The most notable performance improvements arise from the crypto benchmarks below (GOARCH=power8 on a ppc64le/linux): pkg:golang.org/x/crypto/blowfish goos:linux goarch:ppc64le ExpandKeyWithSalt 52.2µs ± 0% 47.5µs ± 0% -8.88% ExpandKey 44.4µs ± 0% 40.3µs ± 0% -9.15% pkg:golang.org/x/crypto/ssh/internal/bcrypt_pbkdf goos:linux goarch:ppc64le Key 57.6ms ± 0% 52.3ms ± 0% -9.13% pkg:golang.org/x/crypto/bcrypt goos:linux goarch:ppc64le Equal 90.9ms ± 0% 82.6ms ± 0% -9.13% DefaultCost 91.0ms ± 0% 82.7ms ± 0% -9.12% Change-Id: I59a0ca29face38f4ab46e37124c32906f216c4ce Reviewed-on: https://go-review.googlesource.com/c/go/+/260798 Run-TryBot: Carlos Eduardo Seo <carlos.seo@linaro.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@linaro.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-10-27cmd/compile: eliminate unnecessary sign/zero extension for riscv64Joel Sing
Add additional rules to eliminate unnecessary sign/zero extension for riscv64. Also where possible, replace an extension following a load with a different typed load. This removes almost another 8,000 instructions from the go binary. Of particular note, change Eq16/Eq8/Neq16/Neq8 to zero extend each value before subtraction, rather than zero extending after subtraction. While this appears to double the number of zero extensions, it often lets us completely eliminate them as the load can already be performed in a properly typed manner. As an example, prior to this change runtime.memequal16 was: 0000000000013028 <runtime.memequal16>: 13028: 00813183 ld gp,8(sp) 1302c: 00019183 lh gp,0(gp) 13030: 01013283 ld t0,16(sp) 13034: 00029283 lh t0,0(t0) 13038: 405181b3 sub gp,gp,t0 1303c: 03019193 slli gp,gp,0x30 13040: 0301d193 srli gp,gp,0x30 13044: 0011b193 seqz gp,gp 13048: 00310c23 sb gp,24(sp) 1304c: 00008067 ret Whereas it now becomes: 0000000000012fa8 <runtime.memequal16>: 12fa8: 00813183 ld gp,8(sp) 12fac: 0001d183 lhu gp,0(gp) 12fb0: 01013283 ld t0,16(sp) 12fb4: 0002d283 lhu t0,0(t0) 12fb8: 405181b3 sub gp,gp,t0 12fbc: 0011b193 seqz gp,gp 12fc0: 00310c23 sb gp,24(sp) 12fc4: 00008067 ret Change-Id: I16321feb18381241cab121c0097a126104c56c2c Reviewed-on: https://go-review.googlesource.com/c/go/+/264659 Trust: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-27cmd/compile: use MOV pseudo-instructions for sign/zero extensionJoel Sing
Rather than handling sign and zero extension via rules, defer to the assembler and use MOV pseudo-instructions. The instruction can also be omitted where the type and size is already correct. This change results in more than 6,000 instructions being removed from the go binary (in part due to omitted instructions, in part due to MOVBU having a more efficient implementation in the assembler than what is used in the current ZeroExt8to{16,32,64} rules). This will also allow for further rewriting to remove redundant sign/zero extension. Change-Id: I05e42fd9f09f40a69948be7de772cce8946c8744 Reviewed-on: https://go-review.googlesource.com/c/go/+/264658 Trust: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-23cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on S390XMichael Pratt
This is a simplification of LANfloor/LAOfloor since we have a whole word. Change-Id: I791641fb4068cad3f73660ce51699ed4653ae0e6 Reviewed-on: https://go-review.googlesource.com/c/go/+/263151 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-23cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on PPC64Michael Pratt
This is a simple case of changing the operand size of the existing 8-bit And/Or. I've also updated a few operand descriptions that were out-of-sync with the implementation. Change-Id: I95ac4445d08f7958768aec9a233698a2d652a39a Reviewed-on: https://go-review.googlesource.com/c/go/+/263150 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-23cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on ARM64Michael Pratt
These are identical to And8 and Or8, just using LDAXRW/STLXRW instead of LDAXRB/STLXRB. Change-Id: I5308832ae165064550bee4bb245809ab952f4cc8 Reviewed-on: https://go-review.googlesource.com/c/go/+/263148 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-23cmd/compile: intrinsify runtime/internal/atomic.{And,Or} on AMD64Michael Pratt
These are identical to And8 and Or8, just using ANDL/ORL instead of ANDB/ORB. Change-Id: I99cf90a8b0b5f211fb23325dddd55821875f0c8f Reviewed-on: https://go-review.googlesource.com/c/go/+/263140 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-21cmd/compile,cmd/internal/obj/riscv: move g register on riscv64Joel Sing
The original riscv64 port used the thread pointer (TP aka X4) register for the g pointer, however this register is also used when TLS support is required, resulting in a conflict (for example, when a signal is received we have no way of readily knowing if X4 contains a pointer to the TCB or a pointer to a g). In order to support cgo, free up the X4 register by moving g to X27. This unfortunately means that the X4 register is unused in non-cgo mode, however the alternative is to not support cgo on this platform. Update #36641 Change-Id: Idcaf3e8ccbe42972a1b8943aeefde7149d9c960a Reviewed-on: https://go-review.googlesource.com/c/go/+/263477 Trust: Joel Sing <joel@sing.id.au> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-21cmd/compiler,cmd/go,sync: add internal {LoadAcq,StoreRel}64 on ppc64Paul E. Murphy
Add an internal atomic intrinsic for load with acquire semantics (extending LoadAcq to 64b) and add LoadAcquintptr for internal use within the sync package. For other arches, this remaps to the appropriate atomic.Load{,64} intrinsic which should not alter code generation. Similarly, add StoreRel{uintptr,64} for consistency, and inline. Finally, add an exception to allow sync to directly use the runtime/internal/atomic package which avoids more convoluted workarounds (contributed by Lynn Boger). In an extreme example, sync.(*Pool).pin consumes 20% of wall time during fmt tests. This is reduced to 5% on ppc64le/power9. From the fmt benchmarks on ppc64le: name old time/op new time/op delta SprintfPadding 468ns ± 0% 451ns ± 0% -3.63% SprintfEmpty 73.3ns ± 0% 51.9ns ± 0% -29.20% SprintfString 135ns ± 0% 122ns ± 0% -9.63% SprintfTruncateString 232ns ± 0% 214ns ± 0% -7.76% SprintfTruncateBytes 216ns ± 0% 202ns ± 0% -6.48% SprintfSlowParsingPath 162ns ± 0% 142ns ± 0% -12.35% SprintfQuoteString 1.00µs ± 0% 0.99µs ± 0% -1.39% SprintfInt 117ns ± 0% 104ns ± 0% -11.11% SprintfIntInt 190ns ± 0% 175ns ± 0% -7.89% SprintfPrefixedInt 232ns ± 0% 212ns ± 0% -8.62% SprintfFloat 270ns ± 0% 255ns ± 0% -5.56% SprintfComplex 1.01µs ± 0% 0.99µs ± 0% -1.68% SprintfBoolean 127ns ± 0% 111ns ± 0% -12.60% SprintfHexString 220ns ± 0% 198ns ± 0% -10.00% SprintfHexBytes 261ns ± 0% 252ns ± 0% -3.45% SprintfBytes 600ns ± 0% 590ns ± 0% -1.67% SprintfStringer 684ns ± 0% 658ns ± 0% -3.80% SprintfStructure 2.57µs ± 0% 2.57µs ± 0% -0.12% ManyArgs 669ns ± 0% 646ns ± 0% -3.44% FprintInt 140ns ± 0% 136ns ± 0% -2.86% FprintfBytes 184ns ± 0% 181ns ± 0% -1.63% FprintIntNoAlloc 140ns ± 0% 136ns ± 0% -2.86% ScanInts 929µs ± 0% 921µs ± 0% -0.79% ScanRecursiveInt 122ms ± 0% 121ms ± 0% -0.11% ScanRecursiveIntReaderWrapper 122ms ± 0% 122ms ± 0% -0.18% Change-Id: I4d66780261b57b06ef600229e475462e7313f0d6 Reviewed-on: https://go-review.googlesource.com/c/go/+/253748 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Keith Randall <khr@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-10-06cmd/compile,cmd/internal/obj/ppc64: use mulli where possibleLynn Boger
This adds support to allow the use of mulli when one of the multiply operands is a constant that fits in 16 bits. This especially helps in the case where this instruction appears in a loop since the load of the constant is not being moved out of the loop. Some improvements seen in compress/flate on power9: Decode/Digits/Huffman/1e4 259µs ± 0% 261µs ± 0% +0.57% (p=1.000 n=1+1) Decode/Digits/Huffman/1e5 2.43ms ± 0% 2.45ms ± 0% +0.79% (p=1.000 n=1+1) Decode/Digits/Huffman/1e6 23.9ms ± 0% 24.2ms ± 0% +0.86% (p=1.000 n=1+1) Decode/Digits/Speed/1e4 278µs ± 0% 279µs ± 0% +0.34% (p=1.000 n=1+1) Decode/Digits/Speed/1e5 2.80ms ± 0% 2.81ms ± 0% +0.29% (p=1.000 n=1+1) Decode/Digits/Speed/1e6 28.0ms ± 0% 28.1ms ± 0% +0.28% (p=1.000 n=1+1) Decode/Digits/Default/1e4 278µs ± 0% 278µs ± 0% +0.28% (p=1.000 n=1+1) Decode/Digits/Default/1e5 2.68ms ± 0% 2.69ms ± 0% +0.19% (p=1.000 n=1+1) Decode/Digits/Default/1e6 26.6ms ± 0% 26.6ms ± 0% +0.21% (p=1.000 n=1+1) Decode/Digits/Compression/1e4 278µs ± 0% 278µs ± 0% +0.00% (p=1.000 n=1+1) Decode/Digits/Compression/1e5 2.68ms ± 0% 2.69ms ± 0% +0.21% (p=1.000 n=1+1) Decode/Digits/Compression/1e6 26.6ms ± 0% 26.6ms ± 0% +0.07% (p=1.000 n=1+1) Decode/Newton/Huffman/1e4 322µs ± 0% 312µs ± 0% -2.84% (p=1.000 n=1+1) Decode/Newton/Huffman/1e5 3.11ms ± 0% 2.91ms ± 0% -6.41% (p=1.000 n=1+1) Decode/Newton/Huffman/1e6 31.4ms ± 0% 29.3ms ± 0% -6.85% (p=1.000 n=1+1) Decode/Newton/Speed/1e4 282µs ± 0% 269µs ± 0% -4.69% (p=1.000 n=1+1) Decode/Newton/Speed/1e5 2.29ms ± 0% 2.20ms ± 0% -4.13% (p=1.000 n=1+1) Decode/Newton/Speed/1e6 22.7ms ± 0% 21.3ms ± 0% -6.06% (p=1.000 n=1+1) Decode/Newton/Default/1e4 254µs ± 0% 237µs ± 0% -6.60% (p=1.000 n=1+1) Decode/Newton/Default/1e5 1.86ms ± 0% 1.75ms ± 0% -5.99% (p=1.000 n=1+1) Decode/Newton/Default/1e6 18.1ms ± 0% 17.4ms ± 0% -4.10% (p=1.000 n=1+1) Decode/Newton/Compression/1e4 254µs ± 0% 244µs ± 0% -3.91% (p=1.000 n=1+1) Decode/Newton/Compression/1e5 1.85ms ± 0% 1.79ms ± 0% -3.10% (p=1.000 n=1+1) Decode/Newton/Compression/1e6 18.0ms ± 0% 17.3ms ± 0% -3.88% (p=1.000 n=1+1) Change-Id: I840320fab1c4bf64c76b001c2651ab79f23df4eb Reviewed-on: https://go-review.googlesource.com/c/go/+/259444 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-10-02all: drop 387 supportKeith Randall
My last 387 CL. So sad ... ... ... ... not! Fixes #40255 Change-Id: I8d4ddb744b234b8adc735db2f7c3c7b6d8bbdfa4 Reviewed-on: https://go-review.googlesource.com/c/go/+/258957 Trust: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-10-01cmd/compile: enable late expansion for interface callsDavid Chase
Includes a few tweaks to Value.copyOf(a) (make it a no-op for a self-copy) and new pattern hack "___" (3 underscores) is like ellipsis, except the replacement doesn't need to have matching ellipsis/underscores. Moved the arg-length check in generated pattern-matching code BEFORE the args are probed, because not all instances of variable length OpFoo will have all the args mentioned in some rule for OpFoo, and when that happens, the compiler panics without the early check. Change-Id: I66de40672b3794a6427890ff96c805a488d783f4 Reviewed-on: https://go-review.googlesource.com/c/go/+/247537 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-09-28cmd/asm,cmd/compile,cmd/internal/obj/ppc64: add extswsli support on power9Lynn Boger
This adds support for the extswsli instruction which combines extsw followed by a shift. New benchmark demonstrates the improvement: name old time/op new time/op delta ExtShift 1.34µs ± 0% 1.30µs ± 0% -3.15% (p=0.057 n=4+3) Change-Id: I21b410676fdf15d20e0cbbaa75d7c6dcd3bbb7b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/257017 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <carlos.seo@gmail.com> Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
2020-09-17cmd/compile: add new ops for experiment with late call expansionDavid Chase
Added Dereference, StaticLECall, SelectN, SelectNAddr Change-Id: I5426ae488e83956539aa07f7415b8acc26c933e6 Reviewed-on: https://go-review.googlesource.com/c/go/+/239082 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-09-17cmd/compile: use combined shifts to improve array addressing on ppc64xLynn Boger
This change adds rules to find pairs of instructions that can be combined into a single shifts. These instruction sequences are common in array addressing within loops. Improvements can be seen in many crypto packages and the hash packages. These are based on the extended mnemonics found in the ISA sections C.8.1 and C.8.2. Some rules in PPC64.rules were moved because the ordering prevented some matching. The following results were generated on power9. hash/crc32: CRC32/poly=Koopman/size=40/align=0 195ns ± 0% 163ns ± 0% -16.41% CRC32/poly=Koopman/size=40/align=1 200ns ± 0% 163ns ± 0% -18.50% CRC32/poly=Koopman/size=512/align=0 1.98µs ± 0% 1.67µs ± 0% -15.46% CRC32/poly=Koopman/size=512/align=1 1.98µs ± 0% 1.69µs ± 0% -14.80% CRC32/poly=Koopman/size=1kB/align=0 3.90µs ± 0% 3.31µs ± 0% -15.27% CRC32/poly=Koopman/size=1kB/align=1 3.85µs ± 0% 3.31µs ± 0% -14.15% CRC32/poly=Koopman/size=4kB/align=0 15.3µs ± 0% 13.1µs ± 0% -14.22% CRC32/poly=Koopman/size=4kB/align=1 15.4µs ± 0% 13.1µs ± 0% -14.79% CRC32/poly=Koopman/size=32kB/align=0 137µs ± 0% 105µs ± 0% -23.56% CRC32/poly=Koopman/size=32kB/align=1 137µs ± 0% 105µs ± 0% -23.53% crypto/rc4: RC4_128 733ns ± 0% 650ns ± 0% -11.32% (p=1.000 n=1+1) RC4_1K 5.80µs ± 0% 5.17µs ± 0% -10.89% (p=1.000 n=1+1) RC4_8K 45.7µs ± 0% 40.8µs ± 0% -10.73% (p=1.000 n=1+1) crypto/sha1: Hash8Bytes 635ns ± 0% 613ns ± 0% -3.46% (p=1.000 n=1+1) Hash320Bytes 2.30µs ± 0% 2.18µs ± 0% -5.38% (p=1.000 n=1+1) Hash1K 5.88µs ± 0% 5.38µs ± 0% -8.62% (p=1.000 n=1+1) Hash8K 42.0µs ± 0% 37.9µs ± 0% -9.75% (p=1.000 n=1+1) There are other improvements found in golang.org/x/crypto which are all in the range of 5-15%. Change-Id: I193471fbcf674151ffe2edab212799d9b08dfb8c Reviewed-on: https://go-review.googlesource.com/c/go/+/252097 Trust: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-09-16cmd/compile: extend ssa.AuxCall to closure and interface callsDavid Chase
Also introduce helper methods. Change-Id: I11a744ed002bae0ca9ebabba3206e1c14147e03d Reviewed-on: https://go-review.googlesource.com/c/go/+/239080 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-09-16cmd/compile: introduce special ssa Aux type for callsDavid Chase
This is prerequisite to moving call expansion later into SSA, and probably a good idea anyway. Passes tests. This is the first minimal CL that does a 1-for-1 substitution of *ssa.AuxCall for *obj.LSym. Next step (next CL) is to make this change for all calls so that additional information can be stored in AuxCall. Change-Id: Ia3a7715648fd9fb1a176850767a726e6f5b959eb Reviewed-on: https://go-review.googlesource.com/c/go/+/237680 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-08-27cmd/compile: generate subfic on ppc64Paul E. Murphy
This merges an lis + subf into subfic, and for 32b constants lwa + subf into oris + ori + subf. The carry bit is no longer used in code generation, therefore I think we can clobber it as needed. Note, lowered borrow/carry arithmetic is self-contained and thus is not affected. A few extra rules are added to ensure early transformations to SUBFCconst don't trip up earlier rules, fold constant operations, or otherwise simplify lowering. Likewise, tests are added to ensure all rules are hit. Generic constant folding catches trivial cases, however some lowering rules insert arithmetic which can introduce new opportunities (e.g BitLen or Slicemask). I couldn't find a specific benchmark to demonstrate noteworthy improvements, but this is generating subfic in many of the default bent test binaries, so we are at least saving a little code space. Change-Id: Iad7c6e5767eaa9dc24dc1c989bd1c8cfe1982012 Reviewed-on: https://go-review.googlesource.com/c/go/+/249461 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-08-27cmd/compile: remove unused carry related ssa ops in ppc64Paul E. Murphy
The intermediate SSA opcodes* are no longer generated during the lowering pass. The shifting rules have been improved using ISEL. Therefore, we can remove them and the rules which expand them. * The removed opcodes are: LoweredAdd64Carry ADDconstForCarry MaskIfNotCarry FlagCarryClear FlagCarrySet Change-Id: I1ebe2726ed988f29ed4800c8f57b428f7a214cd0 Reviewed-on: https://go-review.googlesource.com/c/go/+/249462 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>