diff options
author | wei xiao <wei.xiao@arm.com> | 2016-11-28 10:55:43 +0800 |
---|---|---|
committer | Brad Fitzpatrick <bradfitz@golang.org> | 2017-11-15 02:58:03 +0000 |
commit | d259815ccbf747dd159eabc8f28bfc4a94d47659 (patch) | |
tree | 35fb914ea2cf6025ca5b9680825d3e2917849cc6 /src/runtime/asm_arm64.s | |
parent | 466e299d6b6f98e86a5284afab1e6068867de66b (diff) | |
download | go-d259815ccbf747dd159eabc8f28bfc4a94d47659.tar.gz go-d259815ccbf747dd159eabc8f28bfc4a94d47659.zip |
runtime: IndexByte and memclr perf improvements on arm64
Update runtime asm_arm64.s and memclr_arm64.s to improve performance by using
SIMD instructions to do more in parallel. It shows improvement on bytes, html
and go1 benchmarks (particualrly regexp, which uses IndexByte frequently).
Benchmark results of bytes:
name old time/op new time/op delta
IndexByte/10-8 28.5ns ± 0% 19.5ns ± 0% -31.58% (p=0.000 n=10+10)
IndexByte/32-8 52.6ns ± 0% 19.0ns ± 0% -63.88% (p=0.000 n=10+10)
IndexByte/4K-8 4.12µs ± 0% 0.49µs ± 0% -88.16% (p=0.000 n=10+10)
IndexByte/4M-8 4.29ms ± 1% 0.70ms ±26% -83.65% (p=0.000 n=10+10)
IndexByte/64M-8 69.7ms ± 0% 16.0ms ± 0% -76.97% (p=0.000 n=9+10)
IndexBytePortable/10-8 34.0ns ± 0% 34.0ns ± 0% ~ (all equal)
IndexBytePortable/32-8 66.1ns ± 0% 66.1ns ± 0% ~ (p=0.471 n=9+9)
IndexBytePortable/4K-8 6.17µs ± 0% 6.17µs ± 0% ~ (all equal)
IndexBytePortable/4M-8 6.33ms ± 0% 6.35ms ± 0% +0.21% (p=0.002 n=10+9)
IndexBytePortable/64M-8 103ms ± 0% 103ms ± 0% +0.01% (p=0.017 n=9+10)
name old speed new speed delta
IndexByte/10-8 351MB/s ± 0% 512MB/s ± 0% +46.14% (p=0.000 n=9+10)
IndexByte/32-8 609MB/s ± 0% 1683MB/s ± 0% +176.40% (p=0.000 n=10+10)
IndexByte/4K-8 994MB/s ± 0% 8378MB/s ± 0% +742.75% (p=0.000 n=10+10)
IndexByte/4M-8 977MB/s ± 1% 6149MB/s ±32% +529.29% (p=0.000 n=10+10)
IndexByte/64M-8 963MB/s ± 0% 4182MB/s ± 0% +334.29% (p=0.000 n=9+10)
IndexBytePortable/10-8 294MB/s ± 0% 294MB/s ± 0% +0.17% (p=0.000 n=8+8)
IndexBytePortable/32-8 484MB/s ± 0% 484MB/s ± 0% ~ (p=0.877 n=9+9)
IndexBytePortable/4K-8 664MB/s ± 0% 664MB/s ± 0% ~ (p=0.242 n=8+9)
IndexBytePortable/4M-8 662MB/s ± 0% 661MB/s ± 0% -0.21% (p=0.002 n=10+9)
IndexBytePortable/64M-8 652MB/s ± 0% 652MB/s ± 0% ~ (p=0.065 n=10+10)
Benchmark results of html:
name old time/op new time/op delta
Escape-8 62.0µs ± 1% 61.0µs ± 1% -1.69% (p=0.000 n=9+10)
EscapeNone-8 10.2µs ± 0% 10.2µs ± 0% -0.09% (p=0.022 n=9+10)
Unescape-8 71.9µs ± 0% 68.7µs ± 0% -4.35% (p=0.000 n=10+10)
UnescapeNone-8 4.03µs ± 0% 0.48µs ± 0% -88.08% (p=0.000 n=10+10)
UnescapeSparse-8 10.7µs ± 2% 7.1µs ± 3% -33.91% (p=0.000 n=10+10)
UnescapeDense-8 53.2µs ± 1% 53.5µs ± 1% ~ (p=0.143 n=10+10)
Benchmark results of go1:
name old time/op new time/op delta
BinaryTree17-8 6.53s ± 0% 6.48s ± 2% ~ (p=0.190 n=4+5)
Fannkuch11-8 6.35s ± 1% 6.35s ± 0% ~ (p=1.000 n=5+5)
FmtFprintfEmpty-8 108ns ± 1% 101ns ± 2% -6.32% (p=0.008 n=5+5)
FmtFprintfString-8 172ns ± 1% 182ns ± 2% +5.70% (p=0.008 n=5+5)
FmtFprintfInt-8 207ns ± 0% 207ns ± 0% ~ (p=0.444 n=5+5)
FmtFprintfIntInt-8 277ns ± 1% 276ns ± 1% ~ (p=0.873 n=5+5)
FmtFprintfPrefixedInt-8 386ns ± 0% 382ns ± 1% -1.04% (p=0.024 n=5+5)
FmtFprintfFloat-8 492ns ± 0% 492ns ± 1% ~ (p=0.571 n=4+5)
FmtManyArgs-8 1.32µs ± 1% 1.33µs ± 0% ~ (p=0.087 n=5+5)
GobDecode-8 16.8ms ± 2% 16.7ms ± 1% ~ (p=1.000 n=5+5)
GobEncode-8 14.1ms ± 1% 14.0ms ± 1% ~ (p=0.056 n=5+5)
Gzip-8 788ms ± 0% 802ms ± 0% +1.71% (p=0.008 n=5+5)
Gunzip-8 83.6ms ± 0% 83.9ms ± 0% +0.40% (p=0.008 n=5+5)
HTTPClientServer-8 120µs ± 0% 120µs ± 1% ~ (p=0.548 n=5+5)
JSONEncode-8 33.2ms ± 0% 33.0ms ± 1% -0.71% (p=0.008 n=5+5)
JSONDecode-8 152ms ± 1% 152ms ± 1% ~ (p=1.000 n=5+5)
Mandelbrot200-8 10.0ms ± 0% 10.0ms ± 0% -0.05% (p=0.008 n=5+5)
GoParse-8 7.97ms ± 0% 7.98ms ± 0% ~ (p=0.690 n=5+5)
RegexpMatchEasy0_32-8 233ns ± 1% 206ns ± 0% -11.44% (p=0.016 n=5+4)
RegexpMatchEasy0_1K-8 1.86µs ± 0% 0.77µs ± 1% -58.54% (p=0.008 n=5+5)
RegexpMatchEasy1_32-8 250ns ± 0% 205ns ± 0% -18.07% (p=0.008 n=5+5)
RegexpMatchEasy1_1K-8 2.28µs ± 0% 1.11µs ± 0% -51.09% (p=0.029 n=4+4)
RegexpMatchMedium_32-8 332ns ± 1% 301ns ± 2% -9.45% (p=0.008 n=5+5)
RegexpMatchMedium_1K-8 85.5µs ± 2% 78.8µs ± 0% -7.83% (p=0.008 n=5+5)
RegexpMatchHard_32-8 4.34µs ± 1% 4.27µs ± 0% -1.49% (p=0.008 n=5+5)
RegexpMatchHard_1K-8 130µs ± 1% 127µs ± 0% -2.53% (p=0.008 n=5+5)
Revcomp-8 1.35s ± 1% 1.13s ± 1% -16.17% (p=0.008 n=5+5)
Template-8 160ms ± 2% 162ms ± 2% ~ (p=0.222 n=5+5)
TimeParse-8 795ns ± 2% 778ns ± 1% ~ (p=0.095 n=5+5)
TimeFormat-8 782ns ± 0% 786ns ± 1% +0.59% (p=0.040 n=5+5)
name old speed new speed delta
GobDecode-8 45.8MB/s ± 2% 45.9MB/s ± 1% ~ (p=1.000 n=5+5)
GobEncode-8 54.3MB/s ± 1% 55.0MB/s ± 1% ~ (p=0.056 n=5+5)
Gzip-8 24.6MB/s ± 0% 24.2MB/s ± 0% -1.69% (p=0.008 n=5+5)
Gunzip-8 232MB/s ± 0% 231MB/s ± 0% -0.40% (p=0.008 n=5+5)
JSONEncode-8 58.4MB/s ± 0% 58.8MB/s ± 1% +0.71% (p=0.008 n=5+5)
JSONDecode-8 12.8MB/s ± 1% 12.8MB/s ± 1% ~ (p=1.000 n=5+5)
GoParse-8 7.27MB/s ± 0% 7.26MB/s ± 0% ~ (p=0.762 n=5+5)
RegexpMatchEasy0_32-8 137MB/s ± 1% 155MB/s ± 0% +12.93% (p=0.008 n=5+5)
RegexpMatchEasy0_1K-8 551MB/s ± 0% 1329MB/s ± 1% +141.11% (p=0.008 n=5+5)
RegexpMatchEasy1_32-8 128MB/s ± 0% 156MB/s ± 0% +22.00% (p=0.008 n=5+5)
RegexpMatchEasy1_1K-8 449MB/s ± 0% 920MB/s ± 0% +104.68% (p=0.016 n=4+5)
RegexpMatchMedium_32-8 3.00MB/s ± 0% 3.32MB/s ± 2% +10.60% (p=0.016 n=4+5)
RegexpMatchMedium_1K-8 12.0MB/s ± 2% 13.0MB/s ± 0% +8.48% (p=0.008 n=5+5)
RegexpMatchHard_32-8 7.38MB/s ± 1% 7.49MB/s ± 0% +1.49% (p=0.008 n=5+5)
RegexpMatchHard_1K-8 7.88MB/s ± 1% 8.08MB/s ± 0% +2.59% (p=0.008 n=5+5)
Revcomp-8 188MB/s ± 1% 224MB/s ± 1% +19.29% (p=0.008 n=5+5)
Template-8 12.2MB/s ± 2% 12.0MB/s ± 2% ~ (p=0.206 n=5+5)
Change-Id: I94116620a287d173a6f60510684362e500f54887
Reviewed-on: https://go-review.googlesource.com/33597
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Diffstat (limited to 'src/runtime/asm_arm64.s')
-rw-r--r-- | src/runtime/asm_arm64.s | 143 |
1 files changed, 109 insertions, 34 deletions
diff --git a/src/runtime/asm_arm64.s b/src/runtime/asm_arm64.s index 5e202e7a87..9bf0646c8d 100644 --- a/src/runtime/asm_arm64.s +++ b/src/runtime/asm_arm64.s @@ -802,48 +802,123 @@ samebytes: // TEXT bytes·IndexByte(SB),NOSPLIT,$0-40 MOVD b+0(FP), R0 - MOVD b_len+8(FP), R1 - MOVBU c+24(FP), R2 // byte to find - MOVD R0, R4 // store base for later - ADD R0, R1 // end -loop: - CMP R0, R1 - BEQ notfound - MOVBU.P 1(R0), R3 - CMP R2, R3 - BNE loop - - SUB $1, R0 // R0 will be one beyond the position we want - SUB R4, R0 // remove base - MOVD R0, ret+32(FP) - RET - -notfound: - MOVD $-1, R0 - MOVD R0, ret+32(FP) - RET + MOVD b_len+8(FP), R2 + MOVBU c+24(FP), R1 + MOVD $ret+32(FP), R8 + B runtime·indexbytebody<>(SB) TEXT strings·IndexByte(SB),NOSPLIT,$0-32 MOVD s+0(FP), R0 - MOVD s_len+8(FP), R1 - MOVBU c+16(FP), R2 // byte to find - MOVD R0, R4 // store base for later - ADD R0, R1 // end + MOVD s_len+8(FP), R2 + MOVBU c+16(FP), R1 + MOVD $ret+24(FP), R8 + B runtime·indexbytebody<>(SB) + +// input: +// R0: data +// R1: byte to search +// R2: data len +// R8: address to put result +TEXT runtime·indexbytebody<>(SB),NOSPLIT,$0 + // Core algorithm: + // For each 32-byte chunk we calculate a 64-bit syndrome value, + // with two bits per byte. For each tuple, bit 0 is set if the + // relevant byte matched the requested character and bit 1 is + // not used (faster than using a 32bit syndrome). Since the bits + // in the syndrome reflect exactly the order in which things occur + // in the original string, counting trailing zeros allows to + // identify exactly which byte has matched. + + CBZ R2, fail + MOVD R0, R11 + // Magic constant 0x40100401 allows us to identify + // which lane matches the requested byte. + // 0x40100401 = ((1<<0) + (4<<8) + (16<<16) + (64<<24)) + // Different bytes have different bit masks (i.e: 1, 4, 16, 64) + MOVD $0x40100401, R5 + VMOV R1, V0.B16 + // Work with aligned 32-byte chunks + BIC $0x1f, R0, R3 + VMOV R5, V5.S4 + ANDS $0x1f, R0, R9 + AND $0x1f, R2, R10 + BEQ loop + + // Input string is not 32-byte aligned. We calculate the + // syndrome value for the aligned 32 bytes block containing + // the first bytes and mask off the irrelevant part. + VLD1.P (R3), [V1.B16, V2.B16] + SUB $0x20, R9, R4 + ADDS R4, R2, R2 + VCMEQ V0.B16, V1.B16, V3.B16 + VCMEQ V0.B16, V2.B16, V4.B16 + VAND V5.B16, V3.B16, V3.B16 + VAND V5.B16, V4.B16, V4.B16 + VADDP V4.B16, V3.B16, V6.B16 // 256->128 + VADDP V6.B16, V6.B16, V6.B16 // 128->64 + VMOV V6.D[0], R6 + // Clear the irrelevant lower bits + LSL $1, R9, R4 + LSR R4, R6, R6 + LSL R4, R6, R6 + // The first block can also be the last + BLS masklast + // Have we found something already? + CBNZ R6, tail + loop: - CMP R0, R1 - BEQ notfound - MOVBU.P 1(R0), R3 - CMP R2, R3 - BNE loop + VLD1.P (R3), [V1.B16, V2.B16] + SUBS $0x20, R2, R2 + VCMEQ V0.B16, V1.B16, V3.B16 + VCMEQ V0.B16, V2.B16, V4.B16 + // If we're out of data we finish regardless of the result + BLS end + // Use a fast check for the termination condition + VORR V4.B16, V3.B16, V6.B16 + VADDP V6.D2, V6.D2, V6.D2 + VMOV V6.D[0], R6 + // We're not out of data, loop if we haven't found the character + CBZ R6, loop + +end: + // Termination condition found, let's calculate the syndrome value + VAND V5.B16, V3.B16, V3.B16 + VAND V5.B16, V4.B16, V4.B16 + VADDP V4.B16, V3.B16, V6.B16 + VADDP V6.B16, V6.B16, V6.B16 + VMOV V6.D[0], R6 + // Only do the clear for the last possible block with less than 32 bytes + // Condition flags come from SUBS in the loop + BHS tail + +masklast: + // Clear the irrelevant upper bits + ADD R9, R10, R4 + AND $0x1f, R4, R4 + SUB $0x20, R4, R4 + NEG R4<<1, R4 + LSL R4, R6, R6 + LSR R4, R6, R6 - SUB $1, R0 // R0 will be one beyond the position we want - SUB R4, R0 // remove base - MOVD R0, ret+24(FP) +tail: + // Check that we have found a character + CBZ R6, fail + // Count the trailing zeros using bit reversing + RBIT R6, R6 + // Compensate the last post-increment + SUB $0x20, R3, R3 + // And count the leading zeros + CLZ R6, R6 + // R6 is twice the offset into the fragment + ADD R6>>1, R3, R0 + // Compute the offset result + SUB R11, R0, R0 + MOVD R0, (R8) RET -notfound: +fail: MOVD $-1, R0 - MOVD R0, ret+24(FP) + MOVD R0, (R8) RET // Equal(a, b []byte) bool |