x86_64 assembly pack: "optimize" for Knights Landing, add AVX-512 results.

"Optimize" is in quotes because it's rather a "salvage operation" for now. Idea is to identify processor capability flags that drive Knights Landing to suboptimial code paths and mask them. Two flags were identified, XSAVE and ADCX/ADOX. Former affects choice of AES-NI code path specific for Silvermont (Knights Landing is of Silvermont "ancestry"). And 64-bit ADCX/ADOX instructions are effectively mishandled at decode time. In both cases we are looking at ~2x improvement. AVX-512 results cover even Skylake-X :-) Hardware used for benchmarking courtesy of Atos, experiments run by Romain Dolbeau <romain.dolbeau@atos.net>. Kudos! Reviewed-by: Rich Salz <rsalz@openssl.org>
author: Andy Polyakov <appro@openssl.org> 2017-07-20 09:48:35 +0200
committer: Andy Polyakov <appro@openssl.org> 2017-07-21 14:07:32 +0200
commit: 64d92d74985ebb3d0be58a9718f9e080a14a8e7f (patch)
tree: 036456c8d139587371300824d273d1c500411d1a /crypto/chacha/asm/chacha-x86_64.pl
parent: bbb4ceb86eb6ea0300f744443c36fb6e980fff9d (diff)
1 files changed, 4 insertions, 2 deletions
diff --git a/crypto/chacha/asm/chacha-x86_64.pl b/crypto/chacha/asm/chacha-x86_64.pl
index e2c6a32440..0cfe8990fa 100755
--- a/crypto/chacha/asm/chacha-x86_64.pl
+++ b/crypto/chacha/asm/chacha-x86_64.pl
@@ -24,7 +24,7 @@
 #
 # Performance in cycles per byte out of large buffer.
 #
-#		IALU/gcc 4.8(i)	1xSSSE3/SSE2	4xSSSE3	    8xAVX2
+#		IALU/gcc 4.8(i)	1xSSSE3/SSE2	4xSSSE3	    NxAVX(v)
 #
 # P4		9.48/+99%	-/22.7(ii)	-
 # Core2		7.83/+55%	7.90/8.08	4.35
@@ -32,8 +32,9 @@
 # Sandy Bridge	8.31/+42%	5.45/6.76	2.72
 # Ivy Bridge	6.71/+46%	5.40/6.49	2.41
 # Haswell	5.92/+43%	5.20/6.45	2.42	    1.23
-# Skylake	5.87/+39%	4.70/-		2.31	    1.19
+# Skylake[-X]	5.87/+39%	4.70/-		2.31	    1.19[0.57]
 # Silvermont	12.0/+33%	7.75/7.40	7.03(iii)
+# Knights L	11.7/-		-		9.60(iii)   0.80
 # Goldmont	10.6/+17%	5.10/-		3.28
 # Sledgehammer	7.28/+52%	-/14.2(ii)	-
 # Bulldozer	9.66/+28%	9.85/11.1	3.06(iv)
@@ -50,6 +51,7 @@
 #	limitations, SSE2 can do better, but gain is considered too
 #	low to justify the [maintenance] effort;
 # (iv)	Bulldozer actually executes 4xXOP code path that delivers 2.20;
+# (v)	8xAVX2 or 16xAVX-512, whichever best applicable;
 
 $flavour = shift;
 $output  = shift;
author	Andy Polyakov <appro@openssl.org>	2017-07-20 09:48:35 +0200
committer	Andy Polyakov <appro@openssl.org>	2017-07-21 14:07:32 +0200
commit	64d92d74985ebb3d0be58a9718f9e080a14a8e7f (patch)
tree	036456c8d139587371300824d273d1c500411d1a /crypto/chacha/asm/chacha-x86_64.pl
parent	bbb4ceb86eb6ea0300f744443c36fb6e980fff9d (diff)