Intel needs to see what has happened to their AVX instructions and why NVidia has taken over.
If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
My approach to this is to write a bunch of tiny “kernels” which are obvious to SIMD and then inline them all, and it does a pretty good job on x86 and arm
dragontamer ·8 days ago
If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.
GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.
Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.
-------
Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.
Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
Show replies
Joker_vD ·8 days ago
Thank you for saying it out loud. XLAT/XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv/vrgather.vv.
Show replies
TinkersW ·7 days ago
As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
Show replies
EVa5I7bHFq9mnYK ·8 days ago
Show replies
marmaduke ·7 days ago
https://github.com/maedoc/tvbk/blob/nb-again/src/util.h