The search is less than ideal. Search for FMA, it will find multiple pages of NEON intrinsics, but no AMD64 like _mm256_fmadd_pd
Now the search is indeed less than ideal, we're working on replacing our search engine with a much more robust that doesn't favour one architecture over the other, especially in such words.
In any case, thank you for your feedback. It's still in beta but it is already very useful for us, as we're actually using it for development on our own projects.
Apart from that, I find the search results too sparse (doesn't contain the prototype) and the result page too verbose (way too much fluff in the description, and way too much setup in the example; honestly, who cares about [1] Also, the contrast is set so low that I literally cannot read all of the example.
1. Empty string -> zero results, obviously a bug, we'll add some default value. 2. The sparse results are because of VSX, VSX provides multiple prototypes per intrinsic, which we thought would increase the size of the results a bit too much. Including the prototypes in the results is not a problem, but we don't want to have too much information on the other hand, that would make it too hard for the developer to find the relevant intrinsic. We'll take another look at this.
The description actually is rather bare, we intend to include a lot more information, like diagrams, pseudo code for the operation etc.
Examples are meant to be run as self-contained compile units, in the Compiler Explorer or locally, to demonstrate the intrinsic hence the extra setup. This will not change.
We also think that nothing will replace the official ISA references, we also include links to those anyway.
3. Regarding the contrast, we're already working on a light/dark theme.
Thank you for your comments.
I actually like that the example is a complete, standalone program that you can compile or send to Compiler Explorer.
Claude seems to enjoy storing 16 bit masks in 512 bit vectors but the compiler will find that easily.
The biggest issue I encountered was that when converting nested if statements into mask operations, it would frequently forget to and the inner and outer mask together.
Yeah, the plan is to get all SIMD engines there, RVV is the hardest though (20k intrinsics). Currently we're doing IBM Z, which should be done probably within the month? It still needs some work, and progress is slow because we're just using our own funds. Plan is IBM Z (currently worked on), Loongson LSX/LASX, MIPS MSA, ARM SVE/SVE2 and finally RVV 1.0. LSX/LASX and MSA are very easy. Ideally, I'd like to open source everything, but I can't just now, as I would just hand over all the data to big players like OpenAI. Once I manage to ensure adequate funding, we're going to open source the data (SIMD.info) and probably the model itself (SIMD.ai).
It is interesting how often SIMD stuff is discussed on here. Are people really directly dealing with SIMD calls a lot?
I get the draw -- this sort of to-the-metal hyper-optimization is legitimately fun and intellectually rewarding -- but I suspect that in the overwhelming majority of cases simply using the appropriate library, ideally one that is cross-platform and utilizes what SIMD a given target hosts, is a far better choice than bothering with the esoterica or every platform and generation of SIMD offerings.
Point being, while not everyone is in a position this stuff is relevant (and not everyone who sometimes finds this stuff relevant can say it's relevant often), it's more widely applicable than you're suggesting.
If someone is an average developer building a game or a corporate information or financial system or even a neural network implementation, if you are touching SIMD code directly you're probably approaching things in a less than optimal fashion and there are much better ways to utilize whatever features your hardware, or hardware in the future, may offer up.
Now of course, many of these transformations can be wrapped in a higher-level API (think of sums over an array, reductions, string length, string encoding, etc.) but not all of them.
Of course, multithreading also exists to improve performance, but for many tasks, it's more worth it to run it on one core without the sync overhead, especially with data-parallel algorithms where you're doing the exact same thing on all of the data and you have a fairly large dataset. Or even better, you can combine the two, partition the data between multiple cores into a few small sets then use a SIMD "kernel" to process it. With extremely embarrassingly parallel problems, you can achieve 1000x speedups this way, not exaggerating. A typical speedup is much smaller but still, using your machine well can easily produce an order-of-magnitude difference in performance. If you read up on ISPC benchmarks, you'll find that for even any kind of existing, very branchy and not SIMD-friendly code, they regularly had a free 4x speedup without changing the behaviour or the result of the program.
Seriously, it's not that if you use SIMD-powered libraries then you're set for performance, if you have a holistic view of your system's performance, you can do really amazing things.
[1]: https://simd.info/c_intrinsic/_mm256_permute_pd [2]: https://www.felixcloutier.com/x86/vpermilpd [3]: https://officedaytime.com/simd512e/simdimg/si.php?f=vpermilp... [4]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
RISC-V and wasm are hosted here: https://dzaima.github.io/intrinsics-viewer/
You need to download it your self if you want to use the others.
We would actually like to include more information, but our goal is to complement the official documentation not replace it. We actually provide links to Felix Cloutier's and Intel sites anyway, same for Arm and Power, where we can.
The biggest problem is the generation of the diagrams, we're investigating some way to generate these diagrams in a common manner for all architectures, but this will take time.