SIMD.info – Reference tool for C intrinsics of all major SIMD engines

https://simd.info/

Discussion

syockit
While the search feature is nice, the reference itself still lacks some details about what an instruction actually does. Take for example, [1], and compare it with say [2] (with diagram), [3] (ditto), or [4] (only pseudocode but helpful nonetheless). Of course, all the alternatives mentioned only cater to x86 but still it'd be great if this site also follows the approach taken by the other three.

[1]: https://simd.info/c_intrinsic/_mm256_permute_pd [2]: https://www.felixcloutier.com/x86/vpermilpd [3]: https://officedaytime.com/simd512e/simdimg/si.php?f=vpermilp... [4]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

camel-cdr
https://github.com/dzaima/intrinsics-viewer is like Intels Guide, but also for Arm, RISC-V and wasm.

RISC-V and wasm are hosted here: https://dzaima.github.io/intrinsics-viewer/

You need to download it your self if you want to use the others.

vectorcamp
Hi, I'm one of the SIMD.info team, thanks for your feedback.

We would actually like to include more information, but our goal is to complement the official documentation not replace it. We actually provide links to Felix Cloutier's and Intel sites anyway, same for Arm and Power, where we can.

The biggest problem is the generation of the diagrams, we're investigating some way to generate these diagrams in a common manner for all architectures, but this will take time.

skavi
https://dougallj.github.io/asil/ is like officedaytime but for SVE.
Const-me
The ISA extension tags are mostly incorrect. According to that web site, all SSE2, SSE3, SSSE3, and SSE4.1 intrinsics are part of SSE 4.2, and all FMA3 intrinsics are part of AVX2. BTW there’s one processor which supports AVX2 but lacks FMA3: https://en.wikipedia.org/wiki/List_of_VIA_Eden_microprocesso...

The search is less than ideal. Search for FMA, it will find multiple pages of NEON intrinsics, but no AMD64 like _mm256_fmadd_pd

vectorcamp
Hi, thanks for your feedback, we are being "incorrect" on purpose. All intrinsics up to and including SSE4.2 were including as part of SSE4.2. We have no intention of providing full granularity for any ISA extension esp one that is 20 years old. For the same reason, we are listing VSX as including in Power ISA 3.0, but not eg. Altivec or Power7/Power8 VSX. If you need such granularity, you are better off visiting the Intel Intrinsics Guide or the ISA manuals. So the separation for x86 is split into 3 groups, SSE4.2 (up to and including), AVX2 (including AVX) and AVX512 (also including some but not all variants). Something like x86_64-v1, x86_64-v2, etc that is used on compilers. We will probably do a finer granularity listing the exact extension in the description in the future, but not as part of the categorization.

Now the search is indeed less than ideal, we're working on replacing our search engine with a much more robust that doesn't favour one architecture over the other, especially in such words.

In any case, thank you for your feedback. It's still in beta but it is already very useful for us, as we're actually using it for development on our own projects.

ack_complete
Note that this issue also affects NEON. Two examples are vmull_p64(), which requires the Crypto extension -- notably absent on RPi3/4 -- and vqrdmlah_s32(), which requires FEAT_RDM, not guaranteed until ARMv8.1. Unlike Intel, ARM doesn't do a very good job of surfacing this in their intrinsics guide.
vient
Would also be nice to remove empty categories from tree view. For example, right now you can uncheck VSX and still see "Memory Operations - VSX Unaligned ..." full of empty tags.
gMermigkis
Thank you for this comment, will be taken into consideration.
Sesse__
I clicked the “go” button just to see the typical format, and it gave… zero results. Because the example is “e.g. integer vector addition” and it doesn't strip away the “e.g.” part!

Apart from that, I find the search results too sparse (doesn't contain the prototype) and the result page too verbose (way too much fluff in the description, and way too much setup in the example; honestly, who cares about [1]), so I'll probably stick to the existing x86/Arm references.

[1] Also, the contrast is set so low that I literally cannot read all of the example.

vectorcamp
You make some good points. I represent Vectorcamp (creators of simd.info). It's still in Beta status, because we know there are some limitations currently, but we are already using it in production for our own projects. Now to comment on your points:

1. Empty string -> zero results, obviously a bug, we'll add some default value. 2. The sparse results are because of VSX, VSX provides multiple prototypes per intrinsic, which we thought would increase the size of the results a bit too much. Including the prototypes in the results is not a problem, but we don't want to have too much information on the other hand, that would make it too hard for the developer to find the relevant intrinsic. We'll take another look at this.

The description actually is rather bare, we intend to include a lot more information, like diagrams, pseudo code for the operation etc.

Examples are meant to be run as self-contained compile units, in the Compiler Explorer or locally, to demonstrate the intrinsic hence the extra setup. This will not change.

We also think that nothing will replace the official ISA references, we also include links to those anyway.

3. Regarding the contrast, we're already working on a light/dark theme.

Thank you for your comments.

SloopJon
I don't think it's that it's not stripping "e.g.", but that the search criteria are empty. The empty result set is prefaced by "Search results for:".

I actually like that the example is a complete, standalone program that you can compile or send to Compiler Explorer.

convery
Neat idea, the 'search' feature is a bit odd though if you don't know which instruction you are looking for. e.g. searching for 'SHA' shows the autocomplete for platforms not selected and then 0 results due to the filters (they haven't been added for SSE/AVX yet), but searching for 'hash' gets you 100 results like '_mm256_castsi256_ph' which has nothing to do with the search.
gMermigkis
Thanks for your comment. We have noticed some strange behavior with the “search” feature, you are right to mention that & we are currently trying to improve its performance. Regarding the SHA you don’t get any results when filtering out NEON or VSX, because the AVX512 SHA intrinsics hasn’t been added yet (under dev atm). When searching for “HASH”, the first 3 results that you get are correct (NEON), the other ones are as mentioned before are bad behavior of the search component - it must have found some similarity.
fancyfredbot
The link to SIMD.AI is interesting. I didn't have a perfect experience trying to get Claude to convert a scalar code to AVX512.

Claude seems to enjoy storing 16 bit masks in 512 bit vectors but the compiler will find that easily.

The biggest issue I encountered was that when converting nested if statements into mask operations, it would frequently forget to and the inner and outer mask together.

vectorcamp
Getting an LLM to translate code is very tricky, we haven't included AVX2 and AVX512 yet in our SIMD.ai because it requires a lot more work. However, translating code between similarly sized vector engines is doable when we finetuned our own data to the LLM. We tested both ChatGPT and Claude -and more- but none could do even the simplest translations between eg SSE4.2 and Neon or VSX. So trying something harder like AVX512 felt like a bit of a stretch. But we're working on it.
mshockwave
This is pretty useful! Any plan for adding ARM SVE and RISC-V V extension?
pabs3
A response from the SIMD.info folks:

Yeah, the plan is to get all SIMD engines there, RVV is the hardest though (20k intrinsics). Currently we're doing IBM Z, which should be done probably within the month? It still needs some work, and progress is slow because we're just using our own funds. Plan is IBM Z (currently worked on), Loongson LSX/LASX, MIPS MSA, ARM SVE/SVE2 and finally RVV 1.0. LSX/LASX and MSA are very easy. Ideally, I'd like to open source everything, but I can't just now, as I would just hand over all the data to big players like OpenAI. Once I manage to ensure adequate funding, we're going to open source the data (SIMD.info) and probably the model itself (SIMD.ai).

CalChris
Maybe std::simd could be worked into this.
llm_nerd
Neat tool.

It is interesting how often SIMD stuff is discussed on here. Are people really directly dealing with SIMD calls a lot?

I get the draw -- this sort of to-the-metal hyper-optimization is legitimately fun and intellectually rewarding -- but I suspect that in the overwhelming majority of cases simply using the appropriate library, ideally one that is cross-platform and utilizes what SIMD a given target hosts, is a far better choice than bothering with the esoterica or every platform and generation of SIMD offerings.

vectorcamp
I agree, it's always best to use something that already exists and is optimized for your platform, unless it doesn't exist or you need extra features that are not covered. In those cases you need to read large ISA manuals, use each vendor's intrinsic site or use our tool SIMD.info :)
sophacles
I kinda agree with the main point, but keep in mind those libraries with SIMD optimizations don't just appear out of nowhere... people write those. Also it's pretty common for someone to write software for an org that thas 10^5 or more identical cores running in a datacenter (or datacenters)... some specialized optimization can easily be cost-effective in those situations. Then there's crazy distributed systems stuff, where a small latency reduction in the right place can have significant impact for an entire cluster. And on and on....

Point being, while not everyone is in a position this stuff is relevant (and not everyone who sometimes finds this stuff relevant can say it's relevant often), it's more widely applicable than you're suggesting.

llm_nerd
For sure there are obviously developers building those computation libraries like numpy, compilers, R, and so on. These people exist and are grinding out great code and abstractions for the rest of us to use, and many of them are regulars on HN. But these people are seldom the target of the "learn SIMD" content that appears on here regularly.

If someone is an average developer building a game or a corporate information or financial system or even a neural network implementation, if you are touching SIMD code directly you're probably approaching things in a less than optimal fashion and there are much better ways to utilize whatever features your hardware, or hardware in the future, may offer up.

Pannoniae
This is not entirely accurate... think about it this way. Every time you issue a floating-point addition or multiplication, you're using an 8th or a 16th of your CPU's theoretical performance. Of course, it's a bit more complicated than that, but that's the general gist of it. Compilers won't generate SIMD code for you (autovectorisation) except for the simplest cases, and they certainly won't do the necessary transformations (AoS to SoA or AoSoA) necessary to efficiently use SIMD.

Now of course, many of these transformations can be wrapped in a higher-level API (think of sums over an array, reductions, string length, string encoding, etc.) but not all of them.

Of course, multithreading also exists to improve performance, but for many tasks, it's more worth it to run it on one core without the sync overhead, especially with data-parallel algorithms where you're doing the exact same thing on all of the data and you have a fairly large dataset. Or even better, you can combine the two, partition the data between multiple cores into a few small sets then use a SIMD "kernel" to process it. With extremely embarrassingly parallel problems, you can achieve 1000x speedups this way, not exaggerating. A typical speedup is much smaller but still, using your machine well can easily produce an order-of-magnitude difference in performance. If you read up on ISPC benchmarks, you'll find that for even any kind of existing, very branchy and not SIMD-friendly code, they regularly had a free 4x speedup without changing the behaviour or the result of the program.

Seriously, it's not that if you use SIMD-powered libraries then you're set for performance, if you have a holistic view of your system's performance, you can do really amazing things.

varispeed
SIMD from MCUs would also be awesome!
vectorcamp
Do you mean Helium from Arm? Yes, that would be nice to include and relatively easy as it's mostly the same as Neon.