AVX-512 stand-alone C code for neural nets
AVX-512 stand-alone C code for neural nets
Open-source program generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets. An example of AVX-512 programming using GCC's AVX-512 intrinsics: https://NN-512.com
-
- Posts: 1
- Joined: 2021-08-24, 19:43:16
Re: AVX-512 stand-alone C code for neural nets
NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets
The generated C code is an example of AVX-512 programming using GCC's AVX-512 intrinsics. AVX-512 is exciting because its use of masking simplifies edge cases (partial loads, partial stores, etc.), there are 32 wide vector registers, and really excellent shuffle/permutation instructions (in particular, the two-input permute by var). Recent versions of GCC produce very good object code from C intrinsics
The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud instances. For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation
The generated C code is an example of AVX-512 programming using GCC's AVX-512 intrinsics. AVX-512 is exciting because its use of masking simplifies edge cases (partial loads, partial stores, etc.), there are 32 wide vector registers, and really excellent shuffle/permutation instructions (in particular, the two-input permute by var). Recent versions of GCC produce very good object code from C intrinsics
The goal of NN-512 is efficient neural net inference on inexpensive, CPU-only cloud instances. For example, a Skylake-X cloud compute instance costs $10 per CPU-core per month at Vultr, and the NN-512 generated code does about 18 DenseNet121 inferences per CPU-core per second (in series, not batched)
As AVX-512 becomes better supported by Intel and AMD chips, it becomes more attractive as an alternative to expensive GPU instances for workloads with small amounts of inference mixed with other computation