Difference between revisions of "NEON HelloWorld"
From ArmadeusWiki
(Add a "Hello World" example with SIMD intrinsics) |
(→How to compile ?) |
||
(8 intermediate revisions by one other user not shown) | |||
Line 13: | Line 13: | ||
{{Warning | Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals).}} | {{Warning | Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals).}} | ||
+ | |||
+ | === Source code === | ||
<source lang="c"> | <source lang="c"> | ||
Line 45: | Line 47: | ||
int main () { | int main () { | ||
− | |||
− | |||
/* Create custom arbitrary data. */ | /* Create custom arbitrary data. */ | ||
const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8, | const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8, | ||
9, 10, 11, 12, 13, 14, 15, 16 }; | 9, 10, 11, 12, 13, 14, 15, 16 }; | ||
+ | |||
/* Create the vector with our data. */ | /* Create the vector with our data. */ | ||
uint8x16_t data; | uint8x16_t data; | ||
Line 67: | Line 68: | ||
</source> | </source> | ||
− | What is done ? | + | === What is done ? === |
* We first create custom data: an array of 16 uint8_t values. | * We first create custom data: an array of 16 uint8_t values. | ||
* We load the custom values in a vector register through the '''vld1q_u8''' intrinsic (note the 'q' suffix which means we use 128 registers) | * We load the custom values in a vector register through the '''vld1q_u8''' intrinsic (note the 'q' suffix which means we use 128 registers) | ||
* We call our '''add3''' function | * We call our '''add3''' function | ||
− | And in the '''add3''' function ? | + | === And in the '''add3''' function ? === |
* Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the '''vmovq_n_u8''' intrinsic | * Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the '''vmovq_n_u8''' intrinsic | ||
* We call the '''vaddq_u8''' intrinsic to do the addition | * We call the '''vaddq_u8''' intrinsic to do the addition | ||
Line 78: | Line 79: | ||
{{Note | all sixteen values of each vector are compute in real parallelism}} | {{Note | all sixteen values of each vector are compute in real parallelism}} | ||
+ | |||
+ | === Result === | ||
+ | |||
+ | <pre class="host"> | ||
+ | $ ./a.out | ||
+ | data = 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 | ||
+ | data (new) = 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 | ||
+ | </pre> | ||
=== How to compile ? === | === How to compile ? === | ||
− | The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8 is : | + | The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8/9 is : |
<pre class="host"> | <pre class="host"> | ||
− | $ gcc -mcpu=cortex- | + | $ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a9 neon.c (APF6) |
+ | $ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a8 neon.c (APF51) | ||
</pre> | </pre> | ||
{{Note | Do not forget to include the "arm_neon.h" header file.}} | {{Note | Do not forget to include the "arm_neon.h" header file.}} | ||
+ | |||
+ | == Links == | ||
+ | |||
+ | * [http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html GCC NEON intrinsics] | ||
+ | * [http://code.google.com/p/math-neon/source/browse/#svn/trunk Example of a mathematic library using NEON] | ||
+ | |||
+ | |||
+ | [[Category:Software]] | ||
+ | [[Category:Architecture dependent]] |
Latest revision as of 17:42, 1 June 2018
Developing with SIMD instructions or intrinsics need a bit of experience. That is a complete different way of thinking and programming.
Contents
Prerequisites
- Install a emulation environment
- Build a GCC toolchain which support NEON intrinsics
Let's go programming
Here is a brief example of what is possible with SIMD programming. This piece of code only add the value "3" to each value of the SIMD vector.
On the Cortex-A platform there is both 64 bits and 128 bits vector registers. Here we use 128 bits ones then we can code sixteen 8 bits values (unsigned integers).
Warning: Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals). |
Source code
#include <stdio.h>
#include "arm_neon.h"
void add3 (uint8x16_t *data) {
/* Set each sixteen values of the vector to 3.
*
* Remark: a 'q' suffix to intrinsics indicates
* the instruction run for 128 bits registers.
*/
uint8x16_t three = vmovq_n_u8 (3);
/* Add 3 to the value given in argument. */
*data = vaddq_u8 (*data, three);
}
void print_uint8 (uint8x16_t data, char* name) {
int i;
static uint8_t p[16];
vst1q_u8 (p, data);
printf ("%s = ", name);
for (i = 0; i < 16; i++) {
printf ("%02d ", p[i]);
}
printf ("\n");
}
int main () {
/* Create custom arbitrary data. */
const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16 };
/* Create the vector with our data. */
uint8x16_t data;
/* Load our custom data into the vector register. */
data = vld1q_u8 (uint8_data);
print_uint8 (data, "data");
/* Call of the add3 function. */
add3(&data);
print_uint8 (data, "data (new)");
return 0;
}
What is done ?
- We first create custom data: an array of 16 uint8_t values.
- We load the custom values in a vector register through the vld1q_u8 intrinsic (note the 'q' suffix which means we use 128 registers)
- We call our add3 function
And in the add3 function ?
- Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the vmovq_n_u8 intrinsic
- We call the vaddq_u8 intrinsic to do the addition
Result
$ ./a.out data = 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 data (new) = 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
How to compile ?
The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8/9 is :
$ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a9 neon.c (APF6) $ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a8 neon.c (APF51)