Difference between revisions of "NEON HelloWorld"

Latest revision as of 17:42, 1 June 2018

Developing with SIMD instructions or intrinsics need a bit of experience. That is a complete different way of thinking and programming.

Prerequisites

Install a emulation environment
Build a GCC toolchain which support NEON intrinsics

Let's go programming

Here is a brief example of what is possible with SIMD programming. This piece of code only add the value "3" to each value of the SIMD vector.

On the Cortex-A platform there is both 64 bits and 128 bits vector registers. Here we use 128 bits ones then we can code sixteen 8 bits values (unsigned integers).

Warning: Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals).

Source code

#include <stdio.h>

#include "arm_neon.h"

void add3 (uint8x16_t *data) {
    /* Set each sixteen values of the vector to 3.
     *
     * Remark: a 'q' suffix to intrinsics indicates
     * the instruction run for 128 bits registers.
     */
    uint8x16_t three = vmovq_n_u8 (3);

    /* Add 3 to the value given in argument. */
    *data = vaddq_u8 (*data, three);
}

void print_uint8 (uint8x16_t data, char* name) {
    int i;
    static uint8_t p[16];

    vst1q_u8 (p, data);

    printf ("%s = ", name);
    for (i = 0; i < 16; i++) {
	printf ("%02d ", p[i]);
    }
    printf ("\n");
}

int main () {
    /* Create custom arbitrary data. */
    const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8,
				   9, 10, 11, 12, 13, 14, 15, 16 };

    /* Create the vector with our data. */
    uint8x16_t data;

    /* Load our custom data into the vector register. */
    data = vld1q_u8 (uint8_data);

    print_uint8 (data, "data");
    
    /* Call of the add3 function. */
    add3(&data);

    print_uint8 (data, "data (new)");
    
    return 0;
}

What is done ?

We first create custom data: an array of 16 uint8_t values.
We load the custom values in a vector register through the vld1q_u8 intrinsic (note the 'q' suffix which means we use 128 registers)
We call our add3 function

And in the add3 function ?

Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the vmovq_n_u8 intrinsic
We call the vaddq_u8 intrinsic to do the addition

Note: all sixteen values of each vector are compute in real parallelism

Result

$ ./a.out 
data = 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 
data (new) = 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19

How to compile ?

The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8/9 is :

$ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a9 neon.c    (APF6)
$ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a8 neon.c    (APF51)

Note: Do not forget to include the "arm_neon.h" header file.

Links

@@ Line 13: / Line 13: @@
 {{Warning | Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals).}}
+=== Source code ===
 <source lang="c">
@@ Line 45: / Line 47: @@
 int main () {
-    int i;
      /* Create custom arbitrary data. */
      const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8,
 , 10, 11, 12, 13, 14, 15, 16 };
      /* Create the vector with our data. */
      uint8x16_t data;
@@ Line 67: / Line 68: @@
 </source>
-What is done ?
+=== What is done ? ===
 * We first create custom data: an array of 16 uint8_t values.
 * We load the custom values in a vector register through the '''vld1q_u8''' intrinsic (note the 'q' suffix which means we use 128 registers)
 * We call our '''add3''' function
-And in the '''add3''' function ?
+=== And in the '''add3''' function ? ===
 * Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the '''vmovq_n_u8''' intrinsic
 * We call the '''vaddq_u8''' intrinsic to do the addition
@@ Line 78: / Line 79: @@
 {{Note | all sixteen values of each vector are compute in real parallelism}}
+=== Result ===
+<pre class="host">
+$ ./a.out
+data = 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16
+data (new) = 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
+</pre>
 === How to compile ? ===
-The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8 is :
+The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8/9 is :
 <pre class="host">
-$ gcc -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon <sources>
+$ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a9 neon.c    (APF6)
+$ ./buildroot/output/host/usr/bin/arm-linux-gcc -mfpu=neon -mcpu=cortex-a8 neon.c    (APF51)
 </pre>
 {{Note | Do not forget to include the "arm_neon.h" header file.}}
+== Links ==
+* [http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html GCC NEON intrinsics]
+* [http://code.google.com/p/math-neon/source/browse/#svn/trunk Example of a mathematic library using NEON]
+[[Category:Software]]
+[[Category:Architecture dependent]]

Difference between revisions of "NEON HelloWorld"

Latest revision as of 17:42, 1 June 2018

Contents

Prerequisites

Let's go programming

Source code

What is done ?

And in the add3 function ?

Result

How to compile ?

Links

Navigation menu

Views

Personal tools

Navigation

Support/community

Development

Search

Tools