1×1 Convolution based on the RISC-V Vector Extension

We have created a 1×1 convolution kernel based on the RISC-V Vector Extension (RVV) and evaluated its performance using an RTL simulator.

Click here for related articles.

Running Auto-Vectorized Program on RISC-V Vector RTL Simulator
Matrix Multiplication based on the RISC-V Vector Extension
1×1 Convolution based on the RISC-V Vector Extension (this article)

1×1 Convolution

A 1×1 convolution computes the dot product of each (h, w) input of size HxWxC with a filter of size 1x1xC for each output channel.

The pre-vectorized code with reference to TensorFlow Lite for Microcontrollers looks like this:

// output_data = 1x1_conv((input_data + input_offset), filter_data) with
// input_data = [B, H, W, C], filter_data = [OC, 1, 1, C],
// output_data = [B, H, W, OC]
void OneByOneConvInt8(const int8_t* input_data, const int8_t* filter_data,
                      int32_t* output_data, const int32_t input_offset, ...) {
  ...
  for (int batch = 0; batch < batches; ++batch) {
    for (int out_y = 0; out_y < output_height; ++out_y) {
      const int in_y = out_y;
      for (int out_x = 0; out_x < output_width; ++out_x) {
        const int in_x = out_x;
        for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
          int32_t acc = 0;
          for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
            int32_t input_val =
                input_data[Offset(input_shape, batch, in_y, in_x, in_channel)];
            int32_t filter_val = filter_data[Offset(filter_shape, out_channel,
                                                    0, 0, in_channel)];
            acc += (input_val + input_offset) * filter_val;
          }
          output_data[Offset(output_shape, batch, out_y, out_x, out_channel)] =
              acc;
        }
      }
    }
  }
}

Note that input_data and filter_data are int8_t type arrays, and output_data is int32_t type arrays.

Ara

Ara is an implementation of the RISC-V Vector Extension developed by the Parallel Ultra Low Power (PULP) project. For an overview of Ara, see the related article Matrix Multiplication based on the RISC-V Vector Extension.

1x1 Convolution on Ara RTL Simulator

Ara can create an RTL simulator for Ara system combining CVA6 and Ara, using Verilator. This time, we used the minimal configuration 2_lanes.mk with two 64-bit vector units.

Below is the console output when running the 1x1 convolution kernel.

$ cd $ARA/hardware
$ app=1x1_conv_int8 make simv

...

===================
=  1X1 CONV INT8  =
===================

--------------------------------------------------------------------
Calculating a (1 x 16 x 16 x 32) x (32 x 1 x 1 x 32) convolution...
--------------------------------------------------------------------

Initializing data...
Calculating 1x1 conv without vector extension...
The execution took 3300106 cycles.
The performance is 0.159 OP/cycle.
Calculating 1x1 conv with vector extension...
The execution took 141214 cycles.
The performance is 3.713 OP/cycle.
Verifying result...
Passed.

--------------------------------------------------------------------
Calculating a (1 x 8 x 8 x 64) x (64 x 1 x 1 x 64) convolution...
--------------------------------------------------------------------

Initializing data...
Calculating 1x1 conv without vector extension...
The execution took 3224180 cycles.
The performance is 0.163 OP/cycle.
Calculating 1x1 conv with vector extension...
The execution took 109951 cycles.
The performance is 4.768 OP/cycle.
Verifying result...
Passed.

--------------------------------------------------------------------
Calculating a (1 x 4 x 4 x 128) x (128 x 1 x 1 x 128) convolution...
--------------------------------------------------------------------

Initializing data...
Calculating 1x1 conv without vector extension...
The execution took 3247290 cycles.
The performance is 0.161 OP/cycle.
Calculating 1x1 conv with vector extension...
The execution took 103775 cycles.
The performance is 5.052 OP/cycle.
Verifying result...
Passed.

For 32, 64 and 128 input/output channels, we have achieved a speedup of 23-31 times over CVA6.