1×1 Convolution based on the RISC-V Vector Extension

We have created a 1×1 convolution kernel based on the RISC-V Vector Extension (RVV) and evaluated its performance using an RTL simulator.
Click here for related articles.
- Running Auto-Vectorized Program on RISC-V Vector RTL Simulator
- Matrix Multiplication based on the RISC-V Vector Extension
- 1×1 Convolution based on the RISC-V Vector Extension (this article)
1×1 Convolution
A 1×1 convolution computes the dot product of each (h, w) input of size HxWxC with a filter of size 1x1xC for each output channel.
The pre-vectorized code with reference to TensorFlow Lite for Microcontrollers looks like this:
// output_data = 1x1_conv((input_data + input_offset), filter_data) with
// input_data = [B, H, W, C], filter_data = [OC, 1, 1, C],
// output_data = [B, H, W, OC]
void OneByOneConvInt8(const int8_t* input_data, const int8_t* filter_data,
int32_t* output_data, const int32_t input_offset, ...) {
...
for (int batch = 0; batch < batches; ++batch) {
for (int out_y = 0; out_y < output_height; ++out_y) {
const int in_y = out_y;
for (int out_x = 0; out_x < output_width; ++out_x) {
const int in_x = out_x;
for (int out_channel = 0; out_channel < output_depth; ++out_channel) {
int32_t acc = 0;
for (int in_channel = 0; in_channel < input_depth; ++in_channel) {
int32_t input_val =
input_data[Offset(input_shape, batch, in_y, in_x, in_channel)];
int32_t filter_val = filter_data[Offset(filter_shape, out_channel,
0, 0, in_channel)];
acc += (input_val + input_offset) * filter_val;
}
output_data[Offset(output_shape, batch, out_y, out_x, out_channel)] =
acc;
}
}
}
}
}
Note that input_data and filter_data are int8_t type arrays, and output_data is int32_t type arrays.
Ara
Ara is an implementation of the RISC-V Vector Extension developed by the Parallel Ultra Low Power (PULP) project. For an overview of Ara, see the related article Matrix Multiplication based on the RISC-V Vector Extension.
1x1 Convolution on Ara RTL Simulator
Ara can create an RTL simulator for Ara system combining CVA6 and Ara, using Verilator. This time, we used the minimal configuration 2_lanes.mk with two 64-bit vector units.
Below is the console output when running the 1x1 convolution kernel.
$ cd $ARA/hardware $ app=1x1_conv_int8 make simv ... =================== = 1X1 CONV INT8 = =================== -------------------------------------------------------------------- Calculating a (1 x 16 x 16 x 32) x (32 x 1 x 1 x 32) convolution... -------------------------------------------------------------------- Initializing data... Calculating 1x1 conv without vector extension... The execution took 3300106 cycles. The performance is 0.159 OP/cycle. Calculating 1x1 conv with vector extension... The execution took 141214 cycles. The performance is 3.713 OP/cycle. Verifying result... Passed. -------------------------------------------------------------------- Calculating a (1 x 8 x 8 x 64) x (64 x 1 x 1 x 64) convolution... -------------------------------------------------------------------- Initializing data... Calculating 1x1 conv without vector extension... The execution took 3224180 cycles. The performance is 0.163 OP/cycle. Calculating 1x1 conv with vector extension... The execution took 109951 cycles. The performance is 4.768 OP/cycle. Verifying result... Passed. -------------------------------------------------------------------- Calculating a (1 x 4 x 4 x 128) x (128 x 1 x 1 x 128) convolution... -------------------------------------------------------------------- Initializing data... Calculating 1x1 conv without vector extension... The execution took 3247290 cycles. The performance is 0.161 OP/cycle. Calculating 1x1 conv with vector extension... The execution took 103775 cycles. The performance is 5.052 OP/cycle. Verifying result... Passed.
For 32, 64 and 128 input/output channels, we have achieved a speedup of 23-31 times over CVA6.
Summary
We created a 1x1 convolution kernel based on the RISC-V Vector Extension and evaluated its performance using Ara's RTL simulator.


