##### Document Text Contents

Page 1

Implementation of Parametric

Haar-like Transformations on FPGA

Mikko Koverola

School of Electrical Engineering

Thesis submitted for examination for the degree of Master of

Science in Technology.

Espoo 14.5.2018

Supervisor

Prof. Jussi Ryynänen

Advisor

Ph.D. David Guevorkian

Page 2

Copyright c

2018 Mikko Koverola

Page 44

35

of the C++ code is translated correctly to the RTL model, a channel class can be

used in the Catapult HLS. The ac_channel<T> class defines FIFO data transfer

between the RTL model’s processes that are generated from the C++ functions. The

parameter T defines the data type that is passed through the channel. For example,

ac_channel<ac_fixed(4,2,false)> defines a channel that passes unsigned fixed-point

values through the channel. [38]

3.2.2 Processing Elements

A high-performance PE is the enabling factor for a high-performance Haar-like

transformation and the inverse square root logic block lies at the heart of an efficient

PE. As discussed in Chapter 2, it is of utmost importance to choose a well suited

calculation method for the inverse square root logic. The Catapult HLS 10.2/754530

provides two methods for implementing the inverse square root. One method is to

implement the inverse square root functionality with the standard AC square root

and division functions and the other is a piecewise linear (PWL) inverse square root

function included in the AC math libraries.

The PWL inverse square root function is a polynomial approximation method

that can process fixed- and floating-point inputs. The square root function on the

other hand is implemented as an iterative calculation and it can only have fixed-point

values as an input. For fixed-point PE implementation the results suggest that

the PWL inverse square root function uses less resources than the square root and

division functions but it also produces more inaccurate outputs than the square root

and division functions combined. The results are discussed in section 4.1.

The accuracy constraints for the PEs depend largely on how the Haar-like trans-

formation is used. If information about the expected input range is available this

information can be taken advantage of in the design process. For example, if it is

known that the transformations will process input vectors whose elements’ magni-

tudes will vary only from 0 to 1, the PEs can be designed to accommodate this range

of inputs with a given level of accuracy while minimizing the bitwidths. For the

investigation done in this master’s thesis, the input test vectors were restricted to

have only integer values between 1 and 32.

The fixed-point implementation of the PEs was done similarly as in the VHDL

models. Conceptually the same architecture can be considered as in the figure 15,

although without the delay logic. An example code of the C++ implementation of

the real-number fixed-point PEs is presented below. The name of the type definitions

for each operation output type are included in the example code comments.

Each output type for the operations is defined to be an AC fixed-point number.

By adjusting the lengths of the integer and fractional parts of the output types the

accuray-level for the whole PE could be defined. Also, defining the output types as

AC complex types easy migration to a complex number implementation could be

achieved. The bitwidths used for each output type in every transformation size are

presented in appendix B.

Page 45

36

void PE(

input_type in1,

input_type in2,

int K_ind1,

int K_ind2,

ctrl_type gen,

ctrl_type sel,

output_type &out1,

output_type &out2

){

if (gen == 1) { // Kernel generation logic

if (!( (in1 == 0) && (in2 == 0) )){

sqr1 = (in1*in1); // sqr1 = sqr_type

sqr2 = (in2*in2); // sqr2 = sqr_type

Temp = sqr1 + sqr2; // Temp = sum_type

sqrt(Temp,sqroot); // sqroot = sqroot_type

div(One,sqroot,invNorm); // invNorm = norm_type

}

if((in1 == 0) && (in2 == 0)){ // If norm is zero identity kernels

K[K_ind1] = 1; //K = kernel_type

K[K_ind2] = 1; //K = kernel_type }

else if (sel == 1){ // Generate bypass kernels

K[K_ind1] = 0; //K = kernel_type

K[K_ind2] = 1; //K = kernel_type }

else { // Generate kernels

K[K_ind1] = in1*invNorm; //K = kernel_type

K[K_ind2] = in2*invNorm; //K = kernel_type

}

}

// Multiplication logic

out1 = (K[K_ind1]*in1) + (K[K_ind2]*in2);

out2 = (K[K_ind2]*in1) - (K[K_ind1]*in2);

}

In the PEs the most complicated calculations are done in the kernel generation

logic. Floating-point Matlab reference calculations were generated for the operations

in the kernel generation logic and the reference output values were compared with

the output values of the fixed-point operations. The accuracy of the C++ fixed-point

values was set by first determining the needed integer part widths and then adjusting

the bitwidth of the fractional part to achieve the wanted accuracy-level (See figure

16). The accuracy and bitwidth optimizations results are discussed in section 4.1.1

for the real number and in section 4.1.2 for the complex number implementations.

The Catapult HLS tool is able to unroll loops to parallelize the loop iterations.

This is done either by using a pragma or setting the unrolling option from the

tool GUI. To investigate latency reductions, the iterative square root and division

Page 87

78

C Accuracy Results for PE operations

Table C1: The accuracy results for the fixed-point real number PE implementations’

kernel generation calculations.

Operation output PE, Sqrt&Div PE, Inv. Sqrt PE, Reference

In1 1.0000000000 1.0000000000 1.0000000000

In2 2.0000000000 2.0000000000 2.0000000000

Inverse Norm 0.4472122192 0.4467773437 0.4472135954

Kernel1 0.4472045898 0.4467773437 0.4472135954

Kernel2 0.8944396972 0.8935546875 0.8944271909

Out1 2.2360839843 2.2338867187 2.2360679774

Out2 0.0000000000 0.0000000000 0.0000000000

Table C2: The accuracy results for the fixed-point complex number PE implementa-

tions’ kernel generation calculations.

Operation output PE, Sqrt&Div PE, Inv. Sqrt PE, Reference

In1 Re 1.0000000000000 1.000000000000 1.000000000000

In1 Im 2.0000000000000 2.000000000000 2.000000000000

In2 Re 2.0000000000000 2.000000000000 2.000000000000

In2 Im 3.0000000000000 3.000000000000 3.000000000000

Inverse Norm 0.2357025146484 0.235534667968 0.235702260395

Kernel1 Re 0.2357025146484 0.235534667968 0.235702260395

Kernel1 Im 0.4714050292968 0.471069335937 0.471404520791

Kernel2 Re 0.4714050292968 0.471069335937 0.471404520791

Kernel2 Im 0.7071075439453 0.706604003906 0.707106781186

Out1 Re 4.2426452636718 4.239624023437 4.242640687119

Out1 Im 0.0000000000000 0.000000000000 0.000000000000

Out2 Re 0.0000000000000 0.000000000000 0.000000000000

Out2 Im 0.0000000000000 0.000000000000 0.000000000000

Page 88

79

Table C3: The accuracy results for the fixed-point and floating-point real number

PE implementations’ kernel generation calculations when using the PWL inverse

square root funtion.

Operation output Fixed-point Floating-point Reference

In1 1.0000000000 1.0000000000 1.0000000000

In2 2.0000000000 2.0000000000 2.0000000000

Inverse Norm 0.4467773437 0.4468140602 0.4472135954

Kernel1 0.4467773437 0.4468140602 0.4472135954

Kernel2 0.8935546875 0.8936281204 0.8944271909

Out1 2.2338867187 2.2340707778 2.2360679774

Out2 0.0000000000 0.0000000000 0.0000000000

Implementation of Parametric

Haar-like Transformations on FPGA

Mikko Koverola

School of Electrical Engineering

Thesis submitted for examination for the degree of Master of

Science in Technology.

Espoo 14.5.2018

Supervisor

Prof. Jussi Ryynänen

Advisor

Ph.D. David Guevorkian

Page 2

Copyright c

2018 Mikko Koverola

Page 44

35

of the C++ code is translated correctly to the RTL model, a channel class can be

used in the Catapult HLS. The ac_channel<T> class defines FIFO data transfer

between the RTL model’s processes that are generated from the C++ functions. The

parameter T defines the data type that is passed through the channel. For example,

ac_channel<ac_fixed(4,2,false)> defines a channel that passes unsigned fixed-point

values through the channel. [38]

3.2.2 Processing Elements

A high-performance PE is the enabling factor for a high-performance Haar-like

transformation and the inverse square root logic block lies at the heart of an efficient

PE. As discussed in Chapter 2, it is of utmost importance to choose a well suited

calculation method for the inverse square root logic. The Catapult HLS 10.2/754530

provides two methods for implementing the inverse square root. One method is to

implement the inverse square root functionality with the standard AC square root

and division functions and the other is a piecewise linear (PWL) inverse square root

function included in the AC math libraries.

The PWL inverse square root function is a polynomial approximation method

that can process fixed- and floating-point inputs. The square root function on the

other hand is implemented as an iterative calculation and it can only have fixed-point

values as an input. For fixed-point PE implementation the results suggest that

the PWL inverse square root function uses less resources than the square root and

division functions but it also produces more inaccurate outputs than the square root

and division functions combined. The results are discussed in section 4.1.

The accuracy constraints for the PEs depend largely on how the Haar-like trans-

formation is used. If information about the expected input range is available this

information can be taken advantage of in the design process. For example, if it is

known that the transformations will process input vectors whose elements’ magni-

tudes will vary only from 0 to 1, the PEs can be designed to accommodate this range

of inputs with a given level of accuracy while minimizing the bitwidths. For the

investigation done in this master’s thesis, the input test vectors were restricted to

have only integer values between 1 and 32.

The fixed-point implementation of the PEs was done similarly as in the VHDL

models. Conceptually the same architecture can be considered as in the figure 15,

although without the delay logic. An example code of the C++ implementation of

the real-number fixed-point PEs is presented below. The name of the type definitions

for each operation output type are included in the example code comments.

Each output type for the operations is defined to be an AC fixed-point number.

By adjusting the lengths of the integer and fractional parts of the output types the

accuray-level for the whole PE could be defined. Also, defining the output types as

AC complex types easy migration to a complex number implementation could be

achieved. The bitwidths used for each output type in every transformation size are

presented in appendix B.

Page 45

36

void PE(

input_type in1,

input_type in2,

int K_ind1,

int K_ind2,

ctrl_type gen,

ctrl_type sel,

output_type &out1,

output_type &out2

){

if (gen == 1) { // Kernel generation logic

if (!( (in1 == 0) && (in2 == 0) )){

sqr1 = (in1*in1); // sqr1 = sqr_type

sqr2 = (in2*in2); // sqr2 = sqr_type

Temp = sqr1 + sqr2; // Temp = sum_type

sqrt(Temp,sqroot); // sqroot = sqroot_type

div(One,sqroot,invNorm); // invNorm = norm_type

}

if((in1 == 0) && (in2 == 0)){ // If norm is zero identity kernels

K[K_ind1] = 1; //K = kernel_type

K[K_ind2] = 1; //K = kernel_type }

else if (sel == 1){ // Generate bypass kernels

K[K_ind1] = 0; //K = kernel_type

K[K_ind2] = 1; //K = kernel_type }

else { // Generate kernels

K[K_ind1] = in1*invNorm; //K = kernel_type

K[K_ind2] = in2*invNorm; //K = kernel_type

}

}

// Multiplication logic

out1 = (K[K_ind1]*in1) + (K[K_ind2]*in2);

out2 = (K[K_ind2]*in1) - (K[K_ind1]*in2);

}

In the PEs the most complicated calculations are done in the kernel generation

logic. Floating-point Matlab reference calculations were generated for the operations

in the kernel generation logic and the reference output values were compared with

the output values of the fixed-point operations. The accuracy of the C++ fixed-point

values was set by first determining the needed integer part widths and then adjusting

the bitwidth of the fractional part to achieve the wanted accuracy-level (See figure

16). The accuracy and bitwidth optimizations results are discussed in section 4.1.1

for the real number and in section 4.1.2 for the complex number implementations.

The Catapult HLS tool is able to unroll loops to parallelize the loop iterations.

This is done either by using a pragma or setting the unrolling option from the

tool GUI. To investigate latency reductions, the iterative square root and division

Page 87

78

C Accuracy Results for PE operations

Table C1: The accuracy results for the fixed-point real number PE implementations’

kernel generation calculations.

Operation output PE, Sqrt&Div PE, Inv. Sqrt PE, Reference

In1 1.0000000000 1.0000000000 1.0000000000

In2 2.0000000000 2.0000000000 2.0000000000

Inverse Norm 0.4472122192 0.4467773437 0.4472135954

Kernel1 0.4472045898 0.4467773437 0.4472135954

Kernel2 0.8944396972 0.8935546875 0.8944271909

Out1 2.2360839843 2.2338867187 2.2360679774

Out2 0.0000000000 0.0000000000 0.0000000000

Table C2: The accuracy results for the fixed-point complex number PE implementa-

tions’ kernel generation calculations.

Operation output PE, Sqrt&Div PE, Inv. Sqrt PE, Reference

In1 Re 1.0000000000000 1.000000000000 1.000000000000

In1 Im 2.0000000000000 2.000000000000 2.000000000000

In2 Re 2.0000000000000 2.000000000000 2.000000000000

In2 Im 3.0000000000000 3.000000000000 3.000000000000

Inverse Norm 0.2357025146484 0.235534667968 0.235702260395

Kernel1 Re 0.2357025146484 0.235534667968 0.235702260395

Kernel1 Im 0.4714050292968 0.471069335937 0.471404520791

Kernel2 Re 0.4714050292968 0.471069335937 0.471404520791

Kernel2 Im 0.7071075439453 0.706604003906 0.707106781186

Out1 Re 4.2426452636718 4.239624023437 4.242640687119

Out1 Im 0.0000000000000 0.000000000000 0.000000000000

Out2 Re 0.0000000000000 0.000000000000 0.000000000000

Out2 Im 0.0000000000000 0.000000000000 0.000000000000

Page 88

79

Table C3: The accuracy results for the fixed-point and floating-point real number

PE implementations’ kernel generation calculations when using the PWL inverse

square root funtion.

Operation output Fixed-point Floating-point Reference

In1 1.0000000000 1.0000000000 1.0000000000

In2 2.0000000000 2.0000000000 2.0000000000

Inverse Norm 0.4467773437 0.4468140602 0.4472135954

Kernel1 0.4467773437 0.4468140602 0.4472135954

Kernel2 0.8935546875 0.8936281204 0.8944271909

Out1 2.2338867187 2.2340707778 2.2360679774

Out2 0.0000000000 0.0000000000 0.0000000000