

## Joël Cathébras

Vendredi 14 décembre 2018 Séminaire sécurité des systèmes électroniques embarqués Campus de Beaulieu, 263 avenue du Général Leclerc, Rennes

#### AN APPROACH FOR THE HARDWARE ACCELERATION OF HOMOMORPHIC CRYPTOGRAPHY



## THE PERFECT NEUTRAL MATCHMAKER



#### Homomorphic encryption: processing encrypted data without decrypting them!

## A MORE SERIOUS EXAMPLE: THIRD-PARTY MEDICAL MONITORING



With classical cryptography

list

Ceatech

With homomorphic cryptography



**Decryption is an homomorphism:** for all ciphertexts ct<sub>a</sub> and ct<sub>b</sub>

 $Dec(ct_a \otimes ct_b) = Dec(ct_a) * Dec(ct_b) = m_a * m_b$ 

Cleartext space Ciphertext space Message space err  $\sigma$  $(\mathcal{H}, +_{\mathcal{H}}, \times_{\mathcal{H}})$  $(\mathcal{C}, +_{\mathcal{C}}, \times_{\mathcal{C}})$  $(\{0,1\}, \oplus, \wedge)$ Encode Encrypt  $C_b$  $C_a$ b  $m_a$ a  $m_b$  $\oplus$ Decrypt Decode  $a \oplus b$  $a \wedge b$  $m_{a \oplus b}$  $m_{a \wedge b}$  $C_{a \oplus b} = C_{a \wedge b}$ → **Ж** *C → m* C

Homomorphic encryption scheme

**Theoretical problematics** Find decryption homomorphism Noise management for correctness

list

Ceatech

**Different generation of HE schemes** 

#### **Practical problematics**

Data size expansion (1 bit  $\rightarrow \sim 10$  kbit) Computational complexity (1 AND  $\rightarrow \sim ms$ )

#### **Need hardware acceleration**



## STATE OF THE ART OF HOMOMORPHIC SCHEMES



■ Misc. ■ A-GCD ■ NTRU/NTRU' ■ LWE ■ RLWE ■ TLWE



## MAIN PROBLEMATIC OF RLWE-BASED L-FHE SCHEMES

**Ring Learning With Errors:** handling polynomial ring elements



**Decision**: Distinguish (*A*, *B*) from a random pair in  $R_a^2$ 

Security and correctness depend on parameters:



*n* increases security & *q* increases correctness

#### Leveled-FHE parameters depend on application:

**Grow** with the complexity of encrypted evaluation

Flexible acceleration strategy for  $\begin{cases} Polynomial arithmetic over <math>R_q \\ Integer arithemetic over \mathbb{Z}_q \end{cases}$ 



# STATE OF THE ART OF HARDWARE ACCELERATION FOR RLWE-BASED L-FHE SCHEMES





Among related works on coupling RNS and NTT strategies

Öztürk et al. 2015 Memory-access iterative NTT External computation of twiddle factors Doubling communication bandwidth Cousins et al. 2017

Dataflow oriented NTT Local storage of twiddle factors **High storage cost**  Sinha Roy et al. 2015

Memory-access iterative NTT On-the-fly computation of twiddle factors. Generation insert bubbles

impacting NTT throughput

#### Dataflow oriented NTT-based convolutions with on-the-fly computation of twiddle factors



Analysis of the FV scheme towards its hardware acceleration.

Proposal of a data flow oriented Residue Polynomial Multiplier (RPM).

In-depth on our design of fully-streaming multi-field NTT circuits.

Analysis of the scalability of our NTT-based RPM.

Other contributions, conclusion and perspectives.



## ANALYSIS OF FV TOWARD ITS HARDWARE ACCELERATION (1) PROFILING OF FV HOMOMORPHIC EVALUATION

Performance bottlenecks of FV homomorphic evaluation? Profiling of an homomorphic evaluation of Trivium



FV implementation from the Cingulata compiler  $\lambda < 80, L = 19, n = 8192, \log_2 q = 913, \log_2 \sigma = 383$ Valgrind 3.10 on Intel Core i7-3770.

# Hardware acceleration should target ciphertext multiplications

## ANALYSIS OF FV TOWARD ITS HARDWARE ACCELERATION (2) FV PARAMETERS ANALYSIS

list Ceatech

> Is there parameter choices more suitable for hardware acceleration? Parameter analysis w.r.t. security and multiplicative depth requirements





### ANALYSIS OF FV TOWARD ITS HARDWARE ACCELERATION (3) PRESENTATION OF THE RNS/NTT STRATEGY

**Residue Number System (RNS):** 





## ANALYSIS OF FV TOWARD ITS HARDWARE ACCELERATION (4) VALIDATION OF THE RNS/NTT STRATEGY

Is RNS/NTT strategy possible up to very large FV parameter sets? Look for suitable RNS basis element respecting the strategy constraints



#### Enough primes > 32-bit for very large parameters ( $\lambda > 128, L \sim 100$ )



### THE FULL RNS VARIANT OF FV



#### DATA FLOW ORIENTED RPM THROUGH NEGATIVE WRAPPED CONVOLUTION ARCHITECTURE PRINCIPLE

- Negative Wrapped Convolution over  $\mathbb{Z}_{q_i}^n \Leftrightarrow \mathsf{RPM}$  over  $\mathbb{Z}_{q_i}[X]/(X^n+1)$ :
  - Let  $\psi_i$  be a *n*-th primitive root of -1 over  $\mathbb{Z}_{q_i}^*$ , exists if 2*n* divides  $q_i 1$ .

list

Clatech



#### High throughput NTT circuits with on-the-fly change of twiddle factor sets



### TOWARDS AUTOMATIC GENERATION OF MULTI-FIELD NTT DESIGN (1) PRINCIPLE OF A TWIDDLE BANK

SPIRAL DFT hardware generator  $\Rightarrow$  Multi-field NTT hardware generator

**Example of NTT (**r = 2, n = 16, w = 4**)** 





## TOWARDS AUTOMATIC GENERATION OF MULTI-FIELD NTT DESIGN (1) PRINCIPLE OF A TWIDDLE BANK

SPIRAL DFT hardware generator  $\Rightarrow$  Multi-field NTT hardware generator

**Example of NTT (**r = 2, n = 16, w = 4**)** 















data flow twiddle flow









How to make a TWB reprogrammable?



How to dispatch the twiddle factors in the TWB memory elements?





## AUTOMATIC GENERATION OF MULTI-FIELD NTT DESIGN (3) REPROGRAMMING OF TWIDDLE BANKS





## AUTOMATIC GENERATION OF MULTI-FIELD NTT DESIGN (4) TWIDDLE PATH SYNTHESIS RESULTS

| L: Multiplicative depth |  |
|-------------------------|--|
| $n: \deg(F)$            |  |
| $S_q: \log_2 q$         |  |

s: log<sub>2</sub> q<sub>i</sub> k, k': RNS basis sizes w: streaming width

|                           |   |                      | (                      | TWB                       |                        |                       |                        |
|---------------------------|---|----------------------|------------------------|---------------------------|------------------------|-----------------------|------------------------|
| (n, w, s)                 | G | Res.                 | %                      | Total                     | STWB                   | Misc.                 | 1112                   |
| $(2^{12}, 2, 30)$         | 4 | LUTs<br>FFs<br>BRAMs | $0.59 \\ 0.45 \\ 1.9$  | $2,567 \\ 3,912 \\ 28$    | $2,043 \\ 3,512 \\ 28$ | 523<br>400<br>–       | $562 \\944 \\7$        |
| (214, 2, 30)              | 4 | LUTs<br>FFs<br>BRAMs | $0.71 \\ 0.52 \\ 5.17$ | $3,055 \\ 4,495 \\ 76$    | $2,350 \\ 3,968 \\ 76$ | 705<br>× G            | $658 \\ 1,083 \\ 19$   |
| (2 <sup>14</sup> , 8, 30) | 4 | LUTs<br>FFs<br>BRAMs | $1.83 \\ 1.7 \\ 7.62$  | $7,950 \\ 14,658 \\ 112$  | 6,117<br>13,568<br>112 | 1, 30<br>994<br>–     | $1,581 \\ 3,458 \\ 28$ |
| $(2^{14}, 8, 46)$         | 4 | LUTs<br>FFs<br>BRAMs | $2.5 \\ 2.47 \\ 16.3$  | $10,\!826\\21,\!350\\240$ | 8,997<br>20,064<br>240 | $1,826 \\ 1,174 \\ -$ | $2,301 \\ 5,082 \\ 60$ |

#### Synthesis result of our twiddle path

Vivado 2018.1 targeting Xilinx Virtex 7 XC7VX690T-2-FFG1157C

## Comparison with local storage of all twiddle factors

| Parameters |          |       |    |      |   | Local | Storage | Our Twiddle Path |             |  |
|------------|----------|-------|----|------|---|-------|---------|------------------|-------------|--|
| L          | n        | $S_q$ | s  | k+k' | w | MB    | BRAM    | MB               | BRAM(%)     |  |
| 1          | $2^{11}$ | 54    |    | 5    |   | 0.31  | 25      | 0.25             | 20 (-20 %)  |  |
| 5          | $2^{12}$ | 108   |    | 8    |   | 0.98  | 56      | 0.49             | 28 (-56 %)  |  |
| 10         | $2^{13}$ | 216   | 30 | 16   | 2 | 3.93  | 176     | 0.98             | 44 (-75 %)  |  |
| 20         | $2^{14}$ | 432   |    | 30   |   | 14.75 | 570     | 1.97             | 76 (-87 %)  |  |
| 30         | $2^{15}$ | 594   |    | 41   |   | 40.30 | 1,394   | 3.93             | 136 (-90 %) |  |
|            |          |       |    |      |   |       |         |                  |             |  |
|            |          |       |    |      |   |       |         |                  | Y           |  |

G times more costly than for a single-field NTT Up to -90 % of memory utilization compared to some state-of-art approaches

Empirically  $4 \le G \le 6$ 

No impact on NTT throughput



### **RPM CHARACTERIZATION (1) PROOF-OF-CONCEPT INTEGRATION**



Xilinx Virtex 7 xc7vx690t-ffg1157c-2 & PCIe Gen3 ×8

| n - L          | ,10691 -   | 50, 11 |     |       |
|----------------|------------|--------|-----|-------|
| Vivado 2016.3: | simulated, | placed | and | route |
|                |            |        |     |       |

|        |                  | Ressources    | 8         |        |        | RPM   |       |                                       | BCHI   |
|--------|------------------|---------------|-----------|--------|--------|-------|-------|---------------------------------------|--------|
|        | $\left( \right)$ | type          | available | total  | NTT    | MM    | GTW   | Others                                | & WRAP |
| LUT    | 12.5%            | LUT           | 432,368   | 54,188 | 41,964 | 5,198 | 5,906 | 1,120                                 | 27,775 |
| LUTRAM | 8.3%             | LUTRAM        | 173,992   | 14,402 | 10,710 | 2,056 | 1,550 | 86                                    | 5,425  |
| FF     | 7.7%             | $\mathbf{FF}$ | 864,736   | 66,444 | 50,961 | 6,755 | 7,761 | 967                                   | 39,614 |
| BRAM   | 14.1%            | BRAM          | 1,470     | 208    | 147    | 0     | 21    | 40                                    | 153    |
| DSP    | 14.4%            | DSP           | 3,600     | 517    | 363    | 88    | 66    | 0                                     | 48     |
|        |                  | IO            | 600       | 0      | 0      | 0     | 0     | 0                                     | 59     |
|        |                  | Pcie          | 3         | 0      | 0      | 0     | 0     | 0                                     | 1      |
|        |                  |               |           |        |        |       |       | · · · · · · · · · · · · · · · · · · · | 4      |

Running frequency  $f_{RPM} = 200 \text{ MHz}$ 

#### How does RPM scale in FV context?

#### **Post-implementation resources utilization**



## **RPM CHARACTERIZATION (2) PROJECTION: INFLUENCE OF DEGREE** *n*

**Impact of the polynomial degree** n (w = 2 and  $\log_2 q_i = 30$ ):

Xilinx Virtex 7: XC7VX690T-2-FFG1157C

Resource limitation (FPGA / PCIe Gen3 x8)





#### **RPM CHARACTERIZATION (3)**

**PROJECTION: INFLUENCE OF STREAMING WIDTH** *w* 

**Impact of the polynomial degree** w ( $n = 2^{14}$  and  $\log_2 q_i = 30$ ): Xilinx Virtex 7: XC7VX690T-2-FFG1157C

— Resource limitation (FPGA / PCIe Gen3 x8)





#### **RPM CHARACTERIZATION (4)**

**PROJECTION: INFLUENCE OF PRIME SIZE**  $LOG_2 q_i$ 

**Impact of the polynomial degree**  $\log_2 q_i$  ( $n = 2^{14}$  and w = 2): Xilinx Virtex 7: XC7VX690T-2-FFG1157C

Resource limitation (FPGA / PCIe Gen3 x8)



#### **Performance projection @200MHz:**

With respect to timing from [HPS18] ( $\lambda > 128$ ) (su = speedup)

| Parameters |          |       |                              | RPM                     | Ν                       | /ul.RPM           | Relin.RPM                     |                            |                                                                                                                                           |                                |                                                                                                                                  |
|------------|----------|-------|------------------------------|-------------------------|-------------------------|-------------------|-------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| L          | n        | $S_q$ | s                            | k                       | k'                      | w                 | $1/\mathrm{ms}$               | #                          | $ms(\mathbf{su})$                                                                                                                         | #                              | $ms(\mathbf{su})$                                                                                                                |
| 1          | $2^{12}$ | 54    |                              | 2                       | 3                       |                   | 195.3                         | 15                         | 0.08(	imes 23.4)                                                                                                                          | 8                              | 0.04(	imes 9.5)                                                                                                                  |
| <b>5</b>   | $2^{13}$ | 108   |                              | 4                       | <b>5</b>                |                   | 97.7                          | 27                         | $0.28(\times 22.2)$                                                                                                                       | 32                             | 0.33(	imes 7.5)                                                                                                                  |
| 10         | $2^{13}$ | 216   | 30                           | 8                       | 8                       | <b>2</b>          | 48.8                          | 48                         | $0.98(\times 24.4)$                                                                                                                       | 128                            | 2.62(	imes 6.8)                                                                                                                  |
| 20         | $2^{14}$ | 432   |                              | 15                      | 15                      |                   | 24.4                          | 93                         | $3.69(\times 27.4)$                                                                                                                       | 450                            | 18.4(× <b>4.0</b> )                                                                                                              |
| 30         | $2^{15}$ | 594   | J                            | 20                      | 21                      |                   | 12.2                          | 126                        | $10.1 (\times \textbf{31.3})$                                                                                                             | 882                            | 65.5(	imes 4.8)                                                                                                                  |
| 20         | $2^{14}$ | 432   | 30                           | 15                      | 15                      | 2<br>4<br>8<br>16 | 24.4<br>48.8<br>97.7<br>195.3 | 93                         | $\begin{array}{c} 3.69 (\times 27.4) \\ 1.84 (\times 54.8) \\ 0.92 (\times 109.5) \\ 0.46 (\times 219) \end{array}$                       | 450                            | $\begin{array}{c} 18.4(\times {\bf 4})\\ 9.22(\times {\bf 8.1})\\ 4.61(\times {\bf 16.1})\\ 2.30(\times {\bf 32.3})\end{array}$  |
| 20         | $2^{14}$ | 432   | $30 \\ 41 \\ 51 \\ 58 \\ 62$ | 15<br>11<br>9<br>8<br>7 | 15<br>11<br>9<br>8<br>8 | 2                 | 24.4                          | 93<br>69<br>54<br>48<br>45 | $\begin{array}{c} 3.69 (\times 27.4) \\ 2.70 (\times 37.3) \\ 2.21 (\times 45.6) \\ 1.97 (\times 51.3) \\ 1.84 (\times 54.8) \end{array}$ | 450<br>242<br>162<br>128<br>98 | $\begin{array}{c} 18.4(\times 4) \\ 9.91(\times 7.5) \\ 6.64(\times 11.2) \\ 5.24(\times 14.2) \\ 4.01(\times 18.5) \end{array}$ |

Scalable speedup w.r.t. multiplicative depth (FV parameter growth)

Parallelism improves speedup but is costly

Prime size reduces number of operations at reasonable cost



## **OTHER CONTRIBUTIONS**

• Design of a generator of twiddle factor sets



• Prototype of GPU acceleration for RNS specific functions



• Exploration of potential acceleration with a hybrid computing system for FV



One order of magnitude faster for ciphertext multiplication

Communication intensive application



#### CONCLUSION





- Improving bandwidth for communication intensive system
  - High bandwidth & low latency interfaces (IBM Capi, Intel UltraPath, ...)
  - 3D memory
- Leveled homomorphic cryptography with the FV scheme
  - Target application: low-depth and batching intensive applications
  - Batching compatible NTT-based Residue Polynomial Multiplication

#### • Hardware acceleration for TFHE

- Polynomial multiplications with real coefficients modulo 1
- Exploration of SPIRAL opportunities for FFT-based convolutions
- Hardware acceleration for Post-Quantum cryptography
  - NIST competition: 1/3 lattice-based with Polynomial Multiplication required
  - Explore single-field NTT generation with SPIRAL

## Thanks!

## Questions?

Centre de Saclay Nano-Innov PC 172 - 91191 Gif sur Yvette Cedex

\_\_\_\_\_





## **MODULAR ARITHMETIC**

• Modular Addition:



• Modular Multiplication (NFLlib):

• Modular Subtraction:



