This paper presents benchmarking and profiling of the two lattice-based signature scheme finalists, Dilithium and Falcon, on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. This research is motivated by the Cortex M7 device being the only processor in the Cortex-M family to offer a double precision (i.e., 64-bit) floating-point unit, making Falcon’s implementations, requiring 53 bits of precision, able to fully run native floating-point operations without any emulation. Falcon shows significant speed-ups between 6.2-8.3x in clock cycles, 6.2-11.8x in runtime, but Dilithium does not show much improvement other than those gained by the slightly faster processor.
We then present profiling results of the two schemes on the Cortex M7 to show their respective bottlenecks and operations where the improvements are and can be made, which show some operations in Falcon’s procedures observe speed-ups by an order of magnitude. Finally, we test the native FPU instructions on the Cortex M7, used in Falcon’s FPR instructions, for constant runtime and find irregularities on four different STM32 boards, as well as on the Raspberry Pi 3, used in previous benchmarking results for Falcon.