Spinpack using FPGA

Idea

setting up test environment

Adaptions for FPGA

Overview about dataflow

Most data intensive is the sparse matrix (stays in memory or on disk), followed by vectors and config space (stays in memory). Symmetries (permutations) fit into the CPU cache normaly. Matrix is read out sequentially (no latency problem, for big systems its on disk -> bandwith). Space computation is mainly integer or bit driven, but because of missing bit-permutation atomic function its very CPU intensive.
code and data flow


As a first test, space generation could be completely done within FPGA replacing numsymconf() function, writing out minimum symmetric configurations to memory (byte packet or long array).
base space generation


Second test would be implement parts or full hamilton matrix generation to FPGA, if speedup is about 100, matrix could be generated on the fly on every iteration without the need of storing the matrix. This would reduce bandwith problems to disk for bigger spin systems. Nowadays we are limited by disk bandwith (100MB/s) and could go to FPGA streams about 1GB/s per node (speedup 10 without needs of disks and better scaling).
matrix generation


Estimation of FPGA logic needs to compare 40bits configurations to get the minimum. Permutations at zero costs (just wires)? Conventional 1.5GHz-x86-CPUs need around 8000 CPU clocks to get the minimum config for N=40 square system. An FPGA needs one FPGA Clock.
Logic needs for comparison


ToDo:

 - add ImpulseC codes and results
   in short: the c-to-vhdl and vhdl-compiler did a bad job, it looks like
     only the demo codes does compile,
    trying to compile more complex networks (160 permutations + minimum)
     does the compiler hang up, its not a fun to work with it
     I assume bad memory wasting algorithms and
     bad scaling code which cause OOMs
 - my hope is that someone writes better (open) compilers for FPGAs
 - the other way is to make the curcuit much more simple,
   one could design a butterfly network for permutations
   (of cause it would be better to integrate it to the CPU)