Sorry for the language mix (german, english).
- virt. Test-Cluster (rid-replacement)
- Zi != 0 for square Sz=0 k=-9999
- after release v2.40, make FPGA ready (b_smallest blockwise) + FPGA
- ising matrix to ising+1storder-excitations?
- storeh2 for mpi (for ns() too)
  MC like method 10000 random configs divided into 100 blocks
   defining start-cfg of each block,
  isingerg as mean value of neighbouring ising ergs?
    like e0+meanexcitations
- remove writing l1 (allows starting 2 jobs within same path)  
  save as one file on node0?
- rename zahl to lfloat (long float), mzahl to sfloat (short float)
- complete new strategy? build n1 via storeh2 (sorted by isingerg)
  and balance number of nonzero elements per rank/thread
  b2i have to ask probably more than one other rank?
- remove n2, parallel ns()
- err9200: set nzxmax-overflow-flag and reset hbuf->n
           stop later! ??? or better dyn.malloc
      or hbuf as compressed part of hbuf[NUM_
- mpi ns() without nfs-transfer
- fix compile-errors on SunOS(isut1), check MRule for kago39
- ns() replace fwrite(l1_xxx.tmp) ??? (tina has NFS problems)
  by mpi_sendrecv
   1st: send buffer to thread_i (i+1 if mem_i=full)
        or simpler send buf_i to thread i%mpi_n (buf_mpi_i)
   2nd: balance l1[] send l1[]begin(i+1)..end to
        or simpler send buf_mpi_i to thread i (l1_i)
- speedup by local malloc? if not change back to v[i+b_ofs[blk]]
  and node_ofs[..], b_ofs[]=0..node_len[node]-1;
- autobalancer, redistribute lines among threads after 1st iteration?
  !ps -m -o time -p $PPID at end to calculate unbalance
- first mpi_n*MPI_calls per hbuf-line, later block a fixed number of lines
  split nhm()
  - hamilton_nhv(i) -> nhm(i,jnz) -> store hbuf[].cfg,rr
  - smallest(hbuf) -> jnz*b_smallest -> store hbuf[].scfg,rr*=(sign,norm2)
  - scfg2blk(hbuf) -> [blk].hbuf...
  - mpi_n*MPI_Sendrecv([i+j].hbuf to i+j, [i].hbuf from i-j) cfg
          scfg2idx -> hbuf[].idx
          MPI_Sendrecv([i+j].hbuf to i-j, [i].hbuf from i+j) idx+v0?
    for all b_len[0] (send 0 if b_len[i] letzte up-spin wegnehmen und neuen Platz suchen bis smallest
- alloc h_arrays within threads (h_xxxx[B_NUM] not needed),
  open/close within threads?
- bessere Speed-Messung um Flaschenhals zu finden?
  - MP-scaling, Size-Scaling hnz/s (wenn sinnvoll)
  - for FPGAs prepared
- smaller code for less bugs
  - remove noSBase (k=-1 can be used, add kud=0 or -1, test speed)
- pictures of most probable 40 states! mark flips (@ critical J2=0.6)
  show first perturbation terms
- Fernziel: FPGAs + MPI (b_smallest-calls blocken)
- MPI async: MPI_PROBE, MPI_GET_COUNT, MPI_RECV, MPI_ISend,Ireceive,waitall/any
- async read n-to-m threads (m<=n), Operationen in Bloecken (speedup)
  zu aufwendig? besser nur in Speicher, ontime-Berechnung oder mmap zu Platte
- tJ has no .3 symmetry like tU (Ham2), makes sence?
- j1-j2-t-U als parameterindex speichern oder via Index (lange 40er j1-j2 Berechnungen)
  (wegen a2 keine getrennten Parameterfiles), besser index?
  speicher index zu [faktor{0,+1/2,-1/2,-1,+1}, parameterindex]
- n-site-Terme in H n>2
- SiSj mit sym viel zu langsam, warum?
- a4 parallel (world leader? <-> AHonecker)
- a2 parallel-skaling testen / verbessern?
- data-multistream-konzept (pipeline concept of vectormachines) for new HPC?
  seriell dynamisch gekoppelte programmierbare Units (z.B. CPUs, FPGAs)
  Abbildung in seq. Prozessoren als threads + stream buffers + stop
  start mechanism if stream buffer is full/empty (wait for data)
  reicht nicht fuer v2=H*v1+v2, auch random access notwendig (stream+RAM)
  MIPS per Watt? cost-performance-per-watt  (ARM SA-1100 1997 133MHz max250mW) 
- nice graphic?: xy-Array fuer colored Isingenergien-Matrixgewichten
    minIsing=0(neel) maxIsing=maxNumBonds=N*Dim
    Ising=num-uu-Bonds+dd-Bonds, H1diff=0..2maxNN-2 H2diff=0..2maxNN
  + state Overlap <PisingstatesE1|H|PisingstatesE2>  => x_out enhanced
  highest/lowest ising recursiv? min..maxE1.E2.E3....
 - check also: grep ToDo src/*.[ch]
 - LM Bindungen in H vereinfachen/generalisieren? 
  (use a more general (simple) method for local symmetries)
   example:
  ..O   O---O   O---O   O...   this is a sample-chain, with 5*N sites
     \ /     \ /     \ /       and N vertical symmetries, which are
      O       O       O        completely decoupled from each other
     / \     / \     / \       and to the other symetries (very similar to LM),
  ..O   O---O   O---O   O...   a future version should care about this
 speedup for local singlets (see 4site_exchange_diamond36.def)
 commutating symmetries (ge: vertauschende Symmetrien)
 - find solution to avoid 64bit overflow for spinchain N=6 s=7 and bigger
 - instead of pthread_create/join doing on every iteration do it only once
    and use with mutex in mpi compatible way
    - improves sun_top_pcpu (pcpu is resettet to 0 after pwd_create)
    - improves linux_top logging (new pid creation on p_create)
    - could be made mpi compatible (MY_MPI)
 - calculate SiSj for twisted boundary conditions TBC
  (using posx,posy, ww[NN*NN] ?)
 - translate docs to english (partly done)
 - new design via pipes (dataflow) and threads (ex: generate-H-thread
   writes elements sorted to 4 pipes for blocks, system does cashing)
 - Codierung/Indizierung nach Isingenergie mit allen moegl. Bindungen bei
    gleicher Symmetrie (Vorteil, S=1, LM automatisch integriert, schneller?)

 - Einfuehrung optional eines C++ types ULLLong mit >64bit fuer N>64
 - Umstellung auf C++ wuerde das Programm uebersichtlicher machen!
   minimal fuer Daten-Typen
 - Erwartungswerte H_t H_J H_U etc. berechnen (deren Summe ist <H>)
   ermoeglicht bessere Interpretation? evl. H_J1, H_J2
 - H_J1={ny,array of pointer of {iy,nx,array of {ix,wert[y,x]}}}
 - store H_J1,H_J2 seperatly and calculate H = J1*H_J1 + J2*H_J2
- try start neel and count nonzero-elements per lanzcos step (a8?)
  also check overlap to predecessing lanzcos vectors
- repeat with debug++ if fatal error, reduce debug output 
- speed 600MHz      lt=2:05 SH=2:15,    --- 1s/It ---  nu+nd=32+8 n1=482e3
  call   b_smallest lt=2:07 SH=2:13=133s 
  call 2*b_smallest lt=2:05 SH=3:50=230s => b_smallest=1:37=97s=72%
  call   b_getbase  lt=2:07 SH=13:45
  call 2*b2i        lt=2:06 SH=2:29 => +12..14s=10%
       nhmline()                       +1..3s=1%
       1*nhm                SH=2:21
       2*nhm                SH=4:32 => +131s=98%
       nhm=return           SH=0:07s
  lt mit H aufbauen? nach E_Ising sortiert? 3bonds=3dimIsing
   H*000111=001011+100110 (6)->(6)  better uu=dd=0 ud=du=1 (for +-Operators)
   H*4.2.0 =2.2.2+2.2.2
   H*001011=010011+000111b+001101+101010 (6)->(6,2)
   H*2.2.2 =2.2.2+4.2.0b+2.2.2+0.6.0
  hash(ising-string)? trees of Isingergs (level=bondtype)
   konfig mit min. Bitastand zu aelteren Repres. als Representanten speichern? 
   (kfg1^kfg2 liefert <ij>)
   n1=555e6 hnz=23e9=41.4*n1
   idx  -> (IsingRep -> kfg) -> H*kfg -> kfgs -> IsingRep -> idxs
         H*kfg, kfgs->IsingRep per FPGAs?
         idx <-> IsingRep per (hash)Table?
   v voellig dynamisch mit lowestIsing startend?
   1. Iteration superfast, 2. Iteration 41*langsamer? etc.
   aber Problem index finden bleibt?
   numBonds? (topology-index 1dist.2dist.3dist...N/2dist for chain)
   0=01010101 2(8.0.8.0) -> 10010101 + 01100101 + 00110101 + ... 8(6.4.4.)
   1=10010101 8(6.4.4. ) -> 01010101=0
                             + 10100101=1
                             + 10001101 + 10010011 + ... 16(4.6.)
   2=10001101 16(4.6.)  ->  10010101=1
                             + 10001011 (4.4.)
   3=10001011 16(4.4.)  ->  10001101=2
                             + 01001011 (6.4.2.)
                             + 10000111 (2.4.)
   4a=01001011 8(6.4.2)
   4b=10000111 8(2.4)
- code2 waehlen
- kill SIGUSR sh parallel => kein sinnvoller Wert 
- LAPAck ohne EV (per Option), zheev implementieren fuer sparse + parallel!?
  Lizenz? http://www.netlib.org/lapack/faq.html#1.2
  - change the name of routines if modified, 
  - We only ask that proper credit be given to the authors.
  complex: zheev (JOBZ='N'|'V', UPLO='U', N, A[LDA,N], LDA>=max[1,N], W[N],..)
   wantz = LSAME( jobz,'v'); // teste option
   lower =
- symmetry.tex uebersetzen/neu gestalten
- Oles-Term pruefen und dokumentieren + ggf. nur reines Coulomb-U(i,j)
  6-site U/|i-j|
- Reimars patch = ok
- check OP/sec, theoretical limits MBps MOps etc. no disk/IO?


pid=...
while ps -p $pid; do
 echo -n "$(date +"%j %H:%M:%S") "
 # only for OSF -g $pid (for subprozesses gzip)
 ps -p $pid -o "pid,pgid,ppid,time,etime,usertime,systime,pcpu,pagein,vsz,rss,inblock,oublock" | tail -1
 sleep 30
done
# compare prozess + system pagein/inblock/oublock wenn moeglich

#
plot [700:840] "aab.log" u 0:3 t "cpu/%" w lp,\
 "<awk '{print  ($9-x)/3.e3; x=$9}' aab.log" u 0:1 t "read/3e3" w lp,\
 "<awk '{print  ($10-x)/1.e2; x=$10}' aab.log" u 0:1 t "write/1e2" w lp

marvel: full last
time gzip  -fc1 tmp/htmp001.001 >tmp/htmp001.1.gz   1m32.381s # v1.2.4
time gzip  -fc6 tmp/htmp001.001 >tmp/htmp001.6.gz   3m07.496s
time gzip  -fc9 tmp/htmp001.001 >tmp/htmp001.9.gz   4m34.404s
time bzip2 -fc1 tmp/htmp001.001 >tmp/htmp001.1.bz2  7m43.521s # v1.0.1
time bzip2 -fc9 tmp/htmp001.001 >tmp/htmp001.9.bz2 12m14.325s
ls -l tmp/htmp001.*
-rw-r--r--   1 jschulen urzs     778485760 Jan 17 22:11 tmp/htmp001.001
-rw-r--r--   1 jschulen urzs     303230293 Jan 19 13:19 tmp/htmp001.1.gz  39%
-rw-r--r--   1 jschulen urzs     297352257 Jan 19 14:31 tmp/htmp001.6.gz  38%
-rw-r--r--   1 jschulen urzs     296760506 Jan 19 13:54 tmp/htmp001.9.gz  38%
-rw-r--r--   1 jschulen urzs     296796144 Jan 19 14:07 tmp/htmp001.1.bz2 38%
-rw-r--r--   1 jschulen urzs     332110892 Jan 19 14:20 tmp/htmp001.9.bz2 42% ?
time cat tmp/htmp001.001          >/dev/null 0m11.624s
time gunzip  -c tmp/htmp001.1.gz  >/dev/null 0m24.802s
time gunzip  -c tmp/htmp001.9.gz  >/dev/null 0m24.043s
time bunzip2 -c tmp/htmp001.1.bz2 >/dev/null 1m55.676s

Performance and efficence

The efficence of spinpack-2.19 on a Pentium-M-1.4Ghz was estimated using valgrind-20030725 for the 40-site square lattice s=1/2-model. 37461525713 Instr./49s = 764M I/s (600MHz) 12647793092 Drefs/49s = 258M rw/s