This page is obsolete. Communication should be done in the background (ToDo)!

MPI-SpeedUp for SpinPack (matrix at memory, updated Mar09)

SpinPack MPI-SpeedUp measured (Mar09) MPI-Stress Benchmark (Mar09)

Now we could compare new SC5832 machine (light green) to an Infiniband cluster with two QuadOpteron Nodes (brown curve), which has enough bandwitdh to show good scaling. Both scale well, but the SC5832 is better on a peak GFLOP or energy consumption base. The MPI-Stress benchmark shows that the infiniband cluster has much higher latencies for collective communication. The SMP machines have latencies of 1.9us for a 4-socket DualOpteron and 3.1us for a 8-socket QuadOpteron system using OpenMPI-1.2.6. The Altix4700 at the LRZ has 510 usable IA64 Prozessors per Numalink-Partition. The MPI speed depends very strongly from the MPI package size (vertical lines, 128kB for middle line).

    Checked for up to 1000 cores now (Jul08)!

    linear in plot: x=log(CPUs) y=log(t1/t)
     lg2(t1/t) =     b * lg2(CPUs)   # t1 extrapolated 1CPU-time
         t1/t  = (2^(b))^lg2(CPUs)   # 2^b = SpeedUP2 (CPU-doubling)
         t1/t2 = (2^(b))^lg2(2) = 2^b = SpeedUp2
      BWFactor = (t1/t(1CPU))        # Band Width Factor

    OverallSpeedUp = SMPSpeedUP * MPISpeedUp
        SMPSpeedUp =            SMPSpeedUp2 ^ lg2(SMPCores)
        MPISpeedUp = BWFactor * MPISpeedUp2 ^ lg2(MPINodes)

    SpeedUp2
      v2.33: SMP = 1.66 (up to 32 CPUs),  MPI = 1.46 (up to 64 nodes)
      v2.36: SMP = 1.66 (up to 32 CPUs),  MPI = 1.69 (up to 50 nodes)

      BWFactor 100Mbit/s = ca.  40% ( 40% float, 25% double, 2*2GHz)
                 1Gbit/s = ca. 100% (100% float, 70% double, 4*2Ghz)
              2*10Gbit/s = ca. 100% (estimated, BW*Cores/Node)

    extrapolation:
      v2.33: SpeedUp = 1.66^lg2(SMPCores) * 1.46^lg2(MPINodes)
      v2.36: SpeedUp = 1.66^lg2(SMPCores) * 1.69^lg2(MPINodes) ??