1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Computation slow with float than double.

Discussion in 'Intel' started by Michele Guidolin, Jun 7, 2005.

  1. Hello to everybody.

    I'm doing some benchmark about a red black Gauss Seidel algorithm with 2
    dimensional grid of different size and type, I have some strange result
    when I change the computation from double to float.

    Here are the time of test with different grid SIZE and type:

    SIZE 128 256 512

    float 2.20s 2.76s 7.86s

    double 2.30s 2.47s 2.59s

    As you can see when the grid has a size of 512 node the code with float
    type increase the time drastically.
    The number of loops is proportional to the SIZE of grid, so the time
    should be similar with different SIZE of grid.

    Should the float computation always fastest than double?
    I would like to know if is a gcc problem (I don't have other compiler)
    and if it is not what could be the problem?

    Hope to receive an answer as soon as possible,
    Thanks

    Michele Guidolin.

    P.S.
    Here are some more information about the test:

    The code that I'm testing is the follow and it is the same for the
    double version (the constant are not 0.25f but 0.25).

    ------------- CODE -------------

    #define SHIFT_S 9
    #define SIZE (1<<SHIFT_S)
    #define DUMP 0

    #define MAT(i,j) ((i)<<SHIFT_S) + (j)


    inline void gs_relax(int i,int j,float *u, float *rhs)
    {

    u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
    0.0f * u[MAT(i,j)] +
    0.25f* u[MAT(i+1,j)]+
    0.25f* u[MAT(i-1,j)]+
    0.25f* u[MAT(i,j+1)]+
    0.25f* u[MAT(i,j-1)]);
    }

    void gs_step_fusion(float *u, float *rhs)
    {
    int i,j;

    /* update the red points:
    */

    for(j=1; j<SIZE-1; j=j+2)
    {
    gs_relax(1,j,u,rhs);
    }
    for(i=2; i<SIZE-1; i++)
    {
    for(j=1+(i+1)%2; j<SIZE-1; j=j+2)
    {
    gs_relax(i,j,u,rhs);
    gs_relax(i-1,j,u,rhs);
    }

    }
    for(j=1; j<SIZE-1; j=j+2)
    {
    gs_relax(SIZE-2,j,u,rhs);
    }

    }

    int main(void) {
    int iter;

    int ITERATIONS = ((int)(pow(2.0,28.0))/(pow((double)SIZE,2.0)));

    float u[SIZE*SIZE];
    float rhs[SIZE*SIZE];

    double time;

    printf("-----START SEQUENTIAL FUSION------------\n\n");
    printf("size: %d\n",SIZE);
    printf("loops: %d\n",ITERATIONS);
    init_boundaries(u,rhs);

    gettimeofday(&submit_time, 0);

    for(iter=0; iter<ITERATIONS; iter++)
    gs_step_fusion(u,rhs);

    gettimeofday(&complete_time, 0);


    time = timeval_diff(&submit_time, &complete_time);
    printf("\ntime: %fs\n",time);

    printf("-----END SEQUENTIAL FUSION------------\n\n");

    }
    ---------------CODE--------------

    I'm testing this code on this machine:

    processor : 0
    vendor_id : GenuineIntel
    cpu family : 15
    model : 4
    model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
    stepping : 1
    cpu MHz : 3192.311
    cache size : 1024 KB
    physical id : 0
    siblings : 2
    fdiv_bug : no
    hlt_bug : no
    f00f_bug : no
    coma_bug : no
    fpu : yes
    fpu_exception : yes
    cpuid level : 3
    wp : yes
    flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
    mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni
    monitor ds_cpl cid
    bogomips : 6324.22

    with Hyper threading enable on GNU\Linux 2.6.8.

    The compiler is gcc 3.4.4 and the flags are:
    CFLAGS = -g -O2 -funroll-loops -msse2 -march=pentium4 -Wall

    I tried with also -ffast-math and -mfpmath=sse but I have the same result.
     
    Michele Guidolin, Jun 7, 2005
    #1
    1. Advertisements

  2. I wonder if the problem with float being slower might be an alignment issue.

    Later

    Mark Hittinger
     
    Mark Hittinger, Jun 7, 2005
    #2
    1. Advertisements

  3. Michele Guidolin

    Yousuf Khan Guest

    Seeing as you're using a P4 processor, and using the SSE2. If so, then
    I've seen in the past discussion where it's been shown that P4's
    single-precision float doesn't work nearly as well as its
    double-precision float. It might have something to do with how it
    conglomerates the floating point operands together prior to performing
    the operations. Apparently, the AMD version of SSE2 doesn't show any
    difference in performance whether you're using single or double. It's
    just one of those wierd architectural issues in P4.

    Yousuf Khan
     
    Yousuf Khan, Jun 7, 2005
    #3
  4. Michele Guidolin

    Beemer Biker Guest


    look at the assembly code and see if the compiler is converting float to
    double in the above code. could be that the doubles are being loaded
    directly into the floating processor stack and the singles are being
    converted in a gp register then loaded into the fp stack. Recompile with
    double then look at the assembly code difference.
     
    Beemer Biker, Jun 8, 2005
    #4
  5. Michele Guidolin

    Yousuf Khan Guest

    He's using SSE2. Check out his compiler flags.

    Yousuf Khan
     
    Yousuf Khan, Jun 8, 2005
    #5
  6. Here is the assembler code of float version:

    ------------- ASM -----------------
    inline void gs_relax(int i,int j,float *u, float *rhs)
    {
    fb: 55 push %ebp

    u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
    fc: d9 ee fldz
    fe: d9 05 00 00 00 00 flds 0x0
    104: d9 c9 fxch %st(1)
    106: 89 e5 mov %esp,%ebp
    108: 56 push %esi
    109: 8b 45 08 mov 0x8(%ebp),%eax
    10c: 8b 75 0c mov 0xc(%ebp),%esi
    10f: 53 push %ebx
    110: c1 e0 09 shl $0x9,%eax
    113: 8b 4d 10 mov 0x10(%ebp),%ecx
    116: 8b 55 14 mov 0x14(%ebp),%edx
    119: 8d 1c 30 lea (%eax,%esi,1),%ebx
    11c: c1 e0 00 shl $0x0,%eax
    11f: d8 0c 99 fmuls (%ecx,%ebx,4)
    122: d8 04 9a fadds (%edx,%ebx,4)
    125: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx
    12c: c1 e0 00 shl $0x0,%eax
    12f: d9 04 91 flds (%ecx,%edx,4)
    132: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax
    139: d8 ca fmul %st(2),%st
    13b: de c1 faddp %st,%st(1)
    13d: d9 04 81 flds (%ecx,%eax,4)
    140: d8 ca fmul %st(2),%st
    142: de c1 faddp %st,%st(1)
    144: d9 44 99 04 flds 0x4(%ecx,%ebx,4)
    148: d8 ca fmul %st(2),%st
    14a: d9 ca fxch %st(2)
    14c: d8 4c 99 fc fmuls 0xfffffffc(%ecx,%ebx,4)
    150: d9 c9 fxch %st(1)
    152: de c2 faddp %st,%st(2)
    154: de c1 faddp %st,%st(1)
    156: d9 1c 99 fstps (%ecx,%ebx,4)
    159: 5b pop %ebx
    15a: 5e pop %esi
    15b: 5d pop %ebp
    15c: c3 ret
    ------------- ASM -----------------

    and here is the assembler code of double version

    ------------- ASM -----------------
    inline void gs_relax(int i,int j,double *u, double *rhs)
    {
    112: 55 push %ebp

    u[MAT(i,j)] = ( rhs[MAT(i,j)] +
    113: d9 ee fldz
    115: d9 05 00 00 00 00 flds 0x0
    11b: d9 c9 fxch %st(1)
    11d: 89 e5 mov %esp,%ebp
    11f: 56 push %esi
    120: 8b 45 08 mov 0x8(%ebp),%eax
    123: 8b 75 0c mov 0xc(%ebp),%esi
    126: 53 push %ebx
    127: c1 e0 09 shl $0x9,%eax
    12a: 8b 4d 10 mov 0x10(%ebp),%ecx
    12d: 8b 55 14 mov 0x14(%ebp),%edx
    130: 8d 1c 30 lea (%eax,%esi,1),%ebx
    133: c1 e0 00 shl $0x0,%eax
    136: dc 0c d9 fmull (%ecx,%ebx,8)
    139: dc 04 da faddl (%edx,%ebx,8)
    13c: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx
    143: c1 e0 00 shl $0x0,%eax
    146: dd 04 d1 fldl (%ecx,%edx,8)
    149: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax
    150: d8 ca fmul %st(2),%st
    152: de c1 faddp %st,%st(1)
    154: dd 04 c1 fldl (%ecx,%eax,8)
    157: d8 ca fmul %st(2),%st
    159: de c1 faddp %st,%st(1)
    15b: dd 44 d9 08 fldl 0x8(%ecx,%ebx,8)
    15f: d8 ca fmul %st(2),%st
    161: d9 ca fxch %st(2)
    163: dc 4c d9 f8 fmull 0xfffffff8(%ecx,%ebx,8)
    167: d9 c9 fxch %st(1)
    169: de c2 faddp %st,%st(2)
    16b: de c1 faddp %st,%st(1)
    16d: dd 1c d9 fstpl (%ecx,%ebx,8)
    170: 5b pop %ebx
    171: 5e pop %esi
    172: 5d pop %ebp
    173: c3 ret
    ------------- ASM -----------------

    It's alot of time that I don't look in assembler code, but for me look
    like that in float version is doing all the operation in float.
    Maybe Yousuf is right and is the P4 that do very bad.

    Somebody that know better the asm can help me?
    Thanks.

    Michele.
     
    Michele Guidolin, Jun 8, 2005
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.