# Computation slow with float than double.

Discussion in 'Intel' started by Michele Guidolin, Jun 7, 2005.

1. ### Michele GuidolinGuest

Hello to everybody.

I'm doing some benchmark about a red black Gauss Seidel algorithm with 2
dimensional grid of different size and type, I have some strange result
when I change the computation from double to float.

Here are the time of test with different grid SIZE and type:

SIZE 128 256 512

float 2.20s 2.76s 7.86s

double 2.30s 2.47s 2.59s

As you can see when the grid has a size of 512 node the code with float
type increase the time drastically.
The number of loops is proportional to the SIZE of grid, so the time
should be similar with different SIZE of grid.

Should the float computation always fastest than double?
I would like to know if is a gcc problem (I don't have other compiler)
and if it is not what could be the problem?

Thanks

Michele Guidolin.

P.S.

The code that I'm testing is the follow and it is the same for the
double version (the constant are not 0.25f but 0.25).

------------- CODE -------------

#define SHIFT_S 9
#define SIZE (1<<SHIFT_S)
#define DUMP 0

#define MAT(i,j) ((i)<<SHIFT_S) + (j)

inline void gs_relax(int i,int j,float *u, float *rhs)
{

u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
0.0f * u[MAT(i,j)] +
0.25f* u[MAT(i+1,j)]+
0.25f* u[MAT(i-1,j)]+
0.25f* u[MAT(i,j+1)]+
0.25f* u[MAT(i,j-1)]);
}

void gs_step_fusion(float *u, float *rhs)
{
int i,j;

/* update the red points:
*/

for(j=1; j<SIZE-1; j=j+2)
{
gs_relax(1,j,u,rhs);
}
for(i=2; i<SIZE-1; i++)
{
for(j=1+(i+1)%2; j<SIZE-1; j=j+2)
{
gs_relax(i,j,u,rhs);
gs_relax(i-1,j,u,rhs);
}

}
for(j=1; j<SIZE-1; j=j+2)
{
gs_relax(SIZE-2,j,u,rhs);
}

}

int main(void) {
int iter;

int ITERATIONS = ((int)(pow(2.0,28.0))/(pow((double)SIZE,2.0)));

float u[SIZE*SIZE];
float rhs[SIZE*SIZE];

double time;

printf("-----START SEQUENTIAL FUSION------------\n\n");
printf("size: %d\n",SIZE);
printf("loops: %d\n",ITERATIONS);
init_boundaries(u,rhs);

gettimeofday(&submit_time, 0);

for(iter=0; iter<ITERATIONS; iter++)
gs_step_fusion(u,rhs);

gettimeofday(&complete_time, 0);

time = timeval_diff(&submit_time, &complete_time);
printf("\ntime: %fs\n",time);

printf("-----END SEQUENTIAL FUSION------------\n\n");

}
---------------CODE--------------

I'm testing this code on this machine:

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 1
cpu MHz : 3192.311
cache size : 1024 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 3
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni
monitor ds_cpl cid
bogomips : 6324.22

with Hyper threading enable on GNU\Linux 2.6.8.

The compiler is gcc 3.4.4 and the flags are:
CFLAGS = -g -O2 -funroll-loops -msse2 -march=pentium4 -Wall

I tried with also -ffast-math and -mfpmath=sse but I have the same result.

Michele Guidolin, Jun 7, 2005

2. ### Mark HittingerGuest

I wonder if the problem with float being slower might be an alignment issue.

Later

Mark Hittinger

Mark Hittinger, Jun 7, 2005

3. ### Yousuf KhanGuest

Seeing as you're using a P4 processor, and using the SSE2. If so, then
I've seen in the past discussion where it's been shown that P4's
single-precision float doesn't work nearly as well as its
double-precision float. It might have something to do with how it
conglomerates the floating point operands together prior to performing
the operations. Apparently, the AMD version of SSE2 doesn't show any
difference in performance whether you're using single or double. It's
just one of those wierd architectural issues in P4.

Yousuf Khan

Yousuf Khan, Jun 7, 2005
4. ### Beemer BikerGuest

look at the assembly code and see if the compiler is converting float to
double in the above code. could be that the doubles are being loaded
directly into the floating processor stack and the singles are being
converted in a gp register then loaded into the fp stack. Recompile with
double then look at the assembly code difference.

Beemer Biker, Jun 8, 2005
5. ### Yousuf KhanGuest

He's using SSE2. Check out his compiler flags.

Yousuf Khan

Yousuf Khan, Jun 8, 2005
6. ### Michele GuidolinGuest

Here is the assembler code of float version:

------------- ASM -----------------
inline void gs_relax(int i,int j,float *u, float *rhs)
{
fb: 55 push %ebp

u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
fc: d9 ee fldz
fe: d9 05 00 00 00 00 flds 0x0
104: d9 c9 fxch %st(1)
106: 89 e5 mov %esp,%ebp
108: 56 push %esi
109: 8b 45 08 mov 0x8(%ebp),%eax
10c: 8b 75 0c mov 0xc(%ebp),%esi
10f: 53 push %ebx
110: c1 e0 09 shl $0x9,%eax 113: 8b 4d 10 mov 0x10(%ebp),%ecx 116: 8b 55 14 mov 0x14(%ebp),%edx 119: 8d 1c 30 lea (%eax,%esi,1),%ebx 11c: c1 e0 00 shl$0x0,%eax
11f: d8 0c 99 fmuls (%ecx,%ebx,4)
122: d8 04 9a fadds (%edx,%ebx,4)
125: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx
12c: c1 e0 00 shl $0x0,%eax 12f: d9 04 91 flds (%ecx,%edx,4) 132: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax 139: d8 ca fmul %st(2),%st 13b: de c1 faddp %st,%st(1) 13d: d9 04 81 flds (%ecx,%eax,4) 140: d8 ca fmul %st(2),%st 142: de c1 faddp %st,%st(1) 144: d9 44 99 04 flds 0x4(%ecx,%ebx,4) 148: d8 ca fmul %st(2),%st 14a: d9 ca fxch %st(2) 14c: d8 4c 99 fc fmuls 0xfffffffc(%ecx,%ebx,4) 150: d9 c9 fxch %st(1) 152: de c2 faddp %st,%st(2) 154: de c1 faddp %st,%st(1) 156: d9 1c 99 fstps (%ecx,%ebx,4) 159: 5b pop %ebx 15a: 5e pop %esi 15b: 5d pop %ebp 15c: c3 ret ------------- ASM ----------------- and here is the assembler code of double version ------------- ASM ----------------- inline void gs_relax(int i,int j,double *u, double *rhs) { 112: 55 push %ebp u[MAT(i,j)] = ( rhs[MAT(i,j)] + 113: d9 ee fldz 115: d9 05 00 00 00 00 flds 0x0 11b: d9 c9 fxch %st(1) 11d: 89 e5 mov %esp,%ebp 11f: 56 push %esi 120: 8b 45 08 mov 0x8(%ebp),%eax 123: 8b 75 0c mov 0xc(%ebp),%esi 126: 53 push %ebx 127: c1 e0 09 shl$0x9,%eax
12a: 8b 4d 10 mov 0x10(%ebp),%ecx
12d: 8b 55 14 mov 0x14(%ebp),%edx
130: 8d 1c 30 lea (%eax,%esi,1),%ebx
133: c1 e0 00 shl $0x0,%eax 136: dc 0c d9 fmull (%ecx,%ebx,8) 139: dc 04 da faddl (%edx,%ebx,8) 13c: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx 143: c1 e0 00 shl$0x0,%eax
146: dd 04 d1 fldl (%ecx,%edx,8)
149: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax
150: d8 ca fmul %st(2),%st
154: dd 04 c1 fldl (%ecx,%eax,8)
157: d8 ca fmul %st(2),%st
15b: dd 44 d9 08 fldl 0x8(%ecx,%ebx,8)
15f: d8 ca fmul %st(2),%st
161: d9 ca fxch %st(2)
163: dc 4c d9 f8 fmull 0xfffffff8(%ecx,%ebx,8)
167: d9 c9 fxch %st(1)
16d: dd 1c d9 fstpl (%ecx,%ebx,8)
170: 5b pop %ebx
171: 5e pop %esi
172: 5d pop %ebp
173: c3 ret
------------- ASM -----------------

It's alot of time that I don't look in assembler code, but for me look
like that in float version is doing all the operation in float.
Maybe Yousuf is right and is the P4 that do very bad.

Somebody that know better the asm can help me?
Thanks.

Michele.

Michele Guidolin, Jun 8, 2005