I have built a fully functional ARM7 prototype board based on the
Atmel
AT91R40008 processor. Everything works fine, but the performance of
the
processor is approximately 1/10th what it should be. In a simple in
SRAM
memory write test, I first copy my code to SRAM, and then run out of
SRAM
and write blocks of 32 bytes to consequetive locations in an unrolled
loop
for a total of 9600 bytes (a simple test buffer) then do this loop 8
times,
so the scope can get a good lock. The original C/C++ code and the
dissasembled ARM code are below for reference. The key element is that
other
than the looping overhead the instruction stream should be nothing
other
than fetch, decode, execute of store byte immediate to internal SRAM
of the
form:
STRB Rn,[ip,#dd]
At worst case this should take 1-3 cycles per operation, I am scoping
this
and getting a memory write every 40 -"FORTY" cycles approximately!!!!
This
is bizzare. Of course the External bus interface settings are
irrelevant for
the internal bus, and I am not pulling on the external nWait pin. I
hypothesize that the processor is in some mode after reset and running
slower?
Maybe has something to do with the debug interface, I am not sure,
nothing I
have found in all 3000+ pages of ARM docs lead me to any
conclusions...
As another brief example, this is the C/C++ code for a max speed I/O
toggle, I basically have a scope on one of the I/O pins and I am
toggling in a loop at max speed and then looking at the waveform:
******** C/C++ code
while(1)
{
pio_base_ptr[PIO_SODR/4] = 0x00020000;
pio_base_ptr[PIO_CODR/4] = 0x00020000;
}
And here's the dissassembled ARM code, 5 instructions, yet it it
taking nearly 400 clocks to run these 5 instructions! Again, running
out of SRAM and that's it, bizzare ???
************* ARM CODE
|L000630.J10.C_Entry|
LDR a2,[v2,#4]
STR a1,[a2,#&30]!
LDR a2,[v2,#4]
STR a1,[a2,#&34]!
B |L000630.J10.C_Entry|
There are very few resources with HARDCORE info, any insight would be
greatly appreciated
Desperately seeking a GURU,
Xander.
*********** C/C++ version of the memory fill
// fill memory up with incremental values
for (t=0; t < 8; t++)
for (ram_index = 0; ram_index < 9600/1-32; ram_index+=32)
{
work_ptr[ram_index+0] = 1;
work_ptr[ram_index+1] = 2;
work_ptr[ram_index+2] = 3;
work_ptr[ram_index+3] = 4;
work_ptr[ram_index+4] = 1;
work_ptr[ram_index+5] = 2;
work_ptr[ram_index+6] = 3;
work_ptr[ram_index+7] = 4;
work_ptr[ram_index+8] = 1;
work_ptr[ram_index+9] = 2;
work_ptr[ram_index+10] = 3;
work_ptr[ram_index+11] = 4;
work_ptr[ram_index+12] = 1;
work_ptr[ram_index+13] = 2;
work_ptr[ram_index+14] = 3;
work_ptr[ram_index+15] = 4;
work_ptr[ram_index+16] = 1;
work_ptr[ram_index+17] = 2;
work_ptr[ram_index+18] = 3;
work_ptr[ram_index+19] = 4;
work_ptr[ram_index+20] = 1;
work_ptr[ram_index+21] = 2;
work_ptr[ram_index+22] = 3;
work_ptr[ram_index+23] = 4;
work_ptr[ram_index+24] = 1;
work_ptr[ram_index+25] = 2;
work_ptr[ram_index+26] = 3;
work_ptr[ram_index+27] = 4;
work_ptr[ram_index+28] = 1;
work_ptr[ram_index+29] = 2;
work_ptr[ram_index+30] = 3;
work_ptr[ram_index+31] = 4;
}
********* ARM ASM version of the memory fill
|L000638.J8.C_Entry|
STR v2,[v4,#&c5c]
MOV a2,#0
STR v2,[v4,#&c60]
|L000644.J10.C_Entry|
MOV a1,#0
|L000648.J11.C_Entry|
STRB v2,[v1,a1]
ADD ip,v1,a1
STRB a4,[ip,#1]
STRB v3,[ip,#2]
STRB lr,[ip,#3]
STRB v2,[ip,#4]
STRB a4,[ip,#5]
STRB v3,[ip,#6]
STRB lr,[ip,#7]
STRB v2,[ip,#8]
STRB a4,[ip,#9]
STRB v3,[ip,#&a]
STRB lr,[ip,#&b]
STRB v2,[ip,#&c]
STRB a4,[ip,#&d]
STRB v3,[ip,#&e]
STRB lr,[ip,#&f]
STRB v2,[ip,#&10]
STRB a4,[ip,#&11]
STRB v3,[ip,#&12]
STRB lr,[ip,#&13]
STRB v2,[ip,#&14]
STRB a4,[ip,#&15]
STRB v3,[ip,#&16]
STRB lr,[ip,#&17]
STRB v2,[ip,#&18]
STRB a4,[ip,#&19]
STRB v3,[ip,#&1a]
STRB lr,[ip,#&1b]
STRB v2,[ip,#&1c]
STRB a4,[ip,#&1d]
STRB v3,[ip,#&1e]
STRB lr,[ip,#&1f]
ADD a1,a1,#&20
CMP a1,a3
BLT |L000648.J11.C_Entry|
ADD a2,a2,#1
CMP a2,#8
BLT |L000644.J10.C_Entry|
B |L000638.J8.C_Entry|