Motherboard Forums


Reply
Thread Tools Display Modes

Any Large Integer Libraries Out There for the ARM?

 
 
Boudewijn Dijkstra
Guest
Posts: n/a
 
      04-25-2012, 07:23 AM
Op Mon, 23 Apr 2012 20:36:48 +0200 schreef David T. Ashley
<(E-Mail Removed)>:
> On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
> <(E-Mail Removed)> wrote:
>
>> Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley
>> <(E-Mail Removed)>:
>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
>>> a hack job in C. It is logically correct, but it probably takes 3-10
>>> times as long as assembly-language.

>>
>> Unlikely. With Cortex-M3 and the IAR compiler I came to the following
>> results:
>> * C-only: 63 cycles
>> * C-with-asm: 50 cycles
>> * asm-only: 39 cycles
>>
>> In how many cycles does your version run?

>
> Can you post your C code that led to 63-cycle execution?


Note that I did not perform exhaustive testing on the below code.

#include <inttypes.h>

typedef struct uint96_s {
uint32_t w0;
uint32_t w1;
uint32_t w2;
} uint96_t;

typedef struct uint128_s {
uint32_t w0;
uint32_t w1;
uint32_t w2;
uint32_t w3;
} uint128_t;

void umul_96x32c(uint96_t *a, uint32_t b, uint128_t *r)
{
uint64_t p0, p1, p2;
uint32_t r0, r1a, r1b, r2a, r2b, r3;
uint64_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b;
r0 = p0;
r1a = p0 >> 32;
r->w0 = r0;
p1 = (uint64_t)a->w1 * (uint64_t)b;
r1b = p1;
r2a = p1 >> 32;
r1 = (uint64_t)r1a + (uint64_t)r1b;
r->w1 = r1;
p2 = (uint64_t)a->w2 * (uint64_t)b;
r2b = p2;
r3 = p2 >> 32;
r2 = (r1 >> 32) + (uint64_t)r2a + (uint64_t)r2b;
r->w2 = r2;
r3 += r2 >> 32;
r->w3 = r3;
}

> And your C with ASM?


The obvious thing to do is to add manual ADC instructions.

#pragma inline=never
void umul_96x32i(uint96_t *a, uint32_t b, uint128_t *r)
{
uint64_t p0, p1, p2;
uint32_t r0, r1a, r1b, r2a, r2b, r3;
uint32_t r1, r2;

p0 = (uint64_t)a->w0 * (uint64_t)b;
r0 = p0;
r1a = p0 >> 32;
r->w0 = r0;
p1 = (uint64_t)a->w1 * (uint64_t)b;
r1b = p1;
r2a = p1 >> 32;
r1 = r1a + r1b;
r->w1 = r1;
p2 = (uint64_t)a->w2 * (uint64_t)b;
r2b = p2;
r3 = p2 >> 32;
__asm("adcs %[Rd],%[Rn],%[Rm]"
: [Rd]"=r"(r2)
: [Rn]"r" (r2a), [Rm]"r" (r2b)
: "cc");
r->w2 = r2;
__asm("adc %[Rd],%[Rn],#0"
: [Rd]"=r"(r3)
: [Rn]"r" (r3));
r->w3 = r3;
}

> And your ASM-only?


The only further possible optimization was register allocation.

umul_96x32:
PUSH {R4}
LDR R3,[R0, #+0]
UMULL R3,R12,R1,R3
STR R3,[R2, #+0]
LDR R3,[R0, #+4]
UMULL R3,R4,R1,R3
ADDS R12,R3,R12
STR R12,[R2, #+4]
LDR R0,[R0, #+8]
UMULL R0,R1,R1,R0
ADCS R3,R4,R0
STR R3,[R2, #+8]
ADC R0,R1,#0
STR R0,[R2, #+12]
POP {R4}
BX LR ;; return



--
Gemaakt met Opera's revolutionaire e-mailprogramma:
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
 
Reply With Quote
 
 
 
 
Noob
Guest
Posts: n/a
 
      04-25-2012, 08:19 AM
Boudewijn Dijkstra wrote:

> The obvious thing to do is to add manual ADC instructions.


The somewhat less obvious optimization is to use umlal,
which handles the 64-bit add.

> The only further possible optimization was register allocation.


I'm not sure what the pipeline looks like, but spacing producers
and consumers, by using more temporary registers /might/ slightly
improve performance.

Regards.
 
Reply With Quote
 
 
 
 
Noob
Guest
Posts: n/a
 
      04-25-2012, 08:23 AM
David T. Ashley wrote:

> By changing to one temporary variable (rather than 3) the compiler
> seems to have shaved out two instructions (one at the start to
> allocate stack space and one at the end to deallocate it).


In the parameters, you should provide the value of in_u32,
instead of a pointer to the value, because the compiler is
reloading the value every time it is needed.

You didn't specify. Which CPU(s) are you targeting?

Regards.
 
Reply With Quote
 
Boudewijn Dijkstra
Guest
Posts: n/a
 
      04-25-2012, 12:21 PM
Op Wed, 25 Apr 2012 10:19:23 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote:
>
>> The obvious thing to do is to add manual ADC instructions.

>
> The somewhat less obvious optimization is to use umlal,
> which handles the 64-bit add.


Indeed. But I'm not sure which is faster.

>> The only further possible optimization was register allocation.

>
> I'm not sure what the pipeline looks like, but spacing producers
> and consumers, by using more temporary registers /might/ slightly
> improve performance.


No, it wouldn't. Cortex-M3 only stalls on memory operations.


--
Gemaakt met Opera's revolutionaire e-mailprogramma:
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
 
Reply With Quote
 
David T. Ashley
Guest
Posts: n/a
 
      04-25-2012, 01:06 PM
On Wed, 25 Apr 2012 09:23:16 +0200, "Boudewijn Dijkstra"
<(E-Mail Removed)> wrote:
>
>> And your ASM-only?

>
>The only further possible optimization was register allocation.
>
>umul_96x32:
> PUSH {R4}
> LDR R3,[R0, #+0]
> UMULL R3,R12,R1,R3
> STR R3,[R2, #+0]
> LDR R3,[R0, #+4]
> UMULL R3,R4,R1,R3
> ADDS R12,R3,R12
> STR R12,[R2, #+4]
> LDR R0,[R0, #+8]
> UMULL R0,R1,R1,R0
> ADCS R3,R4,R0
> STR R3,[R2, #+8]
> ADC R0,R1,#0
> STR R0,[R2, #+12]
> POP {R4}
> BX LR ;; return


Thanks for all the help. The code posted allowed me to optimize my
multiply function to almost as efficient as hand-written
assembly-language.

I did post my original code elsewhere on this thread. I handled
carries badly.

Thanks and best regards, Dave Ashley
 
Reply With Quote
 
Noob
Guest
Posts: n/a
 
      04-25-2012, 03:56 PM
Boudewijn Dijkstra wrote:

> Noob wrote:
>
>> Boudewijn Dijkstra wrote:
>>
>>> The obvious thing to do is to add manual ADC instructions.

>>
>> The somewhat less obvious optimization is to use umlal,
>> which handles the 64-bit add.

>
> Indeed. But I'm not sure which is faster.


According to ARM's "Cortex-M3 Technical Reference Manual",

UMULL/SMULL/UMLAL/SMLAL use early termination depending on the size
of source values. These are interruptible (abandoned/restarted), with
worst case latency of one cycle. MLAL versions take four to seven
cycles and MULL versions take three to five cycles. For MLAL, the
signed version is one cycle longer than the unsigned.

and

Cycle count based on input sizes. That is, ABS(inputs) < 64K
terminates early.

Therefore, we are comparing
(1 umull + 2 umlal) vs (3 umull + 3 add) = 11-19 vs 12-18
Hard to give a definitive answer...

>>> The only further possible optimization was register allocation.

>>
>> I'm not sure what the pipeline looks like, but spacing producers
>> and consumers, by using more temporary registers /might/ slightly
>> improve performance.

>
> No, it wouldn't. Cortex-M3 only stalls on memory operations.


I don't understand. Are multi-cycle instructions (such as umull)
pipelined or not?

If I write

umull r4, r5, r0, r3
add r6, r7, r8

add can be executed one cycle after umull (mult and ALU are
two different hardware blocks).

If I write

umull r4, r5, r0, r3
add r6, r5, r4

the pipeline must stall to wait for the result of umull to
become available. What am I missing?

Regards.
 
Reply With Quote
 
Boudewijn Dijkstra
Guest
Posts: n/a
 
      04-26-2012, 09:11 AM
Op Wed, 25 Apr 2012 17:56:06 +0200 schreef Noob <root@127.0.0.1>:
> Boudewijn Dijkstra wrote:
>> Noob wrote:
>>> Boudewijn Dijkstra wrote:
>>>
>>>> The only further possible optimization was register allocation.
>>>
>>> I'm not sure what the pipeline looks like, but spacing producers
>>> and consumers, by using more temporary registers /might/ slightly
>>> improve performance.

>>
>> No, it wouldn't. Cortex-M3 only stalls on memory operations.

>
> I don't understand. Are multi-cycle instructions (such as umull)
> pipelined or not?
>
> If I write
>
> umull r4, r5, r0, r3
> add r6, r7, r8
>
> add can be executed one cycle after umull (mult and ALU are
> two different hardware blocks).
>
> If I write
>
> umull r4, r5, r0, r3
> add r6, r5, r4
>
> the pipeline must stall to wait for the result of umull to
> become available. What am I missing?


AFAIK Cortex-M does not feature parallel execution paths. Cortex-R does.


--
Gemaakt met Opera's revolutionaire e-mailprogramma:
http://www.opera.com/mail/
(Remove the obvious prefix to reply privately.)
 
Reply With Quote
 
David T. Ashley
Guest
Posts: n/a
 
      04-26-2012, 01:52 PM
On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote:

>David T. Ashley wrote:
>
>> By changing to one temporary variable (rather than 3) the compiler
>> seems to have shaved out two instructions (one at the start to
>> allocate stack space and one at the end to deallocate it).

>
>In the parameters, you should provide the value of in_u32,
>instead of a pointer to the value, because the compiler is
>reloading the value every time it is needed.
>
>You didn't specify. Which CPU(s) are you targeting?


The compiler is set for Thumb. (I believe the Cortex-M3 handles only
Thumb-2 instructions, which are a superset of Thumb.)

The parameter observation is interesting. I'd have to look at the
code in more detail (which I won't do now).

But you'd think when you have a CPU with 16 registers, maybe 12 of
which you are free to use assuming you push a few, the compiler could
allocate 4 registers just to buffer the pass-by-reference inputs.

DTA
 
Reply With Quote
 
Paul
Guest
Posts: n/a
 
      04-26-2012, 06:30 PM
In article <(E-Mail Removed)>,
(E-Mail Removed) says...
>
> On Wed, 25 Apr 2012 10:23:50 +0200, Noob <root@127.0.0.1> wrote:
>
> >David T. Ashley wrote:
> >
> >> By changing to one temporary variable (rather than 3) the compiler
> >> seems to have shaved out two instructions (one at the start to
> >> allocate stack space and one at the end to deallocate it).

> >
> >In the parameters, you should provide the value of in_u32,
> >instead of a pointer to the value, because the compiler is
> >reloading the value every time it is needed.
> >
> >You didn't specify. Which CPU(s) are you targeting?

>
> The compiler is set for Thumb. (I believe the Cortex-M3 handles only
> Thumb-2 instructions, which are a superset of Thumb.)
>
> The parameter observation is interesting. I'd have to look at the
> code in more detail (which I won't do now).
>
> But you'd think when you have a CPU with 16 registers, maybe 12 of
> which you are free to use assuming you push a few, the compiler could
> allocate 4 registers just to buffer the pass-by-reference inputs.
>


Why all compiler manufacturers assume you are making a
tablet/phone/laptop therefor just put in more RAM to blaot around the
problem. Just like the chip manufacturers, nothing is made in quantities
less than 10,000 and always ahs TerraBytes of RAM.

Thats what everybody does of course..



[Hint just a tad of sarcasm in the above statements]


--
Paul Carpenter | (E-Mail Removed)
<http://www.pcserviceselectronics.co.uk/> PC Services
<http://www.pcserviceselectronics.co.uk/fonts/> Timing Diagram Font
<http://www.gnuh8.org.uk/> GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
<http://www.badweb.org.uk/> For those web sites you hate
 
Reply With Quote
 
Tauno Voipio
Guest
Posts: n/a
 
      04-26-2012, 06:37 PM
On 26.4.12 4:52 , David T. Ashley wrote:
> On Wed, 25 Apr 2012 10:23:50 +0200, Noob<root@127.0.0.1> wrote:
>
>> David T. Ashley wrote:
>>
>>> By changing to one temporary variable (rather than 3) the compiler
>>> seems to have shaved out two instructions (one at the start to
>>> allocate stack space and one at the end to deallocate it).

>>
>> In the parameters, you should provide the value of in_u32,
>> instead of a pointer to the value, because the compiler is
>> reloading the value every time it is needed.
>>
>> You didn't specify. Which CPU(s) are you targeting?

>
> The compiler is set for Thumb. (I believe the Cortex-M3 handles only
> Thumb-2 instructions, which are a superset of Thumb.)
>
> The parameter observation is interesting. I'd have to look at the
> code in more detail (which I won't do now).
>
> But you'd think when you have a CPU with 16 registers, maybe 12 of
> which you are free to use assuming you push a few, the compiler could
> allocate 4 registers just to buffer the pass-by-reference inputs.
>
> DTA


At least GCC does it - R0 to R3.

--

Tauno Voipio

 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any Interest in Open-Source Math/Utility Libraries for Small Microcontrollers? David T. Ashley Embedded 5 08-28-2007 03:18 PM
AI booster error message: "-" is not a valid Integer wally Asus 1 11-11-2005 09:59 AM
Are there any laptops out there with 4GB RAM capacity? Craig Wanke Laptops 3 12-09-2003 12:43 PM
Any PCM audio amplitude manipulation libraries available? Deathstar Embedded 4 09-12-2003 05:20 PM
pIII sustained integer performance problem Kamen Yotov Intel 0 07-03-2003 08:03 PM


All times are GMT. The time now is 12:40 AM.


Welcome!
Welcome to Motherboard Point
 

Advertisment