1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Any Large Integer Libraries Out There for the ARM?

Discussion in 'Embedded' started by David T. Ashley, Apr 18, 2012.

  1. I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    a hack job in C. It is logically correct, but it probably takes 3-10
    times as long as assembly-language.

    Are there any large integer and similar libraries out there for the
    ARM?

    The GMP is a great resource, of course, but it isn't very applicable
    to embedded systems because of:

    (a)Dynamic allocation.

    (b)Variable-size operands, which mean looping and branching, which
    causes pipelined RISC machines not to perform as well as they would if
    the code were written in a linear way.

    Anything out there?

    Thanks, Dave Ashley
     
    David T. Ashley, Apr 18, 2012
    #1
    1. Advertising

  2. David T. Ashley

    Guest

    David T. Ashley wrote:

    > I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    > a hack job in C. It is logically correct, but it probably takes 3-10
    > times as long as assembly-language.
    >
    > Are there any large integer and similar libraries out there for the
    > ARM?
    >
    > The GMP is a great resource, of course, but it isn't very applicable
    > to embedded systems because of:
    >
    > (a)Dynamic allocation.
    >
    > (b)Variable-size operands, which mean looping and branching, which
    > causes pipelined RISC machines not to perform as well as they would if
    > the code were written in a linear way.
    >
    > Anything out there?


    I've been working on untangling the OpenSSL bignum library free from
    OpenSSL. I've found it confusing because of a lot of upward dependencies,
    and the fact that the ./configure intermingles option setting for the whole
    package. One target that may be nearby is to group the options that only
    bignum needs into a single .h file.

    Mel.
     
    , Apr 18, 2012
    #2
    1. Advertising

  3. On Wed, 18 Apr 2012 10:54:20 -0400, wrote:

    >David T. Ashley wrote:
    >
    >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >> a hack job in C. It is logically correct, but it probably takes 3-10
    >> times as long as assembly-language.
    >>
    >> Are there any large integer and similar libraries out there for the
    >> ARM?
    >>
    >> The GMP is a great resource, of course, but it isn't very applicable
    >> to embedded systems because of:
    >>
    >> (a)Dynamic allocation.
    >>
    >> (b)Variable-size operands, which mean looping and branching, which
    >> causes pipelined RISC machines not to perform as well as they would if
    >> the code were written in a linear way.
    >>
    >> Anything out there?

    >
    >I've been working on untangling the OpenSSL bignum library free from
    >OpenSSL. I've found it confusing because of a lot of upward dependencies,
    >and the fact that the ./configure intermingles option setting for the whole
    >package. One target that may be nearby is to group the options that only
    >bignum needs into a single .h file.
    >
    > Mel.


    Hi Mel,

    I have obtained this domain name:

    http://www.armlib.com/

    and I'm slugging away on the weekends.

    If there is enough synergy where you feel we might be able to work
    together or exchange technical information, please write me directly.

    My feeling based on the embedded work I've done is that there need to
    be two variants of the library--classic ARM and Thumb. That
    assessment may need to be adjusted downward or upward.

    Thanks and best regards, Dave A.
     
    David T. Ashley, Apr 18, 2012
    #3
  4. On Wed, 18 Apr 2012 11:12:24 -0400, David T. Ashley
    <> wrote:

    >On Wed, 18 Apr 2012 10:54:20 -0400, wrote:
    >
    >>David T. Ashley wrote:
    >>
    >>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >>> a hack job in C. It is logically correct, but it probably takes 3-10
    >>> times as long as assembly-language.
    >>>
    >>> Are there any large integer and similar libraries out there for the
    >>> ARM?
    >>>
    >>> The GMP is a great resource, of course, but it isn't very applicable
    >>> to embedded systems because of:
    >>>
    >>> (a)Dynamic allocation.
    >>>
    >>> (b)Variable-size operands, which mean looping and branching, which
    >>> causes pipelined RISC machines not to perform as well as they would if
    >>> the code were written in a linear way.
    >>>
    >>> Anything out there?

    >>
    >>I've been working on untangling the OpenSSL bignum library free from
    >>OpenSSL. I've found it confusing because of a lot of upward dependencies,
    >>and the fact that the ./configure intermingles option setting for the whole
    >>package. One target that may be nearby is to group the options that only
    >>bignum needs into a single .h file.
    >>
    >> Mel.

    >
    >Hi Mel,
    >
    >I have obtained this domain name:
    >
    >http://www.armlib.com/
    >
    >and I'm slugging away on the weekends.
    >
    >If there is enough synergy where you feel we might be able to work
    >together or exchange technical information, please write me directly.
    >
    >My feeling based on the embedded work I've done is that there need to
    >be two variants of the library--classic ARM and Thumb. That
    >assessment may need to be adjusted downward or upward.
    >
    >Thanks and best regards, Dave A.


    Also, I'm not sure if my news client actually provides information to
    write me directly.

    My e-mail address is:

    gmail.com@dashley

    and I'm sure you'll figure out the sophisticated encryption algorithm
    used to prevent harvesting robots from making use of my e-mail
    address.

    Thanks, Dave A.
     
    David T. Ashley, Apr 18, 2012
    #4
  5. David T. Ashley

    Guest

    David T. Ashley wrote:

    > On Wed, 18 Apr 2012 10:54:20 -0400, wrote:


    >>I've been working on untangling the OpenSSL bignum library free from
    >>OpenSSL. [ ... ]

    >
    > Hi Mel,
    >
    > I have obtained this domain name:
    >
    > http://www.armlib.com/
    >
    > and I'm slugging away on the weekends.
    >
    > If there is enough synergy where you feel we might be able to work
    > together or exchange technical information, please write me directly.
    >
    > My feeling based on the embedded work I've done is that there need to
    > be two variants of the library--classic ARM and Thumb. That
    > assessment may need to be adjusted downward or upward.
    >
    > Thanks and best regards, Dave A.


    I'm still working to get my head around the physical requirements for
    packaging a release. It seems as though a source->.a process would suit my
    needs best. That way, ARM/THUMB can be decided project-by-project.

    I've just started some due diligence and registered with github.

    Somebody there has <https://github.com/wbhart/bsdnt> under a BSD license
    which might be what we're both looking for.

    Otherwise, it might be a plan to put up a BSD licensed project based on the
    BSD licensed OpenSSL code.

    Mel.
     
    , Apr 18, 2012
    #5
  6. On Wed, 18 Apr 2012 11:59:46 -0400, wrote:
    >
    >I'm still working to get my head around the physical requirements for
    >packaging a release. It seems as though a source->.a process would suit my
    >needs best. That way, ARM/THUMB can be decided project-by-project.
    >
    >I've just started some due diligence and registered with github.


    Hi Mel,

    I have the same thoughts. Assuming that ".a" is the standard suffix
    for an ARM library, a library would be built from the smallest
    compileable modules (typically a single function each), and the linker
    will automatically bring in only what it needs to satisfy references.

    With few exceptions, the source code would consist of source files
    that each contain a single function or a single constant or variable
    data structure.

    There would have to be two libraries--one for ARM classic and one for
    THUMB.

    Nearly all of the assembly-language modules would be different for the
    ARM vs. the THUMB libraries, but nearly all of the C functions would
    be the same for both libraries, with the development tool settings
    compiling for ARM or THUMB.

    >Somebody there has <https://github.com/wbhart/bsdnt> under a BSD license
    >which might be what we're both looking for.


    Thanks. I will look at that.

    >Otherwise, it might be a plan to put up a BSD licensed project based on the
    >BSD licensed OpenSSL code.


    Thanks. I'm not sure about the BSD license. I'll look at that.

    I believe it was Steve Ballmer of Microsoft who once wrote that
    "open-source licenses attach themselves like a cancer to anything they
    touch".

    I disagree with that point of view (I fully support open-source
    development), but they just aren't practical for embedded systems.

    Someone who buys a washing machine doesn't want to see information
    about software licenses anywhere, including the manufacturer's web
    page. It doesn't help them to use or understand the product.

    Open-source is great. Open-source licenses applied to small embedded
    systems is not.

    DTA
     
    David T. Ashley, Apr 18, 2012
    #6
  7. David T. Ashley

    David Brown Guest

    On 18/04/12 19:43, David T. Ashley wrote:
    > On Wed, 18 Apr 2012 11:59:46 -0400, wrote:
    >>
    >> I'm still working to get my head around the physical requirements for
    >> packaging a release. It seems as though a source->.a process would suit my
    >> needs best. That way, ARM/THUMB can be decided project-by-project.
    >>
    >> I've just started some due diligence and registered with github.

    >
    > Hi Mel,
    >
    > I have the same thoughts. Assuming that ".a" is the standard suffix
    > for an ARM library, a library would be built from the smallest
    > compileable modules (typically a single function each), and the linker
    > will automatically bring in only what it needs to satisfy references.
    >
    > With few exceptions, the source code would consist of source files
    > that each contain a single function or a single constant or variable
    > data structure.
    >
    > There would have to be two libraries--one for ARM classic and one for
    > THUMB.
    >
    > Nearly all of the assembly-language modules would be different for the
    > ARM vs. the THUMB libraries, but nearly all of the C functions would
    > be the same for both libraries, with the development tool settings
    > compiling for ARM or THUMB.
    >


    A library (.a) with a header file is a convenient format to use - and as
    you say, each source file should only contain a few functions so that
    only those needed are linked in.

    However, this will lead to sub-optimal code because it does not allow
    cross-function and cross-file optimisations. If you are happy giving
    out the source code, then people can use that to get the smallest and
    fastest code.

    Where possible, it is best to write such a library in C, using inline
    assembly when it makes a big difference. If you can make your functions
    as inlineable C code, then it will often result in smaller and faster
    object code, as the compiler can do things like re-arrange registers,
    re-use results, avoid function call overheads, etc. When you make the
    routines in fixed assembly, the compiler is forced to move data into and
    out of registers to suit the assembly code - this overhead can quickly
    outweigh the benefits of the tuned assembly code.

    Where you do need assembly, it is often worth writing it as inline
    assembly using advanced syntaxes supported by the compiler. Then at
    least the compiler has the freedom to pick registers and include the
    code in its optimisations. gcc syntax is supported by several
    compilers, including (obviously) all gcc-based tools and CodeWarrior. I
    don't know if other tools like IAR and Keil support them, or if they
    have their own syntax.

    >> Somebody there has<https://github.com/wbhart/bsdnt> under a BSD license
    >> which might be what we're both looking for.

    >
    > Thanks. I will look at that.
    >
    >> Otherwise, it might be a plan to put up a BSD licensed project based on the
    >> BSD licensed OpenSSL code.

    >
    > Thanks. I'm not sure about the BSD license. I'll look at that.
    >
    > I believe it was Steve Ballmer of Microsoft who once wrote that
    > "open-source licenses attach themselves like a cancer to anything they
    > touch".
    >
    > I disagree with that point of view (I fully support open-source
    > development), but they just aren't practical for embedded systems.
    >
    > Someone who buys a washing machine doesn't want to see information
    > about software licenses anywhere, including the manufacturer's web
    > page. It doesn't help them to use or understand the product.
    >
    > Open-source is great. Open-source licenses applied to small embedded
    > systems is not.
    >


    Don't take advice about open-source licenses from Steve Ballmer - the
    only thing he knows about them is that they might slightly reduce his
    gazillion dollar income. Anything he says is pure FUD.

    What you are thinking about here is pure GPL - if you make your library
    GPL, then the user (i.e., another programmer) has to give the end user
    the right to all the source code, and the person buying the washing
    machine gets confusing information about the software license.

    With BSD-style licenses, no restrictions are imposed on your users
    (other developers), and the end user knows nothing about it. The only
    requirement is that no one can remove the copyright notices you put on
    the files and claim they wrote them. It's a perfectly good license for
    embedded libraries. (And Microsoft likes it too - they took the
    BSD-licensed network stack from FreeBSD to form the first version of the
    Windows TCP/IP stack. Apple likes it even more - they took the entire
    BSD-licensed FreeBSD OS, put a nice face on it, and called it "OS X".)

    An alternative is to use a modified GPL which forces developers to
    publish enhancements and modifications to the library code, without
    affecting anything in their own code. See FreeRTOS for an example.
     
    David Brown, Apr 18, 2012
    #7
  8. On Wed, 18 Apr 2012 21:00:34 +0200, David Brown
    <> wrote:

    >Apple likes [BSD license] even more - they took the entire
    >BSD-licensed FreeBSD OS, put a nice face on it, and called it "OS X".)


    That's an exaggeration. It's true that OS X contains some FreeBSD
    code, but Apple certainly did not "take FreeBSD and put a nice face on
    it".

    OS X is based on NeXTSTEP, which itself was based on CMU's Mach 2.0
    operating system. NeXTSTEP took Mach and replaced the existing Unix
    compatibility layer with new code that included ports of both FreeBSD
    and NetBSD modules. Apple chose NeXTSTEP to be the basis for the new
    Mac OS and purchased NeXT to get it.

    [And yes, I know Steve Jobs founded NeXT - after he was ousted from
    Apple. Jobs came back to Apple after a 10 year absence as part of the
    deal to purchase NeXT. Jobs did not retake control of Apple until the
    company narrowly avoided bankruptcy 2 years later.]

    George
     
    George Neuner, Apr 19, 2012
    #8
  9. David T. Ashley

    Noob Guest

    David T. Ashley wrote:

    > Also, I'm not sure if my news client actually provides information
    > to write me directly.
    >
    > My e-mail address is:
    >
    > gmail.com@dashley
    >
    > and I'm sure you'll figure out the sophisticated encryption algorithm
    > used to prevent harvesting robots from making use of my e-mail
    > address.


    Errr... You provide your true e-mail address in the message headers
    (From: field). Do you think crawlers ignore the headers?
     
    Noob, Apr 19, 2012
    #9
  10. David T. Ashley

    Rocky Guest

    On Apr 19, 10:06 am, Noob <r...@127.0.0.1> wrote:
    > David T. Ashley wrote:
    > > Also, I'm not sure if my news client actually provides information
    > > to write me directly.

    >
    > > My e-mail address is:

    >
    > > gmail.com@dashley

    >
    > > and I'm sure you'll figure out the sophisticated encryption algorithm
    > > used to prevent harvesting robots from making use of my e-mail
    > > address.

    >
    > Errr... You provide your true e-mail address in the message headers
    > (From: field). Do you think crawlers ignore the headers?

    However, gmail is pretty good at filtering spam. I use a gmail address
    for most of my public activity and gmail successfully gets rid of
    99.999 percent of spam
     
    Rocky, Apr 19, 2012
    #10
  11. Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley
    <>:
    > I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    > a hack job in C. It is logically correct, but it probably takes 3-10
    > times as long as assembly-language.


    Unlikely. With Cortex-M3 and the IAR compiler I came to the following
    results:
    * C-only: 63 cycles
    * C-with-asm: 50 cycles
    * asm-only: 39 cycles

    In how many cycles does your version run?

    --
    Gemaakt met Opera's revolutionaire e-mailprogramma:
    http://www.opera.com/mail/
    (Remove the obvious prefix to reply privately.)
     
    Boudewijn Dijkstra, Apr 19, 2012
    #11
  12. David T. Ashley

    boB Guest

    On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
    <> wrote:

    >Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley
    ><>:
    >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >> a hack job in C. It is logically correct, but it probably takes 3-10
    >> times as long as assembly-language.




    Was that with the GNU ARM compiler ??

    If so, that's why I use IAR also.
    Looks like I better see if I can optimize some stuff like Boudewijn
    has done.

    boB



    >
    >Unlikely. With Cortex-M3 and the IAR compiler I came to the following
    >results:
    >* C-only: 63 cycles
    >* C-with-asm: 50 cycles
    >* asm-only: 39 cycles
    >
    >In how many cycles does your version run?
     
    boB, Apr 20, 2012
    #12
  13. David T. Ashley

    Guest

    David T. Ashley wrote:

    > On Wed, 18 Apr 2012 11:59:46 -0400, wrote:
    >>
    >>I'm still working to get my head around the physical requirements for
    >>packaging a release. It seems as though a source->.a process would suit
    >>my
    >>needs best. That way, ARM/THUMB can be decided project-by-project.
    >>
    >>I've just started some due diligence and registered with github.

    >
    > Hi Mel,
    >
    > I have the same thoughts. Assuming that ".a" is the standard suffix
    > for an ARM library, a library would be built from the smallest
    > compileable modules (typically a single function each), and the linker
    > will automatically bring in only what it needs to satisfy references.
    >

    [ ... ]
    I've taken a highly opportunistic cargo-cult-style swipe at/of the OpenSSL
    bignum code. The results can be got via

    git clone https://github.com/melwilson/bignum-embedded.git

    Results don't look bad for a couple of days work, but clearly have to be
    massively cleaned and made consistent before they constitute a real library.

    Mel.
     
    , Apr 23, 2012
    #13
  14. On Fri, 20 Apr 2012 00:57:28 +0200, "Boudewijn Dijkstra"
    <> wrote:

    >Op Wed, 18 Apr 2012 16:22:10 +0200 schreef David T. Ashley
    ><>:
    >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >> a hack job in C. It is logically correct, but it probably takes 3-10
    >> times as long as assembly-language.

    >
    >Unlikely. With Cortex-M3 and the IAR compiler I came to the following
    >results:
    >* C-only: 63 cycles
    >* C-with-asm: 50 cycles
    >* asm-only: 39 cycles
    >
    >In how many cycles does your version run?


    Can you post your C code that led to 63-cycle execution?

    And your C with ASM?

    And your ASM-only?

    If you've posted it and I missed it, I apologize.

    Thanks, Dave Ashley
     
    David T. Ashley, Apr 23, 2012
    #14
  15. David T. Ashley

    Noob Guest

    David T. Ashley wrote:

    > I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    > a hack job in C. It is logically correct, but it probably takes 3-10
    > times as long as assembly-language.


    gcc "understands" the intent of the following code, and produces
    fairly decent assembly code for SH-4:

    #include <stdint.h>
    typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
    void mul(U32 u, U96 v, U128 res)
    {
    uint64_t m0 = (uint64_t)u * v[0];
    res[0] = m0;

    uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
    res[1] = m1;

    uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
    res[2] = m2;

    res[3] = (m2 >> 32);
    }

    Do you get acceptable results with this function?

    Regards.
     
    Noob, Apr 24, 2012
    #15
  16. On Tue, 24 Apr 2012 13:55:32 +0200, Noob <root@127.0.0.1> wrote:

    >David T. Ashley wrote:
    >
    >> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >> a hack job in C. It is logically correct, but it probably takes 3-10
    >> times as long as assembly-language.

    >
    >gcc "understands" the intent of the following code, and produces
    >fairly decent assembly code for SH-4:
    >
    >#include <stdint.h>
    >typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
    >void mul(U32 u, U96 v, U128 res)
    >{
    > uint64_t m0 = (uint64_t)u * v[0];
    > res[0] = m0;
    >
    > uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
    > res[1] = m1;
    >
    > uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
    > res[2] = m2;
    >
    > res[3] = (m2 >> 32);
    >}
    >
    >Do you get acceptable results with this function?


    I'll give that a try.

    What I used was this ... similar but different.

    Your handling of carries looks more efficient.

    ----------

    void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128,
    const APP_BIGINT_U96_T *in_u96,
    const APP_BIGINT_U32_T *in_u32)
    {
    APP_BIGINT_U64_T term_96w0_32w0;
    APP_BIGINT_U64_T term_96w1_32w0;
    APP_BIGINT_U64_T term_96w2_32w0;
    APP_BIGINT_U32_NATIVE_T carry;
    APP_BIGINT_U64_NATIVE_T temp_u64;


    //****************************************************************************************************

    //****************************************************************************************************

    //****************************************************************************************************
    //Calculate all of the intermediate terms. It was discovered that
    the compiler does a good
    //job of using the UMULL instruction.
    //
    //96w0 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w0_32w0.w[1] = temp_u64 >> 32;
    term_96w0_32w0.w[0] = temp_u64;
    //
    //96w1 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w1_32w0.w[1] = temp_u64 >> 32;
    term_96w1_32w0.w[0] = temp_u64;
    //
    //96w2 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w2_32w0.w[1] = temp_u64 >> 32;
    term_96w2_32w0.w[0] = temp_u64;
    //

    //****************************************************************************************************

    //****************************************************************************************************

    //****************************************************************************************************
    //Add together the terms. The best way to envision this is the way
    that children are sometimes
    //taught to multiply with a rectangle and then adding diagonally.
    //
    //Word [0] of the result. No addition is required, because this
    term comes from the least significant
    //term alone.
    out_u128->w[0] = term_96w0_32w0.w[0];
    //
    //Word [1] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
    carry = temp_u64 >> 32;
    out_u128->w[1] = temp_u64;

    //Word [2] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
    carry = temp_u64 >> 32;
    out_u128->w[2] = temp_u64;

    //Word [3] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
    out_u128->w[3] = temp_u64;
    }

    Thanks, DTA
     
    David T. Ashley, Apr 24, 2012
    #16
  17. David T. Ashley

    Noob Guest

    David T. Ashley wrote:

    > Noob wrote:
    >
    >> David T. Ashley wrote:
    >>
    >>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >>> a hack job in C. It is logically correct, but it probably takes 3-10
    >>> times as long as assembly-language.

    >>
    >> gcc "understands" the intent of the following code, and produces
    >> fairly decent assembly code for SH-4:
    >>
    >> #include <stdint.h>
    >> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
    >> void mul(U32 u, U96 v, U128 res)
    >> {
    >> uint64_t m0 = (uint64_t)u * v[0];
    >> res[0] = m0;
    >>
    >> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
    >> res[1] = m1;
    >>
    >> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
    >> res[2] = m2;
    >>
    >> res[3] = (m2 >> 32);
    >> }
    >>
    >> Do you get acceptable results with this function?

    >
    > I'll give that a try.
    >
    > What I used was this ... similar but different.
    >
    > Your handling of carries looks more efficient.


    IMHO, the biggest economy, in terms of code density, comes from the
    umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
    all done "by hand" using 4-6 instructions.)

    FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
    What CPU are you targeting? ARM or Thumb mode?

    mul:
    push {r4, r5, r6, r7, r8, r9}
    ldr r6, [r1, #0]
    movs r5, #0
    umull r6, r7, r6, r0
    str r6, [r2, #0]
    ldr r3, [r1, #4]
    mov r8, r7
    mov r9, r5
    umlal r8, r9, r0, r3
    str r8, [r2, #4]
    ldr r3, [r1, #8]
    mov r6, r9
    mov r7, r5
    umlal r6, r7, r0, r3
    str r6, [r2, #8]
    str r7, [r2, #12]
    pop {r4, r5, r6, r7, r8, r9}
    bx lr

    AFAICT, the register copies before the 2 umlal are unnecessary,
    which means we could shave 3 instructions.

    Regards.
     
    Noob, Apr 24, 2012
    #17
  18. On Tue, 24 Apr 2012 15:55:59 +0200, Noob <root@127.0.0.1> wrote:

    >David T. Ashley wrote:
    >
    >> Noob wrote:
    >>
    >>> David T. Ashley wrote:
    >>>
    >>>> I needed to multiply U128 = U96 * U32 on an ARM the other day, and did
    >>>> a hack job in C. It is logically correct, but it probably takes 3-10
    >>>> times as long as assembly-language.
    >>>
    >>> gcc "understands" the intent of the following code, and produces
    >>> fairly decent assembly code for SH-4:
    >>>
    >>> #include <stdint.h>
    >>> typedef uint32_t U32; typedef U32 U96[3]; typedef U32 U128[4];
    >>> void mul(U32 u, U96 v, U128 res)
    >>> {
    >>> uint64_t m0 = (uint64_t)u * v[0];
    >>> res[0] = m0;
    >>>
    >>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);
    >>> res[1] = m1;
    >>>
    >>> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);
    >>> res[2] = m2;
    >>>
    >>> res[3] = (m2 >> 32);
    >>> }
    >>>
    >>> Do you get acceptable results with this function?

    >>
    >> I'll give that a try.
    >>
    >> What I used was this ... similar but different.
    >>
    >> Your handling of carries looks more efficient.

    >
    >IMHO, the biggest economy, in terms of code density, comes from the
    >umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
    >all done "by hand" using 4-6 instructions.)
    >
    >FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
    >What CPU are you targeting? ARM or Thumb mode?
    >
    >mul:
    > push {r4, r5, r6, r7, r8, r9}
    > ldr r6, [r1, #0]
    > movs r5, #0
    > umull r6, r7, r6, r0
    > str r6, [r2, #0]
    > ldr r3, [r1, #4]
    > mov r8, r7
    > mov r9, r5
    > umlal r8, r9, r0, r3
    > str r8, [r2, #4]
    > ldr r3, [r1, #8]
    > mov r6, r9
    > mov r7, r5
    > umlal r6, r7, r0, r3
    > str r6, [r2, #8]
    > str r7, [r2, #12]
    > pop {r4, r5, r6, r7, r8, r9}
    > bx lr
    >
    >AFAICT, the register copies before the 2 umlal are unnecessary,
    >which means we could shave 3 instructions.


    Thanks.

    Examining your code ...

    The possibility of a carry makes me nervous. I'd first like to verify
    that the sum

    >>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);


    can't exceed 64 bits and generate a carry.

    It doesn't need to be verfied that

    >>> uint64_t m2 = (uint64_t)u * v[2] + (m1 >> 32);


    can't generate a carry, because there is a simple proof of that
    relying on the operand sizes and the fact that this is the most
    significant product.

    Squaring the maximum 32-bit integer yields:

    (2^32-1) * (2^32-1) = 2^64 - 2^33 + 1.

    If the second term is also a maximum square, adjusting for the
    placement, its maximum value is:

    2^96 - 2^65 + 2^32

    The question of whether there can be a carry is equivalent to the
    question of whether.

    2^96 - 2^65 + 2^32 + 2^64 - 2^33 + 1 < 2^96

    Looking at the terms, this is definitely the case.

    OK, I believe that

    >>> uint64_t m1 = (uint64_t)u * v[1] + (m0 >> 32);


    cannot generate a carry. This seems plausible because one of the
    arguments is 32-bits ... I'm just not sure about how larger
    multiplications (more terms to add) would work out.

    I believe your implementation is correct.

    Below is the source code and assembly-language for my original plus
    your improved version.

    Quite a size and speed difference!

    Thanks!

    DTA

    --------

    void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T *out_u128,
    const APP_BIGINT_U96_T *in_u96,
    const APP_BIGINT_U32_T *in_u32)
    {
    APP_BIGINT_U64_T term_96w0_32w0;
    APP_BIGINT_U64_T term_96w1_32w0;
    APP_BIGINT_U64_T term_96w2_32w0;
    APP_BIGINT_U32_NATIVE_T carry;
    APP_BIGINT_U64_NATIVE_T temp_u64;


    //****************************************************************************************************

    //****************************************************************************************************

    //****************************************************************************************************
    //Calculate all of the intermediate terms. It was discovered that
    the compiler does a good
    //job of using the UMULL instruction.
    //
    //96w0 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w0_32w0.w[1] = temp_u64 >> 32;
    term_96w0_32w0.w[0] = temp_u64;
    //
    //96w1 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w1_32w0.w[1] = temp_u64 >> 32;
    term_96w1_32w0.w[0] = temp_u64;
    //
    //96w2 * 32w0
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    term_96w2_32w0.w[1] = temp_u64 >> 32;
    term_96w2_32w0.w[0] = temp_u64;
    //

    //****************************************************************************************************

    //****************************************************************************************************

    //****************************************************************************************************
    //Add together the terms. The best way to envision this is the way
    that children are sometimes
    //taught to multiply with a rectangle and then adding diagonally.
    //
    //Word [0] of the result. No addition is required, because this
    term comes from the least significant
    //term alone.
    out_u128->w[0] = term_96w0_32w0.w[0];
    //
    //Word [1] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
    carry = temp_u64 >> 32;
    out_u128->w[1] = temp_u64;

    //Word [2] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
    carry = temp_u64 >> 32;
    out_u128->w[2] = temp_u64;

    //Word [3] of the result.
    temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    +
    (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
    out_u128->w[3] = temp_u64;
    }

    APP_BIGINT_UmulU128U96U32Old PROC
    ;;;621
    ;;;622 void APP_BIGINT_UmulU128U96U32Old( APP_BIGINT_U128_T
    *out_u128,
    0001e0 e92d47f0 PUSH {r4-r10,lr}
    ;;;623 const APP_BIGINT_U96_T
    *in_u96,
    ;;;624 const APP_BIGINT_U32_T
    *in_u32)
    ;;;625 {
    0001e4 b090 SUB sp,sp,#0x40
    0001e6 4604 MOV r4,r0
    0001e8 460d MOV r5,r1
    0001ea 4616 MOV r6,r2
    ;;;626 APP_BIGINT_U64_T term_96w0_32w0;
    ;;;627 APP_BIGINT_U64_T term_96w1_32w0;
    ;;;628 APP_BIGINT_U64_T term_96w2_32w0;
    ;;;629 APP_BIGINT_U32_NATIVE_T carry;
    ;;;630 APP_BIGINT_U64_NATIVE_T temp_u64;
    ;;;631
    ;;;632
    //****************************************************************************************************
    ;;;633
    //****************************************************************************************************
    ;;;634
    //****************************************************************************************************
    ;;;635 //Calculate all of the intermediate terms. It was
    discovered that the compiler does a good
    ;;;636 //job of using the UMULL instruction.
    ;;;637 //
    ;;;638 //96w0 * 32w0
    ;;;639 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    0001ec 6828 LDR r0,[r5,#0]
    0001ee 6831 LDR r1,[r6,#0]
    0001f0 fba00101 UMULL r0,r1,r0,r1
    0001f4 e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;640 term_96w0_32w0.w[1] = temp_u64 >> 32;
    0001f8 2100 MOVS r1,#0
    0001fa 9809 LDR r0,[sp,#0x24]
    0001fc e9cd0106 STRD r0,r1,[sp,#0x18]
    000200 900f STR r0,[sp,#0x3c]
    ;;;641 term_96w0_32w0.w[0] = temp_u64;
    000202 9808 LDR r0,[sp,#0x20]
    000204 900e STR r0,[sp,#0x38]
    ;;;642 //
    ;;;643 //96w1 * 32w0
    ;;;644 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    000206 6868 LDR r0,[r5,#4]
    000208 6831 LDR r1,[r6,#0]
    00020a fba00101 UMULL r0,r1,r0,r1
    00020e e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;645 term_96w1_32w0.w[1] = temp_u64 >> 32;
    000212 2100 MOVS r1,#0
    000214 9809 LDR r0,[sp,#0x24]
    000216 e9cd0106 STRD r0,r1,[sp,#0x18]
    00021a 900d STR r0,[sp,#0x34]
    ;;;646 term_96w1_32w0.w[0] = temp_u64;
    00021c 9808 LDR r0,[sp,#0x20]
    00021e 900c STR r0,[sp,#0x30]
    ;;;647 //
    ;;;648 //96w2 * 32w0
    ;;;649 temp_u64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    000220 68a8 LDR r0,[r5,#8]
    000222 6831 LDR r1,[r6,#0]
    000224 fba00101 UMULL r0,r1,r0,r1
    000228 e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;650 term_96w2_32w0.w[1] = temp_u64 >> 32;
    00022c 2100 MOVS r1,#0
    00022e 9809 LDR r0,[sp,#0x24]
    000230 e9cd0106 STRD r0,r1,[sp,#0x18]
    000234 900b STR r0,[sp,#0x2c]
    ;;;651 term_96w2_32w0.w[0] = temp_u64;
    000236 9808 LDR r0,[sp,#0x20]
    000238 900a STR r0,[sp,#0x28]
    ;;;652 //
    ;;;653
    //****************************************************************************************************
    ;;;654
    //****************************************************************************************************
    ;;;655
    //****************************************************************************************************
    ;;;656 //Add together the terms. The best way to envision this
    is the way that children are sometimes
    ;;;657 //taught to multiply with a rectangle and then adding
    diagonally.
    ;;;658 //
    ;;;659 //Word [0] of the result. No addition is required,
    because this term comes from the least significant
    ;;;660 //term alone.
    ;;;661 out_u128->w[0] = term_96w0_32w0.w[0];
    00023a 980e LDR r0,[sp,#0x38]
    00023c 6020 STR r0,[r4,#0]
    ;;;662 //
    ;;;663 //Word [1] of the result.
    ;;;664 temp_u64 = (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[0]
    00023e 2000 MOVS r0,#0
    000240 f8dd9030 LDR r9,[sp,#0x30]
    000244 e9cd9004 STRD r9,r0,[sp,#0x10]
    000248 f8dd803c LDR r8,[sp,#0x3c]
    00024c 4603 MOV r3,r0
    00024e 4601 MOV r1,r0
    000250 e9cd8006 STRD r8,r0,[sp,#0x18]
    000254 eb190008 ADDS r0,r9,r8
    000258 4159 ADCS r1,r1,r3
    00025a e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;665 +
    ;;;666 (APP_BIGINT_U64_NATIVE_T)term_96w0_32w0.w[1];
    ;;;667 carry = temp_u64 >> 32;
    00025e 2100 MOVS r1,#0
    000260 9809 LDR r0,[sp,#0x24]
    000262 4607 MOV r7,r0
    000264 e9cd0106 STRD r0,r1,[sp,#0x18]
    ;;;668 out_u128->w[1] = temp_u64;
    000268 9808 LDR r0,[sp,#0x20]
    00026a 6060 STR r0,[r4,#4]
    ;;;669
    ;;;670 //Word [2] of the result.
    ;;;671 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    00026c 46ba MOV r10,r7
    00026e 2000 MOVS r0,#0
    000270 e9cd7002 STRD r7,r0,[sp,#8]
    000274 f8dd9028 LDR r9,[sp,#0x28]
    000278 4603 MOV r3,r0
    00027a 4601 MOV r1,r0
    00027c e9cd9004 STRD r9,r0,[sp,#0x10]
    000280 eb170009 ADDS r0,r7,r9
    000284 4159 ADCS r1,r1,r3
    000286 e9cd0100 STRD r0,r1,[sp,#0]
    00028a 2000 MOVS r0,#0
    00028c f8dd8034 LDR r8,[sp,#0x34]
    000290 4603 MOV r3,r0
    000292 e9cd8006 STRD r8,r0,[sp,#0x18]
    000296 9800 LDR r0,[sp,#0]
    000298 eb100008 ADDS r0,r0,r8
    00029c 4159 ADCS r1,r1,r3
    00029e e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;672 +
    ;;;673 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[0]
    ;;;674 +
    ;;;675 (APP_BIGINT_U64_NATIVE_T)term_96w1_32w0.w[1];
    ;;;676 carry = temp_u64 >> 32;
    0002a2 2100 MOVS r1,#0
    0002a4 9809 LDR r0,[sp,#0x24]
    0002a6 4607 MOV r7,r0
    0002a8 e9cd0106 STRD r0,r1,[sp,#0x18]
    ;;;677 out_u128->w[2] = temp_u64;
    0002ac 9808 LDR r0,[sp,#0x20]
    0002ae 60a0 STR r0,[r4,#8]
    ;;;678
    ;;;679 //Word [3] of the result.
    ;;;680 temp_u64 = (APP_BIGINT_U64_NATIVE_T)carry
    0002b0 46b8 MOV r8,r7
    0002b2 2000 MOVS r0,#0
    0002b4 e9cd7004 STRD r7,r0,[sp,#0x10]
    0002b8 f8dd902c LDR r9,[sp,#0x2c]
    0002bc 4603 MOV r3,r0
    0002be 4601 MOV r1,r0
    0002c0 e9cd9006 STRD r9,r0,[sp,#0x18]
    0002c4 eb170009 ADDS r0,r7,r9
    0002c8 4159 ADCS r1,r1,r3
    0002ca e9cd0108 STRD r0,r1,[sp,#0x20]
    ;;;681 +
    ;;;682 (APP_BIGINT_U64_NATIVE_T)term_96w2_32w0.w[1];
    ;;;683 out_u128->w[3] = temp_u64;
    0002ce 9808 LDR r0,[sp,#0x20]
    0002d0 60e0 STR r0,[r4,#0xc]
    ;;;684 }
    0002d2 b010 ADD sp,sp,#0x40
    0002d4 e8bd87f0 POP {r4-r10,pc}
    ;;;685
    ENDP

    ---------

    void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128,
    const APP_BIGINT_U96_T *in_u96,
    const APP_BIGINT_U32_T *in_u32)
    {
    APP_BIGINT_U64_NATIVE_T m2, m1, m0;

    m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    out_u128->w[0] = m0;

    m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32);
    out_u128->w[1] = m1;

    m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32);
    out_u128->w[2] = m2;

    out_u128->w[3] = (m2 >> 32);
    }

    APP_BIGINT_UmulU128U96U32 PROC
    ;;;602
    //----------------------------------------------------------------------------------------------------
    ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T
    *out_u128,
    00018e b570 PUSH {r4-r6,lr}
    ;;;604 const APP_BIGINT_U96_T
    *in_u96,
    ;;;605 const APP_BIGINT_U32_T
    *in_u32)
    ;;;606 {
    000190 b088 SUB sp,sp,#0x20
    ;;;607 APP_BIGINT_U64_NATIVE_T m2, m1, m0;
    ;;;608
    ;;;609 m0 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    000192 680b LDR r3,[r1,#0]
    000194 6814 LDR r4,[r2,#0]
    000196 fba33404 UMULL r3,r4,r3,r4
    00019a e9cd3402 STRD r3,r4,[sp,#8]
    ;;;610 out_u128->w[0] = m0;
    00019e 9b02 LDR r3,[sp,#8]
    0001a0 6003 STR r3,[r0,#0]
    ;;;611
    ;;;612 m1 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m0 >> 32);
    0001a2 684c LDR r4,[r1,#4]
    0001a4 6815 LDR r5,[r2,#0]
    0001a6 2600 MOVS r6,#0
    0001a8 9b03 LDR r3,[sp,#0xc]
    0001aa e9cd3600 STRD r3,r6,[sp,#0]
    0001ae fbe43605 UMLAL r3,r6,r4,r5
    0001b2 e9cd3604 STRD r3,r6,[sp,#0x10]
    ;;;613 out_u128->w[1] = m1;
    0001b6 9b04 LDR r3,[sp,#0x10]
    0001b8 6043 STR r3,[r0,#4]
    ;;;614
    ;;;615 m2 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (m1 >> 32);
    0001ba 688c LDR r4,[r1,#8]
    0001bc 6815 LDR r5,[r2,#0]
    0001be 2600 MOVS r6,#0
    0001c0 9b05 LDR r3,[sp,#0x14]
    0001c2 e9cd3600 STRD r3,r6,[sp,#0]
    0001c6 fbe43605 UMLAL r3,r6,r4,r5
    0001ca e9cd3606 STRD r3,r6,[sp,#0x18]
    ;;;616 out_u128->w[2] = m2;
    0001ce 9b06 LDR r3,[sp,#0x18]
    0001d0 6083 STR r3,[r0,#8]
    ;;;617
    ;;;618 out_u128->w[3] = (m2 >> 32);
    0001d2 2400 MOVS r4,#0
    0001d4 9b07 LDR r3,[sp,#0x1c]
    0001d6 e9cd3400 STRD r3,r4,[sp,#0]
    0001da 60c3 STR r3,[r0,#0xc]
    ;;;619 }
    0001dc b008 ADD sp,sp,#0x20
    0001de bd70 POP {r4-r6,pc}
    ;;;620
    ENDP
     
    David T. Ashley, Apr 24, 2012
    #18
  19. David T. Ashley

    Noob Guest

    Noob wrote:

    > IMHO, the biggest economy, in terms of code density, comes from the
    > umlal instruction (32x32->64_mul followed by add_64). (On SH-4, it's
    > all done "by hand" using 4-6 instructions.)
    >
    > FWIW, here's the assembly code gcc 4.6.1 generates for Cortex-M3.
    > What CPU are you targeting? ARM or Thumb mode?
    >
    > mul:
    > push {r4, r5, r6, r7, r8, r9}
    > ldr r6, [r1, #0]
    > movs r5, #0
    > umull r6, r7, r6, r0
    > str r6, [r2, #0]
    > ldr r3, [r1, #4]
    > mov r8, r7
    > mov r9, r5
    > umlal r8, r9, r0, r3
    > str r8, [r2, #4]
    > ldr r3, [r1, #8]
    > mov r6, r9
    > mov r7, r5
    > umlal r6, r7, r0, r3
    > str r6, [r2, #8]
    > str r7, [r2, #12]
    > pop {r4, r5, r6, r7, r8, r9}
    > bx lr
    >
    > AFAICT, the register copies before the 2 umlal are unnecessary,
    > which means we could shave 3 instructions.


    Hand-written version (direct translation of the C source)

    mul:
    push {r4, r5}

    ldr r3, [ r1, #0 ]
    umull r4, r5, r0, r3
    str r4, [ r2, #0 ]

    ldr r3, [ r1, #4 ]
    mov r4, #0
    umlal r5, r4, r0, r3
    str r5, [ r2, #4 ]

    ldr r3, [ r1, #8 ]
    mov r5, #0
    umlal r4, r5, r0, r3
    str r4, [ r2, #8 ]
    str r5, [ r2, #12 ]

    pop {r4, r5}
    bx lr

    Depending on the specific CPU being targeted, it might be possible
    to use more registers to improve performance.
     
    Noob, Apr 24, 2012
    #19
  20. On Tue, 24 Apr 2012 11:32:54 -0400, David T. Ashley
    <> wrote:
    >
    >Below is the source code and assembly-language for my original plus
    >your improved version.


    One more note ...

    By changing to one temporary variable (rather than 3) the compiler
    seems to have shaved out two instructions (one at the start to
    allocate stack space and one at the end to deallocate it).

    I believe that the definition of C guarantees that

    temp = expression + temp;

    will use the "old" value of temp before updating the new one, so that
    should be OK.

    Here is the new code and assembly-language (with two fewer
    instructions).

    I really appreciate the help. The C version you provided is
    SUBSTANTIALLY more efficient.

    DTA

    ---------

    void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T *out_u128,
    const APP_BIGINT_U96_T *in_u96,
    const APP_BIGINT_U32_T *in_u32)
    {
    APP_BIGINT_U64_NATIVE_T temp64;

    temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    out_u128->w[0] = temp64;

    temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
    out_u128->w[1] = temp64;

    temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
    out_u128->w[2] = temp64;

    out_u128->w[3] = (temp64 >> 32);
    }

    APP_BIGINT_UmulU128U96U32 PROC
    ;;;602
    //----------------------------------------------------------------------------------------------------
    ;;;603 void APP_BIGINT_UmulU128U96U32( APP_BIGINT_U128_T
    *out_u128,
    00018e b57f PUSH {r0-r6,lr}
    ;;;604 const APP_BIGINT_U96_T
    *in_u96,
    ;;;605 const APP_BIGINT_U32_T
    *in_u32)
    ;;;606 {
    ;;;607 APP_BIGINT_U64_NATIVE_T temp64;
    ;;;608
    ;;;609 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[0] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0];
    000190 680b LDR r3,[r1,#0]
    000192 6814 LDR r4,[r2,#0]
    000194 fba33404 UMULL r3,r4,r3,r4
    000198 e9cd3402 STRD r3,r4,[sp,#8]
    ;;;610 out_u128->w[0] = temp64;
    00019c 9b02 LDR r3,[sp,#8]
    00019e 6003 STR r3,[r0,#0]
    ;;;611
    ;;;612 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[1] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
    0001a0 684c LDR r4,[r1,#4]
    0001a2 6815 LDR r5,[r2,#0]
    0001a4 2600 MOVS r6,#0
    0001a6 9b03 LDR r3,[sp,#0xc]
    0001a8 e9cd3600 STRD r3,r6,[sp,#0]
    0001ac fbe43605 UMLAL r3,r6,r4,r5
    0001b0 e9cd3602 STRD r3,r6,[sp,#8]
    ;;;613 out_u128->w[1] = temp64;
    0001b4 9b02 LDR r3,[sp,#8]
    0001b6 6043 STR r3,[r0,#4]
    ;;;614
    ;;;615 temp64 = (APP_BIGINT_U64_NATIVE_T)in_u96->w[2] *
    (APP_BIGINT_U64_NATIVE_T)in_u32->w[0] + (temp64 >> 32);
    0001b8 688c LDR r4,[r1,#8]
    0001ba 6815 LDR r5,[r2,#0]
    0001bc 2600 MOVS r6,#0
    0001be 9b03 LDR r3,[sp,#0xc]
    0001c0 e9cd3600 STRD r3,r6,[sp,#0]
    0001c4 fbe43605 UMLAL r3,r6,r4,r5
    0001c8 e9cd3602 STRD r3,r6,[sp,#8]
    ;;;616 out_u128->w[2] = temp64;
    0001cc 9b02 LDR r3,[sp,#8]
    0001ce 6083 STR r3,[r0,#8]
    ;;;617
    ;;;618 out_u128->w[3] = (temp64 >> 32);
    0001d0 2400 MOVS r4,#0
    0001d2 9b03 LDR r3,[sp,#0xc]
    0001d4 e9cd3400 STRD r3,r4,[sp,#0]
    0001d8 60c3 STR r3,[r0,#0xc]
    ;;;619 }
    0001da bd7f POP {r0-r6,pc}
    ;;;620
    ENDP
     
    David T. Ashley, Apr 24, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Deathstar
    Replies:
    4
    Views:
    415
    Herman Oosthuysen
    Sep 12, 2003
  2. Bhanu Nagendra Pisupati

    software libraries for ARM ETM

    Bhanu Nagendra Pisupati, Mar 4, 2005, in forum: Embedded
    Replies:
    0
    Views:
    275
    Bhanu Nagendra Pisupati
    Mar 4, 2005
  3. Craig Wanke
    Replies:
    3
    Views:
    277
    Craig Wanke
    Dec 9, 2003
  4. Ilgaz Ocal
    Replies:
    0
    Views:
    1,399
    Ilgaz Ocal
    Jun 5, 2005
  5. d_s_klein
    Replies:
    0
    Views:
    303
    d_s_klein
    Nov 19, 2010
Loading...

Share This Page