1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

FPU vs soft library vs. fixed point

Discussion in 'Embedded' started by Don Y, May 25, 2014.

  1. Don Y

    Don Y Guest

    Hi,

    I'm exploring tradeoffs in implementation of some computationally
    expensive routines.

    The easy (coding) solution is to just use doubles everywhere and
    *assume* the noise floor is sufficiently far down that the ulp's
    don't impact the results in any meaningful way.

    But, that requires either hardware support (FPU) or a library
    implementation or some "trickery" on my part.

    Hardware FPU adds cost and increases average power consumption
    (for a generic workload). It also limits the choices I have
    (severe cost and space constraints).

    OTOH, a straight-forward library implementation burns more CPU
    cycles to achieve the same result. Eventually, I will have to
    instrument a design to see where the tipping point lies -- how
    many transistors are switching in each case, etc.

    Fixed point solutions mean a lot more up-front work verifying
    no loss of precision throughout the calculations. Do-able but
    a nightmare for anyone having to maintain the codebase.

    OToOH, a non-generic library approach *could*, possibly, eek out
    a win by eliminating unnecessary operations that an FPU (or a
    generic library approach) would naively undertake. But, this
    similarly encumbers code maintainers -- to know *why* certain
    operations can be elided at certain points in the algorithms, etc.

    So, for a specific question: anyone have any *real* metrics
    regarding how efficient (power, cost) hardware FPU (or not!)
    is in FP-intensive applications? (by "FP-intensive", assume
    20% of the operations performed by the processor fall into
    that category).

    Thx,
    --don
     
    Don Y, May 25, 2014
    #1
    1. Advertisements

  2. Don Y

    rickman Guest

    So far I haven't seen any requirements of either the computations
    vis-a-vis the noise floor or power consumption or anything else. You
    seem to understand the basic concepts and tradeoffs, but perhaps you
    have no practical experience to know where, even approximately, the
    trade off work best.

    I also don't have lots of experience with floating point, but I would
    expect if you are doing a lot of floating point the hardware would use
    less power than a software emulation. I can't imagine the cost would be
    very significant unless you are building a million of them.

    I think finding general "metrics" on FP approaches will be a lot harder
    than defining your requirements and looking for a solution that suits.
    Do you have requirements at this point?
     
    rickman, May 25, 2014
    #2
    1. Advertisements

  3. Don Y

    Tim Wescott Guest

    The cost is often more in that the pool of available processors shrinks
    dramatically, and it's hard to get physically small parts.
     
    Tim Wescott, May 25, 2014
    #3
  4. Don Y

    Don Y Guest

    Hi Rick,

    I'm pricing in 100K quantities -- which *tends* to make cost
    differences diminish.

    But, regardless of quantity, physics dictates the volume of a
    battery/cell required to power the things! Increased quantity
    doesn't make it draw less power, etc. :<
    I can't discuss two of the applications. But, to "earn my training
    wheels", I set out to redesign another app with similar constraints.
    It's a (formant) speech synthesizer that runs *in* a BT earpiece.
    (i.e., the size of the device is severely constrained -- which has
    repercussions on power available, etc.)

    A shirt-cuff analysis of the basic processing loop shows ~60 FMUL,
    ~30 FADD and a couple of trig/transcend operations per iteration.
    That's running at ~20KHz (lower sample rates make it hard to
    synthesize female and child voices with any quality). Not a tough
    requirement to meet WITHOUT the power+size constraints. But, throw
    it in a box with a ~0.5WHr power source and see how long it lasts
    before you're back on the charger! :-/

    There are some computations that I've just omitted from this tally
    as they would be hard to quantify in their current forms -- one would
    be foolish no naively implement them (e.g., glottal waveform synthesis).

    I think this would be a good "get my feet wet" application because
    all of the math is constrained a priori. While I can't *know* what
    the synthesizer will be called upon to speak, I *do* know what all
    of the CONSTANTS are that drive the algorithms.

    As such, I *know* there is no need to support exceptions, I know
    the maximum and minimum values to be expected at each stage in
    the computation, etc.

    At the same time, it has fixed (hard) processing limits -- I can't
    "preprocess" speech and play it back out of a large (RAM) buffer...
    there's no RAM lying around to exploit in that wasteful a manner.

    It also highlights the potential problem of including FPU hardware
    in a design if it isn't *always* in use -- does having an *idle*
    FPU carry any other recurring (operational) costs? (in theory,
    CMOS should have primarily dynamic currents... can I be sure the
    FPU is truly "idle" when I'm not executing FP opcodes?)

    And, how much "assist" do the various FPA's require? Where is the
    break-even point for a more "tailored" approach?

    Note that hardware FPU and software *generic* libraries have to
    accommodate all sorts of use/abuse. They can't know anything about
    the data they are being called upon to process so always have to
    "play it safe". (imagine how much extra work they do when summing
    a bunch of numbers of similar magnitudes!)

    I'm hoping someone has either measured hardware vs. software
    implementations (if not, that's the route I'll be pursuing)
    *or* looked at the power requirements of each approach...

    --don
     
    Don Y, May 25, 2014
    #4
  5. Don Y

    Don Y Guest

    Hi Tim,

    Exactly. Especially if the "other" functions that the processor
    is performing do not benefit from the extra (hardware) complexity.

    "advanced RISC machine"

    Would you use a processor with a GPU (G, not F) to control a
    CNC lathe? Even if it had a GUI? Or, would you "tough it out"
    and write your own blit'er and take the performance knock on
    the (largely static!) display tasks?
     
    Don Y, May 25, 2014
    #5
  6. Don Y

    Tim Wescott Guest

    If you're a direct employee and you're working at 100K quantities, it
    should be exceedingly easy to get the attention of the chip company's
    applications engineers. Maybe too easy -- there have been times in my
    career that I haven't asked an app engineer a question because I couldn't
    handle the thought of fending him off for the next six months. At any
    rate, you could ask for white papers, or at those quantities you may even
    prompt someone to do some testing.

    Assuming that a processor with software emulation could get the job done,
    a processor with an FPU may still be more power efficient because it
    could be mostly turned off.

    Be careful shopping for processors -- there are a lot of processors out
    there that have a single-precision FPU that does nothing for double-
    precision operations: you often have to dig to find out what a processor
    can really do.

    My gut feel is that at 100K quantities you can do the work in fixed-
    point, document the hell out of it, and come out ahead. If time-to-
    market is an issue, call me! Matrix multiplies of fixed-point numbers
    with block-floating point coefficients can be very efficient on DSP-ish
    processors, and doing the math to make sure you're staying inside the
    lines isn't all that hard.
     
    Tim Wescott, May 26, 2014
    #6
  7. Don Y

    Don Y Guest

    Hi Tim,

    On 5/25/2014 4:19 PM, Tim Wescott wrote:

    [attrs elided]
    I found that running big numbers by vendors usually ended up with them
    camped out on my doorstep as if they were expecting the delivery of
    their first child! :<

    Often, I think clients farm out projects deliberately so any queries
    that might be "revealing" don't originate from their offices. More
    than once I've had sales reps "gossip" about projects at competitors'
    firms -- so why would they expect *me* to trust them with proprietary
    details?? :<
    The sort of testing that they would do, I could *easily* do! Turn off
    FP support, drag in an emulation library, measure elapsed time, power
    consumption, etc.

    That's an unoptimized comparison. It forces the "software" approach to
    bear the same sorts of costs that the hardware implementation *must*
    bear (i.e., you can't purchase a 27b FPU... or, one that elides support
    for any ops that you never use, etc. OTOH, you *can* do this with a
    software approach!)
    Yes, assuming the static currents are effectively zero. And, that the
    dynamic currents don't alter the average power capacity of the battery,
    etc. (e.g., a prolonged low power drain *may* be better for the battery
    than one that exhibits lots of dynamism. Esp given battery chemistries
    intended for quick charge/discharge cycles)
    Yes. And, many "accelerators" instead of genuine FPUs. This just
    complicates finding the sweet spot (e.g., normalizing and denormalizing
    values is expensive in software... perhaps a win for a *limited* FPA?)
    I have become ever-increasingly verbose in my documentation. To
    the point that current docs include animations, interactive
    presentations, etc. (can't embed this sort of stuff in sources)

    Yet, folks still seem to act as if they didn't understand the docs
    *or* didn't bother to review them! (Start coding on day #1. Figure
    out *what* you are supposed to be coding around day #38...)

    "Why is this <whatever> happening? I only changed two constants
    in the code! ..."

    "Um, did you read the footnote on page 13 of the accompanying
    description of the algorithm? And, the constraints that it
    clearly described that applied to those constants? Did you
    have all the DEBUG/invariants enabled when you compiled the
    code? If so, it would have thrown a compile-time error
    alerting you to your folly..."

    "No, I just put all that stuff in a folder when I started work
    on this. I figured it would just be easier to ask *you* as you
    wrote it all..."

    "Ah! Unable to read at your age? Amazing you made it through
    life thus far! Tell me... how good are you with *numbers*?
    I.e., are you sure your paycheck is right?? If you'd like, I
    can speak to your employer about that. I suspect I can make
    *him* much happier with my suggested adjustments... :-/ "

    People seem to just want to *poke* at working code in the hope
    that it ends up -- miraculously -- doing what they want it to do.
    Rather than *understanding* how it works so they can make intelligent
    changes to it. I think the rapid edit-compile-link-debug cycle times
    contribute to this. It's *too* easy to just make a change and see
    what happens -- without thinking about whether or not the change is
    what you *really* want. How do you ever *know* you got it right?

    Anyway... the appeal of an "all doubles" approach is there's little
    someone can do to *break* it (other than OBVIOUSLY breaking it!).
    I'm just not keen on throwing away that cost/performance *just*
    to allow "lower cost" developers to be hired... (unless I can put
    that on the BoM and get a bean-counter to price it! :> )

    End of today's cynicism. :) Time for ice cream!
     
    Don Y, May 26, 2014
    #7
  8. Don Y

    rickman Guest

    Can't say since "small" is not really anything I can measure. There are
    very small packages available (some much smaller than I want to work
    with). I have to assume you mean many of the parts with FP they also
    have a larger pin count, meaning 100 and up. But unless I have a set of
    requirements, that is getting into some very hard to compare features.

    What? If one part of a design needs an I2C interface you think you
    should not use a hardware I2C interface because the entire project
    doesn't need it??? That makes no sense to me.

    This is a total non-sequitur. Reminds me of a supposed true story where
    one employee said you get floor area by dividing length by width rather
    than multiplying. When that was questioned he reasoned with, "How many
    quarters in a dollar? How many quarters in two dollars?... SEE!" lol
     
    rickman, May 26, 2014
    #8
  9. Don Y

    Paul Rubin Guest

    http://www.ti.com/product/tm4c123gh6pm used in the TI Tiva Launchpad is
    a 64LQFP, still not exactly tiny. There is supposedly a new comparable
    Freescale part (MK22FN1M0VLH12, also 64LQFP) that is pin compatible with
    the part in the Teensy 3.1 (pjrc.com), and that has floating point (the
    Teensy cpu is integer-only). I wonder if the pjrc guy will make a
    Teensy 3.2 with the new part, which also has more memory. It's a cute
    little board. The FP on all these parts is unfortunately single
    precision.
     
    Paul Rubin, May 26, 2014
    #9
  10. Don Y

    rickman Guest

    I'm not sure what you are saying about cost differences diminishing.
    High volume makes cost differences jump out and be noticed! Or are you
    saying everyone quotes you great prices at those volumes?

    I may have something completely different for you to consider.

    I highly recommend that you not use pejoratives like "wasteful" when it
    comes to engineering. One man's waste is another man's efficiency. It
    only depends on the numbers. I don't know what your requirements are so
    I can't say having RAM is wasteful or not.

    I think this is a red herring. If you are worried about power
    consumption, worry about power consumption. Don't start worrying about
    what is idle and what is not before you even get started. Do you really
    think the FP instructions are going to be hammering away at the power
    draw when they are not being executed? Do you worry about the return
    from interrupt instruction when you aren't using that?

    What is an FPA?

    If you are designing a device for 100k production run, it would seem
    reasonable to do some basic testing and get real answers to your
    questions rather than to ask others for their opinions and biases.

    Ok, I'm still not clear on your application requirements, but if you
    need some significant amount of computation ability with analog I/O and
    power is a constraint, I know of a device you might want to look at.

    The GA144 from Green Arrays is an array of 144 async processors, each of
    which can run instructions at up to 700 MIPS. Floating point would need
    to be software, but that should not be a significant issue in this case.
    The features that could be great for your app are...

    1) Low standby power consumption of 8 uA, active power of 5 mW/processor
    2) Instant start up on trigger
    3) Integrated ADC and DAC (5 each) with variable resolution/sample rate
    (can do 20 kHz at ~15 bits)
    4) Small device in 1 cm sq 88 pin QFP
    5) Small processors use little power and suspend in a single instruction
    time reducing power to 55 nA each with instant wake up.

    This device has its drawbacks too. It is programmed in a Forth like
    language which many are not familiar with. The I/Os are 1.8 volts which
    should not be a problem in your app. Each processor is 18 bits with
    only 64 words of memory, not sure what your requirements might be. You
    can hang an external memory on the device. It needs a separate SPI
    flash to boot and for program storage.

    The price is in the $10 ball park in lower volumes, not sure what it is
    at 100k units.

    One of the claimed apps that has been prototyped on this processor is a
    hearing aid app which requires a pair of TMS320C6xxx processors using a
    watt of power (or was it a watt each?). Sounds a bit like your app. :)

    Using this device will require you to forget everything you think you
    know about embedded processors and letting yourself be guided by the
    force. But your app might just be a good one for the GA144.
     
    rickman, May 26, 2014
    #10
  11. Don Y

    rickman Guest

    I really don't get your point. What are you comparing this to?
     
    rickman, May 26, 2014
    #11
  12. Hi Don,

    Only 3 hands?
    This is being kicked around in comp.arch right now in a wandering
    thread called "RISC versus the Pentium Pro". Haven't seen the numbers
    you're asking for but likely you can get them if you ask nicely.

    They are discussing a closely related question involving the tradeoffs
    between providing an all-up FPU (e.g., IEEE-754) vs providing
    sub-units and allowing software to drive them. I haven't followed the
    whole thread [it's wandering a lot (even for c.a.)] but there's been
    some mentions of break even points of HW vs SW for general code.
    A number of comp.arch participants are present/past CPU designers.
    Quite a few others come from HPC ... when they are talking about FP
    intensive code, they mean 70+%.
    George
     
    George Neuner, May 26, 2014
    #12
  13. Don Y

    Don Y Guest

    Hi Rick,

    Things like FPU, MMU, GPU, etc. *tend* to find their way into
    more expensive/capable/larger devices. The thinking seems to be
    that -- once you've "graduated" to that level of complexity -- you
    aren't pinching pennies/watts/mm^3, etc.
    If the interface carried other baggage with it (size, cost, power)
    and COULDN'T BE IMPLEMENTED SOME OTHER WAY (e.g., FP library can
    produce the exact same results as FPU), then why would you take on
    that extra baggage?

    Why not put EVERYTHING into EVERY DESIGN? And, just increase package
    dimensions, power requirements, cost, etc. accordingly...
    You've missed the point.

    You can provide the *functionality* of a GPU (FPU) in software at
    the expense of some execution speed. If you don't *need* that speed
    for the operation of the CNC lathe (i.e., the GPU *might* make
    some of the routines form moving TEXT and LINE DRAWINGS around the
    LARGELY STATIC display screen), then why take on that cost?

    You might grumble that updating the screen takes a full 1/10th of a
    second and COULD BE *SO* MUCH FASTER (with the GPU) but do you think
    the marketing guys are going to brag about a *faster* update rate than
    that? Especially if there are other consequences to this choice?
     
    Don Y, May 26, 2014
    #13
  14. Don Y

    rickman Guest

    Ok, but I think you have me confused with another poster. I'm agnostic
    on that particular issue. My cross is the lack of reasonable packages
    for FPGAs.
     
    rickman, May 26, 2014
    #14
  15. Don Y

    Don Y Guest

    Hi Rick,

    On 5/25/2014 9:46 PM, rickman wrote:

    [attrs elided]
    The differences amount to a lot FOR THE LOT. And, when reflected to
    retail pricing. But, the differences in piece part prices drop
    dramatically. At some point, you're just "buying plastic" (regardless
    of what sort of silicon is inside).
    If you don't have a resource available, then any use of that resource
    that can be accomplished by other means is wasteful. E.g., the speech
    synthesizer does most of its work out of ROM to avoid using RAM.
    An FPU represents a lot of gates! Depending on the processor it's
    attached to, often 20-30% of the gates *in* the processor. That's
    a lot of silicon to "ignore" on the assumption that it doesn't
    cost anything while not being used. I'd much rather have assurances
    that it doesn't than to assume it doesn't and learn that it has
    dynamic structures within.
    Floating Point Accelerator.
    It sure seems *most* efficient to poll others who *might* have
    similar experiences (hardware software tradeoff re: floating point
    as I suspect *most* of us have made that decision at least a few
    times in our careers!).

    I didn't ask for "opinion" or "bias" as both of those suggest
    an arbitrariness not born of fact. To be clear, my question was:

    ---------------------------------------------------VVVVVVVVVVVVV
    So, for a specific question: anyone have any *real* metrics
    regarding how efficient (power, cost) hardware FPU (or not!)
    is in FP-intensive applications? (by "FP-intensive", assume
    20% of the operations performed by the processor fall into
    that category).
    Try
    <http://www.hpcwire.com/hpcwire/2012-08-22/adapteva_unveils_64-core_chip.html>
    -- 100GFLOPS/2W (FLOPS, not IPS)

    Of course, the speech synthesizer example is in the sub-40mW (avg)
    budget (including audio and radio) so ain't gonna work for me! :>
     
    Don Y, May 26, 2014
    #15
  16. Don Y

    rickman Guest

    Ok, we have left the realm of an engineering discussion. The point is
    that if FP is useful, use it. If it is not useful, don't use it. But
    don't assume, before you have actually looked, that you won't be able to
    find a device with the feature set you want or can use.

    Yes, but what exactly are you saying...

    Well, yeah. Wonderful analogy. Now can we get back to discussing the
    issue?

    Ok, you are still in analogy land. I'm happy to discuss the original
    issue if that is what you want.

    We have deviated far from my original statement that I would expect
    floating point instructions to use less power than floating point in
    software. Of course the devil is in the details and this is just one
    feature of your design. The chip you choose to use will depend on many
    factors.
     
    rickman, May 26, 2014
    #16
  17. Don Y

    Don Y Guest

    Hey George!

    Finally warm up, (and dry out) there?? :> Broke 100F last week... :<
    July's gonna be a bitch!

    The others were busy at the time... (but I reserve the right NOT to
    disclose what they were doing! :> )
    Thanks, I will look at the thread!
    I think you can get even finer in choosing how little you implement
    based on application domain (of course, I haven't yet read their claims
    but still assume they are operating within some "rational" sense of
    partitioning... e.g., not willing to allow the actual number format
    to be rendered arbitrary, etc.)

    E.g., in the speech synthesizer example (elsewhere), I can point to
    any operator/operation/argument and *know* what sorts of values it
    will take on at any time during the life of the algorithm. And, the
    consequences of trimming precision or range, etc. I'm not sure
    how easily a generalized solution could be tweeked to shed unnecessary
    capability in those situations.
    Well, I *do* have other things to do besides crunch numbers! :>

    Hope you are well. Really busy, here! :-/
    --don
     
    Don Y, May 26, 2014
    #17
  18. Don Y

    upsidedown Guest

    Also verify that the FP also supports 64 bits in hardware, not just 32
    bits.
    If you do not need strict IEEE float/double conformance and can live
    without denormals, infinity and NaN cases, those libraries will
    somewhat be simplified.
    Perhaps some other FP format would be suitable for emulation like the
    48 bit (6 byte) Borland Turbo Pascal Real data type, which uses the
    integer arithmetic more efficiently.

    One needs to look careful at the integer instruction set of the
    processor. FMUL is easy, it just needs a fast NxN integer
    multiplication and some auxiliary instructions. FADD/FSUB are more
    complicated, requiring to have a fast shift right by a variable number
    of bits for denormalization and a fast find-first-bit-set instruction
    for normalization. Without these instructions, you may have to do up
    to 64 iteration cycles in a loop with a shift right/left instruction
    and some conditional instructions, which can take a lot of time.
     
    upsidedown, May 26, 2014
    #18
  19. Don Y

    rickman Guest

    What won't work for you, the GA144 or the 100GFLOPS unit? As I
    mentioned, the GA144 has already been evaluated by someone for a hearing
    aid app which is very low power. 40 mW is not at all unreasonable for
    an audio app on this device.
     
    rickman, May 26, 2014
    #19
  20. One of my colleagues has a C++ library that does precision checking through
    the calculations and tells you at which point precision was lost. Whne
    you're happy and ready to go to production, you just turn off the checking
    and it's optimised away to just doing the calculations.

    That makes it easier to maintain than having a separate fixed-point
    codebase.
    It's something we've been looking at (how to do scientific compute on very
    constrained processors), but we've been focusing more on accelerating the
    fixed point than FP side so don't have any numbers to hand.

    Theo
     
    Theo Markettos, May 26, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.