1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Hidden latencies and delays for a running program?

Discussion in 'Embedded' started by haiticare2011, May 25, 2014.

  1. Hi all,

    I've been a SW developer, but one question I've never addressed is: What OS
    latencies and CPU delays are there in a compiled, running program? Is there any
    simple way to minimize them?

    I am thinking of a simple c code program that reads data off a pci card and then
    writes it to memory like a PCIe SSD drive. I understand there will be various
    hardware latencies and delays in the data input.

    But what if the assembler program is executing? Does the OS "butt in" and context
    switch/ multi-task during execution of a continuous compiled program? If so, how
    does one shut that off?

    I've read about this somewhere, but never paid attention to it.

    Thanks in advance
    jb
     
    haiticare2011, May 25, 2014
    #1
    1. Advertisements

  2. haiticare2011

    Paul Rubin Guest

    Lots. At the cpu level alone: variable instruction timing, cache
    misses, pipeline stalls, etc. At OS level: swapping and page faults,
    contention for machine resources by other tasks, etc.
    If you have absolute deadlines ("hard real time") then it's complicated
    and there's books written about it.
    Some OS's offer real time scheduling which basically means you can give
    an absolute priority to your real time task, so no other tasks can run
    until the priority task has released the cpu.
     
    Paul Rubin, May 25, 2014
    #2
    1. Advertisements

  3. haiticare2011

    Tauno Voipio Guest

    Yes, it does, and you should not attempt to prevent it,
    as you may make the whole system totally unresponsive.

    There is little difference between a compiled C program
    and an assembly program performing the same algorithm.

    The write to the SSD drive is far from simple, if you
    have a file system on the card. Also, the SSD may have
    an internal controller which needs time slots for its
    own purposes. Examples are SD (camera) cards and USB sticks.
     
    Tauno Voipio, May 25, 2014
    #3
  4. haiticare2011

    Don Y Guest

    Hi jb,

    That, of course, depends on the choice of processor ("CPU delays")
    and the choice/characteristics of the OS you are using (if any).

    CPU's often include instruction pipelines, I/D caches, and
    (instruction) scheduling algorithms that can cause what you *think*
    is happening (i.e., by examining the assembly language code that
    is actually executing) to differ from what is *actually* happening
    (i.e., by examining the CPU's *state*, dynamically).

    Add a second (or fourth) core and things get even messier!

    OS's range from *nothing* (e.g., running your code in a big loop)
    to those with virtual memory subsystems, and dynamic scheduling
    algorithms, preemption, resource reservations, deadline handlers,
    etc.

    Of course, if it's *your* hardware (and OS choice), you can opt to
    bypass all of those mechanisms by *carefully* designing your
    "system" to run at the highest hardware priority available. In
    essence, claiming the CPU for your exclusive use.
    Again, that depends on the choice of processor and the actual code
    that gets executed (recall, what you *write* can be rewritten by an
    aggressive compiler so you need to look at what the actual instruction
    stream will be). You can, of course, mix and match your tools to
    the tasks best suited. E.g., if there are timing constraints and
    relationships that must be observed in accessing the PCI card, code
    that in ASM. If the OS already knows how to *talk* to the SSD
    (assuming you are using a supported file system and not just writing
    to the raw device), then just pass the results of the ASM routine
    to a higher level routine that allows the OS to do the actual write.

    Of course, you have to be sure your *average* throughput meets the
    needs of the data source. Often, that means an elastic store,
    somewhere, so your ASM routine can *always* be invoked to get the
    next batch of data even if the OS hasn't caught up with the *last*
    batch of data. Make this store easily resizable and then measure
    to see just how much gets consumed (max) in your worst case scenario.

    [Hint, if you are using a COTS OS, you probably will never be able
    to get *published* data to allow you to make these computations
    a priori. And, if the OS will support a variety of unconstrained
    *other* applications, all bets are off -- unless you can constrain
    them to suit your requirements!]
    Again, depends on the OS and how you've installed your "program".
    E.g., if you have ensured that your code always runs at highest
    privilege, then the OS waits for *you* (which could bodge other
    applications that are expecting the OS to "be fair").

    If, OTOH, you are just a userland application, then your code
    could "pause" for INDEFINITE periods of time: milliseconds to
    *days* (exaggeration).

    All the "writing in ASM" buys you is the ability to see what the
    sequence of opcodes available to the CPU will be. Writing in a
    HLL hides that detail from you (though you can often tell your
    compiler to show it to you) *and* limits your ability to make
    arbitrary changes to that sequence (because the compiler has
    liberties to alter what you've told it -- in "compatible ways").
    Much effort goes into system designs to *free* people from
    having to think about these sorts of details. But, when you
    are dealing with hardware, there are often other constraints that
    force you to work around/through those abstractions.

    Typically (i.e., even in a custom OS/MTOS/RTOS) a high(er) priority
    task deals with events that have timeliness constraints. E.g.,
    fetching packets off a network interface (if you "miss" one, it
    either is lost forever *or* you have to request/wait for its
    retransmission -- a loss of efficiency... especially if you are
    likely to miss *that* one, too!).

    The data acquired (or *delivered* -- when pumping a data sink), is
    then buffered and a lower priority (though this might still be a
    relatively high priority... based on the overall needs of the
    system) task removes data from that buffer and "consumes" it.

    Note that this *adds* latency to the overall task. And, allows
    that latency to exhibit a greater degree of variability (based
    on how much of the elastic store gets consumed -- or not -- over
    the course of execution). So, if you expect a close temporal
    relationship between "input" and "output", you have to address
    this with other mechanisms (e.g., if you wanted something to
    happen AS SOON AS -- or, some predictable, constant time
    thereafter -- an input event was detected, the variability in
    this approach is directly reflected in that "output")

    Of course, if it can't be consumed as fast as it is sourced, then
    your system is too slow for the task you've set for it!

    "Why not just do the output in the same high priority task as the
    input?"

    What if the SSD (in your case) is not *ready* for more input at the
    *moment* your new input comes along? Perhaps the SSD is doing
    internal housekeeping? Do you twiddle your thumbs in that HIGH
    PRIORITY task *waiting* for it to be ready? How long can you twiddle
    before your *next* input comes along AND GETS *MISSED*?

    OS's (particularly full-fledged RTOS's) can provide varying degrees
    of support to remove some of the details of this task management.
    E.g., it may provide support for shared circular buffers. Or, allow
    buffers to be dynamically m-mapped to recipient tasks (to eliminate
    bcopy()'s). Signaling between the producer and consumer can be
    *part* of the OS (instead of forcing you to spin-wait on a flag).
    Deadline handlers can be created (by you) that the OS can then
    invoke *if* the associated task fails to meet its agreed upon
    deadline (e.g., what happens if you *can't* get back to look at
    the PCI card before the next data arrives? or, if you can't pull
    the data out of the buffer before the buffer *fills*/overflows?
    Do you *break*? Or, do you gracefully recover?)

    Best piece of advice: figure out how *not* to have timing constraints
    on your task. And, if unavoidable, figure out best to handle their
    violation: "hard" constraints can be handled easiest -- you simply
    stop working on them once you're "late"! ("Sorry, the ship has already
    sailed!"). "Soft" requires far more thought and effort -- it assumes
    there is still *value* to achieving the goal -- albeit *late*. ("But,
    if you charter a speedboat, you could probably catch up to that ship
    and arrange to board her AT SEA -- or in the next port. Yeah, that's
    a more expensive proposition but that's what happens when you miss
    your deadline!").

    Any more *specific* answer requires far more specifics about your
    execution environment (processor, hardware involved, choice of OS, etc.)

    HTH,
    --don
     
    Don Y, May 25, 2014
    #4
  5. haiticare2011

    Paul Rubin Guest

    Oh I remember now, you had the other post about some kind of data
    logging application. As others said, it sounds like you don't really
    have a strict latency bound as long as you don't use data, given enough
    ram to buffer stuff while i/o is blocked, with high enough probability
    that the failure possibilities are dominated by the reliability of the
    hardware.

    Anyway my guess is that the main source of delays may be the SSD itself.
    Those have unpredictable delays as they sometimes have to reorganize the
    data internally, which on some units can take a VERY long time on rare
    occasions. If you use an "enterprise" SSD, the vendors try harder to
    control those delays, including by overprovisioning the device so that
    the reorganization can happen using the extra capacity in the
    background. For that reason the enterprise SSD's cost more.
     
    Paul Rubin, May 25, 2014
    #5
  6. haiticare2011

    rickman Guest

    I worked on a real time PC in which we had installed a board. It ran NT
    with a real time extension. First pass of my board had a bug which hung
    the bus transfer and the *entire* machine hung. Wow! The only way out
    was a hardware reset.

    JB seems to have a lot to learn about real time systems. The part I
    don't quite get is why the PC side has to be real time. If he uses a
    separate MCU board to capture the ADC data (the important real time part
    of the problem) it can then send the data to a PC, not in "real time",
    just with a through put that exceeds the data rate. Adequate buffering
    on the MCU card will assure no loss of data. Then the PC can store the
    data on any media it wishes. Sounds simple enough to me but I don't get
    why he continues to flog this horse.
     
    rickman, May 25, 2014
    #6
  7. haiticare2011

    Tauno Voipio Guest


    Maybe the PHB has orederd him to make the PC a real-time
    capturing system. Anyway, he'll have a stiff climb up the
    learning steps.
     
    Tauno Voipio, May 26, 2014
    #7
  8. haiticare2011

    rickman Guest

    PHB? Do you mean powers that be? He has been asking about embedded,
    but seems to think he has to put the entire system on the embedded
    device. I don't want to give the guy grief, but it sounds like he is
    not familiar enough with embedded design to even know if his task can
    use it effectively or not. He seems to reject a lot of suggestions
    before he understands them. I'm also very unclear on what data rate he
    really needs from the front end to the storage.
     
    rickman, May 26, 2014
    #8
  9. haiticare2011

    Tauno Voipio Guest

    Sorry - Pointy-Haired Boss, from Dilbert.
     
    Tauno Voipio, May 26, 2014
    #9
  10. haiticare2011

    David Brown Guest

    The OP is very unclear about the data rate he needs (he alternates over
    several orders of magnitude), and has no idea at all about the sample
    size. The worrying thing is that he does not seem to consider this a
    problem, and does not realise that this project needs a lot of thought
    and planning, then a lot of research and prototyping, before he can
    start looking at implementation and development.

    He also has virtually no idea about the technologies for implementing
    the system. He has some fixed pre-conceived ideas that he won't change
    no matter what people tell him - he believes USB latency will cause
    trouble, he believes SSD is the greatest invention since sliced bread,
    he believes assembly programming will be more "real time" than C
    programming.

    The guy may be a good SW developer for all I know, but he is clearly far
    out of his depth with this project. I don't know if this is his own
    fault, or that of a PHB, but he desperately needs help here (of a kind
    that we cannot give him) before he wastes lots of time and money.
     
    David Brown, May 26, 2014
    #10
  11. haiticare2011

    MK Guest

    From this and your other posts I think you are trying to make a data
    acquisition system which will store up to 10Mbyte/s on a PC hard drive.
    You've got three ways (at least) to get the data into the PC: USB,
    Ethernet and PCI. USB and Ethernet are relatively easy and work with any
    kind of PC and won't need fancy driver level code - so will probably
    work with any OS.
    Ethernet is the most simple from the PC software point of view.
    10Mbyte/s is wire speed maxed out for 100Mb Ethernet so you'll struggle
    If you try to use a typical micro's on chip MAC. You can get ARM based
    micros with high speed USB.
    If I were doing this (and I have , many times) I'd use an FPGA to
    control the ADC , buffer the data and drive Ethernet via an off chip
    Gigabit PHY.
    You will need to buffer the data from the ADC and unless you are very
    clever with the host computer you'll need a decent sized buffer for the
    data. How big depends on so many variables that it's very risky to guess
    - you'll need to check but I would start with enough to store 500mS
    worth of data (5M bytes in your case so use a 32Mbyte or so SDRAM).

    In order to control the Ethernet interface you'll need to be quite
    confident with VHDL or Verilog or use a soft micro on the FPGA and get
    into a different kind of mess.

    If all your experience is with software you might do better with the
    micro with built in high speed USB but you'll need one which supports
    external SDRAM at the same time and your data throughput will be
    challenging.

    PCI has all the problems of USB and Ethernet interfaces and a lot of
    additional ones as well - don't go that way unless there is a really
    good reason for it.

    Unless you need a lot of these I suggest you just buy something, and of
    course if you want a good design done you could always email me :)

    Michael Kellett
     
    MK, May 26, 2014
    #11
  12. Rick
    If this is as trivial as you say, then there would be more examples of how to
    do it that work. But there aren't. There is little consensus on how to achieve
    good data throughput. Solutions range all over the place, and few work. For
    example, there is "Starter Ware," a low overhead OS for ARM from Ti. But if you
    read the forums, much of the documentation is incorrect and unworkable.

    Now, you recommend a "mcu board." Now we're getting somewhere. Do you have any
    actual examples of this working? Which mcu? How was the bus to the PC
    configured? Since you say "I have a lot to learn, teach me your concrete system
    example."

    JB
     
    haiticare2011, May 26, 2014
    #12
  13. Thanks for the compliments. :) I'm convinced the rank-and-file developers out
    there don't have their ducks in a row on this one, either. Judging by the BBB
    developers attempts, it's still the Wild West. :)
     
    haiticare2011, May 26, 2014
    #13
  14. haiticare2011

    upsidedown Guest

    If that is all you need, what do you need an OS for ?

    Just use an ISR (Interrupt Service Routine) for reading your input
    card (such as an ADC) and an other ISR for writing the data to SSD
    drive (write complete interrupt).

    The main program then consists of initializing those two interrupt
    service routines and a program body, consisting of an eternal loop,
    consisting of a (low power) wait for interrupt instruction.
     
    upsidedown, May 26, 2014
    #14
  15. haiticare2011

    rickman Guest

    No, I have not built your system for you already. In the other thread I
    have given you lots of material for you to work with. On the other hand
    you have not given us a set of requirements to work from. When I get
    the requirements I will consider if I want to take on the job. :)
     
    rickman, May 26, 2014
    #15
  16. haiticare2011

    Don Y Guest

    Um, are you requiring an "example" to be of the form:
    _Application Note 1234: Using the C Language Under <OS> to
    Copy <arbitrary> Data from a PCI Card in a PC(?) to an SSD
    in the Same PC without any Constraints on Timeliness using
    Free Tools"
    If that's the case, I can save you a lot of time...
    Which *specifically* "don't work"? And, do they not work because of
    ommissions on YOUR part? If not, please identify *why* they "don't
    work". E.g., the solution I provided *does* work as I have used it
    on dozens of projects. If you can't see how use on a non-PC applies,
    then I can site my 9-track tape driver that runs on a PC... not a
    PCI card (ISA) and not an SSD (IDE) but if you cant work "in the
    abstract", you'll never work in the *specific*!
    Then "Starter Ware" requires more of you than you are able to provide.
    Fine. Pick something else.

    You probably can't use Limbo, either -- due to your unspecified
    timing constraints, hardware interface, file formats, filesystem
    choice, etc.

    Jaluna would intimidate you with its build environment.

    RTEMS might not provide the (unspecified) user interface you
    need.

    QNX costs money.

    You probably can't write on bare iron...

    etc.

    Hey, maybe the Linux folks can entertain your queries! I'm
    sure there's a newsgroup/forum for that!

    In all seriousness, until *you* know (meaning "can put in unambiguous
    quantifiable terms") what your complete set of criteria are, you're
    just going to be squeezing balloons -- always chasing, never achieving.

    Good luck!
     
    Don Y, May 26, 2014
    #16
  17. Actually, the failure of the ARM community to achieve any serious IO is
    embarassingly apparent and does not require any bureaucratic structure to
    see it.
    The GPIO data rate was coaxed into the mHz range, but with great difficulty.
    It is natively in the low kHz range.
    General material is offered, which evaporates under scrutiny...
     
    haiticare2011, May 27, 2014
    #17
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.