1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Detailed article about Mars Rover falure in EE Times

Discussion in 'Embedded' started by Jim Stewart, Feb 22, 2004.

  1. Jim Stewart

    Jim Stewart Guest

    1. Advertisements

  2. Steve at fivetrees, Feb 22, 2004
    #2
    1. Advertisements

  3. Jim Stewart

    Janvi Guest

    that is a good article but I believe many many more things went wrong
    If you read this link from February the 17th:

    http://origin.mars5.jpl.nasa.gov/newsroom/pressreleases/20040217a.html

    simply click to the "View all oportunity images from this press release"
    found below the right hand located photo. This will change the release
    date of the event from February the 17th to January the 17th. If you go
    on reading below "The Road Less Traveled" they obviosly exchanged the
    1,4 meter distance to 4.6 meter instead 4.6 feet. The world is not
    only laughing about ESA.

    Thinking more "interplanetary assistance", most people on the earth do
    not have any access to the internet and even we cannot achieve to have
    sufficient potable water to survive in many huge areas ...
     
    Janvi, Feb 22, 2004
    #3
  4. Jim Stewart

    CBFalconer Guest

    I would class it more as operator error. They allowed the first
    half of a composite command to execute and leave the system in a
    critical state, even though the second half had not been
    uploaded. Sounds like something E. Robert Tisdale might do.

    The design problem seems to have been more in the creation of that
    composite command in the first place. There must have been safer
    paths.
     
    CBFalconer, Feb 22, 2004
    #4
  5. Jim Stewart

    Scott Moore Guest

    And failed for the standard reasons.

    First, the case of being out of memory
    was not extensively tested for. That is, one test would have been
    to fill the file system with nonsense files and check the behavior
    during the full condition, with frequent program faults.

    Second, there was no planned fallback action for the software to
    perform. A fault was generated, and the "program" (actually, more
    appropriately termed a "task" here), "faulted", which simply means
    to terminate. No attempt to suspend it until file space freed up,
    no attempt to do anything about the file full condition, etc.

    Third, the system was allowed to create a directory system that
    would not have room enough to load during a restart condition.

    These are all standard failure modes for off the shelf software.
    The case of "edge conditions" are inadequately checked for. Windows
    used to crash very reliably when placed deliberately in a full
    ram or disk enivronment. I suspect the fact it no longer does has
    more to do with virtual memory than any program improvement, but
    it remains a fact that most systems, even major operating systems,
    don't do well with their disk full or nearly full.

    Further, programmers rarely think of the consequences of hitting
    an error such as file system full. The current program/task can be
    killed, but this is not only likely to continue to fail programs,
    but may make it worse if simply starting these doomed to fail
    programs takes more memory. The answer is to DO something about
    it. If the OS does something about it, the program/task need
    not even know about it, it can be held in suspend until the
    problem is cleared.

    I would guess for the above situation, dumping files by either
    age or priority, or both would be appropriate.

    I am not stating this to "prove I am a genius", but simply to
    state what I have stated all along: the current state of software
    is NOT good, and using industry standard practices and operating
    systems is not the path to reliability.
     
    Scott Moore, Feb 24, 2004
    #5
  6. I totally agree.

    In the early days of NASA, Bob Gilruth (NASA head) used to implore his guys
    to "keep it simple". The simpler the system, the fewer failures modes needed
    to be considered. Fairly elementary stuff.

    As I've said before, as a hardware/software engineer I'm fascinated by the
    difference between these two disciplines. In hardware, we're fairly mature,
    and we're adept at "complexity management". In software, we seem determined
    to throw out all the lessons learned and start over - in an undisciplined
    and sloppy kind of way (so far). We approach it horizontally when we should
    be thinking vertically i.e. hierarchically.

    Time and time again I have seen projects compromised by a decision to "save
    time" by buying in an RTOS, or basing the product on Windows CE, etc etc -
    decisions which increase the complexity (and hence failure modes) of the
    product by leaps and bounds. When comprehensibility is reduced, so is
    reliability. Bob Gilruth would turn in his grave - from his obit at
    http://www.space.com/peopleinterviews/gilruth_obit_000817.html:things are important," [Alan Bean] said. "With the quality control and
    documentation, you had the history of everything and could lay your hands on
    it in a flash." <<

    Compare this with the current state of software engineering. Actually, I
    hesitate to call it "engineering" at the moment. There is still far too much
    emphasis on "hack and debug" and not enough on complexity management i.e.
    good, solid hierarchical design. A complex design should be broken down into
    many, simple, pieces. If the designers can't understand the hierarchy,
    WARNING. If the elements are too clever to be comprehensible, WARNING. If
    the designers can't "lay [their] hands on it in a flash", WARNING. We need
    to be more disciplined, and to see simplicity and comprehensibility as
    virtues. Too often the design evolves from the code in an ad-hoc fashion -
    analagous to designing a car by starting with a piece of metal and a
    hacksaw.

    In the original "Mars Rover" thread here there was much emphasis on coding
    issues e.g. typsesafe languages. This, in my view, is a symptom of the
    problem - coding is NOT the main issue. (Sure, it matters, but it's not the
    root problem.) Good *design* is the issue. Given a good, comprehensible
    design and a competent, disciplined coder, I don't care if it's coded in C,
    assembler, or ADA - it'll work, and stay working. If it's not 100%
    comprehensible, it WILL fail.

    </rant>

    Steve
    http://www.fivetrees.com
    http://www.sfdesign.co.uk
     
    Steve at fivetrees, Feb 24, 2004
    #6
  7. Jim Stewart

    Bob Stephens Guest

    I agree with your points about the importance of structured, disciplined
    design as opposed to hack and patch, but given the fallibility of humans in
    general, and NASA in particular - design by the lowest bidder - I am
    surprised that some redundancy wasn't built in to accomodate a catastrophic
    unanticipated failure mode. I realize that weight is at a premium, but even
    so.

    Bob
     
    Bob Stephens, Feb 24, 2004
    #7
  8. Jim Stewart

    Scott Moore Guest

    I am also cross discipline, and this is really key to getting better software.
    In the old days, you could say that hardware is "simply simpler" than software.
    But nowdays, the two disciplines are converging. Verilog/vhdl designs are reaching
    or have reached massive complexity, and appear as complex text software descriptions
    of hardware. Even C occasionally gets compiled to hardware. So the question is
    not academic: "why does hardware quality lead software quality by such a large
    margin". The answer is simple to anyone doing hardware work today. The mindsets
    are completely different. Here are some of the points:

    1. Hardware is completely simulated. Although some PCB level (Printed Circuit
    board, or multiple chip level) designs remain unsimulated, virtually no designers
    just "roll and roll" the dice by trying design iterations emperically, even
    using FPGA chips that make such behavior possible (FPGAs can be downloaded
    and run in a few minutes time).Hardware engineers know that simulation
    delivers better observability and consistency than emperical tests.

    2. Hardware is fully verified. Chip designs are not considered done until all
    sections have been excersized and automatically verified. Tools exist for hardware
    to discover sections that have not been excersized by the tests, and more tests
    are added until "%100 test coverage" is acheived.

    There is interest in applying these methods to software, and it can be done.
    Profilers can find if sections of code have been run, even down to the statement
    and machine instruction level. Automatic test methods are not making as much
    progress, but there is no fundamental reason why they won't work. Finally,
    software engineers need to understand that ANYTHING can be simulated. There is
    far too much temptation to simply defer testing until the hardware comes
    back. But this serializes software and hardware development, and I believe it
    significantly degrades software reliability by deferring simple software
    bugs (I.e., not related to timing or interface issues) to the environment with
    the least potential for observability and automatic verification.
    "Hackaneering" :)
    I would only add that the counter to "it takes good programmers" idea is that,
    certainly, the programers must do the job, but there is also the idea of
    "best practices". This is the idea that good programmers produce best work
    by adopting the best practices they can. For example, a type safe language
    does not make for automatic quality, but adding a type safe language as a
    tool for a good programmer, along with other best practices like simulation,
    automatic verification, modularity and other things will allow the maximum
    reliability to be achieved.
     
    Scott Moore, Feb 24, 2004
    #8
  9. Jim Stewart

    Scott Moore Guest

    You mean like a complete 2nd processor ? I suspect that NASA decided that
    this covers hardware problems, but not software problems, due to having the
    same software run on both processors. I suspect the correct answer would
    be to have three voting processors whose software was written by three
    totally separate groups who were disallowed to communicate with each other
    (the groups that is). Of course, that's probably a recipe for high costs !
     
    Scott Moore, Feb 24, 2004
    #9
  10. The answer is definitely *not* simple.

    Without getting into a theological discussion of why software
    *is* more complicated than hardware:

    o Hardware is usually applied to well bounded problems,
    often with the unstated idea that the software will
    magically fill in the gaps.

    o Software interfaces vary much more widely than the
    ones,zeros and clocks of digital logic.
    Simulation allows emperical testing. Without vectors
    (test harness), there is no test.

    I agree that simulation is valuable for executing
    test harnesses before the target hardware is available.

    One of the nice things about the trend toward using
    Linux in embedded systems is that much of the application
    (and even driver) work can often be done on a PC improving
    the development throughput.

    Usually, the difficult thing about using simulators
    for developing embedded software is that much of the
    software must interact with the target hardware, and
    most target simulators don't provide a good way to
    model the hardware behavior. Even if the simulator
    *does* provide a hardware modelling method, building
    the models is time consuming and error prone.
    Hmmm. If hardware were "fully verified" I would expect
    *much* shorter errata sheets ;-)
    Many of us have an interest in improving quality,
    unfortunately, there are many counter forces.

    o Apathy and narrow mindedness of the engineers.
    o Business pressures.
    o Lack of experience, both individually and collectively.

    Notice that the first bullet places the blame
    squarely on the shoulders of the practioners.

    Just read this news-group and you will find
    endless presentations by those who claim that
    they know the answer, and that most of their
    colleagues are fools. C rocks! C++ sucks!
    Real men write assembler! RTOS users are fools!
    You get the idea.

    I agree that being consistent and using a
    "best practice" approach (reducing the number
    of variables) is an excellent way to improve
    the stability of any software. However, this
    can also lead to stagnation and narrow mindedness.

    Under-stating and over-simplifying the problem
    of software by saying "all-you-have-to-do-is-xyz"
    does not contribute to a solution.

    There is no substitute for experienced engineers.
    If they're experienced enough, then they're in it
    for love ... not money ;-)

    Lets not forget the most important best practices
    of all: Solid requirements, careful analysis and
    a design that covers the temporal aspects as well
    as procedural aspects.

    Wouldn't it be nice if those things took as little
    time as many *think* they take! :)


    --
    Michael N. Moran (h) 770 516 7918
    5009 Old Field Ct. (c) 678 521 5460
    Kennesaw, GA, USA 30144

    "... abstractions save us time working, but they don't
    save us time learning."
    Joel Spolsky, The Law of Leaky Abstractions

    The Beatles were wrong: 1 & 1 & 1 is 1
     
    Michael N. Moran, Feb 24, 2004
    #10
  11. Just a few points:
    FWIW, that's not my position. The right tool for the job etc. However, I do
    notice a tendency to overcomplicate. Recently I worked on a project that,
    quite typically, had grown out of proportion over the years and no-one now
    fully understood. It had been based on an RTOS when a simple round-robin
    scheduler would have been adequate - and far more comprehensible. In this
    case, as in many that I've had first-hand experience of, there were none of
    the real justifications for using an RTOS. IME, this is far from unusual.
    If you mean that a true craftsman is always ready to revise and refine his
    definition of "best practices", then I agree.
    Not sure about this one. As designers, I feel our job is to reduce
    complexity into a collection of simple elements, with simple well-defined
    interfaces and no side-effects. Sorry to belabour this point, but I do see
    it being missed far more often than makes sense.

    I am *not* saying it's easy, BTW. Good decompositional skills are, I
    believe, the hardest thing to learn, and, in my experience, far more rare in
    the software domain than I'd reasonably expect. I think this is
    significant - in hardware many of the subassemblies are available already
    partitioned (ICs, discretes, modules, products). Not so in software
    (usually).
    Now *this* is an interesting point. I have noticed that one of the
    side-benefits of a simple, comprehensible design is a reduction in
    implementation time. The design time is increased, but the overall time is
    reduced - often considerably.

    Steve
    http://www.fivetrees.com
    http://www.sfdesign.co.uk
     
    Steve at fivetrees, Feb 25, 2004
    #11
  12. Jim Stewart

    Scott Moore Guest

    I guess to keep this from being a "do as I say, not as I do"
    discussion, I should outline how I use these principles in my
    own projects. The proprietary details have been removed.

    Current project. Large, actually stunningly large (100's of
    thousands of lines), written in stages since 1980, and maintained
    since.

    o Currently a Windows XP based system, formerly other computers.
    o Written using type safe language.
    o No simulation, since there is no target (not an embedded system).
    o Generous use of "assert" type constructs. I long ago learned to
    check virtually every possible bad condition, even if done
    redundantly. Now it is very rare not to have even serious problems
    trip one of my asserts, which result in a message giving the exact
    location, in source, of the fault.
    o Formal testing methodology. A series of extensive test files give
    automatic coverage of software faults. After that, an extensive
    series of real world examples are run for further testing.

    Results: Virtually all errors are caught in a sensible way. The errors
    that don't result in asserts are then caught by the type protections
    in the language. Program development proceeds virtually without system
    faults from the operating system, even on new, untried sections of the
    code.

    Last project:

    o Embedded to new hardware, which used IBM-PC standard hardware.
    o Written using C.
    o Simulation of chip related tests. Basic workability of test platform
    was performed by "emulating" the full software on a standard PC, made
    possible by the commonality of target and PC hardware. However, one
    version used POSIX I/O and worked on both Windows and Linux,
    and was used to give a complete preview of the system, long before
    hardware was even designed.
    o Formal test methodology was used for chip tests, which ran both against
    a full hardware simulation, then transferred to real hardware.
    Test platform was fully scriptable, and allowed for building complete
    regression tests.

    Results: full functionality with real hardware in 2 weeks after hardware
    proved functional.

    Before that:

    o Embedded, custom platform.
    o Written using C.
    o Full simulation, and test of code in simulation environment. An
    arrangement was used where the code used to compile the hardware
    chips in Verilog was coupled with a CPU model that executed the
    code. The result was a full simulation with real code, against
    the actual code used to construct the hardware.
    o Formal test methodology. TCL was used to drive the system under
    test for full regression testing.

    Results: We brought up several platforms. It was very common to have
    the software running the same DAY as hardware declared the unit to
    be running.

    Final comment: I design in whatever language my client wants. Personally,
    however, I find that projects proceed twice as fast using type safe
    languages. The development time to write the code is the same, and most
    debugging occurs at the same pace. However, it has been my experience
    that C, and probally all type unsafe languages, will throw out several
    problems per year that consume more than one week to solve, such as
    lost pointer errors or array overruns. These problems typically cause
    random serious schedule slips to occur. The occurance of these types
    of problems have effects beyond just the problem itself. For example,
    I typically run a much tighter circle of write-to-test for C, because
    I want to know if I have a serious fault show up a better idea of just
    what code might have introduced this fault. This kind of defensive
    programming costs development time. Also, debug of type safe languages
    occurs faster because of the better diagnostics that occur with even
    minor errors.

    Because I design perhaps %70, and design 10's to 100's of thousands of
    lines each year typically, I don't believe that the above effects
    occur because I have "better knowledge" of one language or another.
    On the contrary, that should favor C. I also don't believe that it
    is an effect of programming knowledge in general, since I am often
    the debugger of choice for serious system errors, especially in
    compiled low level code, since I have written several compilers,
    including a C compiler.

    Do I make much progress convincing my clients to use type safe languages ?
    No, unless you consider my work with TCL or Perl. Since I am a low
    level (close to hardware) specialist, I rarely even see requests
    for C++ code, much less anything higher level than that. It is my
    experience that C is usually picked for projects without even debating
    or even asking the programmers what they would like to use.

    I like to be paid, so I don't bring language issues to work unless
    asked. Its enough work just to try to get my clients to use modular
    concepts. However, I would draw the line, and have drawn the line,
    at systems critical to human life. I would most certainly avoid
    working on any life support system such as medical, aircraft
    naviagation, etc., if it were performed in C.
     
    Scott Moore, Feb 25, 2004
    #12
  13. Interesting. My "angle" is just slightly different.

    - My background is primarily true embedded products (firmware) in a market
    (process control) where there is *no* tolerance of s/w bugs: if it fails,
    it's broken, and the result might be lawsuits and/or product recalls.
    - I consider defensive programming (within reason) to represent good value
    for money - it usually saves time further down the line, and as a hardware
    engineer, I still embrace the catechism that debug costs escalate at every
    stage.
    - I tend to write my own "type-safeness" into the application ;). That is,
    I use strong typing (more than C actually supports) and I explicitly check
    bounds etc. Anything a language can do, I figure I can do too, but with more
    insight into what's actually going on at runtime. (Which is one reason I'm
    not a fan of C++.)
    - I almost never run into stray pointers etc, whether I'm using C,
    assembler or whatever. When I do, it's at an early stage - like you I use
    asserts etc, along with a variety of other means of making oversights jump
    out at me.
    - Many of my applications are close to life-critical (in all but legal
    terms), and most are certainly mission-critical. Beyond the process control
    work, I've written safety monitoring applications involving naval
    navigation, fire alarm reporting, and personnel-at-risk monitoring, amongst
    others. I wouldn't be able to sleep nights if I had anything other than 100%
    confidence in them ;).
    - As I've said here before, I avoid debugging. I hate it ;). Instead, I
    basically write lots of small, trivially simple elements, test them
    individually and collectively, and ensure that *all* runtime errors are
    caught and dealt with sensibly. Many of my colleagues find this process
    strange and tedious - but I find it far more constructive than debugging,
    which can only yield empirical results.

    I apologise if I'm blowing my own trumpet, and reiterating points I've made
    before in other threads. I'm genuinely curious as to why my bug count is so
    low when the average for the industry is so high. It ain't because I'm a
    genius - if anything, I like things simple because I'm *not* a genius. I
    assume it's because I was a hardware engineer first, and have been trained
    and indoctrinated in *engineering*. Possibly also the exposure to
    mission-critical apps, which has forced me to find ways of making software
    robust.

    NASA, sounds like you need me ;).

    Steve
    http://www.fivetrees.com
    http://www.sfdesign.co.uk
     
    Steve at fivetrees, Feb 25, 2004
    #13
  14. Jim Stewart

    Rene Guest


    Haha, I would have laughed if it were funny. Hardware 100% verified... Have
    you
    ever come close to a PowerPC CPU, AMD Ethernet Chip (Lance), Infineon
    DuSLIC,
    Motorola MPC 180 Crypto Chip and so on ?

    I am a doing low level software and some hardware design. I have stopped
    counting
    the hardware bugs in components we used. The lesson was that we now evaluate
    parts
    by their errate sheet.

    - Rene
     
    Rene, Feb 25, 2004
    #14
  15. Jim Stewart

    Elder Costa Guest

    Adding another point of view on the hardware side. I am reading a book
    about worst case analysis that was recommended by somebody in this NG (I
    don't remember the title/author from the top of my head). Hardware can
    become a nightmare if one don't take worst case figures in account when
    designing, just the typical ones. I'm affraid that must be more the rule
    than the exception. Not to mention digital hardware based on software
    (VHDL, Verilog etc.) :)

    Still software is IMHO by far more complex than hardware.

    Just my $0,00999999999.
     
    Elder Costa, Feb 25, 2004
    #15
  16. Jim Stewart

    Brian Dean Guest

    Oh, that reminds me:

    "We are Pentium of Borg.
    Division is Futile.
    You will be approximated."

    Cheers,
    -Brian
     
    Brian Dean, Feb 26, 2004
    #16
  17. Jim Stewart

    Jim Stewart Guest

    A Pentium FPU engineer goes into a bar
    and orders a drink. The bartender says
    "That'll be five dollars", the Pentium
    engineer slaps a 5 dollar bill on the bar
    and says "keep the change"
     
    Jim Stewart, Feb 26, 2004
    #17
  18. and the variant that has "Precision is futile"....
     
    Jim Granville, Feb 26, 2004
    #18
  19. Jim Stewart

    Brian Dean Guest

    Your version is more appropriate - I probably just remembered it
    incorrectly. My memory was "approximated" :) Would that be a
    software bug or a hardware glitch?

    -Brian
     
    Brian Dean, Feb 26, 2004
    #19
  20. Jim Stewart

    Scott Moore Guest

    Didn't say %100 verified. Said %100 coverage. There is a difference.
     
    Scott Moore, Feb 26, 2004
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.