1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Re: Can someone explain step by step how one avoid many conditional in forth as described in Moore F

Discussion in 'Intel' started by Rod Pemberton, Jan 11, 2012.

  1. "Arnold Doray" <> wrote in message
    news:jehfpu$9de$...
    ....

    > > [Forth code]

    >
    > CPU pipelining is improved by reducing conditionals. Modern CPUs have
    > branch prediction, but these aren't always successful, in which case the
    > pipline needs to be flushed, lowering the CPU's throughput.


    Is that still true for multiple cores?

    I.e., I would think the following is entirely possible and plausible, but I
    haven't studied a CPU design in decades. The CPU's designers could execute
    the process in parallel with one core taking one direction for the branch
    and the another core taking the other branch direction. Once the correct
    branch is determined, the bad execution path is discarded. If the primary
    core had the good execution path, it just continues execution. If the
    alternate core had the good execution path, it's internal state could be
    "pushed" to the primary core. If they used static ram for the internal
    state, then it could be "pushed" asynchronously, i.e., between clocks or
    sub-clocks. It would require reserving a core for the branch path
    execution, at least temporarily.


    Rod Pemberton
     
    Rod Pemberton, Jan 11, 2012
    #1
    1. Advertisements

  2. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On Jan 11, 12:21 pm, "Rod Pemberton" <>
    wrote:
    > "Arnold Doray" <> wrote in message
    >
    > news:jehfpu$9de$...
    > ...
    >
    > > > [Forth code]

    >
    > > CPU pipelining is improved by reducing conditionals. Modern CPUs have
    > > branch prediction, but these aren't always successful, in which case the
    > > pipline needs to be flushed, lowering the CPU's throughput.

    >
    > Is that still true for multiple cores?
    >
    > I.e., I would think the following is entirely possible and plausible, butI
    > haven't studied a CPU design in decades.  The CPU's designers could execute
    > the process in parallel with one core taking one direction for the branch
    > and the another core taking the other branch direction.  Once the correct
    > branch is determined, the bad execution path is discarded.  If the primary
    > core had the good execution path, it just continues execution.  If the
    > alternate core had the good execution path, it's internal state could be
    > "pushed" to the primary core.  If they used static ram for the internal
    > state, then it could be "pushed" asynchronously, i.e., between clocks or
    > sub-clocks.  It would require reserving a core for the branch path
    > execution, at least temporarily.
    >
    > Rod Pemberton


    Search on "speculative execution". It can also be done at the compiler
    level; that doesn't require processor support. Intel's P6 was their
    first chip to support it iirc.
     
    Alex McDonald, Jan 11, 2012
    #2
    1. Advertisements

  3. Rod Pemberton

    Guest

    Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    In article <>,
    Alex McDonald <> wrote:
    >
    >Search on "speculative execution". It can also be done at the compiler
    >level; that doesn't require processor support. Intel's P6 was their
    >first chip to support it iirc.


    But don't believe any page that says that it is a solution to the
    problem of memory latency (on cache misses). It isn't, and won't
    ever be, for any 'normal' language and application. That's a MUCH
    harder task :-(


    Regards,
    Nick Maclaren.
     
    , Jan 11, 2012
    #3
  4. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    Rod Pemberton wrote:
    > "Arnold Doray"<> wrote in message
    > news:jehfpu$9de$...
    > ....
    >
    >>> [Forth code]

    >>
    >> CPU pipelining is improved by reducing conditionals. Modern CPUs have
    >> branch prediction, but these aren't always successful, in which case the
    >> pipline needs to be flushed, lowering the CPU's throughput.

    >
    > Is that still true for multiple cores?
    >
    > I.e., I would think the following is entirely possible and plausible, but I
    > haven't studied a CPU design in decades. The CPU's designers could execute
    > the process in parallel with one core taking one direction for the branch
    > and the another core taking the other branch direction. Once the correct
    > branch is determined, the bad execution path is discarded. If the primary
    > core had the good execution path, it just continues execution. If the
    > alternate core had the good execution path, it's internal state could be
    > "pushed" to the primary core. If they used static ram for the internal
    > state, then it could be "pushed" asynchronously, i.e., between clocks or
    > sub-clocks. It would require reserving a core for the branch path
    > execution, at least temporarily.


    That's "Eager Execution" (EE), something which good branch prediction
    makes even less of a win: You can never go through more than 2-3
    branches like this before the power loss becomes totally horrendous
    (running 4-8 threads in parallel, 75-88% wasted resources), while lots
    of current code can branch predict past 6-8 branches before the
    probability of a branch miss passes 50%, and all those branches could
    actually be inside the pipeline at once.

    EE should only be considered as a last resource for branches where the
    runtime branch history shows very low predictability, i.e. closer to 50%
    branch misses.

    BTW, back in March 1995 I visited Intel in Santa Clara, in order to meet
    some people from the P6 team and to get an engineering sample PentiumPro
    machine.

    I had to sign an NDA up front, so before doing that I insisted that I'd
    be allowed to write down on a whiteboard everything I then knew or had
    been able to guess about the architecture, based on nothing but public
    sources. I.e. this was to properly delineate what the NDA could cover.

    It turned out that I was correct in every particular, except for one
    thing: I expected EE because I didn't realize how much they had been
    able to improve the branch predictor.

    The engineers on the other side of the table were very happy, while some
    of the management and legal people seemed quite worried. :)

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
     
    Terje Mathisen, Jan 11, 2012
    #4
  5. Rod Pemberton

    Arnold Doray Guest

    Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On Wed, 11 Jan 2012 07:21:53 -0500, Rod Pemberton wrote:

    > "Arnold Doray" <> wrote in message
    > news:jehfpu$9de$...
    > ...
    >
    >> > [Forth code]

    >>
    >> CPU pipelining is improved by reducing conditionals. Modern CPUs have
    >> branch prediction, but these aren't always successful, in which case
    >> the pipline needs to be flushed, lowering the CPU's throughput.

    >
    > Is that still true for multiple cores?
    >
    > I.e., I would think the following is entirely possible and plausible,
    > but I haven't studied a CPU design in decades. The CPU's designers
    > could execute the process in parallel with one core taking one direction
    > for the branch and the another core taking the other branch direction.
    > Once the correct branch is determined, the bad execution path is
    > discarded. If the primary core had the good execution path, it just
    > continues execution. If the alternate core had the good execution path,
    > it's internal state could be "pushed" to the primary core. If they used
    > static ram for the internal state, then it could be "pushed"
    > asynchronously, i.e., between clocks or sub-clocks. It would require
    > reserving a core for the branch path execution, at least temporarily.
    >
    >
    > Rod Pemberton


    I don't think this is how it's done on modern CPUs. They use branch
    prediction, which can be as good as 70% - 80% right,if I recall
    correctly. When the wrong branch is taken, the entire pipeline needs to
    be flushed. That's the reason to avoid conditionals.

    But let's say for argument's sake that you're right. There are a few
    problems with your scheme:

    1) It's slow: because you're tying up another core. So you effectively
    have one core less than previously.

    2) Also, the switching overhead may be large if it has to be done often.

    3) What happens with code containing multiple branches? You can't follow
    them all. What to do?

    4) Let's suppose you have lots of cores so that (3) is not a big issue.
    Still, the number of instructions before the next branching could be less
    than your pipeline's depth. So you're wasting the advantage offered by a
    deep pipeline.

    Thinking about Gavino's questio again, it occurs to me that with
    predictive branching, a long loop will almost never flush the pipeline
    (because the prediction is done by keeping statistics of the historical
    branch choices). So this reason alone (pipeline flushing) is not a good
    reason avoid using loops. Perhaps CM's comments were made when the
    prediction was not as sophisticated as it is now.

    Of course, avoiding conditionals entirely will give a faster solution.
    How much faster is the question. Anyone has comments/experience with this?

    Cheers,
    Arnold
     
    Arnold Doray, Jan 11, 2012
    #5
  6. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On 1/11/2012 4:29 AM, Alex McDonald wrote:
    > On Jan 11, 12:21 pm, "Rod Pemberton"<>
    > wrote:
    >> "Arnold Doray"<> wrote in message
    >>
    >> news:jehfpu$9de$...
    >> ...
    >>
    >>>> [Forth code]

    >>
    >>> CPU pipelining is improved by reducing conditionals. Modern CPUs have
    >>> branch prediction, but these aren't always successful, in which case the
    >>> pipline needs to be flushed, lowering the CPU's throughput.

    >>
    >> Is that still true for multiple cores?
    >>
    >> I.e., I would think the following is entirely possible and plausible, but I
    >> haven't studied a CPU design in decades. The CPU's designers could execute
    >> the process in parallel with one core taking one direction for the branch
    >> and the another core taking the other branch direction. Once the correct
    >> branch is determined, the bad execution path is discarded. If the primary
    >> core had the good execution path, it just continues execution. If the
    >> alternate core had the good execution path, it's internal state could be
    >> "pushed" to the primary core. If they used static ram for the internal
    >> state, then it could be "pushed" asynchronously, i.e., between clocks or
    >> sub-clocks. It would require reserving a core for the branch path
    >> execution, at least temporarily.
    >>
    >> Rod Pemberton

    >
    > Search on "speculative execution". It can also be done at the compiler
    > level; that doesn't require processor support. Intel's P6 was their
    > first chip to support it iirc.


    Taking both paths on a branch is currently called "eager execution".

    Branch prediction is one form of "speculative execution".

    P6 did speculative, but not eager. As, for that matter, did P5 (Pentium).

    I don't know of anyone doing full eager execution, although it has been
    studied out the bejeezus. Branch prediction usually beats it. I
    believe an IBM chip did eager ifetch - fetching both sides of a branch -
    but did not actually execute, stalled at decoder or therabouts.
     
    Andy (Super) Glew, Jan 11, 2012
    #6
  7. Rod Pemberton

    Joe keane Guest

    Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    In article <-arch.net>,
    Andy (Super) Glew <-arch.net> wrote:
    >I believe an IBM chip did eager ifetch - fetching both sides of a
    >branch - but did not actually execute, stalled at decoder or
    >therabouts.


    Doing some 'early' loads seems like a good idea, assuming that you have
    some IF units free.

    But that is a *lot* different from the other suggestion, copying the
    machine state to a different core.
     
    Joe keane, Jan 13, 2012
    #7
  8. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On 1/13/2012 9:47 AM, Joe keane wrote:
    > In article<-arch.net>,
    > Andy (Super) Glew<-arch.net> wrote:
    >> I believe an IBM chip did eager ifetch - fetching both sides of a
    >> branch - but did not actually execute, stalled at decoder or
    >> therabouts.

    >
    > Doing some 'early' loads seems like a good idea, assuming that you have
    > some IF units free.
    >
    > But that is a *lot* different from the other suggestion, copying the
    > machine state to a different core.


    Microarchitectures proposed for eager execution:

    * using the same pipeline(s) as ordinary non-eager
    ** e.g. if stalled on the main path, execute on the alternative path
    ** DEE: e.g. if have gone too far down one path, execute on an
    alternative path that has higher relative probability of success than
    predicting yet another branch on the current path.

    Actually, the DEE technique applies to all forms of eager execution.

    * using eager threads on the same CPU core
    ** although not necessarily the same pipeline(s) - eager threading was
    one of the techniques I wanted to investigate on top of my multicluster
    multithreading (MCMT) substrate -- although in my experience eager
    nearly always loses out to non-eager forms of speculation.

    Note that this is not necessarily the same pipeline resources, since in
    my terminology an MCMT machine like Bulldozer is really a single CPU
    with multiple execution clusters.

    * using eager threads of execution on different CPU cores.

    Problem is that eager threads seldom last long, so it is hard for eager
    threads to overcome the cost of migrating to another core.
     
    Andy (Super) Glew, Jan 13, 2012
    #8
  9. Rod Pemberton

    Joe keane Guest

    Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    In article <-arch.net>,
    Andy (Super) Glew <-arch.net> wrote:
    >Problem is that eager threads seldom last long, so it is hard for eager
    >threads to overcome the cost of migrating to another core.


    Essentially creating a new thread, for every -branch-, falls into the
    'that don't sound right'.
     
    Joe keane, Jan 14, 2012
    #9
  10. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    Joe keane wrote:
    > In article<-arch.net>,
    > Andy (Super) Glew<-arch.net> wrote:
    >> Problem is that eager threads seldom last long, so it is hard for eager
    >> threads to overcome the cost of migrating to another core.

    >
    > Essentially creating a new thread, for every -branch-, falls into the
    > 'that don't sound right'.


    I've thought about this in the context "execute alternating instructions
    from both sides of the branch, adding implicit predicates to them", so
    that when the branch retires the predicates become known and half the
    instructions are cancelled.

    The problem is of course that you need to fetch from two paths at the
    same time, i.e. you need a pair of virtualized IP registers as well.

    The first criterium must be that you only ever do this for code that has
    been executed a number of times, and where the branch history has turned
    out to be very unpredictable.

    Another idea could be to start with the predicted branch, then switch to
    the other on the first stall/cache miss or new branch.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
     
    Terje Mathisen, Jan 14, 2012
    #10
  11. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On Jan 14, 4:23 am, Terje Mathisen <"terje.mathisen at tmsw.no">
    wrote:
    > Joe keane wrote:
    > > In article<-arch.net>,
    > > Andy (Super) Glew<-arch.net>  wrote:
    > >> Problem is that eager threads seldom last long, so it is hard for eager
    > >> threads to overcome the cost of migrating to another core.

    >
    > > Essentially creating a new thread, for every -branch-, falls into the
    > > 'that don't sound right'.

    >
    > I've thought about this in the context "execute alternating instructions
    > from both sides of the branch, adding implicit predicates to them", so
    > that when the branch retires the predicates become known and half the
    > instructions are cancelled.


    This reminds me of a (not so useful) idea I posted here
    21 Oct 2010, "Move on 'commit' predication?" Message-ID:
    <>
    http://groups.google.com/group/comp.arch/browse_thread/thread/f9af23c9e400a263/e4b5e80e5fd36288

    It was basically just a (wrong-headed?) microarchitecture
    for supporting dynamic predication (of hammock branches)
    where one path was less likely but uncertain enough to
    justify predication.

    (One of the issues of eager execution is recognizing joining.)
     
    Paul A. Clayton, Jan 14, 2012
    #11
  12. Re: Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

    On 1/13/2012 8:16 PM, Joe keane wrote:
    > In article<-arch.net>,
    > Andy (Super) Glew<-arch.net> wrote:
    >> Problem is that eager threads seldom last long, so it is hard for eager
    >> threads to overcome the cost of migrating to another core.

    >
    > Essentially creating a new thread, for every -branch-, falls into the
    > 'that don't sound right'.


    Nobody would seriously propose a new thread for every branch.

    You would only consider eager threading for flakey branches, where you
    don't know which way they are going to go. And won't know for a while.

    Myself, I think non-threading approaches to eager execution are better.
    E.g. the discussion of short forward branches - emit colde that
    computes both sides of such a hammock, with operations like Phi nodes to
    bring back rto a single dataflow.
     
    Andy (Super) Glew, Jan 15, 2012
    #12
    1. Advertisements

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.
Similar Threads
  1. Ohaya
    Replies:
    2
    Views:
    490
    Ohaya
    Apr 10, 2004
  2. Don
    Replies:
    1
    Views:
    581
  3. Christoff.Romeould
    Replies:
    0
    Views:
    699
    Christoff.Romeould
    Jul 2, 2003
  4. We Live For The One We Die For The One

    Can someone explain Core and Memory speed thanks.

    We Live For The One We Die For The One, Sep 28, 2003, in forum: ATI
    Replies:
    6
    Views:
    405
    Ben Pope
    Sep 29, 2003
  5. David Hearn

    Can someone explain these files?

    David Hearn, May 5, 2006, in forum: Embedded
    Replies:
    3
    Views:
    529
  6. +c0re-
    Replies:
    0
    Views:
    977
    +c0re-
    Jul 14, 2004
  7. kd833
    Replies:
    2
    Views:
    751
    kd833
    May 4, 2009
  8. John S. Etnier
    Replies:
    3
    Views:
    293
    Gregory Weston
    Nov 29, 2003
Loading...