1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Macro-Op fusion does not work in 64-bit mode

Discussion in 'Intel' started by YKhan, Aug 1, 2006.

  1. YKhan

    YKhan Guest

    "The thing is though, while MOF may be touted as the best thing since
    sliced bread, it does not cause many performance problems when it is
    off. It appears that the bottleneck in the CPU is not in that aspect of
    the pipeline, so its loss has little speed impact. More on this when
    the testing is complete."
    http://www.theinquirer.net/default.aspx?article=33347

    Macro-op Fusion was one of the big hype items of the
    Conroe/Merom/Woodcrest. This feature is supposed to be one of the
    things giving Intel it's edge over AMD in the performance wars. Now it
    turns out that it doesn't even work in 64-bit mode. But apparently it's
    no big deal. Most of us have already figured out that the real secret
    behind CMW is its big L2 cache, but Intel downplayed that. So Intel
    can't have it both ways, either MOF is important, and Intel will have
    to explain why it isn't available when in 64-bit mode and why CMW is
    crippled in that mode? Or MOF isn't important, and Intel has to admit
    that it's all due the cache.

    Yousuf Khan
     
    YKhan, Aug 1, 2006
    #1
    1. Advertisements

  2. Is it really just the cache and nothing else? :p
     
    The little lost angel, Aug 1, 2006
    #2
    1. Advertisements

  3. YKhan

    Tony Hill Guest

    Actually I've been rather adamant that there are a LOT of factors that
    are affecting performance in the Core architecture. Sure, the extra
    cache helps. Faster bus speed helps too, and more pipelines, better
    decoders, an excellent brand predictor, improved TLB and hey, even
    Macro-Op Fusion, just to name a few. Take away any one of these and
    you are going to lose some performance. Going from 4MB to 2MB of
    cache costs about 3.5% performance (see:
    http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=4 ), while
    1MB or L2 would probably drop performance further. Substantial yes,
    but not nearly enough to make up for the improvements vs. either the
    Athlon64 X2 or the Core Duo (Yonah) chips before it.
    Or they just tell the truth that Macro-Op Fusion is just one of many
    features that helps performance. It's also supposed to reduce power
    consumption slightly. In all it's damn near impossible to predict
    just how much the loss of this one feature will really change things
    though, since there are many other variables that come into play here.
     
    Tony Hill, Aug 1, 2006
    #3
  4. YKhan

    Mark Guest

    If you actually looked at the benchmarks, you would realize that the
    improved performance cannot be attributed to the cache alone.
     
    Mark, Aug 1, 2006
    #4
  5. YKhan

    Yousuf Khan Guest

    Well, it might also be the predictive algorithms for populating the
    cache, but that's really part of the cache.

    Yousuf Khan
     
    Yousuf Khan, Aug 6, 2006
    #5
  6. YKhan

    Yousuf Khan Guest

    The cache is 4 times bigger than anything AMD has. What else would it
    be? We've already shown it's not macro-op fusion.

    Yousuf Khan
     
    Yousuf Khan, Aug 6, 2006
    #6
  7. YKhan

    Seraphim Guest

    What about other things like the out of order load/store? That's memory
    and not cache. It seems that every thing just adds a small % thus adding
    up. While individually, the large cache or whatever does not appear to
    be the "key" component.
     
    Seraphim, Aug 6, 2006
    #7
  8. The out of order load/store *is* predictive, in particular the
    disambiguation and was said to include speculative components, without
    further elucidation by Intel. The large cache is an important part of such
    a strategy to avoid/minimize negative effects. It's quite rare for
    microarchitecture tweaks like op-fusion, or additional pipeline paths to
    yield benefits which are consistently measurable.

    I *do* wish that the benchmarkers would quit quoting "latency" performance
    using a program which is now clearly insufficient for the job.
     
    George Macdonald, Aug 6, 2006
    #8
  9. YKhan

    krw Guest

    Predictive algorithms are part of the load/store or fetch units
    ,which the dcache and icache are part, but I wouldn't say any
    prefetching was part of the cache, per se. Caches are pretty dumb.

    Sorta like saying the multiply algorithm is part of the register
    file...
     
    krw, Aug 6, 2006
    #9
  10. How much impact would something like a wider execution path make? This is
    coming from someone who is more of a layman than anything else when it comes
    to the specifics of how CPU's actually perform their duties, so I'm asking
    out of curiosity. Having read an analysis off of the anandtech website, one
    of the key architectural changes they point out is how much wider the Core 2
    is compared to a PIII/P4/Ahtlon64. Core 2, for instance, is the only core
    among those that can execute 128bit SSE instructions in a single cycle. Is
    this the type of thing that might add up to create a real impact?

    Carlo
     
    Carlo Razzeto, Aug 6, 2006
    #10
  11. YKhan

    Yousuf Khan Guest

    I'm sure it helps during SSE instructions. Can't see it being a big part
    of the equation though, just like SSE itself isn't a big part of programs.

    Yousuf Khan
     
    Yousuf Khan, Aug 6, 2006
    #11
  12. YKhan

    Tony Hill Guest

    AMD has chips with 2MB of cache (2 x 1MB) and so does Intel. Intel
    chips are MUCH faster, clock for clock, when compared with equal
    quantities of cache.
    How about the fact that Intel has 4 instruction decoders to AMD's 3,
    an extra LOAD/STORE unit, 3 fully pipelined SSE units vs. K8's 2
    partially pipelined, more and better branch predictors, much larger
    TLBs, larger OoO reorder buffer, more advanced scheduler... to name a
    few. And that's entirely separate from the better data prefetching
    and greater cache bandwidth that, as you mentioned in another message,
    are all related to cache.

    Besides, we don't really know how much macro-op fusion really is
    helping since we haven't seen any apples to apples comparison. 32-bit
    with macro-op fusion vs. 64-bit without it doesn't really help, even
    if only relative to AMD's 32-bit vs. 64-bit numbers. Intel might have
    just done a better implementation of 64-bit x86 (AMD's K8 does have a
    compromise or two in 64-bit mode as well) and that made up for the
    loss in performance from Macro-op Fusion.

    Long story short, there is a LOT more to the Core architecture than
    just cache. Other than the integrated memory controller, Core is a
    more advanced chip start to finish when compared to AMD's K8.
    Fortunately for AMD, most of these advantages are incremental in
    nature and their more modular K8L design could theoretically allow
    them to phase such features into future processors.
     
    Tony Hill, Aug 9, 2006
    #12
  13. I think what Yousuf is getting at is that in a single task benchmark
    situation, you have 4MB of L2 cache for that single task, multithreaded or
    not.
    Looking back, it's not often that inner core microarchitecture tweaks have
    yielded that much performance benefit. To me there are two clues here:

    1) The fact that there are benchmarks where C2D shows near-zero benefit vs.
    AMD64 points to the memory/cache subsytem and how it's manipulated as the
    important provider of performance in the other benchmarks where C2D wins
    handily. In particular, when disambiguation "hits", it hits *big*; when it
    "misses", the penalty drags performance back down. When it "hits", it
    depends heavily on the large cache and associativity to avoid thrashing.

    2) The ridiculous C2D "latency" measurements being published, all using the
    same chipset where a P4 is a latency dog, are an indication that
    speculation on stride size and Load/Store re-ordering make a *huge*
    contribution to performance. Of course what this really means is that the
    current latency benchmark is obsolete but it makes no sense that a system
    with FSB, where the real round-trip latency is illustrated by the P4
    measurements, can beat a system with an on-board memory controller. Again,
    without the large L2 cache, the strategy would fall down.
    "Incremental" is correct.;-)
     
    George Macdonald, Aug 10, 2006
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.