1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Re: An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Discussion in 'Nvidia' started by Skybuck Flying, Aug 1, 2011.

  1. "Skybuck Flying" wrote in message news:...

    Also in case you have troubles understanding the pascal code then I shall
    try and write down some simply and easy to understand pseudo code and some
    explanations/comments, let's see:

    What the code does and assumes is the following:

    general description, this information not that important but might give you
    the general idea:

    "Memory" is a pointer towards a big linear/sequentially memory block of
    bytes/integers. The memory block contains many integers. These integers are
    "virtually" subdivided into seperate blocks.

    Each block contains 8000 integers. This is indicated by ElementCount, so
    ElementCount equals 8000.

    There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals
    4000.

    So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of
    memory.

    Each block is like a chain of indexes.

    Each index point towards the next index within it's block.

    The indexes are thrown around inside the block during initialisation to
    create a random access pattern.

    This is also known as the "pointer chasing problem" so it could also be
    called the "index chasing problem".

    Index -> Index -> Index -> Index -> Index and so forth.

    The first index must be retrieved from memory, only then does one know where
    the next index is located and so on.

    There is an ElementIndex variable which is initialized to zero.

    So each ElementIndex simply starts at element 0 of block 0.

    There could be 8000 ElementIndexes in parallel all starting at element 0 of
    block X.

    Or there could simply be on ElementIndex and process each block in turn.

    in version 0.03 "BlockBase" was introduced.

    BlockBase as an index which points towards the first element of a block.

    So it's a storage to prevent multiplications all the time.

    In the code below 3 blocks in parallel are attempted.

    Now for some further pseudo code:

    // routine

    // variables:

    // each block is processed repeatedly loop count times.
    // loop index indicates the current loop/round
    LoopIndex

    BlockIndexA // indicates block a number
    BlockIndexB // indicates block b number
    BlockIndexC // indicates block c number

    ElementIndexA // indicates block a element index
    ElementIndexB // indicates block b element index
    ElementIndexC // indicates block c element index

    ElementCount // number of elements per block
    BlockCount // number of blocks
    LoopCount // number of loops/chases per block

    BlockBaseA // starting index of block a
    BlockBaseB // starting index of block b
    BlockBaseC // starting index of block c

    Memory // contains all integers of all blocks of all elements. So it's an
    array/block of elements/integers.

    RoutineBegin

    ElementCount = 8000
    BlockCount = 4000
    LoopCount = 80000

    BlockIndexA = 0
    BlockIndexB = 1
    BlockIndexC = 2

    // loop over all blocks, process 3 blocks per loop iteration.
    // slight correction:
    // wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes"
    // somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes"
    // so BlockIndexA is 0, 3, 6, 9, 12, etc
    // so BlockIndexB is 1, 4, 7, 10, 13, etc
    // so BlockIndexC is 2, 5, 8, 11, 14, etc
    FirstLoopBegin

    // calculate the starting index of each block.
    // formula is: Base/Index/Offset = block number * number of elements per
    block.
    BlockBaseA = BlockIndexA * vElementCount;
    BlockBaseB = BlockIndexB * vElementCount;
    BlockBaseC = BlockIndexC * vElementCount;

    // initialise each element index to the first index of the block.
    ElementIndexA = 0
    ElementIndexB = 0
    ElementIndexC = 0

    // loop 80000 times through the block arrays/elements
    SecondLoopBegin "LoopIndex goes from 0 to 79999"

    // Seek into memory at location BlockBase + ElementIndex
    // do this for A, B and C:
    // retrieve the new element index via the old element index
    ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
    ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
    ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

    SecondLoopEnd

    // store some bullshit in block result, this step could be skipped.
    // but let's not so compiler doesn't optimize anything away
    // perhaps block result should be printed just in case.
    // the last retrieved index is store into it.
    // do this for A,B,C

    BlockResult[ BlockIndexA ] = vElementIndexA
    BlockResult[ BlockIndexB ] = vElementIndexB
    BlockResult[ BlockIndexC ] = vElementIndexC

    // once the 3 blocks have been processed go to the next 3 blocks
    BlockIndexA = vBlockIndexA + 3
    BlockIndexB = vBlockIndexB + 3
    BlockIndexC = vBlockIndexC + 3

    FirstLoopEnd

    Perhaps this clearifies it a little bit.

    Let me know if you have any further questions ;)

    Bye,
    Skybuck.
     
    1. Advertising

  2. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    Skybuck Flying wrote:


    > "Skybuck Flying" wrote in message news:...
    >
    > Also in case you have troubles understanding the pascal code then I shall
    > try and write down some simply and easy to understand pseudo code and some
    > explanations/comments, let's see:
    >
    > What the code does and assumes is the following:
    >
    > general description, this information not that important but might give you
    > the general idea:
    >
    > "Memory" is a pointer towards a big linear/sequentially memory block of
    > bytes/integers. The memory block contains many integers. These integers are
    > "virtually" subdivided into seperate blocks.
    >
    > Each block contains 8000 integers. This is indicated by ElementCount, so
    > ElementCount equals 8000.
    >
    > There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals
    > 4000.
    >
    > So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of
    > memory.
    >
    > Each block is like a chain of indexes.
    >
    > Each index point towards the next index within it's block.
    >
    > The indexes are thrown around inside the block during initialisation to
    > create a random access pattern.
    >
    > This is also known as the "pointer chasing problem" so it could also be
    > called the "index chasing problem".
    >
    > Index -> Index -> Index -> Index -> Index and so forth.
    >
    > The first index must be retrieved from memory, only then does one know where
    > the next index is located and so on.
    >
    > There is an ElementIndex variable which is initialized to zero.
    >
    > So each ElementIndex simply starts at element 0 of block 0.
    >
    > There could be 8000 ElementIndexes in parallel all starting at element 0 of
    > block X.
    >
    > Or there could simply be on ElementIndex and process each block in turn.
    >
    > in version 0.03 "BlockBase" was introduced.
    >
    > BlockBase as an index which points towards the first element of a block.
    >
    > So it's a storage to prevent multiplications all the time.



    4000 is 0x0FA0, 8000 is 0x1F40. If we expand these sizes
    to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF).
    This allows simple bit masks to sort out invalid indices
    and simplifies address calculations markably.

    If we see the entire thing as an array with 4096 pages à
    8192 elements, we just have to create a large array with
    a size of 4096 * 8192 * 4 = 134,217,728 byte.


    > In the code below 3 blocks in parallel are attempted.



    A very bad idea. L1 cache is 65,536 byte on recent CPUs,
    but only 32,768 byte on future processors. e.g. Zambezi.
    Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
    (or 65,536!) byte do not fit into L1. The result: Memory
    access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
    27 (RAM) clocks per access, depending on where the cache
    lines have to be fetched from or written to.

    If you want to process multiple blocks in parallel, it's
    better to check, how many physical processors exist on a
    user's machine, then create that amount threads (minus 1
    for the OS) running on one physical processor, each.


    > Now for some further pseudo code:
    >
    > // routine
    >
    > // variables:
    >
    > // each block is processed repeatedly loop count times.
    > // loop index indicates the current loop/round
    > LoopIndex
    >
    > BlockIndexA // indicates block a number
    > BlockIndexB // indicates block b number
    > BlockIndexC // indicates block c number
    >
    > ElementIndexA // indicates block a element index
    > ElementIndexB // indicates block b element index
    > ElementIndexC // indicates block c element index
    >
    > ElementCount // number of elements per block
    > BlockCount // number of blocks
    > LoopCount // number of loops/chases per block
    >
    > BlockBaseA // starting index of block a
    > BlockBaseB // starting index of block b
    > BlockBaseC // starting index of block c
    >
    > Memory // contains all integers of all blocks of all elements. So it's an
    > array/block of elements/integers.
    >
    > RoutineBegin
    >
    > ElementCount = 8000
    > BlockCount = 4000
    > LoopCount = 80000
    >
    > BlockIndexA = 0
    > BlockIndexB = 1
    > BlockIndexC = 2
    >
    > // loop over all blocks, process 3 blocks per loop iteration.
    > // slight correction:
    > // wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes"
    > // somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes"
    > // so BlockIndexA is 0, 3, 6, 9, 12, etc
    > // so BlockIndexB is 1, 4, 7, 10, 13, etc
    > // so BlockIndexC is 2, 5, 8, 11, 14, etc
    > FirstLoopBegin
    >
    > // calculate the starting index of each block.
    > // formula is: Base/Index/Offset = block number * number of elements per
    > block.
    > BlockBaseA = BlockIndexA * vElementCount;
    > BlockBaseB = BlockIndexB * vElementCount;
    > BlockBaseC = BlockIndexC * vElementCount;



    With 8,192 elements per block, the multiplication can be
    replaced by a shift:

    BlockBaseN = BlockIndexN << 15

    reducing the VectorPath MUL (blocking all pipes for five
    clocks) to a DirectPath SHL (one clock, one pipe).


    > // initialise each element index to the first index of the block.
    > ElementIndexA = 0
    > ElementIndexB = 0
    > ElementIndexC = 0
    >
    > // loop 80000 times through the block arrays/elements
    > SecondLoopBegin "LoopIndex goes from 0 to 79999"
    >
    > // Seek into memory at location BlockBase + ElementIndex
    > // do this for A, B and C:
    > // retrieve the new element index via the old element index
    > ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
    > ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
    > ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]
    >
    > SecondLoopEnd



    I still do not understand what this is good for, but you
    should consider to process one block after the other. If
    you are eager to introduce something like "parallel pro-
    cessing", divide your entire array into as much parts as
    there are processors (hopefully a multiple of two), then
    let those threads process a contiguous part of memory in
    parallel.

    If you prefer a single threaded solution, it's better to
    process one block after the other, because entire blocks
    fit into L1 completely. Any "touched" element is present
    in L1 after the first access, three registers are enough
    to manage the entire loop, faster algorithms can be used
    to process multiple elements in parallel (all pipes busy
    most of the time), and so on,


    > // store some bullshit in block result, this step could be skipped.
    > // but let's not so compiler doesn't optimize anything away
    > // perhaps block result should be printed just in case.
    > // the last retrieved index is store into it.
    > // do this for A,B,C
    >
    > BlockResult[ BlockIndexA ] = vElementIndexA
    > BlockResult[ BlockIndexB ] = vElementIndexB
    > BlockResult[ BlockIndexC ] = vElementIndexC
    >
    > // once the 3 blocks have been processed go to the next 3 blocks
    > BlockIndexA = vBlockIndexA + 3
    > BlockIndexB = vBlockIndexB + 3
    > BlockIndexC = vBlockIndexC + 3
    >
    > FirstLoopEnd
    >
    > Perhaps this clearifies it a little bit.
    >
    > Let me know if you have any further questions ;)



    As shown, your current approach is quite slow - it might
    be worth a thought to change your current design towards
    something making use of built-in acceleration mechanisms
    like caching and multiple execution pipelines.

    Using multiples of two (rather than ten) simplifies cal-
    culations markably. Hence, the first thing to change was
    the size of your array - 4,096 blocks and 8,192 elements
    don't occupy that much additional memory, 6,217,728 byte
    are peanuts compared to the entire array. In return, you
    save a lot of time wasted with calculating addresses and
    validating indices (AND automatically keeps your indices
    within the valid range).

    The slightly changed design used addressing modes like

    mov REG,block_offset[block_base + index * 4] or

    mov block_offset[block_base + index * 4],REG

    allowing to step through the entire array with some very
    simple (and fast) additions and shifts. For example, the
    given function required one register for block_base plus
    block_offset. This had to be updated only once per outer
    loop, while the inner loop just fetches indexed elements
    and writes them back. Reducing address calculations to a
    single register, 12 registers are left for the real work
    (two registers are required as loop counters).


    Greetings from Augsburg

    Bernhard Schornak
     
    1. Advertising

  3. "
    4000 is 0x0FA0, 8000 is 0x1F40.
    "

    Ok this could be handy to know to spot these constants/parameters
    potentially in your code.

    "
    If we expand these sizes
    to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF).
    This allows simple bit masks to sort out invalid indices
    and simplifies address calculations markably.
    "

    Why would you want to spot invalid indices ? At least my pascal code seems
    already ok. Perhaps this is to help you debug your assembler code ?

    Address calculations should be possible for any setting, this allows
    flexibility in testing different scenerio's.

    "
    If we see the entire thing as an array with 4096 pages à
    8192 elements, we just have to create a large array with
    a size of 4096 * 8192 * 4 = 134,217,728 byte.
    "

    Irrelevant any size can be allocated easily, at least in a high level
    language, perhaps this helps you in assembler ?

    > In the code below 3 blocks in parallel are attempted.


    "
    A very bad idea.
    "

    This was your idea after all ;) :)

    It might be bad for current settings, but maybe there is some value in it
    for other settings.

    "
    L1 cache is 65,536 byte on recent CPUs,
    but only 32,768 byte on future processors. e.g. Zambezi.
    Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
    (or 65,536!) byte do not fit into L1. The result: Memory
    access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
    27 (RAM) clocks per access, depending on where the cache
    lines have to be fetched from or written to.

    If you want to process multiple blocks in parallel, it's
    better to check, how many physical processors exist on a
    user's machine, then create that amount threads (minus 1
    for the OS) running on one physical processor, each.
    "

    So far the examples have all executed well inside the cache because of the
    low settings which almost fit in L1 cache.

    However it would also be interesting to change settings to for example a
    block size of 1920x1200 elements to mimic a single frame of a video codec
    for example, which might want to do some random access memory.

    So then the L1 cache will probably be whoefully short and then you can still
    try and see if your "pairing/parallel" processing of frames might give
    higher throughput.

    However I already do see a little problem with this idea. Usually video
    codec frames are processed sequentially, so it might be difficult to
    impossible to process multiple at the same time.

    However a frame does consists out of r,g,b and sometimes maybe even alpha
    channel.

    So there could be 3 to 4 channels, so your idea of processing 3 blocks at
    the same time might still have some merit, when each color channel is
    considered a block.

    Another interesting setting could be to reduce the settings to make it fit
    entirely into the L1 cache to see if your pairing/multiple loading code also
    becomes faster for when it does fit in cache.

    Writing multi threading code for "random access memory" could become tricky.
    Especially if writes are involved as well.
    Processors might conflict with each other.

    As long as processors operate within their own blocks and can never access
    each other blocks then it should be ok.
    Some care should still be taken at the border of the blocks, inserting some
    paddings there could help prevent accidently/unlucky simultanious access. So
    it does become a little bit more tricky to write code like that... because
    of the padding, allocations have to be done a little bit different, as well
    as formula's. I can see it become a bit messy though some variable names
    could help solve the mess as follows:

    Allocate( mElementCount + mPaddingCount );

    ^ Then padding count could be used in formula's/code where necessary or left
    out where not necessary.

    Like mod mElementCount to setup indexes correctly.

    "
    With 8,192 elements per block, the multiplication can be
    replaced by a shift:

    BlockBaseN = BlockIndexN << 15

    reducing the VectorPath MUL (blocking all pipes for five
    clocks) to a DirectPath SHL (one clock, one pipe).
    "

    While to explanation of this optimization idea/trick is interesting
    unfortunately it cannot be really used in the example code,

    The example code needs to remain flexible to try out different scenerio's to
    see if your idea can be usefull for something ;)

    > // initialise each element index to the first index of the block.
    > ElementIndexA = 0
    > ElementIndexB = 0
    > ElementIndexC = 0
    >
    > // loop 80000 times through the block arrays/elements
    > SecondLoopBegin "LoopIndex goes from 0 to 79999"
    >
    > // Seek into memory at location BlockBase + ElementIndex
    > // do this for A, B and C:
    > // retrieve the new element index via the old element index
    > ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
    > ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
    > ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]
    >
    > SecondLoopEnd


    "
    I still do not understand what this is good for, but you
    should consider to process one block after the other. If
    you are eager to introduce something like "parallel pro-
    cessing", divide your entire array into as much parts as
    there are processors (hopefully a multiple of two), then
    let those threads process a contiguous part of memory in
    parallel.
    "

    The code above represents your "parallel block" idea, where there would be 3
    loads in parallel.

    The mMemory[ ... ] is a load. See one of your earlier postings of yourself
    where you write about that idea.

    You "claimed" the X2 processor can do 3 loads in parallel more or less.

    The code above tries to take adventage of that.

    "
    If you prefer a single threaded solution, it's better to
    process one block after the other, because entire blocks
    fit into L1 completely.
    "

    Perhaps your parallel loading idea might also be faster in such a situation.

    If not then the opposite situation of not fitting into L1 could be tested as
    well to see if your parallel loading idea gives more performance.

    "
    Any "touched" element is present
    in L1 after the first access, three registers are enough
    to manage the entire loop, faster algorithms can be used
    to process multiple elements in parallel (all pipes busy
    most of the time), and so on,
    "

    ?


    > // store some bullshit in block result, this step could be skipped.
    > // but let's not so compiler doesn't optimize anything away
    > // perhaps block result should be printed just in case.
    > // the last retrieved index is store into it.
    > // do this for A,B,C
    >
    > BlockResult[ BlockIndexA ] = vElementIndexA
    > BlockResult[ BlockIndexB ] = vElementIndexB
    > BlockResult[ BlockIndexC ] = vElementIndexC
    >
    > // once the 3 blocks have been processed go to the next 3 blocks
    > BlockIndexA = vBlockIndexA + 3
    > BlockIndexB = vBlockIndexB + 3
    > BlockIndexC = vBlockIndexC + 3
    >
    > FirstLoopEnd
    >
    > Perhaps this clearifies it a little bit.
    >
    > Let me know if you have any further questions ;)


    "
    As shown, your current approach is quite slow
    "

    Which one ?

    The single block version (which should be quite fast) or the three block
    version (which was your idea ! ;) :)).

    "
    - it might
    be worth a thought to change your current design towards
    something making use of built-in acceleration mechanisms
    like caching and multiple execution pipelines.
    "

    It already does caching by accident more or less, multiple execution
    pipelines that was tried with your idea.

    I think it does do both a bit... I don't know exactly how much it already
    does these things.

    A processor simulator would be necessary to see what exactly is happening.

    But I can tell a little bit from re-arranging the code that it does some of
    these things that you mentioned.

    So perhaps the "three block" version is already optimal but it's still
    slower than the "one block" version, at least for 32 bit because then it
    fits in L1 cache, however different settings should still be tried to see
    which one if faster then.

    Also 64 bit assembly code of yours should still be tried, it's not tried
    yet.

    "
    Using multiples of two (rather than ten) simplifies cal-
    culations markably. Hence, the first thing to change was
    the size of your array - 4,096 blocks and 8,192 elements
    don't occupy that much additional memory, 6,217,728 byte
    are peanuts compared to the entire array. In return, you
    save a lot of time wasted with calculating addresses and
    validating indices (AND automatically keeps your indices
    within the valid range).
    "

    As long as loading and writing is not affected by the extra padding bytes,
    and as long as the example executes correctly then arrays could be padding
    with extra bytes if it makes address calculations faster without changing
    the nature of the example.

    However if the example suddenly produces different results then it would be
    a problem.

    "
    The slightly changed design used addressing modes like

    mov REG,block_offset[block_base + index * 4] or

    mov block_offset[block_base + index * 4],REG
    "

    What's the difference here ?

    One seems reading the other seems writing ?

    So apperently you comparing this code to something else ? But what ?

    The shifting code ?

    "
    allowing to step through the entire array with some very
    simple (and fast) additions and shifts. For example, the
    given function required one register for block_base plus
    block_offset. This had to be updated only once per outer
    loop, while the inner loop just fetches indexed elements
    and writes them back. Reducing address calculations to a
    single register, 12 registers are left for the real work
    (two registers are required as loop counters).
    "

    Ok, I think I now grasp your concept of using the shifting tricks and why it
    would lead to higher throughput.

    Now that I look at your shifting code again it seems to be applied to the
    block number.

    So at least the blocks could be padded for the elements. So the number of
    actualy elements would be a bit higher, some elements would be unused.

    The example should remain with 4000 blocks and 8000 elements "virtually" but
    in reality there could be more... however the "more" should not be processed
    and only be there to speed up the addressing.

    However it remains to be seen what the effect of this padding could be, for
    single block processing it doesn't seem to be relevant and caching would be
    uneffected, the padded elements would not be cached except for the very end
    if hit.

    For the multiple blocks idea this could waste a little bit of caching space
    but might not be to bad.

    So there is some merit in your idea ;)

    It could be interesting to write some general code to try and always apply
    this padding trick and the shifting trick ;)

    Changes are high I will try to write this soon and see what happens, I
    expect it run a bit faster after reading your comments and explanation of it
    ;)

    Bye,
    Skybuck =D
     
  4. I tried the shift left trick you mentioned.

    For the 32 bit version it doesn't really matter and doesn't give significant
    performance increases, at least with current settings.

    For what it's worth here is the assembly output ;):

    So from the looks of it, it isn't worth the trouble, perhaps results might
    be different for 64 bit or other settings.

    // *** Begin of SHL trick, assembly output for 32 bit: ***

    unit_TCPUMemoryTest_version_001.pas.364: begin
    0040FD44 53 push ebx
    0040FD45 56 push esi
    0040FD46 57 push edi
    0040FD47 83C4F4 add esp,-$0c
    unit_TCPUMemoryTest_version_001.pas.365: vElementShift := mElementShift;
    0040FD4A 0FB6501C movzx edx,[eax+$1c]
    0040FD4E 891424 mov [esp],edx
    unit_TCPUMemoryTest_version_001.pas.366: vBlockCount := mBlockCount;
    0040FD51 8B5010 mov edx,[eax+$10]
    unit_TCPUMemoryTest_version_001.pas.367: vLoopCount := mLoopCount;
    0040FD54 8B4814 mov ecx,[eax+$14]
    0040FD57 894C2404 mov [esp+$04],ecx
    unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to
    vBlockCount-1 do
    0040FD5B 4A dec edx
    0040FD5C 85D2 test edx,edx
    0040FD5E 7232 jb $0040fd92
    0040FD60 42 inc edx
    0040FD61 89542408 mov [esp+$08],edx
    0040FD65 33F6 xor esi,esi
    unit_TCPUMemoryTest_version_001.pas.372: vBlockBase := vBlockIndex shl
    vElementShift;
    0040FD67 8B0C24 mov ecx,[esp]
    0040FD6A 8BDE mov ebx,esi
    0040FD6C D3E3 shl ebx,cl
    unit_TCPUMemoryTest_version_001.pas.374: vElementIndex := 0;
    0040FD6E 33D2 xor edx,edx
    unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FD70 8B4C2404 mov ecx,[esp+$04]
    0040FD74 49 dec ecx
    0040FD75 85C9 test ecx,ecx
    0040FD77 720C jb $0040fd85
    0040FD79 41 inc ecx
    unit_TCPUMemoryTest_version_001.pas.379: vElementIndex := mMemory[
    vBlockBase + vElementIndex ];
    0040FD7A 03D3 add edx,ebx
    0040FD7C 8B7804 mov edi,[eax+$04]
    0040FD7F 8B1497 mov edx,[edi+edx*4]
    unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FD82 49 dec ecx
    0040FD83 75F5 jnz $0040fd7a
    unit_TCPUMemoryTest_version_001.pas.382: mBlockResult[ vBlockIndex ] :=
    vElementIndex;
    0040FD85 8B4808 mov ecx,[eax+$08]
    0040FD88 8914B1 mov [ecx+esi*4],edx
    unit_TCPUMemoryTest_version_001.pas.383: end;
    0040FD8B 46 inc esi
    unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to
    vBlockCount-1 do
    0040FD8C FF4C2408 dec dword ptr [esp+$08]
    0040FD90 75D5 jnz $0040fd67
    unit_TCPUMemoryTest_version_001.pas.384: end;
    0040FD92 83C40C add esp,$0c
    0040FD95 5F pop edi
    0040FD96 5E pop esi
    0040FD97 5B pop ebx
    0040FD98 C3 ret

    // *** End of SHL trick, assembly output for 32 bit: ***

    Bye,
    Skybuck.
     
  5. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    Skybuck Flying wrote:


    >> In the code below 3 blocks in parallel are attempted.

    >
    > "
    > A very bad idea.
    > "
    >
    > This was your idea after all ;) :)



    Not that I knew. If I remember correctly, you posted this
    stuff...


    > It might be bad for current settings, but maybe there is some value in it for other settings.
    >
    > "
    > L1 cache is 65,536 byte on recent CPUs,
    > but only 32,768 byte on future processors. e.g. Zambezi.
    > Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
    > (or 65,536!) byte do not fit into L1. The result: Memory
    > access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
    > 27 (RAM) clocks per access, depending on where the cache
    > lines have to be fetched from or written to.
    >
    > If you want to process multiple blocks in parallel, it's
    > better to check, how many physical processors exist on a
    > user's machine, then create that amount threads (minus 1
    > for the OS) running on one physical processor, each.
    > "
    >
    > So far the examples have all executed well inside the cache because of the low settings
    > which almost fit in L1 cache.
    >
    > However it would also be interesting to change settings to for example a block size of
    > 1920x1200 elements to mimic a single frame of a video codec for example, which might want
    > to do some random access memory.
    >
    > So then the L1 cache will probably be whoefully short and then you can still try and see
    > if your "pairing/parallel" processing of frames might give higher throughput.



    Read some recent processor manuals to get an overview.


    > However I already do see a little problem with this idea. Usually video codec frames are
    > processed sequentially, so it might be difficult to impossible to process multiple at the
    > same time.



    Possible with multiple cores. GPUs have hundreds of them.


    > However a frame does consists out of r,g,b and sometimes maybe even alpha channel.



    These four byte fit into a dword.


    > So there could be 3 to 4 channels, so your idea of processing 3 blocks at the same time
    > might still have some merit, when each color channel is considered a block.



    I never wrote something about "processing three blocks at
    the same time" - this probably is the result of your mis-
    understanding how my examples really work.


    > Another interesting setting could be to reduce the settings to make it fit entirely into
    > the L1 cache to see if your pairing/multiple loading code also becomes faster for when it
    > does fit in cache.



    As you can read everywhere else, it does. If cache didn't
    work as accelerator, no recent processor had -any- cache.
    Caches are the most expensive part of any processor. They
    occupy huge areas on a die and increase production costs.


    > Writing multi threading code for "random access memory" could become tricky. Especially if
    > writes are involved as well.
    > Processors might conflict with each other.



    Writes are not the bottleneck, because they are scheduled
    by the memory controller (no wait cycles). The real brake
    are reads. The processor has to wait until requested data
    are present - it cannot work with "guessed" data.


    > As long as processors operate within their own blocks and can never access each other
    > blocks then it should be ok.



    As far as I understood your poatings, your definitions do
    not allow inter-block access,


    > Some care should still be taken at the border of the blocks, inserting some paddings there
    > could help prevent accidently/unlucky simultanious access. So it does become a little bit
    > more tricky to write code like that... because of the padding, allocations have to be done
    > a little bit different, as well as formula's. I can see it become a bit messy though some
    > variable names could help solve the mess as follows:
    >
    > Allocate( mElementCount + mPaddingCount );
    >
    > ^ Then padding count could be used in formula's/code where necessary or left out where not
    > necessary.
    >
    > Like mod mElementCount to setup indexes correctly.



    Which is faster than an "and reg,mask" (1 clock, 1 pipe),
    executed while the processor waits for data?


    > "
    > With 8,192 elements per block, the multiplication can be
    > replaced by a shift:
    >
    > BlockBaseN = BlockIndexN << 15
    >
    > reducing the VectorPath MUL (blocking all pipes for five
    > clocks) to a DirectPath SHL (one clock, one pipe).
    > "
    >
    > While to explanation of this optimization idea/trick is interesting unfortunately it
    > cannot be really used in the example code,
    >
    > The example code needs to remain flexible to try out different scenerio's to see if your
    > idea can be usefull for something ;)



    Which idea? What I posted until now were suggestions, how
    your ideas could be improved. If I wanted to advertise my
    own ideas, I started some threads rather than to reply to
    other people's postings.


    >> // initialise each element index to the first index of the block.
    >> ElementIndexA = 0
    >> ElementIndexB = 0
    >> ElementIndexC = 0
    >>
    >> // loop 80000 times through the block arrays/elements
    >> SecondLoopBegin "LoopIndex goes from 0 to 79999"
    >>
    >> // Seek into memory at location BlockBase + ElementIndex
    >> // do this for A, B and C:
    >> // retrieve the new element index via the old element index
    >> ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
    >> ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
    >> ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]
    >>
    >> SecondLoopEnd

    >
    > "
    > I still do not understand what this is good for, but you
    > should consider to process one block after the other. If
    > you are eager to introduce something like "parallel pro-
    > cessing", divide your entire array into as much parts as
    > there are processors (hopefully a multiple of two), then
    > let those threads process a contiguous part of memory in
    > parallel.
    > "
    >
    > The code above represents your "parallel block" idea, where there would be 3 loads in
    > parallel.



    This still is your code, and this stupid idea is based on
    your misunderstanding of my explanations how Athlons work
    (three execution pipes do not speed up reads - especially
    not if each of them has to read data from a different, in
    most cases uncached, memory location).


    > The mMemory[ ... ] is a load. See one of your earlier postings of yourself where you write
    > about that idea.
    >
    > You "claimed" the X2 processor can do 3 loads in parallel more or less.



    <j10mg4$2km$>

    I explain, but don't claim anything. What I wrote here is
    a short summary of AMD's reference manuals and practice -
    not more, not less.


    > The code above tries to take adventage of that.



    No. The fastest way to perform this task are some -minor-
    changes to avoid time consuming instructions like MUL and
    a change to sequential (not random) reads as long as this
    is possible.


    > "
    > If you prefer a single threaded solution, it's better to
    > process one block after the other, because entire blocks
    > fit into L1 completely.
    > "
    >
    > Perhaps your parallel loading idea might also be faster in such a situation.



    If you implement it properly, yes. Accessing three memory
    areas with distances of 32,768 byte surely isn't a proper
    implementation. My example in

    <j10mg4$2km$>

    still shows how to speed up reads. Clearly in sequential,
    not in random order...


    > If not then the opposite situation of not fitting into L1 could be tested as well to see
    > if your parallel loading idea gives more performance.
    >
    > "
    > Any "touched" element is present
    > in L1 after the first access, three registers are enough
    > to manage the entire loop, faster algorithms can be used
    > to process multiple elements in parallel (all pipes busy
    > most of the time), and so on,
    > "
    >
    > ?
    >
    >
    >> // store some bullshit in block result, this step could be skipped.
    >> // but let's not so compiler doesn't optimize anything away
    >> // perhaps block result should be printed just in case.
    >> // the last retrieved index is store into it.
    >> // do this for A,B,C
    >>
    >> BlockResult[ BlockIndexA ] = vElementIndexA
    >> BlockResult[ BlockIndexB ] = vElementIndexB
    >> BlockResult[ BlockIndexC ] = vElementIndexC
    >>
    >> // once the 3 blocks have been processed go to the next 3 blocks
    >> BlockIndexA = vBlockIndexA + 3
    >> BlockIndexB = vBlockIndexB + 3
    >> BlockIndexC = vBlockIndexC + 3
    >>
    >> FirstLoopEnd
    >>
    >> Perhaps this clearifies it a little bit.
    >>
    >> Let me know if you have any further questions ;)

    >
    > "
    > As shown, your current approach is quite slow
    > "
    >
    > Which one ?
    >
    > The single block version (which should be quite fast) or the three block version (which
    > was your idea ! ;) :)).
    >
    > "
    > - it might
    > be worth a thought to change your current design towards
    > something making use of built-in acceleration mechanisms
    > like caching and multiple execution pipelines.
    > "
    >
    > It already does caching by accident more or less, multiple execution pipelines that was
    > tried with your idea.
    >
    > I think it does do both a bit... I don't know exactly how much it already does these things.
    >
    > A processor simulator would be necessary to see what exactly is happening.



    The best processor simulator is the human brain (and some
    knowledge about the target machine).


    > But I can tell a little bit from re-arranging the code that it does some of these things
    > that you mentioned.
    >
    > So perhaps the "three block" version is already optimal but it's still slower than the
    > "one block" version, at least for 32 bit because then it fits in L1 cache, however
    > different settings should still be tried to see which one if faster then.
    >
    > Also 64 bit assembly code of yours should still be tried, it's not tried yet.
    >
    > "
    > Using multiples of two (rather than ten) simplifies cal-
    > culations markably. Hence, the first thing to change was
    > the size of your array - 4,096 blocks and 8,192 elements
    > don't occupy that much additional memory, 6,217,728 byte
    > are peanuts compared to the entire array. In return, you
    > save a lot of time wasted with calculating addresses and
    > validating indices (AND automatically keeps your indices
    > within the valid range).
    > "
    >
    > As long as loading and writing is not affected by the extra padding bytes, and as long as
    > the example executes correctly then arrays could be padding with extra bytes if it makes
    > address calculations faster without changing the nature of the example.
    >
    > However if the example suddenly produces different results then it would be a problem.
    >
    > "
    > The slightly changed design used addressing modes like
    >
    > mov REG,block_offset[block_base + index * 4] or
    >
    > mov block_offset[block_base + index * 4],REG
    > "
    >
    > What's the difference here ?
    >
    > One seems reading the other seems writing ?



    Yes.


    > So apperently you comparing this code to something else ? But what ?



    RDI = block count
    RSI = current block address
    RBP = current index to read

    prepare:
    prefetch rsi
    mov edi,block_count
    inner_loop:
    mov eax,[rsi+0x00+rbp*4] # movl 0x00(%rsi, %rbp, 4),%eax
    mov ebx,[rsi+0x04+rbp*4]
    mov ecx,[rsi+0x08+rbp*4]
    mov edx,[rsi+0x0C+rbp*4]
    mov r10,[rsi+0x10+rbp*4]
    mov r11,[rsi+0x14+rbp*4]
    mov r12,[rsi+0x18+rbp*4]
    mov r13,[rsi+0x1C+rbp*4]
    nop
    ....
    [process data here]
    ....
    sub ebp, 8
    jae inner_loop
    outer_loop:
    mov ebp,(loop_cnt -1)
    add esi,block_size
    dec edi
    jne inner_loop
    ....


    Reads 8 elements per iteration. EAX is present after NOP,
    EBX...R13 are present in the following cycles (one access
    per cycle). The loop reads half of an Athlon's cache line
    size (one Bulldozer cache line). Replacing the NOP with a
    PREFETCH [RSI+RBP*4] preloads the next cache line to save
    a few clock cycles. This is the code for the entire loop,
    just add some processing of the retrieved data. There are
    four registers left for calculations, temporary stores or
    whatever you want to do.


    Greetings from Augsburg

    Bernhard Schornak
     
  6. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    Skybuck Flying wrote:


    > I tried the shift left trick you mentioned.
    >
    > For the 32 bit version it doesn't really matter and doesn't give significant performance
    > increases, at least with current settings.
    >
    > For what it's worth here is the assembly output ;):
    >
    > So from the looks of it, it isn't worth the trouble, perhaps results might be different
    > for 64 bit or other settings.
    >
    > // *** Begin of SHL trick, assembly output for 32 bit: ***
    >
    > unit_TCPUMemoryTest_version_001.pas.364: begin
    > 0040FD44 53 push ebx
    > 0040FD45 56 push esi
    > 0040FD46 57 push edi
    > 0040FD47 83C4F4 add esp,-$0c
    > unit_TCPUMemoryTest_version_001.pas.365: vElementShift := mElementShift;
    > 0040FD4A 0FB6501C movzx edx,[eax+$1c]
    > 0040FD4E 891424 mov [esp],edx
    > unit_TCPUMemoryTest_version_001.pas.366: vBlockCount := mBlockCount;
    > 0040FD51 8B5010 mov edx,[eax+$10]
    > unit_TCPUMemoryTest_version_001.pas.367: vLoopCount := mLoopCount;
    > 0040FD54 8B4814 mov ecx,[eax+$14]
    > 0040FD57 894C2404 mov [esp+$04],ecx
    > unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to vBlockCount-1 do
    > 0040FD5B 4A dec edx
    > 0040FD5C 85D2 test edx,edx
    > 0040FD5E 7232 jb $0040fd92
    > 0040FD60 42 inc edx
    > 0040FD61 89542408 mov [esp+$08],edx
    > 0040FD65 33F6 xor esi,esi
    > unit_TCPUMemoryTest_version_001.pas.372: vBlockBase := vBlockIndex shl vElementShift;
    > 0040FD67 8B0C24 mov ecx,[esp]
    > 0040FD6A 8BDE mov ebx,esi
    > 0040FD6C D3E3 shl ebx,cl
    > unit_TCPUMemoryTest_version_001.pas.374: vElementIndex := 0;
    > 0040FD6E 33D2 xor edx,edx
    > unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to vLoopCount-1 do
    > 0040FD70 8B4C2404 mov ecx,[esp+$04]
    > 0040FD74 49 dec ecx
    > 0040FD75 85C9 test ecx,ecx
    > 0040FD77 720C jb $0040fd85
    > 0040FD79 41 inc ecx
    > unit_TCPUMemoryTest_version_001.pas.379: vElementIndex := mMemory[ vBlockBase +
    > vElementIndex ];
    > 0040FD7A 03D3 add edx,ebx
    > 0040FD7C 8B7804 mov edi,[eax+$04]
    > 0040FD7F 8B1497 mov edx,[edi+edx*4]
    > unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to vLoopCount-1 do
    > 0040FD82 49 dec ecx
    > 0040FD83 75F5 jnz $0040fd7a
    > unit_TCPUMemoryTest_version_001.pas.382: mBlockResult[ vBlockIndex ] := vElementIndex;
    > 0040FD85 8B4808 mov ecx,[eax+$08]
    > 0040FD88 8914B1 mov [ecx+esi*4],edx
    > unit_TCPUMemoryTest_version_001.pas.383: end;
    > 0040FD8B 46 inc esi
    > unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to vBlockCount-1 do
    > 0040FD8C FF4C2408 dec dword ptr [esp+$08]
    > 0040FD90 75D5 jnz $0040fd67
    > unit_TCPUMemoryTest_version_001.pas.384: end;
    > 0040FD92 83C40C add esp,$0c
    > 0040FD95 5F pop edi
    > 0040FD96 5E pop esi
    > 0040FD97 5B pop ebx
    > 0040FD98 C3 ret
    >
    > // *** End of SHL trick, assembly output for 32 bit: ***
    >
    > Bye,
    > Skybuck.



    I just see one SHL in that code. What do you expect that
    the posted code might improve? I see at least 10 totally
    superfluous lines moving one register into another to do
    something with that copy or other -unnecessary- actions.
    Removing all redundant instructions will gain more speed
    than a (pointlessly inserted) SHL...


    Greetings from Augsburg

    Bernhard Schornak
     
  7. The pascal code with the 3 blocks is roughly based on this code/posting of
    yours:

    "
    > An example which does one load per pipe would be nice ! ;)


    ....
    mov eax,[mem] \
    mov ebx,[mem + 0x04] cycle 1
    mov ecx,[mem + 0x08] /
    nop \
    nop cycle 2
    nop /
    nop \
    nop cycle 3
    nop /
    eax present \
    nop cycle 4
    nop /
    ebx present \
    nop cycle 5
    nop /
    ecx present \
    nop cycle 6
    nop /
    ....


    It takes 3 clocks to load EAX - Athlons can fetch
    32 bit (or 64 bit in long mode) per cycle. Hence,
    the new content of EAX will be available in cycle
    four, while the other two loads still are in pro-
    gress. Repeat this with EBX and ECX - NOPs should
    be replaced by some instructions not depending on
    the new content of EAX, EBX or ECX. It is the the
    programmer's job to schedule instructions wisely.
    Unfortunately, an overwhelming majority of coders
    does not know what is going on inside the machine
    they write code for (=> "machine independent").
    "

    I do twist this text a little bit when I write "parallel" I kinda mean 3
    loads in whatever clock cycles it takes.

    So it's like a more efficient form of batch processing.

    I requested for an example of a "load per pipe".

    I also doubted if it was possible.

    Yet somehow you claimed it was possible, but then you come with some kind of
    story.

    Not sure if it was an answer to my question/request or a half answer.

    It's also not clear what is ment with a "load per pipe".

    Is that parallel, or really short sequential.

    For me it's more or less the same as long as it's faster than what the code
    is currently doing.

    So what it's actually called/named doesn't really matter for me as long as
    it's somehow faster.

    But now it seems we are starting to miss understand each other on both sides
    ;)

    The manuals seems very cluttered with operating system specific information
    which I don't need.

    I only need to know hardware architecture for optimization purposes.

    I already read "optimization tricks" manual.

    I feel I could be wasting my time reading an instruction set like x86 or x64
    which could die any day now.

    It's probably also quite a bad instruction set with many oddities.

    I'd rather read and switch to a more properly designed instruction set.

    I had a fun time reading about cuda/ptx I do not think reading x86 would be
    any fun at all ;)

    But perhaps I will give it a looksy to see if I can find something valuable
    in it... I doubt it though.

    The processors seem also quite complex and change a lot...

    Bye,
    Skybuck.
     
  8. "Bernhard Schornak" wrote in message news:j1saub$fi0$...

    Skybuck Flying wrote:


    > I tried the shift left trick you mentioned.
    >
    > For the 32 bit version it doesn't really matter and doesn't give
    > significant performance
    > increases, at least with current settings.
    >
    > For what it's worth here is the assembly output ;):
    >
    > So from the looks of it, it isn't worth the trouble, perhaps results might
    > be different
    > for 64 bit or other settings.
    >
    > // *** Begin of SHL trick, assembly output for 32 bit: ***
    >
    > unit_TCPUMemoryTest_version_001.pas.364: begin
    > 0040FD44 53 push ebx
    > 0040FD45 56 push esi
    > 0040FD46 57 push edi
    > 0040FD47 83C4F4 add esp,-$0c
    > unit_TCPUMemoryTest_version_001.pas.365: vElementShift := mElementShift;
    > 0040FD4A 0FB6501C movzx edx,[eax+$1c]
    > 0040FD4E 891424 mov [esp],edx
    > unit_TCPUMemoryTest_version_001.pas.366: vBlockCount := mBlockCount;
    > 0040FD51 8B5010 mov edx,[eax+$10]
    > unit_TCPUMemoryTest_version_001.pas.367: vLoopCount := mLoopCount;
    > 0040FD54 8B4814 mov ecx,[eax+$14]
    > 0040FD57 894C2404 mov [esp+$04],ecx
    > unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to
    > vBlockCount-1 do
    > 0040FD5B 4A dec edx
    > 0040FD5C 85D2 test edx,edx
    > 0040FD5E 7232 jb $0040fd92
    > 0040FD60 42 inc edx
    > 0040FD61 89542408 mov [esp+$08],edx
    > 0040FD65 33F6 xor esi,esi
    > unit_TCPUMemoryTest_version_001.pas.372: vBlockBase := vBlockIndex shl
    > vElementShift;
    > 0040FD67 8B0C24 mov ecx,[esp]
    > 0040FD6A 8BDE mov ebx,esi
    > 0040FD6C D3E3 shl ebx,cl
    > unit_TCPUMemoryTest_version_001.pas.374: vElementIndex := 0;
    > 0040FD6E 33D2 xor edx,edx
    > unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to
    > vLoopCount-1 do
    > 0040FD70 8B4C2404 mov ecx,[esp+$04]
    > 0040FD74 49 dec ecx
    > 0040FD75 85C9 test ecx,ecx
    > 0040FD77 720C jb $0040fd85
    > 0040FD79 41 inc ecx
    > unit_TCPUMemoryTest_version_001.pas.379: vElementIndex := mMemory[
    > vBlockBase +
    > vElementIndex ];
    > 0040FD7A 03D3 add edx,ebx
    > 0040FD7C 8B7804 mov edi,[eax+$04]
    > 0040FD7F 8B1497 mov edx,[edi+edx*4]
    > unit_TCPUMemoryTest_version_001.pas.377: for vLoopIndex := 0 to
    > vLoopCount-1 do
    > 0040FD82 49 dec ecx
    > 0040FD83 75F5 jnz $0040fd7a
    > unit_TCPUMemoryTest_version_001.pas.382: mBlockResult[ vBlockIndex ] :=
    > vElementIndex;
    > 0040FD85 8B4808 mov ecx,[eax+$08]
    > 0040FD88 8914B1 mov [ecx+esi*4],edx
    > unit_TCPUMemoryTest_version_001.pas.383: end;
    > 0040FD8B 46 inc esi
    > unit_TCPUMemoryTest_version_001.pas.370: for vBlockIndex := 0 to
    > vBlockCount-1 do
    > 0040FD8C FF4C2408 dec dword ptr [esp+$08]
    > 0040FD90 75D5 jnz $0040fd67
    > unit_TCPUMemoryTest_version_001.pas.384: end;
    > 0040FD92 83C40C add esp,$0c
    > 0040FD95 5F pop edi
    > 0040FD96 5E pop esi
    > 0040FD97 5B pop ebx
    > 0040FD98 C3 ret
    >
    > // *** End of SHL trick, assembly output for 32 bit: ***
    >
    > Bye,
    > Skybuck.



    "
    I just see one SHL in that code.
    "

    That's because this is the single version which I posted, ofcourse I also
    tried the 3 pair version, didn't feel like posting that too since it doesn't
    become any faster.

    But perhaps you interested in it so here ya go, it's maybe 1 procent faster,
    but this is doubtfull:

    // *** Begin of 3 pair SHL trick of 32 bit code ***:

    unit_TCPUMemoryTest_version_001.pas.421: begin
    0040FD44 53 push ebx
    0040FD45 56 push esi
    0040FD46 57 push edi
    0040FD47 83C4D8 add esp,-$28
    unit_TCPUMemoryTest_version_001.pas.423: vElementShift := mElementShift;
    0040FD4A 0FB6501C movzx edx,[eax+$1c]
    0040FD4E 89542418 mov [esp+$18],edx
    unit_TCPUMemoryTest_version_001.pas.424: vBlockCount := mBlockCount;
    0040FD52 8B5010 mov edx,[eax+$10]
    0040FD55 89542410 mov [esp+$10],edx
    unit_TCPUMemoryTest_version_001.pas.425: vLoopCount := mLoopCount;
    0040FD59 8B5014 mov edx,[eax+$14]
    0040FD5C 89542414 mov [esp+$14],edx
    unit_TCPUMemoryTest_version_001.pas.427: vBlockIndexA := 0;
    0040FD60 33D2 xor edx,edx
    0040FD62 89542404 mov [esp+$04],edx
    unit_TCPUMemoryTest_version_001.pas.428: vBlockIndexB := 1;
    0040FD66 C744240801000000 mov [esp+$08],$00000001
    unit_TCPUMemoryTest_version_001.pas.429: vBlockIndexC := 2;
    0040FD6E C744240C02000000 mov [esp+$0c],$00000002
    0040FD76 E988000000 jmp $0040fe03
    unit_TCPUMemoryTest_version_001.pas.432: vBlockBaseA := vBlockIndexA shl
    vElementShift;
    0040FD7B 8B4C2418 mov ecx,[esp+$18]
    0040FD7F 8B542404 mov edx,[esp+$04]
    0040FD83 D3E2 shl edx,cl
    0040FD85 8954241C mov [esp+$1c],edx
    unit_TCPUMemoryTest_version_001.pas.433: vBlockBaseB := vBlockIndexB shl
    vElementShift;
    0040FD89 8B4C2418 mov ecx,[esp+$18]
    0040FD8D 8B542408 mov edx,[esp+$08]
    0040FD91 D3E2 shl edx,cl
    0040FD93 89542420 mov [esp+$20],edx
    unit_TCPUMemoryTest_version_001.pas.434: vBlockBaseC := vBlockIndexC shl
    vElementShift;
    0040FD97 8B4C2418 mov ecx,[esp+$18]
    0040FD9B 8B54240C mov edx,[esp+$0c]
    0040FD9F D3E2 shl edx,cl
    0040FDA1 89542424 mov [esp+$24],edx
    unit_TCPUMemoryTest_version_001.pas.436: vElementIndexA := 0;
    0040FDA5 33D2 xor edx,edx
    unit_TCPUMemoryTest_version_001.pas.437: vElementIndexB := 0;
    0040FDA7 33F6 xor esi,esi
    unit_TCPUMemoryTest_version_001.pas.438: vElementIndexC := 0;
    0040FDA9 33FF xor edi,edi
    unit_TCPUMemoryTest_version_001.pas.440: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FDAB 8B5C2414 mov ebx,[esp+$14]
    0040FDAF 4B dec ebx
    0040FDB0 85DB test ebx,ebx
    0040FDB2 7222 jb $0040fdd6
    0040FDB4 43 inc ebx
    unit_TCPUMemoryTest_version_001.pas.442: vElementIndexA := mMemory[
    vBlockBaseA + vElementIndexA ];
    0040FDB5 0354241C add edx,[esp+$1c]
    0040FDB9 8B4804 mov ecx,[eax+$04]
    0040FDBC 8B1491 mov edx,[ecx+edx*4]
    unit_TCPUMemoryTest_version_001.pas.443: vElementIndexB := mMemory[
    vBlockBaseB + vElementIndexB ];
    0040FDBF 03742420 add esi,[esp+$20]
    0040FDC3 8B4804 mov ecx,[eax+$04]
    0040FDC6 8B34B1 mov esi,[ecx+esi*4]
    unit_TCPUMemoryTest_version_001.pas.444: vElementIndexC := mMemory[
    vBlockBaseC + vElementIndexC ];
    0040FDC9 037C2424 add edi,[esp+$24]
    0040FDCD 8B4804 mov ecx,[eax+$04]
    0040FDD0 8B3CB9 mov edi,[ecx+edi*4]
    unit_TCPUMemoryTest_version_001.pas.440: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FDD3 4B dec ebx
    0040FDD4 75DF jnz $0040fdb5
    unit_TCPUMemoryTest_version_001.pas.447: mBlockResult[ vBlockIndexA ] :=
    vElementIndexA;
    0040FDD6 8B4808 mov ecx,[eax+$08]
    0040FDD9 8B5C2404 mov ebx,[esp+$04]
    0040FDDD 891499 mov [ecx+ebx*4],edx
    unit_TCPUMemoryTest_version_001.pas.448: mBlockResult[ vBlockIndexB ] :=
    vElementIndexB;
    0040FDE0 8B5008 mov edx,[eax+$08]
    0040FDE3 8B4C2408 mov ecx,[esp+$08]
    0040FDE7 89348A mov [edx+ecx*4],esi
    unit_TCPUMemoryTest_version_001.pas.449: mBlockResult[ vBlockIndexC ] :=
    vElementIndexC;
    0040FDEA 8B5008 mov edx,[eax+$08]
    0040FDED 8B4C240C mov ecx,[esp+$0c]
    0040FDF1 893C8A mov [edx+ecx*4],edi
    unit_TCPUMemoryTest_version_001.pas.451: vBlockIndexA := vBlockIndexA + 3;
    0040FDF4 8344240403 add dword ptr [esp+$04],$03
    unit_TCPUMemoryTest_version_001.pas.452: vBlockIndexB := vBlockIndexB + 3;
    0040FDF9 8344240803 add dword ptr [esp+$08],$03
    unit_TCPUMemoryTest_version_001.pas.453: vBlockIndexC := vBlockIndexC + 3;
    0040FDFE 8344240C03 add dword ptr [esp+$0c],$03
    unit_TCPUMemoryTest_version_001.pas.430: while vBlockIndexA <=
    (vBlockCount-4) do
    0040FE03 8B542410 mov edx,[esp+$10]
    0040FE07 83EA04 sub edx,$04
    0040FE0A 3B542404 cmp edx,[esp+$04]
    0040FE0E 0F8367FFFFFF jnb $0040fd7b
    0040FE14 EB35 jmp $0040fe4b
    unit_TCPUMemoryTest_version_001.pas.458: vBlockBaseA := vBlockIndexA shl
    vElementShift;
    0040FE16 8B4C2418 mov ecx,[esp+$18]
    0040FE1A 8B542404 mov edx,[esp+$04]
    0040FE1E D3E2 shl edx,cl
    0040FE20 8954241C mov [esp+$1c],edx
    unit_TCPUMemoryTest_version_001.pas.460: vElementIndexA := 0;
    0040FE24 33D2 xor edx,edx
    unit_TCPUMemoryTest_version_001.pas.462: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FE26 8B5C2414 mov ebx,[esp+$14]
    0040FE2A 4B dec ebx
    0040FE2B 85DB test ebx,ebx
    0040FE2D 720E jb $0040fe3d
    0040FE2F 43 inc ebx
    unit_TCPUMemoryTest_version_001.pas.464: vElementIndexA := mMemory[
    vBlockBaseA + vElementIndexA ];
    0040FE30 0354241C add edx,[esp+$1c]
    0040FE34 8B4804 mov ecx,[eax+$04]
    0040FE37 8B1491 mov edx,[ecx+edx*4]
    unit_TCPUMemoryTest_version_001.pas.462: for vLoopIndex := 0 to vLoopCount-1
    do
    0040FE3A 4B dec ebx
    0040FE3B 75F3 jnz $0040fe30
    unit_TCPUMemoryTest_version_001.pas.467: mBlockResult[ vBlockIndexA ] :=
    vElementIndexA;
    0040FE3D 8B4808 mov ecx,[eax+$08]
    0040FE40 8B5C2404 mov ebx,[esp+$04]
    0040FE44 891499 mov [ecx+ebx*4],edx
    unit_TCPUMemoryTest_version_001.pas.469: vBlockIndexA := vBlockIndexA + 1;
    0040FE47 FF442404 inc dword ptr [esp+$04]
    unit_TCPUMemoryTest_version_001.pas.456: while vBlockIndexA <=
    (vBlockCount-1) do
    0040FE4B 8B542410 mov edx,[esp+$10]
    0040FE4F 4A dec edx
    0040FE50 3B542404 cmp edx,[esp+$04]
    0040FE54 73C0 jnb $0040fe16
    unit_TCPUMemoryTest_version_001.pas.471: end;
    0040FE56 83C428 add esp,$28
    0040FE59 5F pop edi
    0040FE5A 5E pop esi
    0040FE5B 5B pop ebx
    0040FE5C C3 ret

    // *** End of 3 pair SHL trick of 32 bit code ***:

    "
    What do you expect that the posted code might improve?
    "

    You wrote the "mul" blocks the processor somehow, the pipes were blocked
    because of it.

    You wrote the "shl" does not block the processor/pipelines and multiple
    shl's can be executed faster.

    So if the program was limited by instruction throughput then switching to
    shl would have given more performance.

    As far as I can recall it did not give more performance, not in the single
    version and and not in the 3 pair version.

    "I see at least 10 totally superfluous lines moving one register into
    another to do
    something with that copy or other -unnecessary- actions.
    "

    ?

    "
    Removing all redundant instructions will gain more speed
    than a (pointlessly inserted) SHL...
    "

    You are welcome to take these x86 outputs and remove any redundant
    instructions you see fit it should be easy to re-integrate your modified
    assembly code.

    Bye,
    Skybuck.
     
  9. Flying Bucket posted an excellent Branch_Never variant:

    ....
    0040FE2A 4B dec ebx
    0040FE2B 85DB test ebx,ebx
    0040FE2D 720E jb $0040fe3d
    0040FE2F 43 inc ebx
    ....

    was this your idea or is it just the output of your smart compiler ?

    __
    wolfgang
     
  10. "
    ....
    0040FE2A 4B dec ebx
    0040FE2B 85DB test ebx,ebx
    0040FE2D 720E jb $0040fe3d
    0040FE2F 43 inc ebx
    ....

    was this your idea or is it just the output of your smart compiler ?
    "

    This is indeed a bit strange, since dec leaves CF alone, test resets CF and
    JB looks for CF=1 (inc also leaves CF alone).

    Fortunately this code is only executed once at the start of the for loop,
    for the rest of the loop iterations it is skipped.

    Why it generates this code is a bit of a mystery it could have different
    explanations/reasons.

    Most likely reason is that this is a place holder for any additional
    checking. For example when the code is changed.

    It could also be a failed attempt at seeing if the loop needs to be skipped
    over.

    It could also be caused by a longword and a -1 which would wrap back and
    have no effect, the branch will always execute no matter what. The JB jumps
    over the loop but as you can see it will never jump over the loop.

    So in a way the code is unneccessary. Had the type been an integer than it
    could/would have been necessary.

    It could also be some strange bug in the compiler in a way where it doesn't
    recgonize this useless code or generates it wrongly or perhaps there is no
    real solution to detecting wrap back of $FF FF FF FF - 1.

    Since the loop has a high iteration count it's not much of a problem, for
    smaller loop iterations it could have been a performance/execute problem.

    The only real problem I see with it is that it's taking up precious L1
    instruction cache space.

    One last possible explanation could be "alignment/padding" of instruction
    codes to make them nicely aligned but this seems unlikely.

    Nicely spotted though by you ! ;) :)

    Bye,
    Skybuck.
     
  11. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    Skybuck Flying wrote:


    > The pascal code with the 3 blocks is roughly based on this code/posting of yours:



    Definitely not!


    > "
    >> An example which does one load per pipe would be nice ! ;)

    >
    > ...
    > mov eax,[mem] \
    > mov ebx,[mem + 0x04] cycle 1
    > mov ecx,[mem + 0x08] /
    > nop \
    > nop cycle 2
    > nop /
    > nop \
    > nop cycle 3
    > nop /
    > eax present \
    > nop cycle 4
    > nop /
    > ebx present \
    > nop cycle 5
    > nop /
    > ecx present \
    > nop cycle 6
    > nop /
    > ...
    >
    >
    > It takes 3 clocks to load EAX - Athlons can fetch
    > 32 bit (or 64 bit in long mode) per cycle. Hence,
    > the new content of EAX will be available in cycle
    > four, while the other two loads still are in pro-
    > gress. Repeat this with EBX and ECX - NOPs should
    > be replaced by some instructions not depending on
    > the new content of EAX, EBX or ECX. It is the the
    > programmer's job to schedule instructions wisely.
    > Unfortunately, an overwhelming majority of coders
    > does not know what is going on inside the machine
    > they write code for (=> "machine independent").
    > "
    >
    > I do twist this text a little bit when I write "parallel" I kinda mean 3 loads in whatever
    > clock cycles it takes.
    >
    > So it's like a more efficient form of batch processing.
    >
    > I requested for an example of a "load per pipe".
    >
    > I also doubted if it was possible.



    Your profound misunderstanding of this code snippet led
    to these assumptions. Hint:

    [mem], [mem+4] and [mem+8] are subsequent dwords stored
    in contiguous order

    0x00401000 = address of 1st read
    0x00401004 = 2nd read
    0x00401008 = 3rd read.

    Cache line 0x00401000 will be loaded after the 1st read
    (if it was not in L1, yet). Your implementation of this
    example does this

    0x00401000 = address of 1st read
    0x00507000 = 2nd read
    0x0080F000 = 3rd read,

    which is not the same thing - three cache lines have to
    be loaded if you access memory like that.


    > Yet somehow you claimed it was possible, but then you come with some kind of story.
    >
    > Not sure if it was an answer to my question/request or a half answer.
    >
    > It's also not clear what is ment with a "load per pipe".



    The term speaks for itself.


    > Is that parallel, or really short sequential.



    Depends on the code. If you use subsequential addresses
    for reads or writes, it is both. Otherwise, it still is
    parallel, but performs multiple random reads or writes.


    > For me it's more or less the same as long as it's faster than what the code is currently
    > doing.



    One reason why you did not understand what I told.


    > So what it's actually called/named doesn't really matter for me as long as it's somehow
    > faster.



    As long as you don't know what you are doing, you can't
    expect your code will become faster.


    > But now it seems we are starting to miss understand each other on both sides ;)



    No. I perfectly understand your position. However, it's
    one thing to offer help or do some work for someone, or
    to -demand- more help than already was given.

    I hope, you are able to understand this point.


    > The manuals seems very cluttered with operating system specific information which I don't
    > need.



    Which manuals? Try

    http://support.amd.com/us/Processor_TechDocs/24592.pdf
    http://support.amd.com/us/Processor_TechDocs/24593.pdf
    http://support.amd.com/us/Processor_TechDocs/24594.pdf
    http://support.amd.com/us/Processor_TechDocs/26568.pdf
    http://support.amd.com/us/Processor_TechDocs/26569.pdf
    http://support.amd.com/us/Processor_TechDocs/40546.pdf

    for AMD or

    http://www.intel.com/Assets/PDF/manual/253665.pdf
    http://www.intel.com/Assets/PDF/manual/253666.pdf
    http://www.intel.com/Assets/PDF/manual/253667.pdf
    http://www.intel.com/Assets/PDF/manual/253668.pdf
    http://www.intel.com/Assets/PDF/manual/253669.pdf

    for LETNi references and manuals.


    > I only need to know hardware architecture for optimization purposes.
    >
    > I already read "optimization tricks" manual.



    This is not a -manual-. It's a paper analysing code and
    providing hints to improve programming techniques.


    > I feel I could be wasting my time reading an instruction set like x86 or x64 which could
    > die any day now.
    >
    > It's probably also quite a bad instruction set with many oddities.



    It's the most popular platform worldwide. After all, it
    became that popular, 'cause it was the first (and only)
    open standard where you can plug in any hardware of any
    manufacturer without being bound to proprietary systems
    where you are forced to buy every add-on from the manu-
    facturer who already sold you the "nacked" computer.


    > I'd rather read and switch to a more properly designed instruction set.



    Which?

    If you think, x86 assembler is that disgusting, you are
    better off if you stay with Pascal. It is machine inde-
    pendent, much more comfortable and does most of the re-
    quired work for you. Of course, you've to pay the price
    in form of slow and bloated code, but, hey: Who cares?


    Greetings from Augsburg

    Bernhard Schornak
     
  12. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    Skybuck Flying wrote:


    > "Bernhard Schornak" wrote in message news:j1saub$fi0$...
    >
    > Skybuck Flying wrote:



    <snip>

    > What do you expect that the posted code might improve?
    > "
    >
    > You wrote the "mul" blocks the processor somehow, the pipes were blocked because of it.
    >
    > You wrote the "shl" does not block the processor/pipelines and multiple shl's can be
    > executed faster.
    >
    > So if the program was limited by instruction throughput then switching to shl would have
    > given more performance.
    >
    > As far as I can recall it did not give more performance, not in the single version and and
    > not in the 3 pair version.



    If you cut things out of their context and apply them
    to a very different context, you shouldn't expect too
    much. Leave everything at the place it belongs to...


    > "I see at least 10 totally superfluous lines moving one register into another to do
    > something with that copy or other -unnecessary- actions.
    > "



    movzx edx,[eax+$1c] # EDX = 0x1c[EAX] (byte!)
    mov [esp],edx # EDX > 0x00[ESP]
    mov edx,[eax+$10] # EDX = 0x10[EAX]
    mov ecx,[eax+$14] # ECX = 0x14[EAX]
    mov [esp+$04],ecx # ECX > 0x04[ESP]
    dec edx # EDX - 1
    test edx,edx # redundant
    jb L09 # > done if sign
    inc edx # EDX + 1
    mov [esp+$08],edx # EDX > 0x08[ESP]
    xor esi,esi # ESI = 0

    L00:
    mov ecx,[esp] # ECX = 0x00[ESP]
    mov ebx,esi # EBX = ESI
    shl ebx,cl # EBX = EBX << CL
    xor edx,edx # EDX = 0
    mov ecx,[esp+$04] # ECX = 0x04[ESP]
    dec ecx # ECX - 1
    test ecx,ecx # redundant
    jb L02 # > outer loop if sign
    inc ecx # ECX + 1

    L01:
    add edx,ebx # EDX + EBX
    mov edi,[eax+$04] # EDI = 0x04[EAX]
    mov edx,[edi+edx*4] # EDX > 0x00[EDI+EDX*4]
    dec ecx # ECX - 1 (*)
    jnz L01 # > inner loop if not zero

    L02:
    mov ecx,[eax+$08] # ECX = 0x08[EAX]
    mov [ecx+esi*4],edx # EDX > 0x00[ECX+ESI*4]
    inc esi # ESI + 1
    dec dword ptr [esp+$08] # 0x08[ESP] - 1 (*)
    jnz L00 # outer loop if not zero

    (*) If ECX = 0 or 0x08[ESP] = 0, the inner/outer loop
    runs 4,294,967,295 times, until the tested value will
    be zero, again.

    Consider this fact if your program seems to "hang".

    @ Wolfgang: Both loops do work properly. In the worst
    case (value is zero), these loops count down the full
    32 bit range.


    BTW: I still do not see what that shift is good for.


    > "
    > Removing all redundant instructions will gain more speed
    > than a (pointlessly inserted) SHL...
    > "
    >
    > You are welcome to take these x86 outputs and remove any redundant instructions you see
    > fit it should be easy to re-integrate your modified assembly code.



    It is your program. Apply what you have learned until
    now.

    If you have questions, ask, but please stop demanding
    anything from others if you cannot do it on your own.


    Greetings from Augsburg

    Bernhard Schornak
     
  13. Bernhard Schornak wrote:
    ....
    > L00:
    > mov ecx,[esp] # ECX = 0x00[ESP]
    > mov ebx,esi # EBX = ESI
    > shl ebx,cl # EBX = EBX << CL
    > xor edx,edx # EDX = 0
    > mov ecx,[esp+$04] # ECX = 0x04[ESP]
    > dec ecx # ECX - 1
    > test ecx,ecx # redundant
    > jb L02 # > outer loop if sign
    > inc ecx # ECX + 1


    DEC wont alter carry, so "jb" aka "jc" should be replaced by "jng"or "js".
    ....
    > @ Wolfgang: Both loops do work properly. In the worst
    > case (value is zero), these loops count down the full
    > 32 bit range.


    OTOH what I see is:

    dec ecx
    jng ...

    actually checks is if ecx were zero or negative before the DEC,
    so I'd had just

    test ecx,ecx
    jng ... ;jumps on zero- or sign- or overflow -flag

    as this will imply a zero detection.

    And for a biased range ie:

    cmp ecx,3
    jng ... ;jumps if ecx = 3 or less (signed)
    __
    wolfgang
     
  14. Re: An idea how to speed up computer programs and avoid waiting.("event driven memory system")

    wolfgang kern wrote:


    > Bernhard Schornak wrote:
    > ...
    >> L00:
    >> mov ecx,[esp] # ECX = 0x00[ESP]
    >> mov ebx,esi # EBX = ESI
    >> shl ebx,cl # EBX = EBX<< CL
    >> xor edx,edx # EDX = 0
    >> mov ecx,[esp+$04] # ECX = 0x04[ESP]
    >> dec ecx # ECX - 1
    >> test ecx,ecx # redundant
    >> jb L02 #> outer loop if sign
    >> inc ecx # ECX + 1

    >
    > DEC wont alter carry, so "jb" aka "jc" should be replaced by "jng"or "js".



    Hi! Thanks for refreshing my knowledge base ... I
    haven't seen much more than the steering wheel of
    an Atego 1222 and thousands of waybills for about
    nine months, now. ;)


    > ...
    >> @ Wolfgang: Both loops do work properly. In the worst
    >> case (value is zero), these loops count down the full
    >> 32 bit range.

    >
    > OTOH what I see is:
    >
    > dec ecx
    > jng ...
    >
    > actually checks is if ecx were zero or negative before the DEC,
    > so I'd had just
    >
    > test ecx,ecx
    > jng ... ;jumps on zero- or sign- or overflow -flag
    >
    > as this will imply a zero detection.



    Right. Hence, the dec/inc pairs are redundant for
    checking the range of EDX and ECX. Freeing EBP as
    GPR allows to replace the outer loop counter [ESP
    +0x08] with a register. Just these three cosmetic
    changes saved 5 * 4,000 = 20,000 clocks...

    ....not a real improvement for processing a 130 MB
    array, randomly accessed 320,000,000 times...

    My suggestion to expand the 8,000 to 8,192 dwords
    could reduce all range checks to

    and ecx,0x1FFF
    je ...

    leaves a valid index in ECX, and skips processing
    if ECX = 0. Same with EDX (anded with 0x0FFF).


    > And for a biased range ie:
    >
    > cmp ecx,3
    > jng ... ;jumps if ecx = 3 or less (signed)



    A sometimes required operation. In most cases, it
    is better to define a valid range and "transpose"
    it to something counted up or down to zero, using
    appropriate offsets "compensating" the transposed
    index.

    Unfortunately, there seem to be some addresses in
    the first elements of each block, so the properly
    coded loop had to check for the lower limit - the
    real array starts at offset 0x18 - as well. Slows
    down the code with two additional branches. Looks
    like HeLL, smells like HeLL, nua ös Design is ned
    goa so hell... ;)


    Greetings from Augsburg

    Bernhard Schornak
     
  15. Ok,

    I am a bit surprised I didn't respond yet to this posting of yours but that
    might be because most of it is true and I have nothing further to add,
    except ofcourse confirming that it was indeed a miss-understanding somewhat.

    But it was also wishfull thinking.

    I was hoping that

    [mem + 0x04] somehow ment that it was using "mem" as a base address and
    that "0x04" would contain another pointer which would point to the random
    memory location.

    I am not much of an x86 programmer and each assembler probably has it's own
    pointer syntax/whatever.

    I am glad you cleared this up.

    I also don't hold your example against you because at the time of writing I
    did probably not yet explain what the program was doing.

    Only after your posting did I explain further...

    So I think it's good of me to clearify that further just to be clear that
    you were not trying to sabotage this thread with a wrong example.

    At the time you probably simply didn't understand what I was trying to do...
    or maybe you did understand but provided a sequential example anyway.

    Only you know the thruth what was in your head at the time and true
    intention.... though a sequential example which more or less assumes that
    data is inside the cache could still be interesting.

    So there could still be some value in this... perhaps my code is already
    doing something like this... but I guess not because it's not really giving
    any higher performance but that might be because the data set is too large.

    If the elements were reduced from 8000 to 800 then 3 pairs of blocks might
    fit in the cache and then maybe your code example might still have some
    value.

    However not really since the single program is probably already maxing out
    the system/cpu.

    It's not the cache or memory which is holding back the program it's simply
    the ammount of instructions.

    SSE is not going to help since there is no SSE instruction which can
    retrieve multiple random memory access locations. So to bad for that.

    If you truely want to write a faster program in assembler, and especially a
    significantly more faster program than you would either:

    1. Have to reduce the number of instructions further which will probably be
    hard to do (inside the inner loop).

    or

    2. Find a way to "fetch" multiple elements per cycle (per instruction).

    My take on it is: 1 might be possible, but 2 probably not.

    Also:

    3. Writing 64 bit program is probably useless, since 64 bit instructions
    execute twice as slow, at least on my processor, perhaps you should test
    this on your phenom and see what happens ;)

    However under different circumstances perhaps you can do better, like
    not-fitting-in-cache-circumstances. But this again would probably require
    some kind of "anti-stalling" "anti-waiting" code ;)

    Which was what my original posting was more or less about... letting code
    proceed as much as possible and jumping back when memory results are in...
    something in that trend...

    Perhaps you are now starting to see that my posting is about something new
    or extra which might not be possible with current hardware, though you do
    keep insisting that it is possible I believe you 50% a little bit... even if
    it's possible it will be little, you still have to convince me for 100%
    though... my lack of time as still prevented me from running your program.

    Perhaps I don't want to know results for now since it wouldn't really be
    that helpfull I guess ;) :)

    However if you could somehow re-write your program from "optimized
    assembler" back to "free pascal code" in such a way that the free pascal
    compiler produces more or less the same code then that would be highly
    interesting ! ;) Especially if the code is indeed faster for certain
    circumstances ! ;) It would probably still not be interesting for my current
    project but very maybe future projects.

    However I see a big mistake in your reasoning of optimizing your program,
    which is at the same time a big mistake in my reasoning and original post.

    "CUDA/GPU'S" can probably already do what I propose to a certain extent at
    least... and they probably do it much better than CPU... which means
    x86/intel/amd potentially has huge problem, since GPU can do something which
    their older processors apperently cannot which is: process very large data
    sets in a random fashion.

    However I am unsure about ATI cards and the newer AMD gpu's even INTEL seems
    to have "gpu's embedded"... I find intel kinda secretive about that... they
    are not coming forward with details... they probably did the same during the
    80486 age... they keep "secrets" "secret" to give their selected programmers
    an adventage over others... not a good practice for us it seems, which makes
    me less enthousiatic to buy their processors.

    nvidia's gpu's seem better documented especially cuda... but I am thinking
    this is necessary to attract more programmers to it... cuda has little
    benefits for now... but maybe that will change in future... nvidia in
    bussiness with big companies: apple and microsoft and probably also google
    and maybe even more... they positioned pretty well... I do wonder what will
    happen to their documentation if they do make a big x86 killer... perhaps
    they will become more secretive again which would suck. At least that's how
    I experience it a little bit... perhaps the things I wrote about intel might
    not be entirely true... but that is how I feel about the latest/greatest...
    or maybe their documentation is just simply lagging behind a little bit...
    or I didn't look... or it doesn't concern me yet since I don't have those
    chips or proper simulators ;)

    I have been programming x86 for at least 18 years as well or so... and I
    still don't have a proper x86 similator which kinda sucks !

    Having one which could show "ammount of cycles" taken by programs would be
    great. Quite strange that it apperently doesn't exist, or is too
    shitty/complex/I can't understand it or whatever or takes too long to
    execute ?! Hmm...

    This is what I do like about "virtual instruction sets"... total insight
    into how everything executes and such ! ;) :)

    This probably explains why there are so many virtual instruction sets:

    javascript, java, php, .net, flash, other scripting languages.

    ^ Sad thing is these all have security issue's so doesn't really help at
    protecting computers, it takes only one hole to screw up a computer ;)
    But it does make computers run slower :(

    Bye,
    Skybuck.
     
  16. "
    This probably explains why there are so many virtual instruction sets:

    javascript, java, php, .net, flash, other scripting languages.

    ^ Sad thing is these all have security issue's so doesn't really help at
    protecting computers, it takes only one hole to screw up a computer ;)
    But it does make computers run slower :(
    "

    Oh I forgot one, which could be a pretty serious one in future:

    How could I forget my favorite one at the moment:

    "PTX"

    And perhaps I should also mention "redcode" but it's not really that serious
    more ment for fun ;)

    But PTX is ofcourse serious.

    And there is also AMD's version "vm" or something...

    Which suddenly makes me wonder what intel's instruction set is for their
    integrated cpu+gpu thingy... hmmm ;)

    Bye,
    Skybuck.
     
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page