1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Skybuck's RAM Test version 0.07 (For CUDA and CPU) now available !

Discussion in 'Nvidia' started by Skybuck Flying, Jul 12, 2011.

  1. (To run this test successfully you will probably need an NVIDA CUDA enabled
    graphics card, probably compute capability 2.0 as a minimum).

    Hello,

    Skybuck's RAM Test version 0.07 is now available at the following link, in
    either winrar form or loose files (3):

    File:

    http://www.skybuck.org/CUDA/RAMTest/SkybuckRAMTestV007b.rar

    Folder:

    http://www.skybuck.org/CUDA/RAMTest/

    What the test does is the following:

    It creates 20.000 blocks. Each block has 8.000 elements. Each element is a
    32 bit integer (4 bytes).

    Each block has one execution thread.

    The execution thread "travels" through the elements in a random fashion.
    (RAM read test).

    It stores/writes the last element it processed in the BlockResult[
    BlockIndex ] to verify if it indeed did any processing at all.

    This test is performed on GPU and CPU. (On the CPU only one thread/core is
    used for now, perhaps a future test will include multi-threading).

    The timing and performance results are then displayed at the bottom.

    The GT 520 gpu and the AMD x2 3800+ dual core cpu single thread performed as
    follows:

    Kernel execution time in seconds: 25.0870683593750000
    CPU execution time in seconds : 11.8696194628088207

    Cuda memory transactions per second: 63777878.5898704829000000
    CPU memory transactions per second : 134797918.7549603890000000

    Conclusion: CPU's single thread is twice as fast as GPU.

    Note: this test requires 611 megabyte (640.000.000 bytes) to be
    free/available on CPU and GPU.

    I would be very much interested in how this test performs on your
    system/cpu/gpu.

    So if you do run this test on your system, please post the results below
    (just the 4 lines as above is enough/fine).

    (Also additional info about system would be nice too but is not required ;))

    You can also e-mail results to:



    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 12, 2011
    #1
    1. Advertising

  2. I did some more test with different settings after seeing the depression
    results for random access memory for cuda and probably register dependency
    and such.

    These graphics cards are supposed to be good for linear access/vector like
    access so I tested that as well somewhat.

    If the number of elements is just one cuda performance extremely well, 10
    times as fast as the cpu.

    If the number of elements is 10 cuda still performance 5 times as fast as
    the cpu roughly speaking.

    So there is still some hope inside of me that cuda will be usefull for for
    example video codecs.

    I was hoping to use cuda for something else but I guess that will have to go
    back into the freezer for now.

    Or I could give opencl a try and see if somebody's ati card does better, but
    opencl seems somewhat boring and very little information about the
    instruction set used by opencl programs.

    So perhaps I should spent some time on giving my lossless video codec
    another try but this time use cuda to see if it can achieve faster
    performance and perhaps even high resolution, which would be nice.

    It needs to be at least twice as fast for somewhat decent frame rates at
    normal resolution and then it will need to be 4 times as fast for double
    resolution... so it needs to be 8 times as fast.

    Seeing a speed up of 10 is nice.

    However parallel algorithm might also require some rounds... but the test
    settings also included that somewhat, loops were 10... I just did another
    test with 100 loops, cuda still 3 times faster than cpu.

    Time for a more serious test. I set elements to 1 which would mean 32 bit
    colors. I set blocks to 1920x1200 and I set loops to 22 for a parallel scan
    simulation * 60 for video frequency.

    I won't reveal the numbers lol. But I can tell you. The GPU is 40 times as
    fast as the CPU ! LOL.

    That puts big smile on my face ! ;) =D

    CUDA just made my day real happy ! ;) =D

    Sigh... so CUDA should be perfectly suited for writing video codecs as long
    as the video codecs do their work as sequentially as possible ;)

    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 13, 2011
    #2
    1. Advertising

  3. The number of blocks doesn't really matter.

    I test with 2000 and it gives same performance results, it just takes
    shorter to test, it is after all divided by seconds taken.

    The higher block numbers were just to test if it might help.

    Anyway I have managed to find a little optimization trick via ptx.

    It's quite significant too.

    By adding a "cop" directive which stands for "cache operation" specifier
    cuda can be made to run faster:

    The following instruction was changed from:

    ld.global.s32 %r34, [%r38+0];
    To:
    ld.global.cg.s32 %r34, [%r38+0];

    This seems to give 50% more performance for random access memory with cuda !

    However care/more tests should be done to be sure... maybe it's just for
    this particular situation, but the difference is so big there is probably
    something to it ! ;)

    Surprisingly the "cop" .cs did not give more performance, which is what I
    tried first.

    I still have others to try, but this is already pretty spectacular ! ;)

    Since everything else I tried with code adjustments didn't help ! ;)

    So there is still hope yet to squeeze some more performance out of it ! ;)
    =D

    The CPU is still twice as fast by a large margin though ! ;)

    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 14, 2011
    #3
  4. The following technique works more or less the same way at the source level:

    The following parameter is altered

    from:

    int *Memory,

    to:

    volatile int *Memory,

    This produces the instruction:

    ld.volatile.global.s32 %r34, [%r38+0];

    I also tried adding .cg behind the global but that is not allowed that would
    be recursive...

    volatile already indicates that no cache operations are allowed.

    This gives the same 50% performance increase which is very nice ! ;)

    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 14, 2011
    #4
  5. Interesting news in short: GPU cache 4 times faster then CPU cache ! ;) =D

    (Version 0.10 which still uses GPU ram instead of GPU cache also available)

    (Version 0.12 is the gpu cache version but still unreleased ;) =D)

    Ok, the shared memory kernel is done... it also executes 4000 blocks but
    this time sequentially...

    This test/results made my jaw drop ! LOL... which offers possibilities/hope
    for cuda:

    Just a single cuda thread did this:

    http://www.skybuck.org/CUDA/RAMTest/version 0.12/SharedMemoryTest.png

    Text:

    "
    Test Cuda Random Memory Access Performance.
    version 0.12 created on 21 july 2011 by Skybuck Flying.
    program started.
    Device[0].Name: GeForce GT 520
    Device[0].MemorySize: 1008402432
    Device[0].MemoryClockFrequency: 600000000
    Device[0].GlobalMemoryBusWidthInBits: 64
    Device[0].Level2CacheSize: 65536
    Device[0].MultiProcessorCount: 1
    Device[0].ClockFrequency: 1620000000
    Device[0].MaxWarpSize: 32
    Setup...
    ElementCount: 8000
    BlockCount: 4000
    LoopCount: 80000
    Initialize...
    LoadModule...
    OpenEvents...
    OpenStream...
    SetupKernel...
    mKernel.Parameters.CalculateOptimalDimensions successfull.
    mKernel.Parameters.ComputeCapability: 2.1
    mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
    mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
    mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
    mKernel.Parameters.OptimalThreadsPerBlock: 256
    mKernel.Parameters.OptimalWarpsPerBlock: 6
    mKernel.Parameters.ThreadWidth: 256
    mKernel.Parameters.ThreadHeight: 1
    mKernel.Parameters.ThreadDepth: 1
    mKernel.Parameters.BlockWidth: 16
    mKernel.Parameters.BlockHeight: 1
    mKernel.Parameters.BlockDepth: 1
    ExecuteKernel...
    ReadBackResults...
    DisplayResults...
    CloseStream...
    CloseEvents...
    UnloadModule...
    ExecuteCPU...
    Kernel execution time in seconds: 0.3385913085937500
    CPU execution time in seconds : 1.4263124922301578
    Cuda memory transactions per second: 945092186.0015719590000000
    CPU memory transactions per second : 224354762.1879504710000000
    program finished.
    "

    Conclusion: shared memory is HELL/SUPER FAST !

    Almost 4 times faster than the CPU ?!?!

    I am gonna do a little debug test with VS 2010, because this is almost
    unbelievable ! LOL. But I believe but gjez ?! Cool.

    Though the GPU L1 cache is probably smaller than CPU L1 cache which could
    explain it's higher speed

    For real purposes I might require an even larger cache and then maybe the
    results will be different... but for now it's hopefull

    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 22, 2011
    #5
  6. In reality this probably means the gpu is twice as fast as a dual core,
    since the dual core will also probably be double as fast as single core.

    So if a quad core processor would face a gt 520 they would both be about the
    same speed would be my estimate, unless newer cpu's have even faster caches
    ;)

    Bye,
    Skybuck.
     
    Skybuck Flying, Jul 22, 2011
    #6
  7. Woops there was something wrong with the kernel and also the kernel launch
    parameters.

    Kernel was doing only 1 block, and launch parameters where 4000 threads.

    Now the situation has been corrected.

    The kernel is doing 4000 blocks and only 1 thread.

    It turns out it's fricking slow !


    Test Cuda Random Memory Access Performance.
    version 0.12 created on 21 july 2011 by Skybuck Flying.
    program started.
    Device[0].Name: GeForce GT 520
    Device[0].MemorySize: 1008402432
    Device[0].MemoryClockFrequency: 600000000
    Device[0].GlobalMemoryBusWidthInBits: 64
    Device[0].Level2CacheSize: 65536
    Device[0].SharedMemoryPerMultiProcessor: 49152
    Device[0].RegistersPerMultiProcessor: 32768
    Device[0].ConstantMemory: 65536
    Device[0].MultiProcessorCount: 1
    Device[0].ClockFrequency: 1620000000
    Device[0].MaxWarpSize: 32
    Setup...
    ElementCount: 8000
    BlockCount: 4000
    LoopCount: 80000
    Initialize...
    LoadModule...
    OpenEvents...
    OpenStream...
    SetupKernel...
    mKernel.Parameters.CalculateOptimalDimensions successfull.
    mKernel.Parameters.ComputeCapability: 2.1
    mKernel.Parameters.MaxResidentThreadsPerMultiProcessor: 1536
    mKernel.Parameters.MaxResidentWarpsPerMultiProcessor: 48
    mKernel.Parameters.MaxResidentBlocksPerMultiProcessor: 8
    mKernel.Parameters.OptimalThreadsPerBlock: 256
    mKernel.Parameters.OptimalWarpsPerBlock: 6
    mKernel.Parameters.ThreadWidth: 1
    mKernel.Parameters.ThreadHeight: 1
    mKernel.Parameters.ThreadDepth: 1
    mKernel.Parameters.BlockWidth: 1
    mKernel.Parameters.BlockHeight: 1
    mKernel.Parameters.BlockDepth: 1
    ExecuteKernel...
    ReadBackResults...
    DisplayResults...
    CloseStream...
    CloseEvents...
    UnloadModule...
    ExecuteCPU...
    Kernel execution time in seconds: 24.2583750000000000
    CPU execution time in seconds : 1.4263193366754714
    Cuda memory transactions per second: 13191320.5233244183900000
    CPU memory transactions per second : 224353685.5819891260000000
    program finished.

    (Picture already updated above).
     
    Skybuck Flying, Jul 22, 2011
    #7
  8. Just for the record,

    I also wrote a CPU test which can run on any x86 cpu.

    For the AMD X2 3800+ the results for a single core where as follows:

    Test CPU Random Memory Access Performance.
    version 0.01 created on 1 august 2011 by Skybuck Flying.
    program started.
    Setup...
    ElementCount: 8000
    BlockCount: 4000
    LoopCount: 80000
    Initialize...
    ExecuteCPU...
    CPU execution time in seconds : 0.7778037432131737
    CPU memory transactions per second : 411414836.7016757590000000
    program finished.

    So that's:

    411.414.836 random 32 bit integers per second (mostly from cpu cache).

    This higher number is because of optimized code (no slow dynamic indexes and
    no slow get element routine/no call overhead).

    Bye,
    Skybuck.
     
    Skybuck Flying, Aug 1, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Skybuck Flying
    Replies:
    1
    Views:
    338
  2. Skybuck Flying
    Replies:
    4
    Views:
    692
    Skybuck Flying
    Oct 13, 2009
  3. Antonio López de Santa Anna
    Replies:
    1
    Views:
    804
    Antonio López de Santa Anna
    Oct 10, 2009
  4. Skybuck Flying
    Replies:
    6
    Views:
    899
    Augustus
    Oct 17, 2009
  5. Skybuck Flying
    Replies:
    0
    Views:
    538
    Skybuck Flying
    Jun 8, 2010
Loading...

Share This Page