Interesting news in short: GPU cache 4 times faster then CPU cache !

=D
(Version 0.10 which still uses GPU ram instead of GPU cache also available)
(Version 0.12 is the gpu cache version but still unreleased

=D)
Ok, the shared memory kernel is done... it also executes 4000 blocks but
this time sequentially...
This test/results made my jaw drop ! LOL... which offers possibilities/hope
for cuda:
Just a single cuda thread did this:
http://www.skybuck.org/CUDA/RAMTest/...MemoryTest.png
Text:
"
Test Cuda Random Memory Access Performance.
version 0.12 created on 21 july 2011 by Skybuck Flying.
program started.
Device[0].Name: GeForce GT 520
Device[0].MemorySize: 1008402432
Device[0].MemoryClockFrequency: 600000000
Device[0].GlobalMemoryBusWidthInBits: 64
Device[0].Level2CacheSize: 65536
Device[0].MultiProcessorCount: 1
Device[0].ClockFrequency: 1620000000
Device[0].MaxWarpSize: 32
Setup...
ElementCount: 8000
BlockCount: 4000
LoopCount: 80000
Initialize...
LoadModule...
OpenEvents...
OpenStream...
SetupKernel...
mKernel.Parameters.CalculateOptimalDimensions successfull.
mKernel.Parameters.ComputeCapability: 2.1
mKernel.Parameters.MaxResidentThreadsPerMultiProce ssor: 1536
mKernel.Parameters.MaxResidentWarpsPerMultiProcess or: 48
mKernel.Parameters.MaxResidentBlocksPerMultiProces sor: 8
mKernel.Parameters.OptimalThreadsPerBlock: 256
mKernel.Parameters.OptimalWarpsPerBlock: 6
mKernel.Parameters.ThreadWidth: 256
mKernel.Parameters.ThreadHeight: 1
mKernel.Parameters.ThreadDepth: 1
mKernel.Parameters.BlockWidth: 16
mKernel.Parameters.BlockHeight: 1
mKernel.Parameters.BlockDepth: 1
ExecuteKernel...
ReadBackResults...
DisplayResults...
CloseStream...
CloseEvents...
UnloadModule...
ExecuteCPU...
Kernel execution time in seconds: 0.3385913085937500
CPU execution time in seconds : 1.4263124922301578
Cuda memory transactions per second: 945092186.0015719590000000
CPU memory transactions per second : 224354762.1879504710000000
program finished.
"
Conclusion: shared memory is HELL/SUPER FAST !
Almost 4 times faster than the CPU ?!?!
I am gonna do a little debug test with VS 2010, because this is almost
unbelievable ! LOL. But I believe but gjez ?! Cool.
Though the GPU L1 cache is probably smaller than CPU L1 cache which could
explain it's higher speed
For real purposes I might require an even larger cache and then maybe the
results will be different... but for now it's hopefull
Bye,
Skybuck.