My latest insights into the possibility of executing corewars on a gpu have
made me doubt if the performance is going to be any good... it's probably
not going to be any faster than a cpu... maybe even significantly slower
depending on the number of passes that are needed.
Calculations also assume that all executors would actually run in parallel
at full speed which is also probably a flawed assumption... this could mean
that ultimate performance could even be far worse for gpu.
Conclusions for parallel processors:
1. Huge memory requirements just to be able to store stuff and also cache
stuff.
This is mostly where my current graphics card is kinda lacking... only 512
MB... that's not really that much for parallel stuff... where for each
parallel stuff only a little bit of work would be done
I could continue trying to develop something... but I now have serious
doubts that it would achieve any good speed... at least with the current
design... which is probably a very good design... maybe the best one... only
the other idea might give some performance benefit the speculative execution
one... but I doubt that will be any good for sequantial warriors... unless
something more complex is done with loop iteration predict per processing
element or so... that's a bit too advanced for my taste...
I think it's time to start spending my time on other projects...
Maybe in the future when programming has become more easy... and when more
resources are available I might give it another try... but using opengl/cg
shaders probably has too much programming overhead and especially to little
resources available... hardware wise as well.. too little memory.
It's kinda a bummer...
I shall do one last calculation which would be an optimistic calculation
just to see if something can be done:
4 input textures + 4 output textures * 4 elements per texture * 3 bytes = 96
bytes.
512 MB / 96 = 5.33333333 mega elements per texture.
5.3333333 mega sqrt = 2364x2364 texture size or so.
core size = 8000 + warriors 2 * (8000 processes + 500 pspace) = 8.000 +
17.000 = 25.000 elements + 10 for little overhead or so...
Means 2364*2364 / 25010 = 223 simulators in gpu at best.
cycles per simulator could be anywhere from 1000 to 100.000 cycles per
second.
Worst case scenerio: 223 * 1000 = 223000 cycles per second... could even be
worse if not fully executed in parallel... but gpu does have many cores...
like 200 so might actually execute in parallel.
Best case scenerio: 223 * 100.000 = 22.352.161 cycles for entire gpu.
This is pretty optimisic... probably a bit too optimistic... probably more
passes required... or maybe not...
but let's say 22 million cycles per second for gpu.
Cpu achieves 16 million for dual core... so gpu is not really spectacular...
and I need something spectacular...
The 100.000 above is assuming that opengl doesn't need to bind the cg
program all the time...
It probably would need to re bind... so that would make it 10x times slower
or so... so gpu might actually achieve only 2 million cycles per second
which would be bad.
So conclusion in other short words:
It's like having a cpu which can do 223 cycles in parallel... but it can
only do it 10.000 per second or so... so finally speed would be: 2.230.000
cycles per second... which is just miserable.
So that's my latest guess at what the performance would be... miserable !
Bye,
Skybuck.