Time for a little performance testing
With delphi conversion code included it's:
0.0254 seconds for one frame of 500x400 with 32 bit colors, and 3x32 bit
floating point texture format r,g,b.
That's 25 milliseconds...
Even using this shitty conversion code... this could mean:
1000 / 25.4 = 39 frames per second haha !
Actually the first time it seems to require 50 milliseconds not sure why...
maybe cpu cache getting filled or maybe it's disk activity from delphi ide
or so... could be...
Now let's see how fast draw is without this shitty conversion code and
canvas draw code.
It's about 0.006 seconds for opengl draw + read from texture.
That's about 6 milliseconds (500x400 vertex points as well !

)
Now let's leave texture reading out of it... to see how fast it goes then !
It's 0.00032 seconds.
That's 0.32 milliseconds !
Holyshit batman... that's already pretty fast !

And this includes
500x400 verteces ! HAHA.
Let's see what frame rate would be for this:
1000 / 0.32 = 3125
Not bad.
I did notice the occasional hick up.. this could be the first time because
of loading the texture... and/or disk activity... I think it's disk activity
mostly.
Ok... now I reduce vertex points to something more realistic...
According to my last calculations posted in another sub thread... the number
of simulators would be:
1198.
There are two fields... each one can probably do a number of cell updates...
one pointer need for itself...
one pointer for somewhere else... and again a pointer for somewhere else...
So for itself at least 1, then maybe 2 then maybe another 2... so I think at
most 5 pointers needed.
So 1195 * 5 = 5975 verteces needed... Now I go test it's speed:
Time is: 0.0003126 seconds.
So it remains at 0.32 milliseconds per frame. Hmmm...
This is not so good.
This means:
3125 frames per second * 1198 simulators = 3.743.750 cycles per second.
Which need to divide by 2 probably which gives 1.871.875 cycles per second.
Dual core CPU was probably something like 80.000 * 200 = 16.000.000 cycles
per second.
I must know for sure so I am going to start it to make sure.
Yup confirmed...
Dual core CPU can do: 10 battles of 1 v 1 warriors with 100 rounds with
80.000 cycles in 5 seconds.
This means the dual core is executing:
10 * 100 * 80.000 cycles in 5 seconds = 80.000.000 (no early kills those
were disabled)
Which means it's executing: 80.000.000 / 5 = 16.000.000 cycles for dual
core.
Which means roughly 8.000.000 cycles per core.
So far the cpu seems 4.2 times faster...
However I don't have a decent gpu implementation yet...
However seeing these numbers for this simple test raises big doubts if gpu
version is gonna be any faster...
Maybe the clearing of the framebuffer has something to do with it... gonna
disable it and retest...
depth test was disable... clearing was disable...
Time is now:
0.000237 seconds.
0.237 milliseconds
1000 / 0.237 = 4219
about 33.3% more performance.. still not enough me thinks.
Ok reloading the identity matrix does not seem necessary...
setting the cg world view thingy only needs to be done once it seems...
Performance is now
0.00019 seconds. (fluctuating a bit... maybe at full speed it would be lower
not sure)
which is about 0.19 milliseconds
1000 / 0.19 = 5263 frame per second.
Let's see what this would give:
5263 * 1198 simulators = 6.305.074 cycles
divide by 2 : 3.152.537
Still very poor.
Hmmm I see I have three frame buffer textures active... I only need 2...
Gonna disable one to see if that helps.
Yes that helped a bit...
0.000156 seconds.
Ok I don't need 32 bit floating points...
Only 16 bit floating points... gonna change textures to 16 bit...
This should give good improvement.
Hmmm nope still 0.00015 seconds...
I am starting to wonder if the time I am measuring is actually the api
calling
cpu overhead... hmm...
Hmm could be... if that's the case... then optimizing the number of api
calls
could give more speed... I wonder if enable profilings all the time is
necessary
maybe not... binding programs is that necessary ? I don't know...
For now the speed would be:
1000 / 0.156 = 6410 assuming the bigger texture is no problem.
Final speed would be:
(1198 * 6410) / 2 = 3.839.743 something like that.
Almost half of what a single cpu core would achieve.
This is assuming the vertex/pixel shaders and texturing lookups don't add
any significant
delays or overheads... for the largest possibility.
I go do a little large test to see what happens

4096x4096 gives about same speed...
For now I am worried this is not a good situation.
But there might be a solution... instead of stuffing the entire cores into
the frame buffer... the opposite
could happen...
only instructions are stuffed into the frame buffer and verteces and such
for processing... and the cores themselfes are stuffed into textures which
will not be rendered to a texture...
Instead something else will update those textures... this could be done by
cpu...
I was thinking about doing a single 4096x4096 texture map executor with 1198
simulators... but seeing these presumably api call overheads or round trip
times to gpu makes it doubtfull that it would be any faster... it probably
would not... therefore the strategy has to be rethought and changed.
For now the number of texture inputs seem to be limited to 6 to 8
TEXCOORDS... these textcoords are necessary
to supply the necessary information to tex2D or rect2D or something like
that...
But not really... each vertex could simply ignore those texcoords
semantics... and simply use the first one...
The textcoordintes themselfes could use an additional coordinate for example
the Z coordinate or an addtional TEXCOORD2 or maybe NORMAL coordinate to
indicate which texture to use.
This way the number of texture maps in the gpu could be endless... however
the memory is limited.
Now let's try to fit as much simulators/cores/warriors as possible into the
core... new calculation becomes:
512 MB / (14000 * 6 bytes) =
536870912 / 84000 = 6391 simulators in gpu. the necessary instruction
pointers plus additional fields per simulator should fit easy in one
framebuffer so I am not worried about that.
Using this new figure the number of cycles would be:
6391 * 6410 = 40.968.363 / 2 = 20.484.181
Compared to the dual core cpu which has 16.000.000 cycles this is still very
weak ?!
However the gpu code I used to test is not running at full speed.. so who
knows what will happen... but for now I base it on what I see...
At best with a little bit of luck... two extra threads could be added which
feed the necessary data to the gpu... so that the gpu can do some processing
as well... this way the final speed would be doubled.
However this does require feeding the gpu with 6391 battles ?! which is
quite a lot..
This would be a battlefield size of almost 80x80 then for the cpu a little
bit of extra fields are necessary...
2x2 or so...
Hmm... this makes it difficult to distribute the battles across gpu or
cpu...
Gpu needs a lot of battles to be efficient... when gpu is done it would have
to wait for the cpu to finish the remaining battles...
Then the next round of battles could occur..
Figuring out the sweet spot for gpu and cpu is what would be necessary...
Also the results for gpu could vary... it might finish sooner because of
favorite battles... then the gpu has nothing to do anymore... not enough
extra battles... unless gpu is made flexible... but that would be dangerous
because then it could take long...
For now it does seem like gpu might add some performance benefits... but the
cpu would also be tasked with doing lot's of api calls which might eat into
the performance of the cpu simulators themselfes which would not be a good
thing...
I have also started wondering if the cpu code/simulator code can be changed
to represent a more gpu like approach... maybe the cpu can act like a gpu as
well... more like in a streaming fashion...
Two approaches are possible:
1. Assume this theory and change/reimplement the simulator on cpu to see if
cpu can do stream processing.
or
2. Convert delphi simulator code to c/c++ to analyze it in visual studio to
see what the actual bottleneck is on the cpu... is it bandwidth ? is it cpu
execution ? is it stalls because of reads and writes ?
If it's the last then maybe the cpu code can be altered to run even faster.
I am very curious about that... seeing these poor results for gpu has made
me doubt if I should continue with this corewars/gpu project
GPU might be nice for video codec though
But I am even more curious about getting more simulator speed... so I have
different direction to try... I am having heavy doubts about which path to
choose...
I think it's best if I try to do some cpu benchmarking test to try out the
"cpu streaming" theory !
Streaming vs non-streaming cpu test !
That's what I should do first !
And if no difference is found... maybe an analysis in c/c++ to see what
actual bottleneck is ?!
However there is another possibility...
What if the gpu code could run entirely inside a shader with multiple passes
?!
Then all these opengl api calls would not be necessary anymore...
However I am not sure if such an algorithm is possible... it might be
though... if the output from the pixel shaders can be redirected into the
vertex shaders/textures for the next pass...
I don't know if that's possible so that's also another "research/benchmark"
direction to try out.
So two early benchmarks to try out...:
1. Running code entirely inside gpu shader with multiple passes and frame
buffer/texture feedback ?!
^ Somehow only a few texels on the textures need to be updated for good
speed though ?!? If not possible use full frame shaders maybe that not so
bad for performance ? But I doubt it !
2. Running streaming like vs non-streaming like code on the cpu to see if
there is a difference.
^ Final conclusion: more research needed into possibilities !
Bye,
Skybuck.