Hello,
Below I will describe my preliminary/sketch idea(s)/concept(s) for parallel
battle/core executors for corewars/red code on the gpu.
I will focus on sequential execution per battle... however the battles
themselfes will be executed in parallel so that multiple battles can take
place at the same time.
Though this would require detecting when a battle is done and can be
replaced by a new battle... if more battles are to be processed... otherwise
the "done" battle would not be executed anymore and it would simply wait
until all other battles are done... so that for example an evolution
algorithm can start after all battles are done.
The idea is to stuff all the cores into texture(s)... as well as all other
necessary data... like instruction pointers, process queue, possibly even
p-space.
For now I will focus on cores and instruction pointers/instruction execution
only... because if that can be done then the rest should be easily doable as
well.
It seems like primarily two things need to happen:
1. Instructions need to be executed.
2. Cores need to be updated.
I will now explain these two steps further.
Step 1:
For the instructions reads and writes might need to occur. Reading is not a
problem on the gpu. Writing is a problem and this cannot be done. Only
writing to the current cell can be done and this is not sufficient that's
why step 2 is needed.
However to solve this problem for step 1 the following might be possible:
"Artificial registers will be created".
These registers will hold all information about the execution of an
instruction.
So for example:
1. What data/memory cells where read by the instruction ? (might not be
necessary)
2. What data/memory cells where written by the instruction ?
3. What was the read data's content ?
4. What was the written data's content ?
There will be a worst case scenerio... meaning that the most complex
instruction will only affect X cells... say 5 or so....
So there only need to be enough registers to contain all possible data for a
"worst case instruction".
Now the GPU can happily execute an instruction and record all the necessary
information.
This could be done in a vertex program/shader as step 1. Maybe vertex
shaders could be used to trigger depths for step2/textures so that only the
necessary cells become active.
Ok I will now explain step 2.
Step 2:
In step 2 the cores need to be updated and any other variables/arrays like
pspace, process queues and what not... again I will limit myself to cores
for now... since the rest could more or less use the same technique.
I am not sure what implementation possibilities there exist but in the worst
case:
All core cells would be examined by a fragment/pixel shader... it moves
across the cells and at each cell it examines if this cell was affected by
the instruction that was executed... and if so... how it needs to alter this
cell to comply with the instruction.
This is the update.
Should be pretty easy to do... just a few comparisions would be needed with
some recorded information/fields about the instruction.
For example to give you an idea if you dont have one

:
We are at cell 5 so ask the question: Did the instruction write to cell 5 ?
(us).
If so then ask next question: What is the content that must be written to
cell 5 ? Answer is again the record instruction information fields.
I am pretty much convinced that this should be easily doable.
The question is can somehow all cells be deactived except those that the
instruction affected in the vertex program ?
So to cut the question down to something simple that graphics programmers
can understand:
1. Is it possible for a vertex shader to disable certain
processing/pixels/fragments in the fragment shader ? (Can vertex shaders
dismiss some of the work)
The answer is probably yes.
2. The second question would than be how is this possible implementation
wise ?
(I am not sure but I read something about
2.1 Maybe sciccors.
2.2 Maybe occlusion queries.
2.3 Maybe depths.
Maybe you graphics guys can further explain or have some idea's ?
Anyway the vertex shaders can also access textures on more recent graphics
cards. So the vertex shaders could also already do some processing of the
instruction. Especially if necessary I would not expect any problems here...
So let's of possibilities me thinks.
So let's see if I covered everything concerning instruction execution and
core updates.
The last thing that needs to be done is prepare for the next cycle/round. So
all that needed to be done is specifiy the next instruction pointer if
any... and possibility a spawn instruction pointer... and that's also done
at step 1. This information could simply be written back to the
instruction/vertexes themselfes... unless they need to be preserved for step
2... in that case some extra output registers needed... but all in all
should be doable.
Some further explinations about step 1:
Even at step 1 vertex shaders can probably only modify the verteces
themselfes...
So input vertex = output vertex.
So to be able to do a sort of 1 to many... there need to be many input
verteces=output verteces.
Each vertex would be a certain type/data.
The vertex could be the instruction pointer or
The vertex could be the instruction field A or
The vertex could be the instruction field B or
The vertex could be the instruction modifier
The vertex could be the next instruction pointer
The vertex could be the spawn instruction pointer.
So for every possibility/output there is an input.
Depending on the location of the vertex it would be a certain type....
So the vertex at the start of it's processing only has to figure out what
kind of vertex it is and what it should be looking for/what kind of
processing it should do for the instruction.
So for example:
If vertex location is number 4 then vertex is the instruction modifier.
So it's like an array of integers/floats, where each element performs a
specific role.
I hope I made myself perfectly clear and I think I did !
Maybe a little bit of parallel processing for the instruction could happen.
If a vertex needs information from multiple locations to do it's thing than
that should not be a problem since the vertex can do multiple reads/gathers.
Alternatively larger data types could be used to do more in the same shader
like floats 4 or so... but that probably not really necessary... an array of
single floats might be enough... the choice would be arbitrare except maybe
for performance reasons... or maybe it doesn't matter performance wise... I
don't know yet. But any implementation at this point would do ! So later
this is something that could be experimented with to see if one or the other
style gives more performance
Ok, I am pretty much done now, I think I gave you guys a pretty good idea
how a core executor and therefore multiple core executors could be
implemented on a gpu !
One last explanation about that last concept:
Multiple cores/battles/information could be stored into one big texture.
Battle information 1 would be at offset 0 to 32000.
Battle information 2 would be at offset 32001 to 64000.
Battle information 3 would be at offset 64001 to 96000.
And so forth...
This way all verteces and pixel shaders can use arithmetic to understand to
which battle they belong by simply determine the battle base address first
for example:
Battle Number = Offset div 32000; <- indicates to what battle they belong
Battle Vertex offset = Offset mod 32000; <- indicates where the vertex is
within the "battle information array" so it can understand what type it is.
Same goes for core cells.
Cores could start at for example Battle Base + 8000;
if (Offset >= Battle Base + Core Start) and (Offset <= Battle Base + Core
End) then
begin
// pixel is a core cell.
end;
Ok that should be enough for now !
Bye,
Skybuck.