1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Converting a floating point texture to a rgba texture so it's ready to be flipped to the screen ?! ;

Discussion in 'Nvidia' started by Skybuck Flying, Oct 2, 2009.

  1. I just gave this pipeline simulation a test... without actually using any
    simulator code yet...

    And it seems very limited... only 100.000 instructions can be recorded or
    so... maybe a 1.000.000 but that's very little... just an initializing loop
    takes like 8000 * 10 instructions = 80.000 instructions or so...

    So this pipeline simulation is not worth much... though maybe it could give
    some insight into some cycles or so...

    All in all probably not worth investigating any further since it's pretty
    clear that memory lookups slow it down... and other tests already show the
    cpu can't do anything else while it's waiting for memory or so ?!?

    At least it seemed like that for me... I could wrong though ;)

    Skybuck Flying, Oct 10, 2009
    1. Advertisements

  2. My latest insights into the possibility of executing corewars on a gpu have
    made me doubt if the performance is going to be any good... it's probably
    not going to be any faster than a cpu... maybe even significantly slower
    depending on the number of passes that are needed.

    Calculations also assume that all executors would actually run in parallel
    at full speed which is also probably a flawed assumption... this could mean
    that ultimate performance could even be far worse for gpu.

    Conclusions for parallel processors:

    1. Huge memory requirements just to be able to store stuff and also cache

    This is mostly where my current graphics card is kinda lacking... only 512
    MB... that's not really that much for parallel stuff... where for each
    parallel stuff only a little bit of work would be done ;)

    I could continue trying to develop something... but I now have serious
    doubts that it would achieve any good speed... at least with the current
    design... which is probably a very good design... maybe the best one... only
    the other idea might give some performance benefit the speculative execution
    one... but I doubt that will be any good for sequantial warriors... unless
    something more complex is done with loop iteration predict per processing
    element or so... that's a bit too advanced for my taste...

    I think it's time to start spending my time on other projects...

    Maybe in the future when programming has become more easy... and when more
    resources are available I might give it another try... but using opengl/cg
    shaders probably has too much programming overhead and especially to little
    resources available... hardware wise as well.. too little memory.

    It's kinda a bummer...

    I shall do one last calculation which would be an optimistic calculation
    just to see if something can be done:

    4 input textures + 4 output textures * 4 elements per texture * 3 bytes = 96

    512 MB / 96 = 5.33333333 mega elements per texture.

    5.3333333 mega sqrt = 2364x2364 texture size or so.

    core size = 8000 + warriors 2 * (8000 processes + 500 pspace) = 8.000 +
    17.000 = 25.000 elements + 10 for little overhead or so...

    Means 2364*2364 / 25010 = 223 simulators in gpu at best.

    cycles per simulator could be anywhere from 1000 to 100.000 cycles per

    Worst case scenerio: 223 * 1000 = 223000 cycles per second... could even be
    worse if not fully executed in parallel... but gpu does have many cores...
    like 200 so might actually execute in parallel.

    Best case scenerio: 223 * 100.000 = 22.352.161 cycles for entire gpu.

    This is pretty optimisic... probably a bit too optimistic... probably more
    passes required... or maybe not...

    but let's say 22 million cycles per second for gpu.

    Cpu achieves 16 million for dual core... so gpu is not really spectacular...
    and I need something spectacular...

    The 100.000 above is assuming that opengl doesn't need to bind the cg
    program all the time...

    It probably would need to re bind... so that would make it 10x times slower
    or so... so gpu might actually achieve only 2 million cycles per second
    which would be bad.

    So conclusion in other short words:

    It's like having a cpu which can do 223 cycles in parallel... but it can
    only do it 10.000 per second or so... so finally speed would be: 2.230.000
    cycles per second... which is just miserable.

    So that's my latest guess at what the performance would be... miserable ! ;)

    Skybuck Flying, Oct 23, 2009
    1. Advertisements

  3. However I just had a radically new idea...

    What if the shader itself uses 50.000 local integers or so...

    Then the shader could use all those local integers as if it was local
    memory... and simply execute everything in one pass... this would/should
    greatly increase the execution speed.

    The question is now how much local memory/integers/variables can a shader
    have ?!

    A simple test with an array of ints could shed some light on this for

    void myshader()
    int myvar[50000];


    ^ if something like that compiles than that could be very interesting ! ;)

    Skybuck Flying, Oct 23, 2009
  4. Ok,

    I tested this theory (from last posting) and it seems to compile with some
    slight modifications.

    It seems for loops are limited to 4096 ? Not sure what that is...

    What if it was a while loop ?

    Maybe ints limited to range 4096 ? I am not sure...

    For now the core could be split into a lower and upper half and then this
    code works:

    Now idea yet of what performance would be... also no idea how many of these
    could run in parallel without blowing things up ?! ;)

    Time will tell... now time for some performance indication testing with fx
    composer 2.5.

    Fingers crossed, code example:


    % Description of my shader.
    % Second line of description for my shader.

    keywords: material classic

    date: YYMMDD


    struct Tinstruction
    short mWord1;
    short mWord2;
    short mWord3;

    typedef short Tprocess;

    float4x4 WorldViewProj : WorldViewProjection;

    float4 mainVS(float3 pos : POSITION) : POSITION{
    return mul(WorldViewProj, float4(pos.xyz, 1.0));

    float4 mainPS() : COLOR

    int vIndex;
    // works:
    int vLowerCore[4000];
    int vHigherCore[4000];

    for (vIndex=0; vIndex < 4000; vIndex++)
    vLowerCore[vIndex] = vLowerCore[vIndex] + 1;

    for (vIndex=0; vIndex < 4000; vIndex++)
    vHigherCore[vIndex] = vHigherCore[vIndex] + 1;

    // works as well... highly interesting !
    Tinstruction vLowerCore[4000];
    Tinstruction vHigherCore[4000];

    for (vIndex=0; vIndex < 4000; vIndex++)
    vLowerCore[vIndex].mWord1 = vLowerCore[vIndex].mWord1 + 1;

    for (vIndex=0; vIndex < 4000; vIndex++)
    vHigherCore[vIndex].mWord1 = vHigherCore[vIndex].mWord1 + 1;

    Tprocess vLowerProcess[4000];
    Tprocess vHigherProcess[4000];

    for (vIndex=0; vIndex < 4000; vIndex++)
    vLowerProcess[vIndex] = vLowerProcess[vIndex] + 1;

    for (vIndex=0; vIndex < 4000; vIndex++)
    vHigherProcess[vIndex] = vHigherProcess[vIndex] + 1;

    return float4(1.0, 1.0, 1.0, 1.0);

    technique technique0 {
    pass p0 {
    CullFaceEnable = false;
    VertexProgram = compile vp40 mainVS();
    FragmentProgram = compile fp40 mainPS();

    Skybuck Flying, Oct 23, 2009
  5. I just tried to do some performance testing with fx composer 2.5...

    It gives some error "GPuPerformanceUnsupported" ?!?

    It did give some indication 10 Gpixels / sec ?!?

    Probably flawed indication...

    I think I could use this technique to try and implement a parallel corewar

    The data would be loaded from a texture map just once at the start of the

    Then the shader runs a full simulator battle, maybe even multiple in one

    And then it simply returns the battle results in a little output texture...

    Could be nice if it works ! ;)

    Example for two warriors in core:

    This way the constraints would be:

    First constraint:

    Maximum ammount of simulators in gpu memory possible:

    512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2 ) =
    512 MB / 48000 + 34016 =
    512 MB / 82016 =
    536870912 / 82016 = 6545 simulators in core !

    Now the pixel shaders would simply run each simulator side by side for as
    far as possible...

    I have no idea what the performance for the pixel shader would be...

    But for now I will take a guess...

    6545 simulators * 80.000 cycles * 2 warriors * 100 battles =

    104.720.000.000 instructions to execute at least.

    Each instruction is about 6 bytes...

    So that's a bandwidth requirement of:

    628.320.000.000 bytes

    The true bandwidth is something like:

    50 GB/sec which is: 5.368.709.1200 bytes

    So clearly the bandwidth is a limiter/constraint...

    So estimated time for shader to complete based on bandwidth constraint would

    628.320.000.000 bytes / 5.368.709.1200 bytes / sec =

    628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.

    So instructions per second exected would be:

    104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.

    For two warriors that would mean 4.475.213.675 cycles per second.

    Let's see.. a dual core cpu achieves 16.000.000 cycles per second.

    The gpu performance would be staggering/very good.. however I have a feeling
    there must be another bottleneck/constraint somewhere....

    There could also be an execution constraint for the gpu.

    Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.

    I think that's about:
    15.7 * 1000 * 1000 * 1000 = 15.700.000.000

    So far this seems within range of the number of above.

    Conclusion: performance could be staggering/super speed !

    Speed up over cpu would be:

    4.475.213.675 / 16.000.000 =
    4475213675 / 16000000 = 279.7

    The gpu would be about 280 times faster than a cpu !

    That's the kind of performance gain I am looking for ! ;)

    Me very happy about that number ! =D

    As long as the code will compiled this should definetly be achieveable !

    However there is still a little catch... these numbers do not include the
    initialization... this would
    need to be done for each battle... but that's probably pretty quickly done
    as well...

    Even a 200 speed up would be real nice ! ;)

    So these numbers are very encouraging and I will definetly continue my
    development efforts to get a parallel gpu corewars executor going ! ;)

    Skybuck =D
    Skybuck Flying, Oct 23, 2009
  6. I made a little typo there in the dots:

    Correct dotted value is:


    However the calculations were still done properly... because I removed
    the dots later on ! ;)

    So calculations are correct ! ;)

    Skybuck ! ;) :)
    Skybuck Flying, Oct 23, 2009
  7. The error was probably related to gtx 7900 which doesn't support certain
    performance benchmarks... the gtx 8800 does...

    Anyway back to the story...:

    Even more interesting could be to completely leave the core, processes and
    pspace out of the texture maps...

    Since those "entities" can be done/initialized in the shader itself.

    What remains is the warrior's code... that could be supplied into the
    texture map... parameters maybe not possible... I would be worried that it
    would be pre-compiled/computed which is unwanted.

    To keep it simple each warrior could be stuffed into 100 cells... even if
    they not all used... plus a size indicating how large it really is...

    This means the number of simulators could be:

    512 MB / (100 * 6 bytes + 2) =
    536870912 / 602 = 891812 simulators ! LOL.

    This could allow a "battlefield" of 944 x 944 ;)

    Hmm seems a bit overkill for now... my battlefield would be 60x60 or so...
    but maybe later I try 944x944 or so...

    For now I shall not do any calculations how long this would take... just
    want to "document" the idea a little bit ;)

    Skybuck Flying, Oct 25, 2009
  8. Hmm program start needed as well

    So this becomes:

    512 MB / (100 * 6 bytes + 4) =

    536870912 / 604 = 888859 simulators

    Max battlefield 942 x 942

    Skybuck Flying, Oct 25, 2009
  9. I was losing confidence if it's gonna work because I don't know what will
    happen if a shader uses many variables...

    So I decided to do a little test... a little input texture... and some local
    variables like 8000*4*32 bits.

    And some code to try and force the gpu/cg compiler to actually use all of
    them and not illiminate them...

    Surprisingly it did seem to work... only problem is that FX Composer takes
    multiple seconds to render something... it also allocates gigabytes of
    memory... and then the whole application freezes.

    I tried to make the shader only work for a few pixels... but alas.. it still
    uses gigabytes.

    It does seem to render some white now and then which was probably the result
    of the shader which summed everything up more or less.

    Maybe I need to develop my own cg editor minimalistic development
    environment which is more aimed at large scale or so...


    Skybuck Flying, Oct 27, 2009
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.