Motherboard Forums


Reply
Thread Tools Display Modes

Converting a floating point texture to a rgba texture so it's ready to be flipped to the screen ?! ;)

 
 





















Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 06:59 AM


Hello,

One thing is annoying me a little bit with my current Delphi
program/example/opengl acceleration experiment

I cannot enjoy the fast speed of opengl because for now I am using
Tcanvas.Pixels[x,y] to draw the texture map to the screen. And since the
texture map is in range 0.0 to 1.0 for the color components these first need
to be converted to RGB's which means many multiplications and rounds.

But the biggest problem is the slowness of Tcanvas.Pixels.

Anyway I probably already have a "CopyMemoryToBitmap" routine somewhere
which would help with flipping the memory into bitmap format

So the remaing problem is:

Converting floating point textures to rgba textures so they can be flipped
to screen.

I guess I could use an additional render to texture target... in rgba
mode... and use an extra shader... just for recalculating the floating
points to rgba's...

However doing this seems a bit weird... but it would probably be possible as
follows:

1. Draw a quad with 4 verteces which would activate all pixel shaders.
2. Shade the pixels and output them to the texture... preferably y-flipped
if necessary.
3. Read texture to cpu/system memory.
4. Flip memory to Tbitmap/canvas etc.

However I wonder if OpenGL has a better method of converting a floating
point texture/framebuffer into a bitmap ?!

So that it doesn't need to go through the vertex and pixel shaders ?!?

Hmmmm...

Maybe there is even a faster way ?

Maybe be re-enabling the "default framebuffer ?" But it would be empty...
hmm...

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:03 AM
Ok,

For now I am gonna get rid of the Tcanvas.Pixels... and simply use an extra
memory buffer to convert the floating point texture 3x16 bits or 3x32 bits
floating point texture to rgba 4x8 bytes in cpu.

That way cpu can do something too... hopefully cpu not gonna be too slow at
it ! LOL.

Would be funny if the cpu is still fricking slow even for something like...

I have bad feeling about that !

But gonna try anyway... have to do this anyway !

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:38 AM
Hmmm something fishy going on here...

The color component order of the Delphi form seems to be:

R,G,B,A

The color component order of the Delphi bitmap seems to be:

B,G,R,A

?!?!?

What the **** ?!?

Weirdness !

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:40 AM
Or maybe I just flipped the texture color components...

Me confused...

Should it be record:
r,g,b,a : float
end;

or

should it be record
b,g,r,a : float
end;

for floating point texture maps for GL_RGBA ?!? (and/or GL_RGB)

Hmm...

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:43 AM
Well I think I got the floating point format correct if I recall
correctly...

Since the OpenGL window seemed to draw ok...

So record for floating point texture format is probably:

r,g,b,a : float;

Then why does Delphi needs it other way around ?!

Weird...

Especial form vs bitmap... double weird ?!

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:46 AM
Anyway I am using Tbitmap.Scanline for fast access...

According to other postings it indeed seems to be reversed: B,G,R,A...

The reason for this I don't understand...

For now I will have to use a special type for it...

TbgraByte = record b,g,r,a etc; // considered a bitmap rgba

And use the one which is appriorate

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:50 AM
Ok this solves problem nicely !

Delphi even helped me prevent a stupid error thanks to strong type checking
like so:

Faulty:

var
vBitmapColor : TbgraByte;

begin

TrgbaByte( scanline pointer etc ) := vBitmapColor; // compiler type error


Good:

TbgraByte( scanline pointer etc ) := vBitmapColor;

end;



Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:54 AM

"Skybuck Flying" <> wrote in message
news:d6518$4ac59fd4$d53372a9$ b.home.nl...
> Hmmm something fishy going on here...
>
> The color component order of the Delphi form seems to be:
>
> R,G,B,A


^ Not sure about that...

Form does not seem to have a scanline property...

Maybe it's internal format is also b,g,r,a...

Canvas.Pixels might be doing a conversion as well...

Tcolor is in rgba mode at least... so it might be doing conversions like so:

RGBA to BGRA.

^ This might be another reason why .Pixels[x,y] is slow...

>
> The color component order of the Delphi bitmap seems to be:
>
> B,G,R,A


When accessing the scanline pointer at least...

Maybe .Pixels[ ] does a conversion <- probably.

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 07:56 AM
Anyway using the Tbitmap.Scanline property + rounding seems to be fast
enough for now... for 500x400 pixels

Canvas.Draw( 0, 0, mScreenBuffer ); // mScreenBuffer : TBitmap;

Draws it real fast... like under one second at least !

Bye,
Skybuck.


 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a

 
      10-02-2009, 09:05 AM
Time for a little performance testing

With delphi conversion code included it's:

0.0254 seconds for one frame of 500x400 with 32 bit colors, and 3x32 bit
floating point texture format r,g,b.

That's 25 milliseconds...

Even using this shitty conversion code... this could mean:

1000 / 25.4 = 39 frames per second haha !

Actually the first time it seems to require 50 milliseconds not sure why...
maybe cpu cache getting filled or maybe it's disk activity from delphi ide
or so... could be...

Now let's see how fast draw is without this shitty conversion code and
canvas draw code.

It's about 0.006 seconds for opengl draw + read from texture.

That's about 6 milliseconds (500x400 vertex points as well ! )

Now let's leave texture reading out of it... to see how fast it goes then !


It's 0.00032 seconds.

That's 0.32 milliseconds !

Holyshit batman... that's already pretty fast ! And this includes
500x400 verteces ! HAHA.

Let's see what frame rate would be for this:

1000 / 0.32 = 3125

Not bad.

I did notice the occasional hick up.. this could be the first time because
of loading the texture... and/or disk activity... I think it's disk activity
mostly.

Ok... now I reduce vertex points to something more realistic...

According to my last calculations posted in another sub thread... the number
of simulators would be:

1198.

There are two fields... each one can probably do a number of cell updates...
one pointer need for itself...
one pointer for somewhere else... and again a pointer for somewhere else...

So for itself at least 1, then maybe 2 then maybe another 2... so I think at
most 5 pointers needed.

So 1195 * 5 = 5975 verteces needed... Now I go test it's speed:

Time is: 0.0003126 seconds.

So it remains at 0.32 milliseconds per frame. Hmmm...

This is not so good.

This means:

3125 frames per second * 1198 simulators = 3.743.750 cycles per second.

Which need to divide by 2 probably which gives 1.871.875 cycles per second.

Dual core CPU was probably something like 80.000 * 200 = 16.000.000 cycles
per second.

I must know for sure so I am going to start it to make sure.

Yup confirmed...

Dual core CPU can do: 10 battles of 1 v 1 warriors with 100 rounds with
80.000 cycles in 5 seconds.

This means the dual core is executing:

10 * 100 * 80.000 cycles in 5 seconds = 80.000.000 (no early kills those
were disabled)

Which means it's executing: 80.000.000 / 5 = 16.000.000 cycles for dual
core.

Which means roughly 8.000.000 cycles per core.

So far the cpu seems 4.2 times faster...

However I don't have a decent gpu implementation yet...

However seeing these numbers for this simple test raises big doubts if gpu
version is gonna be any faster...

Maybe the clearing of the framebuffer has something to do with it... gonna
disable it and retest...

depth test was disable... clearing was disable...

Time is now:

0.000237 seconds.

0.237 milliseconds

1000 / 0.237 = 4219

about 33.3% more performance.. still not enough me thinks.

Ok reloading the identity matrix does not seem necessary...

setting the cg world view thingy only needs to be done once it seems...

Performance is now
0.00019 seconds. (fluctuating a bit... maybe at full speed it would be lower
not sure)

which is about 0.19 milliseconds

1000 / 0.19 = 5263 frame per second.

Let's see what this would give:

5263 * 1198 simulators = 6.305.074 cycles
divide by 2 : 3.152.537

Still very poor.

Hmmm I see I have three frame buffer textures active... I only need 2...

Gonna disable one to see if that helps.

Yes that helped a bit...
0.000156 seconds.

Ok I don't need 32 bit floating points...

Only 16 bit floating points... gonna change textures to 16 bit...
This should give good improvement.

Hmmm nope still 0.00015 seconds...

I am starting to wonder if the time I am measuring is actually the api
calling
cpu overhead... hmm...

Hmm could be... if that's the case... then optimizing the number of api
calls
could give more speed... I wonder if enable profilings all the time is
necessary
maybe not... binding programs is that necessary ? I don't know...

For now the speed would be:
1000 / 0.156 = 6410 assuming the bigger texture is no problem.

Final speed would be:
(1198 * 6410) / 2 = 3.839.743 something like that.

Almost half of what a single cpu core would achieve.

This is assuming the vertex/pixel shaders and texturing lookups don't add
any significant
delays or overheads... for the largest possibility.

I go do a little large test to see what happens
4096x4096 gives about same speed...

For now I am worried this is not a good situation.

But there might be a solution... instead of stuffing the entire cores into
the frame buffer... the opposite
could happen...

only instructions are stuffed into the frame buffer and verteces and such
for processing... and the cores themselfes are stuffed into textures which
will not be rendered to a texture...

Instead something else will update those textures... this could be done by
cpu...

I was thinking about doing a single 4096x4096 texture map executor with 1198
simulators... but seeing these presumably api call overheads or round trip
times to gpu makes it doubtfull that it would be any faster... it probably
would not... therefore the strategy has to be rethought and changed.

For now the number of texture inputs seem to be limited to 6 to 8
TEXCOORDS... these textcoords are necessary
to supply the necessary information to tex2D or rect2D or something like
that...

But not really... each vertex could simply ignore those texcoords
semantics... and simply use the first one...
The textcoordintes themselfes could use an additional coordinate for example
the Z coordinate or an addtional TEXCOORD2 or maybe NORMAL coordinate to
indicate which texture to use.

This way the number of texture maps in the gpu could be endless... however
the memory is limited.

Now let's try to fit as much simulators/cores/warriors as possible into the
core... new calculation becomes:

512 MB / (14000 * 6 bytes) =

536870912 / 84000 = 6391 simulators in gpu. the necessary instruction
pointers plus additional fields per simulator should fit easy in one
framebuffer so I am not worried about that.

Using this new figure the number of cycles would be:
6391 * 6410 = 40.968.363 / 2 = 20.484.181

Compared to the dual core cpu which has 16.000.000 cycles this is still very
weak ?!

However the gpu code I used to test is not running at full speed.. so who
knows what will happen... but for now I base it on what I see...

At best with a little bit of luck... two extra threads could be added which
feed the necessary data to the gpu... so that the gpu can do some processing
as well... this way the final speed would be doubled.

However this does require feeding the gpu with 6391 battles ?! which is
quite a lot..

This would be a battlefield size of almost 80x80 then for the cpu a little
bit of extra fields are necessary...
2x2 or so...

Hmm... this makes it difficult to distribute the battles across gpu or
cpu...

Gpu needs a lot of battles to be efficient... when gpu is done it would have
to wait for the cpu to finish the remaining battles...

Then the next round of battles could occur..

Figuring out the sweet spot for gpu and cpu is what would be necessary...

Also the results for gpu could vary... it might finish sooner because of
favorite battles... then the gpu has nothing to do anymore... not enough
extra battles... unless gpu is made flexible... but that would be dangerous
because then it could take long...

For now it does seem like gpu might add some performance benefits... but the
cpu would also be tasked with doing lot's of api calls which might eat into
the performance of the cpu simulators themselfes which would not be a good
thing...

I have also started wondering if the cpu code/simulator code can be changed
to represent a more gpu like approach... maybe the cpu can act like a gpu as
well... more like in a streaming fashion...

Two approaches are possible:

1. Assume this theory and change/reimplement the simulator on cpu to see if
cpu can do stream processing.

or

2. Convert delphi simulator code to c/c++ to analyze it in visual studio to
see what the actual bottleneck is on the cpu... is it bandwidth ? is it cpu
execution ? is it stalls because of reads and writes ?

If it's the last then maybe the cpu code can be altered to run even faster.

I am very curious about that... seeing these poor results for gpu has made
me doubt if I should continue with this corewars/gpu project

GPU might be nice for video codec though

But I am even more curious about getting more simulator speed... so I have
different direction to try... I am having heavy doubts about which path to
choose...

I think it's best if I try to do some cpu benchmarking test to try out the
"cpu streaming" theory !

Streaming vs non-streaming cpu test !

That's what I should do first !

And if no difference is found... maybe an analysis in c/c++ to see what
actual bottleneck is ?!

However there is another possibility...

What if the gpu code could run entirely inside a shader with multiple passes
?!

Then all these opengl api calls would not be necessary anymore...

However I am not sure if such an algorithm is possible... it might be
though... if the output from the pixel shaders can be redirected into the
vertex shaders/textures for the next pass...

I don't know if that's possible so that's also another "research/benchmark"
direction to try out.

So two early benchmarks to try out...:

1. Running code entirely inside gpu shader with multiple passes and frame
buffer/texture feedback ?!
^ Somehow only a few texels on the textures need to be updated for good
speed though ?!? If not possible use full frame shaders maybe that not so
bad for performance ? But I doubt it !

2. Running streaming like vs non-streaming like code on the cpu to see if
there is a difference.

^ Final conclusion: more research needed into possibilities !

Bye,
Skybuck.


 
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off



All times are GMT. The time now is 05:37 AM.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43