Hello,
I just had an idea how BWT, especially the wheeler transform algorithm could
possibly be successfully used in video decoders.
I haven't thought the idea through but I want to publish it just in case
it's a good idea:
The idea is as follows:
1. Usually video's have multiple "key" frames followed by some
"intermediate" frames.
2. BWT could be started in parallel on multiple key frames.
However this would require a little change in for example Microsoft's Video
for Windows... which simply always start at just a single frame.
Video for Windows would need to be turned into a "parallel" version where
decoders would be fed multiple key frames... or let's call them "key
streams".
So that a form of "video key-stream pipelining" occurs.
This way multiple processors/cores/threads could be working on decoding
different sections of the video.
The drawback is probably:
3. Requiring more bandwidth to store the results until actually needed.
Why is this a solution you might be wondering and what was the problem in
the first place ?
I will try to explain...
The problem with wheeler-transform is that it's a pointer-chasing problem
for the decoder.
The processor needs to fetch a pointer/index, wait for it... then fetch the
next pointer/index, and the next and then next etc... this slows down the
processor a lot... it looks like so:
p1->p2->p3->p4->p5->p6.....->p1000000.
The cpu is not fast enough to decode this in time... the gpu can't even do
this kind of decoding (or it's very hard/slow) so that's also not an option.
However what if the cpu could start working way in advance before the frame
is actually needed ?! this would give the cpu more time to work on it...
only problem is with intermediate frames... they also need a lot of time...
maybe limit
bwt to key frames only or so ?!?
Well anyway... by letting multiple cpu's work on more key frames the decode
speed should improve somewhat...
Let's think of it as interleaving video key-stream decoding.
Example:
CPU1: Frame 1 to 8
CPU2: Frame 9 to 16
CPU3: Frame 17 to 24
CPU4: Frame 25 to 32
This way 4 cpu's are being used instead of just one.
Now the problem is ofcourse... can cpu1 decode 8 frames in the needed time
?!?
The answer is ofcourse: no funny enough... because if it could then only one
cpu would
be necessary.
However look what happens at cpu2... by the time cpu1 is done... cpu2 will
be done, cpu3 will
be done... cpu4 will be done...
So those frames will all be done... So my conclusion so far is:
Only the startup takes a while... but after that it should run smooth.
Bye,
Skybuck.