Motherboard Forums


Reply
Thread Tools Display Modes

Can a x86/x64 cpu/memory system be changed into a barrel processor ?

 
 
Skybuck Flying
Guest
Posts: n/a
 
      06-09-2011, 10:38 AM
Hello,

Question is:

Can a x86/x64 cpu/memory system be changed into a barrel processor ?

I shall provide an idea here and then you guys figure out if it would be
possible or not.

What I would want as a programmer is something like the following:

1. Request memory contents/addresses with an instruction which does not
block, for example:

EnqueueReadRequest address1

Then it should be possible to "machine gun" these requests like so:

EnqueueReadRequest address1
EnqueueReadRequest address2
EnqueueReadRequest address3
EnqueueReadRequest address4
EnqueueReadRequest address5

2. Block on response queue and get memory contents

DequeueReadResponse register1

do something with register1, perhaps enqueue another read request

DequeueReadResponse register2
DequeueReadResponse register3

If the queues act in order... then this would be sufficient.

Otherwise extra information would be necessary to know which is what.

So if queues would be out of order then the dequeue would need to provide
which address the contents where for.

DeQueueReadResponse content_register1, address_register2

The same would be done for writing as well:

EnqueueWriteRequest address1, content_register
EnqueueWriteRequest address2, content_register
EnqueueWriteRequest address3, content_register

There could then also be a response queue which notifies the thread when
certain memory addresses where written.

DequeueWriteResponse register1 (in order design)

or

DequeueWriteResponse content_register1, address_register2 (out of order
design)


There could also be some special instructions which would return queue
status without blocking...

Like queue empty count, queue full count, queue max count and perhaps a
queue up count which could be used to change queue status in case something
happened to the queue.

For example each queue has a maximum ammount of entries available.

The queueing/dequeuing instructions mentioned above would block until they
succeed (meaning their request is placed on queue or response removed from
queue)

The counting instructions would not block.

This way the cpu would have 4 queues at least:

1. Read Request Queue
2. Read Response Queue
3. Write Request Queue
4. Write Response Queue

Each queue would have a certain maximum size.

Each queue has counters to indicate how much "free entries there are" and
how much "taken entries there are".

For example, these are also querieable via instructions and do not block the
thread, the counters are protected via hardware mutexes or so because of
queieing and dequeing
but as long as nothing is happening these counters should be able to return
properly.

GetReadRequestQueueEmptyCount register
GetReadRequestQueueFullCount register

GetReadResponseQueueEmptyCount register
GetReadResponseQueueFillCount register

GetWriteRequestQueueEmptyCount register
GetWriteRequestQueueFullCount register

GetWriteResponseQueueEmptyCount register
GetWriteResponseQueueFillCount register

All instructions should be shareable by threads... so that for example one
thread might be postings read requests and another thread might be
retrieving those read responses.

Otherwise the first thread might block because of read request full, and
nobody responding to response queue.

Alternatively perhaps the instructions could also be made non-blocking, and
return a status code to indicate if they operation succeeded or not, however
then an additional code or mode would also be necessary to specify if it
should be blocking or non-blocking... which might make things a bit too
complex, but this is hardware-maker decision... in case many threads sharing
is too difficult or impossible or too slow then non-blocking might be
better, the thread can then cycle around read responses and see if anything
came in so it can do something... however this would lead to high cpu
usage... so for efficiency sake blocking is preferred, or perhaps a context
switch until the thread no longer blocks. It would then still be necessary
for the thread to somehow deal with responses... so this this seem to need
multiple threads to work together for the blocking situation.

The memory system/chips would probably also need some modifications to be
able to deal with these memory requests and return responses.

Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as
much of these requests/responses back and forth.

So what you think of a "barrel" like addition to current amd/intel x86/x64
cpu's and there memory systems ?!? Possible or not ?!?

This idea described above is a bit messy... but it's the idea that counts...
if cpu manufacturers interested I might work it out some more to see how it
would flesh out/work exactly

Bye,
Skybuck.

 
Reply With Quote
 
 
 
 
Skybuck Flying
Guest
Posts: n/a
 
      06-09-2011, 10:41 AM


"Skybuck Flying" wrote in message
news:b3cc5$4df0a2aa$5419acc3$(E-Mail Removed)1.n b.home.nl...

Hello,

Question is:

Can a x86/x64 cpu/memory system be changed into a barrel processor ?

I shall provide an idea here and then you guys figure out if it would be
possible or not.

What I would want as a programmer is something like the following:

1. Request memory contents/addresses with an instruction which does not
block, for example:

EnqueueReadRequest address1

Then it should be possible to "machine gun" these requests like so:

EnqueueReadRequest address1
EnqueueReadRequest address2
EnqueueReadRequest address3
EnqueueReadRequest address4
EnqueueReadRequest address5

2. Block on response queue and get memory contents

DequeueReadResponse register1

do something with register1, perhaps enqueue another read request

DequeueReadResponse register2
DequeueReadResponse register3

If the queues act in order... then this would be sufficient.

Otherwise extra information would be necessary to know which is what.

So if queues would be out of order then the dequeue would need to provide
which address the contents where for.

DeQueueReadResponse content_register1, address_register2

The same would be done for writing as well:

EnqueueWriteRequest address1, content_register
EnqueueWriteRequest address2, content_register
EnqueueWriteRequest address3, content_register

There could then also be a response queue which notifies the thread when
certain memory addresses where written.

DequeueWriteResponse register1 (in order design)

or

DequeueWriteResponse content_register1, address_register2 (out of order
design)


There could also be some special instructions which would return queue
status without blocking...

Like queue empty count, queue full count, queue max count and perhaps a
queue up count which could be used to change queue status in case something
happened to the queue.

For example each queue has a maximum ammount of entries available.

The queueing/dequeuing instructions mentioned above would block until they
succeed (meaning their request is placed on queue or response removed from
queue)

The counting instructions would not block.

This way the cpu would have 4 queues at least:

1. Read Request Queue
2. Read Response Queue
3. Write Request Queue
4. Write Response Queue

Each queue would have a certain maximum size.

Each queue has counters to indicate how much "free entries there are" and
how much "taken entries there are".

For example, these are also querieable via instructions and do not block the
thread, the counters are protected via hardware mutexes or so because of
queieing and dequeing
but as long as nothing is happening these counters should be able to return
properly.

Little correct: full should have been fill:

GetReadRequestQueueEmptyCount register
GetReadRequestQueueFillCount register

GetReadResponseQueueEmptyCount register
GetReadResponseQueueFillCount register

GetWriteRequestQueueEmptyCount register
GetWriteRequestQueueFillCount register

GetWriteResponseQueueEmptyCount register
GetWriteResponseQueueFillCount register

All instructions should be shareable by threads... so that for example one
thread might be postings read requests and another thread might be
retrieving those read responses.

Otherwise the first thread might block because of read request full, and
nobody responding to response queue.

Alternatively perhaps the instructions could also be made non-blocking, and
return a status code to indicate if they operation succeeded or not, however
then an additional code or mode would also be necessary to specify if it
should be blocking or non-blocking... which might make things a bit too
complex, but this is hardware-maker decision... in case many threads sharing
is too difficult or impossible or too slow then non-blocking might be
better, the thread can then cycle around read responses and see if anything
came in so it can do something... however this would lead to high cpu
usage... so for efficiency sake blocking is preferred, or perhaps a context
switch until the thread no longer blocks. It would then still be necessary
for the thread to somehow deal with responses... so this this seem to need
multiple threads to work together for the blocking situation.

The memory system/chips would probably also need some modifications to be
able to deal with these memory requests and return responses.

Perhaps also special wiring/protocols to be able to "pipeline"/"transfer as
much of these requests/responses back and forth.

So what you think of a "barrel" like addition to current amd/intel x86/x64
cpu's and there memory systems ?!? Possible or not ?!?

This idea described above is a bit messy... but it's the idea that counts...
if cpu manufacturers interested I might work it out some more to see how it
would flesh out/work exactly

Bye,
Skybuck.

 
Reply With Quote
 
 
 
 
Joel Koltner
Guest
Posts: n/a
 
      06-09-2011, 04:16 PM
"Skybuck Flying" <(E-Mail Removed)> wrote in message
news:b7aee$4df0a36d$5419acc3$(E-Mail Removed)1.n b.home.nl...
> Can a x86/x64 cpu/memory system be changed into a barrel processor ?


[deletia]

Not directly, but they... sort of... already are: The high-end Intel and AMD
x86 CPUs are all superscalar designs, which means that internally the CPU is
viewed as a collection of "resources" -- ALUs, instruction decoders, memory
read units, memory write units, etc. -- and that there are (typically)
multiple instances of each of these resources, and the CPU scheduler tries
very hard to always keep all the resources busy, which effectively means that
multiple instructions can be executed simultaneously (this effectively
implements your "AddRequest, AddRequest, GetResponse, GetResponse" protocol
that you'd like).

Now, add on the hyper-threading that's been around for a number of years now,
and I'd say you have a result that, in practice, is not that far from a barrel
processor. In fact, it's probably better insofar as popular metrics such as
performance/(# of transistors*clock rate*power) or somesuch in that the
dynamic scheduling that a superscalar CPU performs is often more efficient
than a straight barrel implementation when you're running "general purpose"
code such as a web browser or word processor (although I would expect that
barrel CPUs have instructions that provide "hints" to the schedulers to
suggest it not switch threads or to keep or flush the caches or whatever just
as superscalar CPUs do... but also recall that when HT was added to Intel's
x86 CPUs, for certain workloads the HT actually slowed down the overall
throughput a bit too...).

As I think you've surmised, the trick to achieving high performance with CPUs
is to prevent stalls. This is of course a non-trivial problem, and companies
like Intel and AMD invest enormous resources into trying to get just a little
bit better performance out of their designs; you can be certain that someone
at these companies has very carefully considered which aspects of a barrel
processor design they might "borrow" to improve their performance.

---Joel




 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a
 
      06-10-2011, 12:58 AM
The only thing my program needs to do is fire off memory requests.

However it seems the x86 cpu blocks on the first memory request and does
nothing else.

This is AMD X2 3800+ processor.

Perhaps newer processors don't have this problem anymore but I would
seriously doubt that.

So unless you come up with any prove I am going to dismiss your story is
complex-non-relevant-bullshit.

It's not so hard to write a program which requests random memory accesses.

You apperently should try it sometime.

Bye,
Skybuck.


 
Reply With Quote
 
Joel Koltner
Guest
Posts: n/a
 
      06-10-2011, 01:19 AM
"Skybuck Flying" <(E-Mail Removed)> wrote in message
news:58426$4df16c1b$5419acc3$(E-Mail Removed)1.n b.home.nl...
> The only thing my program needs to do is fire off memory requests.
>
> However it seems the x86 cpu blocks on the first memory request and does
> nothing else.


Hmm, it shouldn't do that, assuming there aren't any dependencies between the
next handful of instructions and the first one there. (But note that if you
perform a load operation and the data isn't in the caches, it takes *many tens
to hundreds* of CPU cycles to fetch the data from external DRAM; hence you
*will* stall. There actually are instructions in the x86 architecture these
days for "warming up" the cache by pre-fetching data, though -- this can help
a lot when you know in advance you'll need data, e.g., a few hundred cycles
from now; if you're looping over big sets of data, you just pre-fetch the next
block while you work on the current one.)

A program that requests random memory accesses will very quickly stall for a
long time (after the first couple of instructions), as you quickly exhaust the
number of "memory read" resources available and have near-constant cache
misses. Few real-world pograms exhibit behavior that bad AFAIK, although I
expect that some large database applications (that have to run through
multiple indices for each request, where the indices and/or data are too big
for the caches) might approach it.

---Joel

 
Reply With Quote
 
Paul
Guest
Posts: n/a
 
      06-10-2011, 02:05 AM
Joel Koltner wrote:
> "Skybuck Flying" <(E-Mail Removed)> wrote in message
> news:58426$4df16c1b$5419acc3$(E-Mail Removed)1.n b.home.nl...
>> The only thing my program needs to do is fire off memory requests.
>>
>> However it seems the x86 cpu blocks on the first memory request and
>> does nothing else.

>
> Hmm, it shouldn't do that, assuming there aren't any dependencies
> between the next handful of instructions and the first one there. (But
> note that if you perform a load operation and the data isn't in the
> caches, it takes *many tens to hundreds* of CPU cycles to fetch the data
> from external DRAM; hence you *will* stall. There actually are
> instructions in the x86 architecture these days for "warming up" the
> cache by pre-fetching data, though -- this can help a lot when you know
> in advance you'll need data, e.g., a few hundred cycles from now; if
> you're looping over big sets of data, you just pre-fetch the next block
> while you work on the current one.)
>
> A program that requests random memory accesses will very quickly stall
> for a long time (after the first couple of instructions), as you quickly
> exhaust the number of "memory read" resources available and have
> near-constant cache misses. Few real-world pograms exhibit behavior
> that bad AFAIK, although I expect that some large database applications
> (that have to run through multiple indices for each request, where the
> indices and/or data are too big for the caches) might approach it.
>
> ---Joel
>


The Intel processor also has prefetch options, and works with both
incrementing memory access patterns or decrementing patterns. Using
a "warm up" option is one thing, but the processor should also be
able to handle prefetch on its own.

Perhaps AMD has something similar ? Since this is posted to comp.arch,
someone there should know. Skybuck's processor has an integrated memory
controller, so there are possibilities.

http://blogs.utexas.edu/jdm4372/2010...ead-read-only/

Both Intel and AMD, will have documentation on their website, addressing
the need to optimize programs to run on the respective processors. And
that is a good place for a programmer to start, to find the secrets
of getting best performance.

Paul
 
Reply With Quote
 
Ken Hagan
Guest
Posts: n/a
 
      06-10-2011, 08:44 AM
On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying
<(E-Mail Removed)> wrote:

> The only thing my program needs to do is fire off memory requests.
>
> However it seems the x86 cpu blocks on the first memory request and does
> nothing else.


How do you know? The whole point about out-of-order execution is that it
is transparent to the software, so it is not possible to write a program
whose behaviour depends on whether blocking occurs or not.

If you have a logic analyzer and you think you have results that prove
in-order behaviour then you'll have to provide more details. That said,
such things are well outside my comfort zone so I personally won't be able
to help.
 
Reply With Quote
 
MitchAlsup
Guest
Posts: n/a
 
      06-10-2011, 04:41 PM
On Jun 10, 3:44*am, "Ken Hagan" <(E-Mail Removed)> wrote:
> On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying *
>
> <(E-Mail Removed)> wrote:
> > The only thing my program needs to do is fire off memory requests.

>
> > However it seems the x86 cpu blocks on the first memory request and does *
> > nothing else.


The CPU will not block if all of the outstanding accesses are to write-
back cacheable memory.

> How do you know? The whole point about out-of-order execution is that it *
> is transparent to the software,


No, the whole point of precise exceptions is to be trasnparent to
software. The point of OoO is to improve performance, adding precise
exceptions to OoO gives you high performance and is relatively
transparent to software (but not entirely).

> so it is not possible to write a program *
> whose behaviour depends on whether blocking occurs or not.


One can EASILY detect blocking (or not) by comparing the wall clock
time on multi-million memory access codes. One can infer the latencies
to the entire cache hierchy including main memory and whether or no
main memory accesses are being processed with concurrency.

Mitch
 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a
 
      06-10-2011, 08:39 PM


"Ken Hagan" wrote in message news(E-Mail Removed)...

On Fri, 10 Jun 2011 01:58:09 +0100, Skybuck Flying
<(E-Mail Removed)> wrote:

> The only thing my program needs to do is fire off memory requests.
>
> However it seems the x86 cpu blocks on the first memory request and does
> nothing else.


"
How do you know?
"

Good question, but not really.

Let's just say I have a lot of programming experience.

Some programs can do a lot while some can do only a little bit.

The last category falls into "a lot of memory accesses".

I have done many tests by now to confirm this.

The evidence is not 100% water tight or 100% certain but I would be very
surprised if it was not the thruth.

Especially since gpu seems to execute it much faster and this was even on
dx9 hardware instead of cuda, however those results are also in doubt,
because
it's almost to good to be true it fluctuated a bit.

"
The whole point about out-of-order execution is that it
is transparent to the software, so it is not possible to write a program
whose behaviour depends on whether blocking occurs or not.
"

What does this have to do with what I wrote... ? It's up to the programmer
if he wants to use the blocking instructions or not.

It's not that much of a big deal... windows has plenty of thread-blocking
api's, which are designed to be blocking on purpose, to save cpu.

Thread's even have an apc queue... where "messages/events" can be posted
too... when they wake up... they can process it.

"
If you have a logic analyzer and you think you have results that prove
in-order behaviour then you'll have to provide more details. That said,
such things are well outside my comfort zone so I personally won't be able
to help.
"

I have also read reports claiming that the cpu is 91% or so waiting on main
memory.

Bye,
Skybuck.

 
Reply With Quote
 
Skybuck Flying
Guest
Posts: n/a
 
      06-10-2011, 08:42 PM
"Analyzing" the real world is pretty useless and the reason is very simple:

Computer programs which were written which were slow would be dismissed by
users.

Programmers try to write programs so they be fast.

Do not think that slow programs would be released.

Therefore it becomes a self-forfilling prohecy...

And be analyzing the current situation and adepting to that... you also keep
the
"chicken and egg" problem alive.

No better hardware then no better software.

Or vice versa:

No slow software, they no faster hardware needed.

Lastly:

Ask yourself one very important big question:

What does the R stand for in RAM ?

I also tried prefetch it helps something like 1%, pretty fricking useless.

Bye,
Skybuck.

 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
New Microchip PIC24F32KA has HW Multiply, Barrel Shifter, 12-bit ADC Bill Giovino Embedded 0 04-26-2011 06:30 PM
Barrel plug-in adapter for AC-adapter hizark21 Laptops 1 07-10-2009 02:03 PM
HELP: Gateway Solo 5300 changed HD and added CD but can't load operating system KC4IH Laptops 3 10-15-2005 04:05 AM
HELP Gateway Solo 5300 changed HD and added CD but can't load operating system KC4IH Laptops 0 10-13-2005 06:38 PM
Can I "mod" a 2-wire fan so the RPMs can be changed? Winey Asus 10 08-05-2004 06:49 PM


All times are GMT. The time now is 08:48 PM.


Welcome!
Welcome to Motherboard Point
 

Advertisment