1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

whose cc recognises byte moves

Discussion in 'Embedded' started by Pat LaVarre, Jan 15, 2004.

  1. Pat LaVarre

    Pat LaVarre Guest

    Newsgroups:comp.arch.embedded
    Lately elsewhere I saw people mocking the practice of writing explicit
    byte assignments in C, which brings me here now asking, anyone with an
    8051 compiler want to try compiling the following code snippet?
    Possibly before I saw:

    1) Some compilers actually involve 32-bit arithmetic, rather than
    moving bytes, ouch.

    2) Some compilers allocate the unsigned long twice, once as a local
    variable, then again as a separate result, ouch.

    3) Many compilers fail to produce the same machine code for both of
    these expressions of the same idea e.g. my 32-bit Linux desktop `gcc
    --version` 3.2.2 here now, when run as c.a.e. lately helpfully
    suggested:

    gcc -c -fomit-frame-pointer -O3 -Wall -W hi.c
    objdump -dS hi.o

    A frustrating failure of the C compilers for 8 and 16 bit
    microcontrollers to understand what I meant by what I said plainly and
    simply in the first place, yes?

    Pat LaVarre

    /// ways of fetching a potentially misaligned big-endian 32-bits ...
    /// (op x25 Read Capacity bytes[4:5:6:7] is disc bytes per read block)

    long sub1(char * chars) {
    long result;
    result = (chars[4] << 0x18) |
    (chars[5] << 0x10) |
    (chars[6] << 0x08) |
    (chars[7] << 0x00);
    return result;
    }

    #include <endian.h>

    #if __BYTE_ORDER != __BIG_ENDIAN /* if byte 0 lsb */
    #define LIL(I, N) (((unsigned char *)&(I))[N])
    #define BIG(I, N) LIL(I, sizeof (I) - (N))

    #else /* else if byte 0 msb */
    #define BIG(I, N) (((unsigned char *)&(I))[N])
    #define LIL(I, N) BIG(I, sizeof (I) - (N))
    #endif

    long sub2(char * chars) {
    long result;
    LIL(result, 3) = chars[4];
    LIL(result, 2) = chars[5];
    LIL(result, 1) = chars[6];
    LIL(result, 0) = chars[7];
    return result;
    }
     
    Pat LaVarre, Jan 15, 2004
    #1
    1. Advertisements

  2. I do not agree that both functions are the same - what if long's and char's
    are signed?
    I have tried it on KEIL, and sub2 works pretty neat - only a few moves and
    that's it...
    sub1 depends on the type of long's and char's...Anyway, a lot
    of shifting occured. On the other side - you said your compiler to shift the
    bytes, so it did...
    Also I have declared pointers to be of data type pointers - that makes the
    code clearer to follow.

    BTW. you have forgotten that we do not have endian.h file ;)

    regards

    Dejan
     
    Dejan Durdenic, Jan 15, 2004
    #2
    1. Advertisements

  3. Pat LaVarre

    Thad Smith Guest

    Some may have optimizers that recognize the shift-by-8/shift-by-16.
    Many don't.
    This code has two bugs. For targets with 16-bit ints, (chars[4]<<0x18)
    resulting in converting a char to a 16-bit int, shifting left 24 bits,
    with a result of type int. This is undefined, since the result doesn't
    fit into a 16-bit int.

    Secondly, whether type char is signed or unsigned is not standard. If
    type char is signed and the msb of the byte is one and the target uses
    two's complement representation, then the sign bit is propagated by the
    conversion from char to int, which is not the obviously desired result.

    What I suggest (for this approach) is the following:

    unsigned long conv (unsigned char *bige) {
    unsigned long result;
    result = ((unsigned long)bige[4] << 0x18) |
    (unsigned long)bige[5] << 0x10) |
    (unsigned long)bige[6] << 0x08) |
    (unsigned long)bige[7] << 0x00);
    return result;
    }

    If you really need a signed result, you can simply cast the result
    unless the target uses non-twos-complement representation, in which case
    you need to make an explicit adjustment.

    If the target uses 16-bit ints, the following code is probably more
    efficient, since it defers long arithmetic:

    unsigned long conv (unsigned char *bige) {
    unsigned long result;
    result = ((unsigned long)(bige[4] << 0x8) | bige[5]) << 0x10) |
    (bige[6] << 0x8) | bige[7]);
    return result;
    }

    It does just-in-time conversion from unsigned int to unsigned long.
    This version will, in most (all?) cases, generate much better code for
    8-bit processors (and probably 16-bit processors). It doesn't work for
    mixed-endian targets, though. It also makes an assumption that chars
    are 8 bits in length, which the earlier version doesn't.

    Thad
     
    Thad Smith, Jan 15, 2004
    #3
  4. Pat LaVarre

    Pat LaVarre Guest

    From: Pat LaVarre ...
    No comment?
    Aye, writing explicit byte moves works.

    But to have to jump thru such hoops to get a reasonable result
    frustrates me. Mostly when someone else didn't jump thru the
    appropriate hoop and I have to fix their code. I remember once with
    nothing but source-to-source transformations I dropped to twenty ms
    from over a thousand.

    Help I don't quite understand: can you easily give an example?

    But if we want to compile the same C for a variety of platforms, we
    don't want our C understood this way. A byte move is a byte move. On
    a processor with a single cycle barrel shifter that likes 32-bit
    arithmetic and aligned memory access, we want shifts. On a processor
    with byte-wide registers and memory, we want moves. Writing either
    gives me the wrong answer on the other platform, ouch.
    What is a "mixed-endian" target"?

    To pretend to have a Gnu endian.h was the clearest way I knew to say
    in C that all the processors that I ever programmed much were either
    big-endian or else little-endian. I imagine the C standard allows
    more variability than that.

    Pat LaVarre

    P.S.
    Sorry I sparked those digressions. Now that we emphasise these
    truths, I imagine the original code came from the xFF-is-char-mask
    tradition of 32-bit int Unix/ Java/ etc. e.g.

    int sub1(char * chars)
    {
    int result;
    result = ((chars[4] & 0xFF) << 0x18) |
    (chars[5] & 0xFF) << 0x10) |
    (chars[6] & 0xFF) << 0x08) |
    (chars[7] & 0xFF) << 0x00);
    return result;
    }
     
    Pat LaVarre, Jan 16, 2004
    #4
  5. Pat LaVarre

    Thad Smith Guest

    I haven't seen it and don't expect it with a reasonable compiler, but I
    wouldn't be terribly surprised, either.
    Good work. That's sometimes part of the job.
    Then it isn't written correctly. The shift method is robust if written
    correctly. The byte move method is appropriate for most cases, assuming
    adjustment is made for endedness.

    Sometimes when making these kind of optimizations, I use conditional
    code to specify two versions -- a generic version that should work on
    all platforms and an optimized version for a particular platform. The
    optimized version is only included if preprocessor symbols indicate the
    associated target. That makes the code portable and the generic version
    usually serves as a more readable description of what is being done.
    One in which is byte order is something other than 0123 or 3210. Was
    that the VAX?
    OK, that should work with a 32-bit int computer. On the other hand, if
    you or others are concerned about 8-bit processors, why not at least
    modify it to be well defined for implementations with 16-bit ints (most
    8-bit systems, included). It doesn't cost anything for 32-bit systems,
    just makes it portable to ALL standard C implementations. Just replace
    0xFF with 0xFFL.

    Thad
     
    Thad Smith, Jan 16, 2004
    #5
  6. Pat LaVarre

    CBFalconer Guest

    No, it isn't portable. Left shifting bits into the sign bit of an
    int results in undefined behavior. Bytes may be more than 8 bits,
    although this is fairly rare, and chars may be signed. The above
    is obviously intended to convert a bigendian (Least sig. byte
    first) stream. A portable version is:

    /* Convert 4 bytes of bigendian to an unsigned long */
    unsigned long bigendian4toul(const unsigned char *chars)
    {
    int i;
    unsigned int result;

    result = 0; i = 4;
    do {
    result = result * 256 + (chars[--i] & 0xff);
    } while (i);
    return result;
    } /* untested */

    And you can replace "result * 256" with "result << 8". Most
    compilers will generate the same code. If you don't want the
    local variable, and can afford more code, the following should
    also be portable:

    #define octet(n) ((unsigned long)chars[n] & 0xff)

    /* Convert 4 bytes of bigendian to an unsigned long */
    unsigned long bigendian4toul(const unsigned char *chars)
    {
    return ( (((((octet(3) ) << 8) +
    octet(2) ) << 8) +
    octet(1) ) << 8) +
    octet(0) );
    } /* untested */

    The "& 0xff" in octet is only needed if CHAR_BIT is greater than
    8. The use of '+' in place of '|' emphasizes the arithmetical
    nature, but should give the same result.

    The critical things from the users viewpoint is "How many bytes of
    input am I using" and "What is the endianess of the input
    stream". The sizes and endianess of the local entities should not
    affect the code. The "How many" affects the output type, and the
    input endianess affects the actual code organization.

    Once you have routines that depend ONLY on things defined by the C
    standard, you can reuse them in safety. The only reason to modify
    and specialize them is for performance in particular environments,
    and then only when experience shows the necessity.
     
    CBFalconer, Jan 16, 2004
    #6
  7. Pat LaVarre

    Pat LaVarre Guest

    What is a "mixed-endian" target"?
    Hmmm. Yes to me the orders 1:0:3:2 and 2:3:0:1 feel almost as
    familiar as 0:1:2:3 and 3:2:1:0 do, and I did work on vaxen in the
    early 80's ...
    Thank you, yes I see, on C platforms where long essentially means int,
    0xFFL essentially means 0xFF. Possibly I could be pushed as far as:

    result = ((chars[4] & 0xffL) << 0x18) |
    ((chars[5] & 0xffL) << 0x10) |
    ((chars[6] & 0xFF) << 0x08) |
    ((chars[7] & 0xFF) << 0x00);

    I suspect I hesitate because a Java long means a C long long so I can
    easily misread this C snippet as if it were a pointlessly 64-bit C
    snippet:

    result = ((chars[4] & 0xffLL) << 0x18) |
    ((chars[5] & 0xffLL) << 0x10) |
    ((chars[6] & 0xFF) << 0x08) |
    ((chars[7] & 0xFF) << 0x00);

    Ouch I see I misled you sorry. I meant to say:

    Writing shifts or byte moves commonly provokes actual C compilers to
    produce machine code that expends unreasonable amounts of time or
    space in order to calculate the correct value. Compilers for
    processors that prefer shifts misunderstand byte moves, compilers for
    processors that prefer byte moves misunderstand shifts.
    In theory I like writing a plain version and also a tuned version, but
    in practice my colleagues and I rarely find the time to maintain the
    code we never run. Sometimes I do persuade people to run the naive
    code as a check on the clever code.
    Portability isn't free: some people find platform-specific code, in
    particular cast-free code, easier to read.

    I actually launched this c.a.e. thread because I saw some Linux folk
    who mostly target processors that prefer aligned shifts casually mock
    people who write byte moves. I was trying to remember why I tended to
    favour writing byte moves.
    No.

    Writing code like everyone else writes leads to correct
    binary-code-only compiler behaviour, because everyone else has already
    fixed the bugs I'd otherwise discover.

    Writing compliant code only guarantees correct output from a
    hypothetical standard C compiler whose authors have fixed its last
    bug, and doesn't relate to the time and space expended.

    Pat LaVarre
     
    Pat LaVarre, Jan 16, 2004
    #7
  8. Pat LaVarre

    Pat LaVarre Guest

    the xFF-is-char-mask
    Sorry I neglected to mention that. Fortunately many (all?) two's
    complement machines agree over what left shifting a 1 into the sign
    bit means.
    Some compilers reward deferring the promotion to long e.g.

    #define octet(n) ((unsigned long)(chars[n] & 0xff))
    Some compilers reward substituting | for + when in fact they give the
    same result. Using | says explicitly that the addition needn't carry,
    which helps, but also says explicitly that the addition shall not
    carry, which can hurt.
    Do we have a practical way to establish whether or not the source code
    we wrote depends only on what the C standard promises a hypothetically
    compliant compiler will provide?
    Hello here we are. Actual C compilers for actual 8-bit and 16-bit
    micros commonly do waste much code space and time by wrongly inferring
    we meant 32-bit shifts, multiplication, and addition when in fact we
    wanted byte moves.
    That kind of "portable" is hypothetical. Me, I care about time and
    space and plainly correct machine code, not just arguably correct C
    source code.

    If actual C compilers actually understood the equivalence between
    shift and byte move, then I'd get precisely the same result from
    saying either, and I'd not be tempted to observe, model, discuss, and
    exploit the difference.

    Pat LaVarre
     
    Pat LaVarre, Jan 16, 2004
    #8
  9. The VAX is (was) a little endian machine when dealing with integers.

    However, the floating point format inherited from PDP-11 looked quite
    strange, with the first 16 bit word containing the sign, exponent and
    the most significant part of the mantissa, while the second word
    contained the least significant word.

    However, when looking at byte addresses, the lowest byte contained the
    most significant part of the mantissa (and the least significant bit
    from the exponent), the second byte the sign and most of the exponent,
    the third byte the least significant part of the mantissa and the last
    byte the middle bits from the mantissa.

    However, on the little endian PDP-11, some compilers might put the
    most significant word of a 32 bit variable into the lower address and
    the least significant word in the higher address, although the
    hardware supported only 16 bit memory references for integers. With
    such compilers, the byte order was 2301 :).

    Paul
     
    Paul Keinanen, Jan 16, 2004
    #9
  10. Pat LaVarre

    Thad Smith Guest

    You are correct. If the intent is to convert a a 4 octet
    twos-complement big-endian representation into a native 32-bit signed
    value, more logic is needed.

    Here's one candidate:

    long bigendian2c4otolong (const unsigned char *be) {
    unsigned long r;
    r = ((be[0] & 0xffLU) << 24) |
    ((be[1] & 0xffLU) << 16) |
    ((be[2] & 0xffLU) << 8) |
    ((be[3] & 0xffLU) << 0);
    if (be[0] & 0x80) return (long)(r-0x80000000LU)-0x7fffffffL-1;
    else return (long)r;
    }

    That isn't necessarily the best for any particular platform, but I think
    that it should work, except for the case of -2^31 on a sign-magnitude or
    ones-complement host.
    That looks like a good LITTLE ENDIAN converter. ;-)
    Agreed.

    Thad
     
    Thad Smith, Jan 17, 2004
    #10
  11. Characters in c are signed.
    I have as one of my rules of thumb :
    only use shift operators on unsigned quantities.
     
    Albert van der Horst, Jan 20, 2004
    #11
  12. Pat LaVarre

    Dave Hansen Guest

    [...]
    No they are not. They are encodings. The range of encoded values of
    the execution character set must be able to be represented by the
    (char) type. Character string literals (e.g. "Hello world!") have
    type (char *). Character literals (e.g. 'A') have type (int).

    The types (char), (signed char) and (unsigned char) are three distinct
    types. The (char) type itself may be signed or unsigned: you don't
    know unless you read the documentation.

    If sign matters, it is better to use (signed char) or (unsigned char).
    Good idea. Regards,

    -=Dave
     
    Dave Hansen, Jan 21, 2004
    #12
  13. Pat LaVarre

    Pat LaVarre Guest

    Characters in c are signed.

    Yes we corrected that tupo once already. Sorry it exists.
    Specifically I omitted the & 0xFF.
    A comparably popular competing rule of thumb is used signed quantities
    always e.g. in C always say signed char.

    Pat LaVarre
     
    Pat LaVarre, Jan 21, 2004
    #13
  14. I have as one of my rules of thumb :
    Fully agreed.
    I strongly doubt that this rule is anywhere near popular. The average
    C programmer isn't that silly --- or so I keep hoping.

    To be perfectly clear about this: this rule-of-thumb is too stupid to
    live. For starters, there's no way you can correctly use the standard
    <ctype.h> functions/macros in a portable manner while sticking to that
    rule.

    Signed chars have exactly one kind of non-silly usage: as very small
    integers. Using them to represent characters in actual text gives you
    just a desaster waiting to happen.
     
    Hans-Bernhard Broeker, Jan 21, 2004
    #14
  15. Pat LaVarre

    Pat LaVarre Guest

    A comparably popular competing rule of thumb is used signed quantities
    I see this in code fragments written to run as C or as Java. I
    thought these might be coming out of a tradition of signed-char
    32-bit-int Unix, but personally I'm decidely vague on how the world
    has split between signed and unsigned bytes express in C as type char.
    Sorry I was unclear. Java uses 16 unsigned bits, perhaps unsigned
    short in C, to mean UTF-16 char. I meant to be talking about signed
    bytes. Java has only signed bytes, no unsigned bytes.

    Pat LaVarre
     
    Pat LaVarre, Jan 21, 2004
    #15
  16. And in that case, it's entirely the *Java* side of that which would be
    governing it. IIRC, Java has no unsigned types at all.

    OTOH, trying to write code that works in more than one different
    language is even sillier than that rule of thumb. It can be a nice
    game (the record achievement being more than 20 languages that can all
    execute a given file, IIRC), but it's quite certainly useless in any
    productive environment. The idea of there being a programming
    language "C/C++" you can write program in has wreaked enough havoc to
    people's education already --- no point in making that even worse.
     
    Hans-Bernhard Broeker, Jan 22, 2004
    #16
  17. Pat LaVarre

    Pat LaVarre Guest

    ...

    Whether Java does or does not accurately represent a signed-char
    32-bit-int Unix C design tradition passed person-to-person, I do not
    know.
    That is, in Java as specified, signedness is always implicit, never
    explicit or otherwise indeterminate. Instead, all of byte, short,
    int, long are always signed and char is always unsigned. That is, in
    Java (96?) we see a different way of spelling the C99 <stdint.h> ideas
    of int8_t, int16_t, int32_t, int64_t and uint16_t.

    In particular, Java char by definition works like the unsigned
    two's-complement sixteen bits that often we can get via C unsigned
    short (and perhaps wchar?) e.g.

    $ cat hi.java
    class hi {

    public static void main(String[] args)
    {
    int i = -1;
    char ch = (char) i;
    int j = (int) ch;
    System.err.println(j);
    System.err.println("x" + Integer.toHexString(j));
    }

    }
    $ javac hi.java
    $ java hi
    65535
    xffff
    $
    $ # Pat LaVarre
     
    Pat LaVarre, Jan 22, 2004
    #17
  18. I've always wondered about this. A char is a char, unless
    it's a character in which case it's an int! Makes about as
    much sense as some of the other strange "features" of C.
    Right. And to just make matters "interesting", the developers
    of an OS with which I've worked for a number of years decided
    that char strings should be UBYTE which then causes the compilers
    to complain because UBYTE != char!
    Why?
     
    Everett M. Greene, Jan 22, 2004
    #18
  19. Pat LaVarre

    Dave Hansen Guest

    Shifting a bit left into the sign bit is undefined. Shifting the sign
    bit to the right is implementation-defined. Summary: you don't know
    what you're gonna get.

    Of course, using unsigned types doesn't always help if the type is
    narrower than int. In that case, it gets promoted to (signed) int
    before the shift. For example, assuming 16-bit int:

    unsigned char x = 0xFF, y;
    unsigned int z;

    y = x << 4; /* OK, y = 0xF0 */
    z = x << 4; /* OK, z = 0xFF0 */
    y = x << 8; /* undefined */
    z = x << 8; /* undefined */

    In each case, the (unsgined) value of x is converted to (signed) int,
    the int value is shifted, and the result converted to the type of the
    LHS of the assignment. In the last two cases, the shifted value can't
    be represented by the int type, and the result is undefined.

    I've written C programs for 20 years. I love the power, flexibilty,
    and expressiveness of the language. But sometimes, C just p*sses me
    off.

    Regards,

    -=Dave
     
    Dave Hansen, Jan 22, 2004
    #19
  20. Pat LaVarre

    Pat LaVarre Guest

    For example, assuming 16-bit int:
    Yes, on paper. But how often is this issue real? Anybody actually
    selling a 16-bit processor these days whose int's are not two's
    complement, such that (0x00FF << 8) is (0x00FF * 0x0100) is (0xFF00)?

    Pat LaVarre
     
    Pat LaVarre, Jan 23, 2004
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.