Would you like signs with those chars?
As a general rule, C integer types are signed unless specified otherwise; short, int, long all work that way. But char, which is usually a single byte on current machines, is different; it can be signed or not, depending on whatever is most convenient to implement on any given architecture. On x86 systems, a char variable is signed unless declared as unsigned char. On Arm systems, though, char variables are unsigned (unless explicitly declared signed) instead.
The fact that a char variable may or may not be signed is an easy
thing for a developer to forget, especially if that developer's work is
focused on a single architecture. Thus, x86 developers can get into the
habit of thinking of char as always being signed and, as a result,
write code that will misbehave on some other systems. Jason Donenfeld
recently encountered this sort of bug and, after fixing it, posted a
patch meant to address this problem kernel-wide. In an attempt to
"just eliminate this particular variety of heisensigned bugs
entirely
", it added the
-fsigned-char flag to the compiler command line, forcing the bare
char type to be signed across all architectures.
This change turned out to not be popular. Segher Boessenkool pointed
out that it constitutes an ABI change, and could hurt performance on
systems that naturally want char to be unsigned. Linus Torvalds
agreed,
saying that: "We should just accept the standard wording, and be aware
that 'char' has indeterminate signedness
". He disagreed, however, with
Boessenkool's suggestion to remove the -Wno-pointer-sign option
used now (thus enabling -Wpointer-sign warnings). That change
would enable a warning that results from the mixing of
pointers to signed and unsigned char types; Torvalds complained
that it fails to warn when using char variables, but produces
a lot of false positive warnings with correct code.
Later in the discussion, though, Torvalds wondered
whether it might be a good idea to nail down the signedness of
char variables after all — but to force them to be unsigned
by default rather than signed. That, he said, shouldn't generate worse
code on any of the commonly used architectures. "And I do think that
having odd architecture differences is generally a bad idea, and making the
language rules stricter to avoid differences is a good thing
".
Amusingly, he noted that, with this option, code like:
const unsigned char *c = "Subscribe to LWN";
will still, with the -Wpointer-sign option, generate a warning,
since a string constant pointer is still considered to be a bare
char *
type, which is then treated as being different from an explicit
unsigned char * type. "You *really* can't win this
thing. The game is rigged like
some geeky carnival game
".
Donenfeld saw
merit in the idea, even though he thinks that the potential to break
some code exists. He sent out a new
patch adding -funsigned-char to the compiler command line
to effect this change. He had suggested that it could perhaps be merged immediately,
given that there is time to fix any fallout before the 6.1 release, but
Torvalds declined
that opportunity: "if we were still in the merge window, I'd probably apply this,
but as things stand, I think it should go into linux-next and cook
there for the next merge window
". He added that any problems that
result from the change are likely to be subtle and to be in driver code
that isn't widely used across architectures. The core kernel code,
instead, has always had to work across architectures, so he does not
believe that problems will show up there.
So Donenfeld's patch is sitting in linux-next instead, waiting for the 6.2 merge window in December. That gives the community until late February to find any problems that might be caused by forcing bare char variables to be unsigned across all architectures supported by Linux. That is a fair amount of time, but it is also certainly not too soon to begin testing this change in as many different environments as possible. It is, after all, a fundamental change to the language in which the kernel is written; a lack of resulting surprises would, itself, be surprising.
One way to identify potential problems is to find the places where the
generated code changes when char is forced to be unsigned.
Torvalds has already made
some efforts in that direction, and Kees Cook has used a system
designed for checking reproducible builds to find a lot of changes.
Many of those changes will turn out to be harmless, but the only way to
know for sure is to actually look at them. Meanwhile, the posting of
one fix by Alexey Dobriyan has caused Torvalds to request
that the char fixes be collected into a single tree. As those
fixes accumulate, the result should be a sign of just how much
disruption this change is actually going to cause.
Index entries for this article | |
---|---|
Kernel | Build system |
Posted Oct 24, 2022 18:14 UTC (Mon)
by ballombe (subscriber, #9523)
[Link] (18 responses)
Posted Oct 24, 2022 21:29 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (17 responses)
If you can find an environment where someone actually declared isalpha and relatives with arrays rather than ints, I would be very surprised, because the standard specifies that the argument must be an int, and any subsequent int-to-array conversion is the callee's problem.
* This problem is theoretically also possible if you are using UTF-8, but in that case, you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it - and at that point, you've already widened everything to 32 bit, so hopefully you did it correctly.
Posted Oct 24, 2022 22:45 UTC (Mon)
by wahern (subscriber, #37304)
[Link] (3 responses)
I can't find conclusive examples for is- ctype routines, but here is how tolower was defined during the first few releases of OpenBSD, as forked from NetBSD:
#define tolower(c) ((_tolower_tab_ + 1)[c])
It's still defined similarly on NetBSD, today: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/sys/ctype_inl...
Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
Posted Oct 24, 2022 23:41 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Even so, I don't believe that the standard *actually* says that c has to be unsigned in that expression - just that the "usual arithmetic conversions" happen (i.e. the compiler magicks it into an int when you're not looking). Compilers presumably added that warning because there were instances of arrays being indexed with negative char, but not negative int or any other signed type. And, again, that presumably had something to do with ASCII supersets and other nonsense involving dirty 7 bit channels.
> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).
Posted Oct 25, 2022 0:22 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
Posted Oct 25, 2022 15:13 UTC (Tue)
by mrvn42 (guest, #161806)
[Link]
> The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).
The problem is that this will only work for values between -1 and 127 for an array of 129 bytes. A value of -2 (or any other non-ascii value other than EOF with signed char) would access memory before the array and a value of 255 (EOF mistakenly stored in an unsigned char or anything non ascii) would access memory after the array.
Looking at the source link in the other comments the BSD code seems to assume chars are unsigned. The test for ascii doesn't work with signed chars at all.
So I assume that "_tolower_tab_" is actually 257 bytes long to cover all unsigned chars and EOF (which is -1 when stored as int).
Posted Oct 25, 2022 0:31 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link] (10 responses)
ASCII is a subset of UTF-8. So if you have a C library which is content to implement these functions for ASCII, they do that fine on UTF-8 data without any decoding.
You run into a problem, as you should expect, if the C library thinks the data is 8859-1 when it's actually UTF-8 but otherwise it just provides the very useful answer to the question: is this (alphabetic / a digit / punctuation/ whitespace / etc.) in ASCII ?
Rust deliberately provides the ASCII variants of these functions on both char (a Unicode scalar value) and u8 (an unsigned 8-bit integer ie like C's unsigned char) named is_ascii_digit and so forth. You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.
Posted Oct 25, 2022 3:29 UTC (Tue)
by dvdeug (subscriber, #10998)
[Link] (7 responses)
isxdigit is always 0-9A-F, and as far as I can tell from the manpage, isdigit is always 0-9. Barring those, why is it useful to ask "is this an alphabetic/punctuation/whitespace character in ASCII?" Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese. I have a hard time thinking about a case where it's the right thing to check if something is some unspecified alphabetic character, but only those in ASCII.
Posted Oct 25, 2022 9:48 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
Posted Oct 25, 2022 15:46 UTC (Tue)
by dvdeug (subscriber, #10998)
[Link]
Posted Oct 25, 2022 15:46 UTC (Tue)
by khim (subscriber, #9252)
[Link] (4 responses)
Sure, but only part of your program which deals with identifiers needs adjustment. You can write That's how doxygen handles it, I would assume someone may use
Posted Oct 25, 2022 16:01 UTC (Tue)
by dvdeug (subscriber, #10998)
[Link] (3 responses)
Posted Oct 25, 2022 16:10 UTC (Tue)
by khim (subscriber, #9252)
[Link] (2 responses)
Only identifier is not “Unicode”. It's Doxygen does that with lex, but in simpler cases you may do the same with
Posted Oct 25, 2022 23:41 UTC (Tue)
by dvdeug (subscriber, #10998)
[Link] (1 responses)
Posted Oct 26, 2022 0:36 UTC (Wed)
by khim (subscriber, #9252)
[Link]
Sure, but standard doesn't say what compiler (or, even worse non-compiler) have to do with broken programs. And if you ignore what standard says and just go with It's not as if we live in a world where everyone cares all that much about following the standard to a
Posted Oct 26, 2022 7:32 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
No they won't, at best they will pass through non-ASCII without doing whatever the function is defined to do (e.g. tolower won't actually lowercase your letters), and at worst they will silently corrupt it (if they think it's one of the legacy 8-bit encodings).
> You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.
There are a subset of edge cases where a string does not contain linguistically useful information, like a phone number or UUID. In those cases, these ASCII-only functions are somewhat useful, but most of them could just as easily be done with regular expressions like [0-9]+. Realistically, you need nontrivial parsing logic anyway, to deal with things like embedded dashes and other formatting vagaries, so you may as well solve both problems with the same tool (which can and should be Unicode-capable, because ASCII is ultimately "just" a subset of UTF-8). In that context, these ASCII-only functions look rather less useful to me.
The problem is, ASCII-only functions are also an attractive nuisance. They make things a little too comfortable for the programmer who's still living in 1974, the programmer who still thinks that strings are either ASCII or "uh, I dunno, those funny letters that use the high bit, I guess?" Those programmers are the reason that so many Americans "can't" have diacritical marks in their names (on their IDs, their airline tickets, etc.). If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode. Unicode is the default, not the "fancy" rare exception. If you have strings, and they're not some variety of Unicode, then one of the following is true:
1. They're encoding something that sort of looks like text, but is not really text, like a phone number.
Posted Oct 26, 2022 12:03 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
Sure enough Rust provides to_ascii_uppercase and to_ascii_lowercase here too.
[ Rust also provides to_uppercase and to_lowercase on char, but because this is a tricky problem these are appropriately more complicated ]
I already mentioned (but you snipped) that this will go wrong if your C library thinks it knows the byte is from some legacy encoding like 8859-1
> most of them could just as easily be done with regular expressions like [0-9]+
This sort of completely inappropriate use of technology (resorting to regular expressions to just match ASCII digits) is how we get software that is hundreds of times bigger and slower than necessary.
> Realistically, you need nontrivial parsing logic anyway
Again, you seem to have determined that people would be looking at these functions where they're completely inappropriate, but C itself isn't the right choice in the applications you're thinking about.
> If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode.
Certainly, but again, we're not asking about Javascript or C# or even Rust, we're talking about C and most specifically about the Linux kernel. Whether the people implementing a driver for a bluetooth device are "actual humans" is I guess up for question, but they're focused very tightly on low level technical details where the fact that the Han writing system has numbers is *irrelevant* to the question of whether this byte is "a digit" in the sense they mean.
C only provides these 1970s functions, and so you're correct that you should not try to write user-facing software in C in 2022. But, the 1970s style ASCII-only functions are actually useful, like it or not, because a bunch of technical infrastructure we rely on works this way, even if maybe it wouldn't if you designed it today (or maybe it would, hard to say)
Example: DNS name labels are (a subset of) ASCII. to_ascii_lowercase or to_ascii_uppercase is exactly appropriate for comparing such labels, and you might say "Surely we should just agree on the case on the wire" but actually we must not do that, at least not before all DNS traffic is DoH because it turns out our security mitigation for some issues relies on random bits in DNS queries, and there aren't really enough so we actually put more randomness in the case bits of the letters in the DNS name, so your code needs to work properly with DNS queries that have the case bits set or unset seemingly at random, so as to prevent attackers guessing the exact name sent...
The end user doesn't see any of this, you aren't expected to type randomly cased hostnames in URLs, nor to apply Punycode rules, you can type in a (Unicode) hostname, and your browser or other software just makes everything work. But the code to do that may be written in C (or perhaps these days Rust) and its DNS name matching only cares about ASCII, even though the actual names are Unicode.
Posted Oct 25, 2022 9:14 UTC (Tue)
by jengelh (subscriber, #33263)
[Link] (1 responses)
int lowertbl[] = {-1, 0, 1, ..., 0x40, 0x61, 0x62, ...};
Sign extension need not produce nonsense. Thanks to the equivalency of the expressions lowertbl+1[c] <=> lowertbl+1+c <=> lowertbl+c+1 <=> lowertbl[c+1], what matters is if the pointer still points to something sensible.
Posted Oct 26, 2022 7:35 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
Posted Oct 24, 2022 19:41 UTC (Mon)
by mss (subscriber, #138799)
[Link] (11 responses)
Posted Oct 24, 2022 21:29 UTC (Mon)
by mb (subscriber, #50428)
[Link] (5 responses)
Well, char in case of signed is consistent with short (always signed).
short c = 0xffff;
prints fffffffe.
Therefore, char = unsigned char actually is inconsistent.
Posted Oct 25, 2022 6:35 UTC (Tue)
by eru (subscriber, #2753)
[Link] (4 responses)
Posted Oct 25, 2022 10:36 UTC (Tue)
by geert (subscriber, #98403)
[Link] (3 responses)
[*] EBCDIC is 8-bit, and seems to have an intentional division in characters with and without bit 7 (the "sign" bit) set.
Posted Oct 25, 2022 13:15 UTC (Tue)
by eru (subscriber, #2753)
[Link] (1 responses)
Posted Oct 25, 2022 19:58 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Note that bunch of these was in wide use way before C was a thing.
Posted Nov 4, 2022 20:07 UTC (Fri)
by fest3er (guest, #60379)
[Link]
Posted Oct 25, 2022 4:56 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (4 responses)
Posted Oct 25, 2022 7:06 UTC (Tue)
by Villemoes (subscriber, #91911)
[Link] (1 responses)
C99, 6.5.16.1 Simple assignment
(2) In simple assignment (=), the value of the right operand is converted to the type of the
and that "converted to the type of" is covered in
6.3 Conversions, 6.3.1.3 Signed and unsigned integers:
(1) When a value with integer type is converted to another integer type other than _Bool, if
And in practice, any relevant compiler (at least when we're talking the linux kernel, but I'd be surprised if any non-academic compiler did otherwise) does as gcc:
* 'The result of, or the signal raised by, converting an integer to a
For conversion to a type of width N, the value is reduced modulo
Posted Oct 25, 2022 9:31 UTC (Tue)
by SLi (subscriber, #53131)
[Link]
I don't think it's only a matter of being "completely precise"; relying on implementation-defined behavior can be reasonable choice, relying on undefined behavior that a compiler vendor has not explicitly declare they define, to me, usually cannot. (I know Linus would disagree with me :D)
Allowing an implementation-defined signal seems a bit horrible here, though, but I think I understand where the committee was coming from...
Posted Oct 25, 2022 7:39 UTC (Tue)
by matthias (subscriber, #94967)
[Link] (1 responses)
Posted Oct 25, 2022 9:59 UTC (Tue)
by SLi (subscriber, #53131)
[Link]
Posted Oct 24, 2022 20:08 UTC (Mon)
by cesarb (subscriber, #6266)
[Link] (13 responses)
Posted Oct 24, 2022 21:14 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (10 responses)
Posted Oct 24, 2022 22:13 UTC (Mon)
by Sesse (subscriber, #53779)
[Link] (4 responses)
I do wish C had made a separate “byte” type, though, for aliasing reasons. char has too many tasks.
Posted Oct 24, 2022 23:26 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Oct 25, 2022 7:30 UTC (Tue)
by gspr (subscriber, #91542)
[Link] (2 responses)
Posted Oct 25, 2022 7:34 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Oct 25, 2022 19:18 UTC (Tue)
by wahern (subscriber, #37304)
[Link]
Posted Oct 25, 2022 5:09 UTC (Tue)
by SLi (subscriber, #53131)
[Link] (4 responses)
Posted Oct 25, 2022 16:15 UTC (Tue)
by khim (subscriber, #9252)
[Link]
I suspect it was just added when the C Standard committee realized that on some platforms simple
Posted Oct 27, 2022 2:09 UTC (Thu)
by gdt (subscriber, #6284)
[Link] (2 responses)
If the language insists on "char" being "unsigned char" then some processors will need to follow the register load with an AND instruction to clear the sign extension. If loading that register also sets register flags (eg, Negative) then you'll need to clear those register flags too. You could, of course, perhaps avoid this with careful compiler optimisations, but that's asking too much of the compilers of the era.
Well before the ANSI standards committee started work, the convention in C was to let these differences in processor implementations shine through, with an obligation on people writing code intended to be 'portable' between differing processors to deal with the results. Considering that C was a systems programming language, this wasn't an unreasonable choice.
Posted Oct 27, 2022 6:01 UTC (Thu)
by SLi (subscriber, #53131)
[Link]
Posted Oct 27, 2022 7:18 UTC (Thu)
by joib (subscriber, #8541)
[Link]
(That blog post is a few years old and doesn't include results for RISC-V, but I understand that RISC-V is like ARM, in that it provides both zero extending and sign extending byte loads but the ABI has chosen char's to be unsigned)
Posted Oct 24, 2022 21:19 UTC (Mon)
by joib (subscriber, #8541)
[Link] (1 responses)
When passing strings between C and Rust, the pointers can/must be cast to the correct type anyway. And when working in Rust-land, the Rust compiler handles strings per its own rules, and in C-land the C compiler handles chars per its rules.
(Adding to the potential confusion, Rust also has a 'char' type, which is a 4-byte value capable of storing a single UTF-32 codepoint, presumably under the covers it's considered unsigned. But indeed a string (str/String) is an u8 array containing UTF-8.)
Posted Oct 25, 2022 3:30 UTC (Tue)
by tialaramex (subscriber, #21167)
[Link]
Rust defines char as a Unicode scalar value, Unicode only has one set of codepoints, and some of them aren't valid scalar values because they were used to make UTF-16 work, but UTF-32 maps all that ones which are scalar values to single code *units* and the rest are invalid.
Thus char::from_u32(0xDE01).unwrap(); will panic because this claims 0xDE01 is a Unicode scalar value, but it isn't because code point U+DE01 is a surrogate used for UTF-16 and has no meaning on its own.
char is not "considered" to be signed or unsigned. You can't do arithmetic on Rust types unless they implement the (Traits signifying) arithmetic operators, which char does not, so in Rust neither 'A' + 'B' nor false + true (bools only implement some logical and bitwise boolean operators, not general arithmetic) work.
Posted Oct 24, 2022 21:31 UTC (Mon)
by lisch (subscriber, #36574)
[Link] (7 responses)
Posted Oct 25, 2022 6:42 UTC (Tue)
by eru (subscriber, #2753)
[Link] (6 responses)
Maybe because some early C compilers for the PC worked this way. Like Lattice C. The original 16-bit 8086 did not have the sign-extending 8-bit load as a single instruction, so it was more efficient to treat char as unsigned.
Posted Oct 25, 2022 9:51 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (4 responses)
> The original 16-bit 8086 did not have the sign-extending 8-bit load as a single instruction, so it was more efficient to treat char as unsigned.
"mov al, [x] + cbw" (sign extending) is even one byte shorter than "xor ax, ax + mov al, [x]", so there is no particular x86-specific reason to pick unsigned over signed. Unsigned chars are simply more intuitive; signed-by-default chars are a relic of K&R C not having the signed keyword at all.
Posted Oct 25, 2022 16:20 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
They may not have heard about it, but they definitely know its properties. Lattice C is maiden name of Microsoft C and I doubt x86 developers who have never heard about Microsoft C exist. P.S. Yes, Microsoft C 3.0 is independent rewrite but it had to be compatible with Microsoft C 1.x and 2.x and thus it had to be compatible with Lattice C, too.
Posted Oct 26, 2022 14:29 UTC (Wed)
by eru (subscriber, #2753)
[Link]
Posted Oct 25, 2022 16:34 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
Whether that was actual reason for the use of unsigned
Posted Oct 26, 2022 8:18 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link]
Posted Oct 25, 2022 10:16 UTC (Tue)
by adobriyan (subscriber, #30858)
[Link]
Posted Oct 25, 2022 4:16 UTC (Tue)
by scientes (guest, #83068)
[Link] (1 responses)
I want to point out that the LDRB and LDRBT instructions also zero-extend byte loads, so this unsignedness is part of the ABI.
Posted Oct 26, 2022 15:42 UTC (Wed)
by jrtc27 (subscriber, #107748)
[Link]
Posted Oct 25, 2022 12:54 UTC (Tue)
by dskoll (subscriber, #1630)
[Link] (1 responses)
Mostly OT... I first learned assembler on the Motorola 6809. It had two sets of conditional branch instructions: One set interpreted the comparisons as signed, and the other as unsigned. It's been decades since I did any Intel assembler, but I assume the same thing holds? Are there any real architectural reasons to prefer signed over unsigned or vice-versa?
The 6809 had an instruction called SEX. It sign-extended accumulator B into double-accumulator D. It also had a BRA instruction, which was an unconditional branch. Someone tended sophomoric when naming the instructions. :)
Posted Oct 25, 2022 14:00 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
Yes, JL/JG/JLE/JGE are for signed comparisons while JB/JA/JBE/JAE are for unsigned comparisons.
Other architectures use LT/GT/LE/GE for signed and LTU/GTU/LEU/GEU for unsigned.
POWER is special in that it has separate "compare signed" and "compare unsigned" instructions, and the branches are just less than/equal/greater than.
Posted Oct 25, 2022 19:27 UTC (Tue)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
Posted Oct 25, 2022 20:03 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Because C it not made that way. You still end up dealing with that
Posted Oct 25, 2022 19:31 UTC (Tue)
by daniel.glasser (subscriber, #97146)
[Link] (1 responses)
Back in the 1980s, I gave up explicitly using "char" for anything other than strings and/or API compatibility, and instead moved to a precursor of the types defined by the header file <stdint.h>, which are now "uint8_t" for unsigned and "int8_t" for signed (I think that the non-standard version was just "uint8" and "int8", respectively, but I'm not sure anymore). Not all compiler toolchains included a header file with consistent names, so I had my own. Using this convention, my code was portable between PDP-11, VAX-11, M68000, Intel8086, and Zilog Z8000, all of which I was using at the time, and eventually PPC, DEC Alpha, ARM, and other architectures with a number of different compilers with, at most, a change in a #define for whether "char" was signed or unsigned if the compiler didn't provide a pre-defined macro for that.
When ISO/IEC 9899:1999 (C99) added "<stdint.h>", I switched to using the standard typedef names for new code and my projects included a "include/compat" directory with versions of "stdint.h" and "stdbool.h" (and a few other "standard" header files introduced by C99) to be used for pre-C99 conforming compilers.
Adopting this sort of thing in the kernel at this stage would likely be a herculean task, but for any new software project in C it is trivial to standardize on using typedefs with explicit properties (signed/unsigned, width) where the built-in C types may vary between targets.
Posted Oct 26, 2022 8:27 UTC (Wed)
by geert (subscriber, #98403)
[Link]
Fortunately that herculean task has already been completed, using {,__}[us]{8,16,32,64}.
Posted Oct 26, 2022 6:04 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link] (4 responses)
char c;
while ((c = getchar()) != EOF) {
would turn to an infinite loop on PPC, because chars were unsigned there and would never match EOF(-1). Since then I've been insisting on building and running my code on various architectures, among which Arm, in part for unsigned chars.
Posted Oct 26, 2022 8:29 UTC (Wed)
by geert (subscriber, #98403)
[Link] (3 responses)
Posted Oct 26, 2022 13:15 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link] (2 responses)
Posted Oct 27, 2022 15:10 UTC (Thu)
by cesarb (subscriber, #6266)
[Link] (1 responses)
As luck would have it, 0xFF never appears in any valid UTF-8 stream (even in the original definition of UTF-8 which allowed for longer sequences), so the strong push to use UTF-8 everywhere lessens the impact of these bugs.
Posted Oct 27, 2022 17:45 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link]
Posted Oct 27, 2022 13:59 UTC (Thu)
by kpfleming (subscriber, #23250)
[Link]
Posted Nov 3, 2022 14:49 UTC (Thu)
by welinder (guest, #4699)
[Link]
True.
> short, int, long all work that way.
Ah, no. They mostly work that way.
"int" has unspecified signedness in the following code:
struct X {
Luckily bitfields are rare enough that you can just always insist on specifying "unsigned int" or "signed int"
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
>> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
> Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese.
Would you like signs with those chars?
int foo = 42;
but can not write int foo = 42;
which means that you can easily use “C” locale and all ASCII-only functions with a simple change: where before Unicode you used is isalpha(c)
or isalnum(c)
now you would use c < 0 || isalpha(c)
and c < 0 || isalnum(c)
.isalpha(c)
and/or isalnum(c)
in a similar way.Would you like signs with those chars?
> It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.
Would you like signs with those chars?
alpha
or Unicode then alnum
or Unicode (where Unicode is defined as “anything with a high bit set”).ctype.h
.Would you like signs with those chars?
Would you like signs with those chars?
isalpha/isalnum
+ Unicode (where Unicode == “high bit is set”) then you would handle all correct programs perfectly. And if someone feeds incorrect one… who cares how would it be handled?T
.Would you like signs with those chars?
2. They are raw bytes in some binary format, and not text at all.
3. In practice, they mostly are Unicode, but that's not your problem (e.g. because you're a filesystem and the strings are paths).
4. You hate your non-English-speaking users (and the English-speakers who have diacritical marks anywhere in their string for whatever reason - we shouldn't make assumptions).
5. You inherited a pile of tech debt and it's too late to fix it now.
Would you like signs with those chars?
Would you like signs with those chars?
#define tolower(c) ((lowertbl+1)[c])
Would you like signs with those chars?
That signed / unsigned Would you like signs with those chars?
char
difference is a potent source of subtle bugs.
For example the following code:
char c = 0xff;
printf ("%x\n", c - 1);
will print fffffffe
if char
is signed but just (intuitively expected) fe
if it is unsigned.
Would you like signs with those chars?
printf ("%x\n", c - 1);
Would you like signs with those chars?
Would you like signs with those chars?
8-bit ASCII (e.g. ISO-8859-*) was only standardized in the eighties, which is about the same time as PowerPC and ARM saw the day of light, so making char default to unsigned may made sense for them (modulo the compatibility issues).
https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout
Would you like signs with those chars?
> 8-bit character sets were not that new, although I agree they mostly post-date C.
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
assignment expression and replaces the value stored in the object designated by the left
operand.
the value can be represented by the new type, it is unchanged.
(2) Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
(3) Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
signed integer type when the value cannot be represented in an
object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).'
2^N to be within range of the type; no signal is raised.
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
2. If you must use char, unsigned will work perfectly well (does your Bloom Filter have a negative count?!), so this isn't actually a use case for signed char.
3. If you need a sentinel value, you can use 255; there is no magical rule that says sentinel values have to be negative.
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
char
is unsigned. They needed single-byte signed type thus signed char
was born.Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Thus, x86 developers can get into the habit of thinking of char as always being signed and, as a result, write code that will misbehave on some other systems.
In my experience, x86 developers are more likely to unwittingly think of char
as unsigned
. For example,
char c = '\xfc';
int array[256] = { ... };
do_something(array[c]); // BOOM!
Would you like signs with those chars?
In my experience, x86 developers are more likely to unwittingly think of char as unsigned.
Would you like signs with those chars?
> I suspect most x86 developers have never heard of Lattice C. :)
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
cbw
only works with accumulator, though. If you want speed and use register variables (means you use di
and si
) then you cannot use it.char
or not we would never know, of course.Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
uint8_t
and int8_t
are not types, they are type aliases.char
/signed char
/unsigned char
whether you like it or not.Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
/* do something */
}
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
Would you like signs with those chars?
const unsigned char *c = u"Subscribe to LWN";
Would be fine, if it was an option :-)
Would you like signs with those chars?
int m : 8;
};