|
|
Subscribe / Log in / New account

Would you like signs with those chars?

By Jonathan Corbet
October 24, 2022
Among the many quirks that make the C language so charming is the set of behaviors that it does not define; these include whether a char variable is a signed quantity or not. The distinction often does not make a difference, but there are exceptions. Kernel code, which runs on many different architectures, is where exceptions can certainly be found. A recent attempt to eliminate the uncertain signedness of char variables did not get far — at least not in the direction it originally attempted to go.

As a general rule, C integer types are signed unless specified otherwise; short, int, long all work that way. But char, which is usually a single byte on current machines, is different; it can be signed or not, depending on whatever is most convenient to implement on any given architecture. On x86 systems, a char variable is signed unless declared as unsigned char. On Arm systems, though, char variables are unsigned (unless explicitly declared signed) instead.

The fact that a char variable may or may not be signed is an easy thing for a developer to forget, especially if that developer's work is focused on a single architecture. Thus, x86 developers can get into the habit of thinking of char as always being signed and, as a result, write code that will misbehave on some other systems. Jason Donenfeld recently encountered this sort of bug and, after fixing it, posted a patch meant to address this problem kernel-wide. In an attempt to "just eliminate this particular variety of heisensigned bugs entirely", it added the -fsigned-char flag to the compiler command line, forcing the bare char type to be signed across all architectures.

This change turned out to not be popular. Segher Boessenkool pointed out that it constitutes an ABI change, and could hurt performance on systems that naturally want char to be unsigned. Linus Torvalds agreed, saying that: "We should just accept the standard wording, and be aware that 'char' has indeterminate signedness". He disagreed, however, with Boessenkool's suggestion to remove the -Wno-pointer-sign option used now (thus enabling -Wpointer-sign warnings). That change would enable a warning that results from the mixing of pointers to signed and unsigned char types; Torvalds complained that it fails to warn when using char variables, but produces a lot of false positive warnings with correct code.

Later in the discussion, though, Torvalds wondered whether it might be a good idea to nail down the signedness of char variables after all — but to force them to be unsigned by default rather than signed. That, he said, shouldn't generate worse code on any of the commonly used architectures. "And I do think that having odd architecture differences is generally a bad idea, and making the language rules stricter to avoid differences is a good thing". Amusingly, he noted that, with this option, code like:

    const unsigned char *c = "Subscribe to LWN";

will still, with the -Wpointer-sign option, generate a warning, since a string constant pointer is still considered to be a bare char * type, which is then treated as being different from an explicit unsigned char * type. "You *really* can't win this thing. The game is rigged like some geeky carnival game".

Donenfeld saw merit in the idea, even though he thinks that the potential to break some code exists. He sent out a new patch adding -funsigned-char to the compiler command line to effect this change. He had suggested that it could perhaps be merged immediately, given that there is time to fix any fallout before the 6.1 release, but Torvalds declined that opportunity: "if we were still in the merge window, I'd probably apply this, but as things stand, I think it should go into linux-next and cook there for the next merge window". He added that any problems that result from the change are likely to be subtle and to be in driver code that isn't widely used across architectures. The core kernel code, instead, has always had to work across architectures, so he does not believe that problems will show up there.

So Donenfeld's patch is sitting in linux-next instead, waiting for the 6.2 merge window in December. That gives the community until late February to find any problems that might be caused by forcing bare char variables to be unsigned across all architectures supported by Linux. That is a fair amount of time, but it is also certainly not too soon to begin testing this change in as many different environments as possible. It is, after all, a fundamental change to the language in which the kernel is written; a lack of resulting surprises would, itself, be surprising.

One way to identify potential problems is to find the places where the generated code changes when char is forced to be unsigned. Torvalds has already made some efforts in that direction, and Kees Cook has used a system designed for checking reproducible builds to find a lot of changes. Many of those changes will turn out to be harmless, but the only way to know for sure is to actually look at them. Meanwhile, the posting of one fix by Alexey Dobriyan has caused Torvalds to request that the char fixes be collected into a single tree. As those fixes accumulate, the result should be a sign of just how much disruption this change is actually going to cause.

Index entries for this article
KernelBuild system


to post comments

Would you like signs with those chars?

Posted Oct 24, 2022 18:14 UTC (Mon) by ballombe (subscriber, #9523) [Link] (18 responses)

C has also the isalpha et al class of function that conveniently take an int argument, but are sometime implemented as an array so you have to cast any char argument to unsigned char to shut up the compiler, even if your strings are 7bit.

Would you like signs with those chars?

Posted Oct 24, 2022 21:29 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (17 responses)

That isn't the problem, as far as I can tell. The problem is that, if you accidentally have a character with the high bit set (because you're using ISO-8859-1 or Windows-1252 or some other 8-bit ASCII superset instead of* UTF-8), then it will sign extend and you will get nonsense. Since those non-Unicode encodings used to be fairly popular, compilers started warning on this, even though I believe there's nothing in the standard that explicitly requires a diagnostic on conversion from char to int. Taking care to do this is good practice, because if you offer a 7 bit channel, people will (ab)use it as an 8 bit channel whether it is intended to support that or not.

If you can find an environment where someone actually declared isalpha and relatives with arrays rather than ints, I would be very surprised, because the standard specifies that the argument must be an int, and any subsequent int-to-array conversion is the callee's problem.

* This problem is theoretically also possible if you are using UTF-8, but in that case, you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it - and at that point, you've already widened everything to 32 bit, so hopefully you did it correctly.

Would you like signs with those chars?

Posted Oct 24, 2022 22:45 UTC (Mon) by wahern (subscriber, #37304) [Link] (3 responses)

> That isn't the problem, as far as I can tell.

I can't find conclusive examples for is- ctype routines, but here is how tolower was defined during the first few releases of OpenBSD, as forked from NetBSD:

#define tolower(c) ((_tolower_tab_ + 1)[c])

It's still defined similarly on NetBSD, today: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/sys/ctype_inl...

Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

Would you like signs with those chars?

Posted Oct 24, 2022 23:41 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

> #define tolower(c) ((_tolower_tab_ + 1)[c])

Even so, I don't believe that the standard *actually* says that c has to be unsigned in that expression - just that the "usual arithmetic conversions" happen (i.e. the compiler magicks it into an int when you're not looking). Compilers presumably added that warning because there were instances of arrays being indexed with negative char, but not negative int or any other signed type. And, again, that presumably had something to do with ASCII supersets and other nonsense involving dirty 7 bit channels.

> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).

Would you like signs with those chars?

Posted Oct 25, 2022 0:22 UTC (Tue) by wahern (subscriber, #37304) [Link]

The C standard says, "The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined."

Would you like signs with those chars?

Posted Oct 25, 2022 15:13 UTC (Tue) by mrvn42 (guest, #161806) [Link]

>> #define tolower(c) ((_tolower_tab_ + 1)[c])
>> Also, EOF is a permitted value and typically -1 (thus the +1 in the above), though that would typically only be an issue for non-C locales.

> The argument is of type int (according to the standard, not that untyped macro), not char, so it's completely unambiguous: You are allowed to pass negative numbers to those routines, because int is always signed, and if it is implemented as a macro, it has to accept signed values in the int range. Of course, if you pass negatives other than EOF (or whatever EOF is #define'd to), then the standard presumably gives you UB (which is why it's OK for the array implementation to walk off the end in that case).

The problem is that this will only work for values between -1 and 127 for an array of 129 bytes. A value of -2 (or any other non-ascii value other than EOF with signed char) would access memory before the array and a value of 255 (EOF mistakenly stored in an unsigned char or anything non ascii) would access memory after the array.

Looking at the source link in the other comments the BSD code seems to assume chars are unsigned. The test for ascii doesn't work with signed chars at all.

So I assume that "_tolower_tab_" is actually 257 bytes long to cover all unsigned chars and EOF (which is -1 when stored as int).

Would you like signs with those chars?

Posted Oct 25, 2022 0:31 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (10 responses)

> you'll get nonsense anyway, because UTF-8 has to be decoded before you can call functions like isalpha on it

ASCII is a subset of UTF-8. So if you have a C library which is content to implement these functions for ASCII, they do that fine on UTF-8 data without any decoding.

You run into a problem, as you should expect, if the C library thinks the data is 8859-1 when it's actually UTF-8 but otherwise it just provides the very useful answer to the question: is this (alphabetic / a digit / punctuation/ whitespace / etc.) in ASCII ?

Rust deliberately provides the ASCII variants of these functions on both char (a Unicode scalar value) and u8 (an unsigned 8-bit integer ie like C's unsigned char) named is_ascii_digit and so forth. You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.

Would you like signs with those chars?

Posted Oct 25, 2022 3:29 UTC (Tue) by dvdeug (subscriber, #10998) [Link] (7 responses)

isalpha and friends are defined to work on the current locale; you can't trust them to work on just ASCII.

isxdigit is always 0-9A-F, and as far as I can tell from the manpage, isdigit is always 0-9. Barring those, why is it useful to ask "is this an alphabetic/punctuation/whitespace character in ASCII?" Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese. I have a hard time thinking about a case where it's the right thing to check if something is some unspecified alphabetic character, but only those in ASCII.

Would you like signs with those chars?

Posted Oct 25, 2022 9:48 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (1 responses)

The typical example is configuration files where you *can* restrict identifiers to ASCII. Using locale functions will cause a mess for Turkish and Azerbaijani speakers, thanks to the "dotless i" and "dotted I" characters in their alphabets.

Would you like signs with those chars?

Posted Oct 25, 2022 15:46 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

You can restrict it to ASCII, but the question is*should* you. The dotted I / dotless i issues only matter if you're doing case-insensitive comparisons, which isn't a very Unix thing.

Would you like signs with those chars?

Posted Oct 25, 2022 15:46 UTC (Tue) by khim (subscriber, #9252) [Link] (4 responses)

> Pretty much everything is defined in terms of Unicode now; even if you're processing C code, you should still be prepared for identifiers in Russian, Greek or Chinese.

Sure, but only part of your program which deals with identifiers needs adjustment.

You can write int foo = 42; but can not write int foo = 42; which means that you can easily use “C” locale and all ASCII-only functions with a simple change: where before Unicode you used is isalpha(c) or isalnum(c) now you would use c < 0 || isalpha(c) and c < 0 || isalnum(c).

That's how doxygen handles it, I would assume someone may use isalpha(c) and/or isalnum(c) in a similar way.

Would you like signs with those chars?

Posted Oct 25, 2022 16:01 UTC (Tue) by dvdeug (subscriber, #10998) [Link] (3 responses)

That diff shows doxygen keeping UTF-8 characters together. The question is why do you need to know that a character is alphabetic but only in ASCII. In your example, the lex is "int" identifier "=" integer ";". It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.

Would you like signs with those chars?

Posted Oct 25, 2022 16:10 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

> It's not looking for ASCII letters; it's looking for 'i', 'n', 't', or the identifier in Unicode.

Only identifier is not “Unicode”. It's alpha or Unicode then alnum or Unicode (where Unicode is defined as “anything with a high bit set”).

Doxygen does that with lex, but in simpler cases you may do the same with ctype.h.

Would you like signs with those chars?

Posted Oct 25, 2022 23:41 UTC (Tue) by dvdeug (subscriber, #10998) [Link] (1 responses)

No, not if that was C or C++ code. I dug up a copy of the C++2003 standard I had laying around, and it specifically defines the set of letters in an identifier, and there's a limited number of usable Unicode characters. I pretty sure that any standard updated this century will have been made with reference to Unicode Standard Annex #31. The JVM (not Java) standard goes the other way and only restricts . ; [ / from being in a name.

Would you like signs with those chars?

Posted Oct 26, 2022 0:36 UTC (Wed) by khim (subscriber, #9252) [Link]

Sure, but standard doesn't say what compiler (or, even worse non-compiler) have to do with broken programs.

And if you ignore what standard says and just go with isalpha/isalnum + Unicode (where Unicode == “high bit is set”) then you would handle all correct programs perfectly. And if someone feeds incorrect one… who cares how would it be handled?

It's not as if we live in a world where everyone cares all that much about following the standard to a T.

Would you like signs with those chars?

Posted Oct 26, 2022 7:32 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> they do that fine on UTF-8 data without any decoding.

No they won't, at best they will pass through non-ASCII without doing whatever the function is defined to do (e.g. tolower won't actually lowercase your letters), and at worst they will silently corrupt it (if they think it's one of the legacy 8-bit encodings).

> You often do not need the fancy Unicode is_digit but only is_ascii_digit for real software because overwhelmingly the "is it a digit?" question is not cultural but purely technical.

There are a subset of edge cases where a string does not contain linguistically useful information, like a phone number or UUID. In those cases, these ASCII-only functions are somewhat useful, but most of them could just as easily be done with regular expressions like [0-9]+. Realistically, you need nontrivial parsing logic anyway, to deal with things like embedded dashes and other formatting vagaries, so you may as well solve both problems with the same tool (which can and should be Unicode-capable, because ASCII is ultimately "just" a subset of UTF-8). In that context, these ASCII-only functions look rather less useful to me.

The problem is, ASCII-only functions are also an attractive nuisance. They make things a little too comfortable for the programmer who's still living in 1974, the programmer who still thinks that strings are either ASCII or "uh, I dunno, those funny letters that use the high bit, I guess?" Those programmers are the reason that so many Americans "can't" have diacritical marks in their names (on their IDs, their airline tickets, etc.). If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode. Unicode is the default, not the "fancy" rare exception. If you have strings, and they're not some variety of Unicode, then one of the following is true:

1. They're encoding something that sort of looks like text, but is not really text, like a phone number.
2. They are raw bytes in some binary format, and not text at all.
3. In practice, they mostly are Unicode, but that's not your problem (e.g. because you're a filesystem and the strings are paths).
4. You hate your non-English-speaking users (and the English-speakers who have diacritical marks anywhere in their string for whatever reason - we shouldn't make assumptions).
5. You inherited a pile of tech debt and it's too late to fix it now.

Would you like signs with those chars?

Posted Oct 26, 2022 12:03 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

The thread you're replying in is about the "isalpha et al class of function" - to my mind that's specifically the predicates, but if you insist on also including tolower and toupper from the same part of the standard library, then that's still fine although more narrowly useful, they perform exactly as anticipated.

Sure enough Rust provides to_ascii_uppercase and to_ascii_lowercase here too.

[ Rust also provides to_uppercase and to_lowercase on char, but because this is a tricky problem these are appropriately more complicated ]

I already mentioned (but you snipped) that this will go wrong if your C library thinks it knows the byte is from some legacy encoding like 8859-1

> most of them could just as easily be done with regular expressions like [0-9]+

This sort of completely inappropriate use of technology (resorting to regular expressions to just match ASCII digits) is how we get software that is hundreds of times bigger and slower than necessary.

> Realistically, you need nontrivial parsing logic anyway

Again, you seem to have determined that people would be looking at these functions where they're completely inappropriate, but C itself isn't the right choice in the applications you're thinking about.

> If you are writing string logic in 2022, and your strings have anything to do with real text that will be read or written by actual humans, then your strings are or should be Unicode.

Certainly, but again, we're not asking about Javascript or C# or even Rust, we're talking about C and most specifically about the Linux kernel. Whether the people implementing a driver for a bluetooth device are "actual humans" is I guess up for question, but they're focused very tightly on low level technical details where the fact that the Han writing system has numbers is *irrelevant* to the question of whether this byte is "a digit" in the sense they mean.

C only provides these 1970s functions, and so you're correct that you should not try to write user-facing software in C in 2022. But, the 1970s style ASCII-only functions are actually useful, like it or not, because a bunch of technical infrastructure we rely on works this way, even if maybe it wouldn't if you designed it today (or maybe it would, hard to say)

Example: DNS name labels are (a subset of) ASCII. to_ascii_lowercase or to_ascii_uppercase is exactly appropriate for comparing such labels, and you might say "Surely we should just agree on the case on the wire" but actually we must not do that, at least not before all DNS traffic is DoH because it turns out our security mitigation for some issues relies on random bits in DNS queries, and there aren't really enough so we actually put more randomness in the case bits of the letters in the DNS name, so your code needs to work properly with DNS queries that have the case bits set or unset seemingly at random, so as to prevent attackers guessing the exact name sent...

The end user doesn't see any of this, you aren't expected to type randomly cased hostnames in URLs, nor to apply Punycode rules, you can type in a (Unicode) hostname, and your browser or other software just makes everything work. But the code to do that may be written in C (or perhaps these days Rust) and its DNS name matching only cares about ASCII, even though the actual names are Unicode.

Would you like signs with those chars?

Posted Oct 25, 2022 9:14 UTC (Tue) by jengelh (subscriber, #33263) [Link] (1 responses)

>if you accidentally have a character with the high bit set (because you're using ISO-8859-1 or Windows-1252 or some other 8-bit ASCII superset instead of* UTF-8), then it will sign extend and you will get nonsense.

int lowertbl[] = {-1, 0, 1, ..., 0x40, 0x61, 0x62, ...};
#define tolower(c) ((lowertbl+1)[c])

Sign extension need not produce nonsense. Thanks to the equivalency of the expressions lowertbl+1[c] <=> lowertbl+1+c <=> lowertbl+c+1 <=> lowertbl[c+1], what matters is if the pointer still points to something sensible.

Would you like signs with those chars?

Posted Oct 26, 2022 7:35 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

The standard requires that all functions implemented as macros must also be implemented as functions, so that you can take their addresses. If implemented as a function, the type must be declared as int, and then c gets coerced to a negative number by sign extension before you even get to the indexing expression. You would have to pad out the table with an extra 128 entries, not an extra 1 entry.

Would you like signs with those chars?

Posted Oct 24, 2022 19:41 UTC (Mon) by mss (subscriber, #138799) [Link] (11 responses)

That signed / unsigned char difference is a potent source of subtle bugs.

For example the following code:
char c = 0xff;
printf ("%x\n", c - 1);
will print fffffffe if char is signed but just (intuitively expected) fe if it is unsigned.

Would you like signs with those chars?

Posted Oct 24, 2022 21:29 UTC (Mon) by mb (subscriber, #50428) [Link] (5 responses)

>will print fffffffe if char is signed but just (intuitively expected) fe if it is unsigned.

Well, char in case of signed is consistent with short (always signed).

short c = 0xffff;
printf ("%x\n", c - 1);

prints fffffffe.

Therefore, char = unsigned char actually is inconsistent.

Would you like signs with those chars?

Posted Oct 25, 2022 6:35 UTC (Tue) by eru (subscriber, #2753) [Link] (4 responses)

The difference with char and short is that people primarily expect char to contain a character (it's even in its name, and string literals are arrays of char), but shorts are used for numbers. This is why having char as a signed type is insane. Whoever has heard of a negative letter 'A'? Back in eighties when first learning C, and porting C programs from the usenet to my crummy PC/XT clone, I often tripped up because of this, as my language requires more than A-Z to write. It was a real pain.

Would you like signs with those chars?

Posted Oct 25, 2022 10:36 UTC (Tue) by geert (subscriber, #98403) [Link] (3 responses)

Don't forget all of this was introduced in the days of 7-bit ASCII[*], so a signed char was fine. In fact it also allows you to store a negative error code, without wasting precious memory on expanding beyond a single byte.
8-bit ASCII (e.g. ISO-8859-*) was only standardized in the eighties, which is about the same time as PowerPC and ARM saw the day of light, so making char default to unsigned may made sense for them (modulo the compatibility issues).

[*] EBCDIC is 8-bit, and seems to have an intentional division in characters with and without bit 7 (the "sign" bit) set.
https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout

Would you like signs with those chars?

Posted Oct 25, 2022 13:15 UTC (Tue) by eru (subscriber, #2753) [Link] (1 responses)

8-bit character sets were not that new, although I agree they mostly post-date C. The Commodore micros since the 1970's used "PETSCII" with graphics characters in the upper half, and the IBM PC has since 1981 had its own character set with ASCII in the lower half, and a mixture of graphics and accented letters in upper (selection depending on the region).

Would you like signs with those chars?

Posted Oct 25, 2022 19:58 UTC (Tue) by khim (subscriber, #9252) [Link]

> 8-bit character sets were not that new, although I agree they mostly post-date C.

Note that bunch of these was in wide use way before C was a thing.

Would you like signs with those chars?

Posted Nov 4, 2022 20:07 UTC (Fri) by fest3er (guest, #60379) [Link]

This is why the DEC-10 was nice. Bytes could be defined from 1 to 36 bits in length. Granted, this could result in unused bits in words, but that's the price of flexibility. :)

Would you like signs with those chars?

Posted Oct 25, 2022 4:56 UTC (Tue) by SLi (subscriber, #53131) [Link] (4 responses)

Though in the signed case it might print or do anything else too (in standard C), since assigning an overflowing value to a signed integer is undefined behavior. Unless there is some exception for chars (C is quirky and I don't remember).

Would you like signs with those chars?

Posted Oct 25, 2022 7:06 UTC (Tue) by Villemoes (subscriber, #91911) [Link] (1 responses)

Let's be completely precise here. It is not undefined, it is "implementation-defined or an implementation-defined signal is raised".

C99, 6.5.16.1 Simple assignment

(2) In simple assignment (=), the value of the right operand is converted to the type of the
assignment expression and replaces the value stored in the object designated by the left
operand.

and that "converted to the type of" is covered in

6.3 Conversions, 6.3.1.3 Signed and unsigned integers:

(1) When a value with integer type is converted to another integer type other than _Bool, if
the value can be represented by the new type, it is unchanged.
(2) Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.
(3) Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.

And in practice, any relevant compiler (at least when we're talking the linux kernel, but I'd be surprised if any non-academic compiler did otherwise) does as gcc:

* 'The result of, or the signal raised by, converting an integer to a
signed integer type when the value cannot be represented in an
object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).'

For conversion to a type of width N, the value is reduced modulo
2^N to be within range of the type; no signal is raised.

Would you like signs with those chars?

Posted Oct 25, 2022 9:31 UTC (Tue) by SLi (subscriber, #53131) [Link]

Ah, indeed; I forgot that there's a difference between conversions (which is implementation-defined) and general signed overflow, which is undefined. Thanks for pointing that out :)

I don't think it's only a matter of being "completely precise"; relying on implementation-defined behavior can be reasonable choice, relying on undefined behavior that a compiler vendor has not explicitly declare they define, to me, usually cannot. (I know Linus would disagree with me :D)

Allowing an implementation-defined signal seems a bit horrible here, though, but I think I understand where the committee was coming from...

Would you like signs with those chars?

Posted Oct 25, 2022 7:39 UTC (Tue) by matthias (subscriber, #94967) [Link] (1 responses)

In this case, there is no overflow. Subtracting 1 from 0xff is non-overflowing regardless of whether the type is signed or unsigned. If it is signed, the result is -2 (0xfe) And in both cases, the result is well within the bounds of an int. The difference is whether it gets sign-extended (signed char) or not (unsigned char).

Would you like signs with those chars?

Posted Oct 25, 2022 9:59 UTC (Tue) by SLi (subscriber, #53131) [Link]

True. I was thinking that the assigning 255 to a signed char part would be UB (like signed integer overflow), but as was pointed above, conversion when the number does not fit is merely implementation-defined.

Would you like signs with those chars?

Posted Oct 24, 2022 20:08 UTC (Mon) by cesarb (subscriber, #6266) [Link] (13 responses)

Another advantage of making "char" always unsigned: it better matches Rust. In Rust, a string constant can be viewed as an &[u8] (actually &str, but str is nothing more than an [u8] which is guaranteed to contain valid UTF-8), so it can be argued that Rust's equivalent to a bare "char" would be u8, which is unsigned.

Would you like signs with those chars?

Posted Oct 24, 2022 21:14 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (10 responses)

More to the point, it better matches reality. A byte is 8 bits, not a number. You can use it to store numbers, but in practice it will be a component of a short or int (or a UTF-8 code point sequence, which sometimes might happen to only be one byte long), not a number in its own right. I don't visualize bytes as getting magically sign-extended when I do bitwise operations on them, and I pretty much never use char for math or counting things. I suppose there might be some situations where you really are extremely sure that you'll never need to count higher than 127 or lower than -128, but it's difficult to imagine a specific example (that does not involve the phrase "for historical reasons").

Would you like signs with those chars?

Posted Oct 24, 2022 22:13 UTC (Mon) by Sesse (subscriber, #53779) [Link] (4 responses)

It's fairly common to use these types to save memory. Just to take a random example from work: A counting Bloom filter will almost never need to count higher than 255, so why waste four times the memory (and cache space)?

I do wish C had made a separate “byte” type, though, for aliasing reasons. char has too many tasks.

Would you like signs with those chars?

Posted Oct 24, 2022 23:26 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

1. short is very often good enough for use cases like that. Not always, but often.
2. If you must use char, unsigned will work perfectly well (does your Bloom Filter have a negative count?!), so this isn't actually a use case for signed char.
3. If you need a sentinel value, you can use 255; there is no magical rule that says sentinel values have to be negative.

Would you like signs with those chars?

Posted Oct 25, 2022 7:30 UTC (Tue) by gspr (subscriber, #91542) [Link] (2 responses)

Re the aliasing of char: wouldn't uint8_t be a nicely named type for your use?

Would you like signs with those chars?

Posted Oct 25, 2022 7:34 UTC (Tue) by Sesse (subscriber, #53779) [Link] (1 responses)

It would, except it maps onto unsigned char, which can alias on anything as it stands. And it's about fifty years too late to change that :-)

Would you like signs with those chars?

Posted Oct 25, 2022 19:18 UTC (Tue) by wahern (subscriber, #37304) [Link]

uint8_t isn't required to be typedef'd to unsigned char. An implementation could choose to treat it differently from unsigned char, precisely to avoid the aliasing behavior of char. It seems that in the case of GCC, it may have been C++ partly to blame for the current state of affairs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110#c13

Would you like signs with those chars?

Posted Oct 25, 2022 5:09 UTC (Tue) by SLi (subscriber, #53131) [Link] (4 responses)

I really wonder (but am too lazy to Google now) what even is the history of signed chars. It seemed like a weird thing to me anyway. Was it some kind of wish to have a type that represents "all the characters we care about" and -1 for EOF, in times before the relevant routines were int?

Would you like signs with those chars?

Posted Oct 25, 2022 16:15 UTC (Tue) by khim (subscriber, #9252) [Link]

I suspect it was just added when the C Standard committee realized that on some platforms simple char is unsigned. They needed single-byte signed type thus signed char was born.

Would you like signs with those chars?

Posted Oct 27, 2022 2:09 UTC (Thu) by gdt (subscriber, #6284) [Link] (2 responses)

The problem being solved is moving a byte in memory to a word-sized register. If you want to do that load in one processor instruction then you have to accept the processor's choice of sign extension or otherwise as that byte is expanded into the register.

If the language insists on "char" being "unsigned char" then some processors will need to follow the register load with an AND instruction to clear the sign extension. If loading that register also sets register flags (eg, Negative) then you'll need to clear those register flags too. You could, of course, perhaps avoid this with careful compiler optimisations, but that's asking too much of the compilers of the era.

Well before the ANSI standards committee started work, the convention in C was to let these differences in processor implementations shine through, with an obligation on people writing code intended to be 'portable' between differing processors to deal with the results. Considering that C was a systems programming language, this wasn't an unreasonable choice.

Would you like signs with those chars?

Posted Oct 27, 2022 6:01 UTC (Thu) by SLi (subscriber, #53131) [Link]

Ah, right, makes total sense. Thank you!

Would you like signs with those chars?

Posted Oct 27, 2022 7:18 UTC (Thu) by joib (subscriber, #8541) [Link]

https://trofi.github.io/posts/203-signed-char-or-unsigned... has some investigation on this issue. Turns out there are quite a few architectures that only provide zero extending byte loads but the ABI has chosen char's to be signed, thus requiring an extra instruction to patch it up.

(That blog post is a few years old and doesn't include results for RISC-V, but I understand that RISC-V is like ARM, in that it provides both zero extending and sign extending byte loads but the ABI has chosen char's to be unsigned)

Would you like signs with those chars?

Posted Oct 24, 2022 21:19 UTC (Mon) by joib (subscriber, #8541) [Link] (1 responses)

Not sure why that would matter?

When passing strings between C and Rust, the pointers can/must be cast to the correct type anyway. And when working in Rust-land, the Rust compiler handles strings per its own rules, and in C-land the C compiler handles chars per its rules.

(Adding to the potential confusion, Rust also has a 'char' type, which is a 4-byte value capable of storing a single UTF-32 codepoint, presumably under the covers it's considered unsigned. But indeed a string (str/String) is an u8 array containing UTF-8.)

Would you like signs with those chars?

Posted Oct 25, 2022 3:30 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

> Rust also has a 'char' type, which is a 4-byte value capable of storing a single UTF-32 codepoint, presumably under the covers it's considered unsigned.

Rust defines char as a Unicode scalar value, Unicode only has one set of codepoints, and some of them aren't valid scalar values because they were used to make UTF-16 work, but UTF-32 maps all that ones which are scalar values to single code *units* and the rest are invalid.

Thus char::from_u32(0xDE01).unwrap(); will panic because this claims 0xDE01 is a Unicode scalar value, but it isn't because code point U+DE01 is a surrogate used for UTF-16 and has no meaning on its own.

char is not "considered" to be signed or unsigned. You can't do arithmetic on Rust types unless they implement the (Traits signifying) arithmetic operators, which char does not, so in Rust neither 'A' + 'B' nor false + true (bools only implement some logical and bitwise boolean operators, not general arithmetic) work.

Would you like signs with those chars?

Posted Oct 24, 2022 21:31 UTC (Mon) by lisch (subscriber, #36574) [Link] (7 responses)

Thus, x86 developers can get into the habit of thinking of char as always being signed and, as a result, write code that will misbehave on some other systems.
In my experience, x86 developers are more likely to unwittingly think of char as unsigned. For example,
char c = '\xfc';
int array[256] = { ... };
do_something(array[c]);   // BOOM!

Would you like signs with those chars?

Posted Oct 25, 2022 6:42 UTC (Tue) by eru (subscriber, #2753) [Link] (6 responses)

In my experience, x86 developers are more likely to unwittingly think of char as unsigned.

Maybe because some early C compilers for the PC worked this way. Like Lattice C. The original 16-bit 8086 did not have the sign-extending 8-bit load as a single instruction, so it was more efficient to treat char as unsigned.

Would you like signs with those chars?

Posted Oct 25, 2022 9:51 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (4 responses)

I suspect most x86 developers have never heard of Lattice C. :)

> The original 16-bit 8086 did not have the sign-extending 8-bit load as a single instruction, so it was more efficient to treat char as unsigned.

"mov al, [x] + cbw" (sign extending) is even one byte shorter than "xor ax, ax + mov al, [x]", so there is no particular x86-specific reason to pick unsigned over signed. Unsigned chars are simply more intuitive; signed-by-default chars are a relic of K&R C not having the signed keyword at all.

Would you like signs with those chars?

Posted Oct 25, 2022 16:20 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> I suspect most x86 developers have never heard of Lattice C. :)

They may not have heard about it, but they definitely know its properties. Lattice C is maiden name of Microsoft C and I doubt x86 developers who have never heard about Microsoft C exist.

P.S. Yes, Microsoft C 3.0 is independent rewrite but it had to be compatible with Microsoft C 1.x and 2.x and thus it had to be compatible with Lattice C, too.

Would you like signs with those chars?

Posted Oct 26, 2022 14:29 UTC (Wed) by eru (subscriber, #2753) [Link]

I recall Microsoft C switched to signed characters in that rewrite, probably to be compatible with most Unix compilers. Xenix used Microsoft C as its system compiler, so compatibility helped porting. That is when I found out about the indeterminate behaviour of char in real life.

Would you like signs with those chars?

Posted Oct 25, 2022 16:34 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

cbw only works with accumulator, though. If you want speed and use register variables (means you use di and si) then you cannot use it.

Whether that was actual reason for the use of unsigned char or not we would never know, of course.

Would you like signs with those chars?

Posted Oct 26, 2022 8:18 UTC (Wed) by pbonzini (subscriber, #60935) [Link]

Compilers at the time weren't super smart on register allocation (also because only AX and BX were both 8-bit accessible and not given special duties by the ISA; there simply wasn't a lot of freedom). Loading from memory into the accumulator was by far the common case. Even though a simple char+char addition would have had to use "XCHG BX,AX" or something like that in order to sign extend both operands, CBW+XCHG would be the same length as clearing BH.

Would you like signs with those chars?

Posted Oct 25, 2022 10:16 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

XLATB treats AL as unsigned, too.

Would you like signs with those chars?

Posted Oct 25, 2022 4:16 UTC (Tue) by scientes (guest, #83068) [Link] (1 responses)

> On Arm systems, though, char variables are unsigned (unless explicitly declared signed) instead.

I want to point out that the LDRB and LDRBT instructions also zero-extend byte loads, so this unsignedness is part of the ABI.

Would you like signs with those chars?

Posted Oct 26, 2022 15:42 UTC (Wed) by jrtc27 (subscriber, #107748) [Link]

LDRx for B/H(/W on AArch64) zero-extends in general, yet short and int are still signed. Plus if you put an S after the R you get the sign-extending version of the instruction (which even exists for the LDRxT (AArch32) / LDTRx (AArch64) unprivileged loads). Thus nothing about the ABI is forced, or even made more efficient if a particular choice is made, by the instruction set here, just whether the mnemonics have an S in them or not.

Would you like signs with those chars?

Posted Oct 25, 2022 12:54 UTC (Tue) by dskoll (subscriber, #1630) [Link] (1 responses)

Mostly OT... I first learned assembler on the Motorola 6809. It had two sets of conditional branch instructions: One set interpreted the comparisons as signed, and the other as unsigned. It's been decades since I did any Intel assembler, but I assume the same thing holds? Are there any real architectural reasons to prefer signed over unsigned or vice-versa?

The 6809 had an instruction called SEX. It sign-extended accumulator B into double-accumulator D. It also had a BRA instruction, which was an unconditional branch. Someone tended sophomoric when naming the instructions. :)

Would you like signs with those chars?

Posted Oct 25, 2022 14:00 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

> I assume the same thing holds?

Yes, JL/JG/JLE/JGE are for signed comparisons while JB/JA/JBE/JAE are for unsigned comparisons.

Other architectures use LT/GT/LE/GE for signed and LTU/GTU/LEU/GEU for unsigned.

POWER is special in that it has separate "compare signed" and "compare unsigned" instructions, and the branches are just less than/equal/greater than.

Would you like signs with those chars?

Posted Oct 25, 2022 19:27 UTC (Tue) by mm7323 (subscriber, #87386) [Link] (1 responses)

Why not use uint8_t and int8_t when dealing with numeric values or octets, and keep char just for characters and strings?

Would you like signs with those chars?

Posted Oct 25, 2022 20:03 UTC (Tue) by khim (subscriber, #9252) [Link]

Because C it not made that way. uint8_t and int8_t are not types, they are type aliases.

You still end up dealing with that char/signed char/unsigned char whether you like it or not.

Would you like signs with those chars?

Posted Oct 25, 2022 19:31 UTC (Tue) by daniel.glasser (subscriber, #97146) [Link] (1 responses)

Sometimes code uses a byte as a very short signed integer in one instance, an unsigned integer in another instance, and as a character where sign did not matter in other instances. I do a lot of coding for embedded (and otherwise resource limited) targets, so this comes up a lot; using compact data structures, and on some targets eliminating generation of sign extension instructions can make the difference between an application that fits on the target and code that doesn't fit, and also affects the runtime efficiency of said code. (Anyone who's written firmware for microcontrollers in C can probably back me up on this.)

Back in the 1980s, I gave up explicitly using "char" for anything other than strings and/or API compatibility, and instead moved to a precursor of the types defined by the header file <stdint.h>, which are now "uint8_t" for unsigned and "int8_t" for signed (I think that the non-standard version was just "uint8" and "int8", respectively, but I'm not sure anymore). Not all compiler toolchains included a header file with consistent names, so I had my own. Using this convention, my code was portable between PDP-11, VAX-11, M68000, Intel8086, and Zilog Z8000, all of which I was using at the time, and eventually PPC, DEC Alpha, ARM, and other architectures with a number of different compilers with, at most, a change in a #define for whether "char" was signed or unsigned if the compiler didn't provide a pre-defined macro for that.

When ISO/IEC 9899:1999 (C99) added "<stdint.h>", I switched to using the standard typedef names for new code and my projects included a "include/compat" directory with versions of "stdint.h" and "stdbool.h" (and a few other "standard" header files introduced by C99) to be used for pre-C99 conforming compilers.

Adopting this sort of thing in the kernel at this stage would likely be a herculean task, but for any new software project in C it is trivial to standardize on using typedefs with explicit properties (signed/unsigned, width) where the built-in C types may vary between targets.

Would you like signs with those chars?

Posted Oct 26, 2022 8:27 UTC (Wed) by geert (subscriber, #98403) [Link]

> Adopting this sort of thing in the kernel at this stage would likely be a herculean task, but for any new software project in C it is trivial to standardize on using typedefs with explicit properties (signed/unsigned, width) where the built-in C types may vary between targets.

Fortunately that herculean task has already been completed, using {,__}[us]{8,16,32,64}.

Would you like signs with those chars?

Posted Oct 26, 2022 6:04 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (4 responses)

I was fortunate to get hit by this a few decades ago, when seeing that my old dirty code doing this:

char c;

while ((c = getchar()) != EOF) {
/* do something */
}

would turn to an infinite loop on PPC, because chars were unsigned there and would never match EOF(-1). Since then I've been insisting on building and running my code on various architectures, among which Arm, in part for unsigned chars.

Would you like signs with those chars?

Posted Oct 26, 2022 8:29 UTC (Wed) by geert (subscriber, #98403) [Link] (3 responses)

BTW, getchar() returns int, not (signed) char, so assigning to char causes truncation.

Would you like signs with those chars?

Posted Oct 26, 2022 13:15 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (2 responses)

Yes exactly! But when you're young you don't notice ;-) It's just that 0xFF also ends up the stream. Nowadays it's still possible to seldom spot programs which stop parsing input after 0xFF, and when I see that it just reminds me that they made the same mistake I made, believing that getchar() would return ... a char!

Would you like signs with those chars?

Posted Oct 27, 2022 15:10 UTC (Thu) by cesarb (subscriber, #6266) [Link] (1 responses)

> Nowadays it's still possible to seldom spot programs which stop parsing input after 0xFF

As luck would have it, 0xFF never appears in any valid UTF-8 stream (even in the original definition of UTF-8 which allowed for longer sequences), so the strong push to use UTF-8 everywhere lessens the impact of these bugs.

Would you like signs with those chars?

Posted Oct 27, 2022 17:45 UTC (Thu) by wtarreau (subscriber, #51152) [Link]

No it makes them harder to test in fact. FWIW it was sound processing so any char is valid. And when you do that with unsigned values, you only face the problem when reaching saturation ;-)

Would you like signs with those chars?

Posted Oct 27, 2022 13:59 UTC (Thu) by kpfleming (subscriber, #23250) [Link]

const unsigned char *c = u"Subscribe to LWN";
Would be fine, if it was an option :-)

Would you like signs with those chars?

Posted Nov 3, 2022 14:49 UTC (Thu) by welinder (guest, #4699) [Link]

> As a general rule, C integer types are signed unless specified otherwise;

True.

> short, int, long all work that way.

Ah, no. They mostly work that way.

"int" has unspecified signedness in the following code:

struct X {
int m : 8;
};

Luckily bitfields are rare enough that you can just always insist on specifying "unsigned int" or "signed int"


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds