No way to parse integers in C
There are a few ways to attempt to parse a string into a number in the C standard library. They are ALL broken.
Update at the bottom: Actually C++’s std::from_chars()
looks useful.
Leaving aside the wide character versions, and staying with long
(skipping int
, long long
or intmax_t
, these variants all having
the same problem) there are three ways I can think of:
atol()
strtol()
/strtoul()
sscanf()
They are all broken.
What is the correct behavior, anyway?
I’ll start by claiming a common sense “I know it when I see it”. The
number that I see in the string with my eyeballs must be the numerical
value stored in the appropriate data type. “123” must be turned into
the number 123
.
Another criteria is that the WHOLE number must be parsed. It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.
Failing to provide the above must be an error. Or at least as the user of the parser I must have the option to know if it happened.
First up: atol()
Input | Output |
---|---|
123timmy | 123 |
99999999999999999999999999999999 | LONG_MAX |
timmy | 0 |
empty string | 0 |
" " |
0 |
No. All wrong. And no way for the caller to know anything happened.
For the LONG_MAX
overflow case the manpage is unclear if it’s
supposed to do that or return as many nines as it can, but empirically
on Linux this is what it does.
POSIX says “if the value cannot be represented, the behavior is undefined” (I think they mean unspecified).
Great. How am I supposed to know if the value can be represented if
there is no way to check for errors? So if you pass a string to
atol()
then you’re basically getting a random value, with a bias
towards being right most of the time.
I can kinda forgive atol()
. It’s from a simpler time, a time when
gets()
seemed like a good idea. gets()
famously cannot
be used correctly.
Neither can atol()
.
Next one: strtol()
I’ll now contradict the title of this post. strtol()
can actually be
used correctly. strtoul()
cannot, but if you’re fine with signed
types only, then this’ll actually work.
But only carefully. The manpage has example code, but in function form it’s:
bool parse_long(const char* in, long* out)
{
// Detect empty string.
if (!*in) {
fprintf(stderr, "empty string\n");
return false;
}
// Parse number.
char* endp = NULL; // This will point to end of string.
errno = 0; // Pre-set errno to 0.
*out = strtol(in, &endp, 0);
// Range errors are delivered as errno.
// I.e. on amd64 Linux it needs to be between -2^63 and 2^63-1.
if (errno) {
fprintf(stderr, "error parsing: %s\n", strerror(errno));
return false;
}
// Check for garbage at the end of the string.
if (*endp) {
fprintf(stderr, "incomplete parsing\n");
return false;
}
return true;
}
It’s a matter of the API here if it’s OK to clobber *out
in the
error case, but that’s a minor detail.
Yay, signed numbers are parsable!
How about strtoul()/strtoull()
?
Unlike its sibling, this function cannot be used correctly.
The strtoul() function returns either the result of the conversion or, if there
was a leading minus sign, the negation of the result of the conversion repre‐
sented as an unsigned value
Example outputs on amd64 Linux:
Input raw | Input | Output raw | Output |
---|---|---|---|
-1 | -1 | 18446744073709551615 | 2^64-1 |
-9223372036854775808 | -2^63 | 9223372036854775808 | 2^63 |
-9223372036854775809 | -2^63-1 | 9223372036854775807 | 2^63-1 |
" " |
just spaces | Error: endp not null | |
-18446744073709551614 | -2^64+2 | 2 | 1 |
-18446744073709551615 | -2^64+1 | 1 | 1 |
-18446744073709551616 | -2^64 | Error ERANGE |
Phew, finally an error is reported.
This is in no way useful. Or I should say: Maybe there are use cases where this is useful, but it’s absolutely not a function that returns the number I asked for.
The title in the Linux manpage is convert a string to an unsigned
long integer
. It does that. Technically it converts it into an
unsigned long integer. Not the obviously correct one, but it indeed
returns an unsigned long.
Interesting note that a non-empty input of just spaces is detectable as an error. It’s obviously the right thing to do, but it’s not clear that this is intentional.
So check your implementation: If passed an input of all isspace()
characters, is this correctly detected as an error?
If not then strtol()
is probably broken too.
Maybe sscanf()
?
A bit less code needed, which is nice:
bool parse_ulong(const char* in, unsigned long* out)
{
char ch; // Probe for trailing data.
int len;
if (1 != sscanf(in, "%lu%n%c", out, &len, &ch)) {
fprintf(stderr, "Failed to parse\n");
return false;
}
// This never triggered, so seems sscanf() doesn't stop
// parsing on overflow. So it's safe to skip the length check.
if (len != (int)strlen(in)) {
fprintf(stderr, "Did not parse full string\n");
return false;
}
return true;
}
Input raw | Input | Output raw | Output |
---|---|---|---|
" " |
just spaces | Failed to parse | |
-1 | -1 | 18446744073709551615 | 2^64-1 |
-9223372036854775808 | -2^63 | 9223372036854775808 | 2^63 |
-9223372036854775809 | -2^63-1 | 9223372036854775807 | 2^63-1 |
-18446744073709551614 | -2^64+2 | 2 | 1 |
-18446744073709551615 | -2^64+1 | 1 | 1 |
-18446744073709551616 | -2^64 | 18446744073709551615 | 2^64-1 |
As we can see here this is of course nonsense (except the first
one). Extra fun that last one. You’d expect that from the two before
it that it would be 0
, or at least an even number. But no.
That last number is simply “out of range”, and that’s reported as
ULONG_MAX
.
But you cannot know this. Getting ULONG_MAX
as your value could be
any one of:
- The input was exactly that value.
- The input was
-1
. - The input is out of range, either greater than
ULONG_MAX
, or less than negativeULONG_MAX
plus one.
There is no way to detect the difference between these.
So sscanf()
is out, too.
Why does this matter?
Garbage in, garbage out, right? Why does it matter that someone might
give you -18446744073709551615
knowing you’ll parse it as 1
?
Maybe it’s a funny little trick, like ping 0
.
First of all it matters because it’s wrong. That is not, in fact, the number provided.
Maybe you’re parsing a bunch of data from a file. You really should stop on errors, or at least skip bad data. But incorrect parsing here will make you proceed with processing as if the data is correct.
Maybe some ACL only allows you to provide negative numbers, and you
use this trick to make it parse as negative in some contexts
(e.g. Python), but positive in others (strtoul()
).
I even saw a comment saying “when you have requirements as specific as this”. As specific as “parse the number, correctly”?
It should matter that programs do the right thing for any given input. It should matter that APIs can be used correctly.
Knives should have handles. It’s fine if the knives are sharp, but no knife should be void of safe places to hold it.
It should be possible to check for errors.
Can I work around it?
You cannot even assemble the pieces here into a working parser for unsigned long.
Maybe you think you can can filter out the incorrect cases, and parse the rest. But no.
You can detect negative numbers with strtol()
, range checked and
all, and discard all these. But you can’t tell the difference between
being off scale low between -2^64…-2^63, and perfectly valid upper
half of unsigned long, 2^63-1…2^64-1.
It’s not a solution to go one integer size bigger, either. long
is
long long
is intmax_t
on my system.
So what do I do in practice?
Do you need to be able to parse the upper half of unsigned long
? If not, then:
use strtol()
- Check for less than zero
- Cast to
unsigned long
If all you need is unsigned int, then maybe on your system
sizeof(int)<sizeof(long)
, and this can work. Just cast to unsigned
int
in the last step.
Do you need the upper half? Sorry, you’re screwed. Write your own parser.
These numbers are very high, yes, and maybe you’ll be fine without them. But one day you’ll be asked to parse a 64bit flag field, and you can’t.
0xff02030405060708
cannot be unambiguously parsed by standard
parsers, even though there’s ostensibly a perfectly cromulent
strtoul()
that handles hex numbers and unsigned longs.
Any hope for C++?
Not much, no.
C++ method std::stoul()
bool parse_ulong(const std::string& in, unsigned long* out)
{
size_t pos;
*out = std::stoul(in, &pos);
if (in.size() != pos) {
return false;
}
return true;
}
Input raw | Input | Output raw | Output |
---|---|---|---|
" " |
just spaces | throws std::invalid_argument |
|
timmy | text | throws std::invalid_argument |
|
-1 | -1 | 18446744073709551615 | 2^64-1 |
-9223372036854775808 | -2^63 | 9223372036854775808 | 2^63 |
-9223372036854775809 | -2^63-1 | throws std::out_of_range |
Code is much shorter, again, which is nice.
And std::istringstream(in) >> *out;
?
Same.
In conclusion
Why is everything broken? I don’t think it’s too much to ask to turn a string into a number.
In my day job I deal with complex systems with complex tradeoffs. There’s no tradeoff, and nothing complex, about parsing a number.
In Python it’s just int("123")
, and it does the obvious thing. But
only signed.
Maybe Google is right in saying just basically never use unsigned. I knew the reasons listed there, but I was not previously aware that the C and C++ standard library string to int parsers were also basically fundamentally broken for unsigned types.
But even if you follow that advice sometimes you need to parse a bit field in integer form. And you’re screwed.
Update: std::from_chars
seems to work
template<typename T>
std::optional<T> parse_int(std::string_view sv)
{
T i;
const auto [ptr, ec] = std::from_chars(sv.begin(), sv.end(), i);
if (ec != std::errc()) {
std::cerr << "Parse error\n";
return {};
}
if (ptr != sv.end()) {
std::cerr << "Trailing data\n";
return {};
}
return i;
}
The standard says:
The pattern is the expected form of the subject sequence in the "C"
locale for the given nonzero base, as described for strtol, except
that no "0x" or "0X" prefix shall appear if the value of base is 16,
and except that a minus sign is the only sign that may appear, and
only if value has a signed type.
That sounds right. No “minus allowed even for unsigned types” BS.
Update 2023-12-11: Alejandro Colomar to the rescue!
After reading this post, Alejandro came up with what seems obvious to me in retrospect; just check for those negative values!
inline unsigned long
strtoul_noneg(const char *nptr, char **restrict endptr, int base)
{
if (strtol(nptr, endptr, base) < 0) {
errno = ERANGE;
return 0;
}
return strtoul(nptr, endptr, base);
}
Hooray!
He also added it as a candidate answer to the stackexchange question.