No way to parse integers in C

There are a few ways to attempt to parse a string into a number in the C standard library. They are ALL broken.

Update at the bottom: Actually C++’s std::from_chars() looks useful.

Leaving aside the wide character versions, and staying with long (skipping int, long long or intmax_t, these variants all having the same problem) there are three ways I can think of:

atol()
strtol() / strtoul()
sscanf()

They are all broken.

What is the correct behavior, anyway?

I’ll start by claiming a common sense “I know it when I see it”. The number that I see in the string with my eyeballs must be the numerical value stored in the appropriate data type. “123” must be turned into the number 123.

Another criteria is that the WHOLE number must be parsed. It is not OK to stop at the first sign of trouble, and return whatever maybe is right. “123timmy” is not a number, nor is the empty string.

Failing to provide the above must be an error. Or at least as the user of the parser I must have the option to know if it happened.

First up: `atol()`

Input	Output
123timmy	123
99999999999999999999999999999999	LONG_MAX
timmy	0
empty string	0
`" "`	0

No. All wrong. And no way for the caller to know anything happened.

For the LONG_MAX overflow case the manpage is unclear if it’s supposed to do that or return as many nines as it can, but empirically on Linux this is what it does.

POSIX says “if the value cannot be represented, the behavior is undefined” (I think they mean unspecified).

Great. How am I supposed to know if the value can be represented if there is no way to check for errors? So if you pass a string to atol() then you’re basically getting a random value, with a bias towards being right most of the time.

I can kinda forgive atol(). It’s from a simpler time, a time when gets() seemed like a good idea. gets() famously cannot be used correctly.

Neither can atol().

Next one: `strtol()`

I’ll now contradict the title of this post. strtol() can actually be used correctly. strtoul() cannot, but if you’re fine with signed types only, then this’ll actually work.

But only carefully. The manpage has example code, but in function form it’s:

bool parse_long(const char* in, long* out)
{
  // Detect empty string.
  if (!*in) {
    fprintf(stderr, "empty string\n");
    return false;
  }

  // Parse number.
  char* endp = NULL;  // This will point to end of string.
  errno = 0;          // Pre-set errno to 0.
  *out  = strtol(in, &endp, 0);

  // Range errors are delivered as errno.
  // I.e. on amd64 Linux it needs to be between -2^63 and 2^63-1.
  if (errno) {
    fprintf(stderr, "error parsing: %s\n", strerror(errno));
    return false;
  }

  // Check for garbage at the end of the string.
  if (*endp) {
    fprintf(stderr, "incomplete parsing\n");
    return false;
  }
  return true;
}

It’s a matter of the API here if it’s OK to clobber *out in the error case, but that’s a minor detail.

Yay, signed numbers are parsable!

How about `strtoul()/strtoull()`?

Unlike its sibling, this function cannot be used correctly.

The strtoul() function returns either the result of the conversion or, if  there
was  a  leading  minus sign, the negation of the result of the conversion repre‐
sented as an unsigned value

Example outputs on amd64 Linux:

Input raw	Input	Output raw	Output
-1	-1	18446744073709551615	2^64-1
-9223372036854775808	-2^63	9223372036854775808	2^63
-9223372036854775809	-2^63-1	9223372036854775807	2^63-1
`" "`	just spaces	Error: endp not null
-18446744073709551614	-2^64+2	2	1
-18446744073709551615	-2^64+1	1	1
-18446744073709551616	-2^64	Error ERANGE

Phew, finally an error is reported.

This is in no way useful. Or I should say: Maybe there are use cases where this is useful, but it’s absolutely not a function that returns the number I asked for.

The title in the Linux manpage is convert a string to an unsigned long integer. It does that. Technically it converts it into an unsigned long integer. Not the obviously correct one, but it indeed returns an unsigned long.

Interesting note that a non-empty input of just spaces is detectable as an error. It’s obviously the right thing to do, but it’s not clear that this is intentional.

So check your implementation: If passed an input of all isspace() characters, is this correctly detected as an error?

If not then strtol() is probably broken too.

Maybe `sscanf()`?

A bit less code needed, which is nice:

bool parse_ulong(const char* in, unsigned long* out)
{
  char ch; // Probe for trailing data.
  int len;
  if (1 != sscanf(in, "%lu%n%c", out, &len, &ch)) {
    fprintf(stderr, "Failed to parse\n");
    return false;
  }

  // This never triggered, so seems sscanf() doesn't stop
  // parsing on overflow. So it's safe to skip the length check.
  if (len != (int)strlen(in)) {
    fprintf(stderr, "Did not parse full string\n");
    return false;
  }
  return true;
}

Input raw	Input	Output raw	Output
`" "`	just spaces	Failed to parse
-1	-1	18446744073709551615	2^64-1
-9223372036854775808	-2^63	9223372036854775808	2^63
-9223372036854775809	-2^63-1	9223372036854775807	2^63-1
-18446744073709551614	-2^64+2	2	1
-18446744073709551615	-2^64+1	1	1
-18446744073709551616	-2^64	18446744073709551615	2^64-1

As we can see here this is of course nonsense (except the first one). Extra fun that last one. You’d expect that from the two before it that it would be 0, or at least an even number. But no.

That last number is simply “out of range”, and that’s reported as ULONG_MAX.

But you cannot know this. Getting ULONG_MAX as your value could be any one of:

The input was exactly that value.
The input was -1.
The input is out of range, either greater than ULONG_MAX, or less than negative ULONG_MAX plus one.

There is no way to detect the difference between these.

So sscanf() is out, too.

Why does this matter?

Garbage in, garbage out, right? Why does it matter that someone might give you -18446744073709551615 knowing you’ll parse it as 1?

Maybe it’s a funny little trick, like ping 0.

First of all it matters because it’s wrong. That is not, in fact, the number provided.

Maybe you’re parsing a bunch of data from a file. You really should stop on errors, or at least skip bad data. But incorrect parsing here will make you proceed with processing as if the data is correct.

Maybe some ACL only allows you to provide negative numbers, and you use this trick to make it parse as negative in some contexts (e.g. Python), but positive in others (strtoul()).

I even saw a comment saying “when you have requirements as specific as this”. As specific as “parse the number, correctly”?

It should matter that programs do the right thing for any given input. It should matter that APIs can be used correctly.

Knives should have handles. It’s fine if the knives are sharp, but no knife should be void of safe places to hold it.

It should be possible to check for errors.

Can I work around it?

You cannot even assemble the pieces here into a working parser for unsigned long.

Maybe you think you can can filter out the incorrect cases, and parse the rest. But no.

You can detect negative numbers with strtol(), range checked and all, and discard all these. But you can’t tell the difference between being off scale low between -2^64…-2^63, and perfectly valid upper half of unsigned long, 2^63-1…2^64-1.

It’s not a solution to go one integer size bigger, either. long is long long is intmax_t on my system.

So what do I do in practice?

Do you need to be able to parse the upper half of unsigned long? If not, then:

use strtol()
Check for less than zero
Cast to unsigned long

If all you need is unsigned int, then maybe on your system sizeof(int)<sizeof(long), and this can work. Just cast to unsigned int in the last step.

Do you need the upper half? Sorry, you’re screwed. Write your own parser.

These numbers are very high, yes, and maybe you’ll be fine without them. But one day you’ll be asked to parse a 64bit flag field, and you can’t.

0xff02030405060708 cannot be unambiguously parsed by standard parsers, even though there’s ostensibly a perfectly cromulent strtoul() that handles hex numbers and unsigned longs.

Any hope for C++?

Not much, no.

C++ method `std::stoul()`

bool parse_ulong(const std::string& in, unsigned long* out)
{
  size_t pos;
  *out = std::stoul(in, &pos);
  if (in.size() != pos) {
    return false;
  }
  return true;
}

Input raw	Input	Output raw	Output
`" "`	just spaces	throws `std::invalid_argument`
timmy	text	throws `std::invalid_argument`
-1	-1	18446744073709551615	2^64-1
-9223372036854775808	-2^63	9223372036854775808	2^63
-9223372036854775809	-2^63-1	throws `std::out_of_range`

Code is much shorter, again, which is nice.

And `std::istringstream(in) >> *out;`?

Same.

In conclusion

Why is everything broken? I don’t think it’s too much to ask to turn a string into a number.

In my day job I deal with complex systems with complex tradeoffs. There’s no tradeoff, and nothing complex, about parsing a number.

In Python it’s just int("123"), and it does the obvious thing. But only signed.

Maybe Google is right in saying just basically never use unsigned. I knew the reasons listed there, but I was not previously aware that the C and C++ standard library string to int parsers were also basically fundamentally broken for unsigned types.

But even if you follow that advice sometimes you need to parse a bit field in integer form. And you’re screwed.

Update: `std::from_chars` seems to work

template<typename T>
std::optional<T> parse_int(std::string_view sv)
{
  T i;
  const auto [ptr, ec] = std::from_chars(sv.begin(), sv.end(), i);
  if (ec != std::errc()) {
    std::cerr << "Parse error\n";
    return {};
  }
  if (ptr != sv.end()) {
    std::cerr << "Trailing data\n";
    return {};
  }
  return i;
}

The standard says:

The pattern is the expected form of the subject sequence in the "C"
locale for the given nonzero base, as described for strtol, except
that no "0x" or "0X" prefix shall appear if the value of base is 16,
and except that a minus sign is the only sign that may appear, and
only if value has a signed type.

That sounds right. No “minus allowed even for unsigned types” BS.

Update 2023-12-11: Alejandro Colomar to the rescue!

After reading this post, Alejandro came up with what seems obvious to me in retrospect; just check for those negative values!

inline unsigned long
strtoul_noneg(const char *nptr, char **restrict endptr, int base)
{
    if (strtol(nptr, endptr, base) < 0) {
        errno = ERANGE;
        return 0;
    }
    return strtoul(nptr, endptr, base);
}

Hooray!

He also added it as a candidate answer to the stackexchange question.

What is the correct behavior, anyway?

First up: atol()

Next one: strtol()

How about strtoul()/strtoull()?

Maybe sscanf()?