2019-11-24

Unsigned Integers Are Dangerous

Unsigned integers are dangerous for at least two reasons:
  • Danger1: "unsigned integers are highly infectious and possibly lethal to desired arithmetic."  Unsigned integers can transform your nice signed integer math into unexpected and unwanted unsigned integer math.  Ex: the unsignedness of 1u infects the C/C++ expression -1/1u so that it yields a large unsigned integer which may go on to infect more arithmetic.
  • Danger2: "unsigned integers almost always teeter on the cliff-edge of underflow, sometimes falling and killing desired behavior."  Underflow and overflow of integers often lead to unwanted behavior, and unsigned integers often hold small values that could easily underflow after common operations like i-- or i-1.  Signed integers often hold small values that are very far away from both underflow and overflow.
Danger1 depends on how your language treats operations with mixed signedness.  C and C++ (and probably many more languages) do have the dangerous behavior of preferring to generate unsigned integers.  Danger2 is for basically all languages.

Due to the severity and generality of these dangers, I recommend the mindset of "use signed integers unless you must use an unsigned integer for a specific reason".  Some acceptable situations to use unsigned integer variables...
  • If you have some variable/constant that will only by touched by bit-wise operations and not arithmetic.
  • If you really need to be stingy with your variable sizes and need the extra positive range of unsigned integers.


You will be tempted to use an unsigned integer when keeping track of a quantity where negative values are invalid and/or impossible, like array indices or counts of elements. Resist that temptation; the bugs arising from Danger1 and Danger2 can still happen.  If you are worried about making it evident to all code readers/users that some value needs to be non-negative, then do the same thing when you have some constraint that can't be handled by the type system (like a comment, assert, or proper naming of functions and variables).  I don't know that I've ever seen a bug because someone thought that a variable like int numApplesInTheBasket could have valid negative values.

People and organizations that agree with my recommendation against unsigned integers, especially within C/C++:
  • Google C++ Style Guide: "You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this."
  • CppCoreGuidelines (organization lead by C++ inventor Bjarne Stroustrup, C++ mega-guru Herb Sutter, and others)
    • ES.106: "Don't try to avoid negative values by using unsigned...Choosing unsigned implies many changes to the usual behavior of integers, including modulo arithmetic, can suppress warnings related to overflow, and opens the door for errors related to signed/unsigned mixes. Using unsigned doesn't actually eliminate the possibility of negative values."
    • ES.107: "Don't use unsigned for subscripts"
  • Yes, C++ STL people are saying it was a mistake to have their interface accept and generate unsigned integers.
  • Many others I won't list.


Explanation of Danger1: Why -1/1u Is Really Big

In short: a C/C++ arithmetic operation with an int and unsigned int will yield an unsigned int.  An implication of this is a lone unsigned int can contaminate an entire chain of operations.

We need to understand what conversions happen when we do arithmetic operations on integers in C/C++.  My explanation will be simplified (wrong) in that I'm not going to tangle with exceptions to the rules below (like the exceptions carved out for _Bool).

First, each operand undergoes integer promotion:
  • If int can hold all possible operand values, then promote operand type to int.
  • Else if unsigned int can hold all possible operand values, then promote operand type to unsigned int.
  • Else no promotion.
Then a result type is decided based on the promoted operand types:
  • Result type is the larger promoted operand type, with unsigned winning ties. (i32+u32→u32; i64+u32→i64)
To turn these low-level rules into some high-level metaphors:
  • When operands are small enough, result is int (due to integer promotion) even if operands are all unsigned. "int strangely blossoms from all smaller things".
  • An int-sized-or-bigger unsigned integer will "win ties" and therefore many results will be unsigned, which create more unsigned results.  "unsigned int is an infectious bully".

Assuming int is 32 bits, Here are some examples of what result type you get for various operand types.
    Result Types For Mixed Integer Math, Assuming 32-bit int
    Original
    Operands
    Promoted
    Operands
    Result Comment
    U16+U16 I32+I32 I32 int blossoming from small things
    U16+U32 I32+U32 U32 Rare case of desired bullying?
    I32+U16 I32+I32 I32
    I32+U32 I32+U32 U32 Nasty surprise to many people
    I32+U64 I32+U64 U64 Bigger wins, may surprise some people
    I64+U32 I64+U32 I64 Bigger wins

    See this github gist for code that lets you see what operand types lead to what result types.

    If Unsigned Math Is So Infectious, Why Does Stuff Work?

    Some of the details about mixed-type integer operations might have surprised you and got you to thinking "how come I don't have bugs from unsigned integer infections all the time?".  Possibly, you've been quarantining and undoing the damage of your unsigned integer infections, like this code snippet:

        int32_t  s1 = -1;
        int32_t  s2 = -1;
        uint32_t u1 = 1;
        int32_t someCalc = s1 * u1 + s2;
        someCalc += blah1;
        someCalc += blah2; 
        return someCalc;


    The result of s1 * u1 is unsigned and thus the result of s1 * u1 + s2 is infected to be unsigned as well, but the unsigned value is silently cast to a signed value and put in someCalcsomeCalc will initially contain the -2 you were hoping for.  So, the unsigned infection spread, but the damage was reversible and contained because we assigned the unsigned result to a signed integer of the same size. But what if we later changed someCalc to int64_t or double to accommodate a wider range of values?  Then the unsigned infection will spread and cause damage that we will not contain and reverse, and someCalc will be initialized to 4294967294 instead of -2.

    Danger2 Examples In Loops

    What if you want to print the first n elements of an array in backwards order? Someone who defaults to using unsigned integers might write:

    void PrintFirstN(int[] array, unsigned int n) {
        for(unsigned int i = n - 1; i >= 0; i--) {
            Print(array[i]);

        }
    }

    The for-loop is incorrect; the i >= 0 is always true and i will underflow to really big numbers, and if you are lucky, the program crashes before doing something really harmful.  To review: Someone made an unsigned integer; it teetered on the edge of underflowing, and then it fell off that cliff and hurt someone.  Signed integers would not have that problem.

    Does looping backwards seem niche and contrived? Okay, let's do the much more normal forwards order:

        for(unsigned int i = 0; i <= n - 1; i++) {
            Print(array[i]);

        }

    Wait, no, if I call PrintFirstN with n as 0, then the condition will always be true. Okay, let's do the more typical i < n condition:

        for(unsigned int i = 0; i < n; i++) {
            ...
        }

    Are you looking forward to all the dangerous i-1 or n-1 or other things that might go inside that loop someday? I'm not.


    Also, if you have a for-loop like this:

        unsigned int startIdx = 3;
        unsigned int stopIdx = 2;
        for(unsigned int i = 0; i < stopIdx - startIdx; i++) {
            ...
        }

    Then the for-loop won't iterate correctly until all three of {startIdx,stopIdx,i} are signed.  If any of them are unsigned, the comparison will be made on unsigned numbers.

    Appendix A: Mixed Type Math In Other Languages

    C#/CSharp

    C# seems to do mixed type arithmetic more safely than C/C++.  Roughly speaking, operands will be converted into an int-or-bigger type that can hold all the values for both operands, preferring small over big and then signed over unsigned.  Ex:  u8+u8→i32 and i32+u32→i64. Nice!  So, C# has something similar to C/C++'s "int blossoming from small things" but unsigned integers are not infectious.

    From the C# Language Reference, the Numeric Promotions section says "Numeric promotion is not a distinct mechanism, but rather an effect of applying overload resolution to the predefined operators", and the predefined arithmetic operators are defined for the following operand pairs:
    • (int, int)
    • (uint, uint)
    • (long, long)
    • (ulong, ulong)
    • (float, float)
    • (double, double)
    • (decimal, decimal)
    Those function signatures combined with implicit conversions (ex: Int16→Int32) for overload resolution are the two ingredients for going from input operand types to result type.  The binary numeric promotions section goes into more detail about how the two ingredients interact, giving a series of if-then-else rules for you to follow to predict result types from operand types.

    No comments:

    Post a Comment