What's Wrong with C?

Robbert Haarman

2010-12-11

Introduction

In this article I give some of my opinions on why the C programming language has become outdated. I will discuss such weaknesses as fixed-size arrays1 and strings, the absence of overloading, and manual memory management. I will also touch on the historical resons for these weaknesses, and discuss alternatives that don't have these problems.

This article is not meant to be a exhaustive discussion of all the issues concerned. It is also not meant to be politically correct, or even truthful. It represents my views, and possibly the views of others. It is meant to outline some of the major problems I have experienced with C, and looks at how these problems have been solved in other languages. The article is meant to be readable by a less tech-savvy audience, and provides a slightly simplified view. Writing a complete and truthful article with suggestions for the future is left as an exercise to a Real Programmer.

Genesis of C

C was developed from 1969 to 1973 as a language for implementing the Unix operating system. It's goal was to be a high-level language, facilitating both coding2 the and porting3 it, should the need ever arise. In this, C differed from other programming languages of its time, which were closely related to machine code4, and, as such, hard to program and tied to a specific architecture5.

The purpose of C to serve as a portable language for writing operating systems goes a long way in explaining the features of C. C's main strengths are it's portability, performance, and ability to interface with the hardware. The performance and ability to interface with the hardware are the result of C's being a thin wrapper around assembly language6. It's portability is the result from abstracting away from the machine-level far enough that the specifics of the hardware don't need to bother the programmer anymore.

Weaknesses of C

Just like its strengths, C's weaknesses can also be explained by its history. Because C is, in some aspects, close to assembly language, it inherits some of the problems found in assembly programming. Other weaknesses of C can be explained by its age. Since C was initially conceived, much research has been done on programming language design. This has led to new insigts not yet present when C was designed. I would like to outline some of what I perceive to be major weaknesses of C below, along with their historical causes.

Fixed-Size Arrays and Strings

In C, strings are represented as arrays of characters. This simplifies the language specification, because it requires programmers to learn only the array-paradigm, which can then also be applied to strings. C requires variables and arrays to be declared before they can be used. For example, before you can use a variable i as a counter, you have to tell the compiler7 that you want a variable i that holds numbers. This has the advantage that the compiler can check for type-errors, a feature which most assemblers8 lack. An assembler wouldn't complain if you tried to write 4 bytes of data to a variable that only has 1 byte reserved for it, even though the result would probably be disastrous. The disadvantage is that you have to decide in advance how much space you will use. If, later, you find out that you needed more, you have a problem.

C offers two ways to circumvent the problem of fixed-size arrays. One way is the brute-force way: just declare an immense array so you are sure it's big enough. The problems with this approach are obvious. The only array that is guaranteed to be big enough is an array of infinite size. However, such an array will never fit into memory, and will waste an incredible amount of memory if not all the space is used. Whatever compromise you strike, you are always either wasting memory or running out of space in your array. Unless of course your guess happened to be exactly right. In some cases, you might actually know exactly how much space you will need, but in the majority of cases, the amount of space required is determined at run-time9; too late to tell the compiler about it.

The other way to solve the array space problem is to make arrays dynamic. C does indeed offer a way to do this, namely through dynamic memory management. The trick is that an array in C is actually a pointer to the memory allocated to that array. To access the n^th element of the array, C simply adds n times the size of one element to the pointer to get to the right position in the memory block. C's memory-management functions provide a way to obtain a pointer to a block of memory of arbitrary size. This means that, if you discover that you need to store 4 integers, you can simply obtain a pointer to space for 4 integers, and use it just as if you had declared an integer-array of size 4. And it gets even better: if later on you find that you actually need to store 2 more integers, you can tell C to grow the size of the memory allocated to this array, and store your integers there. But here comes the catch: No guarantee is made as to if the memory block will still start at the same position. This means thatyou have to make sure that every piece of code always uses that up-to-date pointer. If you have multiple parts to your program that all hold a pointer to this data, this can easily become a Mission Impossible.

Many bugs10 and security flaws in today's software arise from this feature of C. Becasue of the way arrays are made dynamic in C, it becomes impossible for the compiler to do bounds-checking. This means that the compiler won't notice it if your program accesses memory that is actually outside the block allocated to your array. Since strings are arrays, this also applies to data that your program reads from users. One security flaw that is found from time to time in software is the so-called buffer-overrun. To exploit a buffer-overrun, a cracker sends a program more data than its buffer can hold. Depending on how the program and the platform11 are organized, this may cause the program to terminate or execute code supplied by the cracker. If C's arrays and strings were automatically grown (and shrunk) on demand, C-programs would not be vulnerable to this kind of attack.

Manual Memory-Management

This issue is related to the issue discussed in the previous subsection. Certain situations require the allocation of memory at run-time. However, memory that is allocated must also be freed. A good operating system frees the memory allocated to a program when the program exits, but that does not really solve the problem. Some programs run for a long time (for example, until the system is shut down), and allocate memory at various points in time. It is therefore necessary to free memory at run-time. C doesn't do this automatically, as a system that automates this is complicated to implement and impacts performance.

The programmer is thus left to do memory allocation and deallocation by hand, which is a frequent source of bugs. Failing to allocate memory before it is used usually leads to program termination, and is therefore relatively easy to detect and fix. Failing to deallocate memory when it is no longer needed leads to so-called memory-leaks and is much harder to detect and fix. Many C-programs have memory-leaks in them, which can ultimately cause the system to run out of memory and become unusable (that is, until it is reset).

Memory leaks can also be a security concern. On some systems it may be possible for one program to read the data of another program. Imagine the paradise provided to crackers if usch a system runs a program that fails to deallocate the memory where it stores passwords12...

One way to avoid memory-leaks is to implement a so-called garbage-collector. A typical garbage-collector will keep track of the memory blocks allocated by a program, and periodically check if they are still being referenced. If a block is no longer referenced, the garbage-collector frees it. While this eliminates memory-leaks, it has a significant impact on performance. An alternative that is faster in most cases is reference-counting. The reference count for a block of memory is incrememnted whenever a reference to that block is obtained, and decremented whenever a reference is lost. When the reference count reaches zero, the block is freed. This makes obtaining references a little slower, and decrementing reference counts would still have to be done explicitly in C, so it doesn't solve the problem (it does facilitate memory-management, though).

Another way would be to alleviate the programmer from having to do memory-management. While this is probably impossible in practice, thinking about how it could be done in theory provides some helpful ideas. Memory allocations typically occur when working with arrays, strings, or structs13. A solution would be to write functions that grow and shrink arrays, create and obtain references to structs, and operate on strings, and have these functions perform memory-management. Programmers could then use these functions and never worry about memory-management. This basically brings us in the domain of object-oriented programming, which is very painful in C due to the lack of overloading, discussed in the next subsection.

Overloading

Overloading refers to multiple functions with the same name. Typically, these functions will perform similar operations, but on differing numbers of arguments and/or arguments of differing types. C does not support overloading. If you have a function sum(int a, int b) that sums the integers a and b, you cannot also have a function sum(float a, float b) that does the same for foating-point numbers, or a function sum(int a, float b, float c) that sums three integers.

C does provide a mechanism for implementing functions with a variable number of arguments of any type. This mechanism requires that a function have at least one argument of a given type, and the rest of the arguments are traversed with a set of macros. Theoretically, this mechanism could be used to implement overloading, but it requires the function to determine the types of its arguments at run-time, which is very inefficient. Some of the functions in the C-library use this mechanism, with a format-string as the first argument, that indicates the number, types, and positions of the other arguments. However, this mechanism is error-prone, because it is the programmer's responsibility to provide a correct format-string. A format-string specifying more arguments than are actually supplied, or a format string specifying a wrong type can have disastrous consequenses.

An alternative approach would be affixing the function name with some characters indicating the number, type, and position of the arguments. This is refered to as name-mangling. However, this can easily become very tedious in practice (if the codes used for the affixen are too long, it requires too much typing, and if they are too short, it gets confusing). It also makes code hard to read.

The absence of overloading is esppecially cumbersome for object-oriented programming. Object-oriented programming is a technique where variables that belong together are grouped together in one object. C provides a convenient way to do this with structs. Recall the previous example of a rectangle that is represented by a struct holding it's width and height. We could define a function draw(struct rectangle r) that draws the rectangle represented in the struct passed as it's argument on the screen. Now if we define a struct triangle representing a triangle, it would be nice if we could define a function draw(struct triangle t) that draws a triangle on screen, but this will not work. Instead, we have to either rename one or both functions, or we have to rewrite draw() to be a variable-argument function (the latter might not be possible, for example if draw() is defined outside our code). In the end, programmers need to remember different invocations for every type of shape they want to draw.

Now C was not designed to do object-oriented programming, and indeed in most cases it is acceptable to do without overloading. This is especially true for programs that do not export any functionality to other programs. Even if you use objects in your program, the number of different types you have to deal with is usually small enough that you can keep track of them all, and affixing your function names with the types they operate on may even look elegant rather than tedious. However, when writing a library14, the clumsiness of not having overloading can become very obvious and even painful. feels very natural, because programmatic objects can usually be modeled after Real-World objects, making code more meaningful and easy to understand.

Languages That Overcome C's Weaknesses

The world of programming languages has not stood still since 1973, nor have the weaknesses of C gone unnoticed. Numerous attempts have been made to design languages that overcome the weaknesses of C. Some of those have been radical new designs, but fur purposes of this article I will focus on languages that derive from C.

C++

C++ (which is C for ``increment C by one'') is an attempt to extend and thereby improve C. Everything that is valid C is also valid in C++, making it very easy to switch; you don't need to learn anything new. C++ offers full object-oriented programming, with inheritance15, overloading, and everything else (for a description of everything else, refer to an introduction to object-oriented programming). It also offers an impressive set of standard functions, implementing such things as automatically growing arrays and strings, linked lists, algorithms to perform a certain action on a whole set of objects, and many other things. All this functionality makes memory-management by the programmer practically unnecassary, and C++'s object creation and destruction mechanism makes it possible to implement automated memory management. In fact, this was a main cosideration in the design of C++. The reason no automated memory management is implemented by default is that there are so many ways to do it, each with its own merits and shortcomings.

While the huge set of standard functionality in C++ was provided to make a programmer's job easier, it is also the main reason why C++ hasn't replaced C (yet). Certainly, programmers don't have to write basic structures, becasue they are already present in the standard library. But this makes the standard library so big that it is virtually impossible for anybody to know the entire standard library. In fact, C++ is so complicated that compilers have not reached the level of maturity of C-compilers by far, so that C++ programs usually perform far worse than their C counterparts. This, and the enormous effort required to get a good knowledge of C++, has scared many programmers away from C++.

C++ is a language with a lot of potential. It fixes most problems with C, and adds a lot of functionality. It is a pity that it isn't used more widely, especially in mission-critical software like webservers, where it could completely eliminate many security problems. But until performance improves (I think that C++ has the potential to perform better than C), people will probably stick with C.

Java

Java (hackish16 for ``coffee'', after an Indonesian island that produces it) is a language designed and implemented by Sun Microsystems. It has many things in common with C++. It's object-oriented paradigm is a bit simpler, making it less flexible but arguably more elegant. It also comes with a big standard library that provides a load of general-purpose functionality. Major differences are that it is not a superset of C (i.e. a C program is not a valid Java program), and the inclusion of functionality intended for integrating with webbrowsers. In fact, Sun's implementation of Java translates Java source code to Java bytecode, which can be run on so-called s. This means that Java programs, once compiled, are equally crippled on any system, which is a way to make them platform-independent. Java also comes with garbage-collection, so that all the things that cause many C-programs to be buggy are finally history.

The main advantages of Java are it's ease of programming, integration of networking, graphics, and support, and safety (no memory leaks, buffer overflows, etc.) Ideally, portability would also be in this enumeration, but alas portability is less than optimal. The software needed to run Java-programs is not available for as many platforms as C copilers are. Java bytecode is more portable than machine code, but still not as portable as Sun originally intended. This is mostly due to MicroSoft's just-not-compatible VM, which leaves the world with one implementation used on the omnipresent Windows-platform, and one implementation used on all other platforms. The possibility exists for Windows-users to upgrade to a real Java Virtual Machine from Sun, but these are, unfortunately, very large (tens of megabytes). This illustrates one more weakness of Java; it requires a large runtime-environment to be installed. Also, since run-time environemt versions tend not to be upwards-comaptible, they need to be updated every once in a while. The main flaw of Java, however, is it's poor performance. This is due to the fact that bytecode has to be translated to machine code, either before or during execution, and garbage-collection.

The Free Software Foundation provides a compiler called gcj ( Compiler for Java) as part of its well-known and widely-used Compiler Collection. This compiler is open-source software, which means it can be used by anyone for any purpose, and may be freely modified and distributed, as long as the same rights are granted to everyone it is distributed to. It can be compiled and used on any platform that has a C-compiler, and can produce either Java bytecode or native machine code. At the time of this writing, it was not yet complete (specifically, graphics had not been implemented yet), severely limiting its usability, and the code it produces will probably always be slower than equivalent C-code, because of the garbage collection, but it provides an interesting option for the future.

Conclusion

There are a couple of defieciencies in the C programming language. Some of those cause many critical bugs in C programs, such as buffer-overflows and memory-leaks. Numerous attempts have been made to improve C, including new languages that are based on C but provide extensions that solve the problems C has. Two such languages are C++ and Java.

C++ solves the problems discussed to a lesser extent than Java, but has the advantage of being backward-compatible with C. It has the potential to exceed C even in performance, something that few other languages can claim. Unfortunately, its complexity has kept programmers from switching, and has severely complicated the development of good compilers, resulting in C++ having a severely lower performance than their C counterparts.

Java takes a more radical approach, dumping compatibility with C-code, and including garbage-collection to eliminate memory-leaks. Unfortunately, Java is plagued by compatibility problems, and performance will probably never paralell, let alone, exceed C's.

1 Array: A range of variables of the same type

2 Coding: Writing the source code for a program

3 Porting: Making a program work on a different type of system

4 Machine code: Binary code that is processed by machines, but hard to understand for humans

5 Architecture: Type of computer

6 Assembly Language: Language that presents machine code in a human-readable form

7 Compiler: Program that translates source code to machine code

8 Assembler: Program that translates assembly language to machine code

9 Run-Time: While the program is running

10 Bug: Programming error

11 Platform: Combination of operating system and hardware

12 In a good operating system, processes run in different address spaces, so they can't access each other's data. However, this kind of concern is a good reason to always use secret data in encrypted form, even in memory.

13 struct: The name given in C to a block of memory that holds a number of variables. For example, a struct might hold two integers specifying the width and height of a rectangle.

14 Library: Set of functions that can be used by other programs

15 Inheritance: A process whereby one type inherits all the properties of another type. All methods that can be applied to the original type will also be applicable to the new type. The new type can also define it's own methods, which will then be applied instead of the ones in the old type

16 Hackish: the language spoken by computer-experts