Better Languages for Better Software

Robbert Haarman

2013-02-27

Introduction

Software bugs are a well known phenomenon. Virtually all software we use on a daily basis contains bugs, causing it not to function as intended in certain situations. Some of these bugs cause minor annoyances, others can cause programs to crash and lose data, or allow unauthorized access to the system. The vast majority of these bugs fall in one of three categories: buffer overflows, injection vulnerabilities and memory leaks. All of these can be completely eliminated by using different languages instead of the ones in popular use today (notably C and C++). As a bonus, these languages often allow programs to be written much more concisely and thus quickly, allowing more time for testing and improving software.

This essay discusses four common types of bug found in current software: buffer overflow vulnerabilities, injection vulnerabilities, memory leaks, and failing to check for errors. For each type of bug, it discusses the nature of the bug, how it occurs, what it leads to, and how it can be prevented. Each type of bug is illustrated with sample code that exhibits the bug, and code that doesn't.

Although almost all the faulty code is in C, and all the good code is in Common Lisp, this is not to say that C is the only language that is liable to these vulnerabilities, or Common Lisp is the only language that isn't liable to them. C is one example of a popular language that demonstrably leads to the bugs discussed here, and Common Lisp is one unpopular language that makes these bugs impossible to occur.

Buffer Overflows

Computers store everything they operate on in memory. The programs they execute, the data they are processing, the information about what they were doing before their current task, and the input from the user are all in the same memory. To read input (for example, text typed by the user, or a message received over the network), a buffer is allocated to hold this input. A buffer overflow occurs when the amount of input that was read exceeds the size of the buffer that was allocated for it. Some of the input ends up overwriting other things in memory, which may be vital to the correct execution of the running program. Buffer overflows cause unexpected and usually undesired behavior.

Some C code containing a buffer overflow vulnerability is shown below.

char input[1024];
gets(input);

This code allocates a buffer that can hold 1024 characters, and then reads a line of input. If the line of input contains more than 1024 characters, a buffer overflow will occur.

The effects that buffer overflows have vary. Commonly, a buffer overflow causes the program in which it occurs to crash. In some cases, however, buffer overflows can be used to cause the computer to execute code supplied in the input that caused the buffer overflow. This is often used by attackers to install software on an unsuspecting victim's computer. This software can later be used by the attacker to control the victim's computer, so that it can be used for sending spam or launching attacks on other computers, without these attacks being traced back to the attacker performing them.

Buffer overflows are easy to prevent. One only has to ensure that programs never read more data than they allocated buffer space for. For example, the following C code does the same thing as the code above, but never reads more than fits in the buffer:

char input[1024];
fgets(input, 1024, stdin);

However, there are a number of problems here. First, the safe code requires some extra work. Secondly, it's prone to programmer mistakes. The size of the buffer isn't always as clear as in the above example (where the buffer is allocated right above the code that uses it). Practice shows that buffer overflows are among the most common flaws found in software, indicating that programmers either don't use the safe paradigm, or specify the bounds incorrectly. A third problem with having to specify the size of the buffer is that even legitimate users cannot enter more data than the programmer has provided space for.

A better paradigm for reading input would be to have the system allocate memory for the input based on how much input is provided. This is exactly what many languages other than C do. For example, in Common Lisp:

(read-line)

reads a line of input into a newly allocated buffer that is large enough to hold it. This buffer is then returned. The possibility of a buffer overflow has been completely eliminated, without any work from the programmer, and there is no arbitrary constraint on how much input can be read.

A more extensive discussion of buffer overflows can be found in Countering buffer overflows.

Injection Vulnerabilities

Injection vulnerabilities occur when a program does not adequately verify the input it receives, and thereby allows an attacker to control the behavior of the program. By far the most common example of injection is SQL injection. SQL is a language used to interact with databases.

Consider the following scenario. Someone has written some sort of application, which people have to log in to before use. The database contains a table named Users, which contains the name and password for every user allowed to use the application. To verify that the user provided a valid username and password combination, the application performs the following query:

mysql_query("SELECT COUNT(*) FROM Users WHERE" .
	"username = '$username' AND password = '$password';");

Here, $username and $password would be substituted by the username and password entered by the user. The query is created by concatenating all the parts into a single string of characters, which is then sent to the database for processing. The query returns the number of matching records, which would be 1 if the user entered valid information, and 0 otherwise. Now, a malicious user enters ‘foo’ for the username and ‘bar'; DROP TABLE Users;’. for the password. The database will see the following:

SELECT COUNT(*) FROM Users WHERE
	username = 'foo' AND password = 'bar';
DROP TABLE Users;';

That's actually two queries! The first one will count the number of records in the table Users that have foo as the username and bar as the password (probably, it will find 0). The second one will throw away the table Users and all the information in it. Oops! Now nobody can log in anymore! (After that, the '; will probably cause the database to report an error, but the harm will already have been done.)

Like buffer overflows, injection vulnerabilities are easy to prevent. Everything the user enters must be filtered to make sure it does not contain any harmful parts. Often, this can be done by escaping characters that have special meanings to the software processing. In the example above, that would be done by prefixing every single quote in the user input by another single quote. This will cause the database not to break the string it's reading, but embed a single quote in the string and continue reading it. The semicolon and everything after it will be considered part of the password, and nothing bad will happen. The above PHP code could be made safe by rewriting it as follows:

mysql_query("SELECT COUNT(*) FROM Users " .
	"WHERE username = '" . mysql_escape_string($username) .
	"' AND password = '" . mysql_escape_string($password) .
	"';");

Clearly, filtering input causes extra work on the part of the programmer. For this reason, filtering is often omitted. Most programs will appear to work just fine without filtering, and so these bugs often go unnoticed until something harmful happens. However, the need for filtering is symptomatic of a deficiency of the underlying system. If user input wouldn't be able to alter the structure of the query, no filtering would have been necessary!

The problem with the query above is that it is composed by concatenating strings. Both the parts of the query that the program supplies and the part of the query that the user supplies are strings. The strings supplied by the user must be escaped, whereas the strings supplied by the program should not be (otherwise, the program would not be able to give structure to the query).

In Lisp, one could create a structured query as follows:

`(select ((count *)) (Users) (where (and
	(= username ,username) (= password ,password))))

Here, the query is constructed as a list, which contains other lists and symbols. The symbols preceded by a comma will be replaced by the value associated with the symbol after the comma, which would be the strings entered by the user. The result is a list containing lists, symbols, and strings. However, there is no way these strings can modify the structure of the constructed list. No matter what the user enters, each string will always just be an element of the list, and the list structure is always the same.

Alternatively, one could compose queries with a Common Lips macro. Macros receive their arguments unevaluated (so no variable substitution is done). Now, one could specify the query as follows:

(query "SELECT COUNT(*) FROM Users WHERE username = '"
	username "' AND password = '" password "';")

This looks very much like the original, vulnerable construct in PHP. However, there is one very important difference. The macro query can distinguish between things specified by the programmer (which it receives as strings) and things specified by the user (which it receives as symbols). It can then wrap the user-specified values in code that escapes any special characters, and the query can be performed in a safe way.

The most popular interface from Common Lisp to SQL, CLSQL, does something akin to the former approach.

More information on SQL injection can be found in SQL Injection Attacks by Example.

Memory Leaks

A memory leak is a failure of a program to release previously allocated memory. Often, memory leaks go undetected, because they don't lead to immediately observable incorrect behavior. However, a program which exhibits memory leaks usually causes itself and/or other programs to crash or otherwise stop functioning, because all available memory has been allocated. This process can take days or even months.

The following code, written in C, contains a memory leak:

while(1) {
	fread(&size, 4, 1, file));
	if(feof(file)) break;
	buffer = malloc(size);
	fread(buffer, size, 1, file);
	fwrite(buffer, size, 1, stdout);
}

The code reads data from a (previously opened) file, and sends it to standard output. It does so by first reading the size of the next datum (which is encoded as a 4-byte integer), then allocating a buffer of the right size, then reading the datum into the buffer, and finally writing the datum to standard output. This process is repeated until the end of the file is reached. Note that new memory is allocated on each iteration of the loop, but the allocated memory is never released. This could be done by writing ‘free(buffer);’ just after the call to fwrite, when the buffer is no longer needed.

In the above case, the memory leak is relatively easy to spot. However, there are much more complicated cases. In general, memory should be released when it's no longer needed anymore. However, it can be very difficult or even impossible to determine at what point in the program a block of memory is no longer needed and thus where the call to free should go.

The reclamation of memory that is no longer used can be automated. This is called garbage collection. A garbage collection algorithm determines which blocks of allocated memory can no longer be reached from the program, and makes these available for future use again. Thinking back to the program above, a garbage collector would find that the blocks allocated for buffer on all iterations preceding the current one are no longer reachable (because buffer has since been made to refer to a different block), and can thus be reclaimed.

Garbage collection can be built into programs written in C or C++, and there exists a library which bolts garbage collection onto an existing C or C++ program. However, C and C++ were not designed with garbage collection in mind, and thus this garbage collector cannot work otimally (in technical terms, it's a conservative collector; that is, it's not guaranteed to collect all garbage). If a language is designed or implemented to work with garbage collection, a garbage collector can work more efficiently and effectively. Common Lisp, used in previous examples, is one of the many languages that use garbage collection and are thus immune to memory leaks.

Unhandled Errors

If Murphy's Law applies to one thing in life, it has to be computers. If anything can go wrong, it will. Even simple things like allocating some memory, reading the next byte from a file or stream, or displaying a message can fail. For this reason, programming languages offer ways to detect and act on errors.

In C, errors are indicated by special return values. The same is true of other popular languages. The following example illustrates this paradigm:

FILE *file = fopen("some-file", "r");
if(file == NULL) {
	perror("fopen");
	exit(1);
}

This code tries to open the file some-file for reading. It then checks for failure, signified by fopen having returned NULL. If failure did indeed occur, an error message is printed and the program is exited with a nonzero exit code, signifying unsuccessful termination.

The sort of error handling displayed above is rather tedious. Basically, the value returned by every function call must be checked, and appropriate action must be taken in case the special failure value is returned. It is tempting to omit these checks, and indeed this is often done in practice. Often, this causes programs to crash later on, with no indication of where things initially went wrong. It's also possible to imagine scenarios where important files are overwritten with garbage, or other doom scenarios.

Many languages use exception handling to overcome the shortcomings of error handling based on return values. In this paradigm, when an error occurs, an exception is thrown instead of returning from the function. This exception can then be caught and handled by an exception handler. If no exception handler is specified, some default action can be taken (such as displaying an error message and exiting the program). Common Lisp has the more general concept of conditions. In Common Lisp, the code above could have been written as:

(handler-case (open "some-file")
	(error ()
		(princ "Error opening some-file" *error-output*)
		(quit 1)))

Here, handler-case executes the first expression, which tries to open the file. If no condition is signalled, it simply returns the value the expression evaluated to (here, a file stream). If a condition occured, it matches the type of the condition against the types specified in the following clauses. Here, there is only one clause, matching the type error, which is the most general type of error, and thus would match any error generated by open. If the clause is matched, the code inside it is evaluated; here, that would print an error message and exit the program.

There are two important differences between how errors are handled in C and how they are handled in Common Lisp. First and foremost, in C, if no check is performed, the program goes on as if nothing were wrong. In Common Lisp, errors that aren't handled by the program invoke the debugger, letting the user choose a way to handle the error (most languages supporting exceptions abort the program with an error when an unhandled exception occurs). Secondly, it is possible to perform multiple actions inside handler-case. A condition signalled in any of them will cause the condition handling logic to be invoked. This means that error detecting code does not have to be written around every function call.

Common Lisp takes things a step further. When you write code that can signal conditions, you can specify available ways to handle the conditions. These so-called “restarts” will be made available to code calling yours and, ultimately, the debugger. For example, in case of open, you could offer a restart that retries the operation, one that retries the operation using a different filename, and one that aborts the program. For the curious, the code for that would look as follows:

(defun ask-other-file ()
	(princ "Enter filename to try: ")
	(multiple-value-list (read)))

(defun my-open (file)
	(restart-case (open file)
		(retry () (my-open file))
		(use-other-file (file)
			:interactive ask-other-file
			(my-open file))
		(abort ()
			(princ "Aborting." *error-output*)
			(quit))))

C++ takes a curious position on error handling. Exceptions are part of the standard, but support for this in language implementations has been so poor that virtually all programming interfaces forego exceptions in favor of magic return values.

Links for further reading:

Beyond Exception Handling: Conditions and Restarts

Conclusion

The most common flaws found in software today are buffer overflow vulnerabilities, injection vulnerabilities, memory leaks and unhandled errors. All of these are symptomatic of the programming languages used to write this software. Other programming languages make introducing these flaws completely impossible, and make correct code a lot more natural and less tedious to write. This suggests that writing software in these languages instead would dramatically improve software quality.