Defensive programming is better than you think

Exploring defensive programming through a real-world project

Oct 15, 2024

∙ Paid

Introduction

For a long time, I didn’t really get what defensive programming was. The tutorials and videos mostly showed examples of user input validation and assertions in “hello world” level codebases. Both of them are important, but validation is the de facto standard nowadays (done by a package or framework), and to be honest, I don’t see lots of codebases with good assertions (or any).

But the promise of defensive programming is very tempting: robust, reliable software that rarely crashes.

The need for defensive programming

In recent weeks, I have been building ReelCode. It’s a code-running platform such as LeetCode and in a nutshell, this is how it works:

There are problems on the site
You select one. You’ll see a description and an online editor
You write your solution in one of the three supported languages and submit it
A worker picks up your code and executes it inside a docker container (it’s a dind container)
Once your code has been executed, the worker writes the output to a Redis hash
The frontend polls the API for a response and displays it to you

There are lots of moving parts involved in remote code execution. For example, you cannot start the container on the fly when a new submission comes in because it takes forever. You need a container pool. You need to initialize this pool with pre-created and pre-configured containers for each language. Every submission becomes a file, so you need to deal with files which is kind of error-prone and clunky at times. Different languages require different test cases, etc.

On top of that, I made three interesting decisions:

1) There is no database. Everything you see in the app is stored in memory. Even you, the user. It only uses Redis. The problems, and programming languages you see in the app are stored in variables.

2) No frameworks. First of all, I built everything in Golang. There’s no “full-stack” framework for Go such as Laravel for PHP, Spring for Java, etc. And to be honest, you don’t really need one. I don’t know what it is about Go but it’s awesome.

3) Not a single test was written during development. Was it a good decision? No. I mean, yes. Well, sort of. I wrote 10,887 lines of code and made 439 commits without writing a single test. I’ll talk about it later, but at any given moment I’m sure that the site is up and running and it accepts the right solutions for the available problems.

Were these good decisions? Definitely not. I have my reasons (mostly because it was fun), and I will write an article about it.

Since I don’t use a framework that does lots of things for me and the project doesn’t have tests, for every line of code I wrote there was only one thought in my mind: “How can this line go wrong? What can I do to make it robust? How can I make an app as reliable as LeetCode?“ I wanted to use proper defensive programming techniques to make the program reliable. By the way, this is probably the first time I took robustness that seriously.

I think I succeeded at building a somewhat robust system using defensive programming techniques. So let me share my learnings with you.

Since I built ReelCode in Golang, the examples will also be in it. But 90% of the techniques is language-agnostic. You can use them in any language.

We’ll start with the most obvious ones.

Input validation

This is the most simple and probably the most important technique of all. You need to validate every input that comes from:

Users
3rd parties
Or even developers when running background tasks, or other scripts, etc

Fortunately, most frameworks or languages make it very easy:

type PostSubmissionRequest struct {
    User              *models.User `json:"user" validate:"required"`
    SourceCodeEncoded string `json:"source_code" validate:"required"`
    Language          string `json:"language" validate:"required"`
    ProblemId         string `json:"problem_id" validate:"required"`
}

Be as specific as possible. For example, the real code from this example also validates that:

The user and problem exist
The language is a valid one
The source code is base64 encoded

But it couldn’t really fit the code block so I left them out.

Error handling

In the last 12 years, using PHP my error-handling strategy was:

Let 99% of exceptions out to the HTTP layer
Laravel will catch ‘em all
Return 500 (404, or 422 in cases when Laravel could detect them) to the frontend and display some generic error page to the user

There are two interesting things about it:

It’s the worst “error-handling” you can do
It works. According to Taylor Otwell, 600,000 Laravel applications are running on Forge alone. I’ve never used Forge in my life. I’ve never met anyone who uses it. So there are probably millions of Laravel sites out there. I think 99% of them does the same error handling.

The node equivalent?

process.on('unhandledRejection', (reason, promise) => {
  console.log('Unhandled Rejection at:', promise, 'reason:', reason);
  // Log it
  // Redirect the user to an error page
});

For CRUD-heavy applications this is a good enough error handling. Let’s say your POST API returns an error because your SQL query contains a mistake that has nothing to do with input validation. How do you recover from an error like that? What can you do in that situation? Log the error and return 500. There’s not much you can do in a situation like this. There are errors that you cannot really handle. But of course, you can do at least two things to make this API more robust:

Write tests and have a good pipeline to assure you don’t deploy broken APIs
Use a queue. If an endpoint is mission-critical I think you can use queues and jobs by default. Not because of scaling but because of reliability. Just queue the given request so if it fails due to bad code you can store it in a dead letter queue and retry it later when you fix the API. It also increases observability because most queue systems come with a monitoring tool by default. Of course, it makes your FE more complicated because you have to poll the API for a response or you need to send an SSE so I wouldn’t overdo it.

At the moment, ReelCode has only one mission-critical non-GET API endpoint: POST /submissions. It uses a queue called asynq and a monitoring tool called asynqmon.

Wrapping errors

As I said, I wrote ReelCode in Golang which is a “errors-as-values” language. This means that there are no exceptions in the language. Instead of throwing exceptions, functions return errors if something goes wrong. This means, in every situation you explicitly know about every error that can happen in the system. I mean, every error, 100% of them. This gives us the perfect opportunity to wrap errors and add useful context to them:

var outputBuffer bytes.Buffer
_, err = stdcopy.StdCopy(&outputBuffer, &outputBuffer, attachRes.Reader)
if err != nil {
    outputSpan.SetData("status", "failed")
    outputSpan.Finish()
    return "", fmt.Errorf("unable to read container output: %w", err)
}

This snippet runs after you submit a solution on ReelCode. It demultiplexes (copies) the container’s stdout and stderr to outputBuffer.

When StdCopy fails it’s guaranteed that I see a lower-level, weird error message, such as “unexpected EOF” and things like that. So when it fails I return the following error:

return "", fmt.Errorf("unable to read container output: %w", err)

From this line, I immediately know what went wrong. It’s just one sentence but it makes a perfectly generic, lower-level error message more application-specific and higher-level. When I open up Sentry, I want to see app-specific error messages instead of Linux I/O errors.

Error wrapping can have multiple levels. When you submit your solution the app creates a tar archive with a file in it (that contains your solution) and copies it into a container. Creating a tar file can go wrong for three different reasons:

Writing the header
Writing the content
Closing the file

func (e *Executor) createTarWithFile(
    filename, content string
) (io.Reader, error) {

    var buf bytes.Buffer
    tw := tar.NewWriter(&buf)

    hdr := &tar.Header{
    	Name:     filename,
    	Mode:     0644,
    	Size:     int64(len(content)),
    	ModTime:  time.Now(),
    	Typeflag: tar.TypeReg,
    }

    if err := tw.WriteHeader(hdr); err != nil {
        return nil, fmt.Errorf("unable to create tar file: %w", err)
    }

    if _, err := tw.Write([]byte(content)); err != nil {
        return nil, fmt.Errorf("unable to create tar file: %w", err)
    }

    if err := tw.Close(); err != nil {
        return nil, fmt.Errorf("unable to create tar file: %w", err)
    }

    return &buf, nil
}

I wrap each of these errors with a generic message. Then, upper in the call chain, I do this:

filename := fmt.Sprintf("%s%s", submission.ID, language.FileExtension)
tarContent, err := e.createTarWithFile(
    filename, 
    submission.SourceCodeDecoded,
)
if err != nil {
    copySpan.SetData("status", "failed")
    return "", fmt.Errorf("unable to put source code into container: %w", err)
}

When something goes wrong the error will be:

unable to put source code into container: unable to create tar file: <the original system error goes here>

This wrapping is important because there are multiple reasons why you cannot copy into a container. For example, the PUT /containers/<id>/archive Docker API call can go wrong. In this situation, I get an error like this:

unable to put source code into container: docker API failure: error response from daemon: Unknown runtime specified docker-runc.

Here’s a real example from production (2h 26m after launching ReelCode):

Copying the tar archive failed because the container 7a38cc does not exist. This error message is a perfect example of failing at defensive programming in this case. When copying into a container, it should be guaranteed that the container exists. It’s like you want to insert into a DB but the database does not exist.

How can we fix that?

Assertions

An assertion is basically just an if statement. But a good one. Every function that accepts arguments can define requirements. For example, this is what happens to a submission in ReelCode:

The code is executed by a function called Execute
After that, the output is passed to another function called Assert. It checks if your output matches the desired output of a given problem

Every submission has a status such as Waiting, InProgress, CodeExecuted, etc. Running the Assert function only makes sense if the given submission is in the status of CodeExecuted. Once the code is executed it can check its output.

So the following if statements make perfect sense in Assert:

func Assert(submission *models.Submission, problemID string) error {
    if submission.StatusId != submissionstatus.CodeExecuted {
        return ErrUnableToAssert
    }

    if submission.StdoutLines == nil {
        return ErrUnableToAssert
    }
}

It won't check the output if the submission status is not CodeExecuted. If the submission doesn’t have an output it won’t check the output because it makes no sense. Something went wrong in this case.

These if statements are called “assertions” because they ensure that a function’s argument is in the right state. Asserting a submission that is still waiting to be executed can only cause more problems.

Now let’s go back to the previous problem. Copying a file into a container that doesn’t exist:

First of all, the root cause of this problem was a silly bug. When you submit a solution the system tries to get a container out of a container pool. When the submission is executed the container is pushed back to the pool. It works the same way as a connection pool. This is the PushBack function:

func (p *ContainerPool) PushBack(
    language *models.Language, 
    container string
) error {
    if len(p.containers[language.ID]) >= p.capacity {
        err := p.cli.ContainerStop(
            context.Background(), 
            container, 
            container2.StopOptions{},
        )

        if err != nil {
            return fmt.Errorf(
                "unable to stop container in pool: %w", 
                err,
            )
        }

        err = p.cli.ContainerRemove(
            context.Background(), 
            container, 
            container2.RemoveOptions{}
        )
    	
        if err != nil {
            return fmt.Errorf(
                "unable to remove container in pool: %w", 
                err,
            )
        }
    }

    p.containers[language.ID] = append(p.containers[language.ID], container)

    return nil
}

The container pool has a fixed size. For every language, it stores a fixed number of containers for obvious reasons. When pushing back a container to the pool I need to check if the current size exceeds the capacity.

The if statement doesn’t contain a return statement at the end. So if the capacity is exceeded, the container is stopped, removed, and then the execution continues with this line:

p.containers[language.ID] = append(p.containers[language.ID], container)

The removed container’s ID is pushed back into the pool. That was the root cause of the problem.

However, I’m not sure if something like this will happen in the future. So the ultimate reliable solution is this: whenever a container is requested from the container pool, it has to make sure that the container is up and running before returning the ID.

At the moment, this is what the Get function looks like:

func (p *ContainerPool) Get(language *models.Language) (string, error) {
    p.mu.Lock()
    defer p.mu.Unlock()

    containers, ok := p.containers[language.ID]
    if ok && len(containers) > 0 {
        id := containers[len(containers)-1]
        p.containers[language.ID] = containers[:len(containers)-1]
        return id, nil
    }

    return p.createContainer(language)
}

In the if statement I need to make sure that the container is up and running. If, for some reason, it is stopped then I need to get another container ID.

It’s a bit more complicated than a simple if statement, but it is still an assertion. The Get function ensures that it only returns working containers.

The simplest solution would look like this:

func (p *ContainerPool) Get(language *models.Language) (string, error) {
    p.mu.Lock()
    defer p.mu.Unlock()

    containers, ok := p.containers[language.ID]
    if !ok || len(containers) == 0 {
        return p.createContainer(language)
    }

    id := containers[len(containers)-1]
    p.containers[language.ID] = containers[:len(containers)-1]

    // There was a bug that caused stopped IDs being stored in the pool
    // If container inspect returns an error we grab another ID
    _, err := p.cli.ContainerInspect(context.Background(), id)
    if err != nil {
        id = containers[len(containers)-1]
        p.containers[language.ID] = containers[:len(containers)-1]

        return id, nil
    }

    return id, nil
}

Check if the container can be inspected. If not, grab a new ID. I also fixed the function in a way that now the happy is not wrapped in if statements as it was before. Only edge cases are wrapped in if statements.

In a nutshell, these are assertions. They make sure that everything is in the right state.

Keep reading with a 7-day free trial

Subscribe to Computer Science Simplified to keep reading this post and get 7 days of free access to the full post archives.