Crossing Streams: A Love Letter to Go Io.Reader | Datadog

Crossing Streams: a love letter to Go io.Reader

Author Jason Moiron
@jmoiron

Published: July 24, 2014

Jason Moiron @jmoiron is a Datadog Software Engineer and runs his own blog where this blog was originally posted.

Go has a function called ioutil.Readall, which is defined:

func ReadAll(r io.Reader) ([]byte, error)

Use of ioutil.ReadAll is almost always a mistake.

An io.Reader is a stream of bytes. The urge to inspect them is strong. When you have a good idea on what you will get out of a Reader, but want to verify its output, ioutil.ReadAll holds a powerful allure. This isn’t necessarily a bad thing to do; after all, ReadAll exists because of this need. But it’s a bad way to program.

Readers are flexible. Their implementations are varied, meaning that APIs that use Readers can operate directly on seemingly different types of objects: os.File, bytes.Buffer, net.Conn, and http.Request.Body; all Readers.

If you’re unconvinced, you may be wondering, “Isn’t the []byte I get from ReadAll just as flexible?” Byte slices and strings are prevalent in programming, but there are things that they cannot represent. There needn’t necessarily be a “real” source for a stream, and it needn’t necessarily be read til EOF. Streams can trivially produce infinite output while using barely any memory at all; imagine an implementation behaving like /dev/zero or /dev/urandom.

Memory control is an important advantage. Readers allow you to centrally or flexibly control buffering, via bufio or custom means. This is important where memory is limited, which, until computers are created with infinite memory, is everywhere.

There are many sources of data which are larger than the memory of most computers: a large database, the complete dump of wikipedia, or a photo collection. Many video, audio, and compression formats operate on streams directly, so that files larger than available memory can be viewed or decompressed.

Being able to control buffering is also vital in highly concurrent processes, which is Go’s bread & butter. Ideally, you’d like to concurrently work on as many buffers as possible, even if they wouldn’t all fit into memory at once, because in practice waiting for data is often slow compared to processing it.

It’s also potentially faster for protocols and file types where you can operate on different parts of a stream independently. An HTTP header can easily be parsed, processed, validated, and its underlying bytes discarded before the body of the request or the response has been received. If you waited for it all to arrive, you’d have to process each part sequentially.

So Readers are more flexible and result in faster code that uses less memory. How, then, did we get to the point where we are repeatedly writing code like this?

func LoadGzippedJSON(r io.Reader, v interface{}) error {
    data, err := ioutil.ReadAll(r)
    if err != nil {
        return err
    }
    // oh wait, we need a Reader again.. 
    raw := bytes.NewBuffer(data)
    unz, err := gzip.NewReader(raw)
    if err != nil {
        return err
    }
    buf, err := ioutil.ReadAll(unz)
    if err != nil {
        return err
    }
    return json.Unmarshal(buf, &v)
}

When we can write code like this:

func LoadGzippedJSON(r io.Reader, v interface{}) error {
    raw, err := gzip.NewReader(r)
    if err != nil {
        return err
    }
    return json.NewDecoder(raw).Decode(&v)
}

This example is actually a weak one, since the json library will require the entire document to decode, but it’s exemplary of a broader pattern. Github searches show 15,000 uses of json.Unmarshal and only 6,500 uses of json.NewDecoder. More broadly, a github search for ioutil.ReadAll yields over 22,000 results as of this writing, and it’s a fair bet that most of them are not only unnecessary but bad practice.

Of course, we all already know better, even if we don’t explicitly know that we know. We write piped commands like ls |grep foo |wc -l every day without ever considering fashioning its intermediate states:

ls > files.txt
grep "foo" files.txt > grepped.txt
wc -l grepped.txt
rm files.txt grepped.txt

I don’t know where this comes from. Some blame higher level languages for this, with powerful built-in strings and a prevalence of simple read-all APIs, but they also tend to have streaming APIs which are similarly underused. Take Python:

# github search for "json.loads("  => 210,000 matches
feed = urllib2.urlopen("http://example.com/api.json").read()
data = json.loads(feed)

# github search for "json.load(" => 58,000 matches
data = json.load(urllib2.urlopen("http://example.com/api.json"))

Perhaps a slice of bytes is not just a simpler concept, but a much simpler one than an abstract stream. Like teaching mathematics to a child by having them count real objects, manifesting the abstract to the concrete is natural. It takes a while before the mind can visualize summations, integrals, and infinite series.

Go io.Reader is one of the jewels of the Go standard library, but this conversation is a tiny facet of a much larger one about composition via interfaces and the growing body of work in the standard library and elsewhere on how to most effectively leverage the facilities that Go has to offer. This larger topic is one I’m woefully ill-equipped at this stage to tackle, both intellectually and in this format. I strongly suspect that, rather than discovering the rabbit hole goes deeper than we thought, we’re just now coming upon the entrance to the rabbit hole and recognizing it for what it is.

Regardless of the destination, we can start improving our code now! You can spread this knowledge in the Go community by finding uses of ioutil.ReadAll on Github and sending pull requests to help improve these projects. As understanding spreads, the quality of code rises and hopefully the whole ecosystem benefits as a result.