Ben Hyrman

Counting lines with Go

I was thinking the other day about The Unix Philosophy. Broadly, you can get a lot of power from small command line utilities that do one thing well and can be chained, or composed, into more powerful use cases. For example, there's a command called wc (word count). It, well, unsurprisingly, counts words. But, you can also have it count lines if you pass in a flag wc -l.

I wanted to take the idea of "do one thing well" to the extreme. And, since I'm learning Go, I thought a line count utility would be a fitting exercise. In building this, we'll be able to learn how to read input from another program to support composability; we'll learn a bit about command line arguments; and we'll learn about reading files.

I mentioned composability a bit. Let's see what that entails. It means that this utility will output only the line count (so that it can be piped on to another command). And, it will need to accept, as input, either the output of another program (we might call this piped input). Or, it will need to be given a file location.

This means that our little application might be called like this

> lc "path/to/your/file.txt"

or like this

> echo "Count the lines in this" | lc

(you can substitute cat, or grep, or anything else, for echo above)

In either case, the count of carriage returns (\n) in the file will be printed out.

> lc "path/to/your/file.txt"
109

If you want to just see the code, it's available on GitHub.

Implementation Considerations

There are some counting assumptions that I made. I had originally chosen to have this match my editor's line count. That is, if Visual Studio Code shows x lines then my logic would also show x lines. However, I've chosen to follow the behavior of wc -l. I count carriage returns (\n). If a file does not end with a carriage return then the last line will not be counted.

While I'm not sure how I feel about this behavior, it is consistent with other tooling. A trailing carriage return is required to get an accurate count. Changing this is an exercise left to the reader.

Also, I think I might want to reuse the line-counting logic in other applications. So, I'm going to separate the command-line interface from the code that understands how to read through a stream and count returns.

Handling a file argument

Go has a nice flag library to read from the command line. We're going to abuse it a bit to read in the first argument a user passes. Remember, the first scenario is, lc "path/to/your/file.txt". In this usage, there are no named flags being passed in. I simply need arg0.

flag.Parse()
filePath := flag.Arg(0)

if filePath == "" {
fmt.Println("Usage:\n\tlc \"path\\to\\file.txt\"")
return
}

file, err := os.OpenFile(filePath, os.O_RDONLY, 0444)
if err != nil {
log.Fatal(err)
}
defer file.Close()
countLines(file)

I'm opening a file for read-only access, and requiring that the file have read permissions set for the user, group, and other. We'll get a file descriptor back from the call to os.OpenFile. You can think of a file descriptor as a small reference that we'll keep around so know know where to read data from later. And, to clean up after ourselves, we'll close the file when we're done reading. We can do with with the call to defer file.Close()

Handling piped input

The other use case we have to handle is when the data is passed, or piped, directly into our utility.

> echo "Count the lines in this" | lc

This is passed in on stdin, or standard input. Unix introduced three standard streams. These are ubiquitious enough that many programming languages have some way to access them and will, as needed, abstract the implementation details for the operating system away for you. These streams are stdin, stdout (standard output), and stderr (standard error). You might read something from the user on stdin, give them some results on stdout, and log any problems to stderr. But, as you can see in my use case above, stdin might be input from anything, including another program.

And, Unix loves file descriptors. Like, it really loves them. stdin, stdout, and stderr are all file descriptors. Yep, just like the result of the os.OpenFile call earlier. This will come in handy. Trust me.

We need to probe the stdin file descriptor to see if we received any data.

stat, err := os.Stdin.Stat()
if err != nil {
panic(err)
}

if stat.Mode() & os.ModeCharDevice == 0 {
reader := bufio.NewReader(os.Stdin)
countLines(reader)
}

Calling Stat() will give us information on the associated file descriptor. In this case, that's Go's reference to stdin, which Go nicely stores for us in a variable called os.Stdin.

I know stat.Mode() & os.ModeCharDevice == 0 looks a little hairy. We're asking Go for the current file mode on the file information. This is a bitmask of the current modes that are set on the file. When the 'this is character input' flag is set, then we know that stdin is open and being written to.

We could read directly from the stdin file. But, I want to buffer the input to more efficiently traverse it. Go provides a buffered I/O library for just such a use case. We give bufio.NewReader an io.Reader and get back an io.Reader. What a deal. But, it does a ton for us under the hood.

Now that we have an io.Reader, we can call into the package (that we haven't written yet), and count them carriage returns.

func countLines(r io.Reader) {
count, err := lc.CountLines(r)
if err != nil {
log.Fatal(err)
}
fmt.Println(count)
}

Count Them Lines

Long article. I know. But, we're here. All of that setup and we can deliver the actual value in 22 lines of code

func CountLines(r io.Reader) (int, error) {
var count int
var read int
var err error
var target []byte = []byte("\n")

buffer := make([]byte, 32*1024)

for {
read, err = r.Read(buffer)
if err != nil {
break
}

count += bytes.Count(buffer[:read], target)
}

if err == io.EOF {
return count, nil
}

return count, err
}

Just to recap, we want to know how many times a \n character appears in a given file. In a text file, at least, this would tell us how many lines long it is (emoji can't have carriage returns in the middle; I checked).

We don't care about any of the content. So, the question is. How can we efficiently read through the file, get what we want, and get out. I'm going to rule out using io.ReadBytes('\n') or higher-level abstractions like scanner as I want to hold as little in memory as possible. With those options, I might end up trying to read an entire file into memory before I find the first newline character.

But, we can create a byte buffer, read into that buffer, and then search just that buffer. When we're done, we'll move on to the next chunk.

So, let's create a 32 kibibyte buffer

buffer := make([]byte, 32*1024)

and then read from our file into that

read, err = r.Read(buffer)

This might give us a buffer with the following content

Hello World\nThis is a\nThree line file

The variable read will tell us how many bytes were read in. This is very important information because we're reusing our buffer and not re-initializing it between reads. Suppose we read 32kb of data on the first call to r.Read(buffer) but only 5kb of data on the second call. Our buffer will still contain 32kb of data. 5kb from the last read followed by 27kb of old data...

Next, we can use the bytes.Count method in the Go standard library to find the number of newline characters. We'll store the result in our counter variable.

count += bytes.Count(buffer[:read], target)

We will loop "forever". In reality, we will read incrementially to the end of the file. Then, we'll try and read one more time and Go will return an end of file error. We'll check to see when this is encountered and return our results then. In our use case, an end of file error is expected so... well, we shouldn't return it to the caller.

if err == io.EOF {
return count, nil
}

return count, err

An Alternative Way to Read

Calling bytes.Count(buffer[:read], target) is a very specific choice that I can make for this application. However, it might not always work for us. Suppose we were looking for a slightly more complicated pattern. Go has a way for us to do that. bytes.IndexByte will return the index position of the first occurance of a byte in a byte array. If no occurance is found, then a -1 is returned.

So, while it's more complicated, we can look for our \n character, and then look at the next slice of the array after that character, and then look at the next slice... continuing on until we're out of things to look at. Then we'd move on to the next chunk of file.

In that implementation, we would replace count += bytes.Count(buffer[:read], target) with the following

...
const target byte = '\n'
...

var position int
for {
idxOf := bytes.IndexByte(buffer[position:read], target)
if idxOf == -1 {
break
}

count++
position += idxOf + 1
}

Is it Fast?

I am only concerned with if this is comparatively fast when measured against wc. Getting true benchmarking numbers are outside of the scope of my efforts here. I've run both programs several times which ensures the operating system and my storage have both done any caching they plan to do.

Using time (Unix timing utility) wc on my machine (a midrange dev laptop with an NVMe SSD drive) to parse a 1.6GB text file of lorem ipsum text, I get the following averages after an initial warmup call:

real    0m0.822s
user    0m0.156s
sys     0m0.655s

Using lc to parse the same file, I get the following averages after an initial warmup call:

real    0m0.625s
user    0m0.015s
sys     0m0.015s

So, I'm happy with how this experiment went.

Wrapping Up?

I haven't demonstrated any tests for this program. I'll leave you to review them at your leisure. Or, wait for the next exciting installment.

I've previously covered how I set Go up locally and added %GOPATH%\bin to my path. So, from within the lc project directory, I can run go install and have a shiny new command line utility to use.

Honestly, thinking up bespoke little utilities has been a lot of fun. And, once you unlock the power of chaining them together, you'll think of many new use cases. Just keep the Unix philosophy in mind.

Feel free to ping me on Twitter @hyrmn with any questions or comments.