Counting lines with Go
I was thinking the other day about The Unix Philosophy. Broadly, you can get a lot of power from small command line utilities that do one thing well and can be chained, or composed, into more powerful use cases. For example, there's a command called wc
(word count). It, well, unsurprisingly, counts words. But, you can also have it count lines if you pass in a flag wc -l
.
I wanted to take the idea of "do one thing well" to the extreme. And, since I'm learning Go, I thought a line count utility would be a fitting exercise. In building this, we'll be able to learn how to read input from another program to support composability; we'll learn a bit about command line arguments; and we'll learn about reading files.
I mentioned composability a bit. Let's see what that entails. It means that this utility will output only the line count (so that it can be piped on to another command). And, it will need to accept, as input, either the output of another program (we might call this piped input). Or, it will need to be given a file location.
This means that our little application might be called like this
> lc "path/to/your/file.txt"
or like this
> echo "Count the lines in this" | lc
(you can substitute cat
, or grep
, or anything else, for echo
above)
In either case, the count of carriage returns (\n
) in the file will be printed out.
> lc "path/to/your/file.txt"
109
If you want to just see the code, it's available on GitHub.
Implementation Considerations
There are some counting assumptions that I made. I had originally chosen to have this match my editor's line count. That is, if Visual Studio Code shows x
lines then my logic would also show x
lines. However, I've chosen to follow the behavior of wc -l
. I count carriage returns (\n
). If a file does not end with a carriage return then the last line will not be counted.
While I'm not sure how I feel about this behavior, it is consistent with other tooling. A trailing carriage return is required to get an accurate count. Changing this is an exercise left to the reader.
Also, I think I might want to reuse the line-counting logic in other applications. So, I'm going to separate the command-line interface from the code that understands how to read through a stream and count returns.
Handling a file argument
Go has a nice flag library to read from the command line. We're going to abuse it a bit to read in the first argument a user passes. Remember, the first scenario is, lc "path/to/your/file.txt"
. In this usage, there are no named flags being passed in. I simply need arg0
.
flag.Parse()
filePath := flag.Arg(0)
if filePath == "" {
fmt.Println("Usage:\n\tlc \"path\\to\\file.txt\"")
return
}
file, err := os.OpenFile(filePath, os.O_RDONLY, 0444)
if err != nil {
log.Fatal(err)
}
defer file.Close()
countLines(file)
I'm opening a file for read-only access, and requiring that the file have read permissions set for the user, group, and other. We'll get a file descriptor back from the call to os.OpenFile
. You can think of a file descriptor as a small reference that we'll keep around so know know where to read data from later. And, to clean up after ourselves, we'll close the file when we're done reading. We can do with with the call to defer file.Close()
Handling piped input
The other use case we have to handle is when the data is passed, or piped, directly into our utility.
> echo "Count the lines in this" | lc
This is passed in on stdin
, or standard input
. Unix introduced three standard streams. These are ubiquitious enough that many programming languages have some way to access them and will, as needed, abstract the implementation details for the operating system away for you. These streams are stdin
, stdout
(standard output
), and stderr
(standard error
). You might read something from the user on stdin
, give them some results on stdout
, and log any problems to stderr
. But, as you can see in my use case above, stdin
might be input from anything, including another program.
And, Unix loves file descriptors. Like, it really loves them. stdin
, stdout
, and stderr
are all file descriptors. Yep, just like the result of the os.OpenFile
call earlier. This will come in handy. Trust me.
We need to probe the stdin
file descriptor to see if we received any data.
stat, err := os.Stdin.Stat()
if err != nil {
panic(err)
}
if stat.Mode() & os.ModeCharDevice == 0 {
reader := bufio.NewReader(os.Stdin)
countLines(reader)
}
Calling Stat()
will give us information on the associated file descriptor. In this case, that's Go's reference to stdin
, which Go nicely stores for us in a variable called os.Stdin
.
I know stat.Mode() & os.ModeCharDevice == 0
looks a little hairy. We're asking Go for the current file mode on the file information. This is a bitmask of the current modes that are set on the file. When the 'this is character input' flag is set, then we know that stdin
is open and being written to.
We could read directly from the stdin
file. But, I want to buffer the input to more efficiently traverse it. Go provides a buffered I/O library for just such a use case. We give bufio.NewReader
an io.Reader
and get back an io.Reader
. What a deal. But, it does a ton for us under the hood.
Now that we have an io.Reader
, we can call into the package (that we haven't written yet), and count them carriage returns.
func countLines(r io.Reader) {
count, err := lc.CountLines(r)
if err != nil {
log.Fatal(err)
}
fmt.Println(count)
}
Count Them Lines
Long article. I know. But, we're here. All of that setup and we can deliver the actual value in 22 lines of code
func CountLines(r io.Reader) (int, error) {
var count int
var read int
var err error
var target []byte = []byte("\n")
buffer := make([]byte, 32*1024)
for {
read, err = r.Read(buffer)
if err != nil {
break
}
count += bytes.Count(buffer[:read], target)
}
if err == io.EOF {
return count, nil
}
return count, err
}
Just to recap, we want to know how many times a \n
character appears in a given file. In a text file, at least, this would tell us how many lines long it is (emoji can't have carriage returns in the middle; I checked).
We don't care about any of the content. So, the question is. How can we efficiently read through the file, get what we want, and get out. I'm going to rule out using io.ReadBytes('\n')
or higher-level abstractions like scanner
as I want to hold as little in memory as possible. With those options, I might end up trying to read an entire file into memory before I find the first newline character.
But, we can create a byte buffer, read into that buffer, and then search just that buffer. When we're done, we'll move on to the next chunk.
So, let's create a 32 kibibyte buffer
buffer := make([]byte, 32*1024)
and then read from our file into that
read, err = r.Read(buffer)
This might give us a buffer with the following content
Hello World\nThis is a\nThree line file
The variable read
will tell us how many bytes were read in. This is very important information because we're reusing our buffer and not re-initializing it between reads. Suppose we read 32kb of data on the first call to r.Read(buffer)
but only 5kb of data on the second call. Our buffer will still contain 32kb of data. 5kb from the last read followed by 27kb of old data...
Next, we can use the bytes.Count
method in the Go standard library to find the number of newline characters. We'll store the result in our counter variable.
count += bytes.Count(buffer[:read], target)
We will loop "forever". In reality, we will read incrementially to the end of the file. Then, we'll try and read one more time and Go will return an end of file error. We'll check to see when this is encountered and return our results then. In our use case, an end of file error is expected so... well, we shouldn't return it to the caller.
if err == io.EOF {
return count, nil
}
return count, err
An Alternative Way to Read
Calling bytes.Count(buffer[:read], target)
is a very specific choice that I can make for this application. However, it might not always work for us. Suppose we were looking for a slightly more complicated pattern. Go has a way for us to do that. bytes.IndexByte
will return the index position of the first occurance of a byte in a byte array. If no occurance is found, then a -1
is returned.
So, while it's more complicated, we can look for our \n
character, and then look at the next slice of the array after that character, and then look at the next slice... continuing on until we're out of things to look at. Then we'd move on to the next chunk of file.
In that implementation, we would replace count += bytes.Count(buffer[:read], target)
with the following
...
const target byte = '\n'
...
var position int
for {
idxOf := bytes.IndexByte(buffer[position:read], target)
if idxOf == -1 {
break
}
count++
position += idxOf + 1
}
Is it Fast?
I am only concerned with if this is comparatively fast when measured against wc
. Getting true benchmarking numbers are outside of the scope of my efforts here. I've run both programs several times which ensures the operating system and my storage have both done any caching they plan to do.
Using time
(Unix timing utility) wc
on my machine (a midrange dev laptop with an NVMe SSD drive) to parse a 1.6GB text file of lorem ipsum text, I get the following averages after an initial warmup call:
real 0m0.822s
user 0m0.156s
sys 0m0.655s
Using lc
to parse the same file, I get the following averages after an initial warmup call:
real 0m0.625s
user 0m0.015s
sys 0m0.015s
So, I'm happy with how this experiment went.
Wrapping Up?
I haven't demonstrated any tests for this program. I'll leave you to review them at your leisure. Or, wait for the next exciting installment.
I've previously covered how I set Go up locally and added %GOPATH%\bin
to my path. So, from within the lc
project directory, I can run go install
and have a shiny new command line utility to use.
Honestly, thinking up bespoke little utilities has been a lot of fun. And, once you unlock the power of chaining them together, you'll think of many new use cases. Just keep the Unix philosophy in mind.
Feel free to ping me on Twitter @hyrmn with any questions or comments.