Sorting csv's externally

Some time ago, I ran into the problem of having a very large comma-separated-value file which I had to sort. There are many good solutions for this simple problem, and there are many ways of going about this in general. The better ones probably being, loading the file into Sqlite or a docker with Postgresql and using SQL. Or adapting the data with a quick python script to make it easy to use Unix's sort.

Instead, at the time a friend advised Spark. And although Spark does have methods for larger-than-memory datasets, they need to be well partitioned and it is easy to go wrong. It failed spectacularly1. I knew that this shouldn't be a hard problem and because a lot of code there was already in Python, I looked and easily found a python external sorter. It was not the quickest but it got the job done and worked fine.

But it did leave me wondering. Although Panda's csv parser is probably quite optimized (and you can turn on and off the error reporting on faulty csv lines), still there is somewhere a python performance penalty. There was also the fact that I want to experiment with a bit of Scala coding (and not just Spark flavoured Scala). So I made my own external csv sorter in Scala. I hope that by using Univocity's parsing library I am able to make a somewhat quick external sorter that has predictable behaviour even when presented with bad lines. But I still need to properly test and learn the library and learn how to work nicely with Scala. It turns out that working around type erasure in pattern matching and making this properly generic is not that easy. Hopefully I will learn a bit more about those things and others, by working on this project.

  1. Spark is very suited for different tasks but is often misused like this