Recently I had to write a Python script which needed to parse large gzip-ed CSV files. First I reached for the standard
csv
module, which is quite straightforward to use. Unfortunately, it proved to be too slow. In fact, I gave up waiting for it to parse even a single file containing about 30 million rows! And my whole data set had more than 300 million rows altogether.
After googling a bit I realised that my old good friend pandas has a built in csv parser -
pandas.read_csv
. It allows great flexibility, letting you to set various options. One of the more useful ones in my case was
usecols
- a list of columns I was interested in from the original file. Since it contained many more columns than I needed for my purposes, I could reduce the result size considerably. You can also specify conversion functions and many other useful things.
pandas.read_csv
returns the familiar DataFrame. It can be iterated over directly by using this piece of code:
for ind, row in self.df.iterrows():
do_comething(row)
And of course you have all the power of pandas at your disposal to run all sorts of statistics and calculations on the object.
My final execution time using
pandas
was around 4-5 minutes as opposed to yet unknown hours using standard
csv
module!