I ran across an interesting line of code today, and thought I'd share some
insights. First, though we need a little context. Imagine reading several lines
of data from a csv file (using
python's built-in
csv
module). You'll typically have some code that looks
something like this:
import csv
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
# Do some stuff with each row,
# where the row is a list of strings.
So, that's what we've got, and within the for
loop, I found
this code:
row = map(lambda x: x.strip(), row)
What does it do? It simply strips whitespace from the beginng and ending of each item in our list of strings. But how it accomplishes this is worth picking apart.
First, this code uses map
to apply a function to each item in
the list. Then, we construct an anonymous lambda
function which
accepts a parameter, calls the input's strip
method and returns
the result.
Essentially, we call a function for each item in the list. Keep in mind,
we're also doing this inside a for
loop. That's a function call
for each cell in your CSV file.
We can also achieve the same outcome with a list comprehension:
row = [x.strip() for x in row]
OR with a generator! (note the parenthesis instead of square brackets)
row = (x.strip() for x in row)
Measure it
Out of curiosity, I decided to time this with just a few rows of data
(10, in particular). I used timeit
to run this code on my
laptop (a late-2011 macbook air) with some simple data. Here's what I found:
>>> import timeit
>>> timeit.timeit(
... stmt="map(lambda x: x.strip(), row)",
... setup="row = [' {0} '.format(i) for i in range(10)]"
... )
2.491640090942383
With 1,000,000 rows of data (the default number of test iterations, and an admittedly unlikely scenario for a one-off CSV import) this code runs in about two and a half seconds.
Now lets see how the list comprehension and generator versions for the same code stack up!
>>> timeit.timeit(
... stmt="[x.strip() for x in row]",
... setup="row = [' {0} '.format(i) for i in range(10)]"
... )
1.6442670822143555
Quite a bit better! Nearly a second faster. Let's see about the generator:
>>> timeit.timeit(
... stmt="(x.strip() for x in row)",
... setup="row = [' {0} '.format(i) for i in range(10)]"
... )
0.48253297805786133
Yep. About two whole seconds faster than our original code.
So what's the take-away, here? Well, generator expressions are pretty amazing. Use them.
Finally, small things add up. Little decisions, like whether to use
map
+ lambda
or a generator expression can have
fairly significant impact on the performance of your software.