Thursday 17 December 2015

Creating floating point arrays in Python

Note: all script output is included at the bottom of each code block and indicated with '>>>' (or sometimes the output is summarised in a code comment). Yes, it's confusing, but you're smart ;)
Let's imagine you want a simple array of consecutive floating point numbers in Python: [0.1, 0.2, 0.3 ... 0.9]. You start by trying to use Python's built-in range() function:
x = range(0.1,1,0.1) # Don't do this

Expecting to get an array of numbers from 0.1 (first argument, inclusive) to 1 (second argument, exclusive) with a step-size of 0.1 (third argument).
But the Python range function can only deal with integer step sizes, and complains:
>>> TypeError: range() integer step argument expected, got float.

OK, so you import numpy and use arange(), right?
import numpy as np
x = np.arange(0.1, 1, 0.1)
# x is [0.1, 0.2, ..., 0.9]

Exactly what you wanted. But what if you don't have numpy? What if you care about code footprint, portability, and all those things? What if you want someone else to be able to generate your super interesting array, and they don't have numpy? Do you ask them to install numpy, knowing their lives will be better in the long run? After internal debate, you delete the numpy dependency, and try a list comprehension instead:
x = [x * 0.1 for x in range(1,10)]

Hah, now you must surely have the best of all worlds. One line, no dependencies, and a list comprehension. This is great. This is amazing. You print it to double-check your handiwork:
print (x)
>>> [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6000000000000001, 0.7000000000000001, 0.8, 0.9]

Wat?
Oh yes, computers suck at floating point numbers. Now what? Back to numpy? Round the numbers? Write a library to handle all of this? Re-write numpy? You try once more for a simple solution:
x = [x/10.0 for x in range(1,10)]
print(x)
>>> [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

Interesting. Division works where multiplication doesn't. You try out a few variations of each to make sure the distinction is consistent, and find out it is. You're kind of happy with the division solution, but floating point division is slower than floating pointmultiplication, right? What if someone wants to do this a million times? You decide to see how much time you're losing for floating point precision:
import time

N = 1000000
t1 = time.time()

for j in range(N):
    foo = [x * 0.1 for x in range(1, 10)]
print("multiplication: {}".format(time.time() - t1))

t1 = time.time()
for j in range(N):
    foo = [x / 10.0 for x in range(1, 10)]
print("division: {}".format(time.time() - t1))

>>> multiplication: 4.11618614197
>>> division: 4.24211502075

That .1 second for every run of a million hurts a bit. Division is slow. Why not just multiply the numbers as in the first attempt, and then round them to 2 places? Last try:
# ...
t1 = time.time()
for j in range(N):
    foo = [round(x * 0.1, 2) for x in range(1, 10)]
print("round: {}".format(time.time() - t1))

No more slow division! And how long can it take to shave off some decimal places?
Quite a while, it turns out. Is that nearly 30 seconds? Yes, it is.
>>> multiplication: 4.11618614197
>>> division: 4.24211502075
>>> round: 29.5979890823

OK, that's slow. Slower than sub-string manipulation as it turns out. You convert the broken float to a string, take the first three characters off it, and then turn it back to a float, just to prove a point:
# ...
t1 = time.time()
for j in range(N):
    foo = [float(str(x * 0.1)[:3]) for x in range(1, 10)]
print("string: {}".format(time.time() - t1))

>>> multiplication: 4.10650587082
>>> division: 4.21635198593
>>> round: 30.1995661259
>>> string: 23.5980100632

Now that you've put that 0.1 second paid for division into context, you feel OK using it. For comparison (even though you said 'last attempt' a while back), you add timing for numpy as well:
# ...
t1 = time.time()
for j in range(N):
    foo = np.arange(0.1, 1, 0.1)
print("numpy: {}".format(time.time() - t1))

>>> multiplication: 4.17134094238
>>> division: 4.23171901703
>>> round: 36.1641287804
>>> string: 23.5332589149
>>> numpy: 3.10889482498

A whole second faster! Maybe you should include that massive dependency after all. But you remind yourself what you've read a million times in start-up blogs. The most important thing is code readability. The compiler will do optimization better than you can ever hope to, right? Right?? Or - maybe not.


No comments:

Post a Comment