performance - Storing Python objects in a Python list vs. a fixed-length Numpy array

Question

Welcome To Ask or Share your Answers For Others

performance - Storing Python objects in a Python list vs. a fixed-length Numpy array

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Storing Python objects in a Python list vs. a fixed-length Numpy array

In doing some bioinformatics work, I've been pondering the ramifications of storing object instances in a Numpy array rather than a Python list, but in all the testing I've done the performance was worse in every instance. I am using CPython. Does anyone know the reason why?

Specifically:

What are the performance impacts of using a fixed-length array numpy.ndarray(dtype=object) vs. a regular Python list? Initial tests I performed showed that accessing the Numpy array elements was slower than iteration through the Python list, especially when using object methods.
Why is it faster to instantiate objects using a list comprehension such as [ X() for i in range(n) ] instead of a numpy.empty(size=n, dtype=object)?
What is the memory overhead of each? I was not able to test this. My classes extensively use __slots__, if that has any impact.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:08:58+0000

Don't use object arrays in numpy for things like this.

They defeat the basic purpose of a numpy array, and while they're useful in a tiny handful of situations, they're almost always a poor choice.

Yes, accessing an individual element of a numpy array in python or iterating through a numpy array in python is slower than the equivalent operation with a list. (Which is why you should never do something like y = [item * 2 for item in x] when x is a numpy array.)

Numpy object arrays will have a slightly lower memory overhead than a list, but if you're storing that many individual python objects, you're going to run into other memory problems first.

Numpy is first and foremost a memory-efficient, multidimensional array container for uniform numerical data. If you want to hold arbitrary objects in a numpy array, you probably want a list, instead.

My point is that if you want to use numpy effectively, you may need to re-think how you're structuring things.

Instead of storing each object instance in a numpy array, store your numerical data in a numpy array, and if you need separate objects for each row/column/whatever, store an index into that array in each instance.

This way you can operate on the numerical arrays quickly (i.e. using numpy instead of list comprehensions).

As a quick example of what I'm talking about, here's a trivial example without using numpy:

from random import random

class PointSet(object):
    def __init__(self, numpoints):
        self.points = [Point(random(), random()) for _ in xrange(numpoints)]

    def update(self):
        for point in self.points:
            point.x += random() - 0.5
            point.y += random() - 0.5

class Point(object):
    def __init__(self, x, y):
        self.x = x
        self.y = y

points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

And a similar example using numpy arrays:

import numpy as np

class PointSet(object):
    def __init__(self, numpoints):
        self.coords = np.random.random((numpoints, 2))
        self.points = [Point(i, self.coords) for i in xrange(numpoints)]

    def update(self):
        """Update along a random walk."""
        # The "+=" is crucial here... We have to update "coords" in-place, in
        # this case. 
        self.coords += np.random.random(self.coords.shape) - 0.5

class Point(object):
    def __init__(self, i, coords):
        self.i = i
        self.coords = coords

    @property
    def x(self):
        return self.coords[self.i,0]

    @property
    def y(self):
        return self.coords[self.i,1]


points = PointSet(100000)
point = points.points[10]

for _ in xrange(1000):
    points.update()
    print 'Position of one point out of 100000:', point.x, point.y

There are other ways to do this (you may want to avoid storing a reference to a specific numpy array in each point, for example), but I hope it's a useful example.

Note the difference in speed at which they run. On my machine, it's a difference of 5 seconds for the numpy version vs 60 seconds for the pure-python version.

Categories

performance - Storing Python objects in a Python list vs. a fixed-length Numpy array

performance - Storing Python objects in a Python list vs. a fixed-length Numpy array

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags