Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
436 views
in Technique[技术] by (71.8m points)

python - For every point in an array, find the closest point to it in a second array and output that index

If I have two arrays:

X = np.random.rand(10000,2)
Y = np.random.rand(10000,2)

How can I, for each point in X, find out which point in Y is closest to it? So that in the end I have an array showing:

x1_index   y_index_of_closest
   1               7
   2               54
   3               3
  ...             ...

I want to do this for both columns in X and compare each to each column and value in Y

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This question is pretty popular. Since similar questions keep getting closed and linked here, I think it's worth pointing out that even though the existing answers are quite fast for thousands of data points, they start to break down after that. My potato segfaults at 10k items in each array.

The potential problem with the other answers is the algorithmic complexity. They compare everything in X to everything in Y. To get around that, at least on average, we need a better strategy for ruling out some of the things in Y.

In one dimension this is easy -- just sort everything and start popping out nearest neighbors. In two dimensions there are a variety of strategies, but KD-trees are reasonably popular and are already implemented in the scipy stack. On my machine, there's a crossover between the various methods around the point where each of X and Y have 6k things in them.

from scipy.spatial import KDTree

tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)

The extremely poor performance of scipy's KDTree implementation has been a sore spot of mine for awhile, especially with as many things as have been built on top of it. There are probably data sets where it performs well, but I haven't seen one yet.

If you don't mind an extra dependency, you can get a 1000x speed boost just by switching your KDTree library. The package pykdtree is pip-installable, and I pretty much guarantee the conda packages work fine too. With this approach, my used, budget chromebook can process X and Y with 10 million points each in barely 30 seconds. That beats segfaulting at 10 thousand points with a wall time ;)

from pykdtree.kdtree import KDTree

tree = KDTree(X)
neighbor_dists, neighbor_indices = tree.query(Y)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...