将索引列表转换为2D numpy数组的最快方法

问题内容：

我有一个索引列表

a = [
  [1,2,4],
  [0,2,3],
  [1,3,4],
  [0,2]]

将其转换为numpy数组的最快方法是什么，其中每个索引都显示1的位置？

即我想要的是：

output = array([
  [0,1,1,0,1],
  [1,0,1,1,0],
  [0,1,0,1,1],
  [1,0,1,0,0]])

我事先知道数组的最大大小。我知道我可以遍历每个列表，并在每个索引位置插入1，但是有没有一种更快/矢量化的方法来做到这一点？

我的用例可能有成千上万的行/列，我需要做数千次，所以速度越快越好。

问题答案：

这个怎么样：

ncol = 5
nrow = len(a)
out = np.zeros((nrow, ncol), int)
out[np.arange(nrow).repeat([*map(len,a)]), np.concatenate(a)] = 1
out
# array([[0, 1, 1, 0, 1],
#        [1, 0, 1, 1, 0],
#        [0, 1, 0, 1, 1],
#        [1, 0, 1, 0, 0]])

以下是1000x1000二进制数组的计时，请注意，我使用了上面的优化版本，请参见pp下面的函数：

pp 21.717635259992676 ms
ts 37.10938713003998 ms
u9 37.32933565042913 ms

产生计时的代码：

import itertools as it
import numpy as np

def make_data(n,m):
    I,J = np.where(np.random.random((n,m))<np.random.random((n,1)))
    return [*map(np.ndarray.tolist, np.split(J, I.searchsorted(np.arange(1,n))))]

def pp():
    sz = np.fromiter(map(len,a),int,nrow)
    out = np.zeros((nrow,ncol),int)
    out[np.arange(nrow).repeat(sz),np.fromiter(it.chain.from_iterable(a),int,sz.sum())] = 1
    return out

def ts():
    out = np.zeros((nrow,ncol),int)
    for i, ix in enumerate(a):
        out[i][ix] = 1
    return out

def u9():
    out = np.zeros((nrow,ncol),int)
    for i, (x, y) in enumerate(zip(a, out)):
        y[x] = 1
        out[i] = y
    return out

nrow,ncol = 1000,1000
a = make_data(nrow,ncol)

from timeit import timeit
assert (pp()==ts()).all()
assert (pp()==u9()).all()

print("pp", timeit(pp,number=100)*10, "ms")
print("ts", timeit(ts,number=100)*10, "ms")
print("u9", timeit(u9,number=100)*10, "ms")

将索引列表转换为2D numpy数组的最快方法

微信关注