Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
333 views
in Technique[技术] by (71.8m points)

python - 将Python列表编码为唯一值的索引(Encode Python lists as indexes of unique values)

I'd like to represent an arbitrary list as two other lists.

(我想将任意列表表示为另外两个列表。)

The first, call it values , containing the unique elements in the original list, and the second, call it codes , containing the index in values of each element in the original list, in such a way that the original list could be reconstructed as

(第一个称为它的values ,包含原始列表中的唯一元素,第二个称为它的codes ,包含原始列表中每个元素的values的索引,以这种方式可以将原始列表重建为)

orig_list = [values[c] for c in codes]

(Note: this is similar to how pandas.Categorical represents series)

((注意:这与pandas.Categorical代表系列的方式类似))

I've created the function below to do this decomposition:

(我创建了下面的函数来进行分解:)

def decompose(x):
    values = sorted(list(set(x)))
    codes = [0 for _ in x]
    for i, value in enumerate(values):
        codes = [i if elem == value else code for elem, code in zip(x, codes)]
    return values, codes

This works, but I would like to know if there is a better/more efficient way of achieving this (no double loop?), or if there's something in the standard library that could do this for me.

(这行得通,但是我想知道是否有更好/更有效的方法来实现这一点(没有双循环?),或者标准库中是否有可以为我做到这一点的东西。)


Update :

(更新 :)

The answers below are great and a big improvement to my function.

(以下答案对我的功能有很大的改善。)

I've timed all that worked as intended:

(我已经按照预期的时间进行了计时:)

test_list = [random.randint(1, 10) for _ in range(10000)]
functions = [decompose, decompose_boris1, decompose_boris2,
             decompose_alexander, decompose_stuart1, decompose_stuart2,
             decompose_dan1]
for f in functions:
    print("-- " + f.__name__)
    # test
    values, codes = f(test_list)
    decoded_list = [values[c] for c in codes]
    if decoded_list == test_list:
        print("Test passed")
        %timeit f(test_list)
    else:
        print("Test failed")

Results:

(结果:)

-- decompose
Test passed
12.4 ms ± 269 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-- decompose_boris1
Test passed
1.69 ms ± 21.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_boris2
Test passed
1.63 ms ± 18.6 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_alexander
Test passed
681 μs ± 2.15 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart1
Test passed
1.7 ms ± 3.42 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_stuart2
Test passed
682 μs ± 5.98 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
-- decompose_dan1
Test passed
896 μs ± 19.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I'm accepting Stuart's answer for being the simplest and one of the fastest.

(我接受Stuart的回答,因为它是最简单也是最快的之一。)

  ask by foglerit translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I would think that this would be more efficient than your code, despite having to use index to look up each value in x , and I suspect the fastest way for most x without using numpy or pandas :

(我认为这将比您的代码更高效,尽管必须使用index来查找x每个值,并且我怀疑对于大多数x ,最快的方法是不使用numpypandas :)

def decompose(x):
    values = sorted(set(x))
    return values, [values.index(v) for v in x] 

Representing values as a dictionary might bring some extra speed if needed.

(如果需要,将values表示为字典可能会带来一些额外的速度。)

def decompose(x):
    values = sorted(set(x))
    d = {value: index for index, value in enumerate(values)}
    return values, [d[v] for v in x]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...