I have a list of paired items and I'd like to convert them into a pandas DataFrame where each paired item shares the same number in the same columns.
(我有一个配对项目的列表,我想将它们转换为pandas DataFrame,其中每个配对项目在同一列中共享相同的编号。)
So something like this: (所以像这样:)
[('A', 'B'),
('A', 'C'),
('B', 'D')]
is converted into...
(转换成...)
0 1
A 2 1
B 3 1
C 2 0
D 3 0
So the columns are in decreasing order in number of pairs encoded, and it uses the fewest possible columns.
(因此,列按已编码对的数量从大到小排列,并且使用的列最少。)
Is there an algorithm, preferably something in numpy or pandas, that does this?
(是否有一种算法(最好是numpy或pandas的算法)可以做到这一点?)
So far I've been unable to find anything with Google, but it's been a while since I had Linear Algebra, so I might have simply forgotten the right terms to use. (到目前为止,我一直无法在Google上找到任何东西,但是自从有了线性代数以来已经有一段时间了,所以我可能只是忘记了要使用的正确术语。)
I created the following (buggy) code to create a DataFrame, but for some reason it creates as many columns as there are pairs and is not what I'd like to accomplish.
(我创建了以下(笨拙的)代码来创建DataFrame,但是由于某种原因,它创建的列数与对数一样多,而这并不是我想要完成的。)
def create_df(ps):
df = pd.DataFrame(index=np.unique(ps))
cnt = 1
for p in ps:
col = 0
a, b = p
while col in df.columns and (df.at[a, col] != 0 or df.at[b, col] != 0):
col += 1
df.loc[a, col] = cnt
df.loc[b, col] = cnt
cnt += 1
return df
The ultimately goal of this is to integrate the output into a data pipeline so I can use groupby in pandas to calculate statistics over the pairs.
(这样做的最终目标是将输出集成到数据管道中,这样我就可以在熊猫中使用groupby来计算对上的统计信息。)
Because of this, each pair must be defined in the same column, like in the example. (因此,必须像示例中一样,在同一列中定义每对。)
ask by Rob Rose translate from so 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…