Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
179 views
in Technique[技术] by (71.8m points)

python - Join two pandas dataframes based on lists columns

I have 2 dataframes containing columns of lists.
I would like to join them based on 2+ shared values on the lists. Example:

ColumnA ColumnB        | ColumnA ColumnB        
id1     ['a','b','c']  | id3     ['a','b','c','x','y', 'z']
id2     ['a','d,'e']   | 

In this case we can see that id1 matches id3 because there are 2+ shared values on the lists. So the output will be (columns name are not important and just for example):

    ColumnA1 ColumnB1     ColumnA2   ColumnB2        
    id1      ['a','b','c']  id3     ['a','b','c','x','y', 'z']
    

How can I achieve this result? I've tried to iterate each row in dataframe #1 but it doesn't seem a good idea.
Thank you!

question from:https://stackoverflow.com/questions/66060591/join-two-pandas-dataframes-based-on-lists-columns

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you are using pandas 1.2.0 or newer (released on December 26, 2020), cartesian product (cross joint) can be simplified as follows:

    df = df1.merge(df2, how='cross')         # simplified cross joint for pandas >= 1.2.0

Also, if system performance (execution time) is a concern to you, it is advisable to use list(map... instead of the slower apply(... axis=1)

Using apply(... axis=1):

%%timeit
df['overlap'] = df.apply(lambda x: 
                         len(set(x['ColumnB1']).intersection(
                             set(x['ColumnB2']))), axis=1)


800 μs ± 59.1 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

while using list(map(...:

%%timeit
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))

217 μs ± 25.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice that using list(map... is 3x times faster!

Whole set of codes for your reference:

    data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]}
    df1 = pd.DataFrame(data)

    data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]}
    df2 = pd.DataFrame(data)

    df = df1.merge(df2, how='cross')             # for pandas version >= 1.2.0

    df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))

    df = df[df['overlap'] >= 2]
    print (df)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...