Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
931 views
in Technique[技术] by (71.8m points)

python - How to find the correlation between a group of values in a pandas dataframe column

I have a dataframe df:

ID    Var1     Var2
1     1.2        4
1     2.1        6
1     3.0        7
2     1.3        8
2     2.1        9
2     3.2        13

I want to find the pearson correlation coefficient value between Var1 and Var2 for every ID

So the result should look like this:

ID    Corr_Coef
1     0.98198
2     0.97073

update:

Must make sure all columns of variables are int or float

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To get your desired output format you could use .corrwith:

corrs = (df[['Var1', 'ID']]
        .groupby('ID')
        .corrwith(df.Var2)
        .rename(columns={'Var1' : 'Corr_Coef'}))

print(corrs)
    Corr_Coef
ID           
1     0.98198
2     0.97073

Generalized solution:

import numpy as np

def groupby_coef(df, col1, col2, on_index=True, squeeze=True, name='coef',
                 keys=None, **kwargs):
    """Grouped correlation coefficient between two columns

    Flat result structure in contrast to `groupby.corr()`.

    Parameters
    ==========
    df : DataFrame
    col1 & col2: str
        Columns for which to calculate correlation coefs
    on_index : bool, default True
        Specify whether you're grouping on index
    squeeze : bool, default True
        True -> Series; False -> DataFrame
    name : str, default 'coef'
        Name of DataFrame column if squeeze == True
    keys : column label or list of column labels / arrays
        Passed to `pd.DataFrame.set_index`
    **kwargs :
        Passed to `pd.DataFrame.groupby`
    """

    # If we are grouping on something other than the index, then
    #     set as index first to avoid hierarchical result.
    # Kludgy, but safer than trying to infer.
    if not on_index:
        df = df.set_index(keys=keys)
        if not kwargs:
            # Assume we're grouping on 0th level of index
            kwargs = {'level': 0}
    grouped = df[[col1]].groupby(**kwargs)
    res = grouped.corrwith(df[col2])
    res.columns = [name]
    if squeeze:
        res = np.squeeze(res)
    return res

Examples:

df_1 = pd.DataFrame(np.random.randn(10, 2), 
                    index=[1]*5 + [2]*5).add_prefix('var')
df_2 = df_1.reset_index().rename(columns={'index': 'var2'})

print(groupby_coef(df_1, 'var0', 'var1', level=0))
1    7.424e-18
2   -9.481e-19
Name: coef, dtype: float64

print(groupby_coef(df_2, col1='var0', col2='var1', 
                   on_index=False, keys='var2'))
var2
1    7.424e-18
2   -9.481e-19
Name: coef, dtype: float64

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...