Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
485 views
in Technique[技术] by (71.8m points)

python - Handle missing DataFrame columns in groupBy aggregation

I have a dataframe with a LONG list of columns, some of which may not always exist depending on the data source, time of day, etc. I need to aggregate this data with min/max/avg, pct, and some counts, but any time I do this with a data frame that is missing a column, then entire aggregation function fails with an error. Is there a way to eloquently handle missing columns, by ignoring the error if it's a missing column, or maybe by creating any columns that don't exist inline?

df_aggs = df.groupby(['DeviceUUID', 'year', 'month', 'day', 'hour']).agg(

DeviceName=('DeviceName', 'first'),
DeviceType=('DeviceType', 'first'),

V_LD_SEC_A_min=('V_LD_SEC_A', np.min),
V_LD_SEC_A_avg=('V_LD_SEC_A', np.mean),
V_LD_SEC_A_max=('V_LD_SEC_A', np.max),

V_LD_SEC_B_min=('V_LD_SEC_B', np.min),
V_LD_SEC_B_avg=('V_LD_SEC_B', np.mean),
V_LD_SEC_B_max=('V_LD_SEC_B', np.max),

[many more columns ]

X_DOG_A_count=('X_DOG_A', np.count_nonzero),
X_DOG_B_count=('X_DOG_B', np.count_nonzero),
X_DOG_C_count=('X_DOG_C', np.count_nonzero),
X_DOG_count=('X_DOG', np.count_nonzero),
X_NEU_LO_count=('X_NEU_LO', np.count_nonzero),

CVR_X_ENGAGED_A_pct=('CVR_X_ENGAGED_A', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_B_pct=('CVR_X_ENGAGED_B', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_C_pct=('CVR_X_ENGAGED_C', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_3PH_pct=('CVR_X_ENGAGED_3PH',
                       lambda x: (np.sum(x) / np.size(x))*100)
).reset_index(drop=True)

If, in this example, the column 'V_LD_SEC_B' is missing from df, this entire aggregation function fails. What I'd like to get back is the df_agg with the missing column(s) added, with NaN as the value. Do I have to loop through the entire data frame, creating columns that don't exist, or can I inline their creation in some way?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Named Aggregations allow for a variety of different syntaxes. In this case it's much better to work with the dictionary format, and then unpack to apply the aggregations to the DataFrame.

This allows us to check for the intersection between existing columns and the aggregations you want to apply and then reindex to everything, regardless of if it was there for the aggregation, in the end. Here's an example where the DataFrame is missing a 'val2' column that we may want to aggregate

import pandas as pd
import numpy as np
df = pd.DataFrame({'gp': list('abbcc'),
                   'val1': [1,2,3,4,5],
                   'val3': [2,4,6,8,10]})

# Store aggregations in a dict using output col names as keys, NamedAgg as values
aggs = {'val1_max': pd.NamedAgg(column='val1', aggfunc=np.max),
        'val1_min': pd.NamedAgg(column='val1', aggfunc=np.min),
        'val2_sum': pd.NamedAgg(column='val2', aggfunc=np.sum),
        'val3_sum': pd.NamedAgg(column='val3', aggfunc=np.sum)}


# Apply only aggregations we can, checking the column of the `NamedAgg` 
# reindex to everything we want in the end
(df.groupby('gp')
   .agg(**{k:v for k,v in aggs.items() if v.column in df.columns})
   .reindex(aggs.keys(), axis=1)
)

    val1_max  val1_min  val2_sum  val3_sum
gp                                         
a          1         1       NaN         2
b          3         2       NaN        10
c          5         4       NaN        18

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...