python - Handle missing DataFrame columns in groupBy aggregation

Question

Welcome To Ask or Share your Answers For Others

python - Handle missing DataFrame columns in groupBy aggregation

posted Feb 19, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Handle missing DataFrame columns in groupBy aggregation

I have a dataframe with a LONG list of columns, some of which may not always exist depending on the data source, time of day, etc. I need to aggregate this data with min/max/avg, pct, and some counts, but any time I do this with a data frame that is missing a column, then entire aggregation function fails with an error. Is there a way to eloquently handle missing columns, by ignoring the error if it's a missing column, or maybe by creating any columns that don't exist inline?

df_aggs = df.groupby(['DeviceUUID', 'year', 'month', 'day', 'hour']).agg(

DeviceName=('DeviceName', 'first'),
DeviceType=('DeviceType', 'first'),

V_LD_SEC_A_min=('V_LD_SEC_A', np.min),
V_LD_SEC_A_avg=('V_LD_SEC_A', np.mean),
V_LD_SEC_A_max=('V_LD_SEC_A', np.max),

V_LD_SEC_B_min=('V_LD_SEC_B', np.min),
V_LD_SEC_B_avg=('V_LD_SEC_B', np.mean),
V_LD_SEC_B_max=('V_LD_SEC_B', np.max),

[many more columns ]

X_DOG_A_count=('X_DOG_A', np.count_nonzero),
X_DOG_B_count=('X_DOG_B', np.count_nonzero),
X_DOG_C_count=('X_DOG_C', np.count_nonzero),
X_DOG_count=('X_DOG', np.count_nonzero),
X_NEU_LO_count=('X_NEU_LO', np.count_nonzero),

CVR_X_ENGAGED_A_pct=('CVR_X_ENGAGED_A', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_B_pct=('CVR_X_ENGAGED_B', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_C_pct=('CVR_X_ENGAGED_C', lambda x: (
    np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_3PH_pct=('CVR_X_ENGAGED_3PH',
                       lambda x: (np.sum(x) / np.size(x))*100)
).reset_index(drop=True)

If, in this example, the column 'V_LD_SEC_B' is missing from df, this entire aggregation function fails. What I'd like to get back is the df_agg with the missing column(s) added, with NaN as the value. Do I have to loop through the entire data frame, creating columns that don't exist, or can I inline their creation in some way?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-02-19T04:08:27+0000

Named Aggregations allow for a variety of different syntaxes. In this case it's much better to work with the dictionary format, and then unpack to apply the aggregations to the DataFrame.

This allows us to check for the intersection between existing columns and the aggregations you want to apply and then reindex to everything, regardless of if it was there for the aggregation, in the end. Here's an example where the DataFrame is missing a 'val2' column that we may want to aggregate

import pandas as pd
import numpy as np
df = pd.DataFrame({'gp': list('abbcc'),
                   'val1': [1,2,3,4,5],
                   'val3': [2,4,6,8,10]})

# Store aggregations in a dict using output col names as keys, NamedAgg as values
aggs = {'val1_max': pd.NamedAgg(column='val1', aggfunc=np.max),
        'val1_min': pd.NamedAgg(column='val1', aggfunc=np.min),
        'val2_sum': pd.NamedAgg(column='val2', aggfunc=np.sum),
        'val3_sum': pd.NamedAgg(column='val3', aggfunc=np.sum)}


# Apply only aggregations we can, checking the column of the `NamedAgg` 
# reindex to everything we want in the end
(df.groupby('gp')
   .agg(**{k:v for k,v in aggs.items() if v.column in df.columns})
   .reindex(aggs.keys(), axis=1)
)

    val1_max  val1_min  val2_sum  val3_sum
gp                                         
a          1         1       NaN         2
b          3         2       NaN        10
c          5         4       NaN        18

Categories

python - Handle missing DataFrame columns in groupBy aggregation

python - Handle missing DataFrame columns in groupBy aggregation

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags