I have a dataframe with a LONG list of columns, some of which may not always exist depending on the data source, time of day, etc. I need to aggregate this data with min/max/avg, pct, and some counts, but any time I do this with a data frame that is missing a column, then entire aggregation function fails with an error.
Is there a way to eloquently handle missing columns, by ignoring the error if it's a missing column, or maybe by creating any columns that don't exist inline?
df_aggs = df.groupby(['DeviceUUID', 'year', 'month', 'day', 'hour']).agg(
DeviceName=('DeviceName', 'first'),
DeviceType=('DeviceType', 'first'),
V_LD_SEC_A_min=('V_LD_SEC_A', np.min),
V_LD_SEC_A_avg=('V_LD_SEC_A', np.mean),
V_LD_SEC_A_max=('V_LD_SEC_A', np.max),
V_LD_SEC_B_min=('V_LD_SEC_B', np.min),
V_LD_SEC_B_avg=('V_LD_SEC_B', np.mean),
V_LD_SEC_B_max=('V_LD_SEC_B', np.max),
[many more columns ]
X_DOG_A_count=('X_DOG_A', np.count_nonzero),
X_DOG_B_count=('X_DOG_B', np.count_nonzero),
X_DOG_C_count=('X_DOG_C', np.count_nonzero),
X_DOG_count=('X_DOG', np.count_nonzero),
X_NEU_LO_count=('X_NEU_LO', np.count_nonzero),
CVR_X_ENGAGED_A_pct=('CVR_X_ENGAGED_A', lambda x: (
np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_B_pct=('CVR_X_ENGAGED_B', lambda x: (
np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_C_pct=('CVR_X_ENGAGED_C', lambda x: (
np.sum(x) / np.size(x))*100),
CVR_X_ENGAGED_3PH_pct=('CVR_X_ENGAGED_3PH',
lambda x: (np.sum(x) / np.size(x))*100)
).reset_index(drop=True)
If, in this example, the column 'V_LD_SEC_B' is missing from df, this entire aggregation function fails. What I'd like to get back is the df_agg with the missing column(s) added, with NaN as the value. Do I have to loop through the entire data frame, creating columns that don't exist, or can I inline their creation in some way?