Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
154 views
in Technique[技术] by (71.8m points)

python - Max Value in N days before end of week/month/quarter

I have a dataframe df which contains daily data for many ids, sample:

| yyyy_mm_dd | id   | availability |
|------------|------|--------------|
| 2020-01-01 | 1334 | 300          |
| 2020-01-02 | 1334 | 676          |
| 2020-01-03 | 1334 | 463          |
| ...        | 1334 | ...          |
| 2020-03-29 | 1334 | 564          |
| 2020-03-30 | 1334 | 765          |
| 2020-03-31 | 1334 | 7564         |
| ...        | 1334 | ...          |
| 2020-06-28 | 1334 | 4634         |
| 2020-06-29 | 1334 | 65           |
| 2020-06-30 | 1334 | 643          |
| ...        | 1334 | ...          |

How could I identify the maximum availability in the last N days prior to end of a quarter? I.e. if N = 3 then I would like to find the maximum availability per id, in the last 3 days to the prior of each quarter end.

I know I could hardcode the dates but I would like to avoid that if possible. Is this something that windowing could help with?

Expected output based on above input:

| id   | year | quarter | max_availability |
|------|------|---------|------------------|
| 1334 | 2020 | 01      | 7564             |
| 1334 | 2020 | 02      | 4634             |
question from:https://stackoverflow.com/questions/66063623/max-value-in-n-days-before-end-of-week-month-quarter

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can try this code below, which gets the last 10 rows before the end of a quarter and calculates the maximum availability grouped by id.

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rn',
    F.row_number().over(
        Window.partitionBy(
            'id',
            F.quarter('yyyy_mm_dd'), 
            F.year('yyyy_mm_dd')
        ).orderBy(F.desc('yyyy_mm_dd'))
    )
).filter('rn <= 10').groupBy(
    'id',
    F.quarter('yyyy_mm_dd').alias('quarter'), 
    F.year('yyyy_mm_dd').alias('year')
).agg(
    F.max('availability').alias('max_availability')
)

df2.show()
+----+-------+----+----------------+
|  id|quarter|year|max_availability|
+----+-------+----+----------------+
|1334|      1|2020|            7564|
|1334|      2|2020|            4634|
+----+-------+----+----------------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...