You can try this code below, which gets the last 10 rows before the end of a quarter and calculates the maximum availability grouped by id.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'rn',
F.row_number().over(
Window.partitionBy(
'id',
F.quarter('yyyy_mm_dd'),
F.year('yyyy_mm_dd')
).orderBy(F.desc('yyyy_mm_dd'))
)
).filter('rn <= 10').groupBy(
'id',
F.quarter('yyyy_mm_dd').alias('quarter'),
F.year('yyyy_mm_dd').alias('year')
).agg(
F.max('availability').alias('max_availability')
)
df2.show()
+----+-------+----+----------------+
| id|quarter|year|max_availability|
+----+-------+----+----------------+
|1334| 1|2020| 7564|
|1334| 2|2020| 4634|
+----+-------+----+----------------+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…