You can use groupby
and aggregate using collect_list
and array
:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result
as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…