python - Filtering a column with an empty array in Pyspark

Question

Welcome To Ask or Share your Answers For Others

python - Filtering a column with an empty array in Pyspark

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Filtering a column with an empty array in Pyspark

I have a DataFrame which contains a lot of repeated values. An aggregated, distinct count of it looks like below

> df.groupby('fruits').count().sort(F.desc('count')).show()


| fruits        | count       |
| -----------   | ----------- |
| [Apples]      | 123         |
| []            | 344         |
| [Apples, plum]| 444         |

My goal is to filter all rows where the value is either [Apples] or [].

Suprisingly, the following works for an non-empty array but for empty it doesn't

import pyspark.sql.types as T

is_apples = F.udf(lambda arr: arr == ['Apples'], T.BooleanType())
df.filter(is_apples(df.fruits).count() # WORKS! shows 123 correctly.

is_empty = F.udf(lambda arr: arr == [], T.BooleanType())
df.filter(is_empty(df.fruits).count() # Doesn't work! Should show 344 but shows zero.

Any idea what I am doing wrong?

question from:https://stackoverflow.com/questions/65662265/filtering-a-column-with-an-empty-array-in-pyspark

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:39:42+0000

It might be an array containing an empty string:

is_empty = F.udf(lambda arr: arr == [''], T.BooleanType())

Or it might be an array of null:

is_empty = F.udf(lambda arr: arr == [None], T.BooleanType())

To check them all at once you can use:

is_empty = F.udf(lambda arr: arr in [[], [''], [None]], T.BooleanType())

But actually you don't need a UDF for this, e.g. you can do:

df.filter("fruits = array() or fruits = array('') or fruits = array(null)")

Categories

python - Filtering a column with an empty array in Pyspark

python - Filtering a column with an empty array in Pyspark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags