Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
265 views
in Technique[技术] by (71.8m points)

apache spark - pyspark string matching - pick first match

I have two tables.

comment_df

| Date | Comment | 
|:---- |:------:| 
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. |
| 21/01/2020 | AMEX receipts from Euro Account delayed. |

code_df

| Tag | Comment | 
|:---- |:------:| 
| EURO | Euro Account to HSBC |
| Natwest | Euro Account to Natwest |
| AMEX | AMEX payment |

The desired table

| Date | Comment | Tag |
|:---- |:------:| ----:|
| 20/01/2020 | Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue.| EURO |
| 20/01/2020 | Brian initiated a Transfer from Euro Account to Natwest last Tuesday. | Natwest |
| 21/01/2020 | AMEX payment to Natwest was delayed for second time in a row. | AMEX | 
| 21/01/2020 | AMEX receipts from Euro Account delayed. | |

So the first comment has two tags (Euro Account to HSBC & AMEX payment) but I want the result to show the first tag it comes across and not duplicate the rows. Below is what was previously suggested.

code_df = code_df.withColumnRenamed('Comment', 'Commentcode')

result = comment_df.join(code_df, comment_df.Comment.contains(code_df.Commentcode), 'left').drop('Commentcode')

result.show(truncate=False)

+----------+---------------------------------------------------------------------+-------+
|Date      |Comment                                                              |Tag    |
+----------+---------------------------------------------------------------------+-------+
|20/01/2020|Transfer from Euro Account to HSBC account done on Monday but AMEX payment was on Tue. |EURO|
|20/01/2020|Brian initiated a Transfer from Euro Account to Natwest last Tuesday.|Natwest|
|21/01/2020|AMEX payment to Natwest was delayed for second time in a row.        |AMEX|
|21/01/2020|AMEX receipts from Euro Account delayed.                             |null|
+----------+---------------------------------------------------------------------+-------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can order the results based on the position of the matched string, and only get the first match by filtering the position.

from pyspark.sql import functions as F, Window

result = comment_df.join(
    code_df,
    comment_df.Comment.contains(code_df.Commentcode),
    'left'
).withColumn(
    'rn',
    F.row_number().over(
        Window.partitionBy('Date', 'Comment')
              .orderBy(F.expr('instr(Comment, Commentcode)'))
    )
).filter('rn = 1')

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...