Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
425 views
in Technique[技术] by (71.8m points)

python - Parsing complicated list of strings using regex, loops, enumerate, to produce a pandas dataframe

I have a long list of many elements, each element is a string. See below sample:

data = ['BAT.A.100',   'Regulation 2020-1233',   'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100', 

 'BAT.A.120',   'Regulation 2020-1599',   'core text of the regulation ...', ' more free text','more free text', 

 'BAT.A.145',   'Regulation 2019-3333',   'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text', 
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']

My desired output is ultimately a Pandas DataFrame as follows:

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined. I am going for all regex because there would be lot of conditions to check otherwise.

re_req = re.compile(r'##(?P<Short_ref>BAT.A.d{3})'
                    r'##(?P<Full_Reg_ref>Regulationsd{4}-d{4})'
                    r'##(?P<Reg_text>.*?MOC to 1|.*?(?=##BAT.A.d{3})(?!1))'
                    r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to 1)(?P<MOC2>(?:##)?.*?(?=##BAT.A.d{3})(?!1)|.+)'
                    r'|(?P<Moc_text_temp>.*?(?=##BAT.A.d{3})(?!1)))')

final_list = []
for match in re_req.finditer("##" + "##".join(data)):
    inner_list = [match.group('Short_ref').replace("##", " "),
                  match.group('Full_Reg_ref').replace("##", " "),
                  match.group('Reg_text').replace("##", " ")]
    if match.group('Moc_text_temp'): # just Moc_text is present
        inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
    elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
        inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
    else: # neither Moc_text nor MOC2 is present
        inner_list += ["", ""]
    final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])

First and second line of regex is same as which you posted earlier and identifies the first two columns.

In third line of regex, r'##(?P<Reg_text>.*?MOC to 1|.*?(?=##BAT.A.d{3})(?!1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT.A.d{3})(?!1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.

Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.

Last looping over all the matches using finditer and constructing the rows of the dataframe final_df:

enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...