Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
120 views
in Technique[技术] by (71.8m points)

python - extracting hyphen separated key-value pair with value in multiple lines

Input text file

LID - E164 [pii]
LID - 10.3390/antiox9020164 [doi]
AB  - Although prickly pear fruits have become an important part of the Canary diet,
      their native varieties are yet to be characterized in terms of betalains and
      phenolic compounds.
FAU - Gomez-Maqueo, Andrea
AU  - Gomez-Maqueo A
AUID- ORCID: 0000-0002-0579-1855

PG  - 1-13
LID - 10.1007/s00442-020-04624-w [doi]
AB  - Recent observational evidence suggests that nighttime temperatures are increasing
      faster than daytime temperatures, while in some regions precipitation events are 
      becoming less frequent and more intense.
CI  - (c) 2020 Production and hosting by Elsevier B.V. on behalf of Cairo University.
FAU - Farag, Mohamed A
AU  - Farag MA

PG  - 3044
LID - 10.3389/fmicb.2019.03044 [doi]
AB  - Microbial symbionts account for survival, development, fitness and evolution of
      eukaryotic hosts. These microorganisms together with their host form a biological
      unit known as holobiont.

AU  - Flores-Nunez VM
AD  - Departamento de Ingenieria Genetica, Centro de Investigacion y de Estudios
      Avanzados del Instituto Politecnico Nacional, Irapuato, Mexico.

I'm trying to extract the abstracts denoted by AB in the text. I'm iterating through each line, checks whether the key is that of the abstract. If so I'm setting a flag and appending the subsequent lines separated by space. Is there a better way to do this?

f = "sample.txt"

abstracts = []
flag = False

with open(f) as myfile:
    for line in myfile:

        # append subsequent lines if flag is set
        if flag:
            if line.startswith("      "):
                req_line = req_line + " " + line.strip()
            else:
                abstracts.append(req_line)
                req_line = ""
                flag = False

        # find beginning of abstract
        if line.startswith("AB  - "):
            req_line = line.replace("AB  - ", "", 1)
            flag = True

Output:

[
"Although prickly pear fruits have become an important part of the Canary diet, their native varieties are yet to be characterized in terms of betalains and phenolic compounds.",
"Recent observational evidence suggests that nighttime temperatures are increasing faster than daytime temperatures, while in some regions precipitation events are becoming less frequent and more intense.",
"Microbial symbionts account for survival, development, fitness and evolution of eukaryotic hosts. These microorganisms together with their host form a biological unit known as holobiont."
]
question from:https://stackoverflow.com/questions/65661339/extracting-hyphen-separated-key-value-pair-with-value-in-multiple-lines

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Do it with regex (assuming your input string is s read via open("file.txt").read()):

import re
matches = re.findall("ABW*-W*([^-]*(?=
))", s)
output = [" ".join(map(str.strip, i.split("
"))) for i in matches]

gives

['Although prickly pear fruits have become an important part of the Canary diet, their native varieties are yet to be characterized in terms of betalains and phenolic compounds.',
 'Recent observational evidence suggests that nighttime temperatures are increasing faster than daytime temperatures, while in some regions precipitation events are becoming less frequent and more intense.',
 'Microbial symbionts account for survival, development, fitness and evolution of eukaryotic hosts. These microorganisms together with their host form a biological unit known as holobiont.']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...