Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
503 views
in Technique[技术] by (71.8m points)

python - Extracting HTML between tags

I want to extract all HTML between specific HTML tags.

<html>
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>

so want to grep all HTML (tags & values) between the class1 div and the class2 span.

Included Text
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>

Also there are multiple occurrences within the HTML file so I want to match them all. Here is what I mean:

<html>
(first occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>

(2nd occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>  

(third occurrence)
<div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>  
</html>

I've been searching for answers using Beautifulsoup 4. However, all questions/answers are related to extracting values between text, but that is not want I want. I was also wondering if this is even possible with Beautifulsoup or whether I must use regex instead.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can role your own function using bs4 and itertools.takewhile

h  = """<html>
 <div class="class1">Included Text</div>
[...]
<h1><b>text</b></h1><span>[..]</span><div>[...]</div>
[...]
<span class="class2">
[...]</span>"""

soup = BeautifulSoup(h)
def get_html_between(start_select, end_tag, cls):
    start = soup.select_one(start_select)
    all_next = start.find_all_next()
    yield "".join(start.contents)
    for t in takewhile(lambda tag: tag.get("name") != end_tag and tag.get("class") != [cls], all_next):
        yield t

for ele in get_html_between("div.class1","div","class2"):
    print(ele)

Output:

Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]
</span>
<div>[...]</div>

To make it a little more flexible, you can pass in the initial tag and a cond lambda/function, for multiple class1s just iterate and pass each on:

def get_html_between(start_tag, cond):
    yield "".join(start_tag.contents)
    all_next = start_tag.find_all_next()
    for ele in takewhile(cond, all_next):
        yield ele


cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]
soup = BeautifulSoup(h, "lxml")
for tag in soup.select("div.class1"):
    for ele in get_html_between(tag, cond):
        print(ele)

Using you newest edit:

In [15]: cond = lambda tag: tag.get("name") != "div" and tag.get("class") != ["class2"]

In [16]: for tag in soup.select("div.class1"):            
            for ele in get_html_between(tag, cond):
                print(ele)
            print("
")
   ....:         
Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>


Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>


Included Text
<h1><b>text</b></h1>
<b>text</b>
<span>[..]</span>
<div>[...]</div>

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...