Regex to parse a multiline HTML

Question

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

am trying to parse a multi-line html file using regex.

HTML code:

<td>Details</td></tr>  
<tr class=d1>
<td>uss_vod_translator</td>

Regex Expression:

if ($line =~ m/Details</td>s*</tr>s*<trs*class=d1>s*<td>(w*)</td>/)
{
    print "$1";
}

I am using /s* (space) for multi-line, but it is not working. I searched about it, even used /? for multi-line but that too did not work.

Can any one please suggest me how to parse a multiline HTML?

I know regex is a poor solution to parse HTML. But i have a legacy HTML code which i need to parse and have no other choice.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:26:52+0000

Can any one please suggest me how to parse a multiline HTML?

Stop trying to use regular expressions and use a module that will parse it for you.

HTML::TreeBuilder is a good solution.

HTML::TreeBuilder::LibXML gives you the same API but backed by a fast parser.

HTML::TreeBuilder::XPath adds XPath support as well as a fast parser.