Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
812 views
in Technique[技术] by (71.8m points)

c# - What is the best way of removing rogue ampersands in XML?

(TLDR at the bottom)

We have a legacy system that has implemented its own XML reader/writer. The problem is that it allows a literal "&" inside a property value.

<SB nae="Name" net="HV & DD"/>

When I am reading the data using XDocument.Parse() method, this fails of course. I am looking at ways of sanitizing the data.

I am attempting to use regex to identify cases where this is happening. To illustrate, consider this:

&(?!amp;)

This will identify ampersand with a negative lookahead to ensure it isn't actually a correctly escaped ampersand. When I have identified these cases, I can substitute with a proper &

Of course, there is a problem that this will match other escaped character such &gt &lt &quot etc, so I need to unmatch those as well. Maybe using a more general form, like a regex unmatching ampersand followed by 2-4 characters and then semicolon.

But my worry is that there are other cases for ampersands that I am not thinking of and that are not represented in the few samples I have got. I am looking for a safe way that will not mess up proper xml.

TLDR: How do I identify ampersands that are not part of proper xml, but are cases of unescaped ampersands in property values?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can substitute the following regex pattern with &amp;:

&(?!(?:#d+|#x[0-9a-f]+|w+);)

Demo: https://regex101.com/r/3MTLY9/2


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...