This is for .NET. IgnoreCase is set and MultiLine is NOT set.
Usually I'm decent at regex, maybe I'm running low on caffeine...
Users are allowed to enter HTML-encoded entities (<lt;, <amp;, etc.), and to use the following HTML tags:
u, i, b, h3, h4, br, a, img
Self-closing <br/> and <img/> are allowed, with or without the extra space, but are not required.
I want to:
- Strip all starting and ending HTML tags other than those listed above.
- Remove attributes from the remaining tags, except anchors can have an href.
My search pattern (replaced with an empty string) so far:
<(?!i|b|h3|h4|a|img|/i|/b|/h3|/h4|/a|/img)[^>]+>
This seems to be stripping all but the start and end tags I want, but there are three problems:
- Having to include the end tag version of each allowed tag is ugly.
- The attributes survive. Can this happen in a single replacement?
- Tags starting with the allowed tag names slip through. E.g., "<abbrev>" and "<iframe>".
The following suggested pattern does not strip out tags that have no attributes.
</?(?!i|b|h3|h4|a|img)[^>]*>
As mentioned below, ">" is legal in an attribute value, but it's safe to say I won't support that. Also, there will be no CDATA blocks, etc. to worry about. Just a little HTML.
Loophole's answer is the best one so far, thanks! Here's his pattern (hoping the PRE works better for me):
static string SanitizeHtml(string html)
{
string acceptable = "script|link|title";
string stringPattern = @"</?(?(?=" + acceptable + @")notag|[a-zA-Z0-9]+)(?:s[a-zA-Z0-9-]+=?(?:([""']?).*?1?)?)*s*/?>";
return Regex.Replace(html, stringPattern, "sausage");
}
Some small tweaks I think could still be made to this answer:
I think this could be modified to capture simple HTML comments (those that do not themselves contain tags) by adding "!--" to the "acceptable" variable and making a small change to the end of the expression to allow for an optional trailing "s--".
I think this would break if there are multiple whitespace characters between attributes (example: heavily-formatted HTML with line breaks and tabs between attributes).
Edit 2009-07-23: Here's the final solution I went with (in VB.NET):
Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:s[a-zA-Z0-9-]+=?(?:([""']?).*?1?)?)*s*/?>"
html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
The caveat is that the HREF attribute of A tags still gets scrubbed, which is not ideal.
Question&Answers:
os