Answers to homework #1

Question 1

Your solutions to this problem were, on the whole, excellent. I was very pleased with your work and how carefully thought-out your solutions were.

I was able to find a confirmed vulnerability in 11 (61%) 6 (33%) were immaculate and had no weakness or vulnerability that I could find. (Compared to how prior classes did: I broke 50% of the solutions the last year I did this, and in all previous years I was able to break 100% of the solutions.) So, well done! Of course, I have a bit of an unfair advantage: I have been collecting examples of obscure browser behavior and subtle pitfalls for the past few years.

Here were the most common attacks that I found on your solutions:

I have more details about this homework assignment below, broken down into two categories: policy (which tags and attributes should be allowed?), and enforcement architecture (what mechanism do we use to ensure that only tags allowed by the policy appear in the resulting document?).

Mechanism: the enforcement architecture

The only approach that worked reliably is to parse the HTML document to get a parse tree, apply the policy to the parse tree, and then convert it back to a string. Solutions that attempted to use regexps or otherwise work with straight text invariably could be fooled through sufficiently complex or obscure HTML.

Similarly, whitelisting was a must. I found that blacklists were almost always incomplete.

The best solutions parsed the HTML document to get a parse tree, then applied a whitelist to the tags (keeping only tags whose names are on the whitelist), to the attributes (keeping only attributes whose names were on the whitelist), to attribute values (e.g., keeping only URIs whose protocols were on a whitelist), and to the characters that are allowed to appear in the document (HTML or URI encoding all other characters). Everything else was dropped. Then, there was some code to print the resulting parse tree to a string---call this "unparsing", to give it a name.

The final unparsing step is crucial, because we need to make sure that the browser will interpret the resulting HTML document in the same way we did. It is important to HTML-escape values as they are unparsed: for instance, when rendering text that isn't supposed to contain any tags, replace < and > by &lt; and &gt; (respectively) to ensure that the browser doesn't treat this text as containing tags. Part of the problem is that, because erroneous markup is common on the web, many web browsers contain heuristics to try to make sense of technically-invalid syntax and helpfully accept these documents despite their syntax errors. These heuristics may cause the browser to find tags in places that we did not want the browser to find them. By HTML-escaping text that is not intended as a tag, we reduce the chances that some browser heuristic will treat it as a tag.

During the filtering step, we must ensure that no value gets carried over into the output without being checked. One approach is for the filtering phase to accept a parse tree as input, and build up a new parse tree to produce as output, constructing new parse nodes and filling them in with only values that have been filtered. This helps avoid a case where we forgot to check some attribute/aspect of the input node because we weren't aware of its existence. Alternatively, another approach is to skip filtering the tree, and instead move all the logic into the "unparsing" step: we traverse the parse tree and build up an output string from the parsed data, based upon only the whitelisted content (we ignore unsafe tags/attributes and do not emit any output for them).

Escaping and sanitizing data during the "unparsing" step is important. Otherwise, some browsers may find HTML tags in all sorts of places you never would have guessed. For instance, each of the three examples below contain SCRIPT tags that will be recognized by some browsers.

  <SCRIPT SRC="foo.js" </SCRIPT
This is because many browsers try to be "helpful" and correct syntax errors in the HTML for you. They do this because many HTML pages in the wild are not valid HTML, but users want to see their content anyway.

It also helps to be conservative and drop elements that have the potential to confuse a browser. Dropping all comments is helpful, because it avoids the risk that the browser parses comments differently than we'd expect. Also, many browsers peer inside comments and interpret <SCRIPT> and <STYLE> blocks inside comments (this idiom was once used for backwards compatibility with browsers that didn't support Javascript, and once browsers add support for a strange hack like this, it's hard to remove it), so if you leave comments intact you may find that you've provided a way for someone to break your filter. I was able to break a few submissions by exploiting subtle aspects of comments.

A few people used blacklisting: they removed all tags (or attributes, or attribute values, or URIs) that are known to be dangerous. However, blacklisting is error-prone. If you forget, or were unaware of, some dangerous tag, you might fail to filter out unsafe HTML. None of the blacklisting-based solutions that I saw were secure. I was able to successfully attack every blacklist-based implementation that I saw, usually by finding some unsafe tag or attribute or type of URL that was omitted from the blacklist.

Implementation issue: Beware mutating a list while iterating over it. An implementation bug I saw in several solutions involved iterating over a set of attributes, and removing each one that was prohibited. Unfortunately, in many languages modifying a list (e.g., deleting some of its items) while iterating over it leads to unexpected behavior. For instance, consider this Python snippet:

l = [0,1,2,3,4,5]
for i in l:
What is the value of l after executing the above code? If you said "an empty list", you might want to try it in a Python interpreter. The actual answer is [1, 3, 5]: you can see that this loop does not actually remove every element of the list, because you violated the rules and the iterator got all confused.

I saw several solutions that iterated over all attributes of a tag, and if the attribute was not on a whitelist, removed the attribute. Unfortunately, this violates the rules, because it is mutating a list while iterating over it. As a result, if the attacker creates an input with a tag that has a bunch of malicious attributes, then half of them will be deleted but the other half will survive. For instance, the following Python snippet is buggy, because it mutates tag.attrs while iterating over it:

for name,value in tag.attrs:
    if name not in whitelist:
            del tag[name]
If tag.attrs contains multiple non-whitelisted attributes, then this loop might delete only some of them, leaving others still present.

In some sense, this is just a little implementation bug. But when you have an implementation bug like this, it is worth pondering how you could have structured your code to be more resilient to bugs. One approach is to avoid mutating data structures in place, and build up a new data structure with "only the good stuff". Perhaps a better approach is to use a proof-carrying code trick, where you have two components, the transformer and the verifier. The transformer is supposed to transform the HTML to eliminate all bad stuff; the verifier then checks that no bad stuff is present in the output of the transformer (the verifier does not try to transform the document; if the output if the transformer has any amount of bad stuff, no matter how small, then the verifier fails and nothing is output). In this architecture, only the verifier need be trusted, and bugs in the transformer cannot harm security; this would have protected yourself against this sort of bug. A third possible approach is to build your own pretty-printing engine. If you write your own code to translate from a parse tree to a string, then you can write it so that it only outputs anything for parts of the parse tree that appear to be safe. In that architecture, only this pretty-printing code becomes trusted.


Tags. To eliminate active content, we must forbid the SCRIPT, OBJECT, EMBED, APPLET tags, as they can be used to invoke active code.

Forms may allow social engineering attacks on the user that trick the user into revealing passwords, data, or uploading files. Probably forms aren't very useful in webmail. Eliminating forms requires stripping FORM and INPUT tags.

Attributes. To eliminate active content, we must forbid the ONCLICK, ONMOUSEOVER, ONDBLCLICK, ONMOUSEDOWN, ONMOUSEUP, ONMOUSEMOVE, ONMOUSEOUT, ONKEYPRESS, ONKEYDOWN, ONKEYUP, ONFOCUS, ONBLUR attributes, as they can be used to execute Javascript. This list is not exhaustive: Microsoft Internet Explorer appears to have introduced many proprietary additional scripting-related attributes of their own -- which is a good argument for a whitelist instead of a blacklist.

URIs. Some URIs can refer to or invoke active code, so we must filter URIs. For instance, <A HREF="javascript:foo()"> executes Javascript code if that link is clicked. There are many ways to conceal the javascript: string, e.g., j%41vascript%3A, jaVaSCRiPt:, j&#x41;vascript:, j&#65;vascript:, and so on. Also, data: URIs are potentially dangerous, because they can be used to conceal malicious HTML or other content (which may itself contain Javascript). For this reason, we probably should apply some policy to URIs, e.g., requiring that they refer only to a whitelisted set of protocols (e.g., http: or https:).

URIs can appear in lots of places, including in HREF, SRC, LOWSRC, DYNSRC, LONGDESC, ACTION, CITE, USEMAP, BACKGROUND, PROFILE attributes. URIs can even occur in <META HTTP-EQUIV="Refresh" CONTENT="?: URL=..."> and <PARAM VALUETYPE=ref VALUE="...">. This makes it challenging to ensure that we have applied a policy to every URI.

Images and other external resources. Loading an image or other external content can cause an externally-visible side effect. For instance, consider an HTML email that contains a tiny, 1 pixel by 1 pixel, image loaded from the sender's web server. This image--sometimes called a "web bug"--will be loaded by your browser when you view the email. If each image URI is unique, the sender will be able to tell when you read the email. The sender can also learn the IP address of the machine you used to read the webmail. This is arguably a privacy leak; some email users might feel a bit creeped out if the sender can tell whether and when their email was read. "Web bugs" are widely used by spammers for address verification. The spam email includes a customized image whose URI contains your email address; when that URI is loaded, the spammer will know that you received the email and your email address was valid. Also, because the browser will send cookies along with the request for the image, "web bugs" can be used to track you, and in fact web bugs are frequently used for this purpose on the web by advertisers interested in targeting their advertising. Finally, loading an external resource can cause side effects on your browser, if the remote web server uses its response to set a cookie.

So for privacy reasons it might make sense to strip out anything that could automatically cause images or other resources to be automatically loaded when the email is opened. In other words, arguably we should be stripping all images from the HTML document. An alternative approach is to convert images so they are embedded inside the HTML document, e.g., using a data: URI (which allows to hardcode the contents of the image inside the HTML document, rather than referring to a file stored separately).

CSS. Cascading style sheets (CSS) can contain Javascript that will be executed by some older browsers. e.g.,

  background: url(javascript:alert())
  background: url("")
  any: expression(alert())
CSS can also load URLs, and thus can be used to conceal a "web bug". (Methods could include @import url(""); inside a stylesheet, or <link href="" type="text/css"> inside the HTML document, along with many others.)

Hence, the safest thing is to block all CSS, e.g., by blocking STYLE tags and attributes. This is unfortunate, because CSS is a powerful and clean way to specify presentation information (e.g., font sizes, colors, and so on), but it is probably the best one can expect for this homework.

Ultimately, the right answer is probably to write a CSS sanitizer. However, this is more work than I would reasonably expect you to take on in a homework. A large webmail provider might reasonably be able to write a CSS sanitizer to go with the HTML sanitizer, and that would probably be the way to go in practice.

If you overlooked this one, I don't blame you: this is an obscure gotcha, and you aren't the only one to be unaware of it. The Samy worm spread through MySpace due to a similar oversight in MySpace's HTML sanitizer; if you're curious, you can read the story, as told by the author of the Samy worm. Two years later, it was discovered that Facebook had a similar but more complex security hole with Javascript in user-specified CSS styles. This stuff is tricky.

Crazy stuff you have to deal with

Character sets. One of the craziest corner cases in browsers has to do with character sets. You may be familiar with UTF-8, which is a way of encoding Unicode characters as variable-length byte sequences. But have you heard of UTF-7? It's a way of encoding Unicode characters as a variable-length sequence of 7-bit bytes (bytes with their high bit clear).

It turns out that the following:

is a UTF-7 encoding of
Moreover, it turns out that if the HTML document doesn't specify a character set encoding, and if the web server doesn't provide a Content-Type: header specifying a character set encoding, some browsers will helpfully try to guess what character set you intended to use. The reason this is so nasty is that the text above contains no special characters (only + and - characters) so is likely to be accepted by any HTML sanitizer that wasn't written with knowledge of the charset issue.

The standard solution is to add a META EQUIV tag at the head of the document you generate, specifying the character set that the HTML sanitizer used, to ensure that the browser interprets the document with the same charset as you did. For instance, your HTML sanitizer could rewrite the HTML to insert the following near the beginning of the document:

  <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
This also implies that we should strip out any similar META tags from the input document, or whitelist the charset and make sure we are using the same charset as the original document intended.

You can find more on the charset issue here.

Fortunately, because guessing a UTF-7 character set is so problematic for security, modern browsers will not infer a UTF-7 character set. But older browsers (like IE6) still have a significant market share, so you must design a filter that will be secure for them, too--and some older browsers will guess a UTF-7 character set in some conditions. And, of course, if the document explicitly specifies a UTF-7 character set (e.g., via <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-7">), the browser will honor it--so your policy needs to ensure that the document cannot include such a declaration.

Comments. Be careful when handling comments. HTML supports a funky scheme for nested comments. Check this out:

Pop quiz: After the browser removes comments, what do you think will remain? Answer: it depends, but (depending upon doctype) the answer could be "VZ". If you expected the answer to be "VY-->Z", you might have been fooled. The "<!--" starts a SGML comment; the second "--" starts what is effectively a nested comment; the first "-->" closes the nested comment (but we're still in the outer comment at that point); and the second "-->" finally closes the outermost comment. This suggests some fun evasion methods, such as:
  <!-- W: -- X -->
  <p name="Y: --evadefilter><img onerror=alert(999) src=''/><div name=">
If you mis-parse this, thinking that the first line is a comment (and so can be safely included) and the second line is a <P> tag with a very long NAME attribute, and if you don't strip the comment, you might send this on to the browser. If you do, the browser would be within its rights to interpret this as a comment, followed by a <IMG> tag containing script, followed by a <DIV> tag with an unterminated quote, which some browsers will silently terminate for you, just to be helpful. So if you have an imperfect parser and don't strip comments, this could let someone sneak script past you.

Unfortunately, exactly how comments will be parsed is dependent upon the document type (e.g., the <!DOCTYPE> tag) and the browser, so it is hard to predict with certainty how a browser will parse the sorts of comments above. This creates a trap for the unwary.

One reasonable way to ensure that you don't fall for these kinds of tricks might be to parse the HTML document, remove all comments from the parse tree, and ensure that your unparser never generates HTML comments. One way to ensure that the output of HTML sanitizer does not contain anything that the browser might interpret as comments is to generate only tags whose name are on a whitelist, and HTML-encode any textual content to replace < with &lt; and > with &gt;. This way you are not reliant upon your HTML parser to parse HTML comments exactly the same way as the webmail user's browser does.

Other stuff. We must be careful with tags like <META HTTP-EQUIV="..." CONTENT="...">, since that syntax introduces HTTP headers that might affect how the document is parsed. For instance, it can change the content encoding or content MIME type, set cookies, redirect to another page, or otherwise influence how the document will be interpreted.

The TARGET attribute might be a potential risk. It is possible that <A HREF="..." TARGET="other"> may be able to navigate other frames that are open in the browser, which might not be good. Also, <A HREF="..." TARGET="_top"> allows "frame-busting": if Tepidmail loads the filtered HTML email in an IFRAME, then a malicious HTML email might contain a link <A HREF="..." TARGET="_top">. If the user clicks this link, it navigates the entire page (including any surrounding user interface from Tepidmail), which might be undesirable to Tepidmail. This seems comparatively minor but is worth considering.

Parting thoughts. This problem was very challenging, not only because of the complexity of HTML, but also because of a fundamental challenge in building an HTML filter: it is difficult to be sure that your sanitizer will parse HTML in exactly the same way that the browser does. Any difference in how your sanitizer and the browser interpret the same HTML document can easily lead to security flaws. Moreover, surviving the idiosyncracies of all of the major browsers out there is a major challenge. This probably puts inherent limits on the amount of assurance we can gain in the security of any solution to this homework.

Alternative architecture

Here's one totally different architecture I've seen in the past. The idea was to render the HTML using a standard browser, capture the rendered page as a bitmap image (e.g., using a screen capture tool or a HTML->PNG converter), and form a PNG image of the rendered email. The output of the htmlfilter program would then be just this PNG image, and nothing else. The PNG image could be output as a data:image/png;base64 inline image in a tiny HTML document, for instance.

As long as the (pre-)renderer can be trusted not to have any side effects and to be free of vulnerabilities, this greatly minimizes the amount of code in the end browser that must be trusted. Clever. To help ensure absence of side effects, the rendering process could be run in a sandbox with no network access and no ability to modify the filesystem (for instance). Of course, this approach does impair functionality for end users: they can no longer resize fonts, cut-and-paste text, click on links, etc.

Other references

The following sites do a nice job of discussing many subtle issues with HTML sanitization: feedparser's sanitization; Squarefree's advice; NeatHtml.

The XSS cheat sheet has a nice summary of many crazy attacks.

There are many other applications of HTML filtering. For instance, there are several web pages that talk about why RSS aggregators need to sanitize all HTML received via a RSS feed.

A curiousity: Last year, I found a bug in the Beautiful Soup parser that let me break lots of HTML sanitizers. This bug occurred because Beautiful Soup allows angle-brackets (< and >) to appear in text, and does not HTML-escape them when converting its parse tree to a string. In other words, insufficient output sanitization in Beautiful Soup's method for emiting a parse tree to HTML enabled a "shadow parse" vulnerability. Fortunately, Beautiful Soup has since been fixed, so this attack was no longer effective.

How I evaluated your code

I used two methods to evaluate your design and implementation. First, I read your code, and tried to come up with attacks based on my understanding of browsers and the kinds of tricks mentioned above.

Also, I built a test suite of about 400 tricky test cases that illustrate the above risks. I ran your HTML sanitizer on every one of these test cases. I wrote some Javascript that loads each of the resulting filtered HTML documents into Firefox, and then walks the DOM (Firefox's parse tree) to look for any sign of a security breach. For instance, if the browser's parse tree contains a script node, then that likely indicates a vulnerability in your sanitizer. Similarly, I hooked the alert() function: if loading a filtered HTML document invoked alert(), that probably indicates a vulnerability in your sanitizer. Then I inspected each test case that exhibited anything suspicious to see whether it looked like it would lead to a successful attack.

If you're interested, you're welcome to check out my testbed if you want.

If you found a notation on your graded homework that mentioned something like t137, that's a reference to test case 137 in my testbed (filename testcases/t137.html). Feel free to download the testbed and look at the referenced test case if you want to see a concrete example of an attack against your htmlfilter.