CS 261 Homework 1


This problem set is due Wednesday, September 12, at 2pm.

Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.

Question 1

The famous webmail provider, TepidMail, has hired you to secure their webmail service. Your job is to design and implement a way to make it safe to view untrusted HTML emails.

TepidMail is a standard webmail service, so a TepidMail user can go to the TepidMail website and view their email using a web browser. Your job is to figure out how to safely display HTML emails. If an attacker sends a HTML email containing malicious HTML content to TepidMail users, we want to be sure this can't harm the TepidMail users or their machines.

You're going to write a sanitizing filter that TepidMail can invoke on the command line, like this:

./htmlfilter < untrustedemail.html > safeforviewing.html
Before showing an HTML email to one of their users, TepidMail will run it through this filter before sending it to the user's browser to be displayed. (For instance, the TepidMail mailserver might automatically run this filter on every incoming email that contains HTML content; then when the recipient goes to view their email, the filtered HTML document might be shown in a frame.) You have two goals:
A filtered HTML document must not, under any circumstances, cause any harm to the TepidMail user's system.

Viewing a filtered HTML document should be as harmless as viewing an ASCII text file with, say, /bin/more (even if an attacker supplies the entire contents of an ASCII email, viewing it with /bin/more cannot harm your machine). In particular, reading an email from someone malicious should not cause any lasting side effects to the TepidMail user's machine that persist after their web browser is closed; it should not leak any confidential information (e.g., the contents of files on the user's hard disk; or, information about what the user is viewing in another window with the same browser); and it should not endanger the integrity of the user's machine (e.g., we must not allow it to tamper with a different web document that the user is viewing in another window using the same browser).

A filtered HTML document should be safe to view in any browser that is likely to be used by a significant number of TepidMail users: let's say, IE6 and later (e.g., IE7, IE8, etc.), Firefox 3.6 and later, Safari 6.0 and later, and a recent version of Chrome. (These all have non-trivial market share. For no good reason, I've excluded mobile browsers.)

Your code should be robust: it shouldn't crash on any input. Since TepidMail is going to run your program on malicious inputs, it would be embarassing if there is any input that causes your filter to crash uncleanly.

Your scheme must not only be secure; it must also be verifiably secure. TepidMail has informed you that an independent security consultant will be reviewing your architecture, and needs to be convinced that your system is secure. You will have to provide an argument why it is reasonable to believe that your filter achieves this goal. As much as possible, you should strive to provide positive evidence of security, not just absence of evidence of insecurity.

Ideally, your filter should retain as much of the useful HTML content from the original email as possible -- except, of course, where this might conflict with security.

For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. Folks should be able to include links to external sites in their email. It would also be nice if they could include inline images. However, to simplify your life, I'll let you make some simplifying assumptions: in particular, you don't need to support other content (e.g., CSS, scripts, Flash animations, videos, etc.; most of those probably aren't appropriate in an email anyway).

For this exercise, security matters more than functionality, but your approach does need to be generalizable.

I want you to come up with a design, implement it, document your basic architecture and security argument, and submit both the document and the code. Your submission should contain at least three files:

Document the approach you've used and the conceptual basis for it. Sketch the security argument why one should expect your scheme to be secure, in enough detail to convince an independent security consultant. This should be an ASCII text file, and it doesn't have to be too lengthy; a page or so should be enough. Describe both the policy you are enforcing (e.g., the restrictions you place on the HTML content) as well as the method you're using for enforcing that policy (e.g., the implementation strategy for ensuring that the restrictions are fully and accurately enforced).
A Makefile with everything needed to compile your program. If I run make, it should do everything needed to compile your program and finally generate in the current directory an executable file called htmlfilter. This program should read an untrusted HTML file from stdin and write a sanitized HTML file to stdout.
Source files
Include any source files needed to build the executable. Don't include the executable itself; I will run make myself. You can use any well-supported language you like (e.g., C, C++, Java, Perl, Python, Ruby, ML, OCaml, Haskel, bash script), though I do need to be able to run it on a Linux system. To avoid any difficulties, please make your program as portable as possible. You can test your program by logging onto any of the instructional Linux servers (ssh to any machine in the range bcom16.eecs - bcom23.eecs), if you wish.
From within the directory where the above files are found, run
tar cf your-lastname.tar .
Then, email this file as an attachment to cs261hw1 at taverner.cs.berkeley.edu by the due date. I will be using automated scripts to run your programs, so please do follow the above framework. If it helps, here is reference code that demonstrates the required format: ref.tar.

Some hints/suggestions:

  1. The approach you take is likely to be the most important decision you make. Focus on the approach and architecture. I'm especially looking to see a well-thought-out mechanism for enforcing your policy.
  2. You are welcome to use third-party HTML parsing libraries. Writing your own HTML parser is no fun. (But you may not reuse a library solves exactly this homework problem.)
  3. Feel free to keep your implementation simple. If you are writing more than a few hundred of lines of code, you're probably working too hard.
  4. You'll probably need to strip out all Javascript, as by default it can cause side effects and violate the security policy outlined above (for instance it could interfere with the TepidMail web site and have other undesirable effects).
  5. You'll probably also need to do something about other executable content like Flash or Java, as by default they tend to have similar powers.
  6. You may want to review a HTML primer or reference document to refresh your memory about the format of HTML and the semantics of various aspects of HTML.