CS 261 Homework 1

Instructions

This problem set is due Friday, 14 September.

Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.

Question 1

This homework asks you to design and implement a HTML filter so that I can safely view untrusted HTML content. This method must not harm my machine even through these pages come from an untrusted source, and even though my web browser is too complex for me to have full faith in its ability to safely handle totally untrusted HTML pages. The filter might be used, for instance, for filtering and viewing HTML email before displaying it.

You're going to write me a sanitizing filter that I can use something like this:

./htmlfilter < scarystuff.html > safe.html
firefox safe.html
I have two goals:
Security:
This procedure must not, under any circumstances, cause any harm to my system. Ideally, using this procedure to view HTML files should be as harmless as viewing an ASCII text file with, say, /bin/more; note that even if an attacker supplies the entire contents of an ASCII email, viewing it with /bin/more cannot harm my machine, so /bin/more is in some sense the gold standard. In particular, viewing untrusted content using your HTML filter and my favorite web browser should not cause any lasting side effects to my machine; it should not leak any confidential information (e.g., the contents of files on my hard disk; or, information about what I'm viewing in another window with the same browser); and it should not endanger the integrity of my machine (e.g., tampering with a different web document that I'm viewing in another window using the same browser). Your scheme must not only be secure; it must also be verifiably secure. You will have to provide an assurance argument why it is reasonable to believe that your filter achieves this goal.
Functional:
In an ideal world, your filter would allow me to view as much of the HTML content as possible -- except where this would conflict with the previous requirement, in which case security is more important than functionality. For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. However, I don't really care whether I get to see pretty pictures, dancing pigs and other fancy decorative stuff or not. Also, feel free to keep your implementation simple and to omit support for complex functionality. This is intended only as a proof of concept exercise. To keep this homework problem tractable, you can err on the side of omitting functionality in your implementation (though it might be nice if your approach can be generalized to support as much functionality as possible).
Security matters more than functionality; my threshold for security will be pretty high, while my threshold for functionality will be very low.

I want you to come up with a design, implement it, document your basic architecture and assurance argument, and submit both the document and the code. Your submission should contain at least three files:

README
Document the basic architecture you've used and the theory of operation for your scheme. Sketch the assurance argument why one should expect your scheme to be secure. This should be an ASCII text file, and it doesn't have to be too lengthy; a page or so should be enough. You might want to describe both the policy you are enforcing (e.g., the restrictions you're trying to place on the HTML content) as well as the method you're using for enforcing that policy (e.g., the implementation strategy for ensuring that the restrictions are fully and accurately enforced).
Makefile
A Makefile with everything needed to compile your program. If I run make, it should do everything needed to compile your program and finally generate in the current directory an executable file called htmlfilter. This program should read an untrusted HTML file from stdin and write a sanitized HTML file to stdout.
Source files
Include any source files needed to build the executable. Don't include the executable itself; I will run make myself. You can use pretty much any well-supported language you like (e.g., C, C++, Java, Perl, Python, Ruby, ML, OCaml, bash script) as long as it will work on my Linux system. However, to avoid any difficulties, please take care to make your program as portable as possible. I encourage you to test your code on the EECS instructional Linux servers (ilinux1.eecs.berkeley.edu, ilinux2.eecs.berkeley.edu, etc.).
From within the directory where the above files are found, run
tar cf your-lastname.tar .
Then, email this file as an attachment to cs261hw1 at taverner.cs.berkeley.edu by the due date. Because I will be using automated scripts to run your programs, I ask you to follow the above instructions carefully. To help demonstrate the format, here is reference code that demonstrates the required format: ref.tar.

Feel free to keep your implementation simple. If you are writing more than a few hundred of lines of code, you're probably working too hard.

Some hints: You may want to review a HTML primer or reference document to refresh your memory about the format of HTML and the semantics of various aspects of HTML. You'll probably need to do something about Javascript and other executable content, as by default it can cause side effects and violate the security policy outlined above.