This problem set is due Tuesday, September 16.
Work on your own for this homework. You may use any source you like (including other papers or textbooks), but if you use any source not discussed in class, you must cite it.
The famous web company, Gargle Inc., has hired you to design and implement a safe filter to sanitize untrusted HTML content. They have a webmail service, GargleMail. A GargleMail subscriber can go to the GargleMail website and view their email using a web browser. Gargle Inc. wants to allow people to send HTML email to GargleMail subscribers, but they don't want this to open a pathway for malicious HTML content to harm GargleMail subscribers or their machines. This is complicated by the fact that web browsers are complex and sometimes contain vulnerabilities in their handling of untrusted HTML pages, and it might be nice to find some way to reduce this risk.
You're going to write GargleMail a sanitizing filter that they can invoke on the command line, like this:
./htmlfilter < untrustedemail.html > safeforviewing.html
They will then display the resulting HTML file to the recipient of
the email, serving it from the GargleMail webserver (e.g., in a frame).
They have two goals:
Your scheme must not only be secure; it must also be verifiably secure. You will have to provide an assurance argument why it is reasonable to believe that your filter achieves this goal.
For instance, a filter that ignores its input and always outputs the empty HTML page is not very useful. Thus, your solution should be at least minimally useful for viewing the textual content of HTML emails. Ideally, it would be also nice to see inline images. However, other content (e.g., scripts, Flash animations, etc.) doesn't need to be preserved and can be stripped from the original email.
Also, feel free to keep your implementation simple and to omit support for complex functionality. This is intended only as a proof of concept exercise. To keep this homework problem tractable, you can err on the side of omitting functionality in your implementation (though ideally you would choose an approach that can be generalized to support as much functionality as possible).
I want you to come up with a design, implement it, document your basic architecture and assurance argument, and submit both the document and the code. Your submission should contain at least three files:
tar cf your-lastname.tar .
Then, email this file as an attachment to cs261hw1
at taverner.cs.berkeley.edu by the due date.
I will be using automated scripts to run your programs,
so please do follow the above framework.
If it helps, here is reference code that demonstrates
the required format: ref.tar.
Feel free to keep your implementation simple. If you are writing more than a few hundred of lines of code, you're probably working too hard.
Some hints: You may want to review a HTML primer or reference document to refresh your memory about the format of HTML and the semantics of various aspects of HTML. You'll probably need to strip out all Javascript, as by default it can cause side effects and violate the security policy outlined above (for instance it could interfere with the GargleMail web site and have other undesirable effects). You'll probably also need to do something about other executable content like Flash or Java, as by default they tend to have similar powers.
Clarification (9/3): You can use third-party libraries (e.g., HTML parsers, etc.) if you like.