CS 289A: Machine Learning
(Spring 2024)
Project
20% of final grade.
The project should be done in teams of 2–3 students.
Please find a partner.
Please discuss your ideas with one of the Project Teaching Assistants
before submitting your initial proposal.
The Project TAs, and their areas of expertise, are:
Pierre Boyeau,
pierreboyeau@berkeley.edu:
single-cell RNA modeling using deep generative models.
Charles Dove,
charles_dove@berkeley.edu:
machine learning, biomedical imaging, photonic design, artificial vision,
quantum mechanics, and quantum field theory.
Norman Mu,
thenorm@berkeley.edu:
deep learning robustness and safety;
multimodal visual recognition; large language models/assistants.
Deliverables
- Initial proposal, due Friday, April 12
- Project video (maximum 3 minutes), due Monday, May 6,
50% of score
- Final report (maximum 8 pages), due Tuesday, May 7,
50% of score
Overview
The project theme may be anything related to machine learning techniques
discussed in class, including
- critically revisiting a published paper (including reproducing
the experiments, generating new graphs and visual representations, and
discussing the results);
- writing a literature review in a specific domain
(e.g., adversarial examples, transfer learning, active learning,
meta-learning) and making a critical comparison
(ideally on a standard benchmark dataset);
- conducting original theoretical research (e.g., attack one of the COLT
open problems:
2014,
2015);
- conducting original practical research by applying
machine learning methods to a public or private dataset
(see project ideas below).
You are encouraged to design a project that
is related to your research outside this course;
we really hope that the project will help you make progress in
your primary research duties (or a teammate's).
However, please be honorable and don't suggest a project that
you've already fully completed as part of your research.
Initial
proposal
The initial proposal is primarily a proposal, and need not be long.
Write a few paragraphs describing what you have decided to do.
You may have any number of figures and references.
- Include background information: What is the application domain or
field of research? Why is the problem important?
What specific questions will you try to answer?
- Talk about what data sources you plan to use, including
the number(s) of samples and number(s) of features.
If it's visual data, include an illustration if possible.
- Explain what methods you are planning to use and why.
- If you have done preliminary work, explain what you have already done
(e.g., downloaded and played with data, tried k-nearest neighbors,
did a mock-up of the user interface, etc.).
- What will be the core of the work for your project?
(That is, how do you expect to spend most of your time?)
Do you expect to be judged primarily on your writing, your programming,
your exhaustive exploration of methods and data, something else, or
some combination thereof?
(You are welcome to use any and all libraries and codes written by
others, but you should be clear on what your substantial new
contribution will be.)
- The initial proposal is not graded. Its purpose is to make sure
you start early and we give you feedback on your idea.
- Please submit the initial proposal through Gradescope.
Video
- The video should be clear and understandable, describing everything
you think is important about your project
(motivation, description, techniques, results, etc.).
- The video needs to be self-contained: any CS 289A student should
be able to understand what you did (at least at a high level)
without consulting any other materials.
- You can make the video as simple as slides with a voice overlay, or
as fancy as you want.
- As long as it is clear and understandable, you will not be graded on
the fanciness of the video. Content is what will matter.
(Fanciness might be fun, though.)
- You must upload the video on YouTube and provide us with the link.
You may choose to keep the video private (i.e., only those with
the link can view it), in which case only the instructors will view it.
You can make the video public if you want to.
- Important:
The video can be at most 3 minutes long.
This is a very strict requirement;
a video of length 3 minutes and 1 second does not count.
The length is counted as whatever YouTube says it is.
In case you are worried about how you will fit an entire class project
into three minutes, take a look at
these
videos, which fit an entire Ph.D. thesis into three minutes.
Final
report
- We encourage you to use a template from
your favorite machine learning conference (e.g.,
NIPS
or ICML).
- There is no minimum length requirement. The maximum length is 8 pages.
- The submission must be made through Gradescope.
Any one person from the group can submit the proposal.
Please include the full names, student IDs, and email addresses for
all the members of the group.
- Also, for inspiration, here are some of
the final projects from
a Neural Networks class at Stanford.
Grading
criteria
The video and the final report will be graded with 5 criteria.
- Relevance: should be related to machine learning techniques.
- Usefulness: should answer good questions or
solve problems worth solving.
(The questions should be clearly stated in the proposal.)
- Soundness: choose data sets with enough examples to get
statistically significant results;
conduct sound numerical experiments
(split the data into training/validation/test sets);
make comparative result tables using validation or cross-validation;
use the test set only for final assessment;
include error bars if appropriate;
add graphs and other good means of visualization
(e.g., projections onto principal components);
provide sound proofs;
if you choose a literature review, mention the most important papers in
the area and give proper credit.
- Clarity/presentation: good paper organization, good bibliography,
enough graphs and visual support, length should not exceed 8 pages.
- Novelty/originality: we do not require novelty nor originality, but
it could add a few points if you haven't already maxed out your score.
Project
ideas
The ideas in this list fall mainly under the fourth category,
practical research.
If you prefer to revisit an important paper, simply pick a paper.
If you prefer to conduct a literature review,
simply pick a machine learning topic that interests you.
If you prefer to conduct theoretical research, you'd better already know
what you're doing.
We'll start by listing projects and data sets of particular interest to
our Teaching Assistants, then add some others.
Pierre's suggestions
-
Can you build a zero-shot classifer to identify cells' type in a tissue given their gene expressions? The CELLxGENE platform has collected thousands of annotated single-cell-resolution gene expression data in the hope of refining biologists' understanding of cell identity and function.
-
Understand how drug perturbations affect gene expressions in cells, e.g., with this dataset.
-
Can you detect and segment cells from fluorescence microscopy images using supervised learning? The Cell Tracking Challenge contains multiple datasets of segmented multi-timepoint fluorescence microscopy images that can be used to benchmark state-of-the-art cell segmentation algorithms.
-
Compare best strategies to predict protein function from amino acid sequences. The PFAM database contains multiple sequence alignments of protein families that can be used to benchmark protein function prediction algorithms.
Charles' suggestions
Norman's suggestions
A standard paradigm for training large neural networks has emerged across the domains of natural language processing, image recognition, robotics, and now genetics: randomly initialize a model, pre-train on a big blob of diverse data, then fine-tune on a representative sample of the specific task(s) of interest, before shipping the final model for production use. This process works extremely well on static tasks and benchmarks, but the dream of ML models which continue to learn after deployment into the real world still eludes us. Here are five papers which gesture in the same general direction; they might be interesting to build off of or combine pieces from. You might consider re-implementing one of these papers, updating methods from the “older” papers with newer techniques, or adapting existing code to another task or data modality.
Other Ideas
- Try your hand at the distribution shift challenge WILDS. Can you improve upon the basic ERM framework by leveraging diverse datasets?
- The method of integrated gradients is an attribution technique that attempts to attribute some fraction of the decision made by a neural network to individual features of the input. Implement integrated gradients and briefly evaluate it.
- Businesses love to promote themselves on Yelp, leaving false positive reviews. With this dataset, predict which reviews are likely to have been written by the business itself.
- Browsers typically download files to the Downloads folder (or another fixed, set folder). Develop a method for automatically placing files in the appropriate folder. You can constrain yourself to text documents, or to images.
- Image captioning using RNN/LSTM is also an important topic. Try to generate meaningful text from images/videos. For example, take a look at the COCO Captioning Challenge.
- Can you train a fine-grained image classifier? There are many specialized datasets, like this one for dog breeds. This can be applied with the project above or in isolation. Architectures? See this paper.
- Reinforcement Learning tackles sequential decision-making problems where the only information about the system is received through interaction. A great resource to understand RL algorithms is Spinning Up. Pick your favorite algorithm from there and apply it to more complex simulated environment (e.g., MuJoCo is free for students).
- If you want harder problems, involving various aspects of ML, you can also check this: OpenAI Requests for Research
- Fake news 1. Can you train a system to decide if two articles are related and agree? The Fake News Challenge gives access to a dataset for this. You can propose your own solution and see if you get close to the winners!
- Fake news 2. Can you classify articles by bias and factuality? There is a dataset (and SVM algorithm) built with this purpose. Their code is also available, therefore a substantial innovation should be attempted.
- Can you fool a classifier with a real object? There are works that make traffic signs classification systems (trained on the LISA dataset) predict that a stop sign is a 45mph speed limit sign. Or that a 3D printed turtle is a rifle.
- Can you visualize the features learned by a Deep Neural Network? When training Deep Neural Networks, the hidden state at each layer can be understood as features or representations useful to perform the desired task. One such tool is guided backpropagation, and more recently other dataset-wide visualizations have been proposed. See this blog post.
- Dependently typed programming languages, such as Idris, provide safety guarantees that are not available in the more mainstream languages. Can any of these features be used to improve the data analytic stack, or developer ergonomics?
- Try your hand at a computer vision challenge with one of these satellite image datasets. Can you predict building footprints, segment out roads, or even generate a map from a satellite image?
- Ever wish you could read someone's mind? OpenNeuro hosts a variety of fMRI, EEG, ECoG, etc. data.
- Can you transcribe music directly from audio files? MusicNet provides a curated collection of labeled classical music.
-
The price of Bitcoin has been erratically rising and falling over the past couple of years. Can you build a model to predict the price of Bitcoin? Try your hand at this dataset.
-
Can you predict a person’s Myers–Briggs personality type from the content they post on social media? This dataset makes for a fun challenge.
-
Can you identify questions that have the same intent? Try out this dataset of questions posted on Quora.
-
Can you identify the genre of a song from its spectrogram or other audio features? This dataset provides labeled audio tracks for classification.
-
Build and test a computer vision model that performs
image-to-image translation
with adversarial networks.
-
Play with offline reinforcement learning and the
dataset d4rl:
A benchmark for offline reinforcement learning.
-
Sentiment analysis of written sentences–for example,
on
movie reviews from the Rotten Tomatoes dataset.
-
Play with robotic imitation learning in simulation–for example,
with robosuite:
A Modular Simulation Framework and Benchmark for Robot Learning.
Other useful data sources