CSCI 3399: Spring 2024

Overview

Over the past few years, Deep Learning has become ubiquitous in our society, with applications spanning search, image understanding, apps, mapping, medicine, drones, self-driving cars, robotics, and art. At the core of many of these applications are visual recognition tasks, such as image classification and object detection. Recent developments in neural network approaches have significantly enhanced the performance of these state-of-the-art visual recognition systems. In the realm of learning algorithms, beyond supervised learning, self-supervised learning has gained widespread use in recent years, particularly in vision and language modeling. This approach enables the extraction of labels for free from unlabeled data, allowing for the training of an unsupervised dataset in a supervised manner. During this course, students will gain foundational knowledge of deep learning algorithms and neural network architectures, as well as practical experience in building, training, and fine-tuning neural networks. They will also gain an understanding of cutting-edge research topics in areas such as vision, language, medicine, generative AI, robotics, and more.

Prerequisites:
- Programming: You should be familiar with algorithms and data structures. Familiarity with python or similar frameworks for numeric programming will be helpful but is not strictly required. Python (Basics).
- Probability: You should have been exposed to probability distributions, random variables, expectations, etc. Linear Algebra (Essence, Chap 1-4), Multivariate Calculus (Essence, Chap 1, 3-4, 8-9).
- Machine Learning: Some familiarity with machine learning will be helpful but not required; we will review important concepts that are needed for this course.
Lecture:
Lectures will be Monday, Wednesday, and Friday at Fulton Hall 415, from 9:00am to 9:50am.
Textbooks and Materials:
There is no required textbook for the course. However, the following books (available for free online) can be useful as references on relevant topics:
- Deep Learning (DL), Goodfellow, Ian and Bengio, Yoshua and Courville, Aaron, MIT Press, 2016, ISBN: 9780262035613
- Dive into Deep Learning (D2L), Zhang et al.
- Computer Vision: Algorithms and Applications 2nd Edition (CV), Richard Szeliski.
- Pattern Recognition and Machine Learning (PRML), Christopher C. Bishop, Springer, 2006, ISBN: 9780387310732
You may also find this tutorial Deep Learning with PyTorch: A 60 Minute Blitz helpful.
Grading Policy:
- No quizzes/exams.
- 10%: Attendance (Including Asking Questions)
- 60%: Homework Assigenments (10%*6)
- 30%: Final Project (Proposal, Presentation, and Report)
You will complete six programming assignments over the course of the semester. All homework assignments will be in Python, and will use PyTorch on Google Colab.
Instead of a final exam, at the end of the semester you will complete a project working in groups of at most 3 students.

Staff

Yuan Yuan

Instructor

Gavin

Teaching Assistant

Guest Speakers

Chunyuan Li Microsoft Research	Tianhong Li MIT CSAIL	Ge Yang MIT CSAIL	Zifan Shi Stanford / HKUST
Guohao Li University of Oxford	Hanzi Mao Nvidia Research	Hao He MIT CSAIL	Lijie Fan MIT CSAIL

Tentative Schedule (subject to changes)

Theme	Date	Topic	Materials	Assignments
Module I: Deep Learning Basics
ML Basics	Wed, Jan. 17	Lecture 1: Course Introduction Course overview, Course logistics	[Slides] [Waymo Demo] [Python Tutorial], [Colab] [DL Sec 1.2] , [DL Sec 6.6]
	Fri, Jan. 19	Lecture 2: Machine Learning Basics Machine learning overview ML: pipeline, tasks Linear regression, Polynomial regression	[Slides] [Linear Regression Python Tutorial] [DL Sec 5.1 to 5.3]
	Mon, Jan. 22	Lecture 3: Linear regression Optimization: gradient-based solution, closed-form solution Underfit, Overfit, Regularization, Generalization	[Slides]	Assignment 1 out [Lab1a: Python Basic] [Lab1b: Linear Regression]
	Wed, Jan. 24	Lecture 4: Neural Network Binary Classification / Multi-Class Classification Sigmoid / Softmax Cross-Entropy Loss	[Slides] [D2L Sec 4.1] [A short intro to Entropy, Cross-Entropy and KL Divergence] [231n Image Classification] [231n Linear Classification]
	Fri, Jan. 26	Lecture 5: Multi-Layer Perceptron (MLP) Linear Problems / Non-Linear Problems Feature transforms Model: Fully-connected networks Computational Graph Optimization: Backpropagation	[Slides] [MLP web training] [231n Image Classification]
	Mon, Jan. 29	Lecture 6: Activation Functions and Optimization Activation Functions: ReLU, Sigmoid, tanh, Leaky ReLU, ELU Regularization Weight decay	[Slides] [MLP web training] [231n Image Classification]	Assignment 1 due (Jan. 30)
Deep Learning Architectures	Wed, Jan. 31	Lecture 7: Convolutional Neural Networks (CNNs) Weight initialization, dropout, haperparameters Universal approximation theorem Intro to CNNs -- Convolution	[Slides] [DL Sec. 7.1], [D2L Sec. 6.3] [DL Sec. 9.1, 9.2], [D2L Sec. 7.1], [D2L Sec. 7.2]
	Fri, Feb. 2	Lecture 8: Convolutional Neural Networks (CNNs) Convolution: kernel, receptive field, stride Padding Learning convolutional filters One layer (breadth): multiple kernels K layers (depth): nonlinearity in between	[Slides] [Image Kernels] [DL Sec. 9.3, 9.4], [D2L Sec. 7.2], [D2L Sec. 7.3], [D2L Sec. 7.4], [D2L Sec. 7.5]
	Mon, Feb. 5	Lecture 9: Convolutional Neural Networks (CNNs) Pooling AlexNet Batch Normalization ResNet + Residual Blocks	[Slides] [D2L Sec. 7.5], [D2L Sec. 7.6] [D2L Sec. 8.1], [D2L Sec. 8.2], [D2L Sec. 8.3], [D2L Sec. 8.4], [D2L Sec. 8.5], [D2L Sec. 8.6]
	Wed, Feb. 7	Lecture 10: CNN Architectures AlexNet, VGGNet, GoogLeNet, BatchNorm, ResNet Deep Learning Framework	[Slides] [CS231n CNN Architectures] [D2L Sec. 8.1], [D2L Sec. 8.2], [D2L Sec. 8.3], [D2L Sec. 8.4], [D2L Sec. 8.5], [D2L Sec. 8.6]
	Fri, Feb. 9	Lecture 11: Training Neural Networks Activation functions Data preprocessing Weight initialization Data augmentation Regularization (Dropout, etc) Learning rate schedules Hyperparameter optimization Transfer learning	[Slides] [CS231n Traning I] [Karpathy "Recipe for Training"]	Assignment 2 out [Lab2a: Gradient Descent] [Lab2b: PyTorch] [Lab2c: Linear Classifier]
	Mon, Feb. 12	Lecture 12: Deep Learning Framework Hyperparameter optimization Transfer learning PyTorch Dynamic vs Static graphs	[Slides] [Hacker’s guide to DL]
	Wed, Feb. 14	Lecture 13: PyTorch Review Session PyTorch Final project overview Life cycle of a Machine Learning System	[Slides] [Hacker’s guide to DL]
	Fri, Feb. 16	Lecture 14: Recurrent Neural Networks (RNNs) Life cycle of a Machine Learning System Sequential models use cases CNNs for sequences RNNs	[Slides] [RNNs]	Assignment 2 due (Feb. 18)
	Mon, Feb. 19	Lecture 15: Recurrent Networks: Stability analysis and LSTMs Gradient Explosion LSTM, GRU Language modeling	[Slides] [RNNs] [RNN Stability analysis and LSTMs]	Assignment 3 out: [Lab3: Autograd and NN]
	Wed, Feb. 21	Lecture 16: Recurrent Networks: Stability analysis and LSTMs (2)	[Slides]
	Fri, Feb. 23	Lecture 17: Attention and Transformers Self-Attention Transformers	[Slides] [Attention is all you need] [BERT Paper] [The Illustrated Transformer] [Formal Algorithms for Transformers]
	Mon, Feb. 26	Lecture 18: Attention and Transformers (2) Multi-head Self-Attention Mask Self-Attention	[Slides]	Assignment 3 due (Feb. 27)
Module II: Advanced Topics on Deep Learning
Vision Applications	Wed, Feb. 28	Lecture 19: BERT and GPTs Encoder-Decoder Attention Word Embedding Pre-training	[Slides]
	Fri, Mar. 1	Lecture 20: Training Large Language Models Self-Supervised Learning Data Scaling	[Slides]
	Mon, Mar. 11	Lecture 20: Training Large Language Models (2) Self-Supervised Learning Data Scaling	[Slides] [The Practical Guides for Large Language Models]	Assignment 4 out: [Lab4: Neural Machine Translation] Project proposal due (Mar. 12)
	Wed, Mar. 13	Lecture 21: Computer Vision: Detection and Segmentation Semantic segmentation Object detection Instance segmentation	[Slides]
	Fri, Mar. 15	Lecture 22: Generative Models (1) Unsupervised Learning Clustering / PCA Autoregressive Models	[Slides]
Generative and Interactive Visual Intelligence	Mon, Mar. 18	Lecture 23: Generative Models (2) -- VAEs Convolutional AEs, Transpose Convolution Variational Autoencoders (VAE)	[Slides] [Reading: Convolutional AEs]	Assignment 4 Part 1 (LSTM and Attention) due (Mar. 19)
	Wed, Mar. 20	Lecture 24: Generative Models (2) -- VAEs (continued) VAE Loss - KL Divergence Reparameterization trick Conditional VAE	[Slides] [KL Divergence]
	Fri, Mar. 22	Lecture 25: Generative Models (3) -- GANs Generative Adversarial Networks (GANs) Training GANs and challenges Applications	[Slides]
	Mon, Mar. 25	Lecture 26: Generative Models (4) -- Diffusion Models Denoising Diffusion Probabilistic Models (DDPMs) Conditional Diffusion Models	[Slides]	Assignment 4 Part 2 (Transformers) due (Mar. 28)
	Wed, Mar. 27	Lecture 26: Generative Models (4) -- Diffusion Models (continued) Denoising Diffusion Probabilistic Models (DDPMs) Conditional Diffusion Models	[Slides]	Project milestone due (Mar. 31)
	No Class (Good Friday)
	No Class (Easter Monday)
	Wed, Apr. 3	Lecture 27: Self-supervised Learning Pretext tasks Contrastive representation learning Instance contrastive learning: SimCLR and MOCO Sequence contrastive learning: CPC	[Slides]
	Fri, Apr. 5	Lecture 27: Self-supervised learning (continued)	[Slides] [SimCLR] [MoCo] [MoCo v2] [CPC]
	Mon, Apr. 8	LLaVA: A Vision-and-Language Approach to Computer Vision in the Wild Guest Speaker: Chunyuan Li (Microsoft Research)	Abstract: The future of AI is in creating systems like foundation models that are pre-trained once, and will handle countless many downstream tasks directly (zero-shot), or adapt to new tasks quickly (few-shot). In this talk, I will discuss our vision-language approach to achieving “Computer Vision in the Wild (CVinW)”: building such a transferable system in computer vision (CV) that can effortlessly generalize to a wide range of visual recognition tasks in the wild. I will first describe the definition and current status of CVinW, and briefly summarize our efforts on benchmark and modeling. I will dive into Large Language-and-Vision Assistant (LLaVA) and its series, including LLaVA-Med, LLaVA-1.5, LLaVA-NeXT, LLaVA-Interactive, LLaVA-Plus. LLaVA family represents the first open-source project to exhibit the GPT-4V level capabilities in image understanding and reasoning. demonstrate a promising path to build customizable large multimodal models that follow humans' intent with an affordable cost. Reference Papers: [LLaVA], [LLaVA-Med], [LLaVA-1.5], [LLaVA-NeXT], [LLaVA-Interactive], [LLaVA-Plus].
	Wed, Apr. 10	Learning to and from Predict in Computer Vision Guest Speaker: Tianhong Li (MIT CSAIL)	Abstract: Predictive learning has been a long-standing topic in computer vision and has gain increased attention recently due to the success of large language models. This lecture will introduce several pivotal studies within this domain. We begin with image inpainting — a technique vital for understanding context and filling missing information. We will then see how predictive learning could facilitate unsupervised representation learning. Finally, we will introduce how can we use predictive learning to generate novel images. Reference Papers: [VQGAN], [MAE], [MAGE].
	Fri, Apr. 12	Foundation Priors for Robot Perception: From Neural Radiance Fields to OpenAI Sora Guest Speaker: Ge Yang (MIT CSAIL)	Abstract: Recent developments in Artificial Intelligence have produced a trifecta of new techniques in generative modeling, computer graphics, and representation learning that once combined, will lead to radical changes in robotics. In this talk, we will study robot perception as an ill-defined inverse problem whose goal is to infer knowledge of the environment from noise and partial observability. We will start with Neural Radiance Fields (NeRFs) and study ways to combine them with prior knowledge from Foundation Models that are trained over internet-scale datasets to give robots the ability to know what is where in their surrounding environment. We will then look at the AI debate over priors vs data, and discuss how it is affected by recent results from OpenAI sora, the state-of-the-art AI system for generating videos from text. Reference Papers: [CLIP], [NeRF], [Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation (CoRL 2023 Best Paper)].
	No Class (Patriot's Day)
AI for Science	Tue, Apr. 16	Towards Efficient and High-Quality 3D Generation Guest Speaker: Zifan Shi (Stanford / HKUST)	Abstract: 3D generation has received growing attention due to its potential in modeling the 3D visual world. Despite remarkable advancements, there remains a significant journey ahead. In this talk, we will explore three key aspects of 3D generation. Firstly, we will focus on geometry quality, delving into the design of the discriminator. This crucial component has often been overlooked in many existing 3D generative approaches. Secondly, we will examine the realm of animatable human generation, probing into techniques and challenges associated with this dynamic aspect of 3D modeling. Lastly, we will discuss strategies for constructing a foundational model tailored for 3D generation, aiming to provide a robust framework for further advancements in the field. Reference Papers: [Improving 3D-aware Image Synthesis with A Geometry-aware Discriminator], [Learning 3D-aware Image Synthesis with Unknown Pose Distribution], [Gaussian Shell Maps for Efficient 3D Human Generation], [GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation], [3D Gaussian Splatting for Real-Time Radiance Field Rendering (Siggraph 2023 Best Paper)].
	Wed, Apr. 17	CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society Guest Speaker: Guohao Li (University of Oxford)	Abstract: The rapid advancement of chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents, and provides insight into their "cognitive" processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of a society of agents, providing a valuable resource for investigating conversational language models. In particular, we conduct comprehensive studies on instruction-following cooperation in multi-agent settings. Reference Papers: [CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society], [https://www.camel-ai.org/]
	Fri, Apr. 19	Segment Anything Guest Speaker: Hanzi Mao (Nvidia Deep Imagination Research) Time: 1:00 PM - 2:00 PM (Eastern Time), 10:00 AM - 11:00 AM (Pacific Time)	Abstract: We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision. Reference Papers: [Segment Anything (ICCV 2023 Best Honorable Mention Paper)]
	Mon, Apr. 22	Respiration Intelligence: Know Your Health from Your Breathing with an AI Assistant Guest Speaker: Hao He (MIT CSAIL)	Abstract: Respiration is a fundamental life-sustaining function intricately connected to various aspects of human health. With the aid of AI, we can uncover associations between respiration and numerous health conditions. In this lecture, we'll introduce three case studies demonstrating the use of breathing signals to predict blood oxygen saturation, sleep stages, and inflammation. We'll discuss the accuracy and practical applications of these predictive systems, as well as the core AI technologies that power them.
	Wed, Apr. 24	Learning from Synthetic data from LLMs and Diffusion Models Guest Speaker: Lijie Fan (MIT CSAIL)
Final Projects [Week 15-18]	Fri, Apr. 26	Final Project Presentation (1)
	Mon, Apr. 29	Final Project Presentation (2)
	Wed, May. 1	Final Project Presentation (3)
	Mon, May. 6			Final project report/code due

Office Hours

Name	Office hours
Yuan	Mon/Tue 3-4pm @ 245 Beacon Rm. 528E
Gavin	M W F 10-11am @ CS Lab

Office hours will take place in person (or Zoom if needed).
Yuan will hold additional one-on-one AMA office hours Tue 4-5pm/Wed 3-4pm (15-min by appointment)

Course Information

This is a challenging course and we are here to help you become a more-AI version of yourself. Please feel free to reach out if you need help in any form.

1. Get help (besides office hours)

Dropbox: The lecture pdfs will be uploaded to Dropbox (follow the link) and you can ask questions there by making comments on the slides directly.
Discord: For labs/psets/final projects, we will create dedicated channels for you to ask public questions. If you cannot make your post public (e.g., due to revealing problem set solutions), please directly DM TAs or the instructor separately, or come to office hours. Please note, however, that the course staff cannot provide help debugging code, and there is no guarantee that they'll be able to answer last-minute homework questions before the deadline. We also appreciate it when you respond to questions from other students! If you have an important question that you would prefer to discuss over email, you may email the course staff, or you can contact the instructor by email directly.
Support: The university counseling services center provides a variety of programs and activities.
Accommodations for students with disabilities: If you are a student with a documented disability seeking reasonable accommodations in this course, please contact Kathy Duggan, (617) 552-8093, dugganka@bc.edu, at the Connors Family Learning Center regarding learning disabilities and ADHD, or Rory Stein, (617) 552-3470, steinr@bc.edu, in the Disability Services Office regarding all other types of disabilities, including temporary disabilities. Advance notice and appropriate documentation are required for accommodations.

2. Homework submission

All programming assignments are in Python on Colab, always due at midnight (11:59 pm) on the due date.

Install Colab on the browser: Sign in to your Google account, follow the "Link" (to be updated) to the folder of assignments, click on lab0.ipynb, click on "Open with" and "Connect more apps", install "Colaboratory".
Submission: You need save a copy of the file in your own Google drive, so that you can save your edits. Afterwards, you can download the ipynb file and submit it to Canvas.
Final project: In lieu of a final exam, we'll have a final project. This project will be completed in small groups during the last weeks of the class. The direction for this project is open-ended: you can either choose from a list of project ideas that we distribute, or you can propose a topic of your own. A short project proposal will be due approximately halfway through the course. During the final exam period, you'll turn in a final report and give a short presentation. You may use an ongoing research work for your final project, as long it meets the requirements.

3. Academic policy

Late days: You'll have 4 late days each for labs and psets respectively over the course of the semester. Each time you use one, you may submit a homework assignment one day late without penalty. You are allowed to use multiple late days on a single assignment. For example, you can use all of your days at once to turn in one assignment a week late. You do not need to notify us when you use a late day; we'll deduct it automatically. If you run out of late days and still submit late, your assignment will be penalized at a rate of 10% per day. If you edit your assignment after the deadline, this will count as a late submission, and we'll use the revision time as the date of submission (after a short grace period of a few minutes). We will not provide additional late time, except under exceptional circumstances, and for these we'll require documentation (e.g., a doctor's note). Please note that the late days are provided to help you deal with minor setbacks, such as routine illness or injury, paper deadlines, interviews, and computer problems; these do not generally qualify for an additional extension.
Academic integrity: While you are encouraged to discuss homework assignments with other students, your programming work must be completed individually. You may not search for solutions online, or to use existing implementations of the algorithms in the assignments. Thus it is acceptable to learn from another student the general idea for writing program code to perform a particular task, or the technique for solving a mathematical problem, but unacceptable for two students to prepare their assignments together and submit what are essentially two copies of identical work. If you have any uncertainty about the application of this policy, please check with me. Failure to comply with these guidelines will be considered a violation of the University policies on academic integrity. Please make sure that you are familiar with these policies. We will use moss.pl tool to check each lab and pset for plagriasm detection.
AI assistants policy:
- Our policy for using ChatGPT and other AI assistants is identical to our policy for using human assistants.
- This is a deep learning class and you should try out all the latest AI assistants (they are pretty much all using deep learning). It's very important to play with them to learn what they can do and what they can't do. That's a part of the content of this course.
- Just like you can come to office hours and ask a human questions (about the lecture material, clarifications about pset questions, tips for getting started, etc), you are very welcome to do the same with AI assistants.
- But: just like you are not allowed to ask an expert friend to do your homework for you, you also should not ask an expert AI.
- If it is ever unclear, just imagine the AI as a human and apply the same norm as you would with a human.

4. Related Classes / Online Resources

Acknowledgements: This course draws heavily from MIT's 6.869: Advances in Computer Vision by Antonio Torralba, William Freeman, and Phillip Isola, and from Stanford's CS231n: Deep Learning for Computer Vision by Fei-Fei Li. It also includes lecture slides from other researchers, including Andrew Owens , Svetlana Lazebnik, Alexei Efros, Fei-fei Li, Carl Vondrick, David Fouhey, Justin Johnson, and Noah Snavely, David Fouhey and Ava Amini. Special thanks to Hao Wang for the insightful and generous advice.

CSCI 3399: Vision and Learning -- Intro to Deep Learning

Instructor: Yuan Yuan Spring 2024 (MWF 9:00-9:50 AM) Fulton Hall 415