ECCV 2024 Tutorial:


Time is precious: Self-Supervised Learning Beyond Images


30th September, 09:00 to 13:00 CST, Amber 7 + 8

MiCo Milano






Overview


SSL has allowed pretraining of neural networks to scale beyond the size of labelled datasets, demonstrating robust performance without the need for costly annotation. This approach has successfully scaled training dataset sizes to billions of images.

However, state-of-the-art (SoTA) models often learn representations limited to single-image inputs, lacking temporal context. Visual representations derived from such static images are limited to learning from disjoint static snapshots of the world. This limitation is particularly pronounced in recent SSL techniques, all of which are trained on meticulously curated and object-centric datasets, such as ImageNet. Attempts to scale up single-image techniques to larger, less-curated datasets like Instagram-1B have not yielded substantial improvements in performance. A single image, regardless of artificial augmentation, has its constraints: it cannot create new perspectives of an object or anticipate unfolding events in a scene.

The primary goal of this tutorial is to introduce to the computer vision community, the concept of learning robust representations by leveraging the rich information in video frames. While, image-based pretraining has gained recent popularity with SimCLR, the practice of pretraining models from videos dates back much earlier. This tutorial will recapitulate both early and recent works, which have pretrained image encoders using videos for different pretext tasks such as egomotion prediction, active recognition, dense prediction etc. We also discuss practical implementation details relevant for practitioners and highlight connections to other, existing works such as VITO, TimeTuning, DoRA, V-JEPA etc. We also discuss recent works aimed to mimic human visual systems such learning from one continuous video stream and by learning from longitudinal audio-visual headcam recordings from young children, thereby putting this concept into a broader context.

Motivation


These works from the last 6-12 months demonstrate a paradigm shift, showcasing SSL models pretrained on videos can outperform image-based pretraining. The notable progress validates this trend, making this tutorial on past and present advancements crucial for newcomers in the field. Key questions we aim to tackle in this tutorial include:



Speakers


Mohammadreza Salehi
University of Amsterdam
Yuki M. Asano
University of Amsterdam
João Carreira
Google DeepMind
Ishan Misra
GenAI, Meta
Emin Orhan
Independent Researcher



Schedule

Title Speaker Time (CST)
Introduction Mohammadreza 09:00 - 09:10
Part (1): Learning image encoders from videos
Prior works
Shashanka 09:10 - 09:50
Part (2): New Vision Foundation Models from Video(s):
1-video pretraining, tracking image-patches
Yuki M. Asano 09:50 - 10:30
Coffee Break
10:30 - 11:00
Applications (1): Learning from one continous stream:
single-stream continual learning, massively parallel video models, perceivers
João Carreira 11:00 - 11:40
Applications (2): What makes Generative video models tick?
Emu Video (text-to-video), FlowVid (video-to-video), factorizing text-to-video generation, efficiency
Ishan Misra 11:40 - 12:20
Applications (3): SSL from the perspective of a developing child
Audio-visual dataset, development of early word learning, learning from children
Emin Orhan 12:20 - 13:00
Conclusion, Open Problems & Final remarks Yuki M. Asano 13:00 - 13:10




About Us

Shashanka is a final year PhD student at the LinkMedia team in INRIA, France, advised by Yannis Avrithis. He conducts research on the topic on self-supervised learning, specifically on learning image representation from videos and data-augmentation methods. He has organized several deep learning workshop focusing on a broad range of topics including diffusion models, RAG, backdoor attacks etc. in his university.

Mohammadreza Salehi
QUVA Lab, University of Amsterdam

Mohammadreza is a third year PhD student at the QUVA lab, University of Amsterdam advised by Yuki Asano, Cees Snoek, and Efstratios Gavves. His research focuses on representation learning, with a special emphasis on learning image representations from videos. In addition to his primary research, he is also deeply engaged in the field of machine learning safety, working towards ensuring that AI systems are reliable and safe for society.

Yuki Asano
QUVA Lab, University of Amsterdam

Yuki M. Asano is an assistant professor at the Video & Image Sense (VIS) Lab at the University of Amsterdam and leads the Qualcomm-UvA (QUVA) lab. He conducts research on the topic on self-supervised learning, multi-modal learning and augmentations, which has resulted in works such as GDT, SSB, SeLa or single-image pretraining and most recently the self-supervised learning from videos such as TimeTuning and DoRA. He has served as Area Chair for CVPR 22-24, ICLR 2023 and NeurIPS 2022. He has also organized several workshops such as SSLWIN in ECCV 2020, ECCV 2022, BigMAC in ICCV 2023.