ECCV 2024 Tutorial:

Time is precious: Self-Supervised Learning Beyond Images

30th September, 09:00 to 13:00 CST, Amber 7 + 8

MiCo Milano

Overview

SSL has allowed pretraining of neural networks to scale beyond the size of labelled datasets, demonstrating robust performance without the need for costly annotation. This approach has successfully scaled training dataset sizes to billions of images.

However, state-of-the-art (SoTA) models often learn representations limited to single-image inputs, lacking temporal context. Visual representations derived from such static images are limited to learning from disjoint static snapshots of the world. This limitation is particularly pronounced in recent SSL techniques, all of which are trained on meticulously curated and object-centric datasets, such as ImageNet. Attempts to scale up single-image techniques to larger, less-curated datasets like Instagram-1B have not yielded substantial improvements in performance. A single image, regardless of artificial augmentation, has its constraints: it cannot create new perspectives of an object or anticipate unfolding events in a scene.

The primary goal of this tutorial is to introduce to the computer vision community, the concept of learning robust representations by leveraging the rich information in video frames. While, image-based pretraining has gained recent popularity with SimCLR, the practice of pretraining models from videos dates back much earlier. This tutorial will recapitulate both early and recent works, which have pretrained image encoders using videos for different pretext tasks such as egomotion prediction, active recognition, dense prediction etc. We also discuss practical implementation details relevant for practitioners and highlight connections to other, existing works such as VITO, TimeTuning, DoRA, V-JEPA etc. We also discuss recent works aimed to mimic human visual systems such learning from one continuous video stream and by learning from longitudinal audio-visual headcam recordings from young children, thereby putting this concept into a broader context.

Motivation

These works from the last 6-12 months demonstrate a paradigm shift, showcasing SSL models pretrained on videos can outperform image-based pretraining. The notable progress validates this trend, making this tutorial on past and present advancements crucial for newcomers in the field. Key questions we aim to tackle in this tutorial include:

Can we learn strong image encoders from good quality videos (i.e. with limited data)?
Do we need synthetic augmentations? How useful are the natural augmentations in videos?
Can we learn from a continuous stream similar to humans?

Speakers

Shashanka Venkataramanan

INRIA

Mohammadreza Salehi

University of Amsterdam

Yuki M. Asano

University of Amsterdam

João Carreira

Google DeepMind

Ishan Misra

GenAI, Meta

Emin Orhan

Independent Researcher

Schedule

Title	Speaker	Slides	Talk
Introduction	Mohammadreza	Slides	Talk
Part (1): Learning image encoders from videos Prior works	Shashanka	Slides	Talk
Part (2): New Vision Foundation Models from Video(s): 1-video pretraining, tracking image-patches	Yuki M. Asano	Slides	Talk
Coffee Break
Applications (1): Learning from one continuous stream: single-stream continual learning, massively parallel video models, perceivers	João Carreira	Slides	Talk
Applications (2): What makes Generative video models tick? Emu Video (text-to-video), FlowVid (video-to-video), factorizing text-to-video generation, efficiency	Ishan Misra
Applications (3): SSL from the perspective of a developing child Audio-visual dataset, development of early word learning, learning from children	Emin Orhan	Slides	Talk
Conclusion, Open Problems & Final remarks	Yuki M. Asano

About Us

Shashanka Venkataramanan

INRIA

Shashanka is a final year PhD student at the LinkMedia team in INRIA, France, advised by Yannis Avrithis. He conducts research on the topic on self-supervised learning, specifically on learning image representation from videos and data-augmentation methods. He has organized several deep learning workshop focusing on a broad range of topics including diffusion models, RAG, backdoor attacks etc. in his university.

Mohammadreza Salehi

QUVA Lab, University of Amsterdam

Mohammadreza is a third year PhD student at the QUVA lab, University of Amsterdam advised by Yuki Asano, Cees Snoek, and Efstratios Gavves. His research focuses on representation learning, with a special emphasis on learning image representations from videos. In addition to his primary research, he is also deeply engaged in the field of machine learning safety, working towards ensuring that AI systems are reliable and safe for society.

Yuki Asano

QUVA Lab, University of Amsterdam

Yuki M. Asano is an assistant professor at the Video & Image Sense (VIS) Lab at the University of Amsterdam and leads the Qualcomm-UvA (QUVA) lab. He conducts research on the topic on self-supervised learning, multi-modal learning and augmentations, which has resulted in works such as GDT, SSB, SeLa or single-image pretraining and most recently the self-supervised learning from videos such as TimeTuning and DoRA. He has served as Area Chair for CVPR 22-24, ICLR 2023 and NeurIPS 2022. He has also organized several workshops such as SSLWIN in ECCV 2020, ECCV 2022, BigMAC in ICCV 2023.