Saptarshi Sinha1, Alexandros Stergiou2 and Dima Damen1

1University of Bristol, 2University of Twente

ACCV 2024

Abstract

overview

Video repetition counting infers the number of repetitions of recurring actions or motion within a video.

We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference.

We are the first to introduce the usage of exemplars for repetition counting. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets.

Model Overview

overview

In this paper, we train a transformer-based encoder-decoder that encodes videos of varying lengths alongside exemplars and learns representations of general repeating motions. We use the density map to regress the temporal location of each repetition. During training, we learn an exemplar latent representation, which we use for inference where no exemplars are available.

For video \( \mathbf{v} \) is encoded by \( \mathcal{E} \) over sliding temporal windows to spatiotemporal latents . Exemplars \( \{\mathbf{e}_{s}\} \) are also encoded with \( \mathcal{E} \). Video and exemplar latents are cross-attended by decoder \( \mathcal{D} \) over cross-attention blocks. The resulting \( \mathbf{z}_L \) are attended over window self-attention blocks and projected into density map \( \tilde{\mathbf{d}} \). The decoder \( \mathcal{D} \) is trained to regress the Mean Square Error between ground truth \( \mathbf{d} \) and predicted \( \tilde{\mathbf{d}} \) density maps, and Mean Absolute Error between ground truth counts \(c\) and the predicted counts \(\tilde{c}\) obtained by linearly summing the density map: $$ \mathcal{L} = \underbrace{ \frac{|| \mathbf{d} - \tilde{\mathbf{d}}|| ^2}{\mathcal{T}'}}_{\text{MSE}(\mathbf{d},\tilde{\mathbf{d}})} + \underbrace{\frac{|c - \sum{\tilde{\mathbf{d}}}|}{c}}_{\text{MAE}(c,\tilde{c})} $$

At inference, we use the predicted count \( \tilde{c} \).

Demo Results




Bibtex

@InProceedings{sinha2024every,
title = {Every Shot Counts: Using Exemplars for Repetition Counting in Videos},
author = {Sinha, Saptarshi and Stergiou, Alexandros and Damen, Dima},
booktitle={Proceedings of the Asian conference on computer vision (ACCV)},
year = {2024},
}

Acknowledgements

Work used publicly available datasets. Research is supported by EPSRC DTP (Doctoral Training Program) and EPSRC UMPIRE (EP/T004991/1).