Human activities exhibit a strong correlation between actions and the places where these are performed, such as washing something at a sink. More specifically, in daily living environments we may identify particular locations, hereinafter named activity-centric zones, which may afford a set of homogeneous actions. Their knowledge can serve as a prior to favor vision models to recognize human activities.
However, the appearance of these zones is scene-specific, limiting the transferability of this prior information to unfamiliar areas and domains. This problem is particularly relevant in egocentric vision, where the environment takes up most of the image, making it even more difficult to separate the action from the context.
In this paper, we discuss the importance of decoupling the domain-specific appearance of activity-centric zones from their universal, domain-agnostic representations, and show how the latter can improve the cross-domain transferability of Egocentric Action Recognition (EAR) models. We validate our solution on the Epic-Kitchens 100 and Argo1M datasets.
Feature space of an EAR model. On the left, features from a model trained and tested in the same environment are well separated by location. On the right, when applied to a new environment, this clustering effect disappears, and different locations map to the same region in the feature space.
EgoZAR uses domain-agnostic features to identify activity zones in egocentric videos, improving how action recognition models generalize across different environments. It leverages CLIP features as a domain-agnostic representation of the surroundings and clusters these features into location-based groups, which are then integrated into the action recognition process. By separating location and motion information through attention-based modules, we reduce environmental bias, enabling the model to better recognize actions in various settings. The approach works across multiple modalities, enhancing action predictions with contextual location insights.
Results on the validation set of the EPIC-Kitchens-100 dataset in Unsupervised Domain Adaptation (UDA) and Domain Generalization (DG) settings using RGB, Optical Flow, and Audio.
To ensure a fair comparison, we report the Source Only performance for each method.
Our results are averaged over three runs.
Best in bold. Second best is underlined.
†Reproduced.
Best results in bold, second best underlined. †: Domain labels required during training. D: distribution matching, A: adversarial learning, M: label-wise mix-up, P: domain-prompts, R: reconstruction, T: video-text association, Z: activity-centric zone learning.
@article{peirone2024,
author = {Peirone, Simone Alberto and Goletto, Gabriele and Planamente, Mirco and Bottino, Andrea and Caputo, Barbara and Averta, Giuseppe},
title = {Egocentric zone-aware action recognition across environments},
journal = {arXiv preprint arXiv:2409.14205},
year = {2024},
}
Acknowledgements
This study was supported in part by the CINI Consortium through the VIDESEC project and carried out within the
FAIR - Future Artificial Intelligence Research and received funding from the European Union Next-GenerationEU
(PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) - MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 - D.D. 1555
11/10/2022, PE00000013). This manuscript reflects only the authors’ views and opinions, neither the European
Union nor the European Commission can be considered responsible for them.
G. Goletto is supported by PON
“Ricerca e Innovazione” 2014-2020 - DM 1061/2021 funds.