We present HOI4D, a large-scale 4D egocentric dataset with rich annotations, to catalyze the
research of category-level
human-object interaction. HOI4D consists of 2.4M RGB-D egocentric video frames over 4000
sequences collected by 9 participants
interacting with 800 different object instances from 16 categories over 610 different indoor
rooms. Frame-wise annotations for
panoptic segmentation, motion segmentation, 3D hand pose, category-level object pose and hand
action have also been provided,
together with reconstructed object meshes and scene point clouds. With HOI4D, we establish three
benchmarking tasks to promote
category-level HOI from 4D visual signals including semantic segmentation of 4D dynamic point
cloud sequences, category-level
object pose tracking, and egocentric action segmentation with diverse interaction targets.
In-depth analysis shows HOI4D poses
great challenges to existing methods and produces great research opportunities.