HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object
HOI4D is a large-scale 4D egocentric dataset for category-level human-object interaction. The
dataset and benchmarks were
generally described in a CVPR 2022 paper:
HOI4D is constructed by collected human-object interaction RGB-D videos and various annotations
including object CAD models, action segmentation, 2D motion
segmentation, 3D static scene panoptic segmentation, 4D dymanics scene panoptic segmentation,
category-level object pose, and human hand pose.
Please refer to definitions/task/task_definitions.csv for the definitions of C* and T*.
The released data list refers to release.txt. (The rest of the data is temporarily kept as a testset.)
Data Formats
Human-object Interaction RGB-D Videos
1.First you need to install ffmpeg.
2.Then run python utils/dec
Object CAD Models
For each rigid object, we provide a single object mesh {object category}/{object id}.obj. The mesh
itself is the canonical frame that defines the pose of the object.
For each articulated object, we provide the articulated part meshes as well as joint annotations. We
utilize partnet_anno_system to segment articulated parts from
the whole object mesh. The part hierarchy is defined in {object category}/{object id}/result.json, and
the part meshes involved in the part hierarchy are provided in
{object category}/{object id}/{objs}/{part name}.obj. We provide joint annotations {object
category}/{object id}/mobility_v2.json including the origin, direction, and
rotation (for revolute joint) or translation (for prismatic joint) limit for each joint axis. In
addition to the CAD model annotations, we also provide the canonical frame
of each articulated part {object category}/{object id}/{objs}/{part name}_align.obj that is used to
define the part pose.
Action Segmentation
We present Segmentation to record per-frame action class. durationdenotes the length of video,
eventdenotes the label of the clip, startTime and endTimedenotes the range of the clip.
2D Motion Segmentation
We present 2D motion segmentation to annotate the human hands and the objects related to the interaction
in each video. You can first use the get_color_map function in utils/color_map.py
to convert the RGB labels to indices, and then refer to definitions/motion segmentation/label.csv to
correlate each index to its corresponding semantic meaning.
3D Static Scene Panoptic Segmentation
raw_pc.pcd and label.pcdis the raw point cloud and the label of the reconstructed static scene.
The detailed definitions refer to definitions/3D segmentation/labels.xlsx.
output.log is the camera pose of each frame.
Note: You can easily use the camera pose using open3d.
import open3d as o3d
outCam = o3d.io.read_pinhole_camera_trajectory(output.log).parameters
4D Dynamic Scene Panoptic Segmentation
We provide scripts to generate 4D panoptic Segmentation labels.
The results are output in semantic_segmentation_label/*.txt. The penultimate column is the semantic
label and the last column is the instance label.
Category-level Object Pose
anno refers to translation of the part.
rotation refers to rotation of the part.
dimensions refers to scale of the part. Take rigid objects as an example, you can easily load the
object pose using following code:
from scipy.spatial.transform import Rotation as Rt
import numpy as np
def read_rtd(file, num=0):
with open(file, 'r') as f:
cont = f.read()
cont = eval(cont)
if "dataList" in cont:
anno = cont["dataList"][num]
else:
anno = cont["objects"][num]
trans, rot, dim = anno["center"], anno["rotation"], anno["dimensions"]
trans = np.array([trans['x'], trans['y'], trans['z']], dtype=np.float32)
rot = np.array([rot['x'], rot['y'], rot['z']])
dim = np.array([dim['length'], dim['width'], dim['height']], dtype=np.float32)
rot = Rt.from_euler('XYZ', rot).as_rotvec()
return np.array(rot, dtype=np.float32), trans, dim
Human Hand Pose
We present hand pose based on MANO parameters in each video. In each .pickle file:
"poseCoeff" : refers to 3 global rotation + 45 mano pose parameters
"beta" : refers to 10 mano shape parameters. Shape of each human ID H* are the same.
"trans" : refers to translation of the hand in camera frame
"kps2D" : refers to 21 keypoints projection coordination of rendered hand pose on each image.
The results are output in sInstall manopth from here and put the whole manopth folder in the same place
as your code. You may have reference problems when calling MANO. Please modify the first few lines of
the corresponding files according to the actual situation.emantic_segmentation_label/*.txt. The
penultimate column is the semantic label and the last column is the instance label.
To get 3D keypoints and camera frame hand vertices, the following code might help:
from manopth.manopth.manolayer import ManoLayer
import pickle
import torch
@InProceedings{Liu_2022_CVPR,
author = {Liu, Yunze and Liu, Yun and Jiang, Che and Lyu, Kangbo and Wan, Weikang and Shen, Hao and
Liang, Boqiang and Fu, Zhoujie and Wang, He and Yi, Li},
title = {HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR)},
month = {June},
year = {2022},
pages = {21013-21022}
}