SceneHub: A Dataset and Evaluation Framework for 6-DoF 4D Scenes

Jaehong Kim¹, Tao Jin¹, Mallesham Dasari², Srinivasan Seshan¹, Anthony Rowe¹

¹Carnegie Mellon University, ²Northeastern University
Under Review

Paper (Coming Soon) Code Dataset Dataset Readme 3D Viewer

⚠️ The “Dataset” link above includes all components, but only a subset of RGB-D data (100 frames per scene) for easier access. For full dataset access, refer to the Dataset README.

The dataset features: (1) long, unbounded RGB-D sequences with high-quality background geometry;
(2) multiple 3D representations.

Supported Features

Camera Pose	Per view RGBD	Point Cloud	Mesh	Gaussian Splat	Photogrammetry Backdrop	Multi-person	Interactive Objects	Full Scene	Metric	Software Suite
✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔

Teaser Video

Abstract

We present a new dataset and evaluation framework for benchmarking 6-DoF 4D volumetric scenes. Our dataset captures long, dynamic sequences across diverse real-world indoor environments, with synchronized multi-view RGB-D streams, calibrated camera poses, and high-resolution background geometry reconstructed via photogrammetry and LiDAR. We provide a unified representation suite including point clouds, textured meshes, and Gaussian splats, along with tools for format conversion, rendering, and metric evaluation. To support structured comparison and perceptual analysis, we introduce novel metrics such as Geometry Complexity Score (GCS), SSIM-aware GCS, and Volumetric Temporal Information (V-TI). These components enable detailed characterization of spatial and temporal complexity, and facilitate benchmarking for tasks such as compression, view synthesis, and scene-aware rendering. Our dataset bridges gaps in scale, quality, and realism found in existing benchmarks, providing a comprehensive foundation for immersive 3D research.

Dataset Summary

Table 1: Dataset description with scene size, actor count, frame count, triangle density, GCS, GCS_SSIM≥0.98, Depth SI, and V-TI.

► Metric Description

Geometry Complexity Score (GCS)

To quantify the geometric complexity of our dataset, we define the Geometry Complexity Score (GCS) as:

\[ \mathrm{GCS} = \underbrace{\left( \frac{N_{\mathrm{tri}}}{A_{\mathrm{total}}} \right)}_{\text{Triangle density}} \cdot \underbrace{\left( 1 + \lambda_1 \cdot \frac{\sigma_{\mathrm{area}}}{\mu_{\mathrm{area}}} + \lambda_2 \cdot \frac{\sigma_{\theta}}{\mu_{\theta}} \right)}_{\text{Irregularity term}} \cdot \underbrace{\log(1 + A_{\mathrm{total}})}_{\text{Global scale weight}} \]

where \( N_{\mathrm{tri}} \): total number of triangles, \( A_{\mathrm{total}} \): total surface area (in cm²) of the mesh. The first term, \( \frac{N_{\mathrm{tri}}}{A_{\mathrm{total}}} \), captures the local triangle density, indicating how densely the mesh is tessellated per unit area.

The second term reflects structural irregularity, consisting of the coefficient of variation (i.e., standard deviation over mean) for triangle areas \( \frac{\sigma_{\mathrm{area}}}{\mu_{\mathrm{area}}} \) and dihedral angles \( \frac{\sigma_{\theta}}{\mu_{\theta}} \). This penalizes highly irregular or distorted regions.

The final term, \( \log(1 + A_{\mathrm{total}}) \), applies a scale-aware weighting to account for large-scale surfaces, ensuring that meshes with similar detail but different spatial coverage (e.g., small object vs. large room) receive comparable scores.

We set \( \lambda_1 = \lambda_2 = 0.5 \) by default.

Multi-view SSIM

To measure perceptual quality across views, we compute multi-view SSIM over four pose types: original, shifted, interpolated, and random. Each camera pose \( \mathbf{P}_i = [\mathbf{R}_i \mid \mathbf{t}_i] \) defines the viewpoint and rendering frame.

\[ \mathrm{SSIM}_{\mathrm{multi}}^{(v)} = \frac{1}{N_v} \sum_{P_i \in \mathcal{P}_v} \mathrm{SSIM}(R(P_i), G(P_i)) \]

where \( R(P_i) \), \( G(P_i) \) are rendered and ground-truth images at pose \( P_i \). This enables perceptual evaluation across diverse viewpoints.

SSIM-aware Geometry Complexity

We define SSIM-aware GCS as the minimum GCS of a simplified mesh that preserves a multi-view SSIM above a threshold (e.g., \( \tau = 0.98 \)). This identifies the geometry budget required for perceptual fidelity.

Among various mesh decimation levels, we select the lowest-complexity mesh satisfying \( \text{SSIM}_{\text{multi}} \geq \tau \), and compute \( \text{GCS}_{\text{SSIM} \geq \tau} \) accordingly.

3D Scene Viewers

⚠️ The table shows example visualizations for scene ID = 0 (a static scene) from our dataset. This demo uses the ARENA, which runs in your web browser to render interactive 3D content. You can log in anonymously or with your Google account. Loading may take a few seconds depending on your network and browser performance.

Scene	Point Cloud	Mesh	3D Gaussian Splat (3DGS)	Photogrammetry	3DGS (High-resolution)
Lab area	View	View	View	View	View
Couch	View	View	View	View	View
Kitchen	View	View	View	View	View
Whiteboard	View	View	View	View	View
Factory	View	View	View	View	View

Raw data size comparison across 3D representations

Table 2: Raw data size comparison across 3D representations for one temporal frame t (Geometry, Color component).
N_cam, P, V, T, and G denote the number of camera views, points, vertices, triangles, and gaussians, respectively.

Camera Pose Variants

Each scene includes calibrated original camera extrinsics along with three types of virtual camera poses to enable view-aware evaluation.

Shifted views: Slight offsets from the original pose (up/down/left/right).
Interpolated views: Midpoints between camera pairs using Slerp and linear translation.
Random views: Uniformly sampled within the scene volume, looking at the center.

These variants allow evaluating 3D reconstruction fidelity under diverse viewpoint conditions.

**Figure:** Camera pose visualization in the capture room.
**Original (cam1–4), Interpolated, Shifted, Random viewpoint.**

Videos by Scene

BibTeX

BibTex Code Here