Skip to content

Data Pipeline

Every recording submitted through Mecka passes through an automated pipeline that validates quality, extracts features, and prepares the data for use by robotics teams.

Pipeline Stages

1. Ingestion

Raw recordings are uploaded to object storage via the API gateway. Metadata (device info, quest ID, contributor ID, timestamps) is stored in PostgreSQL.

2. Pre-processing

  • Video normalization (resolution, frame rate, codec)
  • Depth map alignment
  • Audio separation

3. Quality Assurance

Automated QA checks include: - Completeness: Did the contributor perform the full task? - Visibility: Are hands and objects clearly visible throughout? - Stability: Is the footage stable enough for pose extraction? - Lighting: Is the scene adequately lit? - Duration: Does the recording meet minimum length requirements?

Each check produces a score. The aggregate determines pass/fail and the quality multiplier applied to rewards.

4. Feature Extraction

  • Pose estimation: 3D hand and body keypoints (21 keypoints per hand, 33 body landmarks)
  • Object detection: Bounding boxes and labels for task-relevant objects
  • Action segmentation: Temporal boundaries of discrete actions within the recording

5. Classification

Recordings are tagged with ROBO categories and task-specific labels. See ROBO Categories for the full taxonomy.

6. Indexing

Processed data is indexed and made available through the HAN SDK with queryable metadata (category, quality score, task type, environment, contributor reputation).

Processing Infrastructure

The pipeline runs on Ray, a distributed computing framework. This allows horizontal scaling as submission volume grows. Each stage is an independent Ray task, enabling parallel processing and fault isolation.