Data Pipeline¶
Every recording submitted through Mecka passes through an automated pipeline that validates quality, extracts features, and prepares the data for use by robotics teams.
Pipeline Stages¶
1. Ingestion¶
Raw recordings are uploaded to object storage via the API gateway. Metadata (device info, quest ID, contributor ID, timestamps) is stored in PostgreSQL.
2. Pre-processing¶
- Video normalization (resolution, frame rate, codec)
- Depth map alignment
- Audio separation
3. Quality Assurance¶
Automated QA checks include: - Completeness: Did the contributor perform the full task? - Visibility: Are hands and objects clearly visible throughout? - Stability: Is the footage stable enough for pose extraction? - Lighting: Is the scene adequately lit? - Duration: Does the recording meet minimum length requirements?
Each check produces a score. The aggregate determines pass/fail and the quality multiplier applied to rewards.
4. Feature Extraction¶
- Pose estimation: 3D hand and body keypoints (21 keypoints per hand, 33 body landmarks)
- Object detection: Bounding boxes and labels for task-relevant objects
- Action segmentation: Temporal boundaries of discrete actions within the recording
5. Classification¶
Recordings are tagged with ROBO categories and task-specific labels. See ROBO Categories for the full taxonomy.
6. Indexing¶
Processed data is indexed and made available through the HAN SDK with queryable metadata (category, quality score, task type, environment, contributor reputation).
Processing Infrastructure¶
The pipeline runs on Ray, a distributed computing framework. This allows horizontal scaling as submission volume grows. Each stage is an independent Ray task, enabling parallel processing and fault isolation.