Depth-Based Hand Tracking for Interactive Exhibits

Replacing RGB hand tracking with infrared depth sensing for reliable museum installations

DATE: 2025-09-15 STATUS: COMPLETE CATEGORY: COMPUTER VISION

SEC 1 — SPECIFICATIONS

Camera: Orbbec Femto Mega (Time-of-Flight IR depth, 120° x 120° FOV)
Body Tracking: Azure Kinect SDK — 32 joints per body at 30fps
Hand Classification: 5-strategy weighted voting (joint + depth-based)
Gestures: Proximity-based hold + velocity-based throw detection
Transport: Pluggable packet formatter + transport (UDP, OSC, or custom)
Latency: Single-frame pipeline at 30fps (~33ms)
Play Zone: Configurable 3D volume filtering with body-proportional thresholds
Deployment: Headless mode, auto-start, PoE camera connection, auto-reconnect
Tech Stack: C# / .NET 8, Azure Kinect Body Tracking SDK, Orbbec K4A Wrapper, OpenCV, Serilog, CUDA 11.8 + cuDNN, UDP / OSC (transport)

SEC 2 — VIDEO

SEC 3 — TECHNICAL NOTES

Overview

A museum exhibit needed reliable hand tracking to let visitors grab and throw virtual objects using natural hand gestures. The original RGB-based solution using a fisheye camera and MediaPipe failed under real venue conditions — orange walls triggered false skin detections, inconsistent lighting washed out hands, multiple spectators confused the tracker, fisheye distortion warped hands at frame edges, and calibration that worked for one visitor failed for the next. RGB colour analysis simply cannot reliably isolate hands in an uncontrolled public environment.

We replaced it with a depth-based system using an Orbbec Femto Mega infrared time-of-flight camera and the Azure Kinect Body Tracking SDK. Infrared depth sensing ignores visible light and wall colours entirely, body tracking identifies individuals rather than just hands, and a configurable 3D play zone physically isolates the active player from spectators.

Solution Architecture

Orbbec Femto Mega (PoE)
  → .NET 8 Tracking App (GPU body tracking)
    → Pluggable transport (UDP, OSC, custom)
      → Game Engine / Application

The .NET tracking application captures depth frames at 30fps, runs them through the Azure Kinect Body Tracking SDK for skeleton detection, extracts hand positions and states, detects hold/throw gestures, and sends the results to the receiving application via a pluggable transport layer. The entire pipeline runs in a single frame with no buffering delay.

Tracking Pipeline

Frame capture — The Orbbec Femto Mega captures depth frames via its K4A-compatible wrapper. A centralized FrameDataService receives raw depth and body data, applies mirror transforms once, and distributes canonical data to all consumers.

Body detection — Each depth frame is processed by the Azure Kinect Body Tracking SDK (GPU-accelerated via CUDA), producing 32-joint skeletons per detected body. A play zone filter checks each body’s pelvis position against configurable depth (0.8m–2.2m), lateral, and vertical boundaries. Bodies outside the zone are rejected with specific reasons (too close, too far, out of bounds). A primary player selector picks the body closest to the zone centre with a 15-frame hold window to handle brief occlusions.

Hand extraction — Four joints per hand (wrist, palm centre, fingertip, thumb) are extracted from the primary player’s skeleton. Positions are smoothed with exponential moving average (configurable factor, default 0.3) and hand states are debounced with hold-frame counting (default 3 frames) to prevent flicker.

Hand State Classification

Hand state (Open/Closed) is determined by a composite classifier where five independent strategies vote on the result — joint compactness, spread ratio, thumb angle, depth variance, and silhouette aspect. Three use skeleton joint geometry, two use depth pixel data from a crop around the hand. Combining fundamentally different signal sources means the system stays reliable even when individual signals are noisy.

The classifier uses asymmetric thresholds — opening the hand requires a stronger signal than closing it. Ambiguous frames default to Closed, so noisy readings never trigger a false release mid-grab. The player must clearly open their hands to let go.

Gesture Detection

A state machine tracks hold and throw gestures — visitors can reach out, grab a virtual ball by bringing their hands together, and throw it with a natural forward motion. The system detects when hands come together (hold), when they separate with forward velocity (throw), and includes a cooldown to prevent double-triggers.

All thresholds are body-proportional, scaling automatically to the player’s size so the experience works equally well for adults and children without recalibration.

Network Transport

The output layer is built around two clean domain interfaces — IPacketFormatter and ITransportSender — separating what gets sent from how it gets sent. A formatter assembles the tracking packet (hand positions, states, gesture data, coordinate mappings) and a transport sender delivers it over the wire.

This makes the system protocol-agnostic. Swapping from UDP to OSC, WebSocket, or any custom transport is a single interface implementation with no changes to the tracking pipeline. The same applies to packet format — JSON, binary, protobuf, or whatever the receiving application expects. New exhibit integrations only need a formatter and a sender.

Calibration & Debug Tools

The application includes a full calibration UI with live feedback:

Depth view with configurable colourmap (Jet, Viridis, etc.) and depth range sliders
Skeleton overlay with confidence-coloured joints and bones
Play zone preview showing boundaries on the depth view with accept/reject indicators per body
Bird’s-eye view for top-down play zone calibration
Hand state display showing per-strategy scores and composite result
Gesture mode with full-width diagnostic canvas: velocity history (60 frames), state transition history (120 frames), normalised hand positions, per-strategy debug scores
Smoothing controls for position EMA factor and state hold frames
All settings persisted to appsettings.json and restored on next launch

Deployment & Resilience

Built for unattended public operation in a museum:

Headless mode (--headless) runs without any visualisation window — just tracking and network output with a status line showing fps, body count, and packet stats
PoE camera connection — single Ethernet cable for power and data in deployment (USB-C for development)
Camera disconnect resilience — auto-reconnect with exponential backoff (1s → 3s → 5s → 10s cap). On disconnect, the app falls back to simulated frames keeping the UI functional. The app never exits on its own
Crash logging — Serilog structured logging with rolling files and dedicated crash file persistence. Global exception handlers catch unhandled errors
Cameraless mode (--cameraless) for development and testing without hardware
Recording/playback — Record live K4A sessions and replay through the full pipeline for deterministic debugging

Test Coverage

820 unit tests (xUnit + Moq) covering all domain logic — hand state strategies, gesture detection, play zone filtering, smoothing, coordinate mapping. Tests run in ~1 second without any camera hardware, enabled by the three-layer architecture: Domain (pure C# logic) → Infrastructure (SDK wrappers) → App (wiring).

Results

The depth-based system eliminated every failure mode of the original RGB approach. Infrared sensing is immune to wall colours and lighting conditions. Body tracking provides positive identification of individuals, and the 3D play zone cleanly separates the active player from spectators. The five-strategy hand classifier with asymmetric thresholds produces reliable open/closed detection through hand rotation and SDK noise that previously caused false throws. The system has been running in production at the museum exhibit.

← All R&D projects