Depth-Based Hand Tracking for Interactive Exhibits
Replacing RGB hand tracking with infrared depth sensing for reliable museum installations
- Camera: Orbbec Femto Mega (Time-of-Flight IR depth, 120° x 120° FOV)
- Body Tracking: Azure Kinect SDK — 32 joints per body at 30fps
- Hand Classification: 5-strategy weighted voting (joint + depth-based)
- Gestures: Proximity-based hold + velocity-based throw detection
- Transport: Pluggable packet formatter + transport (UDP, OSC, or custom)
- Latency: Single-frame pipeline at 30fps (~33ms)
- Play Zone: Configurable 3D volume filtering with body-proportional thresholds
- Deployment: Headless mode, auto-start, PoE camera connection, auto-reconnect
- Tech Stack: C# / .NET 8, Azure Kinect Body Tracking SDK, Orbbec K4A Wrapper, OpenCV, Serilog, CUDA 11.8 + cuDNN, UDP / OSC (transport)
Overview
A museum exhibit needed reliable hand tracking to let visitors grab and throw virtual objects using natural hand gestures. The original RGB-based solution using a fisheye camera and MediaPipe failed under real venue conditions — orange walls triggered false skin detections, inconsistent lighting washed out hands, multiple spectators confused the tracker, fisheye distortion warped hands at frame edges, and calibration that worked for one visitor failed for the next. RGB colour analysis simply cannot reliably isolate hands in an uncontrolled public environment.
We replaced it with a depth-based system using an Orbbec Femto Mega infrared time-of-flight camera and the Azure Kinect Body Tracking SDK. Infrared depth sensing ignores visible light and wall colours entirely, body tracking identifies individuals rather than just hands, and a configurable 3D play zone physically isolates the active player from spectators.
Solution Architecture
Orbbec Femto Mega (PoE)
→ .NET 8 Tracking App (GPU body tracking)
→ Pluggable transport (UDP, OSC, custom)
→ Game Engine / Application
The .NET tracking application captures depth frames at 30fps, runs them through the Azure Kinect Body Tracking SDK for skeleton detection, extracts hand positions and states, detects hold/throw gestures, and sends the results to the receiving application via a pluggable transport layer. The entire pipeline runs in a single frame with no buffering delay.
Tracking Pipeline
Frame capture — The Orbbec Femto Mega captures depth frames via its K4A-compatible wrapper. A centralized FrameDataService receives raw depth and body data, applies mirror transforms once, and distributes canonical data to all consumers.
Body detection — Each depth frame is processed by the Azure Kinect Body Tracking SDK (GPU-accelerated via CUDA), producing 32-joint skeletons per detected body. A play zone filter checks each body’s pelvis position against configurable depth (0.8m–2.2m), lateral, and vertical boundaries. Bodies outside the zone are rejected with specific reasons (too close, too far, out of bounds). A primary player selector picks the body closest to the zone centre with a 15-frame hold window to handle brief occlusions.
Hand extraction — Four joints per hand (wrist, palm centre, fingertip, thumb) are extracted from the primary player’s skeleton. Positions are smoothed with exponential moving average (configurable factor, default 0.3) and hand states are debounced with hold-frame counting (default 3 frames) to prevent flicker.
Hand State Classification
Hand state (Open/Closed) is determined by a composite classifier where five independent strategies vote on the result — joint compactness, spread ratio, thumb angle, depth variance, and silhouette aspect. Three use skeleton joint geometry, two use depth pixel data from a crop around the hand. Combining fundamentally different signal sources means the system stays reliable even when individual signals are noisy.
The classifier uses asymmetric thresholds — opening the hand requires a stronger signal than closing it. Ambiguous frames default to Closed, so noisy readings never trigger a false release mid-grab. The player must clearly open their hands to let go.
Gesture Detection
A state machine tracks hold and throw gestures — visitors can reach out, grab a virtual ball by bringing their hands together, and throw it with a natural forward motion. The system detects when hands come together (hold), when they separate with forward velocity (throw), and includes a cooldown to prevent double-triggers.
All thresholds are body-proportional, scaling automatically to the player’s size so the experience works equally well for adults and children without recalibration.
Network Transport
The output layer is built around two clean domain interfaces — IPacketFormatter and ITransportSender — separating what gets sent from how it gets sent. A formatter assembles the tracking packet (hand positions, states, gesture data, coordinate mappings) and a transport sender delivers it over the wire.
This makes the system protocol-agnostic. Swapping from UDP to OSC, WebSocket, or any custom transport is a single interface implementation with no changes to the tracking pipeline. The same applies to packet format — JSON, binary, protobuf, or whatever the receiving application expects. New exhibit integrations only need a formatter and a sender.
Calibration & Debug Tools
The application includes a full calibration UI with live feedback:
- Depth view with configurable colourmap (Jet, Viridis, etc.) and depth range sliders
- Skeleton overlay with confidence-coloured joints and bones
- Play zone preview showing boundaries on the depth view with accept/reject indicators per body
- Bird’s-eye view for top-down play zone calibration
- Hand state display showing per-strategy scores and composite result
- Gesture mode with full-width diagnostic canvas: velocity history (60 frames), state transition history (120 frames), normalised hand positions, per-strategy debug scores
- Smoothing controls for position EMA factor and state hold frames
- All settings persisted to
appsettings.jsonand restored on next launch
Deployment & Resilience
Built for unattended public operation in a museum:
- Headless mode (
--headless) runs without any visualisation window — just tracking and network output with a status line showing fps, body count, and packet stats - PoE camera connection — single Ethernet cable for power and data in deployment (USB-C for development)
- Camera disconnect resilience — auto-reconnect with exponential backoff (1s → 3s → 5s → 10s cap). On disconnect, the app falls back to simulated frames keeping the UI functional. The app never exits on its own
- Crash logging — Serilog structured logging with rolling files and dedicated crash file persistence. Global exception handlers catch unhandled errors
- Cameraless mode (
--cameraless) for development and testing without hardware - Recording/playback — Record live K4A sessions and replay through the full pipeline for deterministic debugging
Test Coverage
820 unit tests (xUnit + Moq) covering all domain logic — hand state strategies, gesture detection, play zone filtering, smoothing, coordinate mapping. Tests run in ~1 second without any camera hardware, enabled by the three-layer architecture: Domain (pure C# logic) → Infrastructure (SDK wrappers) → App (wiring).
Results
The depth-based system eliminated every failure mode of the original RGB approach. Infrared sensing is immune to wall colours and lighting conditions. Body tracking provides positive identification of individuals, and the 3D play zone cleanly separates the active player from spectators. The five-strategy hand classifier with asymmetric thresholds produces reliable open/closed detection through hand rotation and SDK noise that previously caused false throws. The system has been running in production at the museum exhibit.