R&D @ Hinge Health

The "Small Room" Problem

At Hinge Health, our mission is to help people conquer musculoskeletal pain. A cornerstone of this mission is TrueMotion, our computer vision technology that runs directly on a member's phone or tablet. It provides real-time form feedback, automatic rep counting and high fidelity measurements of poses for movement analyses and other diagnostic applications, transforming a simple video stream into a guided physical therapy session.

But delivering clinical-grade computer vision in the real world comes with a significant logistical hurdle: space.

In a traditional computer vision laboratory, the subject is perfectly illuminated and stands at an optimal distance from a calibrated machine vision camera, with their entire body clearly visible in the frame. In reality, our members are exercising in bedrooms, cramped home offices, or living rooms with coffee tables pushed to the side. To get the full benefit of TrueMotion, users often have to prop their phone up against a book, step back five or six feet, and hope they are fully visible.

This creates friction. We found that for exercises involving extended limbs, like a "Superman" pose or a standing side leg raise, or prone exercises on the floor, members struggled to stay in frame. If a hand drifted off the edge of the screen or a foot disappeared behind a couch, the feedback loop would break. The pose estimation model would lose track of the limb, the rep count might stop, or worse, the system would provide inaccurate corrective feedback based on incomplete data.

We realized that strictly requiring members to be "fully in frame" was a barrier to access. We needed to ensure that our computer vision experience wasn't adding friction to the already difficult task of building a new exercise habit. Our goal became clear: re-engineer our vision stack to "see what isn't there", to accurately track and analyze human movement even when parts of the user are off-camera.

The Challenge of Top-Down Estimation

To understand why this is difficult, we have to look at how most state-of-the-art 3D pose estimation pipelines work. The industry standard for mobile-optimized pose estimation is a top-down approach.

In a top-down system, the pipeline performs two distinct steps:

Person Detection: The system scans the full image to find a human and draws a bounding box around them.
Pose Estimation: The system crops the image to that bounding box and feeds only that cropped part of the image into a pose model to identify key points (joints, eyes, ears, etc.).

This approach is efficient and accurate for full-body tracking, but it is brittle when the subject touches the edge of the frame. If a user raises their hand above their head and it exits the camera view, the "Person Detector" would not grow the bounding box to include the missing arm, instead it would simply go undetected even when visual cues and prior information enable a confident inference of its position. It simply assumes the arm is gone.

To solve this, we couldn't just tweak the model; we had to innovate across the entire TrueMotion Computer Vision (CV) Engine, from how we detect and track bounding boxes to how we train our human pose estimation networks to understand object permanence.

Solution Part 1: "Partial-view" 3D Pose Model

Our primary 3D pose model needed to learn a new skill: inference of the unseen.

Standard pose models are penalized during training if they predict a joint that isn't annotated in the ground truth data. For example, if a hand is off-screen in the training data, the model is usually taught to output "no prediction." We flipped this logic.

For our new 3D pose model, we implemented a novel "Partial-view" data augmentation strategy. During training, we artificially masked out large sections of our training images—covering the top, bottom, or sides with black bars, sometimes occluding substantial proportions of the image area. We then forced the model to predict the entire skeleton, including the parts hidden behind the black bars.

We also adjusted the loss function, reducing the penalty for predicting unannotated keypoints. This essentially gave the model "permission to guess." We encouraged it to use the visible kinematic chain to infer the likely position of the invisible key-points (e.g. using the position of the shoulder and elbow to infer the position of the wrist), even if those invisible key-points were outside the visible window.

Despite a very lightweight model (~4M parameters), we were surprised to find it generalizing well to these occlusions, extrapolating the positions of unseen key-points based on visual information from up-stream visible body-parts.

The result was a model that developed a robust sense of anatomical structure. It no longer treats the edge of the frame as a cliff; it treats it as a window, confidently predicting 3D key-point locations that lie in the invisible space beyond the screen.

Solution Part 2: Bounding Box Detector-Tracker 2.0

Training the model to extrapolate accurate key-points was only half the battle. We also needed a detector and tracker that allowed the visual context of the model (the bounding box) to look outside the frame.

In our previous architectures, we used a dedicated person detector module. While effective, it was relatively costly to run a separate model simply to place a bounding box around the person, especially given our applications where we expect members to be attempting to mostly be in frame and close to the camera. For the latest computer vision engine release, we made a radical architectural change: we removed the separate person detector module entirely.

Instead, we now rely on the pose estimation model itself to initialize detection using a multi-hypothesis approach. Once a person is found, we switch to our re-engineered Bounding Box Detector-Tracker 2.0, which introduces three critical behaviors for robustness:

1. Expand Fast, Shrink Slow

Previously, if a user did a fast movement, like a jumping jack, their limbs might move faster than the tracker could update, effectively leaving the bounding box. We identified that the penalty for missing a limb is greater than the penalty of momentarily losing accuracy due to having a bounding box that is overly large. Based on this insight, the new logic is asymmetric:

Expand Fast: If the model detects movement toward the edge, the bounding box expands instantly to capture the new key-point position.
Shrink Slow: If key points converge (e.g., person squats, moves back from camera or self-occludes), the box resists shrinking immediately. It holds its size, giving the pose model a chance to recover the limb using the inference capabilities we trained into it.

This is a fairly obvious change in hindsight, but works very well in practice.

2. Out-of-Frame Tracking

Standard bounding boxes are limited by image resolution (e.g., 0 to 1920 pixels). Our new tracker allows the bounding box coordinates to exceed the image boundaries. If the pose model infers that a user's hand is at pixel coordinate -200 (far to the left of the screen), the tracker accepts this and maintains a "virtual" bounding box that encompasses empty space. This keeps the coordinate system stable even when the user is only partially visible.

3. Adaptive Margins

The system now dynamically adjusts the margin around the user based on confidence. If the pose estimation is highly confident, we tighten the margin to save processing power. If confidence drops (likely due to occlusion), we expand the margin to ensure we aren't accidentally cropping out a hard-to-see key-point.

Measuring the Invisible: A New Evaluation Methodology

One of the hardest parts of solving the "out of frame" problem is verifying that you've actually solved it. How do you measure the accuracy of a predicted hand position that isn't in the video?

We couldn't rely on our standard evaluation datasets, so we created a new evaluation protocol: Synthetic Out-of-Frame Testing.

We took our "Gold Standard" datasets—where users are fully visible and we have ground-truth 3D annotations—and systematically destroyed them.

Image Augmentation: We applied random black bars to single images, occluding up to 50% of the bounding box, and measured how well the model could reconstruct the hidden key-points compared to the known ground truth.
Video Cropping: We took full videos and cropped them dynamically to simulate a user moving in and out of frame. We then ran the full pipeline (Tracker + Pose Model) on these chopped-up videos.

This gave us a rigorous "Out of Frame KPI" (Key Performance Indicator), a concrete metric to measure how much accuracy we lose when the user is not fully visible.

Results

The results of this re-engineering effort were dramatic.

When testing on our Out of Frame Video KPI (simulating partial visibility), our previous best model stack managed an average PCP (Percentage of Correct Poses) of 45.4%. It essentially failed more than half the time the user went off-screen. The new detector and model stack, achieved an average PCP of 63.3%. Similarly, we see an average reduction in angular error of about 2.2 degrees on partially occluded images.

This is not just a statistical bump; it is a fundamental shift in usability. In qualitative testing, the difference is visually obvious. With the old pose estimation model stack, a limb moving off-screen would vanish, causing the virtual skeleton to snap into a "broken" pose. With the new model stack, the skeleton remains "sticky." As a user reaches up during a lat stretch, the virtual arm extends naturally off the top of the screen, tracking the movement smoothly as if the camera were wider than it actually is.

Furthermore, we found that this robustness allowed us to "unlock" difficult exercises. We previously had exercises that failed our quality thresholds because they required wide movements (like "Snow Angels" or "Lateral Step Downs").

Looking Ahead

This technology is currently rolling out to the Hinge Health app. By enabling our CV engine to "see what isn't there," we are removing a subtle but significant layer of friction for our members.

We don't want people worrying about camera angles, lighting, or furniture placement. We want them focused on their movement and their recovery. By moving the complexity into the engineering stack, solving the hard problems of occlusion and user permanence on the edge, we make the experience simple, seamless, and effective for everyone, no matter how small their living room might be.

Seeing What Isn't There: How We Re-Engineered 3D Pose Estimation for the Edge