R&D @ Hinge Health

In digital musculoskeletal (MSK) care, one of our primary goals is to empower members with objective information about their function. Along with pain reduction, the ability to move freely to bend, reach, and twist without limitation is often the most tangible sign of recovery.

Traditionally, measuring this range of motion (ROM) has been the domain of the clinic, relying on goniometers, trained eyes, and controlled environments. Bringing this level of clinical precision to a smartphone that can be placed in any room is a massive technical challenge. This post details our journey in researching and evaluating robust computer vision based ROM measurement capabilities for frictionless movement analysis applications in member's own homes.

Defining Clinical Accuracy

Before writing a single line of code to measure movement, we had to answer a fundamental question: How accurate do we actually need to be?

In computer vision, it is tempting to chase pixel-perfect precision or minimize average angular error merely for the sake of engineering excellence. However, for a physical therapy application, "accuracy" must be defined by clinical relevance. We anchored our success metrics in two established clinical concepts:

Minimal Detectable Change (MDC): The smallest amount of change in a measurement that exceeds the threshold of clinical measurement error. If a member's range of motion improves by 5 degrees, but the measurement error is 6 degrees, we cannot confidently say they have improved.
Minimal Clinically Important Difference (MCID): The smallest change in a measurement that a patient perceives as beneficial. This is the "meaningful" threshold—the difference between feeling stiff and feeling mobile.

We combined these concepts to create our Clinical Acceptance Thresholds. Ideally, we would use the minimum of MCID or MDC for each measurement but in many cases the literature only provides reliable values for one or the other. For each movement, whether it's a low back flexion or a neck rotation, we calculated specific error margins (specifically the one-sided upper error interval) on an aggregate of the clinical thresholds that our technology had to meet.

For example, for a Squat, we established a strict error threshold of approximately 9 degrees on the bend at the hips. For a Knee Extension, the tolerance is even tighter, around 3 degrees, which may only represent a few pixels of difference in ankle position. These aren't arbitrary engineering targets; they are the "safe zones" where we can trust that the data we show a member reflects their actual physical progress, not digital noise.

The Challenge of the "Wild"

In a motion capture lab, cameras are perfectly calibrated, lighting is controlled, and subjects wear tight-fitting suits with reflective markers. In the real world, often in our members' living rooms, conditions are chaotic.

Our computer vision models must contend with:

Variable Camera Placement: Users might place their phone on the floor, a coffee table, or a high shelf. They might stand too close or too far away. Critically, the angle of the camera to the person may vary.
Perspective Distortion: Different devices with different lenses will project the 3D world into 2D images in different ways (e.g. causing different degrees of foreshortening of limbs).
Lighting Conditions: Backlighting from a window can turn a user into a silhouette, while dim evening lighting introduces sensor noise.
Clothing: Baggy t-shirts, dresses, or robes can obscure the skeletal structure, making it difficult to locate joints like the hips or knees accurately.
Self-Occlusion: In movements like a side bend or a rotation, parts of the body naturally disappear behind others from the camera's perspective.

The most persistent challenge we faced was Depth Ambiguity. A standard phone camera sees in 2D. When a user performs a side bend, their hips might rotate slightly (a "hip twist"). To a 2D sensor, this rotation can look nearly identical to a lateral bend, potentially skewing the measurement. Without depth sensors, our models have to infer this 3D structure from flat pixels, a problem that required novel engineering solutions.

Our internally-developed 3D pose estimation model is highly accurate for its size but had not previously been tasked to provide quantitative functional measurements that must be consistent across sessions to enable longitudinal analysis. Before rolling out ++Movement Analysis++ to our members it was critical to evaluate our system on real data, against clinical thresholds.

Evaluating our Measurements

To ensure we met our Clinical Acceptance Thresholds, we couldn't rely on simple testing. We implemented a rigorous three-pronged evaluation strategy using distinct datasets:

Real-World Data (The "Gold Standard"): We collected several hundred videos of diverse individuals in real-world environments (different clothes, backgrounds, and lighting). Professional annotators manually pinpointed joint locations in 3D space to create a "ground truth" for key frames. Our 3D annotation tool had to account for the perspective transform, camera placement and individual body proportions.
Synthetic Data: To test edge cases we couldn't easily film, we generated synthetic videos. This allowed us to rigorously test how our models handled specific variables, like camera height or distance, in isolation.
Motion Capture (MoCap): For the ultimate truth, we compared our vision-based measurements against a professional optical motion capture system.

The Findings

As expected, our evaluations revealed that our 3D pose estimator performed well on "planar" movements like Low Back Flexion (bending forward) and Extension (leaning back). In these cases, the error rates were consistently well below our clinical thresholds, giving us high confidence in the data. However, we uncovered interesting challenges with complex movements:

The "Smoothing" Trade-off: To prevent the digital skeleton from jittering on screen, our models apply temporal smoothing. We found that while this makes the visual experience smoother, it can introduce a slight lag (tens of milliseconds), causing the system to slightly underestimate the peak range of motion during a fast movement. We had to balance this "visual stability" against "peak accuracy."
Neck mobility underestimation: Measuring neck movement proved particularly tricky. Our initial approach, which relied on calculating 3D rotations (quaternions) of the head, tended to underestimate the full range of motion, biased by the model's tendency to revert to the "mean" pose when uncertain.
Synthetic Data Bias: Somewhat surprisingly, our results were better on synthetic data than our real world data. Digging further, we identified a systematic bias in the pseudolabels of this dataset, especially on the neck, making our inferred measurements look better than they actually were. We decided to drop the synthetic data for this study to ensure an unbiased evaluation of our approach that we could count on generalizing to real members (and to improve the synthetic pseudolabels later).

Refining the Approach

Identifying these gaps allowed us to engineer targeted solutions that pushed our accuracy back into the clinical "green zone."

1. Geometric Re-Anchoring

For side bends, the primary source of error was the model's uncertainty about the hip's depth orientation. We realized that while the hips might be ambiguous in a baggy shirt, the ankles are generally stable anchor points.

We refined our measurement logic to align the 3D skeleton based on the vector formed by the ankles rather than the hips. This geometric shift effectively "grounded" the measurement, significantly reducing the noise from hip rotation and bringing our side bend error rates down dramatically.

2. Training Data Improvements

We traced the primary root cause of head orientation inaccuracy in our model to poor labels in our training set. Our 3D annotation tool was not enabling an easy UX to position head and neck joints, leading to inaccurate 3D labellings. We built improvements to our tool on this front, including a more flexible spine model, enabling direct manipulation of the head by key-points on eyes, ears and nose as well as auto-scaling of head proportions. We then re-annotated 3D poses on tens of thousands of training images with this new tool and identified a marked improvement in head and neck placement.

3. Next-Generation Models

We also accelerated the deployment of our next-generation pose estimation model. By imposing new clinical acceptance thresholds on the accuracy of our model, we were able to isolate key poses and movements where it lacked the fidelity for longitudinal quantitative assessments. We focused on data, training approaches and model architectural changes to specifically cater to these needs, with the intent to create a model that can still facilitate CV-based exercise therapy but can also provide highly accurate measurements for specific movements.

Movement Analysis for the Whole Body

Our work on the low back and neck has laid the foundation for a scalable measurement engine. The geometric solvers, bias correction pipelines, and evaluation frameworks can be applied across joints in the body where members experience MSK pain. As we look to analyse movements on other parts of the body we consider the specific challenges we may face in measuring ROM on these joints:

Hips: Ball and socket joints with three axes of rotation with frequent occlusion by clothing, which requires distinguishing true hip movement from pelvic tilt.
Shoulders: Complex joints with immense freedom of movement, especially when also considering the clavicles, and requires us to solve for self-occlusion when the arm crosses the body.
Elbows & Knees: Often have tighter clinical acceptance thresholds for assessing ROM.

The "wild" will always present challenges from bad lighting to baggy sweaters. But by anchoring our engineering in clinical thresholds and relentlessly measuring our own performance, we are bridging the gap between a clinical lab and a smartphone, ensuring that when a member sees their progress line go up, they know it's a victory they've truly earned.

Tracking Progress: The Challenges of Precisely Measuring Movement from a Phone