Figure 1 illustrates how daily health checks are carried out at the three facilities. At Rutgers University (fig. 1a), technicians use a flashlight (pen light) to look for signs of animal distress. However, visibility can be limited because animals are often resting inside their nests or enrichment devices (fig. 1b,c). Fully pulling cages out of the rack to inspect the animals is discouraged at both Rutgers and the École Polytechnique Fédérale de Lausanne (EPFL), as this practice can disturb the animals’ sleep and compromise the reliability of scientific results. Observations of technician workflow at Rutgers showed that each cage typically receives only 4–8 seconds of attention during routine checks.
To assess how well these daily checks work, we reviewed all clinical cases recorded during the following periods: 11 months at Rutgers (USA), 3 years at EPFL (Switzerland), and 10 months at Sanofi (France), using data from animals housed in DVC racks. As shown in figure 2a, the highest number of detected cases occurred on cage change days: Friday at EPFL, Monday at Rutgers, and Wednesday at Sanofi. At Rutgers and EPFL, case numbers dropped sharply on weekends. This decline likely reflects fewer staff members on duty who still need to inspect a large number of cages. Weekend checks tend to prioritize checking for food, water, and urgent veterinary needs. In contrast, Sanofi showed little difference in case detection between cage change days and other days, possibly because tumor study cages were pulled from the rack daily for visual checks without opening them. Across the study period, the total number of confirmed cases was 229 at EPFL, 42 at Rutgers, and 65 at Sanofi, categorized in figure 2b. Supplementary figures 1 and 2 show animal density per cage and the distribution of veterinary cases, respectively. These results confirm that brief visual inspections are poor at detecting animal distress. Most health problems are only noticed when technicians handle animals during cage changes.

a, Daily count of confirmed veterinary cases by weekday. b, Number of clinical cases per veterinary category at each study site. Total case counts for each category are displayed above the bars.
To improve health monitoring, we created machine learning/AI algorithms that analyze nighttime movement when mice are naturally most active. Each cage’s activity pattern over a 12-hour dark period is compared to its own average from the previous 7 days. Figure 3 shows examples of the alerts the system generates. Red lines represent the current day’s activity; green lines show the 7-day baseline. A color-coded bar at the top of each graph indicates status: green for normal, red for abnormal (labeled as hypoactivity, hyperactivity, or unusual spatial behavior). Figure 3a shows a mouse with reduced movement (hypoactivity); figure 3b shows zero movement, indicating a likely death; figure 3c displays a hyperactive pattern; and figure 3d represents normal activity with no alerts. Technicians use these signals to decide whether a closer physical examination is needed.

a–d, Green lines show the cage’s average locomotor activity over the past 7 days; red lines show activity over the past 24 hours. Daytime is shown in white, nighttime in gray. Each panel includes a header bar tracking time in days, and red blocks mark alerts triggered when the algorithm detected an anomaly. Shown: hypoactivity (a), a single-housed dead mouse (b), hyperactivity (c), and normal activity (d).
A key strength of this algorithm is that users can customize thresholds for spotting movement anomalies, including how strictly it distinguishes true alarms (true positives, TP) from non-issues (false positives, FP). We tested three ensemble models on data from all three sites to find the threshold best suited for scaling the system across an entire facility. The maximum allowed FP rates were set at 5% (model 1), 15% (model 2), and 30% (model 3). This flexibility lets each facility choose settings that balance efficiency with digital monitoring. Too many false warnings can lead to “user fatigue,” reducing trust in the system and making it less effective.
The analysis covered 1,367 cages across the three sites over varying durations (roughly 300 days each for Rutgers and Sanofi, and over 1,000 days at EPFL), yielding 90,540 evaluations at EPFL, 51,094 at Rutgers, and 15,367 at Sanofi (see Table 1). Table 1 also reports FP detection rates for each model. The analysis started retroactively from the date each facility first recorded a confirmed veterinary case (day 0). As expected, FP rates rose from model 1 to model 3, with notable differences between sites. EPFL’s results closely matched the algorithm’s performance during optimization. In contrast, Rutgers showed relatively small changes in false positives across the three models.
Discrepancies in detection rates observed between EPFL and sites 1 and 2 may be attributable to the larger population of wild-type B6 mice at EPFL. At both Sanofi and Rutgers, performance differences between model 2 and model 3 were negligible (Table 1). This inter-site variation could stem from the reduced cage count and shorter evaluation period at Rutgers and Sanofi.
The proportion of clinical cases or true positives (TPs) identified on day 0 improved progressively from model 1 through model 3 (Fig. 4). Detection gains from model 1 to model 3 at each site were as follows: EPFL improved from 17.9% (model 1) to 40.6% (model 3), while Sanofi rose from 50.8% (model 1) to 61.5% (model 3). Conversely, Rutgers University experienced a decline under model 3 (42.2%) relative to model 2 (50%). All facilities and time points showed statistically significant improvements when moving from model 1 to model 2 (McNemar test, P < 0.05, power ranging from 0.71 to 1), though Sanofi saw only a marginal gain (+6.1%). We broadened the detection window to −3 and −6 days prior to day 0 to address weekday bias in operator-reported cases, counting any detection within that window as a TP. Expanding the window boosted TP case detection across all three models (Fig. 4). Only EPFL showed a statistically significant improvement between model 2 and model 3, with detection rates climbing by 7–9 percentage points for both −3 and −6 day intervals (McNemar test, P < 0.05, though power was limited [0.05–0.24] due to the small number of discordant pairs). Given these findings, we selected model 2 for the subsequent results presented in this report, as it struck the best balance between operational efficiency and TP detection effectiveness. Findings from models 1 and 3 are provided in Supplementary Figs. 3–6.

The proportion of identified and missed veterinary cases is displayed across three intervals: day 0 (when the clinical case was reported), spanning day 0 to −3 days, and spanning day 0 to −6 days.
Figure 5 displays the categories of TP clinical issues identified by model 2 across the three locations on day 0. The naming conventions for clinical cases varied among the institutions depending on their available clinical records. A clinical category was included only if it encompassed at least three documented cases. The algorithm successfully detected 93% of found-dead cases at EPFL (71 total cases), 85% at Rutgers (13 total cases), and 100% at Sanofi (3 total cases). The small share of missed fatal cases at Rutgers and EPFL affected group-housed mice. Since the current technology does not permit individual tracking of animal movement, it likely lacks the sensitivity needed to achieve 100% detection of spontaneous deaths. At Rutgers, the algorithm achieved perfect accuracy in identifying hunched posture, eye problems, and weight loss. Detection rates exceeding 80% were also recorded for health impairment, neurological disorders, and hyperactivity at EPFL, as well as for ruffled fur and tumors or ulcers at Sanofi. The lowest detection rates were noted for injuries at EPFL (73%) and for fighting and skin conditions at Rutgers (62% and 57%, respectively).

The percentage of identified cases for each veterinary case category is shown from day 0 to −6 days using ensemble model 2.
To evaluate the algorithm’s ability to catch subclinical cases within the TP subset, we extended the analysis window to −3 and −6 days. Figure 6 illustrates when the algorithm first flagged anomalous behavior. Generally, initial anomaly detections occurred between days −3 and −6, pointing to an early warning signal of the clinical issue. At EPFL and Sanofi, TP detection ranged from 59% to 100% depending on the clinical category during days −3 to −6. At Rutgers, certain TP cases involving skin issues (43%), fighting (38%), and eye-related problems (67%) went undetected on earlier days. Early stages of these conditions may not have yet altered the locomotor activity of the mice, which could explain why the algorithm failed to pick them up.

The percentage of veterinary cases where ensemble model 2 generated its first detection is shown for day 0, between days −1 and −2, and between days −3 and −6.


