“`xml
In the fields of data science and computational statistics, most researchers have concentrated on using machine learning techniques for identifying anomalies.
There are three primary methods for detecting anomalies:
- Supervised: The training dataset is labeled and contains both normal and anomalous data.
- Clean semi-supervised: The training dataset contains only normal data, whereas the test dataset includes anomalies.
- Unsupervised: The training dataset is unlabeled and includes both normal and anomalous data.
Supervised techniques treat anomaly detection as a binary classification task (normal versus anomaly). They utilize labeled datasets to train models that differentiate between standard and unusual data. This method can be highly effective when anomalies are relatively common.
However, in most real-world scenarios, anomalies are rare—often accounting for less than 1 percent of the total data. This scarcity makes supervised methods impractical, as gathering enough labeled anomalous data is both challenging and resource-intensive. Additionally, supervised techniques rely on the assumption that the distribution of anomalous data can be clearly defined and modeled statistically. This is referred to as the well-defined anomaly distribution (WDAD) assumption.
In manufacturing, this assumption is useful for identifying recurring machine failures where the problem is well-understood and there is ample data to characterize the distribution. This principle underpins Six Sigma methodologies, where time-invariant data are modeled using a Gaussian distribution. Any measurement that deviates beyond ±6 standard deviations from the mean is flagged as an anomaly.
Unfortunately, the WDAD assumption rarely holds true in real-world applications, as few methods can accurately model the distributions of both normal and anomalous data. This is particularly evident in manufacturing, where data complexity and variability are significantly higher.
When the WDAD assumption is not valid and anomalies are infrequent, unsupervised or clean semi-supervised methods can be employed to detect outliers. However, these approaches may also struggle if anomalies do not appear as outliers or if the normal data distribution has long tails.
Fastening data can pose challenges for machine learning algorithms because varying levels of torque are applied at different stages of the tightening process. Graph courtesy Ford Motor Co.
Types of Anomalies
Anomaly detection in manufacturing involves time-series data, which requires specialized statistical methods that assume constant variance and variable independence. Time-series data consist of sequential observations recorded at regular intervals over time. These data often exhibit characteristics such as trends, seasonality, cycles, and levels, which can be leveraged to forecast future patterns and identify deviations from the norm.
Most research has focused on three categories of anomalies in time-series data: point anomalies, collective anomalies, and contextual anomalies.
Point anomalies occur when a single data point in the time series deviates significantly from the rest of the data. For example, a sudden day of heavy snowfall in British springtime would be considered a point anomaly in historical weather records. Point anomalies have been widely studied, with most methods assuming they are rare and independent of one another. Techniques such as neural networks, tree-based models, SVM, and long short-term memory (LSTM) networks have proven effective in detecting point anomalies.
Collective anomalies involve multiple data points that may appear normal individually but exhibit unusual patterns when analyzed together. Using the weather analogy, a collective anomaly would be several consecutive days of snowfall.
Contextual anomalies are instances where data points may deviate from the majority of anomaly-no-concern (ANC) data but are considered normal due to their specific context. Contextual anomalies are characterized by two attributes:
- A spatial attribute that defines the local context of a data point relative to its neighbors.
- A behavioral attribute that describes the normality of the data point.
Clustering algorithms are often used to identify contextual anomalies in both real-world and synthetic datasets. A common example is credit card transaction data. For instance, if a person’s spending spikes significantly over a week in April, it might be flagged as a collective anomaly and suspected of fraud. However, the same spending pattern the week before Christmas could be deemed normal given the seasonal context.
The credit card example illustrates how different types of anomalies can overlap. Therefore, it is essential to develop systems capable of identifying all three types of anomalies.

To train their AI model, the researchers consulted a fastening expert to review sets of run-down data like those shown here, identifying instances of good and bad assemblies. Photo courtesy Ford Motor Co.
Dimensionality Reduction and Clustering
The initial step in any anomaly detection process is to leverage domain expertise to extract meaningful features from raw data. These features can then be analyzed using statistical tools to highlight outliers, which are potential anomalies.
The number of meaningful features in a dataset determines whether it is high-dimensional or low-dimensional. As dimensionality increases, it becomes more challenging to establish relationships between features. This necessitates larger datasets and greater computational resources to train models for defect detection. It also increases the risk of overfitting due to noise across multiple dimensions.
Dimensionality reduction techniques aim to project high-dimensional data into a lower-dimensional space, enabling visualization in two or three dimensions and facilitating cluster analysis methods better suited for lower-dimensional datasets. Common cluster analysis approaches applied to time-series anomaly detection include K-means clustering, fuzzy C-means clustering, Gaussian Mixture Models (GMM), and hierarchical clustering.
K-means and fuzzy C-means clustering begin with initial estimates of cluster centroids, followed by iterative optimization to minimize the distances between data points and their respective cluster centers. K-means is a hard clustering method, assigning each data point to a single cluster. In contrast, fuzzy C-means is a soft clustering approach that assigns probabilities to data points, allowing them to belong to multiple clusters.
GMM is a probabilistic clustering method that assumes the data can be described by several sub-processes, each contributing a Gaussian component to the lower-dimensional representation. GMM uses maximum-likelihood estimation algorithms, such as expectation maximization, for model fitting.
Unomaly detection is a widely used approach, often advantageous because it eliminates the need for labeled data.Constructing high-quality labeled datasets is essential for building an effective screening system. However, in practical applications, testing datasets must also be prepared to evaluate models before deployment. When anomalies make up only a small portion of the training data, testing datasets tend to be heavily skewed—containing far more normal instances than anomalous ones. In such scenarios, it’s efficient to leverage this abundance of normal data within a semi-supervised learning framework.
In hierarchical clustering, the process begins with each data point treated as its own cluster. Iteratively, points are merged with nearby clusters until all data converge into a single cluster. This bottom-up strategy is known as agglomerative hierarchical clustering. The sequence of merges is visualized using a dendrogram, where branches join or split at levels corresponding to the iteration step when those clusters were combined or divided. The dendrogram reveals the underlying structure and relationships among all data points. By slicing the dendrogram horizontally at a chosen level, you can extract any desired number of clusters. Notably, unusually small clusters may signal anomalous behavior in the system.
Dimensionality reduction methods fall into two main groups: matrix factorization and neighbor-based graph techniques. Matrix factorization encompasses algorithms like linear autoencoders, generalized low-rank models, and Principal Component Analysis (PCA).
PCA is one of the most widely used dimensionality reduction tools across scientific fields, with origins dating back to the early 20th century. It works by computing the eigenvectors and eigenvalues of the dataset’s covariance matrix to generate linear projections of the data into a lower-dimensional latent space—these projections are called principal components. Components capturing the highest variance retain the most meaningful information from the original dataset and are kept for analysis or visualization, while those with minimal variance are typically discarded.
PCA has been extensively applied in time-series anomaly detection. However, a key limitation arises when feature relationships are nonlinear or absent: in such cases, PCA may produce false positives or fail to uncover useful patterns.
In recent years, learning-based neighbor graph methods such as t-SNE and UMAP have seen significant advancements.
While PCA preserves global data structure through high-variance eigenvectors, t-SNE focuses on local structure by modeling neighboring points in high-dimensional space as probability distributions in a lower-dimensional representation. This allows t-SNE to capture fine-grained local details, though at the cost of some global context. It has proven effective in detecting bearing faults and defects in superconductors.
UMAP, introduced more recently and gaining widespread attention since 2020, improves upon t-SNE by better preserving both local and global data structures while offering faster computation times. UMAP outperforms PCA in clustering time-series data with cyclical or seasonal patterns and has been successfully paired with density-based clustering to identify anomalous periods in time-series data.

These plots display the first two principal components for datasets 1 and 2, labeled for clarity. They reveal that when PCA is applied to nutrunner data, not all noisy points are anomalies, anomalies can appear as tight clusters rather than isolated outliers, and not every outlier is necessarily an anomaly. Photo courtesy Ford Motor Co.
Methodology
We evaluated three semi-supervised methods for detecting anomalies in nutrunner data. Each approach uses a Gaussian Mixture Model (GMM) trained exclusively on normal operation data to define outlier thresholds within a reduced feature space. To compare performance, we paired the GMM with three different dimensionality reduction techniques.
Anomalies in nutrunner operations are infrequent, and historical process data are often not retained long-term. This scarcity makes it difficult to assemble enough data for robust training and testing. Even when historical fault records exist, they must be carefully reviewed by a process expert to ensure dataset quality.
Labeling nutrunner data emerged as our primary challenge. Even fully unsupervised models require high-quality labeled data for validation. In fault detection contexts, training datasets are typically highly imbalanced due to the rarity of faults. Consequently, process experts must manually review large volumes of data to collect enough examples to reliably validate the model.
To build our dataset, we provided three fastening specialists with data from 5,000 run-downs for annotation. We developed a dedicated dashboard to streamline the labeling workflow. Each expert was shown 12 run-down graphs at a time and asked to assess them. Since anomalies were expected to be rare, experts were told that all samples represented normal operating conditions.
If any instance appeared potentially anomalous, experts were instructed to classify it as one of the following:
- True anomaly: A confirmed process anomaly that either compromises part quality or requires further inspection before the assembly can be approved.
- ANC (Anomalous but No Concern): An unusual observation that doesn’t require action—typically explainable by a known process variation unlikely to affect quality.
- Re-hit: A case where no data was recorded, and the time-series signal remains flat at zero amplitude.
If none of the 12 examples fit these categories, the expert clicked a refresh button to mark all as “normal” and load a new set of 12 graphs.
This method proved highly efficient: fewer than 1% of run-downs contained true anomalies. Using this system, our experts labeled data at a rate of roughly 1,000 observations every 30 minutes.
Early in the project, we encountered differing interpretations of what constitutes an anomaly. Test engineers define anomalies as any condition that would lead to part rejection. Data analysts, however, view anomalies as data points with features or patterns that deviate significantly from the majority.
This discrepancy led us to introduce the ANC category. Distinguishing between a true anomaly and an ANC relies on expert judgment, introducing some subjectivity. For this reason, only engineers with deep knowledge of the fastening process were assigned to the labeling task.
While this introduces some uncertainty into our test datasets, engineers tend to prioritize caution—quality takes precedence over all other production metrics. Therefore, any anomaly detection system should flag both true anomalies and ANCs. However, improving ANC detection should never come at the cost of missing true anomalies.
With our custom labeling dashboard, we addressed two critical barriers in developing machine learning for fault detection: the time-intensive process of labeling high-quality data and the knowledge gap between fastening specialists and data scientists.
In the end, we produced two distinct testing datasets: Dataset 1, representing a handheld nutrunner with high operational variability, and Dataset 2, capturing an automated process where a clear staging issue is observable.

Individual simplex shapes can be merged together to form a known as a simplicial complex, which allows for modeling complex multi-dimensional feature spaces. Photo courtesy Ford Motor Co.
Machine Learning Models
To detect anomalies in nutrunner data, we opted to use a Gaussian Mixture Model (GMM). Before training the GMM, we evaluated three different dimensionality reduction techniques: PCA, t-SNE, and UMAP.
During the model-building phase, the visual representations generated by these techniques were invaluable in explaining our findings to colleagues. They were especially useful for emphasizing the value of high-quality data and highlighted early on that test engineers frequently had conflicting opinions on how data should be labeled. These visualizations helped us spot potential labeling errors and gather constructive feedback from the test engineers.
PCA is a dimensionality reduction method designed to maintain the overall layout of the data by keeping the distances between data points. This is accomplished through a linear transformation based on matrix factorization.
At present, Ford relies on DBSCAN to find noise points within the top two principal components as a way to flag anomalies. However, when PCA is used on nutrunner data, not every noise point is an anomaly, anomalies sometimes group closely together, and not all anomalies stand out as clear outliers. In Dataset 1, many anomalies are embedded within the normal data trends, occasionally forming small clusters within the main distribution. In Dataset 2, most anomalies form compact clusters and the ANC class substantially overlaps with regular data patterns.
Given PCA’s sensitivity to variance, we normalized the data beforehand. Generally, the first two or three principal components are chosen in PCA, as they preserve the most information about the original data structure. Yet, early trials with nutrunner data showed that varying which principal components were used led to clearer anomaly clusters. Consequently, we experimented with different principal component combinations while fine-tuning hyperparameters for the semi-supervised approach.
We also assessed t-SNE and UMAP against PCA for reducing data to two dimensions before feeding it into the semi-supervised GMM. T-SNE and UMAP function as neighbor graph methods, calculating how similar data points are before projecting them into a reduced-dimensional space.
UMAP stands out as the most sophisticated method of the three. At a conceptual level, UMAP blends manifold approximation with local set representations to project data into a lower-dimensional space. These representations, called simplicial sets, use combinations of simplices defined by data points to model the high-dimensional feature space.
Since t-SNE is a probabilistic method and both t-SNE and UMAP incorporate random processes, we needed to merge the training and testing datasets before performing dimensionality reduction to guarantee both were mapped into the same reduced space. After creating the 2D projections from the combined data, we separated them back into training and testing sets before using GMM for semi-supervised clustering.
The GMM was trained in a semi-supervised way, incorporating 400 standard data points and the number of Gaussian components.

Since both t-SNE and UMAP use stochastic processes, the researchers merged the training and testing datasets and performed dimensionality reduction on the combined data to ensure consistent mapping to the same reduced feature space. This illustration shows how t-SNE was applied to the merged nutrunner training and testing data. Graph courtesy Ford Motor Co.
Results
For this study, anomalies were defined as the positive class. Given this, our algorithm had the following objectives:
- Minimize the false negative rate. The ultimate aim was to build an anomaly detection system that enhances overall product quality, so our primary focus was on reducing actual anomalies that were mistakenly labeled as normal.
- Detect a high proportion of actual anomalies. Cutting down the false negative rate shouldn’t come at the cost of missing a significant number of real anomalies.
- Near real-time response. The system must be capable of flagging a potential anomaly before the part progresses to the next production step. Though timing can differ, we aimed for analysis completion in under 5 seconds.
- Adaptable and transferable across applications. As processes evolve, any AI system must be easily retrainable by engineers without extensive additional development. Additionally, the system must demonstrate effectiveness across multiple nutrunner datasets to prove its adaptability to varied scenarios.
During model evaluation, we categorized ANCs as genuine anomalies. This approach is reasonable since ANCs are still outliers and should be reviewed by test engineers, in order to be cautious and ensure the highest quality output. However, we maintain that ANC information offers extra valuable clues for assessing model performance and deserves closer attention during performance analysis. This consideration is especially relevant for nutrunner data, where dataset quality is hard to verify due to some degree of subjective judgment in data labeling.

The researchers implemented a GMM to analyze their data. These charts present a visualization of datasets 1 and 2 with the anomaly thresholds highlighted. Photo courtesy Ford Motor Co.
With this in mind, we explored two scenarios:
- ANCs are treated as genuine anomalies.
- ANCs are classified as normal.
The effectiveness of our semi-supervised GMM algorithm was heavily dependent on the dimensionality reduction technique applied beforehand. Looking at Dataset 1a, where ANCs count as anomalies, PCA-GMM comes out on top, achieving an F-score of 0.55 versus 0.25 and 0.27 for t-SNE-GMM and UMAP-GMM, respectively. Both t-SNE and UMAP underperform in comparison: t-SNE yields poor recall, while UMAP yields low precision. Nevertheless, all three methods surpass the existing PCA-DBSCAN setup used for anomaly detection.
Turning to Dataset 1b, where ANCs are treated as normal, PCA-GMM experiences a decline in the rates of true positives and false negatives. t-SNE is the only method whose F-score actually improves when ANCs are considered normal. This was consistent with observations made during development, as t-SNE proved helpful at identifying true anomalies that had contaminated the normal training data. That said, it wasn’t quite as effective at separating ANC data from normal data.
UMAP-GMM leads in recall across all methods for Dataset 1. However, its notably low precision makes it impractical in real-world settings, since it would create a significant workload increase for test engineers dealing with false positives.
In Dataset 2a, PCA-GMM outshines all other methods, attaining the highest recall and second-best precision. Additionally, with Dataset 2b, this method flags 100 percent of the true anomalies. t-SNE-GMM delivers competitive results for Dataset 2, mirroring the PCA-GMM performance. UMAP-GMM again ranks lowest among the approaches due to elevated false positive rates, though it still achieves a higher F-score than the original PCA-DBSCAN method.
Editor’s note: This article summarizes a research paper co-authored by James Simon Flynn, Ph.D., and Cinzia Giannetti, Ph.D., associate professor of mechanical engineering, at Swansea University in Swansea, UK. To read the full paper, click here.
For more information on fastening tools, read these articles:
Are Pneumatic Tools Still Relevant on the Assembly Line?
Fastening Threads: Getting with the Program
Software Analyzes Fastening Data



