Examine design and knowledge sources
Examine overview
On this examine, we study the connection between air air pollution and the onset of ACS in Malaysia, specializing in its subtypes ST-Elevation Myocardial Infarction (STEMI) and Non-ST-Elevation Myocardial Infarction/Unstable Angina (NSTEMI/UA). We particularly examine key air pollution like NOx, SO2, O3, and PM10, identified contributors to cardiovascular illnesses15.
Our evaluation leverages ML algorithms, together with logistic regression (LR), assist vector machine (SVM), random forest (RF), naïve bayes (NB), eXtreme gradient boosting (XGBoost) and stacked EL. The examine combines scientific and environmental knowledge to analyze elements influencing mortality in ACS sufferers, notably the impression of air air pollution. The SHapley Additive exPlanations (SHAP) explainer was used to raised perceive and enhance the predictability and transparency of those fashions. The examine growth flowchart is proven in Fig. 1 beneath.
Graphical Workflow of ML Mannequin Improvement.
Examine knowledge
The information for this examine was collected from two main sources: The Nationwide Cardiovascular Illness Database (NCVD) for ACS knowledge and the Division of Atmosphere (DOE), Malaysia for air high quality measurements. Each datasets have been acquired as structured knowledge.
The NCVD, supported by Malaysia’s Ministry of Well being, gathers knowledge on cardiovascular illnesses. Our focus is on the NCVD-ACS registry, which incorporates data from 25 Malaysian hospitals and spans 2006 to 2017. The Medical Assessment & Ethics Committee (MREC) of the Ministry of Well being permitted the registry in 2007 (Approval Code: NMRR-07-38-164), with the UiTM ethics committee (Reference quantity: 600-TNCPI (5/1/6)) and the Nationwide Coronary heart Affiliation of Malaysia (NHAM) additionally granting approval. Key affected person knowledge, resembling demographics, scientific profiles, and remedy particulars, are meticulously collected. Affected person mortality is verified yearly by the Nationwide Registration Division of Deaths. The NCVD knowledge was accessed on ninth July 2021. The information used on this examine have been anonymized previous to evaluation, as our analysis focuses solely on the values and options, with out entry to any private details about the sufferers. All procedures on this examine have been performed in accordance with the Declaration of Helsinki. Knowledgeable consent was waived by the Medical Assessment & Ethics Committee (MREC) of the Ministry of Well being Malaysia, because the NCVD-ACS knowledge used on this examine have been anonymized previous to entry and evaluation.
The examine analyzed air high quality knowledge from the DOE Malaysia from January 1, 2006, to April 13, 2017, together with 61,816 each day measurements of NOx, SO2, O3, and PM10. The DOE air high quality knowledge have been accessed on twenty third June 2021. This data, which included 24-h imply concentrations from a community of monitoring stations, was mixed with NCVD-ACS data from hospitals inside a 15-km radius16. Air high quality knowledge on time lag 0 aligning with the each day reporting of ACS onset in affected person information to evaluate the impression of air air pollution on ACS sufferers’ mortality threat. Google Earth evaluation was instrumental in clarifying the spatial relationships amongst monitoring stations, hospitals, and air high quality. This contributed to raised integration of air high quality knowledge with NCVD-ACS datasets, guaranteeing temporal matching of environmental publicity to ACS occasions.
Outcomes and candidate predictors
The first final result of the examine was the mortality of ACS sufferers in relation to air air pollution, contemplating post-ACS onset. It sought to grasp the impression of air air pollution on these mortality charges by investigating the connection between ACS and air air pollution predictors to foretell ACS mortality outcomes.
Air air pollution publicity evaluation and rationale for hospital-based task
Current ISI listed research from 2022 to 2025 proceed to focus on the relevance of short-term air air pollution publicity, notably to NO2, SO2, and O3, in triggering acute coronary occasions or influencing mortality outcomes. These findings are particularly related for Southeast Asia and the Western Pacific. Wang et al.17 performed a multi-city examine in China and located that short-term publicity to PM2.5, NO2, and SO2 was considerably related to elevated hospital admissions for acute myocardial infarction (AMI). He et al.14, in Shanghai, reported that hourly publicity to PM2.5 and NO2 correlated strongly with the onset of AMI, emphasizing the significance of very short-term triggers. Lee et al.18 from South Korea confirmed that acute publicity to PM2.5 and O3 elevated the danger of out-of-hospital cardiac arrest, a extreme consequence generally linked to ACS. In Malaysia, Mohamad Roslan et al.19 investigated cardiovascular admissions in Klang and located that NO2 was considerably related to ischemic coronary heart illness admissions, particularly in interplay with PM10, making it notably related to our examine. Equally, Han et al.20 noticed that increased short-term PM2.5 and O3 publicity, coupled with chilly climate, was linked to elevated AMI mortality in Taiwan.
Given these findings, this examine makes use of hospital-based air air pollution knowledge as a proxy for short-term publicity in ACS sufferers. The attribution of publicity primarily based on the hospital district reasonably than affected person house addresses was chosen because of sensible limitations in accessing particular person residential knowledge and the widespread observe in Malaysia of sufferers presenting to the closest tertiary hospital inside roughly 100 km of their houses. Though we acknowledge that sufferers might not at all times be of their residential or working space at symptom onset, such occurrences are comparatively unusual. The chosen strategy aligns with established practices in air air pollution epidemiology, the place hospital or district-level publicity knowledge are sometimes used to estimate ambient circumstances in the course of the acute section previous hospitalization14,19.
From a methodological standpoint, the selection of publicity task in research of acute occasions like ACS is dependent upon the hypothesized publicity window and knowledge availability. Residence-based publicity is usually most popular when evaluating long-term or cumulative results, utilizing pollutant ranges from screens close to the house. Nevertheless, this technique doesn’t account for the time sufferers spend away from house. Conversely, hospital-based publicity, particularly in time-series and case-crossover designs, is taken into account acceptable for short-term publicity research and has been extensively used when fine-scale geolocation knowledge should not out there. It displays the ambient atmosphere the place the acute episode possible culminated, notably when analyzing lag durations of zero to seven days previous to the occasion. Whereas hybrid or superior publicity fashions that incorporate spatio-temporal or private monitoring knowledge supply improved precision, such approaches should not possible in large-scale retrospective hospital datasets like ours. Thus, the hospital-based publicity mannequin used right here represents a practical and scientifically justified technique for evaluating short-term environmental contributions to ACS onset and mortality within the Malaysian context.
Knowledge preparation
Knowledge preprocessing
The supply dataset from NVCD registry comprised 54 variables throughout 54,000 information. For this examine, we centered on 14 key enter options recognized as vital in a earlier examine by Kasim et al.7. This choice refined the dataset to 14,145 cases, particularly tailor-made for mannequin growth involving ACS sufferers.
The merged dataset was examined for potential errors, lacking values, duplicate information, and outliers. Steps have been taken to handle these points systematically: rows with incomplete knowledge or outliers have been eliminated, prioritizing the retention of full circumstances. This strategy not solely improved the dataset’s high quality but additionally minimized the danger of introducing biases or inaccuracies within the ML and stacked EL fashions, as supported by findings from Psychogyios et al.21.
To handle the numerous class imbalance inherent in mortality prediction duties, we employed the Random OverSampling Examples (ROSE) method throughout mannequin coaching. Conventional classifiers typically battle in imbalanced datasets, tending to favor the bulk class (survivors) whereas underperforming on the minority class (mortality), which is clinically probably the most crucial. ROSE generates artificial examples of the minority class utilizing a smoothed bootstrap strategy that estimates the conditional density of the info, thereby creating new cases which can be related however not equivalent to present ones. This technique enhances mannequin sensitivity and recall for mortality prediction, improves generalizability, and reduces the danger of overfitting related to easy duplication strategies22,23. The effectiveness of ROSE in enhancing classification efficiency, notably for uncommon outcomes, has been validated in biomedical and scientific informatics analysis, making it an appropriate alternative for our ACS mortality dataset.
For lacking knowledge, we opted for complete-case evaluation by excluding information with lacking values in key predictor variables. This strategy was chosen to protect the integrity and interpretability of the mannequin, notably given the danger of bias launched by imputation strategies when the missingness mechanism is unsure or probably not at random. Full-case evaluation gives unbiased parameter estimates when knowledge are Lacking Fully At Random (MCAR) and is especially acceptable when the proportion of lacking knowledge is comparatively low, as in our dataset24. Furthermore, whereas a number of imputations supply another, it depends closely on the Lacking At Random (MAR) assumption and can lead to misestimation or implausible imputations if the imputation mannequin is mis specified25. Given these issues, the mixed use of ROSE for sophistication balancing and complete-case evaluation for lacking knowledge represents a sturdy and clear preprocessing technique aligned with finest practices in scientific prediction modeling.
The detailed breakdown of chosen in-hospital variables stratified by survival final result is offered in Supplementary Desk 1. The distribution of variables between the coaching and testing datasets is summarized in Supplementary Desk 2. Efficiency metrics for every machine studying mannequin—together with AUC values, confidence intervals, and statistical comparisons utilizing DeLong’s take a look at—are reported in Supplementary Desk 3. The dataset was cleaned and merged with air air pollution publicity knowledge from the NCVD Registry and the Division of Atmosphere (DOE), Malaysia, previous to mannequin growth and analysis.
Function choice on this examine was guided by each scientific relevance and findings from our earlier work utilizing the identical NCVD-ACS registry. In our earlier publication, we developed a validated mannequin to foretell in-hospital mortality amongst ACS sufferers in Malaysia, utilizing variables routinely collected within the NCVD dataset. These embody demographic elements (e.g., age, gender), scientific historical past (e.g., diabetes, hypertension), very important indicators, biochemical parameters, and remedy variables all of which have well-established hyperlinks to ACS outcomes.
For this examine, we prolonged the function set by incorporating air air pollution variables—NOx, SO2, O3, and PM10—primarily based on knowledge availability from the Division of Atmosphere (DOE) and their documented associations with cardiovascular occasions in regional research. Notably, Mohamad Roslan et al.19 demonstrated a powerful hyperlink between NO2 and ischemic coronary heart illness admissions in Klang, Malaysia. Our strategy builds on validated variables from prior mannequin growth whereas introducing regionally related environmental predictors, guaranteeing each methodological continuity and enhanced perception into ACS mortality threat inside the Malaysian context.
Knowledge splitting and cross-validation
Knowledge partitioning was depicted in Fig. 2, with 70% allotted for mannequin coaching and the remaining 30% reserved for validation, following the rules in literature26.

The flowchart indicating the uncooked variety of cases earlier than and after knowledge cleaning in NCVD-ACS and air air pollution knowledge for In-Hospital Variables Dataset.
Okay-fold cross-validation was employed on this examine. This method includes dividing the enter knowledge into ‘okay’ folds, as an example, okay = 5, ensuing within the dataset being cut up into 5 components. The mannequin undergoes coaching and analysis 5 occasions, utilizing every fold as soon as for testing and the others for coaching27. This technique validates the efficiency of the developed ML fashions, guaranteeing the choice of one of the best mannequin28.
Knowledge balancing and knowledge normalization
Knowledge imbalance typically present in medical datasets the place class cases differ, resulting in decreased classifier efficiency and bias in direction of the bulk class29. To foretell ACS mortality amid air air pollution successfully, we utilized the ROSE technique for the coaching dataset solely. ROSE is famend for its efficacy in binary classification with imbalanced courses. It creates balanced samples for each steady and categorical knowledge utilizing a smoothed bootstrap strategy. This method maintains the info’s integrity and produces artificial samples for underrepresented courses, enhancing the accuracy and unbiased of the mannequin23.
For steady variables resembling age, coronary heart charge, high-density lipoprotein (HDLC), low-density lipoprotein (LDLC), fasting blood glucose (FBG), NOx, SO2, O3, and PM10, knowledge normalization was utilized utilizing the min–max normalization strategy. Earlier analysis has proven that knowledge normalization can considerably enhance the accuracy of ML algorithms30.
Machine studying mannequin growth
The detailed movement for classification mannequin growth presents the sequence of steps in our ML utility, together with mannequin growth, hyper parameter tuning, and the choice of the best-performing mannequin (Fig. 3).

The flowchart of the classification ML predictive fashions’ growth.
Hyperparameter tuning
Every of the fashions in our examine went via hyperparameter tuning, which is crucial for optimizing efficiency and guaranteeing the accuracy and robustness of our evaluation, notably within the context of air air pollution and ACS incidence.
For this function, we utilized the ‘caret’ package deal in R, identified for its potential to streamline the coaching of complicated fashions31. This package deal was chosen for its consistency in delivering outcomes throughout numerous mannequin complexities. The hyperparameters values for optimum ML mannequin efficiency for classification fashions are included in Supplementary Desk 4.
Machine studying efficiency analysis
The Space Below the Receiver Working Attribute (AUROC) curve is a key metric for evaluating classification fashions, notably in medical analysis, as famous by Fawcett32. AUROC gives a constant measure. It assesses a classifier’s potential to tell apart between courses primarily based on true optimistic and false optimistic charges, impartial of sophistication distributions. This high quality makes AUROC a dependable and informative instrument for evaluating classifier efficiency.
The uncooked testing dataset was used to guage the mannequin’s efficiency with out utilizing the ROSE balancing technique. This strategy was chosen to enhance the mannequin’s efficiency in real-world eventualities.
Mannequin interpretation and comparability
SHAP evaluation
To handle the ‘black field’ nature of ML algorithms, we used SHAP to interpret ML mannequin predictions33. Using the ‘shap’ library, we computed SHAP values, which give a unified measure of function significance. This concerned coaching ML fashions, making predictions, after which making use of the SHAP explainer to those fashions. SHAP values reveal the contribution of every function to the predictions, thereby enhancing the worldwide interpretability of the fashions and understanding of function significance.
Comparative evaluation of ML fashions and traditional strategies: NRI and efficiency metrics
The Internet Reclassification Index (NRI) measures enchancment in classifying people into increased or decrease threat classes when a brand new mannequin is in contrast with a pre-existing threat technique, notably for the prediction of ACS threat in relation to air air pollution34.
The examine adopts a mortality threat threshold for prime and low-risk sufferers, as proposed by Correia et al.35, which is utilized to the best machine studying fashions for NRI calculation. The willpower of appropriate cut-off factors for the TIMI threat rating, notably for STEMI and NSTEMI/UA sufferers, is aligned with acknowledged requirements within the subject.
The TIMI (Thrombolysis In Myocardial Infarction) threat rating was chosen because the scientific benchmark in our examine because of its widespread use and practicality in in-hospital settings, notably in Malaysia and different areas inside the Western Pacific. In comparison with the GRACE rating, which includes a broader set of variables and is extra appropriate for long-term threat prediction, TIMI is easier, additive in construction, and depends on scientific variables which can be available upon admission making it particularly beneficial for early triage and mortality threat evaluation. A number of research36,37 have proven that the TIMI rating performs comparably to GRACE in predicting in-hospital mortality, notably in acute settings like ST-elevation myocardial infarction (STEMI). Its integration into emergency protocols and ease of bedside use have led to its adoption in nationwide and hospital-level ACS care tips. As such, TIMI gives a clinically related and contextually acceptable baseline for evaluating the added predictive worth of machine studying fashions for ACS-related in-hospital mortality.
Finest mannequin deployment on the net
The perfect performing algorithm recognized within the examine, the RF algorithm, has been carried out in a web based platform utilizing internet programming languages. This web-based system options an interface for mortality prediction, using each ACS and air air pollution parameters. Moreover, it incorporates a database for storing affected person outcomes, which aids within the ongoing analysis and enhancement of the system. A reporting mechanism can be built-in, additional augmenting its utility in scientific and analysis settings.



