Unlocking disease prediction: How the MILTON framework utilizes multi-omics data to transform health insights.
Study: Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK Biobank. Image Credit: Xray Computer/Shutterstock.com
In a recent study published in Nature Genetics, a group of researchers developed and applied an ensemble machine-learning framework (MILTON) to predict diseases and enhance genetic association analyses using multi-omics data from the United Kingdom Biobank (UKB).
Background
Identifying individuals at high risk of developing diseases is vital for preventative medicine. Still, traditional risk assessment tools, which rely on factors like age and family history, may not fully capture the complexity of disease biology.
Large-scale biobanks, such as the UKB, incorporate multi-omics data like blood tests, proteomics, and metabolomics, which provide opportunities to discover novel biomarkers.
These comprehensive datasets enable the identification of biomarker combinations that enhance disease prediction beyond individual markers. Further research is necessary to understand the biological processes underlying complex diseases better and improve predictive models.
About the study
The UKB cohort includes 502,226 participants aged 37 to 73 years, with a median age of 58. Of these, 54.4% are female. The data provides comprehensive information such as diagnosis records, blood biochemistry, body size measures, genomics, and proteomics data. All participants provided informed consent and participated voluntarily.
The Finnish Gene (FinnGen) cohort consists of 412,181 individuals, 55.9% of whom are female, with a median age of 63. Participants also provided informed consent and took part voluntarily.
FinnGen data was not accessed at the patient level; only Genome-Wide Association Study (GWAS) summary statistics were used. The research adhered to all ethical regulations, with approvals obtained from the appropriate ethics boards.
The UKB study received approval from the North West Centre for Research Ethics Committee. At the same time, the Coordinating Ethics Committee of the Hospital District of Helsinki and Uusimaa approved the FinnGen study.
The Finnish Institute for Health and Welfare, the Digital and Population Data Service Agency, the Social Insurance Institution, and Statistics Finland granted additional approvals for FinnGen.
Both studies carefully processed the data, ensuring accurate case and control definitions. Extensive filtering was applied to cases and controls to maintain consistency in the distribution of age, sex, and other baseline characteristics.
Study results
Clinical biomarkers play a crucial role in diagnosing and evaluating diseases by providing measurable indications of a condition’s presence and severity. In the context of phenome-wide association studies (PheWASs), biomarkers also offer an opportunity to identify misclassified or cryptic cases.
MILTON, a machine-learning method, has been introduced to use quantitative biomarkers to predict disease status for 3,213 disease phenotypes. The technique works by first learning a disease-specific signature from diagnosed patients and then predicting potential novel cases among the original controls. These augmented cohorts are used for rare-variant collapsing analysis to compare with baseline cohorts.
MILTON’s disease prediction models are defined based on the time lag between biomarker sample collection and diagnosis. In the UKB, samples may have been collected up to 16.5 years before or 50 years after diagnosis.
MILTON was trained using three different time models: prognostic (up to 10 years after sample collection), diagnostic (up to 10 years before), and time-agnostic (all diagnosed cases). A 10-year cutoff was determined to be optimal after a sensitivity analysis on 400 randomly selected International Classification of Diseases, 10th Revision (ICD10) codes.
MILTON was trained on 67 features, including blood biochemistry and count measures, urine assays, body size, blood pressure, sex, age, spirometry, and fasting time. The model’s performance was assessed using the area under the curve (AUC) metric. MILTON achieved AUC ≥ 0.7 for 1,091 ICD10 codes, AUC ≥ 0.8 for 384 codes, and AUC ≥ 0.9 for 121 codes across all time models and ancestries.
Diagnostic models generally performed better than prognostic ones across 1,466 ICD10 codes. For example, in European (EUR) ancestry participants, diagnostic models had a higher median AUC (0.668 versus 0.647) and sensitivity (0.586 versus 0.570).
MILTON also showed stable performance for EUR and African ancestries, while performance improved for South Asian diagnostic models as the number of cases increased.
MILTON’s ability to predict disease before onset was further validated. When individuals with a high case probability (0.7 ≤ Pcase ≤ 1) were analyzed, 97.41% of ICD10 codes were significantly enriched in participants who were later diagnosed with the corresponding conditions. These results affirm MILTON’s effectiveness in identifying emerging cases and augmenting genetic association analyses.
Conclusions
To summarize, MILTON predicts diseases using multi-omics and biomarkers, enhancing case-control studies across five UKB ancestries. Despite the broad, non-disease-specific feature set, MILTON achieved high predictive power for numerous phenotypes, with AUC > 0.7 for 1,091 ICD10 codes, AUC > 0.8 for 384, and AUC > 0.9 for 121.
However, for some diseases, predictive power remained low, indicating the need for more informative features.
MILTON often outperformed polygenic risk scores (PRSs) but underperformed in diseases like melanoma and breast cancer. Proteomics data improved predictions for 52 phenotypes. MILTON also identified 182 putative novel gene-disease signals requiring further validation.