Development of Models to Predict the Progression of Pulmonary Fibrosis

Introduction: The Challenge of Predicting Pulmonary Fibrosis Progression

Pulmonary fibrosis is a progressive, often fatal interstitial lung disease marked by the excessive deposition of extracellular matrix, leading to irreversible scarring of the lung parenchyma. This fibrotic remodeling destroys the delicate alveolar architecture, progressively impairing gas exchange and resulting in debilitating dyspnea, cough, and reduced quality of life. Idiopathic pulmonary fibrosis (IPF) is the most common and severe form, with a median survival of only 3–5 years following diagnosis. However, the rate of disease progression varies widely among patients: some experience a rapid, relentless decline, while others remain stable for years punctuated by acute exacerbations. The unpredictable nature of pulmonary fibrosis poses a major clinical challenge. Clinicians currently rely on serial measurements of forced vital capacity (FVC) and diffusion capacity for carbon monoxide (DLCO), but these physiological markers often capture decline only after significant lung function has already been lost. There is an urgent, unmet need for robust predictive models that can forecast individual disease trajectories with high accuracy. Such models would enable earlier identification of high-risk patients, guide personalized therapeutic decisions (e.g., earlier referral for lung transplantation, initiation of antifibrotic therapy), and improve the design of clinical trials by enriching study populations with patients likely to progress. Recent advances in computational modeling—spanning statistical analysis, machine learning, and deep learning—are now making these predictive tools a tangible reality. This article provides a comprehensive overview of the current state of predictive model development for pulmonary fibrosis progression, detailing the methodologies employed, the challenges encountered, and the promising future directions that could transform management of this devastating disease.

Why Predictive Models Matter in Pulmonary Fibrosis

The heterogeneity of pulmonary fibrosis progression directly impacts clinical decision-making. A static treatment approach—prescribing antifibrotic agents such as pirfenidone or nintedanib to all patients—fails to account for individual risk. Predictive models offer three key benefits:

Personalized Risk Stratification: By integrating clinical, imaging, genomic, and biomarker data, models can assign a probability of rapid decline to each patient. This allows clinicians to tailor monitoring frequency, escalate therapy proactively, and prioritize lung transplant evaluation for those at highest risk.
Optimized Resource Allocation: Healthcare systems can focus intensive monitoring and interventions on patients predicted to progress, while avoiding unnecessary burdens on stable patients. This is particularly valuable for managing disease progression in resource-limited settings.
Enhanced Clinical Trial Efficiency: Pharmaceutical companies struggle with enrollment criteria that fail to ensure a sufficient number of progression events during the trial period, leading to underpowered studies or extended timelines. Predictive models can screen potential participants to select those with a high likelihood of decline, thereby reducing sample size requirements and shortening trial duration. This accelerates the development of novel therapies.

Beyond these practical advantages, predictive models also deepen our understanding of disease mechanisms. For instance, identifying which features—such as specific genetic variants (e.g., MUC5B promoter polymorphism), circulating biomarkers (e.g., MMP-7, KL-6), or quantitative CT patterns (e.g., traction bronchiectasis, honeycombing)—contribute most to prediction power can highlight biological pathways driving fibrosis, potentially revealing new therapeutic targets.

Categories of Predictive Models

Researchers have developed a spectrum of modeling approaches, each with distinct strengths and limitations. The main categories include statistical models, machine learning models, and imaging-based models. Often, the most powerful solutions combine elements of all three.

Statistical Models

Traditional regression-based approaches, such as Cox proportional hazards models or logistic regression, have been the workhorses of clinical prediction for decades. These models directly relate predictor variables (e.g., baseline FVC, DLCO, age, sex, smoking history) to a binary or time-to-event outcome (e.g., 1-year mortality, 6-month FVC decline ≥10%). The GAP index (gender, age, physiology) is a well-known statistical model that estimates mortality risk in IPF using four variables; it has been externally validated and remains a clinical standard. While statistical models are transparent, interpretable, and require modest computational resources, they assume linear relationships and often cannot capture complex interactions or nonlinear associations present in PF pathophysiology.

Machine Learning Models

Machine learning (ML) algorithms can automatically learn patterns from high-dimensional, heterogeneous data without explicit programming of rules. Common models used in PF prediction include:

Random Forests: An ensemble method that builds multiple decision trees and aggregates their outputs. Random forests handle nonlinearities, interactions, and missing data well, while providing feature importance rankings. Studies have shown random forest models outperforming logistic regression when predicting FVC decline over 52 weeks.
Support Vector Machines (SVM): SVM finds a hyperplane that best separates progression and non-progression classes. It is effective in high-dimensional spaces (e.g., combining hundreds of CT radiomic features) but can be less interpretable.
Gradient Boosting Machines (e.g., XGBoost, LightGBM): These sequential models correct errors of previous trees, often achieving state-of-the-art predictive performance. XGBoost has been successfully applied to predict mortality in PF using clinical and genetic data, yielding area under the receiver operating characteristic curve (AUC) values above 0.80.
Neural Networks: Deep learning architectures, including fully connected networks, convolutional neural networks (CNNs) for imaging data, and recurrent neural networks (RNNs) for longitudinal measurements, offer the greatest flexibility but require very large datasets and careful regularization to avoid overfitting. In PF, CNNs applied to high-resolution CT (HRCT) slices have predicted progression with AUCs exceeding 0.85.

One notable example is the PFRNet model, a deep learning network that integrates baseline clinical data and serial FVC measurements to forecast a patient’s FVC trajectory over the next year. When validated on independent cohorts, PFRNet demonstrated significantly lower mean absolute error compared to conventional linear mixed-effects models.

Imaging-Based Models

HRCT is a cornerstone of PF diagnosis and assessment. Imaging-based prediction models extract quantitative features from CT scans to measure disease extent and progression risk. These features include:

Quantitative lung fibrosis (QLF) scoring: Automated software calculates the percentage of lung volume affected by reticulation, honeycombing, and traction bronchiectasis. Higher QLF scores correlate with faster FVC decline and increased mortality.
Radiomics: High-throughput extraction of hundreds of texture, shape, and intensity features from lung regions. Machine learning then selects the most predictive radiomic signatures. For instance, a study in Radiology found that a radiomics nomogram incorporating texture features from HRCT significantly improved prediction of 2-year mortality beyond the GAP index alone.
Deep learning of CT patterns: CNNs trained on raw CT slice data can identify subtle fibrotic changes not visible to the human eye. A 2023 study using a 3D CNN on whole-lung CT volumes predicted 12-month progression with an AUC of 0.90, outperforming both QLF and GAP.

Combining imaging features with clinical and genetic data in multi-modal models consistently yields the highest predictive accuracy, as each modality captures complementary aspects of disease behavior.

Step-by-Step Model Development Pipeline

Building a reliable predictive model follows a structured pipeline, each stage critical to the model’s ultimate clinical utility.

1. Data Collection and Curation

High-quality, well-annotated datasets are the foundation. Sources include prospective observational cohorts (e.g., the IPF-PRO Registry, CORRELATE PF), clinical trial placebo arms (e.g., from ASCEND, CAPACITY, INPULSIS trials), and electronic health records (EHRs). Essential data types include:

Demographics: Age, sex, race, smoking history.
Physiology: Serial FVC (percent predicted), DLCO, 6-minute walk distance, oxygen saturation.
Imaging: Baseline and follow-up HRCT scans (with standardized acquisition protocols).
Biomarkers: Serum proteins (MMP-7, SP-D, KL-6), genetic variants (MUC5B, TOLLIP), and genomic signatures.
Clinical events: Mortality, hospitalization, acute exacerbation, transplant, or FVC decline thresholds.

Data curation involves handling missing values, ensuring consistent outcome definitions (e.g., “progression” defined as FVC decline ≥10% or death within 12 months), and harmonizing variables across centers. The quality of data curation directly impacts model generalizability.

2. Feature Engineering and Selection

From the raw data, relevant predictors (features) must be derived. In addition to basic clinical variables, advanced features include derived ratios (e.g., FVC/DLCO), temporal trends (slope of FVC decline over previous visits), and composite imaging scores. Dimensionality reduction techniques—principal component analysis, autoencoders, or mutual information filtering—remove redundant or noisy features to prevent overfitting. In high-dimensional settings (e.g., radiomics with >1000 features), feature selection is crucial; methods like LASSO regression or recursive feature elimination with cross-validation are preferred.

3. Model Training and Algorithm Selection

The curated dataset is split into training (typically 70–80%) and testing (20–30%) sets. For smaller cohorts, k-fold cross-validation (e.g., 5-fold or 10-fold) is used within the training set to tune hyperparameters. Model selection involves comparing multiple algorithms (e.g., logistic regression, random forest, XGBoost, neural network) using metrics such as AUC, sensitivity, specificity, positive predictive value, and Brier score for probabilistic calibration. For time-to-event outcomes, concordance index (C-index) and calibration plots are evaluated. The best-performing model is then applied to the held-out test set to obtain unbiased estimates of its predictive performance.

4. Internal and External Validation

Internal validation (e.g., repeated cross-validation, bootstrap) assesses stability and optimism. External validation on an independent dataset from a different center or time period is the gold standard for evaluating generalizability. Discrepancies between internal and external performance often arise due to differences in patient demographics, disease severity at enrollment, scanning protocols, or outcome ascertainment. Models that pass external validation with AUC >0.80 and well-calibrated predictions are considered clinically plausible. A prominent example is the re-evaluation of the GAP index and a machine learning model on the CORRELATE PF registry, which confirmed their utility in real-world populations.

5. Deployment and Clinical Integration

Once validated, the model must be made accessible to clinicians. This often involves embedding it into EHR systems as a clinical decision support tool that automatically calculates a risk score and displays it alongside patient data. User interface design must present predictions clearly (e.g., “This patient has a 70% probability of requiring oxygen therapy within 12 months”). Compliance with regulatory standards (FDA, CE marking) and oversight of algorithm drift (degradation of performance over time) are essential for safe deployment.

Challenges in Predictive Model Development

Despite promising advances, several obstacles impede the widespread adoption of PF progression models.

Data Heterogeneity and Variability

Clinical data from different centers are collected with varying equipment, protocols, and definitions. Pulmonary function tests may not be standardized across labs; HRCT acquisition parameters (slice thickness, reconstruction kernel, radiation dose) affect radiomic features. Inconsistent definitions of “progression” (relative vs. absolute FVC decline, inclusion of death or transplant as competing risks) complicate model comparability. Efforts like the Open Source Imaging Consortium (OSIC) aim to harmonize PF data, but heterogeneity remains a barrier.

Missing Data and Loss to Follow-up

Longitudinal PF cohorts suffer from attrition due to death, transplant, or patient dropout. When data are missing not at random (e.g., sicker patients are more likely to be lost), standard imputation methods (mean imputation, last observation carried forward) can bias models. Advanced techniques such as multiple imputation with chained equations or joint modeling of longitudinal and survival data can help, but they require careful specification.

Model Interpretability

While deep learning models often achieve superior accuracy, their “black box” nature hinders clinical trust. Physicians need to understand why a model predicts high risk—is it due to a rapid decline in DLCO, a specific CT pattern, or an elevated biomarker? Explainable AI techniques (e.g., SHAP values, LIME, attention maps) are increasingly being applied to PF models to highlight influential features at the patient level. Regulatory bodies may also require explanations for high-stakes decisions.

External Validation and Generalizability

Many published PF models are developed and tested on single-center or trial-only populations, which may not reflect real-world diversity. For example, the CAPACITY and INPULSIS trial populations were relatively homogeneous (predominantly Caucasian, mild to moderate disease). Models may fail in populations with different genetic backgrounds, concomitant conditions (e.g., emphysema, pulmonary hypertension), or different disease subtypes (e.g., connective tissue disease-associated ILD). Rigorous external validation in multi-ethnic, multi-center cohorts is necessary but resource-intensive.

Future Directions: Toward Precision Prediction

The next generation of predictive models will leverage richer data streams and more sophisticated analytics to achieve personalized, dynamic predictions.

Integration of Multi-Omics Data

Genomic, transcriptomic, proteomic, and metabolomic profiles can capture molecular disease activity. For example, a polygenic risk score incorporating MUC5B, TOLLIP, and other loci may stratify progression risk at diagnosis. Combining proteomic signatures—such as the recently identified FGF-2, CA-125, and OPG panel—with clinical variables has shown promise in early studies. Multi-omics integration via deep learning or Bayesian networks could reveal disease subphenotypes that respond differently to therapies.

Real-Time and Wearable Monitoring

Smartphone-based spirometry, pulse oximeters, and activity trackers can generate continuous lung function and physical activity data. Predictive models that ingest this streaming information via recurrent or transformer architectures could provide near-real-time risk updates. If a patient’s oxygen saturation drops consistently during light activity, the model might alert the care team to impending exacerbation. Such systems require robust data transmission, storage, and privacy safeguards.

Personalized Causal Models

Current models are mostly associative, not causal. Causal models (e.g., using directed acyclic graphs or structural equation modeling) would clarify which interventions could modify progression risk. For instance, if a model identifies that elevated KL-6 directly drives progression (not merely correlates), then targeting KL-6 with a monoclonal antibody might be a viable therapeutic strategy. Causal discovery from observational data is an active research area with potential for high impact.

To overcome the small sample size barrier, multi-institutional data sharing initiatives like the IPF-PRO Registry and European IPF Registry are critical. Federated learning—where models are trained across decentralized data without sharing raw patient data—preserves privacy while enabling model development on datasets that collectively contain thousands of patients. Early federated learning experiments for PF CT segmentation have shown results comparable to centralized training.

Conclusion

The development of accurate models to predict the progression of pulmonary fibrosis represents a pivotal step toward personalized, proactive management of this devastating disease. By moving beyond one-size-fits-all approaches and incorporating diverse data sources—clinical, imaging, genomic, and wearable—researchers are building tools that can forecast individual trajectories with unprecedented precision. While challenges of data heterogeneity, missing data, interpretability, and generalization remain, ongoing advances in machine learning, multi-omics integration, and collaborative data sharing promise to overcome these hurdles. The ultimate goal is to integrate validated predictive models into routine clinical workflows, enabling earlier interventions, more efficient clinical trials, and improved patient outcomes. As these technologies mature, the vision of a future where every pulmonary fibrosis patient receives a personalized risk assessment and treatment strategy is becoming increasingly attainable.