This report presents findings from a systematic ablation study conducted to address reviewer concerns regarding the potential circularity of using ATA risk level and RET variant features in our MTC prediction model. The concern raised was that including these features constitutes "restating consensus knowledge" rather than genuine prediction.
Key Finding: The conclusions are model-dependent. LightGBM on the expanded dataset still achieves 93.33% accuracy after removing all genetic features, while XGBoost on the original dataset preserves 100.00% recall even after removing CEA and variant one-hot encodings.
"The model already encodes cancer (the outcome) into both training and testing. This is a foundational validity problem and cannot be mitigated by cross-validation or holdout splitting. Consequently, reported performance metrics do not reflect true predictive capacity. Specifically, RET Variant and ATA Risk Level Encode Cancer a Priori. Including ATA risk as a feature means the model is given: 'This mutation usually causes cancer' and asked to predict whether cancer occurred. This is not prediction — it is restating consensus knowledge. The model cannot generalize beyond what is already encoded."
We conducted a systematic feature ablation study across:
- 5 machine learning models: Logistic Regression, Random Forest, XGBoost, LightGBM, SVM
- 2 datasets: Original (149 patients) and Expanded (1,047 samples with synthetic augmentation)
- 8 ablation configurations: Systematically removing feature groups to isolate contributions
| Configuration | Features Removed | Purpose |
|---|---|---|
| Baseline | None | Full model reference |
| No ATA Risk Level | ret_risk_level + interaction terms |
Test ATA risk contribution |
| No Variant Encodings | All variant_* one-hot features |
Test variant encoding value |
| No Genetics | All genetic features | Critical test: pure biomarker prediction |
| No Calcitonin | calcitonin_* features |
Test biomarker vs genetic reliance |
| No CEA | cea_level_numeric |
Address imputation concerns |
| Genetics Only | Remove all biomarkers | Critical test: is prediction "just consensus"? |
| Biomarkers Only | Remove all genetics | Equivalent to No Genetics |
- 80/20 stratified train-test split with fixed random seed (42)
- SMOTE applied only to training data to prevent data leakage
- StandardScaler fitted only on training data
- Consistent evaluation metrics: Accuracy, Recall, F1, ROC-AUC
When ALL genetic features (ATA risk level + variant encodings) are removed, the highest-accuracy model still achieves clinically useful performance:
| Model | Dataset | Baseline Accuracy | Biomarkers Only | Accuracy Drop |
|---|---|---|---|---|
| LightGBM | Expanded | 96.19% | 93.33% | -2.86% |
| XGBoost | Original | 83.33% | 73.33% | -10.00% |
| Random Forest | Original | 83.33% | 70.00% | -13.33% |
Interpretation: If the model were merely "restating consensus knowledge," removing genetic features should cause performance collapse. Instead, LightGBM retains strong performance without them, demonstrating that the model learns clinically meaningful patterns from biomarkers and presentation features.
For the screening-safe model, the most informative comparison is XGBoost on the original dataset:
| Configuration | Accuracy | Recall | F1 |
|---|---|---|---|
| Baseline | 83.33% | 100.00% | 0.8571 |
| Without ATA Risk Level | 76.67% | 86.67% | 0.7879 |
| Without Variant Encodings | 83.33% | 100.00% | 0.8571 |
| Without Any Genetic Features | 73.33% | 93.33% | 0.7778 |
| Without Calcitonin | 83.33% | 100.00% | 0.8571 |
| Without CEA | 90.00% | 100.00% | 0.9091 |
| Genetics Only | 90.00% | 100.00% | 0.9091 |
| Biomarkers Only | 73.33% | 93.33% | 0.7778 |
Interpretation: XGBoost maintains 100% recall across several reduced feature settings, but it loses recall once all genetic features are removed. This suggests that the screening-safe configuration depends more on retained genetic context than on individual biomarker inputs such as CEA or calcitonin.
The reviewer specifically cited ATA risk level as the source of circularity. The ablation results show a model-dependent effect:
| Model | Dataset | Baseline | Without ATA Risk | Change |
|---|---|---|---|---|
| LightGBM | Expanded | 96.19% | 95.24% | -0.95% |
| XGBoost | Original | 83.33% | 76.67% | -6.67% |
| Random Forest | Original | 83.33% | 80.00% | -3.33% |
| SVM | Original | 76.67% | 80.00% | +3.33% |
Interpretation: ATA risk level is not uniformly decisive. Its contribution varies by model and use case, which argues against the idea that performance is simply an artifact of ATA encoding.
The reviewer implied variant-specific encoding might encode cancer outcomes. The results show that one-hot variant features are often less important than broader genetic context:
| Model | Dataset | Baseline | Without Variants | Change |
|---|---|---|---|---|
| XGBoost | Original | 83.33% | 83.33% | 0.00% |
| LightGBM | Expanded | 96.19% | 94.29% | -1.90% |
| Random Forest | Original | 83.33% | 76.67% | -6.67% |
| SVM | Original | 76.67% | 80.00% | +3.33% |
Interpretation: Variant one-hot encodings are often less important than broader genetic context, especially in the screening-safe XGBoost configuration.
Clinical screening requires high recall (sensitivity), whereas triage prioritizes overall discrimination:
| Configuration | LightGBM Expanded Recall | XGBoost Original Recall |
|---|---|---|
| Baseline | 90.20% | 100.00% |
| Without Any Genetic Features | 82.35% | 93.33% |
| Without Calcitonin | 88.24% | 100.00% |
| Without CEA | 88.24% | 100.00% |
| Genetics Only | 84.31% | 100.00% |
Interpretation: The strongest screening result comes from XGBoost-original, which preserves zero-miss behavior under several ablations. The strongest accuracy result comes from LightGBM-expanded, which is more sensitive to removal of both genetic and biomarker features.
The findings are consistent with a model-dependent interpretation:
| Pattern | Screening-safe XGBoost | Accuracy-maximizing LightGBM |
|---|---|---|
| Full genetics required for peak performance | Yes | Helpful, but not essential |
| Variant dummies dispensable | Yes | Mostly yes |
| CEA helpful | No | Yes |
| Calcitonin helpful | Not required | Yes |
Finding: LightGBM on the expanded dataset still reaches 93.33% accuracy after all genetic features are removed. This directly contradicts the claim that genetic encoding is the only reason the model works.
Finding: The models do not reduce to a simple ATA lookup. In LightGBM-expanded, removing all genetic features still leaves a strong model; in XGBoost-original, removing variant dummies has no effect, while removing all genetics lowers recall.
Finding: The model generalizes using a mixture of genotype and phenotype signals. ATA risk level is neither universally dominant nor irrelevant; its effect depends on the model and task.
| Concern | Our Implementation | Verdict |
|---|---|---|
| Data leakage in splits | Train-test split before any processing | Valid |
| Feature scaling leakage | Scaler fitted only on training data | Valid |
| SMOTE applied correctly | Applied after split, training only | Valid |
| Stratified sampling | 80/20 split with stratification | Valid |
| Multiple model validation | 5 diverse algorithms tested | Valid |
| Reproducibility | Fixed random seed (42) | Valid |
Conclusion: The ablation study methodology is sound and results are reproducible.
-
The highest-accuracy model learns beyond ATA consensus encoding. Removing all genetic features lowers LightGBM-expanded accuracy from 96.19% to 93.33%, but does not cause collapse.
-
ATA risk level is not a universal source of circularity. Its effect is model-dependent and must be interpreted in context.
-
Variant-specific encodings are often dispensable. In XGBoost-original, removing variant one-hot encodings does not change performance.
-
The screening-safe model preserves clinical safety under several ablations. XGBoost-original maintains 100% recall without CEA, without calcitonin, and without variant one-hot encodings.
-
The strongest deployment conclusions come from two models: XGBoost-original for screening and LightGBM-expanded for maximum accuracy.
Removing calcitonin features has different effects in the two most relevant models:
LightGBM on Expanded Data:
| Model | Baseline | No Calcitonin | Change |
|---|---|---|---|
| LightGBM | 96.19% | 95.24% | -0.95% |
XGBoost on Original Data:
| Model | Baseline | No Calcitonin | Change |
|---|---|---|---|
| XGBoost | 83.33% | 83.33% | 0.00% |
Calcitonin remains directionally useful for the highest-accuracy model, but is not required for the strongest screening result.
This finding does not diminish calcitonin's clinical value:
- Calcitonin remains the gold-standard biomarker for MTC surveillance per ATA guidelines.
- Its modeling contribution depends on the algorithm and task.
- Feature ablations should be interpreted in the context of intended clinical use.
When interpreting feature contributions:
- Always compare the preferred screening model and the preferred accuracy model separately.
- Ablation studies are essential for identifying which signals are task-specific.
- Do not assume one feature ranking applies uniformly across all models.
| Configuration | Accuracy | Recall | F1 | ROC-AUC |
|---|---|---|---|---|
| Baseline | 83.33% | 100.00% | 0.857 | 0.916 |
| Without ATA Risk | 76.67% | 86.67% | 0.788 | 0.893 |
| Without Variants | 83.33% | 100.00% | 0.857 | 0.907 |
| No Genetics | 73.33% | 93.33% | 0.778 | 0.853 |
| No Calcitonin | 83.33% | 100.00% | 0.857 | 0.916 |
| No CEA | 90.00% | 100.00% | 0.909 | 0.938 |
| Genetics Only | 90.00% | 100.00% | 0.909 | 0.929 |
| Biomarkers Only | 73.33% | 93.33% | 0.778 | 0.853 |
Complete results for all 10 model-dataset combinations are available in:
results/ablation/{model}_{dataset}_ablation_results.txtresults/ablation/{model}_{dataset}_ablation_results.csv
Methodology: Systematic feature group removal with consistent train-test protocols.