Ablation Study Report: Feature Contribution Analysis

Response to Methodological Concerns Regarding ATA Risk Level Feature Encoding

Executive Summary

This report presents findings from a systematic ablation study conducted to address reviewer concerns regarding the potential circularity of using ATA risk level and RET variant features in our MTC prediction model. The concern raised was that including these features constitutes "restating consensus knowledge" rather than genuine prediction.

Key Finding: The conclusions are model-dependent. LightGBM on the expanded dataset still achieves 93.33% accuracy after removing all genetic features, while XGBoost on the original dataset preserves 100.00% recall even after removing CEA and variant one-hot encodings.

Reviewer Concern

"The model already encodes cancer (the outcome) into both training and testing. This is a foundational validity problem and cannot be mitigated by cross-validation or holdout splitting. Consequently, reported performance metrics do not reflect true predictive capacity. Specifically, RET Variant and ATA Risk Level Encode Cancer a Priori. Including ATA risk as a feature means the model is given: 'This mutation usually causes cancer' and asked to predict whether cancer occurred. This is not prediction — it is restating consensus knowledge. The model cannot generalize beyond what is already encoded."

Methodology

Ablation Study Design

We conducted a systematic feature ablation study across:

5 machine learning models: Logistic Regression, Random Forest, XGBoost, LightGBM, SVM
2 datasets: Original (149 patients) and Expanded (1,047 samples with synthetic augmentation)
8 ablation configurations: Systematically removing feature groups to isolate contributions

Ablation Configurations

Configuration	Features Removed	Purpose
Baseline	None	Full model reference
No ATA Risk Level	`ret_risk_level` + interaction terms	Test ATA risk contribution
No Variant Encodings	All `variant_*` one-hot features	Test variant encoding value
No Genetics	All genetic features	Critical test: pure biomarker prediction
No Calcitonin	`calcitonin_*` features	Test biomarker vs genetic reliance
No CEA	`cea_level_numeric`	Address imputation concerns
Genetics Only	Remove all biomarkers	Critical test: is prediction "just consensus"?
Biomarkers Only	Remove all genetics	Equivalent to No Genetics

Statistical Approach

80/20 stratified train-test split with fixed random seed (42)
SMOTE applied only to training data to prevent data leakage
StandardScaler fitted only on training data
Consistent evaluation metrics: Accuracy, Recall, F1, ROC-AUC

Results

Finding 1: The Highest-Accuracy Model Remains Strong Without Full Genetic Context

When ALL genetic features (ATA risk level + variant encodings) are removed, the highest-accuracy model still achieves clinically useful performance:

Model	Dataset	Baseline Accuracy	Biomarkers Only	Accuracy Drop
LightGBM	Expanded	96.19%	93.33%	-2.86%
XGBoost	Original	83.33%	73.33%	-10.00%
Random Forest	Original	83.33%	70.00%	-13.33%

Interpretation: If the model were merely "restating consensus knowledge," removing genetic features should cause performance collapse. Instead, LightGBM retains strong performance without them, demonstrating that the model learns clinically meaningful patterns from biomarkers and presentation features.

Finding 2: The Screening-Safe Model Behaves Differently

For the screening-safe model, the most informative comparison is XGBoost on the original dataset:

Configuration	Accuracy	Recall	F1
Baseline	83.33%	100.00%	0.8571
Without ATA Risk Level	76.67%	86.67%	0.7879
Without Variant Encodings	83.33%	100.00%	0.8571
Without Any Genetic Features	73.33%	93.33%	0.7778
Without Calcitonin	83.33%	100.00%	0.8571
Without CEA	90.00%	100.00%	0.9091
Genetics Only	90.00%	100.00%	0.9091
Biomarkers Only	73.33%	93.33%	0.7778

Interpretation: XGBoost maintains 100% recall across several reduced feature settings, but it loses recall once all genetic features are removed. This suggests that the screening-safe configuration depends more on retained genetic context than on individual biomarker inputs such as CEA or calcitonin.

Finding 3: ATA Risk Level Has Model-Dependent Impact

The reviewer specifically cited ATA risk level as the source of circularity. The ablation results show a model-dependent effect:

Model	Dataset	Baseline	Without ATA Risk	Change
LightGBM	Expanded	96.19%	95.24%	-0.95%
XGBoost	Original	83.33%	76.67%	-6.67%
Random Forest	Original	83.33%	80.00%	-3.33%
SVM	Original	76.67%	80.00%	+3.33%

Interpretation: ATA risk level is not uniformly decisive. Its contribution varies by model and use case, which argues against the idea that performance is simply an artifact of ATA encoding.

Finding 4: Variant One-Hot Encodings Are More Dispensable Than Full Genetic Context

The reviewer implied variant-specific encoding might encode cancer outcomes. The results show that one-hot variant features are often less important than broader genetic context:

Model	Dataset	Baseline	Without Variants	Change
XGBoost	Original	83.33%	83.33%	0.00%
LightGBM	Expanded	96.19%	94.29%	-1.90%
Random Forest	Original	83.33%	76.67%	-6.67%
SVM	Original	76.67%	80.00%	+3.33%

Interpretation: Variant one-hot encodings are often less important than broader genetic context, especially in the screening-safe XGBoost configuration.

Finding 5: Recall Priorities Differ Between Screening and Triage

Clinical screening requires high recall (sensitivity), whereas triage prioritizes overall discrimination:

Configuration	LightGBM Expanded Recall	XGBoost Original Recall
Baseline	90.20%	100.00%
Without Any Genetic Features	82.35%	93.33%
Without Calcitonin	88.24%	100.00%
Without CEA	88.24%	100.00%
Genetics Only	84.31%	100.00%

Interpretation: The strongest screening result comes from XGBoost-original, which preserves zero-miss behavior under several ablations. The strongest accuracy result comes from LightGBM-expanded, which is more sensitive to removal of both genetic and biomarker features.

Cross-Model Consistency

The findings are consistent with a model-dependent interpretation:

Pattern	Screening-safe XGBoost	Accuracy-maximizing LightGBM
Full genetics required for peak performance	Yes	Helpful, but not essential
Variant dummies dispensable	Yes	Mostly yes
CEA helpful	No	Yes
Calcitonin helpful	Not required	Yes

Response to Reviewer Critique

Claim: "The model already encodes cancer into both training and testing"

Finding: LightGBM on the expanded dataset still reaches 93.33% accuracy after all genetic features are removed. This directly contradicts the claim that genetic encoding is the only reason the model works.

Claim: "This is not prediction — it is restating consensus knowledge"

Finding: The models do not reduce to a simple ATA lookup. In LightGBM-expanded, removing all genetic features still leaves a strong model; in XGBoost-original, removing variant dummies has no effect, while removing all genetics lowers recall.

Claim: "The model cannot generalize beyond what is already encoded"

Finding: The model generalizes using a mixture of genotype and phenotype signals. ATA risk level is neither universally dominant nor irrelevant; its effect depends on the model and task.

Methodological Validity Assessment

Is Our Approach Correct?

Concern	Our Implementation	Verdict
Data leakage in splits	Train-test split before any processing	Valid
Feature scaling leakage	Scaler fitted only on training data	Valid
SMOTE applied correctly	Applied after split, training only	Valid
Stratified sampling	80/20 split with stratification	Valid
Multiple model validation	5 diverse algorithms tested	Valid
Reproducibility	Fixed random seed (42)	Valid

Conclusion: The ablation study methodology is sound and results are reproducible.

Conclusions

The highest-accuracy model learns beyond ATA consensus encoding. Removing all genetic features lowers LightGBM-expanded accuracy from 96.19% to 93.33%, but does not cause collapse.
ATA risk level is not a universal source of circularity. Its effect is model-dependent and must be interpreted in context.
Variant-specific encodings are often dispensable. In XGBoost-original, removing variant one-hot encodings does not change performance.
The screening-safe model preserves clinical safety under several ablations. XGBoost-original maintains 100% recall without CEA, without calcitonin, and without variant one-hot encodings.
The strongest deployment conclusions come from two models: XGBoost-original for screening and LightGBM-expanded for maximum accuracy.

Finding 6: Calcitonin Feature Behavior Differs by Model

Observation

Removing calcitonin features has different effects in the two most relevant models:

LightGBM on Expanded Data:

Model	Baseline	No Calcitonin	Change
LightGBM	96.19%	95.24%	-0.95%

XGBoost on Original Data:

Model	Baseline	No Calcitonin	Change
XGBoost	83.33%	83.33%	0.00%

Interpretation

Calcitonin remains directionally useful for the highest-accuracy model, but is not required for the strongest screening result.

Clinical Implications

This finding does not diminish calcitonin's clinical value:

Calcitonin remains the gold-standard biomarker for MTC surveillance per ATA guidelines.
Its modeling contribution depends on the algorithm and task.
Feature ablations should be interpreted in the context of intended clinical use.

Methodological Takeaway

When interpreting feature contributions:

Always compare the preferred screening model and the preferred accuracy model separately.
Ablation studies are essential for identifying which signals are task-specific.
Do not assume one feature ranking applies uniformly across all models.

Appendix: Raw Results by Model

XGBoost (Original Dataset)

Configuration	Accuracy	Recall	F1	ROC-AUC
Baseline	83.33%	100.00%	0.857	0.916
Without ATA Risk	76.67%	86.67%	0.788	0.893
Without Variants	83.33%	100.00%	0.857	0.907
No Genetics	73.33%	93.33%	0.778	0.853
No Calcitonin	83.33%	100.00%	0.857	0.916
No CEA	90.00%	100.00%	0.909	0.938
Genetics Only	90.00%	100.00%	0.909	0.929
Biomarkers Only	73.33%	93.33%	0.778	0.853

Full Results

Complete results for all 10 model-dataset combinations are available in:

results/ablation/{model}_{dataset}_ablation_results.txt
results/ablation/{model}_{dataset}_ablation_results.csv

Methodology: Systematic feature group removal with consistent train-test protocols.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ablation Study Report: Feature Contribution Analysis

Response to Methodological Concerns Regarding ATA Risk Level Feature Encoding

Executive Summary

Reviewer Concern

Methodology

Ablation Study Design

Ablation Configurations

Statistical Approach

Results

Finding 1: The Highest-Accuracy Model Remains Strong Without Full Genetic Context

Finding 2: The Screening-Safe Model Behaves Differently

Finding 3: ATA Risk Level Has Model-Dependent Impact

Finding 4: Variant One-Hot Encodings Are More Dispensable Than Full Genetic Context

Finding 5: Recall Priorities Differ Between Screening and Triage

Cross-Model Consistency

Response to Reviewer Critique

Claim: "The model already encodes cancer into both training and testing"

Claim: "This is not prediction — it is restating consensus knowledge"

Claim: "The model cannot generalize beyond what is already encoded"

Methodological Validity Assessment

Is Our Approach Correct?

Conclusions

Finding 6: Calcitonin Feature Behavior Differs by Model

Observation

Interpretation

Clinical Implications

Methodological Takeaway

Appendix: Raw Results by Model

XGBoost (Original Dataset)

Full Results

FilesExpand file tree

ablation_feature_contribution_analysis.md

Latest commit

History

ablation_feature_contribution_analysis.md

File metadata and controls

Ablation Study Report: Feature Contribution Analysis

Response to Methodological Concerns Regarding ATA Risk Level Feature Encoding

Executive Summary

Reviewer Concern

Methodology

Ablation Study Design

Ablation Configurations

Statistical Approach

Results

Finding 1: The Highest-Accuracy Model Remains Strong Without Full Genetic Context

Finding 2: The Screening-Safe Model Behaves Differently

Finding 3: ATA Risk Level Has Model-Dependent Impact

Finding 4: Variant One-Hot Encodings Are More Dispensable Than Full Genetic Context

Finding 5: Recall Priorities Differ Between Screening and Triage

Cross-Model Consistency

Response to Reviewer Critique

Claim: "The model already encodes cancer into both training and testing"

Claim: "This is not prediction — it is restating consensus knowledge"

Claim: "The model cannot generalize beyond what is already encoded"

Methodological Validity Assessment

Is Our Approach Correct?

Conclusions

Finding 6: Calcitonin Feature Behavior Differs by Model

Observation

Interpretation

Clinical Implications

Methodological Takeaway

Appendix: Raw Results by Model

XGBoost (Original Dataset)

Full Results