This project provides machine learning and NLP pipelines for detecting Domain Generation Algorithms (DGA) anomalies in DNS traffic. Malicious actors use DGAs to periodically generate rendezvous domain names, making botnets resilient to takedowns. This system categorizes domains as benign or malignant based on their structure.
- Modular Features: Engineers token-based (ngram lengths) and raw byte-based features from domain strings.
- Ensemble Model: Uses a robust state-of-the-art
CatBoostClassifierensemble model. - Automated Pipeline: Full end-to-end functionality (loading, processing, model training, evaluation) provided in standard scripts.
The current best-performing model is catboost.0.977.26_ensemble.model (trained on 1000 iterations).
It leverages two sets of engineered features:
- Ngram Lengths: Tokenizer extracts up to 14 token lengths.
- Raw Bytes: 26 features representing the exact ASCII bytes (padded or cropped symmetrically).
You can launch the training and evaluation workflow by executing the main python script from the root directory:
python src/train.pyNote: The original Jupyter notebook containing earlier research is accessible in experiments/DGA_detection.ipynb.
For an in-depth explanation of the logic behind this project, please refer to the markdown files in the docs/ folder:
- Experiments: Background context and evolution of the models.
- Tools: The specific tech stack and libraries used.
- Workflows: The core architecture and execution logic.
- Evaluations: Key metrics and rationale behind modeling strategies.