This project builds an AI agent using a Random Forest model to detect malware by analyzing executable files (without running them!). We focus on static analysis – inspecting file characteristics and structure.
Due to challenges in accessing large external datasets, this project demonstrates a robust proof-of-concept using a small, custom-collected dataset of benign and dummy suspicious files. This showcases the core methodology of AI-driven static malware detection.
Our AI learns to spot malware by looking at a file's unique "fingerprint" (features) instead of its behavior.
- Static Detection: Safe and fast analysis, no execution needed.
- AI-Powered: Random Forest model for smart pattern recognition.
- Manageable Data Handling: Demonstrates feature extraction and model training using a controlled, custom dataset.
For this project, we utilize a custom dataset consisting of:
- Benign Samples: Common Windows executable files (e.g.,
notepad.exe,calc.exe). - Suspicious Sample: The harmless
eicar.comtest file (used to simulate a suspicious executable).
Features are extracted from these files using the pefile library. This approach allows for a self-contained demonstration of the static analysis pipeline without requiring large, external malware datasets.
Get this project running in a few steps!
- Python (3.10/3.11 Rec.): python.org (add to PATH!). 🐍
- 7-Zip (Windows): 7-zip.org (for general use, though not strictly needed for this small dataset).
- Visual C++ Redistributable (x64): Microsoft Learn. Install & restart.
- Disk Space: A few GBs of free space are sufficient for this custom dataset. 💾
# Clone this repo
git clone [https://github.com/U210709718/AI-Agents-for-malware-analysis-and-detection.git](https://github.com/U210709718/AI-Agents-for-malware-analysis-and-detection.git)
cd AI-Agents-for-malware-analysis-and-detection
# Setup virtual environment & install libraries
py -m venv .venv
.\.venv\Scripts\activate # Windows
pip install -r requirements.txt # (You'll create this file)-
Create Data Folders:
mkdir -p data/my_test_data/benign_samples mkdir -p data/my_test_data/suspicious_samples
-
Collect Samples:
- Benign: Copy
notepad.exe,calc.exe,mspaint.exefromC:\Windows\System32\intodata/my_test_data/benign_samples/. - Suspicious: Manually create
eicar.comindata/my_test_data/suspicious_samples/.- Open Notepad, paste
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*exactly. - Save as
eicar.com, select "All Files (*.*)" for type. (Your antivirus might quarantine it; restore it if needed).
- Open Notepad, paste
- Benign: Copy
-
Extract Features: This step uses
custom_feature_extractor.pyto get numerical features from your samples.- Ensure
custom_feature_extractor.py(fromsrc/) is insrc/. - With your virtual environment active, run from the project root (
AI-Agents-for-malware-analysis-and-detection/):
py src/custom_feature_extractor.py
This creates
extracted_features.csvindata/my_test_data/. - Ensure
Once features are extracted:
cd src
py malware_detector.pyThis script loads your extracted_features.csv, scales the features, trains a Random Forest model, evaluates it, and saves your trained model (.pkl files) into the models/ directory.
(Training on this small dataset is very fast!)
(This section from malware_detector.py's output. due to the very small dataset, some metrics might be 0.0 or 1.0, and a UserWarning regarding a single label in y_test is expected and normal.)
The Random Forest model achieved the following performance on the custom test set:
- Accuracy: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- F1-Score: 1.0000
Confusion Matrix:
[[2 0 ]
[0 1]]
🎥 Video Demo: (https://youtu.be/bZv6dZoknXk)
by using streamlit in python langauge, these are screenshorts of result of the analysis :

- Large Dataset Integration: Adapt the project to process and train on larger, real-world datasets (like EMBER 2018) once accessible, utilizing memory-efficient loading (e.g.,
np.memmap). - Live PE File Analysis: Implement a module to extract features from any new PE file and use the trained model for real-time prediction.
- More ML Models: Explore other machine learning algorithms (e.g., Gradient Boosting, SVM, simple Neural Networks).
- **Making the model predicting future risks