Skip to content

sehamothman/AI-Malware-Detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ AI Agent for Malware Detection (Static Analysis) 💻

✨ Project Overview

This project builds an AI agent using a Random Forest model to detect malware by analyzing executable files (without running them!). We focus on static analysis – inspecting file characteristics and structure.

Due to challenges in accessing large external datasets, this project demonstrates a robust proof-of-concept using a small, custom-collected dataset of benign and dummy suspicious files. This showcases the core methodology of AI-driven static malware detection.

🚀 How It Works & Key Features

Our AI learns to spot malware by looking at a file's unique "fingerprint" (features) instead of its behavior.

  • Static Detection: Safe and fast analysis, no execution needed.
  • AI-Powered: Random Forest model for smart pattern recognition.
  • Manageable Data Handling: Demonstrates feature extraction and model training using a controlled, custom dataset.

📊 Dataset: Custom Samples

For this project, we utilize a custom dataset consisting of:

  • Benign Samples: Common Windows executable files (e.g., notepad.exe, calc.exe).
  • Suspicious Sample: The harmless eicar.com test file (used to simulate a suspicious executable).

Features are extracted from these files using the pefile library. This approach allows for a self-contained demonstration of the static analysis pipeline without requiring large, external malware datasets.

🛠️ Quick Setup Guide

Get this project running in a few steps!

1. Prepare Your System 💻

  • Python (3.10/3.11 Rec.): python.org (add to PATH!). 🐍
  • 7-Zip (Windows): 7-zip.org (for general use, though not strictly needed for this small dataset).
  • Visual C++ Redistributable (x64): Microsoft Learn. Install & restart.
  • Disk Space: A few GBs of free space are sufficient for this custom dataset. 💾

2. Get Code & Setup Environment 📂

# Clone this repo
git clone [https://github.com/U210709718/AI-Agents-for-malware-analysis-and-detection.git](https://github.com/U210709718/AI-Agents-for-malware-analysis-and-detection.git)
cd AI-Agents-for-malware-analysis-and-detection

# Setup virtual environment & install libraries
py -m venv .venv
.\.venv\Scripts\activate # Windows
pip install -r requirements.txt # (You'll create this file)

3. Prepare Custom Data & Extract Features 🔬

  1. Create Data Folders:

    mkdir -p data/my_test_data/benign_samples
    mkdir -p data/my_test_data/suspicious_samples
  2. Collect Samples:

    • Benign: Copy notepad.exe, calc.exe, mspaint.exe from C:\Windows\System32\ into data/my_test_data/benign_samples/.
    • Suspicious: Manually create eicar.com in data/my_test_data/suspicious_samples/.
      • Open Notepad, paste X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H* exactly.
      • Save as eicar.com, select "All Files (*.*)" for type. (Your antivirus might quarantine it; restore it if needed).
  3. Extract Features: This step uses custom_feature_extractor.py to get numerical features from your samples.

    • Ensure custom_feature_extractor.py (from src/) is in src/.
    • With your virtual environment active, run from the project root (AI-Agents-for-malware-analysis-and-detection/):
    py src/custom_feature_extractor.py

    This creates extracted_features.csv in data/my_test_data/.

4. Run the Detector! ▶️

Once features are extracted:

cd src
py malware_detector.py

This script loads your extracted_features.csv, scales the features, trains a Random Forest model, evaluates it, and saves your trained model (.pkl files) into the models/ directory.

(Training on this small dataset is very fast!)

📈 Results

(This section from malware_detector.py's output. due to the very small dataset, some metrics might be 0.0 or 1.0, and a UserWarning regarding a single label in y_test is expected and normal.)

The Random Forest model achieved the following performance on the custom test set:

  • Accuracy: 1.0000
  • Precision: 1.0000
  • Recall: 1.0000
  • F1-Score: 1.0000

Confusion Matrix:

[[2 0 ]
 [0  1]]

🎥 Video Demo: (https://youtu.be/bZv6dZoknXk)

User Interface

by using streamlit in python langauge, these are screenshorts of result of the analysis : Demo Screenshot Demo Screenshot

💡 Future Enhancements

  • Large Dataset Integration: Adapt the project to process and train on larger, real-world datasets (like EMBER 2018) once accessible, utilizing memory-efficient loading (e.g., np.memmap).
  • Live PE File Analysis: Implement a module to extract features from any new PE file and use the trained model for real-time prediction.
  • More ML Models: Explore other machine learning algorithms (e.g., Gradient Boosting, SVM, simple Neural Networks).
  • **Making the model predicting future risks

About

AI Agent for Static Malware Analysis using random forest. CENG 3544 I Computer and Network security final project, spring 2025 I MSKU.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors