This project demonstrates optimized inference for large language models using Intel's Neural Processing Unit (NPU) via IPEX-LLM.
- Optimized Inference: Leverages Intel NPU for efficient LLM inference.
- Modular Design: Core logic encapsulated in
NPUInferenceEnginefor easy integration. - Hardware Verification: Includes tools to verify NPU availability.
- Example Applications: Includes a summarization example.
-
Create and activate virtual environment:
uv venv .env .env\Scripts\activate
-
Install dependencies:
uv pip install -e . -
Verify NPU hardware:
python tests/hardware_test.py
Run the main script to generate text. The model will be automatically downloaded, optimized, and saved to ./model_weights by default.
python src/main.py --prompt "What is the flavor of water?"Options:
--prompt: The input prompt (default: "What is life?").--repo-id-or-model-path: Hugging Face model ID or local path (default: "Qwen/Qwen2.5-0.5B-Instruct").--save-directory: Directory to save optimized model (default:./model_weights).--n-predict: Max tokens to predict (default: 128).--disable-streaming: Disable streaming output.
Run the example script to summarize a text file (examples/story.txt).
python examples/summarize.py