Anna

Anna is a local inference runtime for large language, multimodal, and speech models. It provides an OpenAI-compatible HTTP API and command-line tools for local generation, serving, benchmarking, Qwen3-TTS speech synthesis, and Qwen3-ASR speech recognition.

The runtime is built on PyTorch and is optimized for Intel Arc / XPU. CPU execution is useful for development and smaller tests.

Features

OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /v1/audio/speech, /v1/audio/transcriptions, /v1/models
Non-streaming and streaming text generation
Chat, plain text completion, multimodal chat, function calling, and reasoning output
Qwen3-TTS speech synthesis and Qwen3-ASR speech recognition
Intel XPU options for torch.compile, KV-cache quantization, int4 weight quantization, MoE expert offload, prompt cache, and continuous batching
CLI tools: anna-serve, anna-generate, anna-bench, anna-speak, anna-transcribe, anna-xpu-int4-cache

Supported Models

Anna loads a local model directory. A normal Hugging Face-style directory should contain config.json and model weights. A Qwen GGUF layout is also supported for compatible Qwen3.5 MoE models.

Model family detection is based on config.json:

`model_type`	Runtime	Main use
`qwen3_tts`	Qwen3-TTS	Speech synthesis
`qwen3_asr`	Qwen3-ASR	Speech recognition
`gemma4`	Gemma 4	Text, chat, multimodal chat with audio
anything else	Qwen3.5 text / VL	Text, chat, image/video multimodal chat

Use anna-generate and anna-bench for text-generation models. Use anna-speak for Qwen3-TTS. Use anna-transcribe for Qwen3-ASR. Use anna-serve for any supported family; unsupported routes return an API error for the loaded model.

Requirements

Python 3.11+
PyTorch 2.7+ installed for your hardware
A local model directory
For Intel Arc / XPU: a PyTorch build with XPU support and the Intel GPU runtime
Optional fused XPU operator build: Intel oneAPI DPC++ compiler, and Visual Studio Build Tools on Windows

The package declares Python dependencies in pyproject.toml. PyTorch and Intel GPU drivers should be installed according to your target machine.

Qwen3-ASR depends on qwen-asr and python-multipart. Anna loads it on Intel XPU only; there is no CPU or alternate backend fallback for the ASR runtime. If XPU is unavailable, model loading fails immediately.

Installation

git clone https://github.com/YOUR_USERNAME/Anna.git
cd Anna
python -m venv .venv

Activate the virtual environment:

# Windows PowerShell
.\.venv\Scripts\Activate.ps1

# Linux / macOS
source .venv/bin/activate

Install Anna:

python -m pip install -U pip
python -m pip install -e .

For development and tests:

python -m pip install -e ".[dev]"
pytest

Check PyTorch and XPU availability:

python -c "import torch; print(torch.__version__); print(torch.xpu.is_available() if hasattr(torch, 'xpu') else False)"

Optional: build the fused XPU operator:

python tools/build_gated_delta_fused_op.py

Example local setup for Windows + Intel Arc A770:

conda activate anna
$env:ANNA_DPCPP = "D:\Intel\oneAPI\compiler\latest\bin\dpcpp.exe"
$env:ANNA_VCVARS64 = "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
python -m pip install -e ".[dev]"
python -c "import torch; print(torch.xpu.is_available()); print(torch.xpu.get_device_name(0))"

You do not need to build the fused operator just to run Qwen3-ASR; it is used by selected Anna custom XPU operator paths.

Quick Start

Start the OpenAI-compatible server:

anna-serve --model-dir /path/to/model --host 127.0.0.1 --port 8000

Send a chat request:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "messages": [
      {"role": "user", "content": "Explain KV cache in one paragraph."}
    ],
    "max_completion_tokens": 128
  }'

Use streaming:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Write a short haiku about local AI."}],
    "stream": true,
    "stream_options": {"include_usage": true}
  }'

Run one-shot generation from the CLI:

anna-generate \
  --model-dir /path/to/text-model \
  --prompt "Explain KV cache in one paragraph." \
  --max-new-tokens 128

Run a benchmark:

anna-bench \
  --model-dir /path/to/model \
  --prompt "Hello" \
  --warmup 1 \
  --runs 3

Synthesize speech with Qwen3-TTS:

anna-speak \
  --model-dir /path/to/qwen3-tts \
  --input "Hello from Anna." \
  --output out.wav

Transcribe audio with Qwen3-ASR:

anna-transcribe \
  --model-dir models/Qwen3-ASR-1.7B \
  --audio input.wav \
  --device xpu \
  --language English

If your local directory is models/Qwen3-ASR-0.6B, pass that path instead. Anna detects the model family from model_type: qwen3_asr in config.json, not from the directory name.

You can also serve Qwen3-ASR through the OpenAI-compatible transcription endpoint:

anna-serve \
  --model-dir models/Qwen3-ASR-1.7B \
  --device xpu \
  --dtype bfloat16 \
  --host 127.0.0.1 \
  --port 8000

curl http://127.0.0.1:8000/v1/audio/transcriptions \
  -F model=Qwen3-ASR-1.7B \
  -F file=@input.wav \
  -F language=English \
  -F response_format=verbose_json

Omit language to let Qwen3-ASR identify it automatically. Use response_format=text for plain text, or response_format=verbose_json for model ID, language, and timing metadata. Pass return_timestamps=true to request timestamp output.

Intel XPU Examples

Select a specific XPU device:

anna-serve \
  --model-dir /path/to/model \
  --device xpu \
  --xpu-device-index 0 \
  --dtype bfloat16

Use memory-saving runtime options:

anna-serve \
  --model-dir /path/to/model \
  --device xpu \
  --dtype bfloat16 \
  --kv-cache-quantization turboquant \
  --kv-cache-quant-bits 4 \
  --weight-quant auto \
  --prompt-cache-size 4

Enable continuous batching for API serving:

anna-serve \
  --model-dir /path/to/model \
  --device xpu \
  --scheduler-max-batch-size 4 \
  --scheduler-batch-wait-ms 2

Check whether Anna will create an XPU int4 sidecar cache:

anna-xpu-int4-cache \
  --model-dir /path/to/model \
  --weight-quant auto \
  --xpu-total-memory-gib 16

Multimodal Requests

For supported vision models, use OpenAI-style content parts:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe this image."},
          {"type": "image_url", "image_url": {"url": "/path/to/image.jpg"}}
        ]
      }
    ],
    "max_completion_tokens": 128
  }'

image_url, video_url, and audio_url content parts are accepted by the API schema. Actual support depends on the loaded model family.

API Routes

Route	Method	Purpose
`/healthz`	`GET`	Runtime health and model status
`/v1/models`	`GET`	List the loaded model ID
`/v1/chat/completions`	`POST`	Chat and multimodal chat
`/v1/completions`	`POST`	Plain text completion
`/v1/audio/speech`	`POST`	Qwen3-TTS speech synthesis
`/v1/audio/transcriptions`	`POST`	Qwen3-ASR speech recognition

`anna-serve` Options

anna-serve has one required option. All other options are optional and have runtime defaults.

Required option	Meaning
`--model-dir PATH`	Local model directory. Use a Hugging Face-style model folder with `config.json` and weights, or a compatible GGUF layout.

Common service options:

Optional option	Default	Meaning
`--model-name NAME`	derived from path	Model ID exposed by `/v1/models` and API responses.
`--host HOST`	`127.0.0.1`	Server bind address. Use `0.0.0.0` to listen on all interfaces.
`--port PORT`	`8000`	Server port.
`--log-level LEVEL`	`info`	Uvicorn and Anna logging level.
`--device DEVICE`	`auto`	`auto`, `cpu`, or `xpu`. `auto` prefers XPU when available.
`--dtype DTYPE`	`auto`	Compute dtype such as `auto`, `float32`, `float16`, or `bfloat16`.
`--max-completion-tokens N`	model/default estimate	Default output token cap for API requests that omit `max_tokens` / `max_completion_tokens`.
`--temperature FLOAT`	`0.7`	Default sampling temperature when the request omits it.
`--top-p FLOAT`	`0.8`	Default nucleus sampling probability.
`--top-k N`	`20`	Default top-k sampling limit. Set `0` to disable.
`--min-p FLOAT`	`0.0`	Default min-p sampling threshold.
`--presence-penalty FLOAT`	`1.5`	Default additive presence penalty.
`--repetition-penalty FLOAT`	`1.0`	Default multiplicative repetition penalty.
`--enable-thinking` / `--disable-thinking`	enabled	Default thinking behavior for chat requests that omit thinking fields.
`--reasoning-format none\|deepseek`	`deepseek`	Reasoning output format. `deepseek` returns `reasoning_content` separately when available.

Performance and memory options:

Optional option	Default	Meaning
`--compile-mode MODE`	`auto`	`none`, `auto`, `default`, `reduce-overhead`, or `max-autotune`. First requests can pay compile cost.
`--compile-fullgraph`	off	Request full-graph capture when `torch.compile` is enabled.
`--prefill-chunk-size N`	`0`	Split long text-only prefills into chunks. `0` lets Anna auto-size on XPU.
`--prompt-cache-size N`	`0`	Keep up to N exact text prompt KV caches resident. `0` disables prompt cache.
`--prompt-cache-max-tokens N`	`0`	Only cache prompts up to N tokens. `0` means no token limit.
`--kv-cache-quantization none\|turboquant`	`none`	Quantize compatible KV caches.
`--kv-cache-quant-bits 2\|3\|4`	`4`	TurboQuant KV-cache bit width.
`--kv-cache-residual-len N`	`128`	Keep the newest N KV tokens in full precision.
`--offload-mode auto\|none\|experts`	`auto`	MoE expert offload strategy.
`--offload-vision`	off	Keep the vision tower on CPU even when the language model runs on XPU.
`--expert-quant auto\|none\|int4`	`auto`	Quantization for MoE expert weights on XPU.
`--weight-quant auto\|none\|int4`	`auto`	Quantization for dense language-model weights on XPU.
`--resident-expert-layers N`	auto	Keep the first N sparse MoE layers fully resident on the execution device.
`--resident-expert-layer-indices LIST`	unset	Comma-separated 0-based sparse layer indices to keep resident. Overrides `--resident-expert-layers`.
`--cached-experts-per-layer N`	auto	Max offloaded experts cached on XPU per sparse MoE layer. `0` disables.

XPU and server runtime options:

Optional option	Default	Meaning
`--xpu-device-index N`	unset	Select an Intel XPU with `ONEAPI_DEVICE_SELECTOR=level_zero:N`.
`--no-xpu-env-defaults`	off	Do not set Anna's recommended Level Zero environment defaults before XPU startup.
`--xpu-int4-matmul auto\|torch\|dequant`	runtime default	XPU int4 dense linear execution strategy.
`--enable-flashqla-gdn-prefill`	off	Enable the XPU SYCL GDN prefill path. Unsupported shapes/devices/dtypes raise immediately.
`--no-inference-warmup`	off	Skip the small post-load XPU warmup. First client request may then pay lazy kernel load.
`--warmup-prefill-tokens N`	`2`	Text token count used by post-load XPU warmup prefill.
`--warmup-decode-steps N`	`1`	Decode steps used by post-load XPU warmup.
`--warmup-batch-size N`	`1`	Batch size used by post-load XPU warmup.
`--profile-runtime`	off	Log synchronized XPU timing and memory stats.
`--min-free-memory-mib N`	`1024`	Minimum free XPU memory required before generation starts.
`--reserve-memory-mib N`	`512`	Extra XPU memory margin preserved during request admission.
`--max-estimated-usage-ratio R`	`0.9`	Reject requests whose estimated usage exceeds this fraction of total XPU memory.
`--generation-memory-safety-factor R`	`2.0`	Multiplier applied to estimated generation memory.
`--scheduler-max-batch-size N`	`1`	Enable continuous batching when greater than `1`.
`--scheduler-batch-wait-ms MS`	`2.0`	Wait time used to coalesce requests when batching is enabled.
`--scheduler-prefill-interval-steps N`	`1`	Prefill scheduling interval while continuous batching is active.
`--metrics-log-interval-seconds S`	`10.0`	Emit aggregated runtime metrics every S seconds. `0` disables metrics logging.

For the full option list, run:

anna-serve --help
anna-generate --help
anna-bench --help
anna-speak --help
anna-transcribe --help

Troubleshooting

If XPU is not detected, confirm that your PyTorch build supports XPU and that Intel GPU drivers are installed.
If the wrong GPU is selected on a system with multiple Intel GPUs, pass --xpu-device-index N.
If the first request is slow, it may be paying model load, kernel load, or torch.compile cost.
If memory is tight on XPU, try --dtype bfloat16, --kv-cache-quantization turboquant, --weight-quant auto, --offload-mode experts, or a lower token limit.
If a route fails, verify that the loaded model family supports that task.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
crates/anna-rust		crates/anna-rust
src/anna		src/anna
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anna

Features

Supported Models

Requirements

Installation

Quick Start

Intel XPU Examples

Multimodal Requests

API Routes

`anna-serve` Options

Troubleshooting

License

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Anna

Features

Supported Models

Requirements

Installation

Quick Start

Intel XPU Examples

Multimodal Requests

API Routes

anna-serve Options

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`anna-serve` Options

Packages