Self-hosted LLM API — OpenAI-compatible Checking...
All /v1/* endpoints require a Bearer token:
Authorization: Bearer YOUR_API_KEY
The /health endpoint is public.
Chat completion (messages format). Supports streaming via "stream": true.
curl https://llm.ftrz.de/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a fibonacci function in Python"}
],
"max_tokens": 512,
"temperature": 0.7
}'
Text completion (prompt format).
curl https://llm.ftrz.de/v1/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6",
"prompt": "def binary_search(arr, target):",
"max_tokens": 256
}'
List available models. The model field in requests is informational — llama-server always serves the currently loaded model, so any identifier works (e.g. qwen3.6).
Health check (no auth required). Returns server status and loaded model info.
OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="https://llm.ftrz.de/v1",
api_key="YOUR_API_KEY"
)
response = client.chat.completions.create(
model="qwen3.6",
messages=[{"role": "user", "content": "Hello!"}]
)
| Model | Qwen3.6-35B-A3B (MoE, 3B active) — Q6_K_XL |
|---|---|
| Hardware | AMD Ryzen AI Max 395 — 96 GB unified memory |
| Backend | llama.cpp (ROCm HIP) |
| Speed | ~40 tok/s generation, ~187 tok/s prompt processing |
| Context | 131,072 tokens |
| Rate Limit | 30 req/min per IP |
| Timeout | 600s per request |