llm.ftrz.de

All /v1/* endpoints require a Bearer token:

Authorization: Bearer YOUR_API_KEY

The /health endpoint is public.

POST /v1/chat/completions

Chat completion (messages format). Supports streaming via "stream": true.

curl https://llm.ftrz.de/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a fibonacci function in Python"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

POST /v1/completions

Text completion (prompt format).

curl https://llm.ftrz.de/v1/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6",
    "prompt": "def binary_search(arr, target):",
    "max_tokens": 256
  }'

GET /v1/models

List available models. The model field in requests is informational — llama-server always serves the currently loaded model, so any identifier works (e.g. qwen3.6).

GET /health

Health check (no auth required). Returns server status and loaded model info.

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.ftrz.de/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="qwen3.6",
    messages=[{"role": "user", "content": "Hello!"}]
)

Model	Qwen3.6-35B-A3B (MoE, 3B active) — Q6_K_XL
Hardware	AMD Ryzen AI Max 395 — 96 GB unified memory
Backend	llama.cpp (ROCm HIP)
Speed	~40 tok/s generation, ~187 tok/s prompt processing
Context	131,072 tokens
Rate Limit	30 req/min per IP
Timeout	600s per request

Authentication

Endpoints

Client Configuration

Specs