Live on Mac Mini M4

LLM Gateway

OpenAI-compatible API serving chat, embeddings, speech-to-text, and text-to-speech. One key, multiple models, zero cloud costs.

Models

Endpoints

Context

32K

Cost

Quick Start

Python (OpenAI SDK)

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.maxpetrusenko.com/v1",
    api_key="your-api-key",
)

# Chat
response = client.chat.completions.create(
    model="gemma4",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

# Vision-language
response = client.chat.completions.create(
    model="shizhengpt",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Analyze visible tongue features as JSON."},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}},
    ]}],
    max_tokens=2048,
)
print(response.choices[0].message.content)

# Embeddings
embeddings = client.embeddings.create(
    model="nomic-embed-text",
    input="search query",
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

curl

# Chat
curl https://llm.maxpetrusenko.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"Hello!"}]}'

# Text-to-Speech
curl https://llm.maxpetrusenko.com/v1/audio/speech \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello from the gateway."}' \
  --output speech.wav

JavaScript / TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llm.maxpetrusenko.com/v1",
  apiKey: "your-api-key",
});

const res = await client.chat.completions.create({
  model: "gemma4",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(res.choices[0].message.content);

Models

Model	Type	Use for	Details
`gemma4`	Chat	Reasoning, agent tasks	Google Gemma 4 8B, 32K context, Q4_K_M
`qwen3:8b`	Chat	Routing, summaries, classification	Alibaba Qwen 3 8B, fast
`nomic-embed-text`	Embedding	Semantic search, RAG retrieval	768 dimensions, 8K input
`qwen35-35b-a3b-iq2m`	Chat/code	Local 35B MoE reasoning/coding on 16GB Mac mini	Qwen3.5-35B-A3B UD-IQ2_M GGUF, 10.6 GiB; aliases: qwen35, qwen35-35b, qwen35-35b-a3b
`shizhengpt`	Vision-language	TCM tongue image observations	ShizhenGPT-7B-VL, MLX Q4
`whisper`	STT	Audio transcription	whisper.cpp base.en, Apple Silicon optimized
`piper`	TTS	Speech synthesis	piper-tts, en_US lessac medium voice

Endpoints

POST/v1/chat/completions

Send messages, get completions. Supports chat models and shizhengpt image+text messages.

{
  "model": "gemma4",
  "messages": [{"role": "user", "content": "Explain quantum computing"}],
  "temperature": 0.7
}

POST/v1/embeddings

Generate vector embeddings for text. Returns 768-dimensional vectors.

{
  "model": "nomic-embed-text",
  "input": "Your text here"
}

POST/v1/audio/transcriptions

Transcribe audio files to text. Send as multipart/form-data with a file field.

curl https://llm.maxpetrusenko.com/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_KEY" \
  -F "[email protected]"

POST/v1/audio/speech

Convert text to speech. Returns a WAV audio file.

{
  "input": "Text you want spoken aloud."
}

GET/v1/models

List all available models.

GET/health

Health check. No authentication required.

Authentication

All endpoints (except /health and /docs) require a bearer token:

Authorization: Bearer your-api-key

Need a key? Contact [email protected]

Notes

Single-tenant. This runs on a 16GB Mac Mini — one generation model is loaded at a time. The first request to a different model takes ~10-30s to swap. Embedding model coexists with generation models. Best for dev, prototyping, and personal projects.