I Built a Private AI Platform in My Closet

In 1998, I co-founded a cybersecurity company called INTRAC.NET. We built and operated IRC networks on Linux and FreeBSD servers, wrote custom IRCd configurations, hardened systems against DDoS and exploit attempts, and learned network security by defending infrastructure in real time.

This was before firewalls had GUIs. Before “the cloud” meant anything other than weather. I was in my early twenties running shell accounts, compiling kernels, and obsessed with one question: how does every layer of a system talk to every other layer?

Nearly three decades later, I am still asking that question. The difference is that now the answer fits in a rack in my closet. The same skills that make you dangerous in a personal lab are the ones that make you valuable at work.

The Project

Project Kaizen is a sovereign AI platform running entirely on local hardware. No cloud subscriptions. No data leaving my house. No third-party APIs for core inference. Apple Silicon, open-source models, and a philosophy built on continuous improvement.

At the core is a Mac Studio M3 Ultra with 512 GB of unified memory, capable of running datacenter-class large language models locally. It powers 13 microservices covering LLM inference, speech-to-text, text-to-speech, semantic memory, web search, multi-provider AI routing, and a central orchestrator managing the entire stack through a single control layer.

The system has persistent memory, voice interaction, and awareness of its local environment. It knows the hardware it runs on, the state of the home security system, and external conditions in real time.

The current model stack runs a 397-billion parameter Mixture-of-Experts model — Qwen3.5-397B — where only 17 billion parameters are active per token. That is the trick. You get datacenter-scale reasoning without datacenter-scale compute. The full model weights sit in unified memory. The architecture routes each token through a sparse subset of experts. The result is a model that would cost thousands per month on cloud GPUs running at conversational speed on a machine under my desk.

┌─────────────────────────────────────────────────────────────────────┐ │ PROJECT KAIZEN — SYSTEM OVERVIEW │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Mac Studio M3 Ultra │ Citadel Homelab │ │ 512 GB Unified Memory │ 4 Servers / 10 GbE Backbone │ │ 80-Core GPU / 819 GB/s │ Dual-WAN / Cloudflare Tunnels │ │ Ollama v0.17.5 │ Pi-hole DNS / 74 Smart Devices │ │ │ │ │ 13 Native Services │ Observability │ │ No Docker. No Containers. │ Grafana / Prometheus / Loki │ │ Metal GPU. Direct memory. │ Glances / UPS Monitoring │ │ │ │ └─────────────────────────────────┴───────────────────────────────────┘

Kaizen runs natively on macOS — zero container overhead, full Metal GPU access

The Model Stack

This is not a single-model setup. It is a purpose-built stack where each model variant serves a specific role, all sharing the same base weights but configured differently at the system prompt and inference layer.

Model	Base	Size	Active	Speed	Role
max:voice	Qwen3.5-35B-A3B	~23 GB	3B	42.9 tok/s	Voice, fast chat
max:deep	Qwen3.5-397B-A17B	~189 GB	17B	17.6 tok/s	Primary reasoning
max:think	Qwen3.5-397B-A17B	~189 GB	17B	17.5 tok/s	Chain-of-thought

The difference between max:deep and max:think is a single config flag. think: false in the YAML tells the WebSearch proxy to inject think: false into the API request, suppressing visible chain-of-thought tokens. max:think leaves them on, so you can watch the model reason step by step. Same weights. Different behavior. Config-driven.

On top of the local stack, three cloud AI proxies bring 11 additional models into the same OpenWebUI interface — Claude 4.6, GPT-5.2, and GLM-5 — all accessible without switching tools. The proxies are Flask services that bridge CLI tools to the OpenAI-compatible API format. When a cloud model needs real-time data, it hits the same /v1/search endpoint that the local models use. One search infrastructure serves everything.

The 13-Service Architecture

Every service runs as a native process on macOS. No Docker. No container orchestration. Direct Metal GPU access, direct memory access, launchd-managed where needed. The Orchestrator at port 11440 manages it all — health checks, service control, memory proxying, and the web dashboard.

┌──────────────────────────────────────────────────────────────────────┐ │ REQUEST FLOW │ └──────────────────────────────────────────────────────────────────────┘ User ──▶ OpenWebUI :8080 ──▶ WebSearch Proxy :11435 │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ Identity Hardware Memory Enforcement Injection Injection (always) (conditional) (always) │ │ │ └──────────────┼──────────────┘ ▼ ┌────────────────────────────────────────────┐ │ [if search needed] │ │ SerpAPI ──▶ Brave ──▶ Google PSE │ │ + Weather + Knowledge Graph + Sports │ └────────────────────────────────────────────┘ ▼ Ollama :11434 ┌────────────────────┐ │ max:voice │ max:deep │ │ max:think │ max:mem │ └────────────────────┘ ▼ Response

Every request passes through the 3-step context injection chain before reaching Ollama

The WebSearch proxy is the brains of the operation. It sits between every client and Ollama, intercepting requests to enrich them before inference. Three things happen in order:

Step 1 — Identity enforcement. The model_router checks the system message. If it does not contain the Project Kaizen identity block, the router replaces it. This ensures Max always knows who he is regardless of what the client sends.

Step 2 — Conditional hardware injection. If the user asks about hardware, specs, or system details, the proxy injects the M3 Ultra specs into the system prompt. If they ask about the weather or write a poem, it stays out. This solved a problem where the model would volunteer hardware specs unprompted on every response.

Step 3 — Personal memory injection. The Mem0 memory client queries ChromaDB for semantically relevant memories and appends them to the system prompt. Identity queries (“What do you know about me?”) expand to 25 memories. Everything else gets 10. The model receives personal context without ever being fine-tuned.

The Infrastructure

Kaizen does not exist in isolation. It runs on top of a full homelab stack that has been evolving since long before AI was part of the picture.

┌─────────────────────────────────────────────────────────────────────┐ │ CITADEL NETWORK TOPOLOGY │ └─────────────────────────────────────────────────────────────────────┘ AT&T Fiber 2.5G ───┐ ├──▶ UDM-SE ──▶ USW-Flex-XG (10 GbE Switch) T-Mobile Failover ─┘ │ │ │ ┌──────────┼──────────┐ │ ▼ ▼ ▼ │ Mac Studio Synology Mac Mini │ M3 Ultra DS1621xs+ M4 │ 512 GB 18 TB 16 GB │ ai.lan nas1.lan docker.lan │ ▼ Pi 5 (pi.lan) DNS / DHCP / Homebridge Pi-hole v6 — 80K+ queries

10 GbE backbone connecting all servers — dual-WAN with automatic failover

UniFi networking with a 10 GbE backbone and dual-WAN failover. The UDM-SE handles routing, firewall, and IDS. A Flex-XG switch connects the three primary servers at 10 Gbps each. When AT&T drops — and it does — T-Mobile picks up automatically. No manual intervention.

Pi-hole DNS on a Raspberry Pi 5 handles all internal resolution. Every .lan hostname maps to a static IP. Over 80,000 queries processed. Ad blocking, local DNS, DHCP — all on a $75 single-board computer that has been running without a reboot for months.

Synology NAS with 18 TB across two RAID pools — one HDD for bulk storage, one SSD for fast access. NFS mounts serve media, backups, and shared data to every machine on the network.

Observability through Grafana, Prometheus, and Loki running on the Mac Mini. Dashboards cover CPU, memory, disk, network, container health, tunnel status, UPS battery levels, and Pi-hole query stats. When something is about to fail, I know before it happens.

Cloudflare Zero Trust provides secure external access. Three tunnels — one per server — route through Cloudflare’s edge network. Zero inbound ports. Zero port forwarding. Email OTP authentication at the gate. The homelab is invisible to port scanners.

Home automation operates through Hubitat with 74 local devices — Z-Wave switches, Zigbee sensors, smart locks, thermostats, garage doors. The AI can arm the house, lock every door, adjust temperature, and access camera feeds through natural voice interaction. Homebridge bridges everything to Apple Home.

The Mobile Layer

I built a native SwiftUI iOS application as the primary client. Version 2.3. Not a web wrapper. A purpose-built interface with its own networking stack, audio pipeline, and memory management.

┌──────────────────────────────────────────────────────────────────────┐ │ KAIZEN AI iOS — v2.3-beta.1 │ ├──────────────────────────────────────────────────────────────────────┤ │ │ │ Voice Mode Chat Model Control │ │ Apple Speech STT Streaming responses Pull / Select / Delete │ │ Kokoro/Piper TTS Image attachments Real-time progress │ │ 4 Audio Visualizers Document attachments Local + Cloud models │ │ Interrupt support Conversation history Provider badges │ │ │ │ Connects to 6 backends: │ │ Ollama :11434 WebSearch :11435 TTS :8003 │ │ Memory :8100 OpenWebUI :8080 Orchestrator :11440 │ │ │ └──────────────────────────────────────────────────────────────────────┘

Native SwiftUI — not a web wrapper. Local or remote via Cloudflare Tunnel.

Advanced voice mode runs a full state machine — idle, listening, processing, speaking — with interruption support. You can cut Max off mid-sentence and he stops, processes your new input, and responds. The audio visualizer runs in SwiftUI with four modes (Orb, EQ Bars, Waveform, Fluid) that react to live audio levels. It is not decorative. It tells you what the system is doing.

Version 2.3 added cloud AI proxy integration. Claude, Codex, and Z.AI models appear in the same model picker alongside local Ollama models, grouped by provider with visual badges. Toggle providers on or off. The app auto-discovers available models from each proxy on launch.

The Voice Pipeline

Voice is not an add-on. It is a first-class interface. The pipeline handles everything from microphone input to spoken response with sub-second latency for transcription and a separate pass for synthesis.

Microphone │ ▼ MLX Whisper Large-v3 Turbo (:8002) 200-400ms │ Transcribed text ▼ WebSearch Proxy (:11435) │ + Identity + Memory + Search (if needed) ▼ Ollama (:11434) │ LLM inference (max:voice @ 42.9 tok/s) ▼ 10-Stage Text Normalization │ Strip markdown, code, HTML, think blocks │ Convert symbols → spoken equivalents │ Inject pause cadence for natural rhythm ▼ Kokoro TTS (:8003) 300-600ms │ 8 neural voices / 24kHz / Piper fallback ▼ Speaker

End-to-end voice pipeline — STT to LLM to TTS with text normalization in between

The text normalization stage is critical. Without it, the TTS would try to read markdown syntax, code blocks, HTML tags, and internal <think> tokens aloud. The pipeline strips all of that, converts currency symbols and percentages to spoken equivalents, handles abbreviations, and injects strategic pauses for natural speech cadence. Ten stages, in order, every time.

When a voice request triggers a web search, the proxy sends a TTS acknowledgment first — “One moment, I’m checking that for you” — so the user knows the system heard them and is working. Without this, a 2-second search delay feels like a failure. With it, it feels natural.

The Memory System

Every conversation flows through the memory layer. The system uses Mem0 with ChromaDB as the vector store and mxbai-embed-large (1024-dimension embeddings) for semantic search. A dedicated model — max:mem (qwen3:8b) — handles fact extraction, pulling structured memories out of natural conversation.

Conversation │ ├──▶ max:mem (qwen3:8b) ──▶ Fact Extraction │ │ │ ▼ │ mxbai-embed-large │ 1024-dim embeddings │ │ │ ▼ │ ChromaDB │ Vector Storage │ └──▶ Next Request │ ▼ Semantic Search ──▶ Top-N Memories ──▶ System Prompt Injection

Memory is extracted asynchronously and injected into every future request

This means Max learns your preferences, remembers your projects, knows your routines — without any fine-tuning, without any cloud service, without any data leaving your network. Say “remember that I take my coffee black” once, and it is embedded, indexed, and surfaced every time coffee comes up in conversation.

The MemCore Web UI at /dashboard/memory lets you browse, search, edit, and delete memories directly. Same-origin design means it works through HTTPS and Cloudflare Tunnel without CORS issues.

Why This Matters Beyond the Hobby

People ask why I build systems like this. The answer is simple. I have been doing it since the late ’90s, and the drive to understand how things work, then make them work better, does not turn off at 5 PM.

Running a homelab at this scale mirrors enterprise realities.

Infrastructure as Code

Every service has a start script, a stop script, and a health endpoint. Every configuration is YAML. Every model has a Modelfile and a proxy config. When something breaks at 2 AM — and it does — I can rebuild from scratch. When a Hubitat hub failed in a crash loop, I diagnosed root cause, executed clean removal, restored from backup, and migrated to new hardware. That is incident response and change management practiced on your own time.

The pattern: Every service exposes GET /health returning {"status": "healthy", "service": "name", "version": "x.x.x"}. The orchestrator polls these endpoints. If a service goes down, the dashboard shows it immediately. This is the same health check pattern used in Kubernetes, ECS, and every serious production system.

Zero Trust Architecture

Cloudflare Tunnels with Access policies, email OTP authentication, tiered access control, and zero inbound exposure. Three tunnels — one per server — connect through Cloudflare’s edge. The UDM-SE firewall has zero inbound port forwards for web services. The homelab is invisible to port scanners.

Built for personal use, architected to enterprise standard. The access tiers mirror RBAC: Owner gets full control, Trusted users get chat access, Guests get time-limited demo tokens.

Observability

Grafana dashboards monitoring real-time metrics across four servers, 22 containers, three tunnels, 74 smart devices, and dual UPS systems. Prometheus scrapes. Loki aggregates logs. Glances provides per-server system metrics. Failures surface before they become outages.

This is the same observability stack running at companies processing millions of requests. The only difference is that mine monitors a closet instead of a data center.

AI and ML Engineering

Running large models locally requires understanding quantization, memory allocation, inference optimization, and prompt architecture at the hardware layer. When you are fitting a 189 GB model into 512 GB of unified memory alongside the operating system, voice services, a vector database, and a web search proxy — you learn to care about every gigabyte.

Quantization matters. The 397B model runs at Q3_K — aggressive, but the Mixture-of-Experts architecture is more tolerant of quantization than dense models because each expert specializes. The voice model runs at Q4_K_M for higher quality where latency matters most. These are the same tradeoffs made in enterprise ML deployment, just at a different scale.

Security Mindset

INTRAC.NET began in cybersecurity. That mindset permeates everything. Alarm systems that cannot be disarmed by voice without explicit safety constraints. VPN layers that assume hostile networks. Firewall segmentation by default. A custom trash wrapper that prevents any service, script, or AI agent from permanently deleting files — everything goes to macOS Trash first.

The instincts built defending IRC infrastructure in 1998 still shape system design today. Security is not added later. It is foundational.

The Point

The strongest technologists build because they are curious, not because they were assigned a ticket.

Project Kaizen began as a question: can a real AI platform run entirely on consumer hardware inside a house?

The answer is yes. The more important outcome is everything learned building it, and how that knowledge compounds across professional environments.

The same person who debugs a failing TTS pipeline at midnight is the person who stays calm when a production deployment goes sideways at work. The same person who designs VLAN segmentation for a home network understands why network isolation matters in a corporate environment. The same person who tunes model quantization on unified memory understands the cost and performance tradeoffs that drive ML infrastructure decisions.

The lab is the classroom. The curiosity is the credential.

“Coding is thinking. AI just saves the typing.”