
LightWeight
Run the exact local model you choose with squeeze planning, verification, and active-set memory policy for your hardware.

"LightWeight focuses on exact-model local inference: hardware checks, GGUF metadata inspection, runtime planning, backend routing, and browser or OpenAI-compatible serving without silently replacing the model you chose."


"The workflow includes doctor, inspect-model, probe, and bench so users can see memory pressure, GPU layer fit, and real tokens-per-second before settling on a thermal profile for private local chat."

01. Why LightWeight?
Most tools just run a model. LightWeight plans, builds, verifies, and reports the best exact-model profile for your actual RAM, VRAM, context, and thermal limits.
Big AI Model
Manual fit math
Guess quant and context
squeeze plan
Profile before download
Ultra-Heavy Model
Crash on load
No recovery path
verify + report
Measured on your PC
Giant Model (Kimi)
Custom API work
Glue code required
active profile
Chat + /v1 use the same squeeze profile
02. How is it possible?
01 / Built for Your PC
Squeeze plan compares quality, balanced, fit, and extreme profiles before download so you know the memory and quality risk up front.
02 / No Heat, No Noise
Thermal modes cap threads, context, KV cache, and offload pressure so the model runs inside realistic laptop limits.
03 / Smart Memory Logic
LightWeight combines GGUF metadata, active squeeze profiles, MoE expert placement, KV cache quantization, mmap, prompt cache, probe results, and OOM recovery.
04 / Truly Portable
The same CLI flow works in terminal chat, browser chat, and the local /v1 API without changing the selected model identity.
03. Measured Runtime
04. Simple Start
Windows (PowerShell):
irm https://lightweight.zecoryx.uz/install.ps1 | iexcopymacOS / Linux:
curl -fsSL https://lightweight.zecoryx.uz/install.sh | shcopyPlan the exact model, build the squeeze profile, then verify it locally:
lightweight doctorcopylightweight squeeze plan qwen:32b --target-ram 16gb --target-vram 6gbcopylightweight squeeze build qwen:32b --profile auto --target-ram 16gb --target-vram 6gbcopylightweight squeeze verify qwen:32b --runcopylightweight squeeze report qwen:32bcopylightweight chat qwen:32b --thermal balancedcopyView all downloaded models on your machine:
lightweight listcopyRemove a model to free up disk space:
lightweight rm llama3:8bcopyTurn your machine into an OpenAI-compatible local API endpoint:
lightweight serve --backend python --port 8000copy# Uses the active squeeze profile and serves http://localhost:8000/v1copySend a request from any app:
curl http://localhost:8000/v1/chat/completions \copy -H "Content-Type: application/json" \copy -d '{"model": "qwen:32b", "messages": [{"role": "user", "content": "Hello!"}]}'copyEndpoints
POST /v1/chat/completions
GET /v1/models
GET /v1/health
Options
--backend auto — choose python or llama-server
--thermal balanced — laptop-friendly limits
--moe-offload auto — active expert placement for MoE models
Use native llama-server when it is installed:
lightweight serve --backend llama-server --model qwen:32b --spec ngram-cachecopyCompare squeeze profiles for a specific hardware target:
lightweight squeeze plan llama3:70b --target-ram 16gb --target-vram 6gbcopySwitch the active squeeze profile used by chat and serve:
lightweight squeeze profiles qwen:32bcopylightweight squeeze use qwen:32b balancedcopyRead GGUF metadata and benchmark real token speed:
lightweight inspect-model qwen:32bcopylightweight probe qwen:32b --thermal balancedcopylightweight bench qwen:32b --tokens 64 --thermal balancedcopyConfigure global CLI settings:
lightweight config --editcopy