LightWeight Logo

LightWeight

Run the exact local model you choose with squeeze planning, verification, and active-set memory policy for your hardware.

Star
0
BMCBuy me a coffee
0 Active Users
Abdullayev L
Abdullayev L@zecoryx
Abu Bakr
Abu Bakr@nafderlin

01. Why LightWeight?

Most tools just run a model. LightWeight plans, builds, verifies, and reports the best exact-model profile for your actual RAM, VRAM, context, and thermal limits.

Big AI Model

Previously needed

Manual fit math

Guess quant and context

With LightWeight

squeeze plan

Profile before download

Ultra-Heavy Model

Previously needed

Crash on load

No recovery path

With LightWeight

verify + report

Measured on your PC

Giant Model (Kimi)

Previously needed

Custom API work

Glue code required

With LightWeight

active profile

Chat + /v1 use the same squeeze profile

02. How is it possible?

01 / Built for Your PC

Squeeze plan compares quality, balanced, fit, and extreme profiles before download so you know the memory and quality risk up front.

02 / No Heat, No Noise

Thermal modes cap threads, context, KV cache, and offload pressure so the model runs inside realistic laptop limits.

03 / Smart Memory Logic

LightWeight combines GGUF metadata, active squeeze profiles, MoE expert placement, KV cache quantization, mmap, prompt cache, probe results, and OOM recovery.

04 / Truly Portable

The same CLI flow works in terminal chat, browser chat, and the local /v1 API without changing the selected model identity.

03. Measured Runtime

StageToolWhat you learn
Before downloaddoctor / checkSqueeze profile, quant, context, and active-set policy
After downloadinspect / probeBuild profile and GGUF metadata
Before daily usebench / chatVerify/report: load time, first token, speed, RAM delta

04. Simple Start

Windows (PowerShell):

terminal
$irm https://lightweight.zecoryx.uz/install.ps1 | iex

macOS / Linux:

terminal
$curl -fsSL https://lightweight.zecoryx.uz/install.sh | sh

Plan the exact model, build the squeeze profile, then verify it locally:

terminal
$lightweight doctor
$lightweight squeeze plan qwen:32b --target-ram 16gb --target-vram 6gb
$lightweight squeeze build qwen:32b --profile auto --target-ram 16gb --target-vram 6gb
$lightweight squeeze verify qwen:32b --run
$lightweight squeeze report qwen:32b
$lightweight chat qwen:32b --thermal balanced

View all downloaded models on your machine:

terminal
$lightweight list

Remove a model to free up disk space:

terminal
$lightweight rm llama3:8b

Turn your machine into an OpenAI-compatible local API endpoint:

terminal
$lightweight serve --backend python --port 8000
$# Uses the active squeeze profile and serves http://localhost:8000/v1

Send a request from any app:

terminal
$curl http://localhost:8000/v1/chat/completions \
$ -H "Content-Type: application/json" \
$ -d '{"model": "qwen:32b", "messages": [{"role": "user", "content": "Hello!"}]}'

Endpoints

POST /v1/chat/completions

GET /v1/models

GET /v1/health

Options

--backend auto — choose python or llama-server

--thermal balanced — laptop-friendly limits

--moe-offload auto — active expert placement for MoE models

Use native llama-server when it is installed:

terminal
$lightweight serve --backend llama-server --model qwen:32b --spec ngram-cache

Compare squeeze profiles for a specific hardware target:

terminal
$lightweight squeeze plan llama3:70b --target-ram 16gb --target-vram 6gb

Switch the active squeeze profile used by chat and serve:

terminal
$lightweight squeeze profiles qwen:32b
$lightweight squeeze use qwen:32b balanced

Read GGUF metadata and benchmark real token speed:

terminal
$lightweight inspect-model qwen:32b
$lightweight probe qwen:32b --thermal balanced
$lightweight bench qwen:32b --tokens 64 --thermal balanced

Configure global CLI settings:

terminal
$lightweight config --edit

Coming
Soon

01

Voice Tasks

02

Image to Text

03

Image to Video

04

Better CLI

05

Improve Performance