Not every workload should hit a paid API. Open-weight models run on your hardware — no data leaves your network, no per-token cost after setup, no rate limits.
Start with Ollama — one command to download and serve Llama, Mistral, Gemma, or DeepSeek locally. Use the OpenAI-compatible API so your existing code works with a URL change. LM Studio adds a GUI if you prefer it.
| Model | Strengths |
|---|---|
| Llama 3 (Meta) | Default open-weight choice. 8B for local dev, 70B+ for production. |
| Mistral / Mixtral | Fast, efficient, strong at code. Mixtral is mixture-of-experts. |
| DeepSeek | Competitive with closed models on reasoning tasks |
| Gemma (Google) | Small, efficient, good for edge deployment |
| Qwen (Alibaba) | Strong multilingual and code performance |
Hugging Face Hub is the registry — models, datasets, and spaces. Learn to navigate model cards, understand quantisation levels (Q4, Q5, Q8 — smaller = faster but less accurate), and pick the right model size for your hardware.
Transformers.js runs models in the browser via WebAssembly/WebGPU. Small embedding and classification models work; large generative models don’t — yet.
Use closed APIs for best quality and fastest iteration. Use open models when:
Ollama · LM Studio · Hugging Face Hub · vLLM · Transformers.js