Stage 1 — LLM Fundamentals¶
⏱ Time estimate: 1 week (~5-8 hours)
👋 Coming from Stage 0? Nice — your toolchain is set. The next 5-8 hours: your first working call to Claude / GPT / Gemini, how token / context window / temperature shape the output, and per-token cost estimation. Jumped straight here? Make sure you can run a Python script and have an API key from one provider — if not, head back to Stage 0.
💡 Don't recognize a term? (LLM / token / context window / temperature / RAG / agent / …) → check
resources/glossary.en.mdfor 30-second definitions.
3 Core Terms (memorize these—all later stages use them)¶
| Term | Chinese | One-liner |
|---|---|---|
| token | 詞元 | the unit LLMs use to count text length and price (1 Chinese char ≈ 1.5-2 tokens; 1 English word ≈ 1.3 tokens) |
| context window | 上下文視窗 | How many tokens the model sees at once (Claude 1M / GPT ~400k / Gemini 2M) |
| temperature | 隨機程度參數 | Controls how stable or creative the output is (0 = deterministic, 1 = creative; use 0.0-0.3 for classification, 0.7-1.0 for creative writing) |
→ These 3 terms run through every later stage. The goal of Stage 1 is to call the API yourself and feel firsthand how they shape the output.
📌 Learning Goals¶
After this stage you will be able to: - Explain what an LLM is, what tokens are, and what context window means - Make your first API call to Claude / GPT / Gemini and parse the response - Compare the four major LLM families (Claude / GPT / Gemini / Llama) on strengths - Estimate cost per task using per-token pricing
🌐 Major LLM Family Comparison (2026-05 snapshot)¶
"How is Claude different from GPT?" "Can I use Chinese models?" "Which OSS model should I run with Ollama?" This section gives you an objective side-by-side view. It does not declare a single "best" model: it compares strengths / good-fit tasks / weaknesses and includes official docs URLs so you can verify the claims yourself.
💡 First, a few terms: - Context window = the amount of conversation an LLM can remember in one pass; it is capped (for example, 200k tokens ~= 150k Chinese characters) - Apache 2.0 / MIT = open-source terms that permit commercial use, modification, and closed-source redistribution; Llama Community License = open-source but with conditions (for example, orgs with >= 700M MAU need a license) - Frontier model = each provider's strongest flagship; OSS = open-source, with weights downloadable for self-hosting
🇺🇸 US Commercial Frontier (3 providers)¶
These 3 are SaaS APIs: you pay per token and cannot self-host them.
| Model family | Flagship (2026-05) | Context | Strengths | Best for | Official docs |
|---|---|---|---|---|---|
| Claude (Anthropic) | Opus 4.8 / Sonnet 4.6 / Haiku 4.5 | 1M (Haiku 4.5 is 200k) | long-form / coding / agent / safety alignment | writing papers / code review / agent runtime | platform.claude.com/docs |
| GPT (OpenAI) | GPT-5.5 / GPT-5 / o-series | ~400k | general-purpose / function calling / broadest ecosystem | broad queries / function-call frameworks / GPTs ecosystem | platform.openai.com/docs/models |
| Gemini (Google) | 3.1 Pro / Flash | 2M (Pro series; Flash is 1M) | long context / native multimodal / Google integration | PDF / video and audio / large document sets / Google Workspace | ai.google.dev |
🇨🇳 Chinese Commercial + Open-Source Frontier (7 providers)¶
These are the main choices for Chinese-language work. Some are API-only (DeepSeek / Kimi / Hunyuan); others also release OSS weights (Qwen / GLM-5.1 / Yi can run through Ollama).
| Model family | Flagship (2026-05) | Context | Strengths | Best for | License | Official |
|---|---|---|---|---|---|---|
| DeepSeek | V3 (deepseek-chat) / R1 (deepseek-reasoner) ⚠️ V4-series weights are open-source; consumer API is not fully public yet |
128k | reasoning / coding / lowest cost | high-token workloads / code generation / math | API proprietary; some weights OSS on HF | api-docs.deepseek.com |
| Qwen (Alibaba) | Qwen3 (cloud DashScope + Apache 2.0 OSS) | 128k+ | strongest Chinese OSS / multimodal / agent | Chinese long-form writing / agent / self-host | Apache 2.0 (OSS) + proprietary (cloud) | qwen.ai · DashScope |
| Kimi (Moonshot) | K2.6 multimodal + Agent | very long context (1M+) | long context / Chinese long-form writing | whole-book reading / literature triage | Proprietary | platform.moonshot.cn |
| GLM (Zhipu) | GLM-5 proprietary / GLM-5.1 Apache 2.0 | 128k | Chinese / tool use / agent | Chinese agents / multi-turn chat | proprietary + Apache 2.0 (5.1) | open.bigmodel.cn · chatglm.cn |
| Hunyuan (Tencent) | T1 (deep-thinking, Transformer-Mamba MoE) + TurboS | 128k | DeepSeek R1-comparable reasoning, Chinese | Chinese reasoning / Tencent ecosystem | Proprietary | hunyuan.tencent.com |
| MiniMax | abab6.5 + M2.7 | 200k | multimodal / Chinese long prose | Chinese writing / video and audio multimodal | Proprietary | platform.minimax.io |
| Yi (01.AI / Kai-Fu Lee) | Yi-Lightning (new API flagship) / Yi-34B-Chat (OSS, 200k context) | 200k | Chinese OSS alternative to Llama | Chinese self-host / Chinese API | Apache 2.0 (OSS) / proprietary (Lightning) | 01.ai · GitHub |
⚠️ Xiaomi MiMo is listed in
resources/cli-agents-guide.mdfor Hermes Agent routing, but as of 2026-05 there is no authoritative official source to verify it, so it is not included in this table. To try it, connect through Hermes Agent 200+ provider routing.
🌍 Western Open-Source (4 providers, self-host defaults)¶
These are the main choices for running on your own hardware, avoiding API fees, or handling privacy-sensitive work. You can install them in one command through Ollama.
| Model family | Active size | License | Strengths | Best for | Official |
|---|---|---|---|---|---|
| Llama (Meta) | 3.3 70B (Llama 4 not yet released as of 2026-05) | Llama Community License | general-purpose / broadest ecosystem / Ollama default | self-hosting intro / fine-tune base | llama.com · HF Meta |
| Gemma (Google) | Gemma 4 26B MoE + 31B dense (released 2026-04; Arena #3) | Apache 2.0 | small and efficient / strong Apple MLX integration / multimodal | edge / mobile / 4-8 GB RAM machines | ai.google.dev/gemma |
| Mistral (Mistral AI) | 7B / Mixtral 8x7B / Codestral | Apache 2.0 (OSS parts) | strongest open-source 7B class | commercial self-host / EU sovereignty | mistral.ai · HF Mistral |
| Phi (Microsoft) | Phi-4 14B reasoning + Phi-4-multimodal-instruct (multimodal version) | MIT | small but strong / reasoning / edge-friendly | 4 GB+ RAM / mobile / reasoning intro | HF microsoft |
🎯 Which One Should I Pick? (by scenario)¶
| Your scenario | Pick + why |
|---|---|
| First time learning an LLM API, prioritize complete tutorials | Claude — Anthropic Cookbook + Courses are widely considered the most complete |
| Long-form writing / papers / code review | Claude Sonnet — long-form prose is a core strength |
| Multimodal (PDF / video and audio / images) | Gemini or Kimi — native multimodal |
| Broad queries + function calling frameworks | GPT — broadest ecosystem and deepest SDK integration |
| Chinese scenarios + commercial API | Kimi (strong long context; can fit whole books), DeepSeek (lowest cost), or GLM (agent-friendly) |
| Chinese scenarios + open-source self-host | Qwen 3 (Apache 2.0; currently the strongest Chinese OSS) |
| Reasoning / math (reasoning model) | DeepSeek R1 / Hunyuan T1 / OpenAI o-series |
| Privacy / offline / no API fees | Llama 3.3 / Gemma 4 / Qwen 3 OSS via Ollama |
| Edge / 4 GB RAM machine | Gemma 4 / Phi-4 / Qwen 3 (qwen3-3B or smaller variants) |
| 100k+ token large documents | Gemini 3.1 (2M context) or Kimi K2.6 (1M+) |
| Want the lowest cost (API-bill sensitive) | DeepSeek V4-Flash — lowest token price among same-tier English models |
📊 Neutral Benchmark Resources (verify for yourself; do not rely on one source)¶
| Resource | Use | URL | 2026-05 status |
|---|---|---|---|
| Artificial Analysis | Third-party benchmarks plus price/latency aggregation, including Chinese models | https://artificialanalysis.ai/ | ✓ Active |
| Arena AI (formerly LMSYS Chatbot Arena) | Human blind-test ELO leaderboard | https://arena.ai/leaderboard/text | ✓ Active |
| Vellum LLM leaderboard | Aggregates multiple benchmarks | https://www.vellum.ai/llm-leaderboard | ✓ Active |
| HuggingFace OpenLLM Leaderboard | Open-source model rankings | https://huggingface.co/spaces/open-llm-leaderboard | ⚠️ Occasional runtime errors as of 2026-05; use the Arena AI open-source tab as fallback |
| SuperCLUE | Authoritative benchmark for Chinese-language scenarios | https://www.superclueai.com/ | ✓ Active |
⚠️ Important Caveats¶
- ⚠️ Benchmark != production performance: run a small eval on your specific task (for example, paste 10 real prompts and see which model answers closest to what you need); do not pick only from rankings
- ⚠️ Frontier changes every 6 months: all numbers above are a 2026-05 snapshot; afterward, rely on official docs / Artificial Analysis
- ⚠️ "Strength" is relative, not absolute: every frontier model can handle basic tasks; differences matter at the margin
- ⚠️ For Chinese scenarios, check SuperCLUE: general international benchmarks such as MMLU are English-heavy, and Chinese-language performance may diverge
🚪 Entry Conditions¶
You should already: - Be able to run a Python script - Know what HTTP / REST is conceptually - Have an API key from at least one provider (Anthropic / OpenAI / Google)
If not — go back to Stage 0 first.
📚 Required Reading¶
- Anthropic — Claude Model Overview — official model family overview, including 2026's latest Opus 4.8 / Sonnet 4.6 / Haiku 4.5
- anthropics/courses — Anthropic API Fundamentals ⭐⭐⭐⭐⭐ ★ 21k+ — Anthropic's official 5-course umbrella; module 1 "Anthropic API Fundamentals" maps to this stage. Jupyter notebooks, runs on Claude 3 Haiku (cheapest), hands-on walkthrough of API essentials.
- OpenAI Quickstart — first API call walkthrough
- A Visual Guide to LLM Tokenizers — Hugging Face's intro
- Anthropic API Pricing — read the pricing table, calculate cost for 1k input + 1k output
🛠 Hands-on Exercises (foundational, illustrative)¶
🦙 This stage defaults to Ollama (cost-driven;
gemma4:e4bruns locally for $0/run). Every exercise has Path A (Ollama, default) + Path B (Anthropic, optional — use it when you want to see cloud-quality answers). Full three-path trade-off inexamples/README.en.md.💰 Stage 1 budget estimate (all 6 exercises, 3-5 runs each): all local = $0, all haiku ≈ $0.30, all sonnet ≈ $0.90. Full model list + Stage 1-7 total budget:
examples/README.en.md#recommended-llm-list.💡 No Ollama yet? Each exercise also ships a Path B Anthropic version — pick one. To enable Path A in one step:
pip install openai && ollama pull gemma4:e4b.
Exercise 1: LLM API (hello world)¶
Five-line Python script that calls an LLM and prints the response. Defaults to local Ollama (free, offline); switch to Path B Anthropic when you want cloud-quality answers. Details in examples/README.en.md.
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_1.py and run python practice_1.py)
# Requires: pip install openai (OpenAI-compatible SDK talks to Ollama)
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't check this — anything works
)
r = client.chat.completions.create(
model="gemma4:e4b", # swap to qwen2.5:3b / llama3.2:3b if preferred
max_tokens=100,
messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)
# === Self-check ===
text = r.choices[0].message.content
print("Response:", text)
print("usage:", r.usage)
assert r.choices[0].finish_reason in ("stop", "length"), f"unexpected finish_reason: {r.choices[0].finish_reason}"
assert len(text) > 0, "response should not be empty"
assert r.usage.completion_tokens > 0, "output token count should be > 0"
print("✅ Exercise 1 passed — local Ollama gemma4:e4b answered for $0")
📋 Starter code — Path B (Anthropic API, optional, when you want cloud quality) (copy to practice_1_anthropic.py)
# Requires: pip install anthropic
# Env: export ANTHROPIC_API_KEY=sk-ant-...
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
import anthropic
client = anthropic.Anthropic()
msg = client.messages.create(
model="claude-haiku-4-5", # haiku = cheapest; switch to sonnet by changing this line
max_tokens=100,
messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)
# === Self-check ===
text = msg.content[0].text
print("Response:", text)
print("usage:", msg.usage)
assert msg.stop_reason in ("end_turn", "max_tokens"), f"unexpected stop_reason: {msg.stop_reason}"
assert len(text) > 0, "response should not be empty"
assert msg.usage.input_tokens > 0 and msg.usage.output_tokens > 0, "token counts should be > 0"
print("✅ Exercise 1 passed — Anthropic API is reachable from your machine")
Exercise 2: Tokens¶
Run the same prompt 100 times and watch token counts vary.
- Notice: temperature ≠ 0 produces variation
- Notice: token count for the SAME English vs Chinese sentence
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_2.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
PROMPTS = {
"Chinese": "用一句話描述一隻貓在做什麼。",
"English": "Describe in one sentence what a cat is doing.",
}
N = 10 # local is slower; start small
for label, prompt in PROMPTS.items():
output_tokens = []
for _ in range(N):
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=80,
temperature=1.0, # high temp to amplify variance
messages=[{"role": "user", "content": prompt}],
)
output_tokens.append(r.usage.completion_tokens)
print(f"\n[{label}] prompt: {prompt}")
print(f" input tokens: {r.usage.prompt_tokens}")
print(f" output tokens — min={min(output_tokens)} max={max(output_tokens)} mean={statistics.mean(output_tokens):.1f} stdev={statistics.stdev(output_tokens):.1f}")
# === Self-check ===
assert max(output_tokens) > min(output_tokens), "with temperature=1.0, output length should vary"
print("\n✅ Exercise 2 passed — observed temperature → token variance, $0/run")
print("💡 Chinese prompts typically use MORE input tokens (one Chinese character ≈ 2 tokens)")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_2_anthropic.py)
# Requires: pip install anthropic
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
import anthropic
client = anthropic.Anthropic()
PROMPTS = {"Chinese": "用一句話描述一隻貓在做什麼。", "English": "Describe in one sentence what a cat is doing."}
for label, prompt in PROMPTS.items():
output_tokens = []
for _ in range(20):
msg = client.messages.create(model="claude-haiku-4-5", max_tokens=80, temperature=1.0,
messages=[{"role": "user", "content": prompt}])
output_tokens.append(msg.usage.output_tokens)
print(f"[{label}] input={msg.usage.input_tokens} output min/max/mean={min(output_tokens)}/{max(output_tokens)}/{sum(output_tokens)/len(output_tokens):.1f}")
Exercise 3: Pricing / Latency¶
Cost-sensitive work required: compute how long and how much it takes to run 1000 hello-world inferences. Local Ollama is $0 but has latency cost; cloud LLMs cost money but are faster. Knowing this trade-off is how you pick the right model.
📋 Starter code — Path A (local Ollama gemma4:e4b, measure latency) (copy to practice_3.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, time
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
latencies = []
for _ in range(5):
t0 = time.time()
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=200,
messages=[{"role": "user", "content": "Hi! Please introduce yourself."}],
)
latencies.append(time.time() - t0)
avg_latency = sum(latencies) / len(latencies)
out_tok_avg = r.usage.completion_tokens
tps = out_tok_avg / avg_latency if avg_latency > 0 else 0
print(f"model: gemma4:e4b (local)")
print(f"5 latencies (sec): min={min(latencies):.2f} max={max(latencies):.2f} mean={avg_latency:.2f}")
print(f"avg output: {out_tok_avg} tokens, ~{tps:.1f} tokens/sec")
print(f"\n1000-run cost: $0 (local); projected duration: {avg_latency * 1000 / 60:.1f} minutes")
# === Self-check ===
assert avg_latency > 0, "latency should be > 0"
assert out_tok_avg > 0, "output token count should be > 0"
print(f"\n✅ Exercise 3 passed — local model is $0 but takes ~{avg_latency * 1000 / 60:.0f} min for 1000 runs")
print("💡 Compare Path B Anthropic: 1000 runs is ~10-20 min at $0.25 (haiku)")
📋 Starter code — Path B (Anthropic API, compute $ cost) (copy to practice_3_anthropic.py)
# Requires: pip install anthropic
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
import anthropic
# Anthropic public pricing 2026 Q2 (per 1M tokens, USD) — verify at https://www.anthropic.com/pricing
PRICING = {
"claude-haiku-4-5": {"input": 1.00, "output": 5.00},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
"claude-opus-4-8": {"input": 5.00, "output": 25.00}, # Opus 4.8 (May 2026, Dynamic Workflows) — same 5/25 pricing
}
client = anthropic.Anthropic()
MODEL = "claude-haiku-4-5"
msg = client.messages.create(model=MODEL, max_tokens=200,
messages=[{"role": "user", "content": "Hi! Please introduce yourself."}])
in_tok, out_tok = msg.usage.input_tokens, msg.usage.output_tokens
rates = PRICING[MODEL]
cost_one = (in_tok * rates["input"] + out_tok * rates["output"]) / 1_000_000
print(f"model: {MODEL}")
print(f"single: input={in_tok} output={out_tok} → ${cost_one:.6f}")
print(f"1000 calls cost across model tiers:")
for name, r in PRICING.items():
c = (in_tok * r["input"] + out_tok * r["output"]) / 1_000_000 * 1000
print(f" {name:<22} ${c:.4f}")
assert cost_one > 0, "Cloud LLM always has a cost"
print(f"\n✅ Exercise 3 passed (Anthropic) — 1000 runs: haiku ≈ $0.25, sonnet 4.6 ≈ $0.76, opus 4.8 ≈ $1.27")
model: claude-haiku-4-5
single: input=14 output=48 → $0.000254
1000 calls cost across model tiers:
claude-haiku-4-5 $0.2540
claude-sonnet-4-6 $0.7620
claude-opus-4-8 $1.2700
Exercise 4: Cross-Provider Comparison¶
Send the same prompt to Claude, GPT, and Gemini simultaneously, compare their responses. Notice "why does the same input produce different answers" — answer style, length, and judgment all differ. Use the OpenAI, Anthropic, and Google SDKs side-by-side.
→ Starter template → examples/stage-1/04-cross-provider/ (parallel calls to all three SDKs + comparison table; missing keys are skipped gracefully; illustrative, not a chapter-length tutorial)
Exercise 5: Error Handling¶
Trigger error conditions deliberately and write retry logic: - Wrong API key → see how it raises - Over-long prompt → what happens when the context window is full - Network drop → write a retry wrapper with exponential backoff
This is foundational for Stage 3-7's production agent code.
→ Starter template → examples/stage-1/05-error-handling/ (mock-based tests so you can verify the retry logic without unplugging your ethernet cable; illustrative, not a chapter-length tutorial)
Exercise 6: Local LLM¶
No API fees, runs on your machine: use Ollama to pull a small model (recommend llama3.2:3b or qwen2.5:3b), call it via OpenAI-compatible API.
# 1. Install Ollama: https://ollama.com
ollama pull qwen2.5:3b
ollama serve # default port 11434
📋 Starter code (copy to practice_6.py)
# Requires: pip install openai
# Pre-req: Ollama is running, qwen2.5:3b is pulled
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Ollama doesn't check this — anything works
)
r = client.chat.completions.create(
model="qwen2.5:3b",
messages=[{"role": "user", "content": "Explain ReAct in 3 sentences."}],
)
text = r.choices[0].message.content
print("Response:", text)
# === Self-check ===
assert len(text) > 10, "response is too short — Ollama may not be running"
print(f"✅ Exercise 6 passed — local Ollama reachable through the OpenAI-compatible API")
print(f"💡 This run cost you $0 (except for electricity)")
🎯 Curated Projects¶
5 categories, 17 projects in one table. Pick by "Best for"; click through for depth on the repo / course site.
| Category | Project | ⭐ | Best for | Why / Notes |
|---|---|---|---|---|
| Official cookbook / starting point | Anthropic Cookbook | ⭐⭐⭐⭐⭐ | Starting with Claude API; reference lookup | Full-feature Claude API notebooks (tool use / batch / prompt cache), ★ 42k+, MIT |
| Anthropic Courses | ⭐⭐⭐⭐⭐ | Systematic Claude learning from zero | Anthropic's own 5-course set (API fundamentals / prompt eval / real-world prompting / tool use), ★ 21k+. Start with anthropic_api_fundamentals |
|
| OpenAI Cookbook | ⭐⭐⭐⭐⭐ | OpenAI API + structured output / function calling | Pair with Anthropic Cookbook, ★ 73k+, MIT. Much bigger than Anthropic's — use search | |
| Anthropic Claude API Quickstart | ⭐⭐⭐⭐ | 5-minute start | Official docs, bookmark it | |
| Chinese textbook (chapter-style) |
datawhalechina/happy-llm | ⭐⭐⭐⭐⭐ | Chinese readers wanting LLM internals | Karpathy "Zero to Hero" Chinese counterpart, ★ 29k+. Equivalent to HF LLM Course in Chinese |
| datawhalechina/llm-universe | ⭐⭐⭐⭐⭐ | Chinese newcomers building with LLM | API basics / knowledge base / RAG / advanced tricks, ★ 12k+ | |
| datawhalechina/llm-cookbook | ⭐⭐⭐⭐ | Full Chinese LLM learning path | Adapted Chinese translation of Andrew Ng's courses (⚠️ updates slowed after 2025-06, CC BY-NC-SA) | |
| jingyaogong/minimind | ⭐⭐⭐⭐ | Post-Karpathy, want a real training run | 2hr to train a 64M LLM from scratch — Pretrain + SFT + LoRA + DPO + RLHF, ★ 48k+, Apache-2.0 | |
| English course (systematic) |
HuggingFace — LLM Course | ⭐⭐⭐⭐⭐ | Transformer internals + HF ecosystem | Transformer theory + applications, Apache 2.0 |
| LangChain Academy | ⭐⭐⭐⭐ | Visual learners who like video courses | LangChain's official free course, includes RAG / agent. Skip the LangChain marketing segments | |
| Local execution (no API costs) |
ollama/ollama | ⭐⭐⭐⭐⭐ | First-time local LLM | This repo's Path A default, OpenAI-compat API, ★ 170k+ |
| ggml-org/llama.cpp | ⭐⭐⭐⭐⭐ | Understanding quantization / how 7B fits in 8GB RAM | Ollama's underlying inference engine, ★ 108k+, MIT | |
| mudler/LocalAI | ⭐⭐⭐⭐ | Team compliance, self-host full OpenAI replacement | Drop-in OpenAI API replacement (chat / embedding / image / TTS / STT), ★ 46k+ | |
| ml-explore/mlx | ⭐⭐⭐⭐ | Mac dev, squeeze Apple Silicon | Apple's ML framework for M1+, ★ 25k+. Pair with mlx-lm for ease |
|
| Build from scratch (understand internals) |
karpathy — Let's build GPT from scratch | ⭐⭐⭐⭐⭐ | Understand LLM internals, not just API calls | 2hr high-density video, build GPT in PyTorch from scratch. Pause and code along, don't passive-watch |
| rasbt/LLMs-from-scratch | ⭐⭐⭐⭐⭐ | Book-pace read of the same material | Book version of Karpathy's video: tokenizer → attention → pretraining → finetuning, ★ 91k+, Apache-2.0 | |
| karpathy/LLM101n | ⭐⭐ | Historical reference | ⚠️ Archived (2024-08), outline only, course never finished. Watch "Build GPT from scratch" above instead |
💡 Suggested reading order: API-first → Anthropic / OpenAI Cookbook · Chinese systematic path → happy-llm + llm-universe · deep internals → Karpathy video + rasbt book with code · local-only → start with Ollama, then llama.cpp.
✅ Self-Check Before Stage 2¶
Can you: - [ ] Make a Claude API call from Python in 5 lines - [ ] Explain why "你好" might use 2 tokens but "Hello" uses 1 - [ ] Quote roughly the per-token price for Claude Sonnet vs Opus - [ ] Name one strength of Claude vs GPT vs Gemini vs Llama
If yes → proceed to Stage 2 — Prompt Engineering.
If no → re-read the Anthropic Quickstart + run all 3 hello-X projects above.
✅ Done with Stage 1? Next, Stage 2 — Prompt Engineering takes 5-12 hours to walk you through writing reusable structured prompts, using few-shot and chain-of-thought for reasoning tasks, and learning to quantify prompt improvement with evals. Keep going →