Stage 1 — LLM Fundamentals¶

繁體中文 | 简体中文 | English

⏱ Time estimate: 1 week (~5-8 hours)

👋 Coming from Stage 0? Nice — your toolchain is set. The next 5-8 hours: your first working call to Claude / GPT / Gemini, how token / context window / temperature shape the output, and per-token cost estimation. Jumped straight here? Make sure you can run a Python script and have an API key from one provider — if not, head back to Stage 0.

💡 Don't recognize a term? (LLM / token / context window / temperature / RAG / agent / …) → check resources/glossary.en.md for 30-second definitions.

3 Core Terms (memorize these—all later stages use them)¶

Term	Chinese	One-liner
token	詞元	the unit LLMs use to count text length and price (1 Chinese char ≈ 1.5-2 tokens; 1 English word ≈ 1.3 tokens)
context window	上下文視窗	How many tokens the model sees at once (Claude 1M / GPT ~400k / Gemini 2M)
temperature	隨機程度參數	Controls how stable or creative the output is (0 = deterministic, 1 = creative; use 0.0-0.3 for classification, 0.7-1.0 for creative writing)

→ These 3 terms run through every later stage. The goal of Stage 1 is to call the API yourself and feel firsthand how they shape the output.

📌 Learning Goals¶

After this stage you will be able to: - Explain what an LLM is, what tokens are, and what context window means - Make your first API call to Claude / GPT / Gemini and parse the response - Compare the four major LLM families (Claude / GPT / Gemini / Llama) on strengths - Estimate cost per task using per-token pricing

🌐 Major LLM Family Comparison (2026-05 snapshot)¶

"How is Claude different from GPT?" "Can I use Chinese models?" "Which OSS model should I run with Ollama?" This section gives you an objective side-by-side view. It does not declare a single "best" model: it compares strengths / good-fit tasks / weaknesses and includes official docs URLs so you can verify the claims yourself.

💡 First, a few terms: - Context window = the amount of conversation an LLM can remember in one pass; it is capped (for example, 200k tokens ~= 150k Chinese characters) - Apache 2.0 / MIT = open-source terms that permit commercial use, modification, and closed-source redistribution; Llama Community License = open-source but with conditions (for example, orgs with >= 700M MAU need a license) - Frontier model = each provider's strongest flagship; OSS = open-source, with weights downloadable for self-hosting

🇺🇸 US Commercial Frontier (3 providers)¶

These 3 are SaaS APIs: you pay per token and cannot self-host them.

Model family	Flagship (2026-05)	Context	Strengths	Best for	Official docs
Claude (Anthropic)	Opus 4.8 / Sonnet 4.6 / Haiku 4.5	1M (Haiku 4.5 is 200k)	long-form / coding / agent / safety alignment	writing papers / code review / agent runtime	platform.claude.com/docs
GPT (OpenAI)	GPT-5.5 / GPT-5 / o-series	~400k	general-purpose / function calling / broadest ecosystem	broad queries / function-call frameworks / GPTs ecosystem	platform.openai.com/docs/models
Gemini (Google)	3.1 Pro / Flash	2M (Pro series; Flash is 1M)	long context / native multimodal / Google integration	PDF / video and audio / large document sets / Google Workspace	ai.google.dev

🇨🇳 Chinese Commercial + Open-Source Frontier (7 providers)¶

These are the main choices for Chinese-language work. Some are API-only (DeepSeek / Kimi / Hunyuan); others also release OSS weights (Qwen / GLM-5.1 / Yi can run through Ollama).

Model family	Flagship (2026-05)	Context	Strengths	Best for	License	Official
DeepSeek	V3 (`deepseek-chat`) / R1 (`deepseek-reasoner`) ⚠️ V4-series weights are open-source; consumer API is not fully public yet	128k	reasoning / coding / lowest cost	high-token workloads / code generation / math	API proprietary; some weights OSS on HF	api-docs.deepseek.com
Qwen (Alibaba)	Qwen3 (cloud DashScope + Apache 2.0 OSS)	128k+	strongest Chinese OSS / multimodal / agent	Chinese long-form writing / agent / self-host	Apache 2.0 (OSS) + proprietary (cloud)	qwen.ai · DashScope
Kimi (Moonshot)	K2.6 multimodal + Agent	very long context (1M+)	long context / Chinese long-form writing	whole-book reading / literature triage	Proprietary	platform.moonshot.cn
GLM (Zhipu)	GLM-5 proprietary / GLM-5.1 Apache 2.0	128k	Chinese / tool use / agent	Chinese agents / multi-turn chat	proprietary + Apache 2.0 (5.1)	open.bigmodel.cn · chatglm.cn
Hunyuan (Tencent)	T1 (deep-thinking, Transformer-Mamba MoE) + TurboS	128k	DeepSeek R1-comparable reasoning, Chinese	Chinese reasoning / Tencent ecosystem	Proprietary	hunyuan.tencent.com
MiniMax	abab6.5 + M2.7	200k	multimodal / Chinese long prose	Chinese writing / video and audio multimodal	Proprietary	platform.minimax.io
Yi (01.AI / Kai-Fu Lee)	Yi-Lightning (new API flagship) / Yi-34B-Chat (OSS, 200k context)	200k	Chinese OSS alternative to Llama	Chinese self-host / Chinese API	Apache 2.0 (OSS) / proprietary (Lightning)	01.ai · GitHub

⚠️ Xiaomi MiMo is listed in resources/cli-agents-guide.md for Hermes Agent routing, but as of 2026-05 there is no authoritative official source to verify it, so it is not included in this table. To try it, connect through Hermes Agent 200+ provider routing.

🌍 Western Open-Source (4 providers, self-host defaults)¶

These are the main choices for running on your own hardware, avoiding API fees, or handling privacy-sensitive work. You can install them in one command through Ollama.

Model family	Active size	License	Strengths	Best for	Official
Llama (Meta)	3.3 70B (Llama 4 not yet released as of 2026-05)	Llama Community License	general-purpose / broadest ecosystem / Ollama default	self-hosting intro / fine-tune base	llama.com · HF Meta
Gemma (Google)	Gemma 4 26B MoE + 31B dense (released 2026-04; Arena #3)	Apache 2.0	small and efficient / strong Apple MLX integration / multimodal	edge / mobile / 4-8 GB RAM machines	ai.google.dev/gemma
Mistral (Mistral AI)	7B / Mixtral 8x7B / Codestral	Apache 2.0 (OSS parts)	strongest open-source 7B class	commercial self-host / EU sovereignty	mistral.ai · HF Mistral
Phi (Microsoft)	Phi-4 14B reasoning + Phi-4-multimodal-instruct (multimodal version)	MIT	small but strong / reasoning / edge-friendly	4 GB+ RAM / mobile / reasoning intro	HF microsoft

🎯 Which One Should I Pick? (by scenario)¶

Your scenario	Pick + why
First time learning an LLM API, prioritize complete tutorials	Claude — Anthropic Cookbook + Courses are widely considered the most complete
Long-form writing / papers / code review	Claude Sonnet — long-form prose is a core strength
Multimodal (PDF / video and audio / images)	Gemini or Kimi — native multimodal
Broad queries + function calling frameworks	GPT — broadest ecosystem and deepest SDK integration
Chinese scenarios + commercial API	Kimi (strong long context; can fit whole books), DeepSeek (lowest cost), or GLM (agent-friendly)
Chinese scenarios + open-source self-host	Qwen 3 (Apache 2.0; currently the strongest Chinese OSS)
Reasoning / math (reasoning model)	DeepSeek R1 / Hunyuan T1 / OpenAI o-series
Privacy / offline / no API fees	Llama 3.3 / Gemma 4 / Qwen 3 OSS via Ollama
Edge / 4 GB RAM machine	Gemma 4 / Phi-4 / Qwen 3 (`qwen3-3B` or smaller variants)
100k+ token large documents	Gemini 3.1 (2M context) or Kimi K2.6 (1M+)
Want the lowest cost (API-bill sensitive)	DeepSeek V4-Flash — lowest token price among same-tier English models

📊 Neutral Benchmark Resources (verify for yourself; do not rely on one source)¶

Resource	Use	URL	2026-05 status
Artificial Analysis	Third-party benchmarks plus price/latency aggregation, including Chinese models	https://artificialanalysis.ai/	✓ Active
Arena AI (formerly LMSYS Chatbot Arena)	Human blind-test ELO leaderboard	https://arena.ai/leaderboard/text	✓ Active
Vellum LLM leaderboard	Aggregates multiple benchmarks	https://www.vellum.ai/llm-leaderboard	✓ Active
HuggingFace OpenLLM Leaderboard	Open-source model rankings	https://huggingface.co/spaces/open-llm-leaderboard	⚠️ Occasional runtime errors as of 2026-05; use the Arena AI open-source tab as fallback
SuperCLUE	Authoritative benchmark for Chinese-language scenarios	https://www.superclueai.com/	✓ Active

⚠️ Important Caveats¶

⚠️ Benchmark != production performance: run a small eval on your specific task (for example, paste 10 real prompts and see which model answers closest to what you need); do not pick only from rankings
⚠️ Frontier changes every 6 months: all numbers above are a 2026-05 snapshot; afterward, rely on official docs / Artificial Analysis
⚠️ "Strength" is relative, not absolute: every frontier model can handle basic tasks; differences matter at the margin
⚠️ For Chinese scenarios, check SuperCLUE: general international benchmarks such as MMLU are English-heavy, and Chinese-language performance may diverge

🚪 Entry Conditions¶

You should already: - Be able to run a Python script - Know what HTTP / REST is conceptually - Have an API key from at least one provider (Anthropic / OpenAI / Google)

If not — go back to Stage 0 first.

📚 Required Reading¶

Anthropic — Claude Model Overview — official model family overview, including 2026's latest Opus 4.8 / Sonnet 4.6 / Haiku 4.5
anthropics/courses — Anthropic API Fundamentals ⭐⭐⭐⭐⭐ ★ 21k+ — Anthropic's official 5-course umbrella; module 1 "Anthropic API Fundamentals" maps to this stage. Jupyter notebooks, runs on Claude 3 Haiku (cheapest), hands-on walkthrough of API essentials.
OpenAI Quickstart — first API call walkthrough
A Visual Guide to LLM Tokenizers — Hugging Face's intro
Anthropic API Pricing — read the pricing table, calculate cost for 1k input + 1k output

🛠 Hands-on Exercises (foundational, illustrative)¶

🦙 This stage defaults to Ollama (cost-driven; gemma4:e4b runs locally for $0/run). Every exercise has Path A (Ollama, default) + Path B (Anthropic, optional — use it when you want to see cloud-quality answers). Full three-path trade-off in examples/README.en.md.

💰 Stage 1 budget estimate (all 6 exercises, 3-5 runs each): all local = $0, all haiku ≈ $0.30, all sonnet ≈ $0.90. Full model list + Stage 1-7 total budget: examples/README.en.md#recommended-llm-list.

💡 No Ollama yet? Each exercise also ships a Path B Anthropic version — pick one. To enable Path A in one step: pip install openai && ollama pull gemma4:e4b.

Exercise 1: LLM API (hello world)¶

Five-line Python script that calls an LLM and prints the response. Defaults to local Ollama (free, offline); switch to Path B Anthropic when you want cloud-quality answers. Details in examples/README.en.md.

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_1.py and run python practice_1.py)

# Requires: pip install openai      (OpenAI-compatible SDK talks to Ollama)
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't check this — anything works
)

r = client.chat.completions.create(
    model="gemma4:e4b",   # swap to qwen2.5:3b / llama3.2:3b if preferred
    max_tokens=100,
    messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)

# === Self-check ===
text = r.choices[0].message.content
print("Response:", text)
print("usage:", r.usage)

assert r.choices[0].finish_reason in ("stop", "length"), f"unexpected finish_reason: {r.choices[0].finish_reason}"
assert len(text) > 0, "response should not be empty"
assert r.usage.completion_tokens > 0, "output token count should be > 0"
print("✅ Exercise 1 passed — local Ollama gemma4:e4b answered for $0")

**How slow?** Gemma 4B on CPU: ~5-30 s/answer; on GPU (RTX 3060+): <2 s. For speed use `gemma3:1b`; for quality use `qwen2.5:14b` / `llama3.3:8b` (needs 8 GB+ VRAM).

📋 Starter code — Path B (Anthropic API, optional, when you want cloud quality) (copy to practice_1_anthropic.py)

# Requires: pip install anthropic
# Env: export ANTHROPIC_API_KEY=sk-ant-...
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-haiku-4-5",  # haiku = cheapest; switch to sonnet by changing this line
    max_tokens=100,
    messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)

# === Self-check ===
text = msg.content[0].text
print("Response:", text)
print("usage:", msg.usage)

assert msg.stop_reason in ("end_turn", "max_tokens"), f"unexpected stop_reason: {msg.stop_reason}"
assert len(text) > 0, "response should not be empty"
assert msg.usage.input_tokens > 0 and msg.usage.output_tokens > 0, "token counts should be > 0"
print("✅ Exercise 1 passed — Anthropic API is reachable from your machine")

**Cost**: ~$0.001/run (haiku) or ~$0.004/run (sonnet); this hello-world is also 5-15× faster than Ollama.

Exercise 2: Tokens¶

Run the same prompt 100 times and watch token counts vary. - Notice: temperature ≠ 0 produces variation - Notice: token count for the SAME English vs Chinese sentence

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_2.py)

# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

PROMPTS = {
    "Chinese": "用一句話描述一隻貓在做什麼。",
    "English": "Describe in one sentence what a cat is doing.",
}

N = 10  # local is slower; start small
for label, prompt in PROMPTS.items():
    output_tokens = []
    for _ in range(N):
        r = client.chat.completions.create(
            model="gemma4:e4b",
            max_tokens=80,
            temperature=1.0,  # high temp to amplify variance
            messages=[{"role": "user", "content": prompt}],
        )
        output_tokens.append(r.usage.completion_tokens)
    print(f"\n[{label}] prompt: {prompt}")
    print(f"  input tokens: {r.usage.prompt_tokens}")
    print(f"  output tokens — min={min(output_tokens)} max={max(output_tokens)} mean={statistics.mean(output_tokens):.1f} stdev={statistics.stdev(output_tokens):.1f}")

# === Self-check ===
assert max(output_tokens) > min(output_tokens), "with temperature=1.0, output length should vary"
print("\n✅ Exercise 2 passed — observed temperature → token variance, $0/run")
print("💡 Chinese prompts typically use MORE input tokens (one Chinese character ≈ 2 tokens)")

📋 Starter code — Path B (Anthropic API, optional) (copy to practice_2_anthropic.py)

# Requires: pip install anthropic
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic
client = anthropic.Anthropic()
PROMPTS = {"Chinese": "用一句話描述一隻貓在做什麼。", "English": "Describe in one sentence what a cat is doing."}

for label, prompt in PROMPTS.items():
    output_tokens = []
    for _ in range(20):
        msg = client.messages.create(model="claude-haiku-4-5", max_tokens=80, temperature=1.0,
                                     messages=[{"role": "user", "content": prompt}])
        output_tokens.append(msg.usage.output_tokens)
    print(f"[{label}] input={msg.usage.input_tokens} output min/max/mean={min(output_tokens)}/{max(output_tokens)}/{sum(output_tokens)/len(output_tokens):.1f}")

**Key SDK diffs**: `messages.create` → `chat.completions.create`; `usage.output_tokens` → `usage.completion_tokens`; `usage.input_tokens` → `usage.prompt_tokens`. **Cost**: 40 runs ≈ $0.01.

Exercise 3: Pricing / Latency¶

Cost-sensitive work required: compute how long and how much it takes to run 1000 hello-world inferences. Local Ollama is $0 but has latency cost; cloud LLMs cost money but are faster. Knowing this trade-off is how you pick the right model.

📋 Starter code — Path A (local Ollama gemma4:e4b, measure latency) (copy to practice_3.py)

# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, time
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

latencies = []
for _ in range(5):
    t0 = time.time()
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=200,
        messages=[{"role": "user", "content": "Hi! Please introduce yourself."}],
    )
    latencies.append(time.time() - t0)

avg_latency = sum(latencies) / len(latencies)
out_tok_avg = r.usage.completion_tokens
tps = out_tok_avg / avg_latency if avg_latency > 0 else 0

print(f"model: gemma4:e4b (local)")
print(f"5 latencies (sec): min={min(latencies):.2f} max={max(latencies):.2f} mean={avg_latency:.2f}")
print(f"avg output: {out_tok_avg} tokens, ~{tps:.1f} tokens/sec")
print(f"\n1000-run cost: $0 (local); projected duration: {avg_latency * 1000 / 60:.1f} minutes")

# === Self-check ===
assert avg_latency > 0, "latency should be > 0"
assert out_tok_avg > 0, "output token count should be > 0"
print(f"\n✅ Exercise 3 passed — local model is $0 but takes ~{avg_latency * 1000 / 60:.0f} min for 1000 runs")
print("💡 Compare Path B Anthropic: 1000 runs is ~10-20 min at $0.25 (haiku)")

📋 Starter code — Path B (Anthropic API, compute $ cost) (copy to practice_3_anthropic.py)

# Requires: pip install anthropic
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic

# Anthropic public pricing 2026 Q2 (per 1M tokens, USD) — verify at https://www.anthropic.com/pricing
PRICING = {
    "claude-haiku-4-5":   {"input": 1.00, "output":  5.00},
    "claude-sonnet-4-6":  {"input": 3.00, "output": 15.00},
    "claude-opus-4-8":    {"input": 5.00, "output": 25.00},  # Opus 4.8 (May 2026, Dynamic Workflows) — same 5/25 pricing
}

client = anthropic.Anthropic()
MODEL = "claude-haiku-4-5"
msg = client.messages.create(model=MODEL, max_tokens=200,
                             messages=[{"role": "user", "content": "Hi! Please introduce yourself."}])
in_tok, out_tok = msg.usage.input_tokens, msg.usage.output_tokens
rates = PRICING[MODEL]
cost_one = (in_tok * rates["input"] + out_tok * rates["output"]) / 1_000_000

print(f"model: {MODEL}")
print(f"single: input={in_tok} output={out_tok} → ${cost_one:.6f}")
print(f"1000 calls cost across model tiers:")
for name, r in PRICING.items():
    c = (in_tok * r["input"] + out_tok * r["output"]) / 1_000_000 * 1000
    print(f"  {name:<22} ${c:.4f}")

assert cost_one > 0, "Cloud LLM always has a cost"
print(f"\n✅ Exercise 3 passed (Anthropic) — 1000 runs: haiku ≈ $0.25, sonnet 4.6 ≈ $0.76, opus 4.8 ≈ $1.27")

**Expected output**:

model: claude-haiku-4-5
single: input=14 output=48 → $0.000254
1000 calls cost across model tiers:
  claude-haiku-4-5       $0.2540
  claude-sonnet-4-6      $0.7620
  claude-opus-4-8        $1.2700

**Trade-off**: local Ollama is $0 for 1000 runs but takes ~2 hr; Anthropic haiku is ~10 min for $0.25; sonnet ~10 min for $0.76. **Use cloud only for production; learning / experiments / debug stay local.**

Exercise 4: Cross-Provider Comparison¶

Send the same prompt to Claude, GPT, and Gemini simultaneously, compare their responses. Notice "why does the same input produce different answers" — answer style, length, and judgment all differ. Use the OpenAI, Anthropic, and Google SDKs side-by-side.

→ Starter template → examples/stage-1/04-cross-provider/ (parallel calls to all three SDKs + comparison table; missing keys are skipped gracefully; illustrative, not a chapter-length tutorial)

Exercise 5: Error Handling¶

Trigger error conditions deliberately and write retry logic: - Wrong API key → see how it raises - Over-long prompt → what happens when the context window is full - Network drop → write a retry wrapper with exponential backoff

This is foundational for Stage 3-7's production agent code.

→ Starter template → examples/stage-1/05-error-handling/ (mock-based tests so you can verify the retry logic without unplugging your ethernet cable; illustrative, not a chapter-length tutorial)

Exercise 6: Local LLM¶

No API fees, runs on your machine: use Ollama to pull a small model (recommend llama3.2:3b or qwen2.5:3b), call it via OpenAI-compatible API.

# 1. Install Ollama: https://ollama.com
ollama pull qwen2.5:3b
ollama serve  # default port 11434

📋 Starter code (copy to practice_6.py)

# Requires: pip install openai
# Pre-req: Ollama is running, qwen2.5:3b is pulled
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't check this — anything works
)

r = client.chat.completions.create(
    model="qwen2.5:3b",
    messages=[{"role": "user", "content": "Explain ReAct in 3 sentences."}],
)

text = r.choices[0].message.content
print("Response:", text)

# === Self-check ===
assert len(text) > 10, "response is too short — Ollama may not be running"
print(f"✅ Exercise 6 passed — local Ollama reachable through the OpenAI-compatible API")
print(f"💡 This run cost you $0 (except for electricity)")

**Why do this**: once you can run local LLMs, Stage 3-6 experiments aren't bottlenecked on API costs; privacy-sensitive work also stays offline.

🎯 Curated Projects¶

5 categories, 17 projects in one table. Pick by "Best for"; click through for depth on the repo / course site.

Category	Project	⭐	Best for	Why / Notes
Official cookbook / starting point	Anthropic Cookbook	⭐⭐⭐⭐⭐	Starting with Claude API; reference lookup	Full-feature Claude API notebooks (tool use / batch / prompt cache), ★ 42k+, MIT
	Anthropic Courses	⭐⭐⭐⭐⭐	Systematic Claude learning from zero	Anthropic's own 5-course set (API fundamentals / prompt eval / real-world prompting / tool use), ★ 21k+. Start with `anthropic_api_fundamentals`
	OpenAI Cookbook	⭐⭐⭐⭐⭐	OpenAI API + structured output / function calling	Pair with Anthropic Cookbook, ★ 73k+, MIT. Much bigger than Anthropic's — use search
	Anthropic Claude API Quickstart	⭐⭐⭐⭐	5-minute start	Official docs, bookmark it
Chinese textbook (chapter-style)	datawhalechina/happy-llm	⭐⭐⭐⭐⭐	Chinese readers wanting LLM internals	Karpathy "Zero to Hero" Chinese counterpart, ★ 29k+. Equivalent to HF LLM Course in Chinese
	datawhalechina/llm-universe	⭐⭐⭐⭐⭐	Chinese newcomers building with LLM	API basics / knowledge base / RAG / advanced tricks, ★ 12k+
	datawhalechina/llm-cookbook	⭐⭐⭐⭐	Full Chinese LLM learning path	Adapted Chinese translation of Andrew Ng's courses (⚠️ updates slowed after 2025-06, CC BY-NC-SA)
	jingyaogong/minimind	⭐⭐⭐⭐	Post-Karpathy, want a real training run	2hr to train a 64M LLM from scratch — Pretrain + SFT + LoRA + DPO + RLHF, ★ 48k+, Apache-2.0
English course (systematic)	HuggingFace — LLM Course	⭐⭐⭐⭐⭐	Transformer internals + HF ecosystem	Transformer theory + applications, Apache 2.0
	LangChain Academy	⭐⭐⭐⭐	Visual learners who like video courses	LangChain's official free course, includes RAG / agent. Skip the LangChain marketing segments
Local execution (no API costs)	ollama/ollama	⭐⭐⭐⭐⭐	First-time local LLM	This repo's Path A default, OpenAI-compat API, ★ 170k+
	ggml-org/llama.cpp	⭐⭐⭐⭐⭐	Understanding quantization / how 7B fits in 8GB RAM	Ollama's underlying inference engine, ★ 108k+, MIT
	mudler/LocalAI	⭐⭐⭐⭐	Team compliance, self-host full OpenAI replacement	Drop-in OpenAI API replacement (chat / embedding / image / TTS / STT), ★ 46k+
	ml-explore/mlx	⭐⭐⭐⭐	Mac dev, squeeze Apple Silicon	Apple's ML framework for M1+, ★ 25k+. Pair with `mlx-lm` for ease
Build from scratch (understand internals)	karpathy — Let's build GPT from scratch	⭐⭐⭐⭐⭐	Understand LLM internals, not just API calls	2hr high-density video, build GPT in PyTorch from scratch. Pause and code along, don't passive-watch
	rasbt/LLMs-from-scratch	⭐⭐⭐⭐⭐	Book-pace read of the same material	Book version of Karpathy's video: tokenizer → attention → pretraining → finetuning, ★ 91k+, Apache-2.0
	karpathy/LLM101n	⭐⭐	Historical reference	⚠️ Archived (2024-08), outline only, course never finished. Watch "Build GPT from scratch" above instead

💡 Suggested reading order: API-first → Anthropic / OpenAI Cookbook · Chinese systematic path → happy-llm + llm-universe · deep internals → Karpathy video + rasbt book with code · local-only → start with Ollama, then llama.cpp.

✅ Self-Check Before Stage 2¶

Can you: - [ ] Make a Claude API call from Python in 5 lines - [ ] Explain why "你好" might use 2 tokens but "Hello" uses 1 - [ ] Quote roughly the per-token price for Claude Sonnet vs Opus - [ ] Name one strength of Claude vs GPT vs Gemini vs Llama

If yes → proceed to Stage 2 — Prompt Engineering.

If no → re-read the Anthropic Quickstart + run all 3 hello-X projects above.

✅ Done with Stage 1? Next, Stage 2 — Prompt Engineering takes 5-12 hours to walk you through writing reusable structured prompts, using few-shot and chain-of-thought for reasoning tasks, and learning to quantify prompt improvement with evals. Keep going →