Skip to content

Stage 1 — LLM Fundamentals

繁體中文 | 简体中文 | English

Time estimate: 1 week (~5-8 hours)

👋 Coming from Stage 0? Nice — your toolchain is set. The next 5-8 hours: your first working call to Claude / GPT / Gemini, how token / context window / temperature shape the output, and per-token cost estimation. Jumped straight here? Make sure you can run a Python script and have an API key from one provider — if not, head back to Stage 0.

💡 Don't recognize a term? (LLM / token / context window / temperature / RAG / agent / …) → check resources/glossary.en.md for 30-second definitions.

3 Core Terms (memorize these—all later stages use them)

Term Chinese One-liner
token 詞元 the unit LLMs use to count text length and price (1 Chinese char ≈ 1.5-2 tokens; 1 English word ≈ 1.3 tokens)
context window 上下文視窗 How many tokens the model sees at once (Claude 1M / GPT ~400k / Gemini 2M)
temperature 隨機程度參數 Controls how stable or creative the output is (0 = deterministic, 1 = creative; use 0.0-0.3 for classification, 0.7-1.0 for creative writing)

→ These 3 terms run through every later stage. The goal of Stage 1 is to call the API yourself and feel firsthand how they shape the output.

📌 Learning Goals

After this stage you will be able to: - Explain what an LLM is, what tokens are, and what context window means - Make your first API call to Claude / GPT / Gemini and parse the response - Compare the four major LLM families (Claude / GPT / Gemini / Llama) on strengths - Estimate cost per task using per-token pricing

🌐 Major LLM Family Comparison (2026-05 snapshot)

"How is Claude different from GPT?" "Can I use Chinese models?" "Which OSS model should I run with Ollama?" This section gives you an objective side-by-side view. It does not declare a single "best" model: it compares strengths / good-fit tasks / weaknesses and includes official docs URLs so you can verify the claims yourself.

💡 First, a few terms: - Context window = the amount of conversation an LLM can remember in one pass; it is capped (for example, 200k tokens ~= 150k Chinese characters) - Apache 2.0 / MIT = open-source terms that permit commercial use, modification, and closed-source redistribution; Llama Community License = open-source but with conditions (for example, orgs with >= 700M MAU need a license) - Frontier model = each provider's strongest flagship; OSS = open-source, with weights downloadable for self-hosting

🇺🇸 US Commercial Frontier (3 providers)

These 3 are SaaS APIs: you pay per token and cannot self-host them.

Model family Flagship (2026-05) Context Strengths Best for Official docs
Claude (Anthropic) Opus 4.8 / Sonnet 4.6 / Haiku 4.5 1M (Haiku 4.5 is 200k) long-form / coding / agent / safety alignment writing papers / code review / agent runtime platform.claude.com/docs
GPT (OpenAI) GPT-5.5 / GPT-5 / o-series ~400k general-purpose / function calling / broadest ecosystem broad queries / function-call frameworks / GPTs ecosystem platform.openai.com/docs/models
Gemini (Google) 3.1 Pro / Flash 2M (Pro series; Flash is 1M) long context / native multimodal / Google integration PDF / video and audio / large document sets / Google Workspace ai.google.dev

🇨🇳 Chinese Commercial + Open-Source Frontier (7 providers)

These are the main choices for Chinese-language work. Some are API-only (DeepSeek / Kimi / Hunyuan); others also release OSS weights (Qwen / GLM-5.1 / Yi can run through Ollama).

Model family Flagship (2026-05) Context Strengths Best for License Official
DeepSeek V3 (deepseek-chat) / R1 (deepseek-reasoner) ⚠️ V4-series weights are open-source; consumer API is not fully public yet 128k reasoning / coding / lowest cost high-token workloads / code generation / math API proprietary; some weights OSS on HF api-docs.deepseek.com
Qwen (Alibaba) Qwen3 (cloud DashScope + Apache 2.0 OSS) 128k+ strongest Chinese OSS / multimodal / agent Chinese long-form writing / agent / self-host Apache 2.0 (OSS) + proprietary (cloud) qwen.ai · DashScope
Kimi (Moonshot) K2.6 multimodal + Agent very long context (1M+) long context / Chinese long-form writing whole-book reading / literature triage Proprietary platform.moonshot.cn
GLM (Zhipu) GLM-5 proprietary / GLM-5.1 Apache 2.0 128k Chinese / tool use / agent Chinese agents / multi-turn chat proprietary + Apache 2.0 (5.1) open.bigmodel.cn · chatglm.cn
Hunyuan (Tencent) T1 (deep-thinking, Transformer-Mamba MoE) + TurboS 128k DeepSeek R1-comparable reasoning, Chinese Chinese reasoning / Tencent ecosystem Proprietary hunyuan.tencent.com
MiniMax abab6.5 + M2.7 200k multimodal / Chinese long prose Chinese writing / video and audio multimodal Proprietary platform.minimax.io
Yi (01.AI / Kai-Fu Lee) Yi-Lightning (new API flagship) / Yi-34B-Chat (OSS, 200k context) 200k Chinese OSS alternative to Llama Chinese self-host / Chinese API Apache 2.0 (OSS) / proprietary (Lightning) 01.ai · GitHub

⚠️ Xiaomi MiMo is listed in resources/cli-agents-guide.md for Hermes Agent routing, but as of 2026-05 there is no authoritative official source to verify it, so it is not included in this table. To try it, connect through Hermes Agent 200+ provider routing.

🌍 Western Open-Source (4 providers, self-host defaults)

These are the main choices for running on your own hardware, avoiding API fees, or handling privacy-sensitive work. You can install them in one command through Ollama.

Model family Active size License Strengths Best for Official
Llama (Meta) 3.3 70B (Llama 4 not yet released as of 2026-05) Llama Community License general-purpose / broadest ecosystem / Ollama default self-hosting intro / fine-tune base llama.com · HF Meta
Gemma (Google) Gemma 4 26B MoE + 31B dense (released 2026-04; Arena #3) Apache 2.0 small and efficient / strong Apple MLX integration / multimodal edge / mobile / 4-8 GB RAM machines ai.google.dev/gemma
Mistral (Mistral AI) 7B / Mixtral 8x7B / Codestral Apache 2.0 (OSS parts) strongest open-source 7B class commercial self-host / EU sovereignty mistral.ai · HF Mistral
Phi (Microsoft) Phi-4 14B reasoning + Phi-4-multimodal-instruct (multimodal version) MIT small but strong / reasoning / edge-friendly 4 GB+ RAM / mobile / reasoning intro HF microsoft

🎯 Which One Should I Pick? (by scenario)

Your scenario Pick + why
First time learning an LLM API, prioritize complete tutorials Claude — Anthropic Cookbook + Courses are widely considered the most complete
Long-form writing / papers / code review Claude Sonnet — long-form prose is a core strength
Multimodal (PDF / video and audio / images) Gemini or Kimi — native multimodal
Broad queries + function calling frameworks GPT — broadest ecosystem and deepest SDK integration
Chinese scenarios + commercial API Kimi (strong long context; can fit whole books), DeepSeek (lowest cost), or GLM (agent-friendly)
Chinese scenarios + open-source self-host Qwen 3 (Apache 2.0; currently the strongest Chinese OSS)
Reasoning / math (reasoning model) DeepSeek R1 / Hunyuan T1 / OpenAI o-series
Privacy / offline / no API fees Llama 3.3 / Gemma 4 / Qwen 3 OSS via Ollama
Edge / 4 GB RAM machine Gemma 4 / Phi-4 / Qwen 3 (qwen3-3B or smaller variants)
100k+ token large documents Gemini 3.1 (2M context) or Kimi K2.6 (1M+)
Want the lowest cost (API-bill sensitive) DeepSeek V4-Flash — lowest token price among same-tier English models

📊 Neutral Benchmark Resources (verify for yourself; do not rely on one source)

Resource Use URL 2026-05 status
Artificial Analysis Third-party benchmarks plus price/latency aggregation, including Chinese models https://artificialanalysis.ai/ ✓ Active
Arena AI (formerly LMSYS Chatbot Arena) Human blind-test ELO leaderboard https://arena.ai/leaderboard/text ✓ Active
Vellum LLM leaderboard Aggregates multiple benchmarks https://www.vellum.ai/llm-leaderboard ✓ Active
HuggingFace OpenLLM Leaderboard Open-source model rankings https://huggingface.co/spaces/open-llm-leaderboard ⚠️ Occasional runtime errors as of 2026-05; use the Arena AI open-source tab as fallback
SuperCLUE Authoritative benchmark for Chinese-language scenarios https://www.superclueai.com/ ✓ Active

⚠️ Important Caveats

  • ⚠️ Benchmark != production performance: run a small eval on your specific task (for example, paste 10 real prompts and see which model answers closest to what you need); do not pick only from rankings
  • ⚠️ Frontier changes every 6 months: all numbers above are a 2026-05 snapshot; afterward, rely on official docs / Artificial Analysis
  • ⚠️ "Strength" is relative, not absolute: every frontier model can handle basic tasks; differences matter at the margin
  • ⚠️ For Chinese scenarios, check SuperCLUE: general international benchmarks such as MMLU are English-heavy, and Chinese-language performance may diverge

🚪 Entry Conditions

You should already: - Be able to run a Python script - Know what HTTP / REST is conceptually - Have an API key from at least one provider (Anthropic / OpenAI / Google)

If not — go back to Stage 0 first.

📚 Required Reading

  1. Anthropic — Claude Model Overview — official model family overview, including 2026's latest Opus 4.8 / Sonnet 4.6 / Haiku 4.5
  2. anthropics/courses — Anthropic API Fundamentals ⭐⭐⭐⭐⭐ ★ 21k+ — Anthropic's official 5-course umbrella; module 1 "Anthropic API Fundamentals" maps to this stage. Jupyter notebooks, runs on Claude 3 Haiku (cheapest), hands-on walkthrough of API essentials.
  3. OpenAI Quickstart — first API call walkthrough
  4. A Visual Guide to LLM Tokenizers — Hugging Face's intro
  5. Anthropic API Pricing — read the pricing table, calculate cost for 1k input + 1k output

🛠 Hands-on Exercises (foundational, illustrative)

🦙 This stage defaults to Ollama (cost-driven; gemma4:e4b runs locally for $0/run). Every exercise has Path A (Ollama, default) + Path B (Anthropic, optional — use it when you want to see cloud-quality answers). Full three-path trade-off in examples/README.en.md.

💰 Stage 1 budget estimate (all 6 exercises, 3-5 runs each): all local = $0, all haiku ≈ $0.30, all sonnet ≈ $0.90. Full model list + Stage 1-7 total budget: examples/README.en.md#recommended-llm-list.

💡 No Ollama yet? Each exercise also ships a Path B Anthropic version — pick one. To enable Path A in one step: pip install openai && ollama pull gemma4:e4b.

Exercise 1: LLM API (hello world)

Five-line Python script that calls an LLM and prints the response. Defaults to local Ollama (free, offline); switch to Path B Anthropic when you want cloud-quality answers. Details in examples/README.en.md.

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_1.py and run python practice_1.py)
# Requires: pip install openai      (OpenAI-compatible SDK talks to Ollama)
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't check this — anything works
)

r = client.chat.completions.create(
    model="gemma4:e4b",   # swap to qwen2.5:3b / llama3.2:3b if preferred
    max_tokens=100,
    messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)

# === Self-check ===
text = r.choices[0].message.content
print("Response:", text)
print("usage:", r.usage)

assert r.choices[0].finish_reason in ("stop", "length"), f"unexpected finish_reason: {r.choices[0].finish_reason}"
assert len(text) > 0, "response should not be empty"
assert r.usage.completion_tokens > 0, "output token count should be > 0"
print("✅ Exercise 1 passed — local Ollama gemma4:e4b answered for $0")
**How slow?** Gemma 4B on CPU: ~5-30 s/answer; on GPU (RTX 3060+): <2 s. For speed use `gemma3:1b`; for quality use `qwen2.5:14b` / `llama3.3:8b` (needs 8 GB+ VRAM).
📋 Starter code — Path B (Anthropic API, optional, when you want cloud quality) (copy to practice_1_anthropic.py)
# Requires: pip install anthropic
# Env: export ANTHROPIC_API_KEY=sk-ant-...
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-haiku-4-5",  # haiku = cheapest; switch to sonnet by changing this line
    max_tokens=100,
    messages=[{"role": "user", "content": "Introduce yourself in one sentence."}],
)

# === Self-check ===
text = msg.content[0].text
print("Response:", text)
print("usage:", msg.usage)

assert msg.stop_reason in ("end_turn", "max_tokens"), f"unexpected stop_reason: {msg.stop_reason}"
assert len(text) > 0, "response should not be empty"
assert msg.usage.input_tokens > 0 and msg.usage.output_tokens > 0, "token counts should be > 0"
print("✅ Exercise 1 passed — Anthropic API is reachable from your machine")
**Cost**: ~$0.001/run (haiku) or ~$0.004/run (sonnet); this hello-world is also 5-15× faster than Ollama.

Exercise 2: Tokens

Run the same prompt 100 times and watch token counts vary. - Notice: temperature ≠ 0 produces variation - Notice: token count for the SAME English vs Chinese sentence

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_2.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

PROMPTS = {
    "Chinese": "用一句話描述一隻貓在做什麼。",
    "English": "Describe in one sentence what a cat is doing.",
}

N = 10  # local is slower; start small
for label, prompt in PROMPTS.items():
    output_tokens = []
    for _ in range(N):
        r = client.chat.completions.create(
            model="gemma4:e4b",
            max_tokens=80,
            temperature=1.0,  # high temp to amplify variance
            messages=[{"role": "user", "content": prompt}],
        )
        output_tokens.append(r.usage.completion_tokens)
    print(f"\n[{label}] prompt: {prompt}")
    print(f"  input tokens: {r.usage.prompt_tokens}")
    print(f"  output tokens — min={min(output_tokens)} max={max(output_tokens)} mean={statistics.mean(output_tokens):.1f} stdev={statistics.stdev(output_tokens):.1f}")

# === Self-check ===
assert max(output_tokens) > min(output_tokens), "with temperature=1.0, output length should vary"
print("\n✅ Exercise 2 passed — observed temperature → token variance, $0/run")
print("💡 Chinese prompts typically use MORE input tokens (one Chinese character ≈ 2 tokens)")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_2_anthropic.py)
# Requires: pip install anthropic
import sys, statistics
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic
client = anthropic.Anthropic()
PROMPTS = {"Chinese": "用一句話描述一隻貓在做什麼。", "English": "Describe in one sentence what a cat is doing."}

for label, prompt in PROMPTS.items():
    output_tokens = []
    for _ in range(20):
        msg = client.messages.create(model="claude-haiku-4-5", max_tokens=80, temperature=1.0,
                                     messages=[{"role": "user", "content": prompt}])
        output_tokens.append(msg.usage.output_tokens)
    print(f"[{label}] input={msg.usage.input_tokens} output min/max/mean={min(output_tokens)}/{max(output_tokens)}/{sum(output_tokens)/len(output_tokens):.1f}")
**Key SDK diffs**: `messages.create` → `chat.completions.create`; `usage.output_tokens` → `usage.completion_tokens`; `usage.input_tokens` → `usage.prompt_tokens`. **Cost**: 40 runs ≈ $0.01.

Exercise 3: Pricing / Latency

Cost-sensitive work required: compute how long and how much it takes to run 1000 hello-world inferences. Local Ollama is $0 but has latency cost; cloud LLMs cost money but are faster. Knowing this trade-off is how you pick the right model.

📋 Starter code — Path A (local Ollama gemma4:e4b, measure latency) (copy to practice_3.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, time
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

latencies = []
for _ in range(5):
    t0 = time.time()
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=200,
        messages=[{"role": "user", "content": "Hi! Please introduce yourself."}],
    )
    latencies.append(time.time() - t0)

avg_latency = sum(latencies) / len(latencies)
out_tok_avg = r.usage.completion_tokens
tps = out_tok_avg / avg_latency if avg_latency > 0 else 0

print(f"model: gemma4:e4b (local)")
print(f"5 latencies (sec): min={min(latencies):.2f} max={max(latencies):.2f} mean={avg_latency:.2f}")
print(f"avg output: {out_tok_avg} tokens, ~{tps:.1f} tokens/sec")
print(f"\n1000-run cost: $0 (local); projected duration: {avg_latency * 1000 / 60:.1f} minutes")

# === Self-check ===
assert avg_latency > 0, "latency should be > 0"
assert out_tok_avg > 0, "output token count should be > 0"
print(f"\n✅ Exercise 3 passed — local model is $0 but takes ~{avg_latency * 1000 / 60:.0f} min for 1000 runs")
print("💡 Compare Path B Anthropic: 1000 runs is ~10-20 min at $0.25 (haiku)")
📋 Starter code — Path B (Anthropic API, compute $ cost) (copy to practice_3_anthropic.py)
# Requires: pip install anthropic
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic

# Anthropic public pricing 2026 Q2 (per 1M tokens, USD) — verify at https://www.anthropic.com/pricing
PRICING = {
    "claude-haiku-4-5":   {"input": 1.00, "output":  5.00},
    "claude-sonnet-4-6":  {"input": 3.00, "output": 15.00},
    "claude-opus-4-8":    {"input": 5.00, "output": 25.00},  # Opus 4.8 (May 2026, Dynamic Workflows) — same 5/25 pricing
}

client = anthropic.Anthropic()
MODEL = "claude-haiku-4-5"
msg = client.messages.create(model=MODEL, max_tokens=200,
                             messages=[{"role": "user", "content": "Hi! Please introduce yourself."}])
in_tok, out_tok = msg.usage.input_tokens, msg.usage.output_tokens
rates = PRICING[MODEL]
cost_one = (in_tok * rates["input"] + out_tok * rates["output"]) / 1_000_000

print(f"model: {MODEL}")
print(f"single: input={in_tok} output={out_tok} → ${cost_one:.6f}")
print(f"1000 calls cost across model tiers:")
for name, r in PRICING.items():
    c = (in_tok * r["input"] + out_tok * r["output"]) / 1_000_000 * 1000
    print(f"  {name:<22} ${c:.4f}")

assert cost_one > 0, "Cloud LLM always has a cost"
print(f"\n✅ Exercise 3 passed (Anthropic) — 1000 runs: haiku ≈ $0.25, sonnet 4.6 ≈ $0.76, opus 4.8 ≈ $1.27")
**Expected output**:
model: claude-haiku-4-5
single: input=14 output=48 → $0.000254
1000 calls cost across model tiers:
  claude-haiku-4-5       $0.2540
  claude-sonnet-4-6      $0.7620
  claude-opus-4-8        $1.2700
**Trade-off**: local Ollama is $0 for 1000 runs but takes ~2 hr; Anthropic haiku is ~10 min for $0.25; sonnet ~10 min for $0.76. **Use cloud only for production; learning / experiments / debug stay local.**

Exercise 4: Cross-Provider Comparison

Send the same prompt to Claude, GPT, and Gemini simultaneously, compare their responses. Notice "why does the same input produce different answers" — answer style, length, and judgment all differ. Use the OpenAI, Anthropic, and Google SDKs side-by-side.

Starter templateexamples/stage-1/04-cross-provider/ (parallel calls to all three SDKs + comparison table; missing keys are skipped gracefully; illustrative, not a chapter-length tutorial)

Exercise 5: Error Handling

Trigger error conditions deliberately and write retry logic: - Wrong API key → see how it raises - Over-long prompt → what happens when the context window is full - Network drop → write a retry wrapper with exponential backoff

This is foundational for Stage 3-7's production agent code.

Starter templateexamples/stage-1/05-error-handling/ (mock-based tests so you can verify the retry logic without unplugging your ethernet cable; illustrative, not a chapter-length tutorial)

Exercise 6: Local LLM

No API fees, runs on your machine: use Ollama to pull a small model (recommend llama3.2:3b or qwen2.5:3b), call it via OpenAI-compatible API.

# 1. Install Ollama: https://ollama.com
ollama pull qwen2.5:3b
ollama serve  # default port 11434
📋 Starter code (copy to practice_6.py)
# Requires: pip install openai
# Pre-req: Ollama is running, qwen2.5:3b is pulled
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't check this — anything works
)

r = client.chat.completions.create(
    model="qwen2.5:3b",
    messages=[{"role": "user", "content": "Explain ReAct in 3 sentences."}],
)

text = r.choices[0].message.content
print("Response:", text)

# === Self-check ===
assert len(text) > 10, "response is too short — Ollama may not be running"
print(f"✅ Exercise 6 passed — local Ollama reachable through the OpenAI-compatible API")
print(f"💡 This run cost you $0 (except for electricity)")
**Why do this**: once you can run local LLMs, Stage 3-6 experiments aren't bottlenecked on API costs; privacy-sensitive work also stays offline.

🎯 Curated Projects

5 categories, 17 projects in one table. Pick by "Best for"; click through for depth on the repo / course site.

Category Project Best for Why / Notes
Official cookbook / starting point Anthropic Cookbook ⭐⭐⭐⭐⭐ Starting with Claude API; reference lookup Full-feature Claude API notebooks (tool use / batch / prompt cache), ★ 42k+, MIT
Anthropic Courses ⭐⭐⭐⭐⭐ Systematic Claude learning from zero Anthropic's own 5-course set (API fundamentals / prompt eval / real-world prompting / tool use), ★ 21k+. Start with anthropic_api_fundamentals
OpenAI Cookbook ⭐⭐⭐⭐⭐ OpenAI API + structured output / function calling Pair with Anthropic Cookbook, ★ 73k+, MIT. Much bigger than Anthropic's — use search
Anthropic Claude API Quickstart ⭐⭐⭐⭐ 5-minute start Official docs, bookmark it
Chinese textbook
(chapter-style)
datawhalechina/happy-llm ⭐⭐⭐⭐⭐ Chinese readers wanting LLM internals Karpathy "Zero to Hero" Chinese counterpart, ★ 29k+. Equivalent to HF LLM Course in Chinese
datawhalechina/llm-universe ⭐⭐⭐⭐⭐ Chinese newcomers building with LLM API basics / knowledge base / RAG / advanced tricks, ★ 12k+
datawhalechina/llm-cookbook ⭐⭐⭐⭐ Full Chinese LLM learning path Adapted Chinese translation of Andrew Ng's courses (⚠️ updates slowed after 2025-06, CC BY-NC-SA)
jingyaogong/minimind ⭐⭐⭐⭐ Post-Karpathy, want a real training run 2hr to train a 64M LLM from scratch — Pretrain + SFT + LoRA + DPO + RLHF, ★ 48k+, Apache-2.0
English course
(systematic)
HuggingFace — LLM Course ⭐⭐⭐⭐⭐ Transformer internals + HF ecosystem Transformer theory + applications, Apache 2.0
LangChain Academy ⭐⭐⭐⭐ Visual learners who like video courses LangChain's official free course, includes RAG / agent. Skip the LangChain marketing segments
Local execution
(no API costs)
ollama/ollama ⭐⭐⭐⭐⭐ First-time local LLM This repo's Path A default, OpenAI-compat API, ★ 170k+
ggml-org/llama.cpp ⭐⭐⭐⭐⭐ Understanding quantization / how 7B fits in 8GB RAM Ollama's underlying inference engine, ★ 108k+, MIT
mudler/LocalAI ⭐⭐⭐⭐ Team compliance, self-host full OpenAI replacement Drop-in OpenAI API replacement (chat / embedding / image / TTS / STT), ★ 46k+
ml-explore/mlx ⭐⭐⭐⭐ Mac dev, squeeze Apple Silicon Apple's ML framework for M1+, ★ 25k+. Pair with mlx-lm for ease
Build from scratch
(understand internals)
karpathy — Let's build GPT from scratch ⭐⭐⭐⭐⭐ Understand LLM internals, not just API calls 2hr high-density video, build GPT in PyTorch from scratch. Pause and code along, don't passive-watch
rasbt/LLMs-from-scratch ⭐⭐⭐⭐⭐ Book-pace read of the same material Book version of Karpathy's video: tokenizer → attention → pretraining → finetuning, ★ 91k+, Apache-2.0
karpathy/LLM101n ⭐⭐ Historical reference ⚠️ Archived (2024-08), outline only, course never finished. Watch "Build GPT from scratch" above instead

💡 Suggested reading order: API-first → Anthropic / OpenAI Cookbook · Chinese systematic path → happy-llm + llm-universe · deep internals → Karpathy video + rasbt book with code · local-only → start with Ollama, then llama.cpp.

✅ Self-Check Before Stage 2

Can you: - [ ] Make a Claude API call from Python in 5 lines - [ ] Explain why "你好" might use 2 tokens but "Hello" uses 1 - [ ] Quote roughly the per-token price for Claude Sonnet vs Opus - [ ] Name one strength of Claude vs GPT vs Gemini vs Llama

If yes → proceed to Stage 2 — Prompt Engineering.

If no → re-read the Anthropic Quickstart + run all 3 hello-X projects above.


Done with Stage 1? Next, Stage 2 — Prompt Engineering takes 5-12 hours to walk you through writing reusable structured prompts, using few-shot and chain-of-thought for reasoning tasks, and learning to quantify prompt improvement with evals. Keep going →