A thermal reading of
artificial intelligence
You point a thermal scope at the dark and it doesn't see a hog — it reads heat and predicts shape. Modern AI works the same way: it doesn't know things the way you do, it predicts them from pattern. This guide takes you from that one idea all the way up to building autonomous agents — cold to white-hot, beginner to deep.
What an AI actually is
Forget the sci-fi. At its heart, today's AI is one surprisingly simple machine doing one thing absurdly well.
A large language model — the thing inside Claude, ChatGPT, all of them — is a next-piece-of-text predictor. You give it some text, and it predicts what comes next. Then it adds that piece, looks at everything again, and predicts the next piece. Over and over.
That's it. That's the whole engine. It's the autocomplete on your phone, except instead of suggesting one word it has read a huge fraction of everything ever written, and it predicts not just the next word but paragraphs, code, arguments, plans — one small chunk at a time.
A trail cam doesn't understand "deer." It has been tuned on millions of frames until it can predict, from a pattern of pixels, "this shape, at this hour, moving this way = deer." The model is the same trick at planetary scale, but for language instead of pixels. It has seen so much text that it has absorbed the statistical shape of how ideas follow one another — and through that, a startling amount about the actual world.
Three things it is not
- Not a database. It doesn't store and retrieve facts in slots. Knowledge is smeared across billions of weights as patterns, which is why it can be fluent and confidently wrong in the same breath.
- Not a search engine. Out of the box it isn't looking anything up. It's generating plausible continuations. (Tools can give it search later — that's Module 11.)
- Not a person. No goals, no memory between conversations, no understanding in the human sense. It's a very deep pattern-completer that's good enough to look like all three.
When your Agent Bricks incident agent turns a Tronox Flash Report into clean JSON, no rule engine is parsing fields. The model is predicting "given this messy report, the next characters of a well-formed JSON object are…" — pattern completion pointed at a structured target.
Tokens & meaning-as-geometry
Before a model can predict text, text has to become numbers. How that happens explains a lot of the model's quirks.
Models don't see letters or words. They see tokens — chunks of text, usually about ¾ of a word. Common words are one token; rare ones get split. hunting is one token; WuTangNAS is several stitched-together pieces.
Every token is turned into an embedding: a long list of numbers (a vector) that places the token as a single point in a vast multi-dimensional space. The whole point of that space is that meaning becomes distance. Tokens with similar meaning sit near each other; unrelated ones sit far apart.
Because meaning lives in geometry, you can do arithmetic on concepts. The classic result: take the vector for "king," subtract "man," add "woman," and you land right next to "queen." The model never learned that rule — it fell out of the shape of the space, learned purely from reading.
This representation explains real behavior you'll hit:
- Token math, not letter math. Ask a model to count the r's in "strawberry" and it may fumble — it sees tokens, not letters. Same reason it's shaky at character-level tricks.
- Cost and limits are in tokens. API pricing, context limits, speed — all measured in tokens, not words. Roughly 750 words ≈ 1,000 tokens.
- Retrieval runs here. When you "search your journal by meaning" later (RAG, Module 12), you're finding the nearest points in exactly this kind of space.
Under the hood · what's in the vector
An embedding is typically hundreds to thousands of numbers (dimensions). Each dimension isn't a clean human label like "animal-ness" — meaning is distributed across many of them at once. During training the model arranges this space so that whatever directions help it predict the next token become the axes of meaning. Position encoding is also added so the model knows token order, since "creek near deer" and "deer near creek" share tokens but differ in arrangement.
Transformers & attention
One idea — "attention" — is why this generation of AI works at all. It's the T in GPT.
To predict the next token well, the model has to figure out which earlier tokens actually matter. The mechanism that does this is attention: for every token, the model looks across all the other tokens and weighs how relevant each one is, right now, to what it's trying to predict.
Read: "the buck crossed the creek and then it bedded down." To know what "it" means, you glance back and lock onto "buck," not "creek." Attention is that glance — done for every word, against every other word, all at once. The model learns to point each token's attention at whatever helps predict what comes next.
Stacked into layers
A single attention pass isn't enough. A transformer stacks dozens of these blocks. Each block re-mixes the tokens through attention and then a small processing step, passing richer representations up the stack. The pattern that emerges in practice:
- Early layers catch surface stuff — grammar, word shape, "this is a noun."
- Middle layers assemble relationships — who did what to whom.
- Deep layers hold abstract meaning, intent, and the threads needed to predict what's coming.
Under the hood · query, key, value (the real mechanic)
Each token produces three vectors: a query ("what am I looking for?"), a key ("what do I offer?"), and a value ("what I'll hand over if chosen"). The model compares one token's query against every token's key to get a relevance score, softmaxes those scores into weights that sum to 1, then blends everyone's values by those weights. That blend is the token's new, context-aware representation.
"Multi-head" attention just runs several of these in parallel, each free to track a different kind of relationship — one head might follow grammatical subjects, another long-range references like our "it → buck." Crucially it's all matrix multiplication, which is why GPUs eat it for breakfast and why this architecture scaled when older sequential ones (RNNs) stalled.
How it learns
Where the "intelligence" comes from: three stages that turn raw text into a helpful assistant.
The model starts as billions of random numbers (parameters, or "weights"). Training is the process of nudging those numbers until the machine gets good at prediction. It happens in three stages, each doing a different job.
Training is sighting in a rifle, a few billion times. Each training example is a shot: the model predicts, you measure how far off it landed from the true next token (the "loss"), and you nudge the weights a hair toward center. One shot teaches almost nothing. Trillions of shots, and the groupings tighten into something that writes code and explains attention. That nudging process is gradient descent — the math just tells you which direction, and how far, to turn each of the billions of knobs.
What "learning" leaves behind
- Frozen weights. After training, the numbers are fixed. When you chat with it, it is not learning — it's running. New knowledge only enters through what you put in front of it (context) or tools.
- A knowledge cutoff. It only "knows" what existed in its training data. Anything after that date has to be fed in — which is exactly why web search and retrieval exist.
- Smeared, not stored. Facts live as patterns across weights, so the model can blend two true things into one false thing with total confidence. This is hallucination, and it's structural, not a bug you can fully patch. (Verification is Module 18.)
This is why a base model can't know your 10.175.128.x subnet or your Holosun case number. That knowledge was never in training — it has to ride in through context or a tool. Most of "using AI well" is really the craft of getting the right things in front of frozen weights at the right moment.
Inference, sampling & the context window
What actually happens the instant you hit enter — and the single most important constraint to understand.
Using a trained model is called inference. It computes the probability of every possible next token, then has to actually pick one. How it picks is controlled by a dial called temperature.
This is why the same prompt can give different answers, and why a model can sound certain about something it's inventing — it's sampling from a distribution, not reciting a record.
The context window — the one limit that explains everything
The context window is everything the model can see at once: your message, the system instructions, the conversation history, any documents or tool results — all of it, counted in tokens. It is the model's entire working memory. Nothing outside the window exists to the model.
- No memory between sessions. A fresh conversation starts blank. Any "memory" feature works by re-injecting saved facts back into the window — it's not the model remembering, it's the harness reminding it.
- Bigger isn't free. Larger windows cost more and can dilute focus — bury the key instruction in 100k tokens of noise and the model may lose the thread. Curation beats dumping.
- This is the lever you control most. Everything in Modules 8–17 — prompts, tools, RAG, memory, agents — is ultimately about managing what's in this window.
Thinking before answering
Some models are trained to spend extra inference on a private scratchpad before they reply — a second way to buy accuracy, paid in tokens.
There are two ways to make a model smarter. The old lever is train-time scale — more parameters, more data. Reasoning models add a second: test-time compute. Instead of a bigger model, you let the same model do more work at the moment you ask — generating a long private chain of reasoning, then writing the answer you see. Spent well, that compute can let a smaller model match a much larger one on hard problems.
The "thinking" is not a new kind of cognition. It's the same next-token prediction from Module 5, just a lot more of it, aimed inward: more tokens means more chances to break a problem into steps, try an approach, notice a mistake, and correct course before committing. OpenAI calls these reasoning tokens; Anthropic emits thinking content blocks. Either way they're billed as output (you pay for every one), they eat context-window space, and the raw trace is usually hidden or summarized — you see a précis, not the full scratchpad.
Glassing the field vs. snapping the shot. A normal model is the snap shot — fast, fine when the hog's broadside at 40 yards. A reasoning model is glassing before the trigger: settle the AGM Taipan, range it, read the wind, check what's behind the target, confirm it's a hog and not a calf — then shoot. That costs time (latency) and you burn it on every setup whether the shot was easy or not (cost). On a hard, low-light shot it's the difference between a clean kill and a wounded animal lost in the brush. On a 40-yard gimme the glassing changed nothing. Glass when the shot is hard; don't glass a gimme. And glassing carefully isn't the same as hitting — you can range, read, and breathe perfectly and still miss.
Trained to reason vs. just told to
This is the honest distinction from "think step by step." Prompted chain-of-thought is an inference-time trick on a normal model: you write "let's think step by step" and it produces visible steps. It helps, but the model was never trained to reason — it's pattern-matching the shape of worked examples and can't reliably backtrack or check itself. A trained reasoning model has the chain-of-thought baked in by reinforcement learning: it's rewarded for reaching correct answers on checkable problems (math, code, logic), and over training it learns strategies — self-verification, trying alternatives, catching its own errors mid-stream. DeepSeek-R1 showed these behaviors emerge from RL without being programmed. That's why you don't need to tell a reasoning model to slow down; it already does, and its self-correction is real rather than performed.
The control dial: reasoning effort
You don't just toggle thinking on and off — you meter it. OpenAI exposes reasoning.effort (none → minimal → low → medium → high → xhigh, model-dependent). Anthropic exposes extended thinking with a budget_tokens cap or an adaptive effort where the model decides how much to think per request. Newer models lean adaptive: think hard on the hard parts, barely at all on the easy ones. The tradeoff stated plainly: higher effort buys accuracy on genuinely hard problems, at the cost of more latency (seconds to minutes) and more dollars — and on easy problems it buys little or nothing.
The rule: reasoning buys accuracy on hard problems, and you pay for it whether the problem was hard or not. Reach for it when correctness matters more than speed and the problem actually needs steps. Default to a normal model otherwise.
Under the hood · the harder caveats
Not a correctness guarantee. More thinking raises the odds on reasoning-heavy tasks; it does not make the model right. A reasoning model can think for 30 seconds and still confidently hand you a wrong answer — the trace can rationalize a flawed conclusion just as fluently.
Overthinking is real. Test-time compute does not monotonically improve accuracy. Past a point, extra reasoning can talk a model out of a correct first instinct and degrade its confidence calibration. Bigger budget ≠ better.
Weak on knowledge-bound tasks. When the bottleneck is what the model knows rather than how it reasons, more thinking gives little benefit and can increase hallucination. Thinking can't conjure facts it doesn't have — that's a job for retrieval (RAG, Module 12), not more reasoning tokens.
The trace is summarized. You generally can't fully audit why it concluded something; don't treat the visible "thinking" as a faithful, complete log.
Reach for effort surgically, not globally. Marked's "Ask Your Journal" is the wrong place — it's retrieval-and-summarize (knowledge-bound), where extra thinking adds cost and can increase hallucination; use a fast model + good RAG. The Tronox extraction's planning step — mapping a messy Flash Report into the schema, reconciling ambiguous fields — is a legitimate place to dial reasoning up, because a wrong structured value flowing to finance is expensive; the bulk extraction stays cheap. On OpenRange / Argus, effort is a cost knob exactly like a per-run token budget: on a Jackery-powered offline node, "think harder" literally means "burn more battery per detection," so routine motion classification wants low/none, reserving thinking for genuinely ambiguous frames.
Build a reasoning-effort A/B harness over ~20 real Flash Reports. Run the extraction twice — once at low/none, once at high — and log per report: accuracy vs. a hand-labeled gold set, total tokens (including reasoning tokens from usage / output_tokens_details), latency, and dollar cost. First step: write the gold labels for the 20 samples and a thin runner that flips one effort parameter and records the usage block. The payoff is a number you can feel — "high effort cost 6x and only fixed 2 of 20," or "it fixed the 3 that mattered." Any alerts go through ntfy only.
The model vs. the thing you talk to
Everything so far described the engine. Now we put a vehicle around it.
A raw LLM is just the prediction engine from Modules 1–6: text in, next token out. By itself it has no persona, no rules, no tools, no memory, no idea it's in a chat. The product you actually use — Claude, ChatGPT, Claude Code, your Marked chatbot — is the model plus a whole apparatus wrapped around it. That apparatus is the harness, and it's where almost all the product engineering lives.
This distinction is the hinge of the whole guide. From here up, the heat is about systems, not the engine. The engine barely changes between "a chatbot" and "an autonomous agent that refactors your codebase" — the harness is what changes.
Harnesses
The single most useful concept for a power user. Once you see the harness, you can't unsee it.
A harness is everything wrapped around the raw model to turn a next-token predictor into a useful system. It's the part you, as a builder, actually design and control.
The LLM is the bare thermal sensor. On its own it's a chip that turns heat into a signal — useless in your hands. The harness is the whole scope: the housing, the reticle, the rangefinder, the ballistic calculator, the zeroing, the trigger discipline built into how you use it. Same sensor sits inside a $300 monocular and a $5k AGM — the instrument around it is what makes one deadly. Claude Code, your Marked chatbot, your Agent Bricks agent: all the same sensor, different scopes.
The pieces
- System prompt — standing instructions injected before your message: who it is, the rules, the tone, the output format. (This guide's structure, your Marked chatbot's "land manager" persona — all system prompt.)
- Tool definitions — the menu of actions it's allowed to call: search, run code, query a DB, fire an ntfy push. (Module 11.)
- Memory / state — what survives between turns or sessions, re-loaded into context as needed. (Module 12.)
- Context manager — the logic deciding what actually gets shown to the model each call, given the window limit.
- The loop — whether the harness calls the model once and stops (chat) or repeatedly toward a goal (agent). (Modules 8–9.)
- Guardrails — budgets, max iterations, human approval gates on irreversible actions.
A great way to feel a harness is to build the thinnest possible one: a ~40-line script with a system prompt, one tool (say, a function that reads a file), and a single model call. Have Claude Code scaffold tiny-harness/ with the Anthropic SDK, then add a second tool and watch the system prompt do the steering. This makes Modules 8–10 concrete instead of abstract.
Chat vs. agents
The question you actually asked. The answer is smaller than it looks — and it lives entirely in the harness.
Same engine. The difference between "a chatbot" and "an agent" is not the model — it's how many times the harness lets the model act, and whether the model gets to decide its own next step.
- Chat — you ask, it answers. One model call per turn. You are the loop: you read the reply, decide what's next, and type again. The human is in the loop on every single step.
- Agent — you give it a goal and a set of tools, and the harness lets the model loop on its own: decide an action, take it, look at the result, decide the next action — repeating until the goal is met. The human sets the goal and the guardrails; the model drives the steps.
It's a spectrum, not a switch
- Plain chat → conversation, no actions.
- Chat + tools → it can search or run code once to answer better, but you still drive.
- Workflow → fixed steps you defined, run in order. Predictable, no improvisation.
- Agent → the model chooses the steps and the order to hit your goal.
- Multi-agent → several agents with roles, coordinated. (Module 14.)
Asking me to draft a vendor email = chat. Your incident-extraction agent (Flash Report in → fixed parse → JSON out) is really a workflow — same path every time, which is exactly what you want for something that feeds finance. Claude Code editing across files, running your tests, reading the failure, fixing, re-running = a true agent. Knowing which one a task should be is half of using AI well.
The agent loop
This is the diagram to tattoo on your brain. Every agent — Claude Code, a research agent, your future projects — is some version of this.
An agent is a loop with a goal. The model thinks about what to do, acts by calling a tool, observes the result, and repeats — each result folded back into context so the next thought is better informed. This pattern is often called ReAct: Reason + Act.
Why looping is the superpower
A chat model gets one shot. An agent gets to be wrong and recover. It can write code, run it, read the error, fix it, and re-run — exactly what a human engineer does. The loop turns a one-shot guesser into something that converges on a working result. That's why Claude Code can actually fix a failing test suite instead of just suggesting a patch and hoping.
The same autonomy that recovers from errors can compound them. Loops can spin forever, wander off-goal, burn tokens, or — worst case — take an irreversible action (delete a file, send the wrong email, push bad code). Every serious agent harness therefore has brakes: a max-iteration budget, a token/time budget, clear stopping conditions, and human checkpoints before anything irreversible. Autonomy without brakes isn't powerful, it's a liability.
- Stopping conditions — the agent must know "done." Vague goals loop forever; checkable goals ("all tests pass," "file written and validates") terminate cleanly.
- Budgets — cap iterations and spend so a confused agent fails cheap instead of expensive.
- Checkpoints — gate destructive or external actions behind a human "yes." This is the difference between a helpful agent and a loose cannon.
Building a real loop is the best way to internalize this. A clean starter for your world: a small local agent whose only tool is "send ntfy notification," with a goal like "watch this log file and alert me when pattern X appears." It exercises the full Think→Act→Observe cycle with a safe, reversible action and lands squarely on your ntfy-first rule. Natural on-ramp toward the OpenRange / Argus alerting brains.
Tools, function calling & MCP
The "act" in the loop. How a frozen text-predictor reaches out and changes something real.
A model can't natively query your database or push a notification — it only emits text. Tools bridge that gap. You describe a function to the model ("send_ntfy, takes a message"); when it wants to use it, it emits a structured request; your harness runs the real function and feeds the result back into context. That request-and-return protocol is function calling.
MCP — the standard that makes this plug-and-play
Early on, every tool had to be wired by hand into every app. MCP (Model Context Protocol) fixes that: it's an open standard for exposing tools and data so any model can plug into any tool without custom glue. Think of it as USB-C for AI — one connector shape, and your Pocket recorder, Gmail, Drive, Supabase, and a dozen others just snap in.
Your Pocket voice-recorder MCP is exactly this pattern: Pocket exposes "search my recordings" as an MCP tool, and I can call it to pull a transcript into context, then build learning materials from it. Same with your Gmail/Outlook connectors for batch inbox triage. When you wondered about Omnigent as a meta-harness for coding agents — that's a harness that orchestrates other harnesses, and MCP is the wiring that lets them all share tools.
Under the hood · the security edge of tools
Tools are where an AI stops being a sandbox and starts having real-world reach — which is where risk concentrates. Two failure modes matter. First, prompt injection: content a tool pulls in (a web page, an email, a file) can contain text trying to hijack the model's instructions. Treat tool output as untrusted data, never as commands. Second, irreversible actions: a tool that deletes, sends, pays, or changes permissions deserves a human checkpoint, because an agent's confident mistake executes instantly. The rule of thumb: read-only tools can run free; world-changing tools get a gate.
Context engineering, RAG & memory
The model only knows what's in its window (Module 5). So the real art is deciding what goes in it.
Since the weights are frozen and the window is finite, everything useful comes down to one craft: getting the right information in front of the model at the right moment. That's context engineering, and it has two big tools — retrieval and memory.
RAG — Retrieval-Augmented Generation
Instead of relying on what the model memorized, RAG fetches relevant material at question time and stuffs it into the window before the model answers. The fetch uses the embedding space from Module 2: your question becomes a vector, and you grab the nearest chunks of your own documents.
Marked's "Ask Your Journal" is RAG: your harvest logs and stand notes get embedded into Supabase, a question retrieves the most relevant entries, and the model answers grounded in your seasons — not generic deer facts. RAG is also the honest fix for hallucination and stale knowledge: instead of trusting the model's memory, you hand it the source and say "answer from this."
Memory — persistence across the gaps
The model forgets everything between sessions. Memory is a harness feature that stores durable facts and selectively re-injects them into context when relevant — so it can "remember" your subnet, your vendors, your projects. Key mental model: the model isn't remembering; the harness is reminding. Memory is a store on the side, loaded back into the window on demand.
- Context window = short-term working memory, this conversation only, wiped at the end.
- Memory store = long-term notes the harness keeps and re-surfaces — like the running picture I keep of your stack so you don't re-explain it every time.
- RAG store = a searchable body of documents pulled in on demand by relevance.
More context is not better context. A window stuffed with marginally-related junk dilutes the model's focus and can bury the one instruction that mattered. Good context engineering is curation, not accumulation: the fewest, most relevant tokens that fully specify the task. When a long agent run gets polluted with dead ends, the right move is often to start a fresh window with a clean summary — not to keep piling on.
Fine-tuning vs. RAG
Now that you know what RAG is, here's when to reach for it versus changing the model itself.
RAG changes what the model knows; fine-tuning changes how the model behaves. Pick by asking which of those your problem actually is — and most of the time the honest answer is "reach for retrieval first."
They get confused because both are sold as "customize the AI on your data," but they're completely different levers. RAG leaves the frozen weights alone (Module 4): at question time it fetches the relevant facts from an external store (Module 12) and drops them into the context window. Knowledge lives outside the model; you swap it freely. Fine-tuning continues training the model on your examples so its default behavior shifts — tone, format, the shape of an answer, a niche task it does reliably without being re-instructed. Knowledge baked into weights; changing it means training again.
Open-book vs. closed-book exam. RAG is an open-book exam: the model looks every fact up in your binder at question time — so the binder can be today's incident reports, and the answer cites the page. Fine-tuning is a closed-book exam: the model studied until the way it answers is second nature, but whatever wasn't in the studying isn't in its head. RAFT is studying and then sitting the open-book exam — it learned how to read your binder and ignore the irrelevant pages. You don't cram facts the night before; you keep facts in the binder and fine-tune the test-taking technique. On the scope: fine-tuning is re-flashing the firmware's image processing; RAG is the rangefinder feeding a live number into the ballistic calc each shot. You'd never re-flash firmware to account for today's wind — you feed today's wind in live.
Which lever, when
- Reach for RAG when facts are fresh / changing (prices, policies, this week's stand notes), proprietary to you (your subnet, your journal, your medallion tables), or you need grounding and citations traceable to a source. Auditability is structurally a RAG property — weights can't cite.
- Reach for fine-tuning when you need a consistent format, style, or voice every time without re-prompting, a narrow task done reliably (emit exactly this JSON shape) where a long prompt is brittle, or latency / cost wins — bake behavior in so each call needs a shorter prompt.
The decision question, front and center: is my problem about what the AI knows, or how it behaves? Knowledge → RAG. Behavior → fine-tuning. And they're not rivals — the strong pattern is fine-tune for behavior and layer RAG for facts. Berkeley's RAFT formalizes it: a model fine-tuned specifically to read retrieved documents — including learning to ignore irrelevant "distractor" chunks — beats either approach alone on domain-specific QA.
Failure modes — credibility lives here
- Fine-tuning to "add knowledge." The seductive mistake. Research (Gekhman et al., EMNLP 2024) shows models learn fine-tuning examples containing new facts much slower than facts they already half-know — and as those new-knowledge examples get learned, they linearly increase the model's tendency to hallucinate. The field's takeaway: models acquire facts in pretraining; fine-tuning teaches them to use what they have, not to learn new facts. Want it to know something new? That's RAG's job.
- Stale indexes. RAG is only as fresh as its store. An index not re-embedded when source data changes will confidently serve last quarter's price. RAG moves the freshness problem out of the weights — it doesn't delete it.
- Retrieval quality dominates. Garbage in, garbage out: no model rescues a bad fetch. Most "RAG is broken" pain is a retrieval problem — bad chunking, weak embeddings, distractor docs — not a generation problem. Debug the retriever before you blame the model.
- Fine-tuning the freshness away. Choosing fine-tuning for facts that change means re-training on every change — slow, expensive, and you still inherit the hallucination risk. Almost always the wrong trade.
2025/26 default: start with RAG (and good prompting). Fine-tune only once you've proven RAG can't deliver the behavior you need — it handles the large majority of "use our data" asks faster, cheaper, and reversibly.
Under the hood · flavors of fine-tuning
Full fine-tuning — update every weight. Most powerful, most expensive, needs real GPU infra, and risks "catastrophic forgetting" of general ability.
LoRA / PEFT — freeze the base model, train a tiny set of low-rank adapter matrices (often <0.5% of parameters). 10–20× less memory while keeping ~90–95% of full-tune quality; adapters can be merged back so there's no extra inference latency. QLoRA adds quantization to fit on a single consumer-ish GPU. This is what "fine-tuning" usually means in practice today.
Instruction tuning — the Module-4 stage that turned a raw predictor into an assistant; your task-specific fine-tune is the same machinery aimed narrower.
Marked's "Ask Your Journal" is the textbook RAG case — your harvest logs change every season and the answer must be grounded in your entries with the entry as the source. Fine-tuning a model on your journal would be the classic mistake: slow learning, blended hallucinated "facts," and a retrain every time you log a hunt. Where fine-tuning could actually earn its place is the Tronox incident-extraction workflow — if the long prompt forcing the exact Flash-Report → JSON shape ever gets brittle or token-heavy at volume, a small LoRA that bakes in the output shape (behavior, not facts) is legitimate; the incident content still rides in via context. And your medallion Silver/Gold tables are a natural retrieval corpus for an "ask the warehouse" assistant — governed, changing, proprietary — never something you'd freeze into weights.
Add an honest A/B inside Marked's "Ask Your Journal." Keep the existing RAG path, then wire a deliberately wrong comparison: answer from the base model with no retrieval — same question, side by side. First step: add a ?mode=noretrieval flag to the Ask-Your-Journal endpoint that skips the Supabase vector search and asks the model cold. Log both answers. You'll feel the difference — the no-retrieval path inventing plausible stand-and-wind "facts" is the hallucination-from-missing-knowledge failure made concrete, and the cleanest proof of why "fine-tune to add knowledge" is a trap.
Multi-agent orchestration
When one loop isn't enough: split the work across specialists with a coordinator.
A single agent juggling a huge task fills its window with too many concerns and starts dropping threads. The fix mirrors how you'd run a crew: break the job into roles, give each a clean context, and have a coordinator stitch the results together.
Common patterns
- Orchestrator-worker — a lead plans and farms subtasks to workers, then merges results. (How deep-research systems fan out across sources.)
- Critic / debate — one agent produces, another reviews and pushes back, raising quality through friction.
- Pipeline — agents in sequence, each transforming the previous one's output, like a Bronze→Silver→Gold medallion flow but with reasoning at each stage.
Multi-agent is more capable and dramatically more expensive, slower, and harder to debug — errors hide between agents, costs multiply, and coordination itself can fail. The discipline: start with the simplest thing that works. One good prompt beats a chat-with-tools that beats a single agent that beats a multi-agent swarm — reach up the ladder only when the rung below genuinely can't carry the task.
This is the frontier you're poking at with Databricks agentic experiments and Omnigent (a meta-harness coordinating coding agents). The same orchestrator-worker shape maps onto a future Tronox build: a planner agent that routes "extract this incident," "reconcile this logistics cost," "draft this IBP note" to specialized sub-agents — but only once each single-agent piece is proven solid on its own. How these agents actually get wired — and how they share what they know without poisoning each other — is the next module.
Orchestration & shared context
You decided one agent isn't enough — now the hard part isn't the agents, it's the wiring between them.
Module 14 answered should you go multi-agent. This answers how it's wired: which shape routes the work, and how agents that each have their own separate context window — possibly running in different harnesses — share what they've learned without flooding, contradicting, or poisoning each other.
The topologies — pick the shape that matches the work
Module 14 named three patterns in passing. Here's the fuller toolkit, and the rule for when each fits is always the same question: how coupled are the subtasks? Independent work fans out; dependent work must serialize or share state.
- Orchestrator-worker (hub-and-spoke). A lead plans, spawns workers, synthesizes their returns — the workhorse Module 14 diagrams. Fits breadth-first, parallelizable work ("find every board member across 20 companies"). The lead holds the plan and the only complete picture; the lead decides it has enough and stops spawning.
- Hierarchical / recursive. Workers are themselves orchestrators with their own workers — a tree. Fits deep decomposition, but cost compounds with depth, so cap the depth explicitly.
- Sequential pipeline. Agents in a line, each transforming the previous one's output (extract → reason → draft → check). Fits a fixed dependency order. The most deterministic shape — closest to a data pipeline — and the easiest to debug, because state flows one direction.
- Parallel fan-out / fan-in. Sibling subtasks dispatched at once, results merged — the orchestrator-worker's parallel core. Fits when latency matters and subtasks are independent. The fan-in (merge) step is where the hard problems live: dedup, conflict resolution, provenance.
- Blackboard. No one boss routes the work. Specialists all watch a shared workspace; each contributes when the current state matches what it knows how to do; a lightweight control loop picks who goes next. Fits ill-defined problems with no fixed solution path. Coordination is indirect — agents never talk to each other, only to the board — which makes it the cleanest model for the shared-state problem below.
- Debate / critic. One agent produces, another adversarially reviews. Going a level past Module 14: the critic must have a different context/prompt than the producer or it just rubber-stamps — the same reason an eval judge must be validated separately (Module 16).
Shared context — the genuinely hard part
The uncomfortable truth: each subagent has its own context window, and they cannot see into each other's. A worker never witnesses the lead's reasoning or its siblings' transcripts — it gets only what was explicitly handed to it, and the lead gets back only what the worker chose to return. There is no shared mind; there is only what you wire through the seams. Three ways agents share state, coldest to hottest:
- Message passing (handoff). The orchestrator sends a distilled task (objective, format, boundaries, the few facts it needs) and gets back a distilled result — never the raw transcript. You pass conclusions, not transcripts. Raw transcripts are huge and full of dead ends; the sender summarizes at the boundary. Dominant mode, and the one most prone to "lost in translation" loss.
- Shared memory / scratchpad / blackboard. A common store all agents read and write. Lets many agents converge on one evolving artifact without N×N messaging. The discipline that makes it safe: a single writer per slot (or append-only with provenance), so two agents don't clobber each other. Reads are cheap; uncoordinated writes are where it corrupts.
- Shared store / artifact. A durable external object — a file, a row, a doc, a task record — that outlives any single agent's context and serves as the handoff medium, especially across harnesses. Agent A in one harness writes the artifact; Agent B in another reads it. The artifact is the shared context; neither agent shares its internal memory.
Across harnesses, and why opacity is correct
When agents live in different harnesses (or vendors/frameworks), there's no shared process, no shared window, no implicit anything — they share context only through an explicit boundary protocol. The emerging standard split, stated plainly:
- MCP connects an agent to its tools (agent → tool) — Module 11.
- Agent-to-agent protocols (e.g. A2A) connect agents to each other (agent → agent), and their first design principle is opacity: agents exchange tasks, messages, and artifacts — distilled, structured handoffs — without exposing internal memory, tools, or chain-of-thought. That opacity isn't a limitation; it's the correct shape. You hand over the call, not the whole sensor feed.
Summarization at boundaries is load-bearing, not optional. Every handoff is a lossy compression, so good systems make the summary structured (schema'd fields, not prose) and keep a pointer back to the source so a claim can be re-checked.
In a two-person thermal hunt the spotter is glassing a wide field and you're on the rifle. You do not share a sensor feed — you can't see through the spotter's scope, he can't see your reticle. What crosses between you is one tight, distilled call: "Hog, far tree line, 180, quartering left." That call is the handoff — a structured summary, not the raw stream. If the spotter narrated every warm rock and deer (the full transcript), you'd drown in it and miss the shot. You pass the call, not the feed — exactly A2A opacity, and exactly why subagents return distilled findings, not their context windows. And the trust angle lands: a bad range call propagates straight into a missed shot, with full confidence.
Provenance & trust on merge
When the orchestrator merges several agents' outputs, it is ingesting text it didn't write — and an LLM can't tell instructions-from-you from instructions-hidden-in-data. A worker that read a poisoned web page can return output carrying a smuggled command; if the lead treats merged worker output as trusted instructions, that's prompt injection between your own agents. This is the same hazard as Module 17, now turned inward. So: tag every contribution with where it came from, treat cross-agent output as data, not commands, and gate any consequential action behind verification — the least-privilege, cut-a-trifecta-leg posture from Module 17, applied to the seams inside your swarm.
Coordination failure modes — where it bites
- Context fragmentation. No agent holds the whole picture; the synthesis is only as good as the distilled returns. Detail dies at every boundary.
- Duplicated / conflicting work. Vague task boundaries make two workers do the same thing or reach contradictory conclusions. (Anthropic's system hit exactly this; the fix was detailed delegation — objective, format, boundaries.)
- Lost-in-translation handoffs. The boundary summary drops the one nuance that mattered; the receiver confidently builds on a misread.
- Runaway fan-out cost. Multi-agent burns ~15× the tokens of plain chat (single agents already ~4×). Spawning subagents for a one-line question, or searching endlessly for info that doesn't exist, are real early failures.
- No clean termination. Without explicit budgets the swarm doesn't know when "enough" is — orchestrators over-invest, recursion never bottoms out, debate never converges.
- Error propagation. One worker's wrong fact, passed up as a clean conclusion, gets laundered into the final answer with false confidence — same shape as poisoned provenance, minus the adversary.
- Tight-coupling mismatch. Some work needs shared context and real-time coordination (most coding: edits conflict, order matters). Today's agents are bad at coordinating edits live, so forcing that work into a parallel fan-out backfires — it wants a pipeline or a single agent.
Practical levers — what keeps it alive
- Budgets / limits. Cap tokens, tool calls, subagent count, and recursion depth per task — and scale them to complexity (a fact-lookup gets one agent and a few calls; a broad comparison gets several). Budgets are the primary termination mechanism.
- Idempotent steps. Make each action safe to retry, so a re-run or crash-recovery doesn't double-write.
- Single writer for shared state. One owner per slot; everyone else appends with provenance. Kills the clobber-and-conflict class.
- Verification / critic stages. A dedicated checking pass before consequential output — the merge isn't trusted blindly. (Straight into Module 16, Evals.)
- Deterministic orchestration vs. model-driven delegation. The biggest lever. Where the flow is known, hard-code it as a deterministic pipeline (cheap, debuggable, repeatable) and let the model reason only inside each step. Reserve model-driven delegation for genuinely open-ended work. Don't pay for dynamic orchestration on a fixed problem.
A Databricks Bronze→Silver→Gold medallion flow is a sequential pipeline topology with zero model-driven delegation: the flow is fixed, each stage transforms the prior one's output, state moves one direction, and you can re-run any stage idempotently. That's the deterministic end of the spectrum. You'd reach for agentic delegation only when you don't know the steps in advance — when a planner has to decide at runtime which transforms even exist. Wire the known part deterministically; spend agent tokens only on the genuinely open part.
Under the hood · control, termination, and the single-writer rule
Termination per topology. Orchestrator-worker: the lead decides it has enough and stops spawning — back it with a token/subagent budget so a slow or missing answer can't block forever. Recursion: a hard depth cap, because cost compounds with every level. Debate: a max-rounds limit, since convergence isn't guaranteed.
The single-writer rule makes a blackboard safe: one owner per slot, everyone else appends new entries rather than overwriting, and every entry carries a source and timestamp. Reads are free and concurrent; the only contention is writes, so you remove write contention by construction. Structured handoffs (schema'd fields) beat prose because the receiver can validate the shape before trusting the content — and a pointer back to the source lets any claim be re-checked. The MCP-vs-A2A split is the same idea at the protocol layer: tools are exposed (MCP), but peer agents stay opaque (A2A) — you expose the call surface, never the internal state.
You already straddle both ends of the spectrum. Databricks workflows / medallion are deterministic orchestration — fixed pipelines you'd be crazy to make agentic. Omnigent is the model-driven end — a meta-harness coordinating other harnesses (Claude Code, Codex, Cursor), which is exactly the agents-across-harnesses problem: it governs them only through an explicit boundary (spend caps, sandboxing, pause-before-action), never by seeing inside their windows. And the Tronox future-build from Module 14 — a planner routing "extract incident," "reconcile cost," "draft IBP note" — is where shared-context discipline bites: those sub-agents must hand back structured results the planner can trust and trace, with the finance-write gated behind verification, because a merged conclusion with bad provenance writing to finance is the failure you can least afford.
Build a tiny blackboard coordinator for an OpenRange + Argus cross-harness scenario — local, ntfy-only, offline-first. First step: a single shared-state file shared/state.json with a strict schema and a single-writer rule per key (OpenRange owns detections[], Argus owns alert_thresholds). Write two tiny "agent" loops that each read the whole board but write only their own keys, append a source and ts on every entry (provenance baked in), and make each write idempotent (re-running can't double-append). Add a tiny "merge" reader that, before firing one ntfy push, checks provenance and refuses to act on any entry whose source it doesn't recognize — your in-house "treat peer output as data, not commands" guard. Stretch: a budget field that caps how many times a loop may write before it must stop, so you can watch termination work. You'll have built, in miniature, message-free shared state, single-writer safety, provenance-gated action, and a termination budget — the four levers, against your own offline stack.
Evals — knowing it works
The hinge between how systems are built and how you operate them well: how you prove an AI system works, not just demo it.
An eval is how you find out whether an AI system actually works — not by watching one good demo, but by running it against a fixed set of real cases and scoring the output. Because the output is non-deterministic, "it passed once" tells you almost nothing; you need a measurement you can repeat.
Zeroing a rifle. You don't call a scope zeroed because one round hit paper. You shoot a group at a known distance against a known point of aim, measure the offset, adjust the turrets, shoot again — and re-confirm at the start of a serious hunt because conditions drift. That's an eval, exactly: a fixed target (your eval set), a repeated measurement against ground truth (the bullseye = the gold answer), a scored miss, and an adjustment. One lucky shot dead-center proves nothing about the next ten — same reason a clean demo proves nothing about the next ten agent runs. pass^k is a tight group; pass@k is "at least one in the black." And "it looked good in the demo" is calling a rifle zeroed off a single round.
Why normal tests stop working
Traditional tests are exact: assert add(2,2) == 4 — same input, same output, forever, and a red test means a real bug. LLM output varies run to run, and the "right" answer is usually a set of acceptable answers, not one string. So assert response == "..." is either too brittle or meaningless. An eval is not a unit test — it's a measurement: run N cases, score each, report the rate. You're estimating a probability, not asserting a constant. A demo is one hand-picked sample with the operator steering: it has no denominator, so it tells you the system can succeed, never how often, and never where it silently fails.
Build the set from real failures, not imagination
The highest-leverage activity is error analysis: read real traces, tag what actually went wrong, group the tags into a failure taxonomy. An LLM has near-infinite ways to fail — you can't anticipate them, so don't pre-write evals before you've seen failures. The productive order: ship something small → look at outputs → discover failures → write a targeted eval for each → fix → repeat. Anthropic's guidance: start with 20–50 tasks from real failures and the manual checks you already run. Each case must be unambiguous (two domain experts independently reach the same verdict) and solvable (write a reference solution). The set is living: every new production failure becomes a new case, so that bug can never silently come back.
The grader ladder — cheap to expensive
Build evaluators in ascending cost; only climb when a cheaper rung can't capture the quality you care about.
- Assertions / code checks (cheapest, deterministic). Valid JSON? Schema matches? Required fields present? No blocked phrase? Number parses? These catch a huge share of real failures and cost nothing on every commit.
- Reference-based checks. Compare against a known-correct answer — exact match, set membership, numeric tolerance. Works when "correct" is well-defined: extraction, classification, structured output. (BLEU/ROUGE are weak as verdicts; use them only to find interesting traces.)
- LLM-as-judge (most expensive). A model scores against a rubric — for subjective qualities rules can't capture (is this summary faithful? is the tone right?), used after you've fixed the easy stuff.
Offline, online, and the CI gate
Reference-based evals have ground truth and run offline before you ship — this is where regression evals live, the safety net that says "my change didn't break the 50 cases that used to pass." Reference-free evals have no gold answer and run online on sampled live traffic — judging intrinsic properties (is the answer grounded in the retrieved context? does it address the question?) to watch for drift. Mature setups run evals at three points: offline on a curated set, in CI before any prompt/model change merges, and online on live traffic. Keep CI evals cheap and mostly deterministic; reserve the expensive judges for the slower cadence.
Agents and RAG get graded differently
- Agents have a trajectory. Capture the transcript (reasoning, tool calls, order) and the outcome (final state). Grade the outcome, not the path — pinning the agent to one "correct" tool sequence punishes valid creative solutions. Tool-use checks are strongest when execution-based (run the call in a sandbox, check the result). pass@k = ≥1 success in k tries (any success is fine); pass^k = all k succeed (when consistency is the product — a finance extraction that must be right every time).
- RAG fails in two places, so split it. Retrieval: context precision (is what we retrieved relevant) and recall (did we get everything needed). Generation: faithfulness/groundedness (does the answer stay inside the context or invent?) and answer relevancy. A wrong answer with good retrieval is a generation problem; with bad retrieval it's an indexing problem — lumping them hides the cause.
Under the hood · LLM-as-judge pitfalls & synthetic data
A judge is only trustworthy after you validate it against human labels. Collect 100+ examples a domain expert has labeled, have the judge predict on held-out ones, and measure agreement (TPR/TNR). Don't deploy an unvalidated judge — it may be grading by criteria you never intended.
Prefer binary pass/fail over 1–5 scales (everyone defaults to "3"); grade one dimension at a time with a clear rubric; give the judge an escape hatch ("Unknown") so it doesn't hallucinate a verdict. Known traps: skipping validation, feeding the wrong inputs (a faithfulness check without the retrieved context), and reading a 100% pass rate as success — it almost always means your eval is too easy. Aim for a set hard enough to sit around ~70%, where there's signal to chase.
Synthetic data done right: define dimensions of variation (report type, missing field, ambiguous date), hand-write ~20 tuples, then have a model expand and naturalize them. Generic "generate 100 test questions" produces repetitive junk that misses edge cases.
Marked — "Ask Your Journal" gets the RAG split: a reference-free faithfulness check (the answer must come from your actual entries, not the model's hunting folklore) plus retrieval checks; "Marked Intelligence" tool calls want execution-based tool-use evals. OpenRange / Argus — tight offline-first loops whose action is detection → ntfy push, so the eval is trigger correctness: a labeled set of clips with known "alert / no-alert," scored as precision/recall (a missed hog and a false 2am ntfy are different costs — grade them separately). Stays local, ntfy-only. Tronox — the canonical regression-gated, finance-writing eval: a folder of real Flash Reports paired with hand-verified JSON; cheap rungs do most of the work (valid JSON, required fields, figures parse, total reconciles), a validated judge handles only severity classification, and the eval is the gate — extraction merges only if the regression set still passes.
Build a tiny regression-eval harness for the Tronox extraction — start with five cases. First step: make an evals/ folder with 5 real Flash Reports and, beside each, a hand-verified expected.json. A run_evals.py runs each report through the extractor and scores the cheap rungs first: valid JSON, required keys present, numeric fields parse, field-level match against expected.json. Print a pass rate and a per-case diff; exit non-zero on any regression so it can gate a commit. Every time the workflow gets a real report wrong in the wild, drop it + corrected JSON into evals/ — the set grows from real failures. Stretch: add one validated judge for severity, but only after the deterministic layer is solid, and write down its agreement rate with your labels first.
Prompt injection & agent security
The adversarial capstone of the systems arc: how an attacker abuses the seams between harness, tools, RAG, and agents — and why there's no complete fix.
An LLM can't tell the difference between instructions from you and instructions hidden in the data it reads — so any untrusted text an agent ingests (a web page, an email, a trail-cam caption, a tool's output) can quietly become a command it obeys. The more an agent can do, the worse a single poisoned sentence gets, and there is no patch that fully closes this.
The whole problem is one architectural fact carried over from Modules 2 and 5: the model reads instructions and data through the same channel — one flat stream of tokens. There's no "this part is trusted, this part is just content" tag the model can rely on. Whatever looks like an instruction can act like one.
Thermal optics, and the hog that "tells" your scope to shoot. Your AGM Rattler reads heat off the field; it doesn't understand the scene. Now imagine a heat source could whisper instructions into the scope's reticle logic — "ignore your zero, fire left." A bare sensor can't sort "the deer I'm hunting" from "a sign someone planted that says shoot here." That's an LLM reading tokens: it can't tell the operator's intent from instructions baked into what it's looking at. The fix isn't a better sensor — it's a trigger discipline downstream of the optic (you, the human) that the scope can't override. That's human-in-the-loop on the consequential action.
Direct vs. indirect injection
- Direct prompt injection. The attacker is the user — they type "ignore your previous instructions and…" to override the system prompt, leak it, or jailbreak guardrails. Annoying, but the blast radius is usually just their own session.
- Indirect / data-borne injection (the dangerous one). The malicious instruction rides in on data the agent fetches on your behalf — a web page, an email body, a calendar invite, a GitHub issue, a PDF, a RAG chunk, even text hidden in white-on-white font. The agent reads it as part of "doing its job" and follows it. You never see the payload; the agent does. This is the attack that matters for agents, because agents read untrusted content by design.
"Just instruct the model not to" fails reliably. The model is non-deterministic and the input space is infinite — an attacker only needs one phrasing that slips through, across unlimited tries. Security that works ~95% of the time is, against an adversary who moves second, security that fails. Treat in-prompt instructions as a preference, never a boundary.
The lethal trifecta
Simon Willison's model is the clearest: an agent becomes exploitable when it has all three of —
- Access to private data (your inbox, your notes, secrets, prod configs),
- Exposure to untrusted content (it reads attacker-influenced text),
- An exfiltration / external channel (it can send mail, hit a URL, write to a DB, render a markdown image whose URL it controls).
With all three, one poisoned document can make the agent read your secrets and ship them out — no code vulnerability required. The classic exfil needs no obvious "send" tool: the injection tells the agent to embed stolen data in a URL —  — and the moment the markdown image renders, the browser leaks it. This is the confused deputy (OWASP LLM06): the agent acts with your privileges, so the real flaw isn't that it was tricked — it's that it was over-privileged, making being tricked catastrophic instead of harmless. Drop any one leg and that specific catastrophe becomes impossible.
Realistic defenses — defense-in-depth, not a fix
No item below is sufficient alone. You stack them and accept residual risk.
- Least-privilege tools. Scope every tool to the minimum — read-only by default, narrow row/path scopes, short-lived tokens, separate identities per agent. The single highest-leverage control: it caps the blast radius whether or not injection succeeds.
- Cut a leg off the trifecta. Meta's Agents Rule of Two (Oct 2025): an unsupervised agent may hold at most two of {untrusted input, sensitive access, external comms}. Want all three? A human gates it.
- Human-in-the-loop on consequential actions. Require explicit approval before anything irreversible or outbound (send, delete, pay, deploy). Reversible/read actions can stay autonomous.
- Provenance / tainting. Track which tokens came from untrusted sources; forbid tainted data from triggering consequential tool calls.
- Output handling (LLM05). Treat model output as untrusted too — never
evalit; sanitize before it hits a shell, SQL, or HTML; strip auto-rendered images/links to kill the exfil channel. - Sandboxing. Run tool/code execution in an isolated, network-restricted environment so even a fully hijacked step can't reach your data or the open internet.
Under the hood · design-patterns taxonomy
The principle (Beurer-Kellner et al., 2025): once an agent has ingested untrusted input, it must be impossible for that input to trigger any consequential action. Six patterns enforce it:
- Action-Selector — agent picks an action but can't read tool responses (an LLM-shaped switch statement).
- Plan-Then-Execute — fix the full plan before touching untrusted content, so content can corrupt outputs but not change which actions run.
- LLM Map-Reduce — quarantined sub-agents each chew one untrusted doc and return only a structured result a coordinator aggregates.
- Dual-LLM — a privileged LLM (tools, no untrusted text) drives a quarantined LLM (untrusted text, no tools); tainted content passes only as opaque variables (
$VAR1) the privileged side can route but never read. - Code-Then-Execute (CaMeL) — privileged LLM emits code in a sandboxed mini-language so a real interpreter can do data-flow/taint analysis (~67% of attacks blocked on AgentDojo — note: not 100%).
- Context-Minimization — strip untrusted text out of context once you've extracted what you need.
Marked's "Ask Your Journal" and "Marked Intelligence" are textbook trifecta candidates. Ask-Your-Journal does RAG over your Supabase entries (private data) and answers in chat. The day you let that chatbot (a) read a shared or web-fetched note, (b) keep access to your full journal, and (c) call a tool that sends mail or hits a URL, you've assembled all three legs in one harness. The fix isn't a cleverer system prompt — it's least-privilege tools and the Rule of Two: keep the journal chatbot read-only and outbound-free, route any "send" through a human tap. Same logic governs the Tronox workflow: it ingests untrusted Flash Report text, so the write into finance must be a gated, validated step, never autonomous. And OpenRange/Argus's offline-first rule is itself a defense — an agent with no outbound internet path (only local ntfy on WuTangNAS) has had a trifecta leg amputated by design: a poisoned caption can't phone home because there's no phone.
Build a trifecta audit + exfil-canary test for an OpenRange agent. First step: have Claude Code enumerate every tool the agent can call and tag each with the three legs (reads-private? reads-untrusted? talks-outbound?). Then write one red-team test: inject a fake instruction into a frame's caption/EXIF that tries to make the agent ntfy its config to an external URL, and assert the agent (a) doesn't, and (b) that the only notification path is local ntfy with no outbound internet egress. The test passing because the leg literally doesn't exist is the lesson — defense by architecture, not by hope.
Using AI the right way
Everything above, turned into operating procedure. This is the part you asked for most directly.
1 · Pick the right altitude for the task
The most common mistake is using an agent where chat would do, or chatting where you needed an agent. Match the tool to the shape of the work:
- Use chat for thinking, drafting, explaining, deciding — anything where you want to stay in the loop and the output is words. (Vendor emails, "explain this Databricks feature," sanity-checking an approach.)
- Use chat + tools when one lookup or one calculation makes the answer real. (Search, a quick data pull, a one-off script.)
- Use a workflow when the steps are fixed and you want the same path every time. (Incident-report → JSON. Predictability is the feature.)
- Use an agent when the task is multi-step, the path varies, and success is checkable. (Refactor across files until tests pass; triage an inbox by rules.)
2 · Prompt like you're briefing a sharp contractor
The model is capable but has zero context about your situation beyond what you give it. Good prompts front-load that:
- Be specific about the goal and the format. "Give me a 5-row markdown table comparing X on cost, speed, and lock-in" beats "tell me about X."
- Give it the context it can't have. Your constraints, your stack, your hard rules (ntfy-only, offline-first). It can't read your mind or your network diagram.
- Show an example of good output. One example of the shape you want is worth a paragraph of description — and a counter-example ("not like this") sharpens it further.
- Let it reason before it answers for anything non-trivial. "Think it through step by step, then give the answer" measurably improves hard tasks.
- Iterate. First output is a draft, not a verdict. Tell it what's off; it adjusts fast. Treat it as a conversation, not a vending machine.
3 · Verify — always, especially when it sounds confident
Hallucination is structural (Module 4): the model can be fluent and wrong simultaneously, and confidence is not a signal of correctness. So the verification load scales with the stakes. Low stakes (brainstorm) → trust and move. High stakes (code that ships, a number for finance, a network change) → verify the output yourself: run the code, check the source, confirm the fact. For agents, this means gating irreversible actions behind your approval. The model is a brilliant, tireless drafter — you remain the editor of record.
4 · Manage the context window like a campsite
- Pack what's relevant, leave the rest. Don't paste an entire repo when three files matter — noise dilutes focus (Module 12).
- Start fresh when it gets muddy. If a long thread has wandered, open a clean one with a tight summary. A polluted window quietly degrades every later answer.
- Decompose big asks. Break a mountain into checkable steps and hand each at the right altitude. Small, verifiable chunks beat one giant vague request.
5 · Know what good looks like (evals)
Before you lean on an AI for something repeated, define how you'll know it's working. "It seemed fine" is how silent failures creep into finance data — even a tiny eval set of five hand-checked examples turns "I hope" into "I checked." How you actually prove it works gets its own full treatment — see Module 16.
6 · Let your role shift up the stack
The throughline of this whole guide: as the tooling climbs from chat to agents, your job moves from doing the work to specifying it, verifying it, and orchestrating it. The leverage isn't in typing faster — it's in being the person who frames the goal precisely, sets the guardrails, and knows enough (from Modules 1–16) to tell when the machine is bluffing.
Marked — chat + RAG + tools, human-in-loop. Right altitude; keep "Ask Your Journal" grounded in retrieval, verify any prediction-y output (rut/weather) against reality. OpenRange / Argus — these want workflows and tight agent loops, not free-roaming agents: detection → ntfy is a checkable, reversible action, perfect for a budgeted loop with no destructive powers. Tronox incident extraction — keep it a workflow, build the 5-example eval set, gate anything that writes to finance systems behind a human. You're already instinctively at the right altitude on all three; now you know why.
Choosing your tools — models & harnesses
The current field of LLMs and the apps built on them, and a straight answer to "which one for what." Snapshot as of mid-2026 — this layer moves fast.
Two knobs decide your experience: the model (the engine, Module 1) and the harness (the app around it, Module 8). The thing most people get backwards: for day-to-day work the harness matters more than the model. Two people on the same model in different apps have wildly different experiences — and the top harnesses now let you swap the model underneath anyway. So pick the workflow first, the engine second.
The models — the engines
Frontier chat models are close enough that "best" usually means "best for this task." The honest differentiators:
Opus 4.8 · Sonnet 4.6 · Haiku 4.5
OpenAI
Moonshot · open weight
Inflection
DeepSeek · Qwen · Grok · Llama · Mistral
Default to the strong mid-tier (Sonnet 4.6 / GPT-5.5 / Gemini 3.1 Pro). Escalate to a top tier only when a task visibly needs it. Drop to a fast/cheap tier (Haiku, Gemini Flash, DeepSeek) for high-volume or simple work. The model only becomes the deciding factor at the extremes — hardest reasoning, cheapest scale, or an open-weight/self-host requirement.
Anthropic's surfaces — your home turf
All of these run the same Claude engine. They differ in where they run and who they're for.
Cowork vs. Claude Code — same engine, different vehicle
This is the one that trips everyone up, because they overlap heavily. Both run the identical Claude agentic core — plan, spawn subagents, use tools, edit files, run code, finish without babysitting. Both reach your local files, your connected apps, run on a schedule, and take orders from your phone. The choice is about fit and interface, not raw capability.
Do you live in a terminal? Yes → Claude Code. No → Cowork. Then: is the task complex, long-running, repeatable as a script, or does it need precision? → Claude Code. Occasional desktop knowledge work you want to watch? → Cowork. Claude Code can do almost everything Cowork can and more; Cowork mainly exists because Code's setup scares off non-developers. The strong move is to use both in sequence — Cowork to process inputs and produce a brief, Claude Code to implement it. (One caveat for work data: Cowork doesn't produce full audit logs, so keep regulated workflows off it without extra controls.)
Coding & agent harnesses beyond Anthropic
Omnigent — the meta-harness
Databricks' open-source Omnigent sits a layer above the harnesses above. Instead of being yet another coding agent, it orchestrates the ones you already use (Claude Code, Codex, Cursor) — swap the model or harness with a one-line config change, run multi-agent teams, and enforce policy at the orchestration layer (spend caps, sandboxing, "pause before this action") rather than by hoping a prompt holds. The clean mental model: Kubernetes for AI agents. It's early/alpha, but it's the answer to "how do I avoid lock-in and govern a fleet of agents."
Picking in practice
- Chat to think, an agent to do. Human-in-the-loop each turn → chat. Want a finished thing produced autonomously with your review at the end → agent.
- Non-developer path: Claude.ai → Cowork → Claude in Office → Design.
- Developer path: Claude.ai/ChatGPT → Claude Code or Codex → Cursor/Copilot in-IDE → Devin for delegated tickets → Omnigent once you're orchestrating several agents.
- Bet on the workflow, not the brand. Models leapfrog monthly and harnesses increasingly let you swap them, so don't marry an engine.
This module ages faster than any other in the guide. In the weeks around this writing, a top Claude tier got export-suspended, Google killed its old CLI for a new one, and a major IDE got acquired and renamed. Treat specific names, tiers, and benchmark numbers as a snapshot, re-check the picture each quarter, and read vendor benchmarks as directional marketing, not gospel.
You're already holding most of this loadout. Run Cowork for batch inbox triage, vendor threads, and report-building (watch-it-happen knowledge work). Keep Claude Code as the build hand for OpenRange, Argus, and Marked. Your interest in Omnigent fits the moment you're juggling multiple Databricks agentic experiments and want to swap models and govern spend from one place. And if you ever want a coding model running locally on WuTangNAS, Kimi K2.7 Code or DeepSeek are the open-weight picks.
Once you have two or three agent workflows going, have Claude Code stand up omnigent/ with a minimal config that points at your existing agents and sets a spend cap + a sandbox policy. It turns "I run a few agents" into "I orchestrate a governed fleet" — and it's the natural bridge from this guide into your Databricks agentic work.
The knowledge graph
Every concept in this guide and how it wires to the others. Tap any node to light up its connections.
Reading top-to-bottom gives you the path. This shows you the shape: foundations on the cool side feeding the central engine, systems on the hot side wrapping around it. The whole field is one connected structure — which is exactly why understanding the engine makes the agents make sense.
Your beginner → advanced roadmap
A progression that turns reading into capability, with concrete builds you can hand off at each stage.
Get the intuition cold
Modules 1–6. You can explain to someone else why an LLM is a prediction engine, what a token is, what attention does, and why the context window is the whole ballgame. No code yet — just the mental model. You're here once "it's autocomplete with a worldview" feels obviously true.
Become a power user of chat
Module 18, applied daily. Specific prompts, examples, step-by-step reasoning, ruthless verification. Use it for real work — vendor threads, explaining Databricks features, drafting docs. The goal: prompting becomes muscle memory and you instinctively smell when it's bluffing.
Build your first harness
Module 8's hand-off: a ~40-line script — system prompt, one tool, one model call, via the Anthropic SDK. Then add a second tool. Feeling the harness from the inside is the jump from "uses AI" to "builds with AI." Hand to Claude Code.
Close the loop — your first agent
Module 9's hand-off: a budgeted Think→Act→Observe loop whose only tool fires an ntfy push. Add a max-iteration brake and a clear stop condition. This is the OpenRange / Argus alerting brain in embryo — safe, reversible, ntfy-first by design.
Ground it in your own data
Module 12. Wire a RAG layer — embeddings in Supabase — so "Ask Your Journal" in Marked answers from your real seasons. You already have the stack; this is where embeddings stop being theory and start returning your own stand notes.
Orchestrate — but only when earned
Modules 14–15. Once single agents are solid, experiment with orchestrator-worker patterns (this is the Omnigent / Databricks-agentic frontier), then wire shared context with single-writer state and provenance on merge. Keep the discipline: simplest thing that works, evals at every stage, humans gating anything irreversible.
This hub is built to expand. Hand me (or Claude Code) a request like "add an embeddings-math deep-dive under Module 2" or "add a Module on AI cost & latency budgeting" and it slots into the same heat-scale structure. The knowledge graph and nav update by editing two small arrays near the bottom of the file. Treat it like a living field journal — keep adding heat as you climb.
The SDLC & Git basics
New subject area. Before CI/CD can mean anything, you need the loop software lives in and the system of record underneath it: Git.
The software development lifecycle (SDLC) is the repeating loop a change travels: plan → code → build → test → release → deploy → operate → monitor, then back to plan. CI/CD is the machinery that automates the middle of that loop — build, test, deploy — so the path from "I changed a line" to "it's running in production" is fast, repeatable, and boring. Boring is the goal.
Underneath all of it is version control, and in practice that means git. Git is the system of record for every change — who, what, when, and the exact state of the code at every point. You can't automate a pipeline over code you can't precisely name and roll back to.
Git is save-states for code. Every commit is a frame you can rewind to — like scrubbing back through a trail-cam clip to the exact frame the hog stepped into the lane. Nothing is ever truly lost; the timeline is the asset. And a quick heads-up on the heat scale: from here it resets per subject. Cold violet is "first day" again, climbing to white-hot within CI/CD — it doesn't carry over from where the AI track left off.
The vocabulary you'll actually use
- Repository (repo). The project plus its entire history. git clone copies it; the hidden .git folder holds every snapshot.
- Working tree → staging → commit. You edit files (working tree), pick what to record (git add stages it), then git commit freezes a snapshot with a message, an author, a timestamp, and a unique hash.
- Commit hash. The 7g8h9i-style id is the immutable name of one exact state. Pipelines and rollbacks key off it.
- HEAD. A pointer to "where you are now" in history — usually the tip of your current branch.
- Remote / push / pull. Your local repo and the shared one (on GitHub or Azure Repos) sync via push (send) and pull (receive).
At Tronox/Databricks this is Databricks Repos / Git folders — your notebooks and asset bundles are versioned in Git, not living as untracked workspace files. On the personal side, every St. Range project (field_guide, Marked, OpenRange) is a Git repo on your st-ranger-danger GitHub. Same primitives in both worlds; the rest of this group automates what happens after a commit lands.
Branching, merging & pull requests
How more than one change happens at once without stepping on each other — and the gate every change passes through before it joins the main line.
A branch is a cheap, throwaway parallel timeline. You snap one off main, do your work in isolation, and when it's ready you merge it back. A pull request (PR) is the formal proposal to do that merge — the place review happens, automated checks run, and the team says "yes, this can join."
A branch is scouting a new line into a stand without disturbing the main trail. You cut and test the new route on its own; if it pans out you fold it into the property map (merge), and if it doesn't you abandon it with zero impact on the trail everyone else is walking. main stays clean and walkable the whole time.
Merge vs. rebase — the one nuance that trips people up
Both integrate one branch into another; they differ in what they do to history.
The rest of the key terms
- Conflict. When two branches change the same lines, Git can't auto-decide — you resolve it by hand, then commit the resolution. Normal, not a failure.
- Code review. A human reads the diff and comments before approving. The single highest-leverage quality habit in the whole pipeline.
- Branch protection. A rule on main: no direct pushes, PR required, checks must pass, N approvals needed. This is what makes the gate real instead of optional.
Branching strategies
- Trunk-based. Tiny short-lived branches off main, merged daily. Pairs best with strong CI. The modern default.
- GitHub flow. Branch → PR → review → merge → deploy. Simple, continuous, what most small teams and your own projects use.
- GitFlow. Long-lived develop + release + hotfix branches. Heavyweight; fits scheduled, versioned enterprise releases — less so continuous delivery.
Turn on branch protection for main on one real repo — start with field_guide — requiring a PR and a passing check before merge. Then put the next guide edit through an actual PR instead of committing to main. You'll feel the whole loop from the inside, and it sets up the CI module: there's now a gate waiting for a check to fill.
Continuous Integration (CI)
The first half of CI/CD. Every push triggers an automated build-and-test, so integration problems surface in minutes — not in a painful merge at release time.
Continuous integration is a simple discipline with a big payoff: merge small changes often, and have a machine build and test every one automatically. The "check" your PR was waiting for in the last module is a CI pipeline — a defined sequence of steps a server runs on your code the moment it changes.
CI is the range check before you trust the rifle. Instead of zeroing once and hoping it holds all season, you confirm the group on every change — automatically. A drifted shot (a failing test) shows up immediately, while you still remember what you touched, instead of in the field when it counts.
The terms, decoded
- Pipeline / workflow. The sequence of automated steps, defined as code in a YAML file that lives in the repo. Versioned with everything else.
- Trigger. What kicks it off — a push, a PR opening, a schedule, or a manual run.
- Runner / agent. The machine that executes the steps. A clean, ephemeral environment each run, so "works on my machine" stops mattering. Hosted (the platform's) or self-hosted (yours).
- Job / step / stage. Steps group into jobs; jobs can run in parallel and group into stages.
- Artifact. The build output (a bundle, an image, a .whl) the pipeline produces and stores for later stages to deploy.
- Status check. The pass/fail signal CI hands back to the PR — the gate from Module 23, now filled.
field_guide already has a .github/ workflow that builds and ships the single HTML file to Cloudflare Pages via Wrangler — that's CI/CD running on your own repo right now. At the day job, the Databricks equivalent is an Azure Pipeline that runs pytest and validates a Databricks Asset Bundle on every push before it's allowed near a workspace.
Continuous Delivery & Deployment (CD)
The second half. CI proved the change is good; CD moves that proven artifact through environments and out to production — with the brakes that keep a bad release from becoming an outage.
The two D's people blur together: Continuous Delivery means every green build is always ready to release — going live is a button a human presses. Continuous Deployment goes one step further: if it's green, it ships to production automatically, no button. Same pipeline; the difference is whether a human stands at the prod gate.
Rolling out a release is like easing a new feeder onto the property. You don't swap every site at once and hope. You put one out, watch the cams for a few days, and only when it's clearly working do you roll it to the rest. That's a canary deploy — and if the herd spooks, you pull it. Same instinct as a rollback.
Deployment strategies — how the new version actually goes live
The rest of the vocabulary
- Environment / stage. A place the app runs — dev, test, staging, prod — each a checkpoint the same artifact is promoted through. You never rebuild per environment; you move the one build forward.
- Approval gate. A required human (or policy) sign-off before promotion to a sensitive environment, usually prod.
- Rollback. Re-deploying the last known-good artifact when a release goes wrong. Fast rollback > perfect releases.
- Feature flag. A switch that ships code dark and turns it on later — decouples "deployed" from "released," so you can flip a feature off without a redeploy.
This is exactly Vercel preview → production for Marked and base_camp: every PR gets a preview deploy (a throwaway environment), and promotion to prod is the gate. On the data side, promoting a Databricks Asset Bundle from a dev workspace to prod behind an Azure Pipelines approval is the same pattern. And the deploy-finished signal should land where everything else does — an ntfy push, your one notification layer.
Azure DevOps vs. GitHub
Microsoft owns both — two complete DevOps suites that overlap almost entirely. Knowing which feature maps to which lets you move fluently between your enterprise day job and your own projects.
Everything in this group — repos, branches, PRs, pipelines, environments — exists in both Azure DevOps and GitHub, just under different names and menus. They're sibling products under one owner. Pick by context, not by capability: the concepts transfer one-to-one.
So which do you reach for?
Microsoft is converging the two, not retiring either. GitHub is the go-forward platform getting the new investment; Azure DevOps stays fully supported for the enterprises standardized on it. They interoperate freely — an Azure Pipeline can build and deploy from a GitHub repo, and GitHub Actions can deploy straight into Azure. You're not locked into one house just because you started in it.
You already live in both houses. The Tronox/Databricks day job is almost certainly Azure DevOps territory — Repos, Boards, and Pipelines wired to the Azure tenant. Your St. Range projects all sit on GitHub with Actions doing the shipping. The skill that pays off is reading a pipeline in either dialect and knowing it's the same five ideas from this group wearing different labels.
Take the GitHub Actions workflow that deploys field_guide and have me write the equivalent azure-pipelines.yml beside it — same steps, Azure dialect. Translating one real pipeline across both platforms is the fastest way to make the mapping in the diagram above stick.
The CI/CD concept map
Every term in this group and how it wires to the next. Tap any node to light up its connections.
Read top-to-bottom and you get the path; this shows you the shape. The cool side is version control — a repo full of commits, branched and merged through pull requests. The hot side is automation — CI and CD, the two hubs everything turns on: CI proves a change, hands off an artifact, and CD promotes it to production. Both houses, Azure DevOps and GitHub, sit at the top because they host every node below them under different names.