AI Native Software Engineering

Reading time
21 min read
Word count
4129 words
Diagram count
9 diagrams

Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Software Engineering/14 AI Native Software Engineering.md.

AI Native Software Engineering

AI native software engineering applies normal engineering rigor to systems where language models assist, decide, retrieve, generate, test, review, operate, or act through tools. The phrase should not mean "trust the model more." It should mean "build a tighter control system around probabilistic work."

The quality bar is higher than ordinary automation because the system can be plausible while wrong, cheap in a demo while expensive in production, and helpful in isolation while unsafe when connected to tools, memory, private data, or deployment paths.

Existing anchors

  • AI Development
  • AI-Enhanced Software Development
  • Indexing Large Codebases for AI-Assisted Development
  • Context-Aware Systems and MCP Protocols
  • ReAct Agent Architecture & Flow
  • LLMOps and Model Deployment
  • Programming Languages for AI-Native Agents

Core claim

AI native engineering is a discipline of evidence. The model may propose, summarize, classify, plan, and execute, but the engineering system must provide:

  • Ground truth from repositories, logs, tests, specifications, and production telemetry.
  • Explicit permission boundaries for data access and tool use.
  • Repeatable evaluations for behavior that cannot be proven by type checks alone.
  • Regression tests for prompt, retrieval, model, and workflow changes.
  • Observability that explains model behavior, cost, latency, tool use, and failure modes.
  • Human review gates for irreversible, sensitive, high cost, or externally visible actions.
  • Rollback paths for prompts, indexes, tools, model versions, and agent policies.

Operating model

AreaOld assumptionAI native assumptionRequired evidence
Code generationGenerated code is a draftGenerated code is untrusted accelerationTests, diff review, static analysis, security review
SearchKeyword search is enoughRetrieval is part of runtime correctnessRetrieval evals, citation checks, freshness metrics
PromptingPrompt text is informalPrompts are production artifactsVersioning, changelog, approvals, regression suite
AgentsAgent autonomy is a featureAutonomy is a bounded risk budgetPermission matrix, audit logs, kill switch
MemoryMore memory is betterMemory changes system behaviorRetention policy, provenance, deletion path, consent
ModelsLatest model is bestModel choice is a tradeoffCost, latency, eval score, context needs, safety profile
ObservabilityLogs cover application behaviorTraces must include model reasoning surfacesPrompt ids, retrieval ids, tool calls, token costs
QualityManual spot checks are enoughAI behavior needs continuous testsGolden sets, adversarial cases, drift monitoring

AI assisted development quality bar

AI output must be treated as untrusted acceleration. A strong workflow lets AI reduce toil without lowering standards.

Required controls:

  • Repo reading before edits.
  • Tests before confidence.
  • Diff review by a human or a high precision review agent with human escalation.
  • Security review for authentication, authorization, secrets, payment, data export, infrastructure, and deployment paths.
  • No fabricated APIs, flags, files, migrations, benchmarks, or release claims.
  • No silent broad refactors.
  • No secrets in prompts, generated files, logs, screenshots, or eval fixtures.
  • Traceable rationale for architecture changes.
  • Clear separation between generated suggestions and accepted engineering decisions.
  • Verification against the same commands that CI or production gates will run.

Development workflow

Rendering diagram...

Quality gates for AI generated code

GateWhat it catchesMinimum barStronger bar
Type checkingInvented APIs, shape drift, null handlingClean type checkStrict mode plus API contract tests
Unit testsLocal behavior errorsRelevant tests passNew tests cover edge cases and failure paths
Integration testsBroken boundariesCritical path passesExternal mocks validate retries and timeouts
Security checksSecret leaks, injection, privilege mistakesStatic checks and manual reviewThreat model plus abuse cases
Diff reviewBroad unintended changesHuman reads changed filesReviewer verifies claims against repo evidence
Runtime smokeMiswired configurationService startsObserved behavior through actual UI or API
Cost reviewExpensive loops, long promptsCost estimate existsToken budgets enforced and monitored

Common failure patterns

PatternSymptomCountermeasure
Hallucinated dependencyCode imports a package not in the lockfileVerify package manifest and lockfile before accepting
Plausible API misuseMethod exists but semantics are wrongRead official docs or local type definitions
Overbroad cleanupGenerated patch rewrites unrelated filesScope diff by task and reject incidental churn
Test theaterTests assert implementation details but not behaviorAdd failure mode and user visible assertions
Hidden data leakPrompt contains credentials or private customer dataRedact, classify, and enforce prompt data policy
False confidenceSummary says tests passed when none ranRequire command output and exit codes

Agentic systems

Agentic systems combine models, tools, state, memory, retrieval, planning, and feedback loops. They require software architecture, not just a larger prompt.

Agentic systems require:

  • Clear task boundaries.
  • Tool permission model.
  • State management.
  • Memory policy.
  • Retry and timeout policy.
  • Human approval gates.
  • Audit logs.
  • Evaluation harness.
  • Rollback or kill switch.
  • Rate limits and spend limits.
  • Idempotent tool execution where possible.
  • Failure classification.
  • Recovery behavior that does not create duplicate external side effects.

Agent architecture

Rendering diagram...

Agent roles and responsibilities

ComponentResponsibilityMust not do
OrchestratorOwns workflow state, retries, cancellation, and step orderHide side effects inside prompt text
Policy engineDecides whether data and tools are allowed for this user and taskTrust model self classification as sole authority
Context builderSelects minimal relevant context for the current stepDump entire databases or repositories into prompts
RetrieverFinds grounded evidence with permissions and provenanceReturn unauthorized or stale private content
Memory serviceStores durable facts, preferences, and past outcomesStore secrets, raw sensitive data, or unverifiable claims
Tool gatewayExecutes bounded operations with schemas and audit logsExpose broad shell, admin, or network access by default
EvaluatorScores outputs, tool choices, and policy complianceDepend only on subjective manual review
Observability layerRecords traces, cost, latency, inputs, outputs, and decisionsLog secrets or sensitive payloads without controls

Autonomy levels

LevelDescriptionExampleRequired control
0Suggest onlyDraft a code review commentHuman decides everything
1Execute reversible local actionFormat a file, run testsLocal sandbox and diff review
2Execute bounded external actionCreate a ticket or draft PRTool allowlist and audit log
3Execute customer visible actionSend an email or update a support caseHuman approval or policy based approval
4Execute irreversible or financial actionDelete data, deploy production, issue refundStrong approval, dual control, rollback plan

Tool permissions

Tools turn model text into action. The safest pattern is a narrow tool gateway with explicit schemas, policy checks, rate limits, and audit logs.

Tool classExamplesDefault permissionReview requirement
Read local repoFile reads, search, dependency inspectionAllowed for assigned scopeConfirm no unrelated file edits
Write local repoPatch files, update testsAllowed only for assigned files or task scopeDiff review and verification
Execute local commandsTests, linters, buildAllowed with bounded working directoryReport failures accurately
Network readOfficial docs, dependency metadataAllowed when current facts are neededPrefer primary sources
Network writeAPI mutation, ticket creation, PR creationDeny by defaultHuman approval or explicit policy
Secrets accessEnv files, credential storesDeny by defaultNeed to know, never paste into prompts
Production operationsDeploy, scale, delete, migrateDeny by defaultChange management and rollback
Financial operationsBilling, refunds, purchasesDeny by defaultStrong approval and audit trail

Permission decision flow

Rendering diagram...

Tool design rules

  • Prefer structured inputs and outputs over free form instructions.
  • Validate tool arguments before execution.
  • Make dangerous operations explicit, separate, and hard to call accidentally.
  • Require idempotency keys for external mutations when possible.
  • Return machine readable errors with retry hints.
  • Attach each tool call to a trace id, user id, policy decision, and prompt version.
  • Avoid giving the model a raw shell, database superuser, cloud admin, or unrestricted browser unless the environment is intentionally sandboxed.

Memory

Memory is durable context that affects future behavior. It is powerful because it reduces repeated context collection, and risky because it can preserve stale, sensitive, or incorrect assumptions.

Memory types

TypeScopeExampleRiskControl
Session memoryCurrent conversationCurrent task constraintsContext pollutionReset at task boundary
User preferenceUser or teamPreferred test commandsStale preferenceTimestamp and source
Project memoryRepo or productArchitecture decisionDrift from codeVerify against repo
Episodic memoryPrior incidentFailure and fix summaryOverfitting to pastLink evidence
Semantic memoryStable domain factNaming conventionLow if verifiedPeriodic refresh
Operational memoryLive system stateCurrent cluster configHigh driftTreat as stale quickly

Memory quality bar

  • Store the smallest useful fact.
  • Record source, time, scope, confidence, and deletion path.
  • Separate user preferences from repo facts and production facts.
  • Never store secrets, private keys, access tokens, raw customer records, or regulated data without explicit policy.
  • Prefer memory as a routing hint, not a substitute for current verification.
  • Expire or refresh facts that can drift.
  • Expose memory use when it materially affects a decision.

Memory write decision

Rendering diagram...

Retrieval and context

RAG quality depends on:

  • Chunking.
  • Metadata.
  • Freshness.
  • Ranking.
  • Deduplication.
  • Citation and provenance.
  • Permission filtering.
  • Context window budgeting.
  • Query rewriting.
  • Evaluation dataset.
  • Negative retrieval tests.
  • Index rebuild strategy.
  • Staleness detection.

RAG pipeline

Rendering diagram...

Retrieval design choices

ChoiceGood defaultWhen to change
Chunk size300 to 800 tokens with overlapUse smaller chunks for API docs, larger chunks for narrative docs
MetadataSource, owner, timestamp, ACL, type, versionAdd domain fields needed for filtering and ranking
SearchHybrid lexical plus vectorPure vector can miss exact identifiers and error strings
RerankingCross encoder or model based rerank for top candidatesSkip only for very low latency or low value paths
CitationsRequired for factual answersInternal workflows may use trace links instead
FreshnessIncremental updates plus rebuild checksUse full rebuild after parser, chunker, or ACL changes
ACLsFilter before final context packingEnforce at source and retrieval layers for defense in depth

Context engineering

Context engineering is the practice of selecting, ordering, compressing, and labeling the information a model receives. It is more than prompt writing.

High quality context has:

  • A clear task objective.
  • Current user constraints.
  • Relevant source excerpts with provenance.
  • Explicit exclusions.
  • Known assumptions.
  • Tool results that are distinguishable from model claims.
  • Output schema or acceptance criteria.
  • Token budget allocation.
  • Freshness notes for time sensitive facts.
  • Safety and data handling instructions.

Context packing template

SectionPurposeExample contents
System policyNon negotiable constraintsData boundaries, tool restrictions, role
TaskCurrent objective"Review this diff for auth regressions"
User constraintsUser specific requirements"Do not push, keep changes local"
Ground truthEvidenceFile excerpts, logs, test output, API responses
Working memoryRelevant prior factsArchitecture decision, naming convention
ToolsAvailable actionsRead only search, test command, patch tool
Output contractRequired response shapeFindings first, file line references

Context anti-patterns

Anti-patternWhy it failsReplacement
Context dumpWastes tokens and hides relevant factsSelective retrieval and structured summaries
Prompt folkloreBehavior depends on unversioned phrasingVersioned prompt with eval coverage
Hidden tool outputModel cannot distinguish observation from guessLabel tool outputs and provenance
Stale memoryPast facts override current repo stateVerify drift prone facts before using
Missing negativesRetriever only tested for happy pathsInclude queries that should return no answer

LLMOps

LLMOps is the operational discipline for model powered software. It covers prompts, models, retrieval, tools, evaluations, deployment, observability, safety, cost, and incident response.

Core concerns:

  • Model selection.
  • Prompt versioning.
  • Evaluation gates.
  • Regression tests.
  • Cost controls.
  • Latency budgets.
  • Safety filters.
  • Data retention.
  • Feedback loop.
  • Observability.
  • Incident response.
  • Dataset governance.
  • Rollback strategy.

Release lifecycle

Rendering diagram...

LLMOps artifact inventory

ArtifactVersioned?Reviewed?Testable?Notes
Prompt templatesYesYesYesTreat like source code
Tool schemasYesYesYesSchema changes can break agents
Retrieval indexesYes, by build idYesYesRecord corpus snapshot and parser version
Eval datasetsYesYesYesPrevent silent benchmark drift
Model configurationYesYesYesInclude temperature, max tokens, safety settings
Memory policyYesYesPartlyTest retention, deletion, and scope rules
Cost budgetsYesYesYesEnforce per route, tenant, workflow, and user
Safety policyYesYesYesInclude abuse cases and escalation rules

Evaluations

Evaluations measure whether an AI system behaves well enough for its job. They are not a single benchmark score.

Evaluation taxonomy

Eval typeMeasuresExample
Golden answerCorrectness against expected outputSupport answer includes required steps
Rubric basedQuality dimensionsAccuracy, completeness, tone, grounding
PairwiseRelative qualityNew prompt beats old prompt on 70 percent of cases
Tool useCorrect action selectionAgent calls read tool before write tool
RetrievalSource selectionRelevant document appears in top 5
GroundingFaithfulness to sourcesNo unsupported factual claims
SafetyPolicy complianceRefuses credential exfiltration request
RegressionStability across releasesPrior failures stay fixed
Cost and latencyOperational fitnessP95 latency under target and cost per task under budget
Human reviewExpert judgmentReviewer approves high risk answers

Evaluation dataset design

Strong eval sets include:

  • Common happy paths.
  • Known historical failures.
  • Boundary cases.
  • Ambiguous requests.
  • Permission denied cases.
  • Stale data cases.
  • Tool failure cases.
  • Retrieval misses.
  • Adversarial prompt injection attempts.
  • Cost intensive tasks.
  • Cases where the correct answer is "I do not know."

Evaluation metrics

MetricGood forWatch out for
Exact matchStructured outputsToo brittle for natural language
Semantic similarityParaphrasesCan approve unsupported claims
PrecisionAvoiding bad answersMay make system overly cautious
RecallFinding all relevant informationCan flood context with noise
FaithfulnessSource grounded answersRequires reliable source annotations
Tool success rateAgent reliabilityDoes not prove tool should have been used
Human preferenceProduct qualityExpensive and subjective
Cost per successful taskEfficiencyCan hide low quality shortcuts

Regression tests

Regression tests for AI systems should cover deterministic code and probabilistic behavior.

LayerTestExample assertion
Prompt renderingSnapshot or schema testRequired policy section is present
Tool schemaContract testInvalid arguments are rejected
RetrieverGolden query testCorrect source is in top results
RerankerRanking testMore authoritative source outranks duplicate
GeneratorRubric evalAnswer cites source and avoids unsupported claim
Agent loopScenario testAgent stops after approval denial
MemoryScope testUser A memory is not visible to user B
SafetyAbuse testPrompt injection cannot override tool policy
CostBudget testWorkflow stops before token budget breach
ObservabilityTrace testEach model call has prompt id and cost fields

Regression suite shape

Rendering diagram...

Prompt and version management

Prompts are production code when they affect user visible behavior, data access, tool use, cost, or safety.

Prompt artifact fields

FieldPurpose
Prompt idStable identity for logs and rollbacks
VersionImmutable release marker
OwnerAccountable maintainer
Change reasonWhy the prompt changed
Model compatibilitySupported model family and settings
Input contractRequired variables and schemas
Output contractRequired format and validation
Safety policyData and behavior limits
Eval coverageTests that protect behavior
Rollback planKnown previous safe version

Prompt review checklist

  • The prompt describes the task without smuggling hidden product requirements.
  • The prompt separates facts, instructions, examples, and tool output.
  • The prompt does not rely on fragile phrasing where a schema would work better.
  • The prompt states uncertainty behavior.
  • The prompt states citation or provenance requirements for factual answers.
  • The prompt does not ask for chain of thought disclosure.
  • The prompt includes data handling constraints.
  • The prompt version is visible in traces.
  • The change has regression coverage for the behavior it intends to alter.

AI observability

AI observability connects model behavior to product behavior. Normal application logs are insufficient because failures may come from retrieval, memory, prompt assembly, model selection, tool policy, or cost throttling.

Trace fields

FieldWhy it matters
Trace idCorrelates user request, model calls, retrieval, and tools
User and tenant scopeSupports permission debugging and cost allocation
Prompt id and versionEnables rollback and regression analysis
Model and parametersExplains behavior and cost changes
Input and output token countsTracks cost and context pressure
Retrieval query and result idsDebugs missing or wrong context
Source citationsSupports grounding checks
Tool calls and argumentsAudits side effects
Policy decisionsShows why access or tools were allowed or denied
Latency by stageIdentifies slow retrieval, model, or tool calls
Safety filter resultsExplains refusals and escalations
Eval score or online quality signalDetects quality drift

Observability dashboard

Rendering diagram...

Useful alerts

AlertSignalLikely cause
Cost spikeTokens or spend exceed budgetLoop, prompt bloat, model change, abuse
Retrieval miss spikeLow citation or low top k relevanceIndex drift, parser failure, ACL bug
Tool failure spikeIncreased tool error rateAPI outage, schema mismatch, permission change
Refusal spikeMore safety blocksPolicy change, adversarial traffic, bad classifier
Latency spikeP95 route latency risesModel slowdown, reranker cost, slow tool
Low groundingUnsupported claim rate risesPrompt change, poor retrieval, stale memory

Cost controls

Cost is an architecture concern. AI systems can scale cost faster than ordinary software because token use compounds through retrieval, agent loops, retries, summarization, and parallel tool calls.

Cost levers

LeverTechniqueTradeoff
Model routingUse smaller model for easy casesNeeds confidence classifier
Context limitsPack only relevant evidenceCan omit useful background
CachingCache embeddings, retrieval, and stable answersMust respect permissions and freshness
Batch processingGroup offline jobsIncreases delay
Early stoppingStop low value loopsMay reduce completion rate
Tool first designQuery database directly instead of asking modelRequires deterministic integration
Eval samplingEvaluate representative subset in CIMay miss rare regressions
Budget enforcementPer user, tenant, route, and workflow limitsNeeds clear user experience on denial

Budget policy example

WorkflowBudget dimensionExample policy
Chat answerTokens per requestHard stop after context and output budget
Code reviewFiles per runReview only changed files unless expanded
Agent taskTool calls per taskStop after N failed attempts
RAG ingestionDocuments per jobBackpressure and retry queue
Eval runCases per commitFull suite nightly, focused suite per PR
Production tenantMonthly spendAlert at 70 percent, throttle at 90 percent

Safety and data boundaries

Safety is not just content moderation. It includes data access, data retention, tool authority, customer impact, legal constraints, and operational blast radius.

Data classification

Data classExamplesPrompt policyStorage policy
PublicDocs, public marketing pagesAllowedNormal retention
InternalPrivate architecture notesAllowed only for authorized usersAccess controlled traces
ConfidentialContracts, private business plansMinimize and redactShort retention and audit
Customer privateSupport tickets, user recordsUse only for permitted taskStrict access and deletion
RegulatedHealth, payment, government identifiersAvoid unless approved architectureSpecialized compliance controls
SecretsTokens, private keys, passwordsNever includeNever store in AI logs

Prompt injection defenses

  • Treat retrieved content as data, not instructions.
  • Keep system and policy instructions outside retrievable documents.
  • Strip or quarantine instructions found inside untrusted documents.
  • Require tool calls to pass policy checks independent of model text.
  • Use allowlisted tools with narrow schemas.
  • Prefer citations and structured outputs for factual workflows.
  • Add tests where documents attempt to override policies.

Boundary diagram

Rendering diagram...

Practical scenarios

Scenario 1: AI code assistant changes authentication middleware

Risk:

  • Auth middleware is high impact and easy to break with plausible looking code.
  • The assistant may simplify checks that appear redundant but encode security policy.

Expected controls:

  • Read current middleware, route tests, security notes, and authorization helpers.
  • Add or update tests for allowed, denied, expired, missing, and cross tenant cases.
  • Run type checks, auth tests, and focused integration tests.
  • Review diff for permission broadening.
  • Require human review before merge.

Review questions:

  • Did any condition become less restrictive?
  • Are deny by default paths preserved?
  • Are errors safe and non revealing?
  • Are logs free of tokens and private identifiers?
  • Is the behavior covered by regression tests?

Scenario 2: Support agent drafts a refund response

Risk:

  • The agent can produce customer visible commitments.
  • Refund policy may depend on account status and jurisdiction.

Expected controls:

  • Retrieve only authorized customer and policy data.
  • Draft response without issuing refund unless approved.
  • Show cited policy and account facts to reviewer.
  • Require explicit approval for financial action.
  • Log prompt version, retrieved records, and approval identity.

Review questions:

  • Did the agent distinguish policy from customer facts?
  • Did it avoid promising an action before approval?
  • Was private data minimized in the response?
  • Is the financial tool behind a separate approval gate?

Scenario 3: RAG answer over engineering docs

Risk:

  • The answer can be authoritative but based on stale or wrong documents.
  • Search may retrieve old docs over current code.

Expected controls:

  • Hybrid retrieval over docs and code.
  • Freshness metadata and source priority.
  • Citation requirement.
  • "I do not know" behavior when sources conflict.
  • Eval cases for renamed APIs, deleted flags, and deprecated commands.

Review questions:

  • Are citations current and authoritative?
  • Does answer quality drop when a relevant document is missing?
  • Does the system identify conflicting sources?
  • Are private documents filtered by user permission?

Scenario 4: Agent opens a pull request

Risk:

  • The agent may bundle unrelated changes or claim unverified success.

Expected controls:

  • Scope file edits to task.
  • Run documented verification.
  • Include test evidence in PR description.
  • Leave unrelated dirty files untouched.
  • Avoid pushing unless explicitly authorized.

Review questions:

  • Does the PR match the requested scope?
  • Are generated claims backed by command output?
  • Are secrets, logs, and generated artifacts excluded?
  • Can the change be reverted cleanly?

Scenario 5: Production summarization job cost spike

Risk:

  • Summarization can create nested prompts, retries, and long context windows.

Expected controls:

  • Per job token budget.
  • Batch size limit.
  • Prompt length monitoring.
  • Retry cap with exponential backoff.
  • Alert on spend and output volume.
  • Degraded mode using smaller model or shorter summary.

Review questions:

  • Which route, tenant, prompt version, or model caused the spike?
  • Did retries multiply cost?
  • Was context packing too permissive?
  • Did caching fail or become invalidated?

Review checklists

AI assisted code review checklist

  • The assistant read relevant files before editing.
  • The diff is limited to the requested scope.
  • No unrelated formatting churn is included.
  • Tests cover intended behavior and failure cases.
  • Verification commands and results are recorded.
  • Security sensitive code has explicit review.
  • Generated code uses existing project patterns.
  • No secret, token, customer data, or private log was added.
  • Documentation changes match shipped behavior.
  • Any uncertainty is stated rather than hidden.

Agent system review checklist

  • Agent goal is narrow and measurable.
  • Autonomy level is documented.
  • Tool permissions are allowlisted and scoped.
  • Tool calls are validated independently of model text.
  • Human approval is required for high impact side effects.
  • Retries have caps and do not duplicate external actions.
  • Timeouts and cancellation are defined.
  • Memory reads and writes have policy controls.
  • Audit logs include user, tool, policy, and trace ids.
  • Kill switch and rollback path exist.

RAG review checklist

  • Corpus source list is known.
  • Ingestion preserves source, timestamp, owner, and ACL metadata.
  • Chunking strategy matches document type.
  • Hybrid search or exact identifier handling exists where needed.
  • Reranking is evaluated against realistic queries.
  • Retrieval evals include negative and stale cases.
  • Answers cite sources or expose trace links.
  • Permission filtering happens before context packing.
  • Index rebuilds are reproducible.
  • Freshness drift is monitored.

LLMOps release checklist

  • Prompt, model, tool schema, and retrieval index versions are recorded.
  • Offline evals pass against golden and adversarial cases.
  • Cost and latency are within budget.
  • Safety evals include injection, data leak, and unauthorized tool use.
  • Canary or shadow deployment exists for risky changes.
  • Rollback target is known.
  • Observability dashboards include trace, quality, cost, and safety fields.
  • Incident owner and escalation path are defined.

Prompt change checklist

  • Prompt has a stable id and version.
  • Change reason is documented.
  • Input and output contracts are explicit.
  • The prompt distinguishes instructions from context and examples.
  • The prompt avoids hidden policy changes.
  • The prompt does not reveal chain of thought.
  • Tests cover the intended behavior change.
  • Online traces can separate old and new versions.

Data boundary checklist

  • Data classes are identified.
  • User authorization is checked before retrieval.
  • Sensitive data is minimized before model calls.
  • Secrets are excluded from prompts, memory, logs, and evals.
  • Retention period is defined.
  • Deletion path is tested.
  • Cross tenant access is tested.
  • Third party model provider terms match data policy.

Design heuristics

  • Use deterministic code for rules, calculations, permissions, and irreversible actions.
  • Use models for language, ambiguity, ranking, synthesis, and fuzzy classification.
  • Keep tool policy outside the prompt.
  • Keep prompts small enough to review.
  • Prefer structured outputs when downstream code depends on the answer.
  • Prefer retrieval over memory for factual or drift prone data.
  • Prefer memory over retrieval for stable user preferences and repeated context.
  • Prefer evals over vibes.
  • Prefer canaries over big bang model changes.
  • Prefer explicit uncertainty over plausible invention.
  • Prefer source citations over unsupported fluency.

Failure mode index

FailureDetectionPrevention
Hallucinated factGrounding eval, citation checkRequire source evidence
Wrong tool callTool trace reviewPolicy engine and tool allowlist
Unauthorized retrievalACL testFilter before context packing
Prompt injectionSafety scenarioTreat retrieved text as data
Memory contaminationMemory scope testSource, confidence, expiry, deletion
Cost runawaySpend alertBudgets, loop caps, caching
Latency regressionP95 dashboardModel routing and timeout budgets
Stale answerFreshness metricReindexing and source priority
Regression after prompt editEval suiteVersioned prompts and release gates
Silent data leakLog auditRedaction and retention policy