Version: MVP

Differentiators

10 min readFor engineers & reviewersUpdated 2026-05-19

What you'll learn

The features that keep Craik's roadmap from collapsing into basic CLI, storage, and adapter work.
Why every durable assertion must be traceable to evidence.
How Craik separates raw agent output from organizational truth.
The runtime primitives — assumption ledger, scope locks, scratchpad expiry, context debt, runtime critic — that make agent work auditable.

Don't become a generic agent launcher.

Craik's differentiators center on durable, governed, evidence-backed agent work. This doc captures the features that should keep the roadmap honest as it grows.

Evidence-first execution

Every durable conclusion should be traceable to evidence.

Runtime rule: No durable assertion without evidence. Craik may allow low-confidence assumptions, but they must not be promoted to durable facts without evidence.

Evidence sources: file reads · command output · GitHub issues, PRs, comments, checks · Stigmem facts · user instructions · web sources · prior handoffs · generated artifacts · runner outputs.

Assumption ledger

Agents make assumptions constantly. Craik should separate assumptions from facts. Each assumption captures statement · source · confidence · task context · verification requirement · expiration · whether action is allowed before verification. Assumptions are visible in case files, handoffs, and memory diffs.

Belief promotion workflow

Craik should distinguish raw agent output from organizational truth.

observed → proposed → accepted → relied_upon → stale → invalidated

The lifecycle applies to memory proposals and eventually to selected Stigmem facts through metadata or companion facts.

Context budgeting as policy

Context assembly should be explainable. Case files capture why each item was included · what was summarized · what was excluded · what was omitted due to budget · what must be fetched on demand · whether omissions create risk.

Agent run reproducibility

Run records link the full provenance chain so reviewers can replay operationally — not deterministically — what an agent knew and what it was allowed to do.

Linked record

Purpose

Why it matters

task_request

input

Original ask the run was launched against.

case_file

brief

Pre-run context bundle.

policy_envelope

authority

What the run was allowed to do.

runner_adapter + metadata

execution

Which runner produced output, with model identifiers.

capability_grants

authority

Explicit permissions exercised.

relevant_facts

context

Memory facts loaded into the case file.

receipts

accountability

Every governed action that fired.

commands + outputs

execution

What ran and what came back, redacted.

memory_proposals

delta

Reviewable facts the run wants to land.

contradictions

delta

Conflicts surfaced during the run.

handoff

closure

The terminal continuity record.

Trust boundaries between agents

Codex, Claude, Gemini, and future runners are not equally trusted by default. Policy controls whether a runner may propose facts · write facts · edit files · run shell · open issues or PRs · approve another agent's work · resolve contradictions · use fail-open profiles.

Cross-agent review protocol

Explicit review roles instead of single orchestrator/specialist decomposition.

Implementer

Does the primary work.

Verifier

Runs validation, confirms claims.

Adversarial reviewer

Finds gaps and unsupported claims.

Policy reviewer

Checks governance compliance.

Documentation reviewer

Aligns docs with implementation.

Memory curator

Hygiene over time.

Release reviewer

Gate before publication.

Adjudicator

Resolves disagreements.

Review outputs are typed, evidence-linked, and graph-connected.

Staleness as a first-class signal

Old truths are a major failure mode. Craik surfaces staleness for facts · docs · handoffs · assumptions · GitHub issue state · branch state · runner outputs · generated artifacts · project policies. Every case file says what's fresh, stale, or unknown.

Decision record suggestions

Craik notices when runtime knowledge is becoming durable project policy.

Signals: repeated reliance on the same fact · resolved contradictions that affect future behavior · recurring policy overrides · repeated docs updates from the same root cause · cross-agent agreement on an architectural constraint.

Craik suggests that maintainers create or update ADRs — it does not write them automatically.

Agent-native onboarding

craik onboard --project <project-id> outputs the canonical bundle a new runner needs.

Output: current project model · active policies · relevant ADRs · docs boundaries · recent handoffs · unresolved contradictions · validation commands · Stigmem connection status · known traps · allowed next actions.

Provenance-aware documentation

For generated or updated docs, Craik records source facts · source files · source issues/PRs · relevant policies · validation commands · authoring agent · review agent · update timestamp. Documentation stays tied to the evidence that justified it.

Policy tests

Craik policies are testable. Policy tests run in CI and fixture-based local tests.

Immutable paths

ADRs cannot be edited under strict mode.

Memory proposal default

Memory writes become proposals unless granted.

Trusted-local receipts

Fail-open still seals receipts.

Automation fail-closed

Automation mode stops instead of widening.

Grant boundaries

Runner adapters cannot bypass grants.

Redaction regressions

Secrets are scrubbed from receipts and handoffs.

Human delegation points

Human involvement is a runtime primitive, not an interruption.

Delegation kinds: approval request · clarification request · policy override request · contradiction adjudication request · memory promotion request · release signoff request.

Delegation points become graph nodes, appear in handoffs, and produce receipts when resolved.

Budget and quota controls

Budgets bound agent work with operational limits visible in case files and receipts: context tokens · model spend · wall-clock time · shell command count · GitHub write count · memory write count · parallel worker count · retry count · human approval count.

Learning without self-trust

Agents may propose facts · skills · policy refinements · validation commands · docs updates · decision record suggestions · plugin ideas. Promotion always requires evidence, policy, review, or explicit approval.

The self-trust rule

Craik may learn continuously, but it should not self-certify truth.

This principle guides every self-improving feature.

Runtime instruction distillation

Craik turns declared agent-runtime instruction files into structured runtime memory.

Recognized sources: AGENTS.md · CLAUDE.md · GEMINI.md · HERMES.md · SKILLS.md · .cursorrules · .github/copilot-instructions.md · .codex/instructions.md · project policy docs explicitly listed in the project profile.

Source Markdown remains canonical. Distilled output is a provenance-linked runtime projection.

Distilled categories: instruction · policy · preference · command · boundary · handoff rule · memory rule · security rule · stale-risk.

Distillations track source path, source hash, line/range, scope, timestamp, and extraction confidence. Extracted items become proposals by default and are invalidated when the source hash changes.

Task intent lock

Craik freezes the accepted task intent before execution. The lock captures original request · accepted interpretation · excluded work · allowed autonomy · stop conditions · scope-change rules — giving agents a stable north star and making scope drift reviewable.

Scratchpad with expiry

Working memory that is not durable truth. Scratchpad space holds temporary notes · candidate hypotheses · partial findings · links to inspect · unresolved fragments — and expires at task end unless promoted to assumptions, facts, handoffs, or artifacts.

Negative knowledge

Useful dead ends are preserved with freshness rules.

Approaches rejected

What's already been tried and didn't work.

Failed commands

Commands that errored and why.

Non-existent APIs

Endpoints checked and not found.

Irrelevant files

Files inspected and found unrelated.

Disproven assumptions

Claims refuted by evidence.

Unavailable names

Package or registry names checked and not free.

Absence can change — freshness rules apply to negative knowledge too.

Capability dry run

Before granting side-effecting capabilities, an agent previews intended actions: files expected to change · shell commands expected to run · GitHub writes expected · facts expected to be proposed or written · policy triggers · approvals likely needed. The runtime then grants narrower authority.

Evidence coverage score

A real coverage signal, not a fake certainty score.

Level

Source

What's behind the claim

unsupported

none

No backing source. Always low-confidence.

single-source

one citation

One file/issue/fact. Confidence depends on the source.

multi-source

multiple citations

Multiple independent supports.

runtime-observed

live execution

Output captured at runtime via a wrapper.

policy-backed

runtime contract

The claim is what policy guarantees.

verified by command/test

execution result

A test or command confirms the claim.

reviewed

another actor

Another agent or human reviewed.

Structured agent debate

When agents disagree, Craik structures the disagreement. Debate records capture claim · evidence · counterclaim · counter-evidence · missing verification · adjudicator decision · resulting memory updates.

Self-audit before handoff

Before finishing, agents run a standard self-audit.

Answered the locked intent.
Stayed in scope.
Cited evidence.
Recorded assumptions.
Recorded validation.
Created needed facts or proposals.
Avoided forbidden paths.
Left next steps.
Produced a useful handoff.

Context debt tracking

When context is omitted, summarized, or deferred because of budget, Craik tracks omitted item · reason · risk · required follow-up · whether the current task may proceed. Context debt is durable; the next run inherits it as carryover.

Tool result attestation

Different result sources have different trust profiles.

Class

Trust profile

When acceptable

runtime-observed

high

Captured by a runtime wrapper; receipted.

agent-reported

low

Agent's claim about a result. Needs verification.

user-reported

trust user

Operator-asserted state.

external API

scoped

Captured from a remote service receipt.

inferred

low

Derived from an artifact, not directly observed.

Important claims like "tests passed" should require runtime-observed receipts whenever possible.

Runtime memory hygiene

Curator workflows for memory quality. Curator tasks find stale assumptions · duplicate facts · unpromoted useful proposals · weak-evidence facts · contradictions · expired handoffs · obsolete negative knowledge. Cleanup is proposed, never automatically destructive by default.

Recovery mode

Interrupted runs are resumable. Recovery uses task request · intent lock · case file · policy envelope · partial receipts · scratchpad · changed files · unfinished handoff · unresolved delegations · memory proposals. Incomplete runs still leave useful handoffs.

Runner capability matrix

Craik knows what each runner can do — and routes accordingly.

Capabilities tracked: shell access · file patching · browser/web access · MCP support · image input · structured output · long context · background tasks · approval flow · tool-call reliability.

The matrix influences runner selection, prompt compilation, and policy grants.

Scope change protocol

When an agent finds work outside the locked intent, it files a scope-change proposal capturing requested scope change · rationale · evidence · risk · whether current work is blocked · recommended action.

Knowledge freshness probe

Before relying on stale or high-impact facts, Craik can refresh relevant state.

Probe targets: repo state · GitHub state · package registries · Stigmem facts · local command output · web sources (when allowed).

Public / internal boundary classifier

Craik classifies where content belongs and helps prevent internal-only labels or implementation tracking details from leaking into public docs.

Targets: public docs · internal docs · issue or PR comments · memory facts · handoffs · release notes · audit artifacts.

Runtime context explanations

Every case-file item is explainable. Agents should be able to ask, "Why am I seeing this?" and get a real answer.

Policy required

Included because policy mandates it.

Recent handoff

Included because a recent handoff referenced it.

Contradiction

Included because it contradicts a current assumption.

Stale + high-risk

Included because it is stale but high-risk.

Task-type

Included because the task type requires it.

Structured context requests

Agents request more context through a structured protocol. Fields: need · reason · urgency · allowed source scope · blocking status · expected output shape. Craik fulfills requests through safe channels and records the result.

First-class unknowns

Agents say "unknown" without being treated as incomplete. Unknowns identify whether resolution requires web access · user input · repo inspection · privileged tool use · Stigmem query · waiting for external state.

Runtime critic

A structured critic pass before accepting major outputs.

Unsupported claims

Claims without evidence references.

Policy violations

Actions that crossed the envelope.

Scope drift

Work outside the intent lock.

Missing validation

Claims unverified by command or test.

Stale evidence

Citations that may have moved.

Missing handoff

Run that didn't close cleanly.

Unredacted content

Sensitive data that slipped through.

Risky memory writes

Promotions without sufficient evidence.

Agent workload memory

Routing memory, not social reputation. Craik remembers which agents and runners perform well on which work.

Signal examples: strong at docs reconciliation · weak at shell-heavy debugging · strong at policy review · tends to miss stale GitHub state · needs stricter context · produces high-quality handoffs.

Known traps

Projects maintain known traps — negative knowledge appearing in onboarding and case files.

Don't edit ADRs

Public docs can't reference internal labels

Tests must run outside the sandbox

Generated docs live elsewhere

Local node advertises a non-standard port

Package version is intentionally pre-release

Evidence expiration rules

Different evidence kinds have different shelf lives.

Source

Freshness

Why

GitHub branch state

expires quickly

Branches advance every push.

Package registry availability

expires quickly

Names can be claimed at any time.

ADR policy

long-lived

Decisions change rarely and explicitly.

Command output

tied to commit

Validity depends on the worktree state.

Web search

time-sensitive

Content can change at any moment.

User instruction

until superseded

Active until the operator updates it.

Handoff quality score

Handoffs are checked for completeness.

Signals: completed work · changed files · validation · assumptions · unresolved questions · next steps · facts proposed or written · receipts · context debt · delegation status.

Policy-aware prompt compiler

Craik compiles runner-specific prompts from the same underlying runtime contracts.

Inputs: locked task intent · policy envelope · context contract · runner capabilities · evidence · assumptions · allowed tools · output schema.

Codex, Claude, and Gemini may need different prompt shapes, but the underlying truth is shared.

Real-runner contract tests

Mocks are not enough for runner adapters. Craik periodically tests Codex, Claude, and Gemini adapters against fixture tasks and verifies that outputs conform to Craik contracts.

Memory impact preview

Before writing facts to Stigmem, Craik shows a memory-diff preview: facts to add · facts to invalidate · contradictions likely to open · affected case files / handoffs / docs · scope and visibility · confidence · evidence.

Agent exit discipline

Agents that cannot complete a task still leave useful state.

Incomplete exits include: why blocked · what was checked · what is safe to continue · what is unsafe · missing context · unresolved delegations · next best action.

Red team mode

High-risk tasks support a stricter reviewer mode. Checks include leaked secrets · public/internal boundary violations · unsupported claims · unsafe command grants · bad memory writes · policy bypasses · misleading docs updates.

Work product classification

Every artifact has a type and lifecycle, and the class drives policy.

Scratch

Expires at task end.

Proposal

Awaits review.

Implementation

The primary deliverable.

Review

Cross-agent review output.

Decision

ADR or equivalent.

Release

Versioned, signed.

Public docs

External-facing.

Internal docs

Operator-only.

Memory update

Fact-store delta.

Audit artifact

Receipt, handoff, graph export.

What changed since last time

Before an agent starts, Craik shows relevant deltas since the last related run — continuity without forcing rediscovery.

Tracked deltas: files changed · facts changed · issues changed · PRs changed · policies changed · handoffs added · contradictions opened or resolved · package versions changed.

Evidence-first execution​

Assumption ledger​

Belief promotion workflow​

Context budgeting as policy​

Agent run reproducibility​

Trust boundaries between agents​

Cross-agent review protocol​