Roadmap

The release plan for scriptorium, synthesized from the knowledge/ evidence base. Each phase’s contents are chosen because the research justifies them — not because they’re easy or because they sound impressive.

The implementation-priority section of every knowledge document feeds this roadmap. Findings that the research concluded should not become skills are documented in Explicit non-goals so the project’s claims stay honest.

v0.1 — Foundation (in flight)

The first release proves the architectural pattern. Three leaf skills, shared state, a CLI, and the knowledge layer that grounds the skills in evidence.

Component	Grounded in
`citation-audit` skill	`citation-claim-alignment`, `citation-accuracy-evidence`, `citation-overreach-research`, `hallucination-in-llm-citations`
`reviewer-simulation` skill	`reviewer-archetypes-evidence`, `common-critiques-taxonomy`, `ai-peer-review-research`, `critique-quality-evidence`
`argumentative-flow` skill	`reader-expectation-approach`, `narrative-frameworks`, `argument-mapping`, `semantic-preservation`
`MANUSCRIPT_STATE.yaml` schema + Venice example	All of the above
`scriptorium` CLI (`install`, `validate`, `prompt-pack`, `list`)	Self-evident
Claude Code plugin packaging	—
Knowledge layer (~40 docs)	—
DESIGN.md with scope statement + defensive-design section	`ai-writing-failure-modes`, `discipline-conventions`

Success criterion for v0.1: the three skills run usefully against the Venice 2026 manuscript (and any other manuscript with a populated MANUSCRIPT_STATE.yaml). Output structure is consistent enough that a future orchestrator can consume it.

v0.2 — Coordination + targeted critique additions

After v0.1 has been used on real manuscripts and the structured-output discipline is proven, the next priorities are coordination and the two highest-ROI critique additions identified in research.

Component	Status / Justification
`manuscript-pipeline` orchestrator	Sequences leaf skills; consumes structured outputs. Spec ready; built once the leaves are stable.
`desk-rejection-risk` skill	Landed. `editorial-decision-making` — 70–90% desk-rejection rates at top journals; scriptorium’s value proposition includes catching what would trigger desk rejection.
`venue-fit` skill	Landed. Tiered venue recommendation with predatory refusal, opt-in preprint mode (PCI, Review Commons, F1000Research, eLife post-2022), and bias-managed pub-history calibration. Grounded in three new knowledge notes: `venue-selection`, `predatory-publishing`, `preprint-landscape`.
ESL-aware checks embedded in `argumentative-flow`	Landed. `esl-writers-swales-hyland`.
`author-contribution-audit` skill	Landed. Replaces the originally-planned `contributors:` schema addition. Per `declared-work-scope`, scriptorium operates on declared prose where it lives — duplicating contributions in `MANUSCRIPT_STATE.yaml` would have created a sync problem. The skill audits the Author Contributions section against ICMJE’s four authorship criteria and CRediT’s 14 contributor roles. Grounded in `credit-taxonomy-authorship`.
`reporting-guideline-fit` skill	Landed. Replaces the originally-planned `reporting_guidelines:` schema addition. Authors often don’t know which EQUATOR checklist applies — declaring it in state was the wrong-data-confidently-declared failure mode. The skill infers from the manuscript methods; the author confirms. Grounded in `reporting-guidelines`.

v0.3 — Validation skills + reporting-guideline compliance

Once the structured-output pattern handles critique and transformation reliably, validation skills become the next leverage point. Most need deterministic scripts called out from skills, not LLM arithmetic.

Component	Status / Justification
`statistics-consistency` skill	`statistical-inconsistency` — Statcheck/GRIM/GRIMMER/SPRITE/Carlisle. Skill orchestrates external scripts; does not pretend to recompute in-band. Design memo: `docs/design/v0.3-statistics-consistency.md`.
`figure-text-alignment` skill	Landed (text-only sub-skill A). `internal-consistency`, `visualization-figures`. Classifies caption ↔ body-text-reference pairs as aligned / partially aligned / misaligned / cannot-determine, plus pattern flags (orphan figure, phantom reference, panel mismatch, axis/units divergence, direction divergence). Pure text-vs-text; no image reading. Multimodal sub-skill B (LLM-vision) remains deferred until reliability is validated against a known-mismatch test set.
`terminology-normalization` skill	Landed (early). `internal-consistency`, `style-guides` — terminology drift detection; preferred-term enforcement from `MANUSCRIPT_STATE.yaml`. Shipped during v0.2 ramp because the grounding notes existed and the schema fields (`terminology.preferred` / `forbidden` / `synonyms`) were already in place.
`gap-finder` skill	Landed (early). Identifies gaps in declared draft prose, organised by a seven-category taxonomy. Each finding anchors in a specific manuscript passage; suggested directions are pasteable search strategies, never invented citations. Grounded in two new knowledge notes: `research-gap-detection`, `literature-search-strategies`.
`reporting-guideline-compliance` skill	Landed. `reporting-guidelines`, `internal-consistency`. Walks an EQUATOR Network checklist (CONSORT, STROBE, PRISMA, ARRIVE, STARD, TRIPOD/TRIPOD+AI, CARE, COREQ, CHEERS, plus AI-extensions) item by item and classifies each as present / partial / missing / not-applicable, with quoted manuscript anchors. Downstream of v0.2’s `reporting-guideline-fit` (which infers which checklist applies; this skill runs it).
`compression` skill	Landed. `narrative-frameworks`, `semantic-preservation`, `copyediting-vs-developmental`. Page-limit-driven section compression that preserves every citation, statistic, declared `core_claim`, terminology choice, and hedging stack. Per-edit suggestions; never auto-applies. Sits one editorial level below `argumentative-flow` (line-edit posture vs. block-rewrite).
`voice-profile` skill	`corpus-based-stylometry`, `author-role-evidence` — extract author writing patterns from a small single-author corpus. Design memo: `docs/design/v0.3-voice-profile.md`.
`persona-calibration` skill	`author-role-evidence`, `ai-peer-review-research` — checkpoint synthetic feedback against the real author. Design memo: `docs/design/v0.3-persona-calibration.md`.

v0.4 — Grant-specific skills and bounded transformations

Originally framed as “Generation skills”. Reframed once declared-work-scope landed: scriptorium does not generate prose from blankness. Generation is in scope when it transforms declared scaffolding into a known target form — the v0.4 work concentrates there. Most of the originally-planned v0.4 skills (discussion-drafting, results-narrative) involved substantial proposer-side judgment incompatible with the scope and have been dropped (see Explicit non-goals); two transformations remain in scope and one grant-specific critique is added.

The unifying frame for v0.4 is the grant-writing workflow, where the author has typically done substantial proposer-side work (mentor discussions, aim selection, significance framing) before sitting down to write, and where the value of bounded-transformation skills is highest.

Component	Justification
`specific-aims` skill	Transforms declared significance + hypotheses + methods (in `MANUSCRIPT_STATE.yaml` and the manuscript’s existing methods scaffolding) into structured NIH-aims prose. Grant-specific; the author has typically already committed to the aims via mentor discussion before invoking — the skill renders them in canonical structure. Grounded in `significance-positioning` and `nih-significance-patterns`. NIH 2025 Simplified Review Framework bundles Significance + Innovation — the skill ladders both, and the NIH Factor 1 / Factor 2 framing is the target structure. In scope because the inputs are declared.
`aims-significance-coherence` skill	Critique skill — audits whether the declared significance is coherent with the stated aims (cross-section consistency check for grants). Pairs with `reviewer-simulation`’s grant-archetype variant (study-section roles); the aims-significance gap is one of the most common NIH-reviewer flags. Grounded in `nih-significance-patterns`, `significance-positioning`, and `reviewer-archetypes-grants`.
`lay-summary` skill	Translation of declared manuscript or grant prose into plain-language form, against funder-specific length and reading-level requirements (NIH 2025 Public Access requirements; Wellcome Trust; EU Clinical Trials Regulation 536/2014). Strongly transformative — both source and target style are declared. Grounded in `plain-language-lay-summaries`.

Skills outside this list that were on the previous v0.4 roadmap — results-narrative, discussion-drafting — have been dropped under declared-work-scope. The remaining grant-side work (cover letters, biosketch-fit, funder-specific compliance checks) is plausibly v0.5+ once v0.4 lands; not committed yet.

v0.5+ — Platform reach + knowledge expansion

Component	Justification
Codex / Gemini / Hermes adapters	Audience reach beyond Claude Code. Most reusable via the `prompt.md` files already shipped per skill; thin per-platform installer scripts in `adapters/`.
Per-discipline knowledge layers (physics, CS/ML, mathematics, qualitative social science)	`discipline-conventions` — currently scope-limited to biomedical/clinical. Expand only when non-biomedical adoption emerges.
Astro/Starlight docs site (with Quarto preprocessing)	Mirrors the quartobot pattern. Phase 1.5; placeholder shipped in v0.1.

Explicit non-goals

Findings the research concluded should not become skills, with reasons. This list is load-bearing: it keeps the project honest about what it does and doesn’t claim.

No blank-slate prose generation — declared-work-scope. Scriptorium operates on prose the author has written or scaffolding the author has declared. The proposer role in Hayes’ 2012 writing-process model (generating content from nothing) is the author’s; scriptorium occupies the translator and evaluator roles. Generation skills are in scope when they transform declared scaffolding (v0.4 specific-aims, lay-summary). Generation from a blank section, or “help me figure out what to write about”, is not, and the originally-planned v0.4 discussion-drafting and results-narrative skills were dropped under this scope: the discussion involves substantial proposer judgment (what does this mean? what should be emphasized?) and the “results narrative” risks slipping claims that go beyond the declared data. The shape of those concerns is covered by argumentative-flow + gap-finder (for an existing discussion) and figure-text-alignment (for results-prose-vs-data consistency).
No general-purpose writing-quality score — quantitative-quality-measures (pending). Flesch-Kincaid / SMOG / Coleman-Liau systematically misrate scientific prose (technical terms inflate difficulty). A quality score would be theater.
No authorial-voice preservation guarantee — ai-writing-failure-modes. Detection of “ChatGPT smell” (Kobak 2024 et al.) is possible at corpus level; correction at sentence level is unreliable. Scriptorium’s conservative-edit posture mitigates this but doesn’t claim to eliminate it.
No forensic-expert replacement — forensic-methodology. Bik-style image forensics, Cabanac tortured-phrase detection, and statistical forensics (Carlisle, Statcheck, GRIM, SPRITE) require domain experts. Scriptorium is a pre-submission first pass that catches cheap errors before a manuscript reaches reviewers — not a replacement for sleuths or institutional integrity review.
No autonomous reviewing — ai-peer-review-research. Scriptorium’s reviewer-simulation is explicitly author-side: the author runs it on their own work to pressure-test before submission. Editorial-side use is contrary to current ICMJE, NIH, and major-publisher policies (and we agree with those policies).
No replacement for reference managers — reference-managers. Citation auditing works with whatever bibliography Zotero/Mendeley/Paperpile/BibTeX produces; scriptorium does not manage references itself.
No discipline-specific defaults beyond biomedical/clinical at v0.1–v0.3 — discipline-conventions. The evidence base is biomedical-coded. Expanding to physics, CS/ML, math, humanities requires per-discipline knowledge layers that don’t exist yet; PRs welcome.

Update cadence

This roadmap is reviewed at each release. Issues open against deferred items are welcome but get triaged against the priority order above.