GraySearch Feature Expansion

Competitive gap analysis → 7-phase build plan · 3x adversarial-tested

Createds200 (2026-04-06)
Phases0 – 6 (7 phases)
Est. Effort19 – 27 sessions
Pressure Tests3 rounds, 41 findings

Executive Summary

Build ProgressPhases 0+1 complete, Phase 3 started (s205)
PhaseThemeEffortStatus
0Context + Config Surface2–3 sessionsComplete (s204)
1Polish & Quick Wins + JS Extract2–3 sessionsComplete (s205)
2Voice Output (TTS)2–3 sessionsNot Started
3Agentic Deep Research4–6 sessionsIn Progress (s205)
4Visual & Rich Content2–3 sessionsNot Started
5Advanced Organization3–5 sessionsNot Started
6Specialty Search1–2 eachNot Started
7Tabular Data & Spreadsheets3–5 sessionsNot Started
8Context Intelligence3–5 sessionsNot Started

Build order: Phase 0 → 1 → 3 → 2 → 4 → 5 → 6. Phase 3 before 2 because research agent is the highest-value competitive gap.

1. Don't chase parity for parity's sake. Only build features that serve the actual research workflow.
2. Preserve the speed advantage. Quick mode must stay under 3s.
3. Build on what's unique. Group context, personalization, multi-provider diversity are moats.
4. Incremental value delivery. Every phase ships something usable.
Multi-Provider Search Architecture

GraySearch sends every query to three independent search providers simultaneously, merges and deduplicates results, then re-ranks the unified pool. This is the same multi-retrieval pattern used by Perplexity, Google AI Mode, and ChatGPT search.

ProviderStrengthIndexLatency
BraveFastest latency, strong keyword precision, independent 30B+ page indexOwn~670ms
ExaSemantic understanding, spam filtering, high-signal authoritative contentOwn~2s
ParallelStrong accuracy-to-cost ratio, independent ranking perspectiveOwn~5-14s

Pipeline

User Query
ASYNC FAN-OUT
Brave
~670ms · keyword-strong
Exa
~2s · semantic search
Parallel
~5-14s · independent rank
Merge + Dedup
Re-Rank
Ranked Results
LLM Synthesis
Answer + Citations

Why Three Providers?

BenefitMechanism
Better recallThree indexes catch what one misses
Better precisionCross-provider agreement filters noise
ResilienceIf one API goes down, the other two still work
SpeedAsync fan-out = as fast as the fastest provider (with timeouts)
No vendor lock-inCan swap providers without rewriting the system
Quality signalDedup overlap acts as an implicit relevance vote

Benchmark data (2025-2026) shows the top 4 search APIs are statistically indistinguishable on quality individually. The winning strategy is to use multiple providers and let the combination outperform any single one.

Already Built (Pre-Plan)

GraySearch core was built across s197–s200 before this expansion plan was created.

s201: Plan page, /plans index, inbox replay fix, config schema (38 keys), settings cleanup, thread archival + health logging, defaults unified.
s204: Phase 0 complete (0A-0D). Per-mode context scaling, XML-tagged exchanges, full config surface + tuning panel, metrics logging + rolling averages, limit warnings, mode badges, Opus option, table rendering, JS extraction to static file, validation hooks.

Table of Contents

  1. Executive Summary
  2. Multi-Provider Search Architecture
  3. Already Built (Pre-Plan)
  4. Phase 0A: Per-Mode Context Scaling
  5. Phase 0B: Context Format Upgrade
  6. Phase 0C: Unified Config Surface
  7. Phase 0D: Config UI (Tuning Panel)
  8. Phase 1A: Observability & Cleanup
  9. Phase 1B: Export / Report Generation
  10. Phase 1C: Search Progress Enhancement
  11. Phase 1D: Search Result Previews
  12. Phase 1E: Extract JS to Static File
  13. Phase 2: Voice Output (TTS)
  14. Phase 3A: Research Agent Architecture
  15. Phase 3B: Progress Streaming
  16. Phase 3C: Report Generation
  17. Phase 3D: UI Integration
  18. Phase 4: Visual & Rich Content
  19. Phase 5: Advanced Organization
  20. Phase 6: Specialty Search
  21. Phase 7: Tabular Data & Spreadsheet Intelligence
  22. Phase 8: Context Intelligence
  23. Adversarial Review Record
Phase 0APer-Mode Context Scaling

Replace the single max_exchanges=5 / 600 char truncation with per-mode strategy. Current usage is 1.6–8.2% of the 200K context window.

Modemax_exchangesanswer_truncationToken Budget
Quick5800 chars~1,000 tokens
Deep+Summary102,000 chars~5,000 tokens
Deep+Full204,000 chars~10,000 tokens
Research30No truncation~15,000 tokens
  • Refactor get_thread_context() to accept max_exchanges + max_answer_chars params (s204)
  • Route handler passes mode-appropriate limits from cfg (s204)
  • Add input token logging: synthesis + expand_query (s204)
  • Quick mode query unchanged (<600 extra chars, preserves <3s target)
  • Per-search metrics logging -- rolling 20/mode to search_metrics.json (s204)
  • Rolling averages in tuning panel (muted orange, per applicable control) (s204)
  • Graceful limit handling -- amber inline warnings when limits hit (s204)
  • Opus model option + Basic/Advanced tier toggle + descriptions (s204)
  • Mode-colored labels in tuning panel matching inline badge colors (s204)
  • Modified-from-default indicator (green *) on changed values (s204)
  • Per-field tradeoff descriptions with click-to-expand (s204)
  • Averages expanded to cover all 38 config fields (s204)
  • group_context_chars metric added to all pipelines (s204)
Round 2 C-1: Token budget estimates measure conversation context ONLY. Full prompt = system (~200 tok) + user context (~750) + group context (~500-700) + search passages (up to ~10,000) + conversation context. Research synthesis could reach 30,000+ tokens ($0.10-0.50). Token+cost logging must ship before expanding limits.
Phase 0BConversation Context Format Upgrade

Upgrade from plain User:/Assistant: to XML-tagged exchanges with mode and citations.

  • XML-tagged exchanges with mode attribute (s204)
  • Include exchange mode tag (quick vs deep calibration) (s204)
  • Include citation URLs in <sources> block, top 5 per exchange (s204)
  • Mode badge on each response (color-coded top + bottom with token counts) (s204)
  • Improved thread title generation (few-shot prompt, answer-rejection guard) (s204)
  • Research mode button placeholder (disabled, Phase 3) (s204)
  • Color-coded mode selector buttons (s204)
  • Markdown renderer: tables, ### headings, --- dividers, tighter spacing (s204)
<exchange n="1" mode="quick">
<query>Best espresso machine under $500?</query>
<answer>The Breville Barista Express...</answer>
<sources><url>https://example.com/review</url></sources>
</exchange>
Phase 0CUnified Config Surface ("Sliders")

Centralize all tunable limits. Code defaults in git-tracked settings.yaml. Browser-written overrides in .gitignored config/graysearch_tuning.yaml. Runtime merges both, tuning takes precedence. Config snapshot pattern prevents mid-search TOCTOU races.

GroupKeysExamples
Context Limits8max_exchanges, max_answer_chars per mode
Models6synthesis model per mode, planner, quality gate
Token/Cost7max_tokens per mode, cost ceiling, Brave rate limit
Research Agent4max_rounds, sub_questions, wall time, concurrency
Search Providers8timeouts, max results, max extract pages
Auto-Brief4exchanges/thread, truncation (normal vs research)
Thread Health1size warning threshold (KB)
  • Add all 38 schema keys to settings.yaml under graysearch: (s204)
  • Create config/graysearch_tuning.yaml (.gitignored) for browser overrides (s204)
  • Update _get_settings(): merge defaults + tuning overrides
  • Remove dead Settings() no-arg call from _get_settings()
  • Build GRAYSEARCH_CONFIG_SCHEMA (38 keys, 8 groups) as single source of truth
  • Config snapshot: pipelines call _get_settings() once, pass cfg downstream (s204)
  • All cfg.get() fallbacks reference _default() from schema
  • Replace Path(__file__).parent.parent with env var (s204)
  • Hide config keys for unbuilt features until they ship (s204, R3-13)
Phase 0DConfig UI (Tuning Panel)

In-browser config editor on the GraySearch page. Gear icon opens settings panel. Both API endpoints behind web auth. Cost previews labeled as estimates with tooltip caveat.

Config TypeControlExample
Integer limitsSlider + stepper5 [---o-----] 30
Cost ceilingsStepper ($0.05)$0.50 [-] [+]
Model selectionDropdown[claude-haiku-4-5 v]
0 = unlimitedToggle + stepper[x] Limit: 4000
  • GET /api/search/config returns config + schema metadata (s204)
  • POST /api/search/config validates + merges overrides into tuning YAML (s204)
  • POST /api/search/config/reset clears all overrides (s204)
  • Config schema with type/range validation (s201)
  • Grouped sections, auto-generated from schema (s204)
  • Live cost/token impact preview (deferred — averages in tuning panel serve this need)
  • Instant apply -- no restart needed (s201)
  • "Reset all" button + modified values highlighted green (s204)
  • Tuning panel via hammer icon in header (s204)
  • JS extracted to static/js/search.js (no more {{}} escaping) (s204)
  • PostToolUse hook for rendered JS validation on views/*.py (s204)
  • Pre-restart validation: scripts/validate_views.py (s204)
Phase 1AObservability & Cleanup
  • Add log.info for model/mode in synthesize() (s204)
  • Per-search cost logging: input_tokens, output_tokens, model, cost (s204)
  • Fix Dict[tuple, Any] type annotation (s204)
  • Fix _REPO_ROOT = Path(__file__).parent.parent (s204)

Thread Health Monitoring

  • Log file size + exchange count on every save_exchange()
  • Primary: exchange count color dot (green <10, yellow 10-20, red >20) (s205)
  • Thread list shows exchange count indicator per thread (s205)
  • MCP get_stack_status includes thread health summary (deferred — operational tooling)

Thread Archival

  • Archival runs inside save_exchange() (atomic, no race conditions)
  • After N exchanges (configurable, default 20), move older to archive
  • load_thread_full() for complete history (deferred — no threads near archive threshold)

Roadmap: Per-exchange storage (solution C) if archival proves insufficient.

Phase 1BExport / Report Generation

Markdown Export (MVP)

  • GET /api/search/thread/{id}/export?format=md (s205)
  • Title as H1, exchanges as H2, citations as footnotes (s205)
  • Export button on thread bar + mobile share sheet (s205)
  • File named {title}_{date}.md (s205)

HTML Export (stretch)

  • Same endpoint with format=html, print-friendly (deferred — Markdown covers the need)

Platform UX: Desktop: browser download. Mobile (Safari): navigator.share() with fallback.

Phase 1CSearch Progress Enhancement
  • During expanding: yield sub-queries as detail line (s205)
  • During reading: yield URLs, show unique domains (s205)
  • During searching: show providers ("Searching Brave + Exa...") (s205)
  • Elapsed time display (running timer, 500ms update) (s205)
Phase 1DSearch Result Previews
  • Preview cards: favicon + title + domain + date + snippet (2-line clamp) (s205)
  • Cards collapse to compact chips on synthesis start (s205)
  • Mobile: 44px min-height, vertical stack (s205)
Phase 1EExtract JS to Static File

Prerequisite for Phase 2+. views/search.py = 1,113 lines of double-brace-escaped JS in Python template strings.

  • Extract search JS into static/js/search.js (s204)
  • PostToolUse validation hook + validate_views.py (s204)
  • Extract remaining JS from willy.py, pages.py, dashboard.py (deferred — S-14)
  • Migrate createScriptProcessorAudioWorkletNode (deferred — Phase 2 prerequisite)
Phase 2Voice Output (TTS)

Complete the voice loop. SSE with base64 audio chunks (proven tunnel-compatible).

2A. TTS Provider

  • Evaluate: Deepgram Aura, ElevenLabs, OpenAI TTS
  • Criteria: <500ms TTFB, natural voice, <$0.01/search
  • Build lib/tts.py with streaming audio

2B. Streaming Pipeline

  • SSE audio_chunk events (base64, OGG/Opus, 200-500ms)
  • Web Audio API playback with decode + queue
  • Tap/click interrupt + tunnel latency test

2C. Voice Flow

  • "Voice mode" toggle + auto-listen
  • Audio ducking: configurable delay, Bluetooth auto-detect
  • Voice preferences in search_preferences.md

2D. Smart TTS

  • Skip citations/URLs, strip markdown
  • Long answers: TTS first 2-3 paragraphs, ask to continue
Phase 3AResearch Agent Architecture

Multi-step autonomous research via non-streaming wrappers over existing search functions.

User query → [Planner/Sonnet] → [Quality Gate/Haiku]
  → [Research Loop] → [Synthesizer] → [Structured Report]

Search Wrappers

  • search_and_summarize(): consumes async generator, returns dict (s205)

Cost Control (Round 2 C-2)

  • Shared Brave rate limiter (asyncio.Semaphore)
  • Per-research cost ceiling (default $0.50) (s205)
  • Hard cap: research_max_brave_calls (default 20) (s205)
  • Cost estimate shown before research starts

Lifecycle (Round 2 R-4)

  • Cancellation flag via asyncio.Event (s206)
  • Concurrent limit config: research_concurrent_limit (s205)
  • Cancellation on tab close (beforeunload)
  • Persist partial findings to disk

Agent Loop

  • SSE streaming (non-blocking via async generator) (s205)
  • Sonnet planner + Haiku quality gate (s205)
  • Max 5 rounds, 5 min wall time, 3-8 sub-questions (s205)
  • Per-sub-question mode selection (quick vs deep_summary) (s205)
  • Structured scratchpad per sub-question (s205)
Phase 3BProgress Streaming
  • SSE research_progress event (step, total, sub_question, status) (s205)
  • Vertical timeline with status indicators (pending/spinner/check/fail/skipped) (s206)
  • Running timer + step counter in panel header (s206)
  • "Stop and summarize" button + POST /api/search/research/cancel (s206)
  • "Also consider..." redirect input (mid-research constraint injection)
Phase 3CReport Generation
  • Sonnet final synthesis from all findings (s205)
  • Structured report: Summary, Findings, Open Questions, Sources (s205)
  • Saved as thread with mode: "research" (s205)
  • Auto-export to group directory
  • Brief weighting config: 4 exchanges/thread, 800-char truncation (s205)

Roadmap: Separate brief section (C) after evaluating real output.

Phase 3DResearch UI Integration
  • Fourth mode pill: "Research" (green, #10b981) (s205)
  • First-use explainer via localStorage (s206)
  • Full-width report card (.gs-report, 95% width, green border) (s206)
  • "Dig deeper" button on subsection headings (switches to Deep+Full) (s206)
Phase 4Visual & Rich Content

4A. Image Search

  • Brave Image API + grid rendering
  • Image upload for reverse search (Claude vision)
  • Storage: 7-day retention, 10MB max

4B. Rich Result Cards

  • Weather, quick facts, comparison tables, timelines

4C. Location-Aware

  • Location preference + browser geolocation
  • "Near me" detection triggers location injection
Phase 5Advanced Organization

5A. Cross-Thread Search

  • GET /api/search/corpus with scoring
  • Lazy-load index, JSON persist, survives restarts
  • "Search within this group" filter

5B. Research Notebook

  • Pin answers + named collections
  • Collection export + auto-suggest pins

5C. Drag-and-Drop Thread Organization

  • Desktop: HTML5 DnD with drag handles + drop targets on group headers
  • Drop targets: group headers glow on dragover, "Recent" = unassign
  • Thread reorder within groups (thread_order in groups.json)
  • Group reorder (group_order in groups.json)
  • Backend: POST /api/search/groups/{id}/reorder + POST /api/search/groups/reorder
  • Mobile: keep context menu flow (no DnD) — evaluate touch DnD as follow-on

Reuses existing api_search_group_assign for moves — no new backend for basic group assignment. Sort order APIs are new. Desktop-only initially; mobile DnD (long-press or polyfill) evaluated after desktop ships.

5D. Branched Conversations

  • "Branch here" on any exchange
  • Branch metadata + group inheritance
  • Branch does NOT bump brief counter
Phase 6Specialty Search

6A. Product Research

  • Product query detection + review site enrichment
  • Comparison synthesis (pros/cons/price/verdict)

6B. Academic/Technical

  • Semantic Scholar API + citation scoring

6C. News Mode

  • Brave News API + recency-first sorting
  • Timeline rendering + "Follow this topic"

6D. Recurring Search

  • "Watch this" (max 5, Quick only, cost estimate)
  • 6-hour re-run + URL dedup + unread badges
Phase 7Tabular Data & Spreadsheet Intelligence

Theme: Accept, analyze, transform, and export structured data across all modes. Dedicated Data mode for analysis-heavy workflows. Effort: 3–5 sessions.

7A. Planning & Requirements

  • Competitive analysis (ChatGPT, Gemini, Copilot tabular UX)
  • Catalog RG's actual tabular workflows from ChatGPT history
  • Formula scope ranking by usage
  • Adversarial review of spec

7B. Tabular Input (all modes)

  • Paste detection (TSV/CSV) with table preview
  • Context injection as fenced CSV block
  • File upload: CSV (client-side) + Excel (openpyxl)
  • Size limits (~5K rows / 500KB)

7C. Tabular Output & Export (all modes)

  • CSV download button on each rendered table
  • Copy table as TSV to clipboard
  • Excel export (.xlsx via openpyxl)
  • Multi-table support + "Download all"

7D. Formula Generation

  • Synthesis prompt for formula requests (Excel vs Sheets toggle)
  • Monospace code blocks with copy button + explanation
  • All major categories: lookup, conditional, financial, array, text, date
  • Optional formula validation (verify output)

7E. Data Mode (dedicated)

  • "Data" mode pill — no web search, direct Claude analysis
  • Specialized analysis prompt (stats, insights, suggest visualizations)
  • Multi-turn analysis with table context carried in thread
  • Computed columns: generates formula AND fills values

7F. Future (not building yet)

Chart generation, Google Sheets integration, SQL-like queries, pivot table builder, data persistence across sessions.

Phase 8Context Intelligence (Active + Passive Learning)

Theme: Make GraySearch progressively smarter about user preferences and research patterns. Active interviews + passive extraction + enhanced auto-briefs. Effort: 3–5 sessions.

8A. Planning & Requirements

  • Audit current context injection chain (user, project, thread, conversation)
  • Catalog preference types (source, format, domain, constraint, fact)
  • Review ChatGPT memory system (learn from their mistakes)
  • Adversarial review of spec

8B. Passive Preference Extraction (all modes)

  • Post-synthesis Haiku extraction: 0-3 new preferences per exchange
  • Category tagging: format, source, domain, constraint, fact
  • Dedup + merge against existing search_preferences.md
  • Staleness handling: timestamp entries, replace contradictions
  • Transparency: extracted prefs visible/editable in Preferences panel
  • Kill switch in tuning panel (on for Deep modes, off for Quick)

8C. Active Context Interview (triggered)

  • Trigger: button in Project Notes + thread context menu + proactive suggestion
  • 3-phase flow: confirm existing → expand with probes → identify gaps
  • Questions displayed inline (conversation area, not modal)
  • Output: updated notes + extracted preferences + user review
  • Persist interview state for resume across sessions
  • Re-interview suggestion after 10+ new exchanges

8D. Thread-Level Context

  • Per-thread notes field (editable via context menu)
  • Thread auto-brief: full trajectory summary (not just recent)
  • Thread context injected into synthesis alongside project context

8E. Enhanced Auto-Brief

  • Dual-output: findings + preferences + open questions
  • Cross-project pattern extraction to global search_preferences.md

8F. Future (not building yet)

Preference confidence scoring, conflict detection, onboarding interview, preference analytics dashboard.

Adversarial Review Record

Round 1 (s200) — 15 findings

Initial confidence: MEDIUM. All addressed.

IDSeverityFindingResolution
C-1Critical1A is a non-issueReduced to logging + cleanup
C-2CriticalCan't reuse search generatorsNon-streaming wrappers in 3A
C-3CriticalWebSocket TTS won't work through tunnelSwitched to SSE-first
R-1RiskResearch blocks uvicorn workerBackground create_task
R-2RiskInline JS at breaking pointAdded 1E: JS extraction
R-3RiskRecurring search unbounded costCap 5 watches, Quick only
R-4RiskImage upload lifecycle missingPath, retention, max size defined
R-5RiskHaiku planner poor qualitySonnet + quality gate
G-1:4GapAPI degradation, Dict, index, duckingAll addressed in respective phases
Q-1:3QuestionPhase 6 order, export UX, build orderAll resolved

Post-Round-1 confidence: HIGH

Round 2 (s200/s201) — 12 findings

All addressed in s201 review with RG.

IDSeverityFindingResolution
C-1CriticalToken budget undercountsFull budget + cost logging first
C-2CriticalNo research cost capSemaphore, ceiling, confirmation
R-1RiskcreateScriptProcessor deprecatedMigrate in 1E
R-2RiskSSE audio may buffer200-500ms chunks, tunnel test
R-3RiskThread files unboundedMonitoring (A) + archival (B), C roadmapped
R-4RiskResearch no lifecycleRegistry, cancel, persist partial
G-1:4GapBT latency, model deprecation, JS risk, Path fixAll addressed
Q-1:2QuestionBrief weighting, branch counterA+B weighting, skip counter on branch

Post-Round-2 confidence: HIGH

Round 3 (s201) — 14 findings

Post-0C/0D additions. All addressed in s201.

IDSeverityFindingResolution
R3-1Criticalsettings.yaml git-tracked; browser writes = merge conflictsSeparate .gitignored tuning file
R3-2CriticalNo config caching; TOCTOU race mid-searchConfig snapshot pattern per pipeline
R3-3Risk_get_settings() dead broken codeRemove dead Settings() call
R3-4RiskArchival race with save_exchange()Archival inside save (atomic)
R3-5RiskDefaults scattered across code + schemaSchema dict as single source of truth
R3-6RiskCost preview impossible to compute accuratelyLabel as estimates with caveat tooltip
R3-7GapConfig API needs authBehind web auth middleware
R3-8GapFile KB poor proxy for context usagePrimary metric: exchange count
R3-9GapCost ceiling slider unboundedSchema max=$5.00
R3-10GapA+B brief weighting undefinedDefined inline (4 exch, 800 char)
R3-11GapAudioWorklet migration underscopedWorklet file + MIME + extra time noted
R3-12QuestionConfig change hits in-flight searchCovered by R3-2 snapshot
R3-13QuestionModel keys for unbuilt features confusingHide until feature ships
R3-14QuestionEffort unchanged after Phase 0 doubledRevised: 19-27 sessions total

Post-Round-3 confidence: HIGH