GraySearch Feature Expansion
Competitive gap analysis → 7-phase build plan · 3x adversarial-tested
Executive Summary
Build ProgressPhases 0+1 complete, Phase 3 started (s205)
| Phase | Theme | Effort | Status |
| 0 | Context + Config Surface | 2–3 sessions | Complete (s204) |
| 1 | Polish & Quick Wins + JS Extract | 2–3 sessions | Complete (s205) |
| 2 | Voice Output (TTS) | 2–3 sessions | Not Started |
| 3 | Agentic Deep Research | 4–6 sessions | In Progress (s205) |
| 4 | Visual & Rich Content | 2–3 sessions | Not Started |
| 5 | Advanced Organization | 3–5 sessions | Not Started |
| 6 | Specialty Search | 1–2 each | Not Started |
| 7 | Tabular Data & Spreadsheets | 3–5 sessions | Not Started |
| 8 | Context Intelligence | 3–5 sessions | Not Started |
Build order: Phase 0 → 1 → 3 → 2 → 4 → 5 → 6. Phase 3 before 2 because research agent is the highest-value competitive gap.
1. Don't chase parity for parity's sake. Only build features that serve the actual research workflow.
2. Preserve the speed advantage. Quick mode must stay under 3s.
3. Build on what's unique. Group context, personalization, multi-provider diversity are moats.
4. Incremental value delivery. Every phase ships something usable.
Multi-Provider Search Architecture
GraySearch sends every query to three independent search providers simultaneously, merges and deduplicates results, then re-ranks the unified pool. This is the same multi-retrieval pattern used by Perplexity, Google AI Mode, and ChatGPT search.
| Provider | Strength | Index | Latency |
| Brave | Fastest latency, strong keyword precision, independent 30B+ page index | Own | ~670ms |
| Exa | Semantic understanding, spam filtering, high-signal authoritative content | Own | ~2s |
| Parallel | Strong accuracy-to-cost ratio, independent ranking perspective | Own | ~5-14s |
Pipeline
User Query
ASYNC FAN-OUT
Brave
~670ms · keyword-strong
Exa
~2s · semantic search
Parallel
~5-14s · independent rank
Ranked Results
LLM Synthesis
Answer + Citations
Why Three Providers?
| Benefit | Mechanism |
| Better recall | Three indexes catch what one misses |
| Better precision | Cross-provider agreement filters noise |
| Resilience | If one API goes down, the other two still work |
| Speed | Async fan-out = as fast as the fastest provider (with timeouts) |
| No vendor lock-in | Can swap providers without rewriting the system |
| Quality signal | Dedup overlap acts as an implicit relevance vote |
Benchmark data (2025-2026) shows the top 4 search APIs are statistically indistinguishable on quality individually. The winning strategy is to use multiple providers and let the combination outperform any single one.
Already Built (Pre-Plan)
GraySearch core was built across s197–s200 before this expansion plan was created.
- Multi-provider search engine — Brave + Exa + Parallel, async fan-out, dedup + ranking
- Three search modes — Quick (Haiku, <3s), Deep+Summary (Haiku), Deep+Full (Sonnet)
- SSE streaming synthesis — Real-time answer streaming via Server-Sent Events
- Content extraction — Trafilatura full-text extraction, async thread pool
- Query expansion — Haiku-generated alternative queries for deep modes
- Thread system — Persistent conversation threads with auto-generated titles
- Group context — Per-group notes.md + auto-generated brief.md, injected into synthesis
- Voice input — Deepgram STT via WebSocket, mic button in search UI
- Search preferences — Per-user preferences panel (stored in search_preferences.md)
- Mobile-responsive UI — Full mobile layout with slide-in panels (history, prefs, group notes)
- Group auto-brief — Haiku-generated summaries, dirty flag + threshold-based regen
- User context personalization — Core memory + preferences injected into every synthesis
s201: Plan page, /plans index, inbox replay fix, config schema (38 keys), settings cleanup, thread archival + health logging, defaults unified.
s204: Phase 0 complete (0A-0D). Per-mode context scaling, XML-tagged exchanges, full config surface + tuning panel, metrics logging + rolling averages, limit warnings, mode badges, Opus option, table rendering, JS extraction to static file, validation hooks.
Phase 0APer-Mode Context Scaling
Replace the single max_exchanges=5 / 600 char truncation with per-mode strategy. Current usage is 1.6–8.2% of the 200K context window.
| Mode | max_exchanges | answer_truncation | Token Budget |
| Quick | 5 | 800 chars | ~1,000 tokens |
| Deep+Summary | 10 | 2,000 chars | ~5,000 tokens |
| Deep+Full | 20 | 4,000 chars | ~10,000 tokens |
| Research | 30 | No truncation | ~15,000 tokens |
- Refactor
get_thread_context() to accept max_exchanges + max_answer_chars params (s204)
- Route handler passes mode-appropriate limits from cfg (s204)
- Add input token logging: synthesis + expand_query (s204)
- Quick mode query unchanged (<600 extra chars, preserves <3s target)
- Per-search metrics logging -- rolling 20/mode to search_metrics.json (s204)
- Rolling averages in tuning panel (muted orange, per applicable control) (s204)
- Graceful limit handling -- amber inline warnings when limits hit (s204)
- Opus model option + Basic/Advanced tier toggle + descriptions (s204)
- Mode-colored labels in tuning panel matching inline badge colors (s204)
- Modified-from-default indicator (green *) on changed values (s204)
- Per-field tradeoff descriptions with click-to-expand (s204)
- Averages expanded to cover all 38 config fields (s204)
- group_context_chars metric added to all pipelines (s204)
Round 2 C-1: Token budget estimates measure conversation context ONLY. Full prompt = system (~200 tok) + user context (~750) + group context (~500-700) + search passages (up to ~10,000) + conversation context. Research synthesis could reach 30,000+ tokens ($0.10-0.50). Token+cost logging must ship before expanding limits.
Phase 0BConversation Context Format Upgrade
Upgrade from plain User:/Assistant: to XML-tagged exchanges with mode and citations.
- XML-tagged exchanges with mode attribute (s204)
- Include exchange mode tag (quick vs deep calibration) (s204)
- Include citation URLs in <sources> block, top 5 per exchange (s204)
- Mode badge on each response (color-coded top + bottom with token counts) (s204)
- Improved thread title generation (few-shot prompt, answer-rejection guard) (s204)
- Research mode button placeholder (disabled, Phase 3) (s204)
- Color-coded mode selector buttons (s204)
- Markdown renderer: tables, ### headings, --- dividers, tighter spacing (s204)
<exchange n="1" mode="quick">
<query>Best espresso machine under $500?</query>
<answer>The Breville Barista Express...</answer>
<sources><url>https://example.com/review</url></sources>
</exchange>
Phase 0CUnified Config Surface ("Sliders")
Centralize all tunable limits. Code defaults in git-tracked settings.yaml. Browser-written overrides in .gitignored config/graysearch_tuning.yaml. Runtime merges both, tuning takes precedence. Config snapshot pattern prevents mid-search TOCTOU races.
| Group | Keys | Examples |
| Context Limits | 8 | max_exchanges, max_answer_chars per mode |
| Models | 6 | synthesis model per mode, planner, quality gate |
| Token/Cost | 7 | max_tokens per mode, cost ceiling, Brave rate limit |
| Research Agent | 4 | max_rounds, sub_questions, wall time, concurrency |
| Search Providers | 8 | timeouts, max results, max extract pages |
| Auto-Brief | 4 | exchanges/thread, truncation (normal vs research) |
| Thread Health | 1 | size warning threshold (KB) |
- Add all 38 schema keys to
settings.yaml under graysearch: (s204)
- Create
config/graysearch_tuning.yaml (.gitignored) for browser overrides (s204)
- Update
_get_settings(): merge defaults + tuning overrides
- Remove dead
Settings() no-arg call from _get_settings()
- Build
GRAYSEARCH_CONFIG_SCHEMA (38 keys, 8 groups) as single source of truth
- Config snapshot: pipelines call
_get_settings() once, pass cfg downstream (s204)
- All
cfg.get() fallbacks reference _default() from schema
- Replace
Path(__file__).parent.parent with env var (s204)
- Hide config keys for unbuilt features until they ship (s204, R3-13)
Phase 0DConfig UI (Tuning Panel)
In-browser config editor on the GraySearch page. Gear icon opens settings panel. Both API endpoints behind web auth. Cost previews labeled as estimates with tooltip caveat.
| Config Type | Control | Example |
| Integer limits | Slider + stepper | 5 [---o-----] 30 |
| Cost ceilings | Stepper ($0.05) | $0.50 [-] [+] |
| Model selection | Dropdown | [claude-haiku-4-5 v] |
| 0 = unlimited | Toggle + stepper | [x] Limit: 4000 |
GET /api/search/config returns config + schema metadata (s204)
POST /api/search/config validates + merges overrides into tuning YAML (s204)
POST /api/search/config/reset clears all overrides (s204)
- Config schema with type/range validation (s201)
- Grouped sections, auto-generated from schema (s204)
- Live cost/token impact preview (deferred — averages in tuning panel serve this need)
- Instant apply -- no restart needed (s201)
- "Reset all" button + modified values highlighted green (s204)
- Tuning panel via hammer icon in header (s204)
- JS extracted to static/js/search.js (no more {{}} escaping) (s204)
- PostToolUse hook for rendered JS validation on views/*.py (s204)
- Pre-restart validation: scripts/validate_views.py (s204)
Phase 1AObservability & Cleanup
- Add
log.info for model/mode in synthesize() (s204)
- Per-search cost logging: input_tokens, output_tokens, model, cost (s204)
- Fix
Dict[tuple, Any] type annotation (s204)
- Fix
_REPO_ROOT = Path(__file__).parent.parent (s204)
Thread Health Monitoring
- Log file size + exchange count on every
save_exchange()
- Primary: exchange count color dot (green <10, yellow 10-20, red >20) (s205)
- Thread list shows exchange count indicator per thread (s205)
- MCP
get_stack_status includes thread health summary (deferred — operational tooling)
Thread Archival
- Archival runs inside
save_exchange() (atomic, no race conditions)
- After N exchanges (configurable, default 20), move older to archive
load_thread_full() for complete history (deferred — no threads near archive threshold)
Roadmap: Per-exchange storage (solution C) if archival proves insufficient.
Phase 1BExport / Report Generation
Markdown Export (MVP)
GET /api/search/thread/{id}/export?format=md (s205)
- Title as H1, exchanges as H2, citations as footnotes (s205)
- Export button on thread bar + mobile share sheet (s205)
- File named
{title}_{date}.md (s205)
HTML Export (stretch)
- Same endpoint with
format=html, print-friendly (deferred — Markdown covers the need)
Platform UX: Desktop: browser download. Mobile (Safari): navigator.share() with fallback.
Phase 1CSearch Progress Enhancement
- During
expanding: yield sub-queries as detail line (s205)
- During
reading: yield URLs, show unique domains (s205)
- During
searching: show providers ("Searching Brave + Exa...") (s205)
- Elapsed time display (running timer, 500ms update) (s205)
Phase 1DSearch Result Previews
- Preview cards: favicon + title + domain + date + snippet (2-line clamp) (s205)
- Cards collapse to compact chips on synthesis start (s205)
- Mobile: 44px min-height, vertical stack (s205)
Phase 1EExtract JS to Static File
Prerequisite for Phase 2+. views/search.py = 1,113 lines of double-brace-escaped JS in Python template strings.
- Extract search JS into
static/js/search.js (s204)
- PostToolUse validation hook +
validate_views.py (s204)
- Extract remaining JS from willy.py, pages.py, dashboard.py (deferred — S-14)
- Migrate
createScriptProcessor → AudioWorkletNode (deferred — Phase 2 prerequisite)
Phase 2Voice Output (TTS)
Complete the voice loop. SSE with base64 audio chunks (proven tunnel-compatible).
2A. TTS Provider
- Evaluate: Deepgram Aura, ElevenLabs, OpenAI TTS
- Criteria: <500ms TTFB, natural voice, <$0.01/search
- Build
lib/tts.py with streaming audio
2B. Streaming Pipeline
- SSE
audio_chunk events (base64, OGG/Opus, 200-500ms)
- Web Audio API playback with decode + queue
- Tap/click interrupt + tunnel latency test
2C. Voice Flow
- "Voice mode" toggle + auto-listen
- Audio ducking: configurable delay, Bluetooth auto-detect
- Voice preferences in
search_preferences.md
2D. Smart TTS
- Skip citations/URLs, strip markdown
- Long answers: TTS first 2-3 paragraphs, ask to continue
Phase 3AResearch Agent Architecture
Multi-step autonomous research via non-streaming wrappers over existing search functions.
User query → [Planner/Sonnet] → [Quality Gate/Haiku]
→ [Research Loop] → [Synthesizer] → [Structured Report]
Search Wrappers
search_and_summarize(): consumes async generator, returns dict (s205)
Cost Control (Round 2 C-2)
- Shared Brave rate limiter (
asyncio.Semaphore)
- Per-research cost ceiling (default $0.50) (s205)
- Hard cap:
research_max_brave_calls (default 20) (s205)
- Cost estimate shown before research starts
Lifecycle (Round 2 R-4)
- Cancellation flag via
asyncio.Event (s206)
- Concurrent limit config:
research_concurrent_limit (s205)
- Cancellation on tab close (
beforeunload)
- Persist partial findings to disk
Agent Loop
- SSE streaming (non-blocking via async generator) (s205)
- Sonnet planner + Haiku quality gate (s205)
- Max 5 rounds, 5 min wall time, 3-8 sub-questions (s205)
- Per-sub-question mode selection (quick vs deep_summary) (s205)
- Structured scratchpad per sub-question (s205)
Phase 3BProgress Streaming
- SSE
research_progress event (step, total, sub_question, status) (s205)
- Vertical timeline with status indicators (pending/spinner/check/fail/skipped) (s206)
- Running timer + step counter in panel header (s206)
- "Stop and summarize" button +
POST /api/search/research/cancel (s206)
- "Also consider..." redirect input (mid-research constraint injection)
Phase 3CReport Generation
- Sonnet final synthesis from all findings (s205)
- Structured report: Summary, Findings, Open Questions, Sources (s205)
- Saved as thread with
mode: "research" (s205)
- Auto-export to group directory
- Brief weighting config: 4 exchanges/thread, 800-char truncation (s205)
Roadmap: Separate brief section (C) after evaluating real output.
Phase 3DResearch UI Integration
- Fourth mode pill: "Research" (green, #10b981) (s205)
- First-use explainer via localStorage (s206)
- Full-width report card (
.gs-report, 95% width, green border) (s206)
- "Dig deeper" button on subsection headings (switches to Deep+Full) (s206)
Phase 4Visual & Rich Content
4A. Image Search
- Brave Image API + grid rendering
- Image upload for reverse search (Claude vision)
- Storage: 7-day retention, 10MB max
4B. Rich Result Cards
- Weather, quick facts, comparison tables, timelines
4C. Location-Aware
- Location preference + browser geolocation
- "Near me" detection triggers location injection
Phase 5Advanced Organization
5A. Cross-Thread Search
GET /api/search/corpus with scoring
- Lazy-load index, JSON persist, survives restarts
- "Search within this group" filter
5B. Research Notebook
- Pin answers + named collections
- Collection export + auto-suggest pins
5C. Drag-and-Drop Thread Organization
- Desktop: HTML5 DnD with drag handles + drop targets on group headers
- Drop targets: group headers glow on dragover, "Recent" = unassign
- Thread reorder within groups (
thread_order in groups.json)
- Group reorder (
group_order in groups.json)
- Backend:
POST /api/search/groups/{id}/reorder + POST /api/search/groups/reorder
- Mobile: keep context menu flow (no DnD) — evaluate touch DnD as follow-on
Reuses existing api_search_group_assign for moves — no new backend for basic group assignment. Sort order APIs are new. Desktop-only initially; mobile DnD (long-press or polyfill) evaluated after desktop ships.
5D. Branched Conversations
- "Branch here" on any exchange
- Branch metadata + group inheritance
- Branch does NOT bump brief counter
Phase 6Specialty Search
6A. Product Research
- Product query detection + review site enrichment
- Comparison synthesis (pros/cons/price/verdict)
6B. Academic/Technical
- Semantic Scholar API + citation scoring
6C. News Mode
- Brave News API + recency-first sorting
- Timeline rendering + "Follow this topic"
6D. Recurring Search
- "Watch this" (max 5, Quick only, cost estimate)
- 6-hour re-run + URL dedup + unread badges
Phase 7Tabular Data & Spreadsheet Intelligence
Theme: Accept, analyze, transform, and export structured data across all modes. Dedicated Data mode for analysis-heavy workflows. Effort: 3–5 sessions.
7A. Planning & Requirements
- Competitive analysis (ChatGPT, Gemini, Copilot tabular UX)
- Catalog RG's actual tabular workflows from ChatGPT history
- Formula scope ranking by usage
- Adversarial review of spec
7B. Tabular Input (all modes)
- Paste detection (TSV/CSV) with table preview
- Context injection as fenced CSV block
- File upload: CSV (client-side) + Excel (
openpyxl)
- Size limits (~5K rows / 500KB)
7C. Tabular Output & Export (all modes)
- CSV download button on each rendered table
- Copy table as TSV to clipboard
- Excel export (.xlsx via
openpyxl)
- Multi-table support + "Download all"
7D. Formula Generation
- Synthesis prompt for formula requests (Excel vs Sheets toggle)
- Monospace code blocks with copy button + explanation
- All major categories: lookup, conditional, financial, array, text, date
- Optional formula validation (verify output)
7E. Data Mode (dedicated)
- "Data" mode pill — no web search, direct Claude analysis
- Specialized analysis prompt (stats, insights, suggest visualizations)
- Multi-turn analysis with table context carried in thread
- Computed columns: generates formula AND fills values
7F. Future (not building yet)
Chart generation, Google Sheets integration, SQL-like queries, pivot table builder, data persistence across sessions.
Phase 8Context Intelligence (Active + Passive Learning)
Theme: Make GraySearch progressively smarter about user preferences and research patterns. Active interviews + passive extraction + enhanced auto-briefs. Effort: 3–5 sessions.
8A. Planning & Requirements
- Audit current context injection chain (user, project, thread, conversation)
- Catalog preference types (source, format, domain, constraint, fact)
- Review ChatGPT memory system (learn from their mistakes)
- Adversarial review of spec
8B. Passive Preference Extraction (all modes)
- Post-synthesis Haiku extraction: 0-3 new preferences per exchange
- Category tagging: format, source, domain, constraint, fact
- Dedup + merge against existing
search_preferences.md
- Staleness handling: timestamp entries, replace contradictions
- Transparency: extracted prefs visible/editable in Preferences panel
- Kill switch in tuning panel (on for Deep modes, off for Quick)
8C. Active Context Interview (triggered)
- Trigger: button in Project Notes + thread context menu + proactive suggestion
- 3-phase flow: confirm existing → expand with probes → identify gaps
- Questions displayed inline (conversation area, not modal)
- Output: updated notes + extracted preferences + user review
- Persist interview state for resume across sessions
- Re-interview suggestion after 10+ new exchanges
8D. Thread-Level Context
- Per-thread notes field (editable via context menu)
- Thread auto-brief: full trajectory summary (not just recent)
- Thread context injected into synthesis alongside project context
8E. Enhanced Auto-Brief
- Dual-output: findings + preferences + open questions
- Cross-project pattern extraction to global
search_preferences.md
8F. Future (not building yet)
Preference confidence scoring, conflict detection, onboarding interview, preference analytics dashboard.
Adversarial Review Record
Round 1 (s200) — 15 findings
Initial confidence: MEDIUM. All addressed.
| ID | Severity | Finding | Resolution |
| C-1 | Critical | 1A is a non-issue | Reduced to logging + cleanup |
| C-2 | Critical | Can't reuse search generators | Non-streaming wrappers in 3A |
| C-3 | Critical | WebSocket TTS won't work through tunnel | Switched to SSE-first |
| R-1 | Risk | Research blocks uvicorn worker | Background create_task |
| R-2 | Risk | Inline JS at breaking point | Added 1E: JS extraction |
| R-3 | Risk | Recurring search unbounded cost | Cap 5 watches, Quick only |
| R-4 | Risk | Image upload lifecycle missing | Path, retention, max size defined |
| R-5 | Risk | Haiku planner poor quality | Sonnet + quality gate |
| G-1:4 | Gap | API degradation, Dict, index, ducking | All addressed in respective phases |
| Q-1:3 | Question | Phase 6 order, export UX, build order | All resolved |
Post-Round-1 confidence: HIGH
Round 2 (s200/s201) — 12 findings
All addressed in s201 review with RG.
| ID | Severity | Finding | Resolution |
| C-1 | Critical | Token budget undercounts | Full budget + cost logging first |
| C-2 | Critical | No research cost cap | Semaphore, ceiling, confirmation |
| R-1 | Risk | createScriptProcessor deprecated | Migrate in 1E |
| R-2 | Risk | SSE audio may buffer | 200-500ms chunks, tunnel test |
| R-3 | Risk | Thread files unbounded | Monitoring (A) + archival (B), C roadmapped |
| R-4 | Risk | Research no lifecycle | Registry, cancel, persist partial |
| G-1:4 | Gap | BT latency, model deprecation, JS risk, Path fix | All addressed |
| Q-1:2 | Question | Brief weighting, branch counter | A+B weighting, skip counter on branch |
Post-Round-2 confidence: HIGH
Round 3 (s201) — 14 findings
Post-0C/0D additions. All addressed in s201.
| ID | Severity | Finding | Resolution |
| R3-1 | Critical | settings.yaml git-tracked; browser writes = merge conflicts | Separate .gitignored tuning file |
| R3-2 | Critical | No config caching; TOCTOU race mid-search | Config snapshot pattern per pipeline |
| R3-3 | Risk | _get_settings() dead broken code | Remove dead Settings() call |
| R3-4 | Risk | Archival race with save_exchange() | Archival inside save (atomic) |
| R3-5 | Risk | Defaults scattered across code + schema | Schema dict as single source of truth |
| R3-6 | Risk | Cost preview impossible to compute accurately | Label as estimates with caveat tooltip |
| R3-7 | Gap | Config API needs auth | Behind web auth middleware |
| R3-8 | Gap | File KB poor proxy for context usage | Primary metric: exchange count |
| R3-9 | Gap | Cost ceiling slider unbounded | Schema max=$5.00 |
| R3-10 | Gap | A+B brief weighting undefined | Defined inline (4 exch, 800 char) |
| R3-11 | Gap | AudioWorklet migration underscoped | Worklet file + MIME + extra time noted |
| R3-12 | Question | Config change hits in-flight search | Covered by R3-2 snapshot |
| R3-13 | Question | Model keys for unbuilt features confusing | Hide until feature ships |
| R3-14 | Question | Effort unchanged after Phase 0 doubled | Revised: 19-27 sessions total |
Post-Round-3 confidence: HIGH