Architecture
How CodeWiki MCP works under the hood — Playwright-everywhere design.
Architecture Diagram
Why Playwright?
Google CodeWiki is an Angular SPA (<sdlc-agents-root>) — a plain HTTP GET returns an empty body. All page content is rendered client-side, so every tool uses Playwright headless Chromium to get actual content.
Browser Singleton
A persistent background event loop runs in a daemon thread. browser.py provides:
_get_browser()— lazily launches a shared Chromium instancerun_in_browser_loop(coro)— submits async work to the persistent loop from any sync contextfetch_rendered_html(url)— navigates, waits for SPA content markers, returns the rendered HTML
All 5 tools share the same browser instance. Pages are cached in a TTLCache (5 min by default) so repeated requests skip the render entirely.
SPA-Aware Parser
CodeWiki uses custom Angular elements instead of standard HTML:
<body-content-section>— one per wiki section<documentation-markdown>— rendered markdown inside each section<chat>→<new-message-form>→<textarea data-test-id="chat-input">— the Gemini chat
parser.py implements a dual-strategy section extractor:
- CodeWiki SPA — looks for
<body-content-section>+<documentation-markdown>elements - Standard HTML fallback — scans h1-h6 headings for non-CodeWiki pages
Structured Diagram Extraction
CodeWiki renders diagrams as <code-documentation-diagram-inline> elements containing base64-encoded SVGs. The parser detects three types:
1. CodeWiki SPA Diagrams
Decodes data:image/svg+xml;base64,... from <image class="image-diagram"> elements, then parses the Graphviz SVG structure into nodes and edges.
2. Mermaid Blocks
Captures raw Mermaid source from <code class="mermaid"> and <div class="mermaid"> elements.
3. Fallback SVGs/Images
Bare <svg> with <title> or <img> matching diagram patterns.
Graphviz SVG Parsing
For Graphviz SVGs, the parser extracts structured graph data:
- Nodes —
<g class="node">groups →{id, label} - Edges —
<g class="edge">groups →{from, to, label}
Diagram summaries (entities + relationships) are placed at the top of tool output so they remain visible even when responses are truncated.
Session Pool (v1.0.2+)
session_pool.py maintains an LRU pool of warm browser contexts for the codewiki_search_wiki tool. Instead of creating a fresh Playwright context for every chat query, the pool keeps recently-used contexts alive with their pages already loaded.
_get_or_create(url)— returns a warm session or creates a new one; navigates to the page and opens the chat panel_release(url, broken)— returns a session to the pool (or destroys it if marked broken)- Pool size is configurable via
CODEWIKI_SESSION_POOL_SIZE(default: 10)
Important: The async _get_or_create / _release functions must be called directly from coroutines running on the browser event loop. The sync wrappers (get_or_create_session / release_session) are provided for non-loop callers only — calling them from the loop causes a deadlock.
Browser Stealth Layer (v1.0.4)
stealth.py injects anti-bot-detection patches into every Playwright context:
- JS property overrides — patches
navigator.webdriver,chrome.runtime, plugins array, languages, WebGL vendor/renderer, permissions query, andouterWidth/outerHeight - Human-like input —
human_type()types character-by-character with 35–120 ms jitter and thinking pauses;human_click()moves the mouse to the element before clicking - Stealth context options — viewport jitter, locale, timezone, Sec-CH-UA client hints, device scale factor
- Launch args —
--disable-blink-features=AutomationControlled,--disable-infobars
Anti-Loop Safety (v1.0.3)
When an AI agent calls tools in a loop, these mechanisms prevent runaway behaviour:
- In-flight deduplication (
dedup.py) — concurrent identical tool calls are collapsed into a single execution - Rate limiting (
rate_limit.py) — sliding-window limiter (default: 10 calls per 60 s per repo) returns a clearRATE_LIMITEDerror - Content hash + idempotency key — every response carries a SHA-256
content_hashandidempotency_keyso agents can detect duplicate results
Keyword Resolver (v1.2.0+)
resolver.py enables bare keyword input — users can type just "react" instead of facebook/react. The resolution pipeline:
- CodeWiki search scraping — headless Playwright fetches search results from
codewiki.google - GitHub API fallback (v1.3.0) — when CodeWiki returns zero hits, queries
api.github.com/search/repositoriesto handle typos and lesser-known repos - MCP Elicitation — if multiple repos match and a
ctx: Contextparameter is present, the server uses the MCP Elicitation protocol to ask the user to pick - Heuristic top-pick — when elicitation is unavailable, the highest-starred result is selected automatically
Resolution results are cached and a resolution note is prepended to every tool response.
Key Components
| Component | Role |
|---|---|
| Playwright + shared browser | All page rendering via headless Chromium |
| Session pool | LRU pool of warm browser contexts for search |
| Stealth layer | Anti-bot-detection patches & human-like input helpers |
| TTLCache (multi-layer) | Rendered pages (5 min), search results (2 min), parsed objects, topic lists (30 min) |
| Rate limiter + dedup | Per-repo sliding window & in-flight call deduplication |
| BeautifulSoup + lxml | Fast HTML parsing with section extraction, TOC, and diagram detection |
| Pydantic schemas | Validate all inputs before processing |
| Keyword resolver | Bare keyword → owner/repo via CodeWiki search, GitHub API fallback, MCP Elicitation |
| Structured responses | JSON envelope with metadata, SHA-256 content hash, idempotency key |
| Modular tools | Each tool in its own module, registered via register_all_tools() |
| Signal handlers | SIGINT/SIGTERM for clean shutdown with browser cleanup |
Project Structure
codewiki_mcp/
├── __init__.py # Package init + version
├── __main__.py # python -m entry point
├── browser.py # Shared Playwright browser singleton + persistent event loop
├── cache.py # TTLCache for rendered pages, search results, parsed objects, topics
├── config.py # Env-var-driven configuration + SPA selectors
├── dedup.py # In-flight call deduplication
├── parser.py # Playwright renderer + BeautifulSoup section parser
├── rate_limit.py # Per-repo sliding-window rate limiter
├── resolver.py # Bare keyword → owner/repo resolution (CodeWiki + GitHub API + Elicitation)
├── server.py # MCP server setup + CLI
├── session_pool.py # LRU pool of warm browser contexts for search
├── stealth.py # Anti-bot-detection JS patches + human-like input helpers
├── types.py # Pydantic schemas + response models
└── tools/
├── __init__.py # Tool registration
├── _helpers.py # Shared URL construction + error handling
├── contents.py # codewiki_read_contents (paginated)
├── request_indexing.py # codewiki_request_indexing (Playwright form submission)
├── search.py # codewiki_search_wiki (Playwright chat interaction)
├── structure.py # codewiki_read_structure
└── topics.py # codewiki_list_topics (with previews)
tests/
├── conftest.py # Shared fixtures + sample data
├── test_cache.py # Cache layer tests
├── test_config.py # Configuration tests
├── test_helpers.py # Helper functions (URL building, truncation, pre-resolve)
├── test_keyword_integration.py # Tool-level keyword resolution integration
├── test_parser.py # Parser + HTML extraction tests
├── test_request_indexing.py # Indexing tool elicitation + confirmation flow
├── test_resolver.py # Keyword resolver pipeline tests
├── test_session_pool.py # Session pool lifecycle + eviction tests
├── test_stealth.py # Stealth layer tests
├── test_tools.py # Server, tools & integration tests
├── test_types.py # Schema, response, dedup, rate-limit, bare keyword & SHA-256 tests
└── diagnose_search.py # Standalone headless chat diagnostic script
Dockerfile # Docker deployment
Running Tests
pip install -e ".[test]"
pytest tests/ -v
246 tests covering cache, config, parser, stealth, tools, types, dedup, rate limiting, idempotency, keyword resolution, helpers, request-indexing elicitation, and keyword integration.