Architecture

How CodeWiki MCP works under the hood — Playwright-everywhere design.

Architecture Diagram

graph LR A[Editor Client] -->|requests| B[Tools Layer] B -->|bare keyword?| R[Keyword Resolver] R -->|owner/repo| B B -->|rate limit + dedup| H[Safety Layer] H -->|renders via| C[Browser Singleton] C --> D[Playwright Chromium] C --> I[Session Pool] I -->|stealth context| D D --> E[Parser] E --> B C --> F[TTL Cache] B --> G[JSON Response]

Why Playwright?

Google CodeWiki is an Angular SPA (<sdlc-agents-root>) — a plain HTTP GET returns an empty body. All page content is rendered client-side, so every tool uses Playwright headless Chromium to get actual content.

Browser Singleton

A persistent background event loop runs in a daemon thread. browser.py provides:

All 5 tools share the same browser instance. Pages are cached in a TTLCache (5 min by default) so repeated requests skip the render entirely.

SPA-Aware Parser

CodeWiki uses custom Angular elements instead of standard HTML:

parser.py implements a dual-strategy section extractor:

  1. CodeWiki SPA — looks for <body-content-section> + <documentation-markdown> elements
  2. Standard HTML fallback — scans h1-h6 headings for non-CodeWiki pages

Structured Diagram Extraction

CodeWiki renders diagrams as <code-documentation-diagram-inline> elements containing base64-encoded SVGs. The parser detects three types:

1. CodeWiki SPA Diagrams

Decodes data:image/svg+xml;base64,... from <image class="image-diagram"> elements, then parses the Graphviz SVG structure into nodes and edges.

2. Mermaid Blocks

Captures raw Mermaid source from <code class="mermaid"> and <div class="mermaid"> elements.

3. Fallback SVGs/Images

Bare <svg> with <title> or <img> matching diagram patterns.

Graphviz SVG Parsing

For Graphviz SVGs, the parser extracts structured graph data:

Diagram summaries (entities + relationships) are placed at the top of tool output so they remain visible even when responses are truncated.

Session Pool (v1.0.2+)

session_pool.py maintains an LRU pool of warm browser contexts for the codewiki_search_wiki tool. Instead of creating a fresh Playwright context for every chat query, the pool keeps recently-used contexts alive with their pages already loaded.

Important: The async _get_or_create / _release functions must be called directly from coroutines running on the browser event loop. The sync wrappers (get_or_create_session / release_session) are provided for non-loop callers only — calling them from the loop causes a deadlock.

Browser Stealth Layer (v1.0.4)

stealth.py injects anti-bot-detection patches into every Playwright context:

Anti-Loop Safety (v1.0.3)

When an AI agent calls tools in a loop, these mechanisms prevent runaway behaviour:

Keyword Resolver (v1.2.0+)

resolver.py enables bare keyword input — users can type just "react" instead of facebook/react. The resolution pipeline:

  1. CodeWiki search scraping — headless Playwright fetches search results from codewiki.google
  2. GitHub API fallback (v1.3.0) — when CodeWiki returns zero hits, queries api.github.com/search/repositories to handle typos and lesser-known repos
  3. MCP Elicitation — if multiple repos match and a ctx: Context parameter is present, the server uses the MCP Elicitation protocol to ask the user to pick
  4. Heuristic top-pick — when elicitation is unavailable, the highest-starred result is selected automatically

Resolution results are cached and a resolution note is prepended to every tool response.

Key Components

ComponentRole
Playwright + shared browserAll page rendering via headless Chromium
Session poolLRU pool of warm browser contexts for search
Stealth layerAnti-bot-detection patches & human-like input helpers
TTLCache (multi-layer)Rendered pages (5 min), search results (2 min), parsed objects, topic lists (30 min)
Rate limiter + dedupPer-repo sliding window & in-flight call deduplication
BeautifulSoup + lxmlFast HTML parsing with section extraction, TOC, and diagram detection
Pydantic schemasValidate all inputs before processing
Keyword resolverBare keyword → owner/repo via CodeWiki search, GitHub API fallback, MCP Elicitation
Structured responsesJSON envelope with metadata, SHA-256 content hash, idempotency key
Modular toolsEach tool in its own module, registered via register_all_tools()
Signal handlersSIGINT/SIGTERM for clean shutdown with browser cleanup

Project Structure

codewiki_mcp/
├── __init__.py        # Package init + version
├── __main__.py        # python -m entry point
├── browser.py         # Shared Playwright browser singleton + persistent event loop
├── cache.py           # TTLCache for rendered pages, search results, parsed objects, topics
├── config.py          # Env-var-driven configuration + SPA selectors
├── dedup.py           # In-flight call deduplication
├── parser.py          # Playwright renderer + BeautifulSoup section parser
├── rate_limit.py      # Per-repo sliding-window rate limiter
├── resolver.py        # Bare keyword → owner/repo resolution (CodeWiki + GitHub API + Elicitation)
├── server.py          # MCP server setup + CLI
├── session_pool.py    # LRU pool of warm browser contexts for search
├── stealth.py         # Anti-bot-detection JS patches + human-like input helpers
├── types.py           # Pydantic schemas + response models
└── tools/
    ├── __init__.py    # Tool registration
    ├── _helpers.py    # Shared URL construction + error handling
    ├── contents.py    # codewiki_read_contents (paginated)
    ├── request_indexing.py # codewiki_request_indexing (Playwright form submission)
    ├── search.py      # codewiki_search_wiki (Playwright chat interaction)
    ├── structure.py   # codewiki_read_structure
    └── topics.py      # codewiki_list_topics (with previews)
tests/
├── conftest.py            # Shared fixtures + sample data
├── test_cache.py          # Cache layer tests
├── test_config.py         # Configuration tests
├── test_helpers.py        # Helper functions (URL building, truncation, pre-resolve)
├── test_keyword_integration.py  # Tool-level keyword resolution integration
├── test_parser.py         # Parser + HTML extraction tests
├── test_request_indexing.py # Indexing tool elicitation + confirmation flow
├── test_resolver.py       # Keyword resolver pipeline tests
├── test_session_pool.py   # Session pool lifecycle + eviction tests
├── test_stealth.py        # Stealth layer tests
├── test_tools.py          # Server, tools & integration tests
├── test_types.py          # Schema, response, dedup, rate-limit, bare keyword & SHA-256 tests
└── diagnose_search.py     # Standalone headless chat diagnostic script
Dockerfile             # Docker deployment

Running Tests

pip install -e ".[test]"
pytest tests/ -v

246 tests covering cache, config, parser, stealth, tools, types, dedup, rate limiting, idempotency, keyword resolution, helpers, request-indexing elicitation, and keyword integration.