From 372deffa29c47e5c2eb655ba75b532d6cb5081e1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Torbj=C3=B8rn=20Lindahl?= Date: Fri, 27 Mar 2026 16:52:31 +0100 Subject: [PATCH] added plan --- mcp-server-plan.md | 627 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 627 insertions(+) create mode 100644 mcp-server-plan.md diff --git a/mcp-server-plan.md b/mcp-server-plan.md new file mode 100644 index 0000000..ebd6798 --- /dev/null +++ b/mcp-server-plan.md @@ -0,0 +1,627 @@ +# MCP Server for FHI Statistikk Open API + +## Overview + +An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API +as tools optimized for AI agent consumption. The server wraps the REST API at +`https://statistikk-data.fhi.no/api/open/v1/` and adds intelligent +summarization, format translation, and convenience features that make the API +practical for LLM-based agents. + +**Base API**: https://statistikk-data.fhi.no/api/open/v1/ +**API docs**: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API +**License**: CC BY 4.0 (open data) +**Auth**: None required + +## Problem Statement + +The raw API has several characteristics that make it hard for AI agents: + +1. **JSON-stat2 format** -- The data endpoint returns a multidimensional sparse + array format designed for statistical software, not LLMs. +2. **Mandatory dimension specification** -- All dimensions must be included in + every data query, even single-valued ones like `KJONN=["0"]`. +3. **Non-obvious value formats** -- Year values use `"2020_2020"` not `"2020"`. +4. **Massive dimension trees** -- The GEO dimension can have 400+ entries in a + hierarchical tree (country > county > municipality > city district). +5. **Multi-step discovery** -- Finding relevant data requires: list sources > + list tables > get dimensions > construct query > fetch data. +6. **Metadata contains raw HTML** -- `

`, ``, `

    ` tags in content fields. +7. **Swagger spec is incomplete** -- Documents only `"item"` filter, but the API + actually supports `"item"`, `"all"`, `"top"`, `"bottom"`. + +## API Inventory + +### Sources (as of 2026-03-27) + +| ID | Title | Publisher | +|----------|------------------------------------------|----------------------| +| nokkel | Folkehelsestatistikk | Helsedirektoratet | +| ngs | Mikrobiologisk genomovervåkning | FHI | +| mfr | Medisinsk fødselsregister | FHI | +| abr | Abortregisteret | FHI | +| sysvak | Nasjonalt vaksinasjonsregister SYSVAK | FHI | +| daar | Dødsårsakregisteret | FHI | +| msis | Meldingssystem for smittsomme sykdommer | FHI | +| lmr | Legemiddelregisteret | FHI | +| gs | Grossiststatistikk | FHI | +| npr | Norsk pasientregister | FHI | +| kpr | Kommunalt pasient- og brukerregister | FHI | +| hkr | Hjerte- og karsykdommer | FHI | +| skast | Skadedyrstatistikk | FHI | + +### Endpoints + +| Method | Path | Purpose | +|--------|-----------------------------------------------|----------------------------| +| GET | `/Common/source` | List all sources | +| GET | `/{sourceId}/Table` | List tables in source | +| GET | `/{sourceId}/Table/{tableId}` | Table info | +| GET | `/{sourceId}/Table/{tableId}/query` | Query template | +| GET | `/{sourceId}/Table/{tableId}/dimension` | Dimensions and categories | +| POST | `/{sourceId}/Table/{tableId}/data` | Fetch data | +| GET | `/{sourceId}/Table/{tableId}/flag` | Flag/symbol definitions | +| GET | `/{sourceId}/Table/{tableId}/metadata` | Table metadata | + +### Filter Types + +| Filter | Description | Example values | +|----------|------------------------------------------------|--------------------------| +| `item` | Exact match on listed values | `["2020_2020","2021_2021"]` | +| `all` | Wildcard match with `*` | `["*"]` or `["A*","B*"]` | +| `top` | First N categories | `["5"]` | +| `bottom` | Last N categories | `["5"]` | + +### Response Formats (data endpoint) + +| Format | Content-Type | Description | +|------------|---------------------------------|---------------------------------| +| json-stat2 | application/json | JSON-Stat 2.0 sparse array | +| csv2 | text/csv | CSV with human-readable labels | +| csv3 | text/csv | CSV with dimension/measure codes| +| parquet | application/vnd.apache.parquet | Apache Parquet columnar format | + +## MCP Tool Design + +### Tool 1: `list_sources` + +**Purpose**: Entry point. List all available data sources. + +**Parameters**: None. + +**Returns**: Array of `{id, title, description, published_by}`. + +**Implementation**: GET `/Common/source`. Pass through with minor field renaming +(snake_case). + +**Caching**: Cache for 24 hours. Source list rarely changes. + +--- + +### Tool 2: `list_tables` + +**Purpose**: Find tables within a source, with optional keyword search. + +**Parameters**: +- `source_id` (string, required) -- Source identifier, e.g. `"nokkel"`. +- `search` (string, optional) -- Case-insensitive keyword filter on table title. + Supports multiple words (all must match). Applied client-side. +- `modified_after` (string, optional) -- ISO-8601 datetime. Only return tables + modified after this date. Passed to API server-side. + +**Returns**: Array of `{table_id, title, published_at, modified_at}`. + +**Implementation**: GET `/{sourceId}/Table?modifiedAfter=...`, then client-side +filter on `search`. Sort by `modified_at` descending. + +**Caching**: Cache per source_id for 1 hour. Table lists update throughout the +day as data is published. + +**Example**: +``` +list_tables(source_id="nokkel", search="befolkning") +→ [{table_id: 185, title: "Befolkningsvekst", ...}, + {table_id: 338, title: "Befolkningssammensetning_antall_andel", ...}, + {table_id: 171, title: "Befolkningsframskriving", ...}] +``` + +--- + +### Tool 3: `describe_table` + +**Purpose**: The primary tool for understanding a table's structure. Gives the +agent everything it needs to construct a data query. + +**Parameters**: +- `source_id` (string, required) +- `table_id` (integer, required) + +**Returns**: A structured summary combining table info, dimensions, metadata, +and flags. This is a composite call (4 parallel API requests). + +**Response structure**: +``` +{ + "title": "Befolkningsvekst", + "published_at": "2025-10-21T08:56:39Z", + "modified_at": "2025-10-21T08:56:39Z", + "is_official_statistics": false, + "description": "Differansen mellom befolkningsmengden...", + "update_frequency": "Årlig", + "keywords": ["Befolkning", "Befolkningsvekst"], + "source_institution": "Statistisk sentralbyrå (SSB)", + "dimensions": [ + { + "code": "GEO", + "label": "Geografi", + "total_categories": 356, + "is_hierarchical": true, + "hierarchy_depth": 4, + "top_level_values": [ + {"value": "0", "label": "Hele landet", "child_count": 15} + ], + "note": "Use get_dimension_values to drill into sub-levels" + }, + { + "code": "AAR", + "label": "År", + "total_categories": 23, + "is_hierarchical": false, + "value_format": "YYYY_YYYY (e.g. 2020_2020)", + "range": "2002..2024", + "values": ["2002_2002", "2003_2003", ..., "2024_2024"] + }, + { + "code": "KJONN", + "label": "Kjønn", + "total_categories": 1, + "is_fixed": true, + "values": [{"value": "0", "label": "kjønn samlet"}], + "note": "Single-valued, auto-included in queries" + }, + { + "code": "ALDER", + "label": "Alder", + "total_categories": 1, + "is_fixed": true, + "values": [{"value": "0_120", "label": "alle aldre"}], + "note": "Single-valued, auto-included in queries" + }, + { + "code": "MEASURE_TYPE", + "label": "Måltall", + "total_categories": 2, + "is_fixed": false, + "values": [ + {"value": "TELLER", "label": "antall"}, + {"value": "RATE", "label": "prosent vekst"} + ] + } + ], + "flags": [ + {"symbol": "", "description": "Verdi finnes i tabellen"} + ] +} +``` + +**Key design decisions**: + +1. **Summarize large dimensions** -- For dimensions with >20 categories (mainly + GEO), show only top-level entries with child counts. The agent uses + `get_dimension_values` to drill down. + +2. **Mark fixed dimensions** -- Dimensions with exactly 1 category get + `is_fixed: true`. The agent knows to ignore these; `query_data` will + auto-include them. + +3. **Show value format** -- AAR values are `"2020_2020"`, not `"2020"`. Show + this explicitly so the agent gets the format right. + +4. **Include metadata inline** -- Strip HTML from metadata paragraphs. Extract + `description`, `keywords`, `update_frequency`, `source_institution` as + top-level fields. + +5. **Include flags inline** -- Flag definitions are small and always relevant. + +**Implementation**: Parallel fetch of: +- GET `/{sourceId}/Table/{tableId}` (table info) +- GET `/{sourceId}/Table/{tableId}/dimension` (dimensions) +- GET `/{sourceId}/Table/{tableId}/metadata` (metadata) +- GET `/{sourceId}/Table/{tableId}/flag` (flags) + +Then merge and transform. + +**Caching**: Cache per (source_id, table_id) for 6 hours. Dimension structure +changes rarely. + +--- + +### Tool 4: `get_dimension_values` + +**Purpose**: Drill into large hierarchical dimensions, typically GEO. + +**Parameters**: +- `source_id` (string, required) +- `table_id` (integer, required) +- `dimension_code` (string, required) -- e.g. `"GEO"`. +- `parent_value` (string, optional) -- Return only children of this category. + E.g. `"18"` for Nordland county. If omitted, returns top-level categories. +- `search` (string, optional) -- Case-insensitive search on category labels. + E.g. `"tromsø"` to find the municipality. + +**Returns**: Array of `{value, label, child_count}`. + +**Implementation**: GET `/{sourceId}/Table/{tableId}/dimension`, then navigate +the category tree client-side. The full tree is fetched and cached; filtering +is done in the MCP server. + +**Examples**: +``` +# Get all counties +get_dimension_values("nokkel", 185, "GEO") +→ [{value: "0", label: "Hele landet", child_count: 15}] + +# Get municipalities in Nordland +get_dimension_values("nokkel", 185, "GEO", parent_value="18") +→ [{value: "1804", label: "Bodø", child_count: 0}, + {value: "1806", label: "Narvik", child_count: 0}, ...] + +# Search for a municipality +get_dimension_values("nokkel", 185, "GEO", search="tromsø") +→ [{value: "5501", label: "Tromsø", child_count: 0}] +``` + +**Caching**: Shares the dimension cache with `describe_table`. + +--- + +### Tool 5: `query_data` + +**Purpose**: Fetch actual data from a table. The main data retrieval tool. + +**Parameters**: +- `source_id` (string, required) +- `table_id` (integer, required) +- `dimensions` (array, required) -- Each element: + - `code` (string) -- Dimension code, e.g. `"GEO"`. + - `filter` (string) -- One of `"item"`, `"all"`, `"top"`, `"bottom"`. + Default: `"item"`. + - `values` (array of strings) -- Filter values. +- `max_rows` (integer, optional) -- Limit returned rows. Default: 1000. + Set to 0 for no limit (be careful). + +**Returns**: Structured rows with labeled values. + +``` +{ + "table": "Befolkningsvekst", + "total_rows": 4, + "rows": [ + {"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet", + "ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5}, + ... + ], + "truncated": false, + "dimensions_used": { + "GEO": {"filter": "item", "values": ["0301"]}, + "AAR": {"filter": "bottom", "values": ["2"]}, + "KJONN": {"filter": "item", "values": ["0"]}, + "ALDER": {"filter": "item", "values": ["0_120"]}, + "MEASURE_TYPE": {"filter": "all", "values": ["*"]} + } +} +``` + +**Key design decisions**: + +1. **Default to csv2 internally** -- Fetch as csv2 (human-readable labels), + parse into rows. CSV is simpler for an agent to reason about than JSON-stat2. + The tool internally requests csv2 and structures it. + +2. **Auto-include fixed dimensions** -- If the agent omits a dimension that has + only 1 category (like KJONN or ALDER), the tool adds it automatically with + `filter: "item"` and the single value. This means the agent only needs to + specify the dimensions it actually cares about. + +3. **Normalize year values** -- If the agent sends `"2020"` for AAR, the tool + translates to `"2020_2020"`. The `YYYY_YYYY` format is an internal API + convention the agent shouldn't need to know about. + +4. **Default MEASURE_TYPE** -- If omitted, default to `filter: "all", values: + ["*"]` to get all measures. Most agents want all available metrics. + +5. **Row limit with truncation flag** -- Default 1000 rows. Return a + `truncated: true` flag and `total_rows` count so the agent knows if there's + more data. + +6. **Echo back dimensions_used** -- Show what was actually sent to the API + (after auto-completion), so the agent can see the full query. + +**Implementation**: +1. Fetch dimension info if not cached (to know fixed dimensions and validate) +2. Auto-complete missing/fixed dimensions +3. Normalize year values +4. POST `/{sourceId}/Table/{tableId}/data` with format=csv2 +5. Parse CSV response into row objects +6. Apply row limit, compute truncation + +**Error handling**: The API returns ProblemDetails (RFC 7807) on 400/404/422. +Transform into clear error messages: +- "Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..." +- "Value '2025_2025' not found in dimension AAR. Range: 2002..2024" +- "maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters." + +--- + +### Tool 6: `get_query_template` + +**Purpose**: Fallback tool returning the raw query template from the API. Useful +when the agent needs to see exactly what the API expects. + +**Parameters**: +- `source_id` (string, required) +- `table_id` (integer, required) + +**Returns**: The raw DataRequest JSON as returned by the API. + +**Implementation**: GET `/{sourceId}/Table/{tableId}/query`. Pass through. + +**When to use**: When `query_data` auto-completion isn't behaving as expected, +or the agent wants to see the complete list of available values for all +dimensions. + +--- + +## Tools NOT included (and why) + +| Considered tool | Decision | Reason | +|------------------------------|----------|---------------------------------------------| +| `get_flags` (standalone) | Dropped | Folded into `describe_table` | +| `get_metadata` (standalone) | Dropped | Folded into `describe_table` | +| `get_table_info` (standalone)| Dropped | Folded into `describe_table` | +| `search_across_sources` | Dropped | Too expensive (13 API calls). Agent can call `list_tables` per source | +| `get_data_jsonstat` | Dropped | Agents don't need raw JSON-stat2 | +| `get_data_parquet` | Dropped | Binary format, not useful for LLM context | + +## Architecture + +### Stack + +- **Language**: Python 3.12+ +- **MCP framework**: FastMCP (`mcp[cli]`) +- **HTTP server**: Uvicorn (`uvicorn>=0.30`) for SSE/HTTP transport +- **HTTP client**: `httpx` (async) +- **CSV parsing**: stdlib `csv` +- **HTML stripping**: stdlib `html.parser` or `re` (simple tag removal) +- **Build system**: Hatchling (matches Fhi.Metadata.MCPserver pattern) + +### Transport + +The server supports multiple transports via CLI flag, following the same pattern +as `Fhi.Metadata.MCPserver`: + +| Transport | Use case | Endpoint | +|------------------|---------------------------------------|-------------------| +| `sse` | Local dev + Skybert deployment | `/sse` | +| `streamable-http`| Future HTTP-only clients | `/mcp` | +| `stdio` | Direct pipe (legacy) | stdin/stdout | + +**Default**: `sse` on `0.0.0.0:8000`. This means the server works over HTTP +both locally and when deployed to Skybert, with no transport change needed. + +**CLI entry point**: +```bash +fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000 +``` + +### Project Structure + +``` +fhi-statistikk-mcp/ +├── .github/ +│ └── workflows/ +│ └── docker-build-push.yaml # CI/CD → crfhiskybert.azurecr.io +├── .mcp.json.local # Local dev: http://localhost:8000/sse +├── .mcp.json.public # Production: https:///sse +├── Dockerfile # Multi-stage, Python 3.12-slim +├── pyproject.toml # Hatchling build, entry point +├── README.md +├── src/ +│ └── fhi_statistikk_mcp/ +│ ├── __init__.py +│ ├── server.py # MCP server, tool definitions, main() +│ ├── api_client.py # Async httpx client for FHI API +│ ├── transformers.py # CSV parsing, dimension summarization +│ └── cache.py # Simple TTL cache +└── tests/ + ├── test_transformers.py + ├── test_cache.py + └── fixtures/ # Recorded API responses + ├── sources.json + ├── tables_nokkel.json + ├── dimensions_185.json + ├── metadata_185.json + ├── flags_185.json + └── data_185.csv +``` + +### MCP Client Configuration + +**Local development** (`.mcp.json.local`): +```json +{ + "mcpServers": { + "fhi-statistikk": { + "type": "sse", + "url": "http://localhost:8000/sse" + } + } +} +``` + +**Production** (`.mcp.json.public`): +```json +{ + "mcpServers": { + "fhi-statistikk": { + "type": "sse", + "url": "https:///sse" + } + } +} +``` + +### Dockerfile + +Following the Fhi.Metadata.MCPserver pattern: +```dockerfile +FROM python:3.12-slim AS base +WORKDIR /app +COPY pyproject.toml . +COPY src/ src/ +RUN pip install --no-cache-dir . + +FROM base AS prod +EXPOSE 8000 +CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"] +``` + +### CI/CD + +Same pipeline pattern as Fhi.Metadata.MCPserver: +- Trigger on push to `main` touching `src/`, `Dockerfile`, or `pyproject.toml` +- Azure Federated Identity (OIDC) login +- Push to `crfhiskybert.azurecr.io/fida/ki/statistikk-mcp` +- Tag: git short SHA + `latest` +- Dispatch to GitOps repo for Skybert deployment + +### Logging + +Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format. +Print startup info (API base URL, cache status) to stderr. No persistent log +files -- container logging handles that on Skybert. + +### Caching Strategy + +| Data | TTL | Key | Reason | +|------------------|----------|----------------------------|--------------------------------| +| Source list | 24h | `"sources"` | Rarely changes | +| Table list | 1h | `source_id` | New tables published daily | +| Dimensions | 6h | `(source_id, table_id)` | Dimension structure is stable | +| Metadata | 6h | `(source_id, table_id)` | Metadata edits are rare | +| Flags | 6h | `(source_id, table_id)` | Flags rarely change | +| Query templates | 6h | `(source_id, table_id)` | Follows dimension changes | +| Data responses | No cache | -- | Queries vary too much to cache | + +In-memory dict with TTL. No external dependency needed -- the data volume is +small and the server is single-process. + +### Rate Limiting + +No documented rate limits, but this is a government API. Be polite: +- Max 5 concurrent requests +- 100ms minimum between requests +- Retry with exponential backoff on 429/503 + +### Error Mapping + +| API Response | MCP Tool Error | +|-----------------------|------------------------------------------------------| +| 400 Bad Request | Descriptive message from ProblemDetails.detail | +| 404 Not Found | "Source/table not found: {id}" | +| 422 Client Error | "Query validation failed: {detail}" | +| Network timeout | "API request timed out. Try reducing query scope." | +| CSV parse error | "Failed to parse response. Try get_query_template." | + +### Unicode / Fuzzy Search + +Dimension value search (in `get_dimension_values`) normalizes both query and +labels for accent-insensitive matching: +- Normalize with `unicodedata.normalize("NFD")`, strip combining marks +- Case-insensitive comparison +- `"tromso"` matches `"Tromsø"`, `"barum"` matches `"Bærum"` +- Preserve original labels in output + +## Implementation Plan + +### Phase 1: Core (MVP) + +1. Set up project skeleton: `pyproject.toml` with hatchling, `src/` layout, + entry point `fhi-statistikk-mcp` +2. Set up `server.py` with FastMCP, SSE transport, CLI args (transport, host, + port), stderr logging +3. Implement `api_client.py` with async httpx client, base URL config +4. Implement `cache.py` with simple TTL dict +5. Implement `list_sources` tool +6. Implement `list_tables` tool with client-side keyword search +7. Implement `describe_table` composite tool + - Parallel fetch of 4 endpoints + - Dimension summarization (large dim truncation, fixed dim detection) + - HTML stripping for metadata + - Merge into structured response +8. Implement `query_data` tool + - Auto-completion of fixed dimensions + - Year value normalization (`"2020"` → `"2020_2020"`) + - Default MEASURE_TYPE to `all`/`["*"]` + - CSV parsing and row structuring + - Row limit and truncation +9. Implement `get_dimension_values` with hierarchy navigation and accent- + insensitive search +10. Implement `get_query_template` passthrough +11. Add `.mcp.json.local` for local dev +12. Test all tools against live API + +### Phase 2: Deployment & Polish + +13. Add `Dockerfile` (multi-stage, Python 3.12-slim) +14. Add `.github/workflows/docker-build-push.yaml` for CI/CD +15. Add `.mcp.json.public` with Skybert URL +16. Add comprehensive error handling and error messages +17. Add rate limiting +18. Record API fixtures for offline testing +19. Write unit tests for transformers and cache +20. Write integration tests against live API + +### Phase 3: Optional Enhancements + +21. Add a `search_all_tables` convenience tool (if agents frequently need it) +22. Add MCP resources for static reference data (source descriptions, common + dimension codes) +23. Add MCP prompt templates (e.g. "finn helsedata om ") + +## Tool Description Guidelines + +MCP tool descriptions are what the agent uses to decide which tool to call. They +should be written for an LLM audience: + +- Lead with the purpose, not the endpoint +- Include example parameter values +- Document non-obvious conventions (year format, dimension codes) +- Mention what `describe_table` returns, since it's the prerequisite for + `query_data` +- Note that Norwegian labels are the default (GEO labels are in Norwegian) + +### Example tool description for `query_data`: + +> Fetch statistical data from an FHI table. Before calling this, use +> `describe_table` to understand the table's dimensions and available values. +> +> You only need to specify the dimensions you care about. Fixed dimensions +> (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit +> MEASURE_TYPE, all measures are returned. +> +> Year values: use "2020" (auto-translated to "2020_2020") or the full format. +> +> Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]), +> "top" (first N), "bottom" (last N). +> +> Returns labeled rows, max 1000 by default. Check "truncated" field. + +## Resolved Decisions + +| Question | Decision | Rationale | +|----------|----------|-----------| +| Hosting | SSE locally, same for Skybert | Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy. | +| JSON-stat2 output | No | csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software. | +| Fuzzy dimension search | Yes, accent-insensitive | Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks. | +| Sample data in describe_table | No | Adds latency. Agent calls `query_data` with `max_rows=5` if it wants a preview. |