Files
fhi-statistikk-mcp/mcp-server-plan.md
Torbjørn Lindahl 4b2e7376bd more files
2026-03-27 17:37:28 +01:00

630 lines
24 KiB
Markdown

# MCP Server for FHI Statistikk Open API
## Overview
An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API
as tools optimized for AI agent consumption. The server wraps the REST API at
`https://statistikk-data.fhi.no/api/open/v1/` and adds intelligent
summarization, format translation, and convenience features that make the API
practical for LLM-based agents.
**This repo**: `/home/tlind/git/fhi/openapi-mcp`
**API documentation repo**: `/home/tlind/git/fhi/Fhi.Statistikk.OpenAPI` (code samples, Postman collection, user guide)
**Base API**: https://statistikk-data.fhi.no/api/open/v1/
**API docs**: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API
**License**: CC BY 4.0 (open data)
**Auth**: None required
## Problem Statement
The raw API has several characteristics that make it hard for AI agents:
1. **JSON-stat2 format** -- The data endpoint returns a multidimensional sparse
array format designed for statistical software, not LLMs.
2. **Mandatory dimension specification** -- All dimensions must be included in
every data query, even single-valued ones like `KJONN=["0"]`.
3. **Non-obvious value formats** -- Year values use `"2020_2020"` not `"2020"`.
4. **Massive dimension trees** -- The GEO dimension can have 400+ entries in a
hierarchical tree (country > county > municipality > city district).
5. **Multi-step discovery** -- Finding relevant data requires: list sources >
list tables > get dimensions > construct query > fetch data.
6. **Metadata contains raw HTML** -- `<p>`, `<a>`, `<ol>` tags in content fields.
7. **Swagger spec is incomplete** -- Documents only `"item"` filter, but the API
actually supports `"item"`, `"all"`, `"top"`, `"bottom"`.
## API Inventory
### Sources (as of 2026-03-27)
| ID | Title | Publisher |
|----------|------------------------------------------|----------------------|
| nokkel | Folkehelsestatistikk | Helsedirektoratet |
| ngs | Mikrobiologisk genomovervåkning | FHI |
| mfr | Medisinsk fødselsregister | FHI |
| abr | Abortregisteret | FHI |
| sysvak | Nasjonalt vaksinasjonsregister SYSVAK | FHI |
| daar | Dødsårsakregisteret | FHI |
| msis | Meldingssystem for smittsomme sykdommer | FHI |
| lmr | Legemiddelregisteret | FHI |
| gs | Grossiststatistikk | FHI |
| npr | Norsk pasientregister | FHI |
| kpr | Kommunalt pasient- og brukerregister | FHI |
| hkr | Hjerte- og karsykdommer | FHI |
| skast | Skadedyrstatistikk | FHI |
### Endpoints
| Method | Path | Purpose |
|--------|-----------------------------------------------|----------------------------|
| GET | `/Common/source` | List all sources |
| GET | `/{sourceId}/Table` | List tables in source |
| GET | `/{sourceId}/Table/{tableId}` | Table info |
| GET | `/{sourceId}/Table/{tableId}/query` | Query template |
| GET | `/{sourceId}/Table/{tableId}/dimension` | Dimensions and categories |
| POST | `/{sourceId}/Table/{tableId}/data` | Fetch data |
| GET | `/{sourceId}/Table/{tableId}/flag` | Flag/symbol definitions |
| GET | `/{sourceId}/Table/{tableId}/metadata` | Table metadata |
### Filter Types
| Filter | Description | Example values |
|----------|------------------------------------------------|--------------------------|
| `item` | Exact match on listed values | `["2020_2020","2021_2021"]` |
| `all` | Wildcard match with `*` | `["*"]` or `["A*","B*"]` |
| `top` | First N categories | `["5"]` |
| `bottom` | Last N categories | `["5"]` |
### Response Formats (data endpoint)
| Format | Content-Type | Description |
|------------|---------------------------------|---------------------------------|
| json-stat2 | application/json | JSON-Stat 2.0 sparse array |
| csv2 | text/csv | CSV with human-readable labels |
| csv3 | text/csv | CSV with dimension/measure codes|
| parquet | application/vnd.apache.parquet | Apache Parquet columnar format |
## MCP Tool Design
### Tool 1: `list_sources`
**Purpose**: Entry point. List all available data sources.
**Parameters**: None.
**Returns**: Array of `{id, title, description, published_by}`.
**Implementation**: GET `/Common/source`. Pass through with minor field renaming
(snake_case).
**Caching**: Cache for 24 hours. Source list rarely changes.
---
### Tool 2: `list_tables`
**Purpose**: Find tables within a source, with optional keyword search.
**Parameters**:
- `source_id` (string, required) -- Source identifier, e.g. `"nokkel"`.
- `search` (string, optional) -- Case-insensitive keyword filter on table title.
Supports multiple words (all must match). Applied client-side.
- `modified_after` (string, optional) -- ISO-8601 datetime. Only return tables
modified after this date. Passed to API server-side.
**Returns**: Array of `{table_id, title, published_at, modified_at}`.
**Implementation**: GET `/{sourceId}/Table?modifiedAfter=...`, then client-side
filter on `search`. Sort by `modified_at` descending.
**Caching**: Cache per source_id for 1 hour. Table lists update throughout the
day as data is published.
**Example**:
```
list_tables(source_id="nokkel", search="befolkning")
→ [{table_id: 185, title: "Befolkningsvekst", ...},
{table_id: 338, title: "Befolkningssammensetning_antall_andel", ...},
{table_id: 171, title: "Befolkningsframskriving", ...}]
```
---
### Tool 3: `describe_table`
**Purpose**: The primary tool for understanding a table's structure. Gives the
agent everything it needs to construct a data query.
**Parameters**:
- `source_id` (string, required)
- `table_id` (integer, required)
**Returns**: A structured summary combining table info, dimensions, metadata,
and flags. This is a composite call (4 parallel API requests).
**Response structure**:
```
{
"title": "Befolkningsvekst",
"published_at": "2025-10-21T08:56:39Z",
"modified_at": "2025-10-21T08:56:39Z",
"is_official_statistics": false,
"description": "Differansen mellom befolkningsmengden...",
"update_frequency": "Årlig",
"keywords": ["Befolkning", "Befolkningsvekst"],
"source_institution": "Statistisk sentralbyrå (SSB)",
"dimensions": [
{
"code": "GEO",
"label": "Geografi",
"total_categories": 356,
"is_hierarchical": true,
"hierarchy_depth": 4,
"top_level_values": [
{"value": "0", "label": "Hele landet", "child_count": 15}
],
"note": "Use get_dimension_values to drill into sub-levels"
},
{
"code": "AAR",
"label": "År",
"total_categories": 23,
"is_hierarchical": false,
"value_format": "YYYY_YYYY (e.g. 2020_2020)",
"range": "2002..2024",
"values": ["2002_2002", "2003_2003", ..., "2024_2024"]
},
{
"code": "KJONN",
"label": "Kjønn",
"total_categories": 1,
"is_fixed": true,
"values": [{"value": "0", "label": "kjønn samlet"}],
"note": "Single-valued, auto-included in queries"
},
{
"code": "ALDER",
"label": "Alder",
"total_categories": 1,
"is_fixed": true,
"values": [{"value": "0_120", "label": "alle aldre"}],
"note": "Single-valued, auto-included in queries"
},
{
"code": "MEASURE_TYPE",
"label": "Måltall",
"total_categories": 2,
"is_fixed": false,
"values": [
{"value": "TELLER", "label": "antall"},
{"value": "RATE", "label": "prosent vekst"}
]
}
],
"flags": [
{"symbol": "", "description": "Verdi finnes i tabellen"}
]
}
```
**Key design decisions**:
1. **Summarize large dimensions** -- For dimensions with >20 categories (mainly
GEO), show only top-level entries with child counts. The agent uses
`get_dimension_values` to drill down.
2. **Mark fixed dimensions** -- Dimensions with exactly 1 category get
`is_fixed: true`. The agent knows to ignore these; `query_data` will
auto-include them.
3. **Show value format** -- AAR values are `"2020_2020"`, not `"2020"`. Show
this explicitly so the agent gets the format right.
4. **Include metadata inline** -- Strip HTML from metadata paragraphs. Extract
`description`, `keywords`, `update_frequency`, `source_institution` as
top-level fields.
5. **Include flags inline** -- Flag definitions are small and always relevant.
**Implementation**: Parallel fetch of:
- GET `/{sourceId}/Table/{tableId}` (table info)
- GET `/{sourceId}/Table/{tableId}/dimension` (dimensions)
- GET `/{sourceId}/Table/{tableId}/metadata` (metadata)
- GET `/{sourceId}/Table/{tableId}/flag` (flags)
Then merge and transform.
**Caching**: Cache per (source_id, table_id) for 6 hours. Dimension structure
changes rarely.
---
### Tool 4: `get_dimension_values`
**Purpose**: Drill into large hierarchical dimensions, typically GEO.
**Parameters**:
- `source_id` (string, required)
- `table_id` (integer, required)
- `dimension_code` (string, required) -- e.g. `"GEO"`.
- `parent_value` (string, optional) -- Return only children of this category.
E.g. `"18"` for Nordland county. If omitted, returns top-level categories.
- `search` (string, optional) -- Case-insensitive search on category labels.
E.g. `"tromsø"` to find the municipality.
**Returns**: Array of `{value, label, child_count}`.
**Implementation**: GET `/{sourceId}/Table/{tableId}/dimension`, then navigate
the category tree client-side. The full tree is fetched and cached; filtering
is done in the MCP server.
**Examples**:
```
# Get all counties
get_dimension_values("nokkel", 185, "GEO")
→ [{value: "0", label: "Hele landet", child_count: 15}]
# Get municipalities in Nordland
get_dimension_values("nokkel", 185, "GEO", parent_value="18")
→ [{value: "1804", label: "Bodø", child_count: 0},
{value: "1806", label: "Narvik", child_count: 0}, ...]
# Search for a municipality
get_dimension_values("nokkel", 185, "GEO", search="tromsø")
→ [{value: "5501", label: "Tromsø", child_count: 0}]
```
**Caching**: Shares the dimension cache with `describe_table`.
---
### Tool 5: `query_data`
**Purpose**: Fetch actual data from a table. The main data retrieval tool.
**Parameters**:
- `source_id` (string, required)
- `table_id` (integer, required)
- `dimensions` (array, required) -- Each element:
- `code` (string) -- Dimension code, e.g. `"GEO"`.
- `filter` (string) -- One of `"item"`, `"all"`, `"top"`, `"bottom"`.
Default: `"item"`.
- `values` (array of strings) -- Filter values.
- `max_rows` (integer, optional) -- Limit returned rows. Default: 1000.
Set to 0 for no limit (be careful).
**Returns**: Structured rows with labeled values.
```
{
"table": "Befolkningsvekst",
"total_rows": 4,
"rows": [
{"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet",
"ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5},
...
],
"truncated": false,
"dimensions_used": {
"GEO": {"filter": "item", "values": ["0301"]},
"AAR": {"filter": "bottom", "values": ["2"]},
"KJONN": {"filter": "item", "values": ["0"]},
"ALDER": {"filter": "item", "values": ["0_120"]},
"MEASURE_TYPE": {"filter": "all", "values": ["*"]}
}
}
```
**Key design decisions**:
1. **Default to csv2 internally** -- Fetch as csv2 (human-readable labels),
parse into rows. CSV is simpler for an agent to reason about than JSON-stat2.
The tool internally requests csv2 and structures it.
2. **Auto-include fixed dimensions** -- If the agent omits a dimension that has
only 1 category (like KJONN or ALDER), the tool adds it automatically with
`filter: "item"` and the single value. This means the agent only needs to
specify the dimensions it actually cares about.
3. **Normalize year values** -- If the agent sends `"2020"` for AAR, the tool
translates to `"2020_2020"`. The `YYYY_YYYY` format is an internal API
convention the agent shouldn't need to know about.
4. **Default MEASURE_TYPE** -- If omitted, default to `filter: "all", values:
["*"]` to get all measures. Most agents want all available metrics.
5. **Row limit with truncation flag** -- Default 1000 rows. Return a
`truncated: true` flag and `total_rows` count so the agent knows if there's
more data.
6. **Echo back dimensions_used** -- Show what was actually sent to the API
(after auto-completion), so the agent can see the full query.
**Implementation**:
1. Fetch dimension info if not cached (to know fixed dimensions and validate)
2. Auto-complete missing/fixed dimensions
3. Normalize year values
4. POST `/{sourceId}/Table/{tableId}/data` with format=csv2
5. Parse CSV response into row objects
6. Apply row limit, compute truncation
**Error handling**: The API returns ProblemDetails (RFC 7807) on 400/404/422.
Transform into clear error messages:
- "Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..."
- "Value '2025_2025' not found in dimension AAR. Range: 2002..2024"
- "maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters."
---
### Tool 6: `get_query_template`
**Purpose**: Fallback tool returning the raw query template from the API. Useful
when the agent needs to see exactly what the API expects.
**Parameters**:
- `source_id` (string, required)
- `table_id` (integer, required)
**Returns**: The raw DataRequest JSON as returned by the API.
**Implementation**: GET `/{sourceId}/Table/{tableId}/query`. Pass through.
**When to use**: When `query_data` auto-completion isn't behaving as expected,
or the agent wants to see the complete list of available values for all
dimensions.
---
## Tools NOT included (and why)
| Considered tool | Decision | Reason |
|------------------------------|----------|---------------------------------------------|
| `get_flags` (standalone) | Dropped | Folded into `describe_table` |
| `get_metadata` (standalone) | Dropped | Folded into `describe_table` |
| `get_table_info` (standalone)| Dropped | Folded into `describe_table` |
| `search_across_sources` | Dropped | Too expensive (13 API calls). Agent can call `list_tables` per source |
| `get_data_jsonstat` | Dropped | Agents don't need raw JSON-stat2 |
| `get_data_parquet` | Dropped | Binary format, not useful for LLM context |
## Architecture
### Stack
- **Language**: Python 3.12+
- **MCP framework**: FastMCP (`mcp[cli]`)
- **HTTP server**: Uvicorn (`uvicorn>=0.30`) for SSE/HTTP transport
- **HTTP client**: `httpx` (async)
- **CSV parsing**: stdlib `csv`
- **HTML stripping**: stdlib `html.parser` or `re` (simple tag removal)
- **Build system**: Hatchling (matches Fhi.Metadata.MCPserver pattern)
### Transport
The server supports multiple transports via CLI flag, following the same pattern
as `Fhi.Metadata.MCPserver`:
| Transport | Use case | Endpoint |
|------------------|---------------------------------------|-------------------|
| `sse` | Local dev + Skybert deployment | `/sse` |
| `streamable-http`| Future HTTP-only clients | `/mcp` |
| `stdio` | Direct pipe (legacy) | stdin/stdout |
**Default**: `sse` on `0.0.0.0:8000`. This means the server works over HTTP
both locally and when deployed to Skybert, with no transport change needed.
**CLI entry point**:
```bash
fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000
```
### Project Structure
```
openapi-mcp/
├── .github/
│ └── workflows/
│ └── docker-build-push.yaml # CI/CD → crfhiskybert.azurecr.io
├── .mcp.json.local # Local dev: http://localhost:8000/sse
├── .mcp.json.public # Production: https://<skybert-url>/sse
├── Dockerfile # Multi-stage, Python 3.12-slim
├── pyproject.toml # Hatchling build, entry point
├── README.md
├── src/
│ └── fhi_statistikk_mcp/
│ ├── __init__.py
│ ├── server.py # MCP server, tool definitions, main()
│ ├── api_client.py # Async httpx client for FHI API
│ ├── transformers.py # CSV parsing, dimension summarization
│ └── cache.py # Simple TTL cache
└── tests/
├── test_transformers.py
├── test_cache.py
└── fixtures/ # Recorded API responses
├── sources.json
├── tables_nokkel.json
├── dimensions_185.json
├── metadata_185.json
├── flags_185.json
└── data_185.csv
```
### MCP Client Configuration
**Local development** (`.mcp.json.local`):
```json
{
"mcpServers": {
"fhi-statistikk": {
"type": "sse",
"url": "http://localhost:8000/sse"
}
}
}
```
**Production** (`.mcp.json.public`):
```json
{
"mcpServers": {
"fhi-statistikk": {
"type": "sse",
"url": "https://<skybert-url>/sse"
}
}
}
```
### Dockerfile
Following the Fhi.Metadata.MCPserver pattern:
```dockerfile
FROM python:3.12-slim AS base
WORKDIR /app
COPY pyproject.toml .
COPY src/ src/
RUN pip install --no-cache-dir .
FROM base AS prod
EXPOSE 8000
CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"]
```
### CI/CD
Same pipeline pattern as Fhi.Metadata.MCPserver:
- Trigger on push to `main` touching `src/`, `Dockerfile`, or `pyproject.toml`
- Azure Federated Identity (OIDC) login
- Push to `crfhiskybert.azurecr.io/fida/ki/statistikk-mcp`
- Tag: git short SHA + `latest`
- Dispatch to GitOps repo for Skybert deployment
### Logging
Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format.
Print startup info (API base URL, cache status) to stderr. No persistent log
files -- container logging handles that on Skybert.
### Caching Strategy
| Data | TTL | Key | Reason |
|------------------|----------|----------------------------|--------------------------------|
| Source list | 24h | `"sources"` | Rarely changes |
| Table list | 1h | `source_id` | New tables published daily |
| Dimensions | 6h | `(source_id, table_id)` | Dimension structure is stable |
| Metadata | 6h | `(source_id, table_id)` | Metadata edits are rare |
| Flags | 6h | `(source_id, table_id)` | Flags rarely change |
| Query templates | 6h | `(source_id, table_id)` | Follows dimension changes |
| Data responses | No cache | -- | Queries vary too much to cache |
In-memory dict with TTL. No external dependency needed -- the data volume is
small and the server is single-process.
### Rate Limiting
No documented rate limits, but this is a government API. Be polite:
- Max 5 concurrent requests
- 100ms minimum between requests
- Retry with exponential backoff on 429/503
### Error Mapping
| API Response | MCP Tool Error |
|-----------------------|------------------------------------------------------|
| 400 Bad Request | Descriptive message from ProblemDetails.detail |
| 404 Not Found | "Source/table not found: {id}" |
| 422 Client Error | "Query validation failed: {detail}" |
| Network timeout | "API request timed out. Try reducing query scope." |
| CSV parse error | "Failed to parse response. Try get_query_template." |
### Unicode / Fuzzy Search
Dimension value search (in `get_dimension_values`) normalizes both query and
labels for accent-insensitive matching:
- Normalize with `unicodedata.normalize("NFD")`, strip combining marks
- Case-insensitive comparison
- `"tromso"` matches `"Tromsø"`, `"barum"` matches `"Bærum"`
- Preserve original labels in output
## Implementation Plan
### Phase 1: Core (MVP)
1. Set up project skeleton: `pyproject.toml` with hatchling, `src/` layout,
entry point `fhi-statistikk-mcp`
2. Set up `server.py` with FastMCP, SSE transport, CLI args (transport, host,
port), stderr logging
3. Implement `api_client.py` with async httpx client, base URL config
4. Implement `cache.py` with simple TTL dict
5. Implement `list_sources` tool
6. Implement `list_tables` tool with client-side keyword search
7. Implement `describe_table` composite tool
- Parallel fetch of 4 endpoints
- Dimension summarization (large dim truncation, fixed dim detection)
- HTML stripping for metadata
- Merge into structured response
8. Implement `query_data` tool
- Auto-completion of fixed dimensions
- Year value normalization (`"2020"` → `"2020_2020"`)
- Default MEASURE_TYPE to `all`/`["*"]`
- CSV parsing and row structuring
- Row limit and truncation
9. Implement `get_dimension_values` with hierarchy navigation and accent-
insensitive search
10. Implement `get_query_template` passthrough
11. Add `.mcp.json.local` for local dev
12. Test all tools against live API
### Phase 2: Deployment & Polish
13. Add `Dockerfile` (multi-stage, Python 3.12-slim)
14. Add `.github/workflows/docker-build-push.yaml` for CI/CD
15. Add `.mcp.json.public` with Skybert URL
16. Add comprehensive error handling and error messages
17. Add rate limiting
18. Record API fixtures for offline testing
19. Write unit tests for transformers and cache
20. Write integration tests against live API
### Phase 3: Optional Enhancements
21. Add a `search_all_tables` convenience tool (if agents frequently need it)
22. Add MCP resources for static reference data (source descriptions, common
dimension codes)
23. Add MCP prompt templates (e.g. "finn helsedata om <topic>")
## Tool Description Guidelines
MCP tool descriptions are what the agent uses to decide which tool to call. They
should be written for an LLM audience:
- Lead with the purpose, not the endpoint
- Include example parameter values
- Document non-obvious conventions (year format, dimension codes)
- Mention what `describe_table` returns, since it's the prerequisite for
`query_data`
- Note that Norwegian labels are the default (GEO labels are in Norwegian)
### Example tool description for `query_data`:
> Fetch statistical data from an FHI table. Before calling this, use
> `describe_table` to understand the table's dimensions and available values.
>
> You only need to specify the dimensions you care about. Fixed dimensions
> (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit
> MEASURE_TYPE, all measures are returned.
>
> Year values: use "2020" (auto-translated to "2020_2020") or the full format.
>
> Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]),
> "top" (first N), "bottom" (last N).
>
> Returns labeled rows, max 1000 by default. Check "truncated" field.
## Resolved Decisions
| Question | Decision | Rationale |
|----------|----------|-----------|
| Hosting | SSE locally, same for Skybert | Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy. |
| JSON-stat2 output | No | csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software. |
| Fuzzy dimension search | Yes, accent-insensitive | Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks. |
| Sample data in describe_table | No | Adds latency. Agent calls `query_data` with `max_rows=5` if it wants a preview. |