added plan
This commit is contained in:
627
mcp-server-plan.md
Normal file
627
mcp-server-plan.md
Normal file
@@ -0,0 +1,627 @@
|
||||
# MCP Server for FHI Statistikk Open API
|
||||
|
||||
## Overview
|
||||
|
||||
An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API
|
||||
as tools optimized for AI agent consumption. The server wraps the REST API at
|
||||
`https://statistikk-data.fhi.no/api/open/v1/` and adds intelligent
|
||||
summarization, format translation, and convenience features that make the API
|
||||
practical for LLM-based agents.
|
||||
|
||||
**Base API**: https://statistikk-data.fhi.no/api/open/v1/
|
||||
**API docs**: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API
|
||||
**License**: CC BY 4.0 (open data)
|
||||
**Auth**: None required
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The raw API has several characteristics that make it hard for AI agents:
|
||||
|
||||
1. **JSON-stat2 format** -- The data endpoint returns a multidimensional sparse
|
||||
array format designed for statistical software, not LLMs.
|
||||
2. **Mandatory dimension specification** -- All dimensions must be included in
|
||||
every data query, even single-valued ones like `KJONN=["0"]`.
|
||||
3. **Non-obvious value formats** -- Year values use `"2020_2020"` not `"2020"`.
|
||||
4. **Massive dimension trees** -- The GEO dimension can have 400+ entries in a
|
||||
hierarchical tree (country > county > municipality > city district).
|
||||
5. **Multi-step discovery** -- Finding relevant data requires: list sources >
|
||||
list tables > get dimensions > construct query > fetch data.
|
||||
6. **Metadata contains raw HTML** -- `<p>`, `<a>`, `<ol>` tags in content fields.
|
||||
7. **Swagger spec is incomplete** -- Documents only `"item"` filter, but the API
|
||||
actually supports `"item"`, `"all"`, `"top"`, `"bottom"`.
|
||||
|
||||
## API Inventory
|
||||
|
||||
### Sources (as of 2026-03-27)
|
||||
|
||||
| ID | Title | Publisher |
|
||||
|----------|------------------------------------------|----------------------|
|
||||
| nokkel | Folkehelsestatistikk | Helsedirektoratet |
|
||||
| ngs | Mikrobiologisk genomovervåkning | FHI |
|
||||
| mfr | Medisinsk fødselsregister | FHI |
|
||||
| abr | Abortregisteret | FHI |
|
||||
| sysvak | Nasjonalt vaksinasjonsregister SYSVAK | FHI |
|
||||
| daar | Dødsårsakregisteret | FHI |
|
||||
| msis | Meldingssystem for smittsomme sykdommer | FHI |
|
||||
| lmr | Legemiddelregisteret | FHI |
|
||||
| gs | Grossiststatistikk | FHI |
|
||||
| npr | Norsk pasientregister | FHI |
|
||||
| kpr | Kommunalt pasient- og brukerregister | FHI |
|
||||
| hkr | Hjerte- og karsykdommer | FHI |
|
||||
| skast | Skadedyrstatistikk | FHI |
|
||||
|
||||
### Endpoints
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|--------|-----------------------------------------------|----------------------------|
|
||||
| GET | `/Common/source` | List all sources |
|
||||
| GET | `/{sourceId}/Table` | List tables in source |
|
||||
| GET | `/{sourceId}/Table/{tableId}` | Table info |
|
||||
| GET | `/{sourceId}/Table/{tableId}/query` | Query template |
|
||||
| GET | `/{sourceId}/Table/{tableId}/dimension` | Dimensions and categories |
|
||||
| POST | `/{sourceId}/Table/{tableId}/data` | Fetch data |
|
||||
| GET | `/{sourceId}/Table/{tableId}/flag` | Flag/symbol definitions |
|
||||
| GET | `/{sourceId}/Table/{tableId}/metadata` | Table metadata |
|
||||
|
||||
### Filter Types
|
||||
|
||||
| Filter | Description | Example values |
|
||||
|----------|------------------------------------------------|--------------------------|
|
||||
| `item` | Exact match on listed values | `["2020_2020","2021_2021"]` |
|
||||
| `all` | Wildcard match with `*` | `["*"]` or `["A*","B*"]` |
|
||||
| `top` | First N categories | `["5"]` |
|
||||
| `bottom` | Last N categories | `["5"]` |
|
||||
|
||||
### Response Formats (data endpoint)
|
||||
|
||||
| Format | Content-Type | Description |
|
||||
|------------|---------------------------------|---------------------------------|
|
||||
| json-stat2 | application/json | JSON-Stat 2.0 sparse array |
|
||||
| csv2 | text/csv | CSV with human-readable labels |
|
||||
| csv3 | text/csv | CSV with dimension/measure codes|
|
||||
| parquet | application/vnd.apache.parquet | Apache Parquet columnar format |
|
||||
|
||||
## MCP Tool Design
|
||||
|
||||
### Tool 1: `list_sources`
|
||||
|
||||
**Purpose**: Entry point. List all available data sources.
|
||||
|
||||
**Parameters**: None.
|
||||
|
||||
**Returns**: Array of `{id, title, description, published_by}`.
|
||||
|
||||
**Implementation**: GET `/Common/source`. Pass through with minor field renaming
|
||||
(snake_case).
|
||||
|
||||
**Caching**: Cache for 24 hours. Source list rarely changes.
|
||||
|
||||
---
|
||||
|
||||
### Tool 2: `list_tables`
|
||||
|
||||
**Purpose**: Find tables within a source, with optional keyword search.
|
||||
|
||||
**Parameters**:
|
||||
- `source_id` (string, required) -- Source identifier, e.g. `"nokkel"`.
|
||||
- `search` (string, optional) -- Case-insensitive keyword filter on table title.
|
||||
Supports multiple words (all must match). Applied client-side.
|
||||
- `modified_after` (string, optional) -- ISO-8601 datetime. Only return tables
|
||||
modified after this date. Passed to API server-side.
|
||||
|
||||
**Returns**: Array of `{table_id, title, published_at, modified_at}`.
|
||||
|
||||
**Implementation**: GET `/{sourceId}/Table?modifiedAfter=...`, then client-side
|
||||
filter on `search`. Sort by `modified_at` descending.
|
||||
|
||||
**Caching**: Cache per source_id for 1 hour. Table lists update throughout the
|
||||
day as data is published.
|
||||
|
||||
**Example**:
|
||||
```
|
||||
list_tables(source_id="nokkel", search="befolkning")
|
||||
→ [{table_id: 185, title: "Befolkningsvekst", ...},
|
||||
{table_id: 338, title: "Befolkningssammensetning_antall_andel", ...},
|
||||
{table_id: 171, title: "Befolkningsframskriving", ...}]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Tool 3: `describe_table`
|
||||
|
||||
**Purpose**: The primary tool for understanding a table's structure. Gives the
|
||||
agent everything it needs to construct a data query.
|
||||
|
||||
**Parameters**:
|
||||
- `source_id` (string, required)
|
||||
- `table_id` (integer, required)
|
||||
|
||||
**Returns**: A structured summary combining table info, dimensions, metadata,
|
||||
and flags. This is a composite call (4 parallel API requests).
|
||||
|
||||
**Response structure**:
|
||||
```
|
||||
{
|
||||
"title": "Befolkningsvekst",
|
||||
"published_at": "2025-10-21T08:56:39Z",
|
||||
"modified_at": "2025-10-21T08:56:39Z",
|
||||
"is_official_statistics": false,
|
||||
"description": "Differansen mellom befolkningsmengden...",
|
||||
"update_frequency": "Årlig",
|
||||
"keywords": ["Befolkning", "Befolkningsvekst"],
|
||||
"source_institution": "Statistisk sentralbyrå (SSB)",
|
||||
"dimensions": [
|
||||
{
|
||||
"code": "GEO",
|
||||
"label": "Geografi",
|
||||
"total_categories": 356,
|
||||
"is_hierarchical": true,
|
||||
"hierarchy_depth": 4,
|
||||
"top_level_values": [
|
||||
{"value": "0", "label": "Hele landet", "child_count": 15}
|
||||
],
|
||||
"note": "Use get_dimension_values to drill into sub-levels"
|
||||
},
|
||||
{
|
||||
"code": "AAR",
|
||||
"label": "År",
|
||||
"total_categories": 23,
|
||||
"is_hierarchical": false,
|
||||
"value_format": "YYYY_YYYY (e.g. 2020_2020)",
|
||||
"range": "2002..2024",
|
||||
"values": ["2002_2002", "2003_2003", ..., "2024_2024"]
|
||||
},
|
||||
{
|
||||
"code": "KJONN",
|
||||
"label": "Kjønn",
|
||||
"total_categories": 1,
|
||||
"is_fixed": true,
|
||||
"values": [{"value": "0", "label": "kjønn samlet"}],
|
||||
"note": "Single-valued, auto-included in queries"
|
||||
},
|
||||
{
|
||||
"code": "ALDER",
|
||||
"label": "Alder",
|
||||
"total_categories": 1,
|
||||
"is_fixed": true,
|
||||
"values": [{"value": "0_120", "label": "alle aldre"}],
|
||||
"note": "Single-valued, auto-included in queries"
|
||||
},
|
||||
{
|
||||
"code": "MEASURE_TYPE",
|
||||
"label": "Måltall",
|
||||
"total_categories": 2,
|
||||
"is_fixed": false,
|
||||
"values": [
|
||||
{"value": "TELLER", "label": "antall"},
|
||||
{"value": "RATE", "label": "prosent vekst"}
|
||||
]
|
||||
}
|
||||
],
|
||||
"flags": [
|
||||
{"symbol": "", "description": "Verdi finnes i tabellen"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Key design decisions**:
|
||||
|
||||
1. **Summarize large dimensions** -- For dimensions with >20 categories (mainly
|
||||
GEO), show only top-level entries with child counts. The agent uses
|
||||
`get_dimension_values` to drill down.
|
||||
|
||||
2. **Mark fixed dimensions** -- Dimensions with exactly 1 category get
|
||||
`is_fixed: true`. The agent knows to ignore these; `query_data` will
|
||||
auto-include them.
|
||||
|
||||
3. **Show value format** -- AAR values are `"2020_2020"`, not `"2020"`. Show
|
||||
this explicitly so the agent gets the format right.
|
||||
|
||||
4. **Include metadata inline** -- Strip HTML from metadata paragraphs. Extract
|
||||
`description`, `keywords`, `update_frequency`, `source_institution` as
|
||||
top-level fields.
|
||||
|
||||
5. **Include flags inline** -- Flag definitions are small and always relevant.
|
||||
|
||||
**Implementation**: Parallel fetch of:
|
||||
- GET `/{sourceId}/Table/{tableId}` (table info)
|
||||
- GET `/{sourceId}/Table/{tableId}/dimension` (dimensions)
|
||||
- GET `/{sourceId}/Table/{tableId}/metadata` (metadata)
|
||||
- GET `/{sourceId}/Table/{tableId}/flag` (flags)
|
||||
|
||||
Then merge and transform.
|
||||
|
||||
**Caching**: Cache per (source_id, table_id) for 6 hours. Dimension structure
|
||||
changes rarely.
|
||||
|
||||
---
|
||||
|
||||
### Tool 4: `get_dimension_values`
|
||||
|
||||
**Purpose**: Drill into large hierarchical dimensions, typically GEO.
|
||||
|
||||
**Parameters**:
|
||||
- `source_id` (string, required)
|
||||
- `table_id` (integer, required)
|
||||
- `dimension_code` (string, required) -- e.g. `"GEO"`.
|
||||
- `parent_value` (string, optional) -- Return only children of this category.
|
||||
E.g. `"18"` for Nordland county. If omitted, returns top-level categories.
|
||||
- `search` (string, optional) -- Case-insensitive search on category labels.
|
||||
E.g. `"tromsø"` to find the municipality.
|
||||
|
||||
**Returns**: Array of `{value, label, child_count}`.
|
||||
|
||||
**Implementation**: GET `/{sourceId}/Table/{tableId}/dimension`, then navigate
|
||||
the category tree client-side. The full tree is fetched and cached; filtering
|
||||
is done in the MCP server.
|
||||
|
||||
**Examples**:
|
||||
```
|
||||
# Get all counties
|
||||
get_dimension_values("nokkel", 185, "GEO")
|
||||
→ [{value: "0", label: "Hele landet", child_count: 15}]
|
||||
|
||||
# Get municipalities in Nordland
|
||||
get_dimension_values("nokkel", 185, "GEO", parent_value="18")
|
||||
→ [{value: "1804", label: "Bodø", child_count: 0},
|
||||
{value: "1806", label: "Narvik", child_count: 0}, ...]
|
||||
|
||||
# Search for a municipality
|
||||
get_dimension_values("nokkel", 185, "GEO", search="tromsø")
|
||||
→ [{value: "5501", label: "Tromsø", child_count: 0}]
|
||||
```
|
||||
|
||||
**Caching**: Shares the dimension cache with `describe_table`.
|
||||
|
||||
---
|
||||
|
||||
### Tool 5: `query_data`
|
||||
|
||||
**Purpose**: Fetch actual data from a table. The main data retrieval tool.
|
||||
|
||||
**Parameters**:
|
||||
- `source_id` (string, required)
|
||||
- `table_id` (integer, required)
|
||||
- `dimensions` (array, required) -- Each element:
|
||||
- `code` (string) -- Dimension code, e.g. `"GEO"`.
|
||||
- `filter` (string) -- One of `"item"`, `"all"`, `"top"`, `"bottom"`.
|
||||
Default: `"item"`.
|
||||
- `values` (array of strings) -- Filter values.
|
||||
- `max_rows` (integer, optional) -- Limit returned rows. Default: 1000.
|
||||
Set to 0 for no limit (be careful).
|
||||
|
||||
**Returns**: Structured rows with labeled values.
|
||||
|
||||
```
|
||||
{
|
||||
"table": "Befolkningsvekst",
|
||||
"total_rows": 4,
|
||||
"rows": [
|
||||
{"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet",
|
||||
"ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5},
|
||||
...
|
||||
],
|
||||
"truncated": false,
|
||||
"dimensions_used": {
|
||||
"GEO": {"filter": "item", "values": ["0301"]},
|
||||
"AAR": {"filter": "bottom", "values": ["2"]},
|
||||
"KJONN": {"filter": "item", "values": ["0"]},
|
||||
"ALDER": {"filter": "item", "values": ["0_120"]},
|
||||
"MEASURE_TYPE": {"filter": "all", "values": ["*"]}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key design decisions**:
|
||||
|
||||
1. **Default to csv2 internally** -- Fetch as csv2 (human-readable labels),
|
||||
parse into rows. CSV is simpler for an agent to reason about than JSON-stat2.
|
||||
The tool internally requests csv2 and structures it.
|
||||
|
||||
2. **Auto-include fixed dimensions** -- If the agent omits a dimension that has
|
||||
only 1 category (like KJONN or ALDER), the tool adds it automatically with
|
||||
`filter: "item"` and the single value. This means the agent only needs to
|
||||
specify the dimensions it actually cares about.
|
||||
|
||||
3. **Normalize year values** -- If the agent sends `"2020"` for AAR, the tool
|
||||
translates to `"2020_2020"`. The `YYYY_YYYY` format is an internal API
|
||||
convention the agent shouldn't need to know about.
|
||||
|
||||
4. **Default MEASURE_TYPE** -- If omitted, default to `filter: "all", values:
|
||||
["*"]` to get all measures. Most agents want all available metrics.
|
||||
|
||||
5. **Row limit with truncation flag** -- Default 1000 rows. Return a
|
||||
`truncated: true` flag and `total_rows` count so the agent knows if there's
|
||||
more data.
|
||||
|
||||
6. **Echo back dimensions_used** -- Show what was actually sent to the API
|
||||
(after auto-completion), so the agent can see the full query.
|
||||
|
||||
**Implementation**:
|
||||
1. Fetch dimension info if not cached (to know fixed dimensions and validate)
|
||||
2. Auto-complete missing/fixed dimensions
|
||||
3. Normalize year values
|
||||
4. POST `/{sourceId}/Table/{tableId}/data` with format=csv2
|
||||
5. Parse CSV response into row objects
|
||||
6. Apply row limit, compute truncation
|
||||
|
||||
**Error handling**: The API returns ProblemDetails (RFC 7807) on 400/404/422.
|
||||
Transform into clear error messages:
|
||||
- "Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..."
|
||||
- "Value '2025_2025' not found in dimension AAR. Range: 2002..2024"
|
||||
- "maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters."
|
||||
|
||||
---
|
||||
|
||||
### Tool 6: `get_query_template`
|
||||
|
||||
**Purpose**: Fallback tool returning the raw query template from the API. Useful
|
||||
when the agent needs to see exactly what the API expects.
|
||||
|
||||
**Parameters**:
|
||||
- `source_id` (string, required)
|
||||
- `table_id` (integer, required)
|
||||
|
||||
**Returns**: The raw DataRequest JSON as returned by the API.
|
||||
|
||||
**Implementation**: GET `/{sourceId}/Table/{tableId}/query`. Pass through.
|
||||
|
||||
**When to use**: When `query_data` auto-completion isn't behaving as expected,
|
||||
or the agent wants to see the complete list of available values for all
|
||||
dimensions.
|
||||
|
||||
---
|
||||
|
||||
## Tools NOT included (and why)
|
||||
|
||||
| Considered tool | Decision | Reason |
|
||||
|------------------------------|----------|---------------------------------------------|
|
||||
| `get_flags` (standalone) | Dropped | Folded into `describe_table` |
|
||||
| `get_metadata` (standalone) | Dropped | Folded into `describe_table` |
|
||||
| `get_table_info` (standalone)| Dropped | Folded into `describe_table` |
|
||||
| `search_across_sources` | Dropped | Too expensive (13 API calls). Agent can call `list_tables` per source |
|
||||
| `get_data_jsonstat` | Dropped | Agents don't need raw JSON-stat2 |
|
||||
| `get_data_parquet` | Dropped | Binary format, not useful for LLM context |
|
||||
|
||||
## Architecture
|
||||
|
||||
### Stack
|
||||
|
||||
- **Language**: Python 3.12+
|
||||
- **MCP framework**: FastMCP (`mcp[cli]`)
|
||||
- **HTTP server**: Uvicorn (`uvicorn>=0.30`) for SSE/HTTP transport
|
||||
- **HTTP client**: `httpx` (async)
|
||||
- **CSV parsing**: stdlib `csv`
|
||||
- **HTML stripping**: stdlib `html.parser` or `re` (simple tag removal)
|
||||
- **Build system**: Hatchling (matches Fhi.Metadata.MCPserver pattern)
|
||||
|
||||
### Transport
|
||||
|
||||
The server supports multiple transports via CLI flag, following the same pattern
|
||||
as `Fhi.Metadata.MCPserver`:
|
||||
|
||||
| Transport | Use case | Endpoint |
|
||||
|------------------|---------------------------------------|-------------------|
|
||||
| `sse` | Local dev + Skybert deployment | `/sse` |
|
||||
| `streamable-http`| Future HTTP-only clients | `/mcp` |
|
||||
| `stdio` | Direct pipe (legacy) | stdin/stdout |
|
||||
|
||||
**Default**: `sse` on `0.0.0.0:8000`. This means the server works over HTTP
|
||||
both locally and when deployed to Skybert, with no transport change needed.
|
||||
|
||||
**CLI entry point**:
|
||||
```bash
|
||||
fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
fhi-statistikk-mcp/
|
||||
├── .github/
|
||||
│ └── workflows/
|
||||
│ └── docker-build-push.yaml # CI/CD → crfhiskybert.azurecr.io
|
||||
├── .mcp.json.local # Local dev: http://localhost:8000/sse
|
||||
├── .mcp.json.public # Production: https://<skybert-url>/sse
|
||||
├── Dockerfile # Multi-stage, Python 3.12-slim
|
||||
├── pyproject.toml # Hatchling build, entry point
|
||||
├── README.md
|
||||
├── src/
|
||||
│ └── fhi_statistikk_mcp/
|
||||
│ ├── __init__.py
|
||||
│ ├── server.py # MCP server, tool definitions, main()
|
||||
│ ├── api_client.py # Async httpx client for FHI API
|
||||
│ ├── transformers.py # CSV parsing, dimension summarization
|
||||
│ └── cache.py # Simple TTL cache
|
||||
└── tests/
|
||||
├── test_transformers.py
|
||||
├── test_cache.py
|
||||
└── fixtures/ # Recorded API responses
|
||||
├── sources.json
|
||||
├── tables_nokkel.json
|
||||
├── dimensions_185.json
|
||||
├── metadata_185.json
|
||||
├── flags_185.json
|
||||
└── data_185.csv
|
||||
```
|
||||
|
||||
### MCP Client Configuration
|
||||
|
||||
**Local development** (`.mcp.json.local`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"fhi-statistikk": {
|
||||
"type": "sse",
|
||||
"url": "http://localhost:8000/sse"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Production** (`.mcp.json.public`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"fhi-statistikk": {
|
||||
"type": "sse",
|
||||
"url": "https://<skybert-url>/sse"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Dockerfile
|
||||
|
||||
Following the Fhi.Metadata.MCPserver pattern:
|
||||
```dockerfile
|
||||
FROM python:3.12-slim AS base
|
||||
WORKDIR /app
|
||||
COPY pyproject.toml .
|
||||
COPY src/ src/
|
||||
RUN pip install --no-cache-dir .
|
||||
|
||||
FROM base AS prod
|
||||
EXPOSE 8000
|
||||
CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"]
|
||||
```
|
||||
|
||||
### CI/CD
|
||||
|
||||
Same pipeline pattern as Fhi.Metadata.MCPserver:
|
||||
- Trigger on push to `main` touching `src/`, `Dockerfile`, or `pyproject.toml`
|
||||
- Azure Federated Identity (OIDC) login
|
||||
- Push to `crfhiskybert.azurecr.io/fida/ki/statistikk-mcp`
|
||||
- Tag: git short SHA + `latest`
|
||||
- Dispatch to GitOps repo for Skybert deployment
|
||||
|
||||
### Logging
|
||||
|
||||
Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format.
|
||||
Print startup info (API base URL, cache status) to stderr. No persistent log
|
||||
files -- container logging handles that on Skybert.
|
||||
|
||||
### Caching Strategy
|
||||
|
||||
| Data | TTL | Key | Reason |
|
||||
|------------------|----------|----------------------------|--------------------------------|
|
||||
| Source list | 24h | `"sources"` | Rarely changes |
|
||||
| Table list | 1h | `source_id` | New tables published daily |
|
||||
| Dimensions | 6h | `(source_id, table_id)` | Dimension structure is stable |
|
||||
| Metadata | 6h | `(source_id, table_id)` | Metadata edits are rare |
|
||||
| Flags | 6h | `(source_id, table_id)` | Flags rarely change |
|
||||
| Query templates | 6h | `(source_id, table_id)` | Follows dimension changes |
|
||||
| Data responses | No cache | -- | Queries vary too much to cache |
|
||||
|
||||
In-memory dict with TTL. No external dependency needed -- the data volume is
|
||||
small and the server is single-process.
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
No documented rate limits, but this is a government API. Be polite:
|
||||
- Max 5 concurrent requests
|
||||
- 100ms minimum between requests
|
||||
- Retry with exponential backoff on 429/503
|
||||
|
||||
### Error Mapping
|
||||
|
||||
| API Response | MCP Tool Error |
|
||||
|-----------------------|------------------------------------------------------|
|
||||
| 400 Bad Request | Descriptive message from ProblemDetails.detail |
|
||||
| 404 Not Found | "Source/table not found: {id}" |
|
||||
| 422 Client Error | "Query validation failed: {detail}" |
|
||||
| Network timeout | "API request timed out. Try reducing query scope." |
|
||||
| CSV parse error | "Failed to parse response. Try get_query_template." |
|
||||
|
||||
### Unicode / Fuzzy Search
|
||||
|
||||
Dimension value search (in `get_dimension_values`) normalizes both query and
|
||||
labels for accent-insensitive matching:
|
||||
- Normalize with `unicodedata.normalize("NFD")`, strip combining marks
|
||||
- Case-insensitive comparison
|
||||
- `"tromso"` matches `"Tromsø"`, `"barum"` matches `"Bærum"`
|
||||
- Preserve original labels in output
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Core (MVP)
|
||||
|
||||
1. Set up project skeleton: `pyproject.toml` with hatchling, `src/` layout,
|
||||
entry point `fhi-statistikk-mcp`
|
||||
2. Set up `server.py` with FastMCP, SSE transport, CLI args (transport, host,
|
||||
port), stderr logging
|
||||
3. Implement `api_client.py` with async httpx client, base URL config
|
||||
4. Implement `cache.py` with simple TTL dict
|
||||
5. Implement `list_sources` tool
|
||||
6. Implement `list_tables` tool with client-side keyword search
|
||||
7. Implement `describe_table` composite tool
|
||||
- Parallel fetch of 4 endpoints
|
||||
- Dimension summarization (large dim truncation, fixed dim detection)
|
||||
- HTML stripping for metadata
|
||||
- Merge into structured response
|
||||
8. Implement `query_data` tool
|
||||
- Auto-completion of fixed dimensions
|
||||
- Year value normalization (`"2020"` → `"2020_2020"`)
|
||||
- Default MEASURE_TYPE to `all`/`["*"]`
|
||||
- CSV parsing and row structuring
|
||||
- Row limit and truncation
|
||||
9. Implement `get_dimension_values` with hierarchy navigation and accent-
|
||||
insensitive search
|
||||
10. Implement `get_query_template` passthrough
|
||||
11. Add `.mcp.json.local` for local dev
|
||||
12. Test all tools against live API
|
||||
|
||||
### Phase 2: Deployment & Polish
|
||||
|
||||
13. Add `Dockerfile` (multi-stage, Python 3.12-slim)
|
||||
14. Add `.github/workflows/docker-build-push.yaml` for CI/CD
|
||||
15. Add `.mcp.json.public` with Skybert URL
|
||||
16. Add comprehensive error handling and error messages
|
||||
17. Add rate limiting
|
||||
18. Record API fixtures for offline testing
|
||||
19. Write unit tests for transformers and cache
|
||||
20. Write integration tests against live API
|
||||
|
||||
### Phase 3: Optional Enhancements
|
||||
|
||||
21. Add a `search_all_tables` convenience tool (if agents frequently need it)
|
||||
22. Add MCP resources for static reference data (source descriptions, common
|
||||
dimension codes)
|
||||
23. Add MCP prompt templates (e.g. "finn helsedata om <topic>")
|
||||
|
||||
## Tool Description Guidelines
|
||||
|
||||
MCP tool descriptions are what the agent uses to decide which tool to call. They
|
||||
should be written for an LLM audience:
|
||||
|
||||
- Lead with the purpose, not the endpoint
|
||||
- Include example parameter values
|
||||
- Document non-obvious conventions (year format, dimension codes)
|
||||
- Mention what `describe_table` returns, since it's the prerequisite for
|
||||
`query_data`
|
||||
- Note that Norwegian labels are the default (GEO labels are in Norwegian)
|
||||
|
||||
### Example tool description for `query_data`:
|
||||
|
||||
> Fetch statistical data from an FHI table. Before calling this, use
|
||||
> `describe_table` to understand the table's dimensions and available values.
|
||||
>
|
||||
> You only need to specify the dimensions you care about. Fixed dimensions
|
||||
> (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit
|
||||
> MEASURE_TYPE, all measures are returned.
|
||||
>
|
||||
> Year values: use "2020" (auto-translated to "2020_2020") or the full format.
|
||||
>
|
||||
> Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]),
|
||||
> "top" (first N), "bottom" (last N).
|
||||
>
|
||||
> Returns labeled rows, max 1000 by default. Check "truncated" field.
|
||||
|
||||
## Resolved Decisions
|
||||
|
||||
| Question | Decision | Rationale |
|
||||
|----------|----------|-----------|
|
||||
| Hosting | SSE locally, same for Skybert | Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy. |
|
||||
| JSON-stat2 output | No | csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software. |
|
||||
| Fuzzy dimension search | Yes, accent-insensitive | Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks. |
|
||||
| Sample data in describe_table | No | Adds latency. Agent calls `query_data` with `max_rows=5` if it wants a preview. |
|
||||
Reference in New Issue
Block a user