Files
fhi-statistikk-mcp/mcp-server-plan.md
Torbjørn Lindahl 817b90420f added repo
2026-03-27 17:01:55 +01:00

24 KiB

MCP Server for FHI Statistikk Open API

Overview

An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API as tools optimized for AI agent consumption. The server wraps the REST API at https://statistikk-data.fhi.no/api/open/v1/ and adds intelligent summarization, format translation, and convenience features that make the API practical for LLM-based agents.

This repo: /home/tlind/git/fhi/openapi-mcp API documentation repo: /home/tlind/git/fhi/Fhi.Statistikk.OpenAPI (code samples, Postman collection, user guide) Base API: https://statistikk-data.fhi.no/api/open/v1/ API docs: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API License: CC BY 4.0 (open data) Auth: None required

Problem Statement

The raw API has several characteristics that make it hard for AI agents:

  1. JSON-stat2 format -- The data endpoint returns a multidimensional sparse array format designed for statistical software, not LLMs.
  2. Mandatory dimension specification -- All dimensions must be included in every data query, even single-valued ones like KJONN=["0"].
  3. Non-obvious value formats -- Year values use "2020_2020" not "2020".
  4. Massive dimension trees -- The GEO dimension can have 400+ entries in a hierarchical tree (country > county > municipality > city district).
  5. Multi-step discovery -- Finding relevant data requires: list sources > list tables > get dimensions > construct query > fetch data.
  6. Metadata contains raw HTML -- <p>, <a>, <ol> tags in content fields.
  7. Swagger spec is incomplete -- Documents only "item" filter, but the API actually supports "item", "all", "top", "bottom".

API Inventory

Sources (as of 2026-03-27)

ID Title Publisher
nokkel Folkehelsestatistikk Helsedirektoratet
ngs Mikrobiologisk genomovervåkning FHI
mfr Medisinsk fødselsregister FHI
abr Abortregisteret FHI
sysvak Nasjonalt vaksinasjonsregister SYSVAK FHI
daar Dødsårsakregisteret FHI
msis Meldingssystem for smittsomme sykdommer FHI
lmr Legemiddelregisteret FHI
gs Grossiststatistikk FHI
npr Norsk pasientregister FHI
kpr Kommunalt pasient- og brukerregister FHI
hkr Hjerte- og karsykdommer FHI
skast Skadedyrstatistikk FHI

Endpoints

Method Path Purpose
GET /Common/source List all sources
GET /{sourceId}/Table List tables in source
GET /{sourceId}/Table/{tableId} Table info
GET /{sourceId}/Table/{tableId}/query Query template
GET /{sourceId}/Table/{tableId}/dimension Dimensions and categories
POST /{sourceId}/Table/{tableId}/data Fetch data
GET /{sourceId}/Table/{tableId}/flag Flag/symbol definitions
GET /{sourceId}/Table/{tableId}/metadata Table metadata

Filter Types

Filter Description Example values
item Exact match on listed values ["2020_2020","2021_2021"]
all Wildcard match with * ["*"] or ["A*","B*"]
top First N categories ["5"]
bottom Last N categories ["5"]

Response Formats (data endpoint)

Format Content-Type Description
json-stat2 application/json JSON-Stat 2.0 sparse array
csv2 text/csv CSV with human-readable labels
csv3 text/csv CSV with dimension/measure codes
parquet application/vnd.apache.parquet Apache Parquet columnar format

MCP Tool Design

Tool 1: list_sources

Purpose: Entry point. List all available data sources.

Parameters: None.

Returns: Array of {id, title, description, published_by}.

Implementation: GET /Common/source. Pass through with minor field renaming (snake_case).

Caching: Cache for 24 hours. Source list rarely changes.


Tool 2: list_tables

Purpose: Find tables within a source, with optional keyword search.

Parameters:

  • source_id (string, required) -- Source identifier, e.g. "nokkel".
  • search (string, optional) -- Case-insensitive keyword filter on table title. Supports multiple words (all must match). Applied client-side.
  • modified_after (string, optional) -- ISO-8601 datetime. Only return tables modified after this date. Passed to API server-side.

Returns: Array of {table_id, title, published_at, modified_at}.

Implementation: GET /{sourceId}/Table?modifiedAfter=..., then client-side filter on search. Sort by modified_at descending.

Caching: Cache per source_id for 1 hour. Table lists update throughout the day as data is published.

Example:

list_tables(source_id="nokkel", search="befolkning")
→ [{table_id: 185, title: "Befolkningsvekst", ...},
   {table_id: 338, title: "Befolkningssammensetning_antall_andel", ...},
   {table_id: 171, title: "Befolkningsframskriving", ...}]

Tool 3: describe_table

Purpose: The primary tool for understanding a table's structure. Gives the agent everything it needs to construct a data query.

Parameters:

  • source_id (string, required)
  • table_id (integer, required)

Returns: A structured summary combining table info, dimensions, metadata, and flags. This is a composite call (4 parallel API requests).

Response structure:

{
  "title": "Befolkningsvekst",
  "published_at": "2025-10-21T08:56:39Z",
  "modified_at": "2025-10-21T08:56:39Z",
  "is_official_statistics": false,
  "description": "Differansen mellom befolkningsmengden...",
  "update_frequency": "Årlig",
  "keywords": ["Befolkning", "Befolkningsvekst"],
  "source_institution": "Statistisk sentralbyrå (SSB)",
  "dimensions": [
    {
      "code": "GEO",
      "label": "Geografi",
      "total_categories": 356,
      "is_hierarchical": true,
      "hierarchy_depth": 4,
      "top_level_values": [
        {"value": "0", "label": "Hele landet", "child_count": 15}
      ],
      "note": "Use get_dimension_values to drill into sub-levels"
    },
    {
      "code": "AAR",
      "label": "År",
      "total_categories": 23,
      "is_hierarchical": false,
      "value_format": "YYYY_YYYY (e.g. 2020_2020)",
      "range": "2002..2024",
      "values": ["2002_2002", "2003_2003", ..., "2024_2024"]
    },
    {
      "code": "KJONN",
      "label": "Kjønn",
      "total_categories": 1,
      "is_fixed": true,
      "values": [{"value": "0", "label": "kjønn samlet"}],
      "note": "Single-valued, auto-included in queries"
    },
    {
      "code": "ALDER",
      "label": "Alder",
      "total_categories": 1,
      "is_fixed": true,
      "values": [{"value": "0_120", "label": "alle aldre"}],
      "note": "Single-valued, auto-included in queries"
    },
    {
      "code": "MEASURE_TYPE",
      "label": "Måltall",
      "total_categories": 2,
      "is_fixed": false,
      "values": [
        {"value": "TELLER", "label": "antall"},
        {"value": "RATE", "label": "prosent vekst"}
      ]
    }
  ],
  "flags": [
    {"symbol": "", "description": "Verdi finnes i tabellen"}
  ]
}

Key design decisions:

  1. Summarize large dimensions -- For dimensions with >20 categories (mainly GEO), show only top-level entries with child counts. The agent uses get_dimension_values to drill down.

  2. Mark fixed dimensions -- Dimensions with exactly 1 category get is_fixed: true. The agent knows to ignore these; query_data will auto-include them.

  3. Show value format -- AAR values are "2020_2020", not "2020". Show this explicitly so the agent gets the format right.

  4. Include metadata inline -- Strip HTML from metadata paragraphs. Extract description, keywords, update_frequency, source_institution as top-level fields.

  5. Include flags inline -- Flag definitions are small and always relevant.

Implementation: Parallel fetch of:

  • GET /{sourceId}/Table/{tableId} (table info)
  • GET /{sourceId}/Table/{tableId}/dimension (dimensions)
  • GET /{sourceId}/Table/{tableId}/metadata (metadata)
  • GET /{sourceId}/Table/{tableId}/flag (flags)

Then merge and transform.

Caching: Cache per (source_id, table_id) for 6 hours. Dimension structure changes rarely.


Tool 4: get_dimension_values

Purpose: Drill into large hierarchical dimensions, typically GEO.

Parameters:

  • source_id (string, required)
  • table_id (integer, required)
  • dimension_code (string, required) -- e.g. "GEO".
  • parent_value (string, optional) -- Return only children of this category. E.g. "18" for Nordland county. If omitted, returns top-level categories.
  • search (string, optional) -- Case-insensitive search on category labels. E.g. "tromsø" to find the municipality.

Returns: Array of {value, label, child_count}.

Implementation: GET /{sourceId}/Table/{tableId}/dimension, then navigate the category tree client-side. The full tree is fetched and cached; filtering is done in the MCP server.

Examples:

# Get all counties
get_dimension_values("nokkel", 185, "GEO")
→ [{value: "0", label: "Hele landet", child_count: 15}]

# Get municipalities in Nordland
get_dimension_values("nokkel", 185, "GEO", parent_value="18")
→ [{value: "1804", label: "Bodø", child_count: 0},
   {value: "1806", label: "Narvik", child_count: 0}, ...]

# Search for a municipality
get_dimension_values("nokkel", 185, "GEO", search="tromsø")
→ [{value: "5501", label: "Tromsø", child_count: 0}]

Caching: Shares the dimension cache with describe_table.


Tool 5: query_data

Purpose: Fetch actual data from a table. The main data retrieval tool.

Parameters:

  • source_id (string, required)
  • table_id (integer, required)
  • dimensions (array, required) -- Each element:
    • code (string) -- Dimension code, e.g. "GEO".
    • filter (string) -- One of "item", "all", "top", "bottom". Default: "item".
    • values (array of strings) -- Filter values.
  • max_rows (integer, optional) -- Limit returned rows. Default: 1000. Set to 0 for no limit (be careful).

Returns: Structured rows with labeled values.

{
  "table": "Befolkningsvekst",
  "total_rows": 4,
  "rows": [
    {"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet",
     "ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5},
    ...
  ],
  "truncated": false,
  "dimensions_used": {
    "GEO": {"filter": "item", "values": ["0301"]},
    "AAR": {"filter": "bottom", "values": ["2"]},
    "KJONN": {"filter": "item", "values": ["0"]},
    "ALDER": {"filter": "item", "values": ["0_120"]},
    "MEASURE_TYPE": {"filter": "all", "values": ["*"]}
  }
}

Key design decisions:

  1. Default to csv2 internally -- Fetch as csv2 (human-readable labels), parse into rows. CSV is simpler for an agent to reason about than JSON-stat2. The tool internally requests csv2 and structures it.

  2. Auto-include fixed dimensions -- If the agent omits a dimension that has only 1 category (like KJONN or ALDER), the tool adds it automatically with filter: "item" and the single value. This means the agent only needs to specify the dimensions it actually cares about.

  3. Normalize year values -- If the agent sends "2020" for AAR, the tool translates to "2020_2020". The YYYY_YYYY format is an internal API convention the agent shouldn't need to know about.

  4. Default MEASURE_TYPE -- If omitted, default to filter: "all", values: ["*"] to get all measures. Most agents want all available metrics.

  5. Row limit with truncation flag -- Default 1000 rows. Return a truncated: true flag and total_rows count so the agent knows if there's more data.

  6. Echo back dimensions_used -- Show what was actually sent to the API (after auto-completion), so the agent can see the full query.

Implementation:

  1. Fetch dimension info if not cached (to know fixed dimensions and validate)
  2. Auto-complete missing/fixed dimensions
  3. Normalize year values
  4. POST /{sourceId}/Table/{tableId}/data with format=csv2
  5. Parse CSV response into row objects
  6. Apply row limit, compute truncation

Error handling: The API returns ProblemDetails (RFC 7807) on 400/404/422. Transform into clear error messages:

  • "Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..."
  • "Value '2025_2025' not found in dimension AAR. Range: 2002..2024"
  • "maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters."

Tool 6: get_query_template

Purpose: Fallback tool returning the raw query template from the API. Useful when the agent needs to see exactly what the API expects.

Parameters:

  • source_id (string, required)
  • table_id (integer, required)

Returns: The raw DataRequest JSON as returned by the API.

Implementation: GET /{sourceId}/Table/{tableId}/query. Pass through.

When to use: When query_data auto-completion isn't behaving as expected, or the agent wants to see the complete list of available values for all dimensions.


Tools NOT included (and why)

Considered tool Decision Reason
get_flags (standalone) Dropped Folded into describe_table
get_metadata (standalone) Dropped Folded into describe_table
get_table_info (standalone) Dropped Folded into describe_table
search_across_sources Dropped Too expensive (13 API calls). Agent can call list_tables per source
get_data_jsonstat Dropped Agents don't need raw JSON-stat2
get_data_parquet Dropped Binary format, not useful for LLM context

Architecture

Stack

  • Language: Python 3.12+
  • MCP framework: FastMCP (mcp[cli])
  • HTTP server: Uvicorn (uvicorn>=0.30) for SSE/HTTP transport
  • HTTP client: httpx (async)
  • CSV parsing: stdlib csv
  • HTML stripping: stdlib html.parser or re (simple tag removal)
  • Build system: Hatchling (matches Fhi.Metadata.MCPserver pattern)

Transport

The server supports multiple transports via CLI flag, following the same pattern as Fhi.Metadata.MCPserver:

Transport Use case Endpoint
sse Local dev + Skybert deployment /sse
streamable-http Future HTTP-only clients /mcp
stdio Direct pipe (legacy) stdin/stdout

Default: sse on 0.0.0.0:8000. This means the server works over HTTP both locally and when deployed to Skybert, with no transport change needed.

CLI entry point:

fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000

Project Structure

fhi-statistikk-mcp/
├── .github/
│   └── workflows/
│       └── docker-build-push.yaml  # CI/CD → crfhiskybert.azurecr.io
├── .mcp.json.local                 # Local dev: http://localhost:8000/sse
├── .mcp.json.public                # Production: https://<skybert-url>/sse
├── Dockerfile                      # Multi-stage, Python 3.12-slim
├── pyproject.toml                  # Hatchling build, entry point
├── README.md
├── src/
│   └── fhi_statistikk_mcp/
│       ├── __init__.py
│       ├── server.py               # MCP server, tool definitions, main()
│       ├── api_client.py           # Async httpx client for FHI API
│       ├── transformers.py         # CSV parsing, dimension summarization
│       └── cache.py                # Simple TTL cache
└── tests/
    ├── test_transformers.py
    ├── test_cache.py
    └── fixtures/                   # Recorded API responses
        ├── sources.json
        ├── tables_nokkel.json
        ├── dimensions_185.json
        ├── metadata_185.json
        ├── flags_185.json
        └── data_185.csv

MCP Client Configuration

Local development (.mcp.json.local):

{
  "mcpServers": {
    "fhi-statistikk": {
      "type": "sse",
      "url": "http://localhost:8000/sse"
    }
  }
}

Production (.mcp.json.public):

{
  "mcpServers": {
    "fhi-statistikk": {
      "type": "sse",
      "url": "https://<skybert-url>/sse"
    }
  }
}

Dockerfile

Following the Fhi.Metadata.MCPserver pattern:

FROM python:3.12-slim AS base
WORKDIR /app
COPY pyproject.toml .
COPY src/ src/
RUN pip install --no-cache-dir .

FROM base AS prod
EXPOSE 8000
CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"]

CI/CD

Same pipeline pattern as Fhi.Metadata.MCPserver:

  • Trigger on push to main touching src/, Dockerfile, or pyproject.toml
  • Azure Federated Identity (OIDC) login
  • Push to crfhiskybert.azurecr.io/fida/ki/statistikk-mcp
  • Tag: git short SHA + latest
  • Dispatch to GitOps repo for Skybert deployment

Logging

Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format. Print startup info (API base URL, cache status) to stderr. No persistent log files -- container logging handles that on Skybert.

Caching Strategy

Data TTL Key Reason
Source list 24h "sources" Rarely changes
Table list 1h source_id New tables published daily
Dimensions 6h (source_id, table_id) Dimension structure is stable
Metadata 6h (source_id, table_id) Metadata edits are rare
Flags 6h (source_id, table_id) Flags rarely change
Query templates 6h (source_id, table_id) Follows dimension changes
Data responses No cache -- Queries vary too much to cache

In-memory dict with TTL. No external dependency needed -- the data volume is small and the server is single-process.

Rate Limiting

No documented rate limits, but this is a government API. Be polite:

  • Max 5 concurrent requests
  • 100ms minimum between requests
  • Retry with exponential backoff on 429/503

Error Mapping

API Response MCP Tool Error
400 Bad Request Descriptive message from ProblemDetails.detail
404 Not Found "Source/table not found: {id}"
422 Client Error "Query validation failed: {detail}"
Network timeout "API request timed out. Try reducing query scope."
CSV parse error "Failed to parse response. Try get_query_template."

Dimension value search (in get_dimension_values) normalizes both query and labels for accent-insensitive matching:

  • Normalize with unicodedata.normalize("NFD"), strip combining marks
  • Case-insensitive comparison
  • "tromso" matches "Tromsø", "barum" matches "Bærum"
  • Preserve original labels in output

Implementation Plan

Phase 1: Core (MVP)

  1. Set up project skeleton: pyproject.toml with hatchling, src/ layout, entry point fhi-statistikk-mcp
  2. Set up server.py with FastMCP, SSE transport, CLI args (transport, host, port), stderr logging
  3. Implement api_client.py with async httpx client, base URL config
  4. Implement cache.py with simple TTL dict
  5. Implement list_sources tool
  6. Implement list_tables tool with client-side keyword search
  7. Implement describe_table composite tool
    • Parallel fetch of 4 endpoints
    • Dimension summarization (large dim truncation, fixed dim detection)
    • HTML stripping for metadata
    • Merge into structured response
  8. Implement query_data tool
    • Auto-completion of fixed dimensions
    • Year value normalization ("2020""2020_2020")
    • Default MEASURE_TYPE to all/["*"]
    • CSV parsing and row structuring
    • Row limit and truncation
  9. Implement get_dimension_values with hierarchy navigation and accent- insensitive search
  10. Implement get_query_template passthrough
  11. Add .mcp.json.local for local dev
  12. Test all tools against live API

Phase 2: Deployment & Polish

  1. Add Dockerfile (multi-stage, Python 3.12-slim)
  2. Add .github/workflows/docker-build-push.yaml for CI/CD
  3. Add .mcp.json.public with Skybert URL
  4. Add comprehensive error handling and error messages
  5. Add rate limiting
  6. Record API fixtures for offline testing
  7. Write unit tests for transformers and cache
  8. Write integration tests against live API

Phase 3: Optional Enhancements

  1. Add a search_all_tables convenience tool (if agents frequently need it)
  2. Add MCP resources for static reference data (source descriptions, common dimension codes)
  3. Add MCP prompt templates (e.g. "finn helsedata om ")

Tool Description Guidelines

MCP tool descriptions are what the agent uses to decide which tool to call. They should be written for an LLM audience:

  • Lead with the purpose, not the endpoint
  • Include example parameter values
  • Document non-obvious conventions (year format, dimension codes)
  • Mention what describe_table returns, since it's the prerequisite for query_data
  • Note that Norwegian labels are the default (GEO labels are in Norwegian)

Example tool description for query_data:

Fetch statistical data from an FHI table. Before calling this, use describe_table to understand the table's dimensions and available values.

You only need to specify the dimensions you care about. Fixed dimensions (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit MEASURE_TYPE, all measures are returned.

Year values: use "2020" (auto-translated to "2020_2020") or the full format.

Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]), "top" (first N), "bottom" (last N).

Returns labeled rows, max 1000 by default. Check "truncated" field.

Resolved Decisions

Question Decision Rationale
Hosting SSE locally, same for Skybert Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy.
JSON-stat2 output No csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software.
Fuzzy dimension search Yes, accent-insensitive Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks.
Sample data in describe_table No Adds latency. Agent calls query_data with max_rows=5 if it wants a preview.