Files

Torbjørn Lindahl 817b90420f added repo

2026-03-27 17:01:55 +01:00

24 KiB

Raw Blame History

MCP Server for FHI Statistikk Open API

Overview

An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API as tools optimized for AI agent consumption. The server wraps the REST API at https://statistikk-data.fhi.no/api/open/v1/ and adds intelligent summarization, format translation, and convenience features that make the API practical for LLM-based agents.

This repo: /home/tlind/git/fhi/openapi-mcp API documentation repo: /home/tlind/git/fhi/Fhi.Statistikk.OpenAPI (code samples, Postman collection, user guide) Base API: https://statistikk-data.fhi.no/api/open/v1/ API docs: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API License: CC BY 4.0 (open data) Auth: None required

Problem Statement

The raw API has several characteristics that make it hard for AI agents:

JSON-stat2 format -- The data endpoint returns a multidimensional sparse array format designed for statistical software, not LLMs.
Mandatory dimension specification -- All dimensions must be included in every data query, even single-valued ones like KJONN=["0"].
Non-obvious value formats -- Year values use "2020_2020" not "2020".
Massive dimension trees -- The GEO dimension can have 400+ entries in a hierarchical tree (country > county > municipality > city district).
Multi-step discovery -- Finding relevant data requires: list sources > list tables > get dimensions > construct query > fetch data.
Metadata contains raw HTML -- <p>, <a>, <ol> tags in content fields.
Swagger spec is incomplete -- Documents only "item" filter, but the API actually supports "item", "all", "top", "bottom".

API Inventory

Sources (as of 2026-03-27)

ID	Title	Publisher
nokkel	Folkehelsestatistikk	Helsedirektoratet
ngs	Mikrobiologisk genomovervåkning	FHI
mfr	Medisinsk fødselsregister	FHI
abr	Abortregisteret	FHI
sysvak	Nasjonalt vaksinasjonsregister SYSVAK	FHI
daar	Dødsårsakregisteret	FHI
msis	Meldingssystem for smittsomme sykdommer	FHI
lmr	Legemiddelregisteret	FHI
gs	Grossiststatistikk	FHI
npr	Norsk pasientregister	FHI
kpr	Kommunalt pasient- og brukerregister	FHI
hkr	Hjerte- og karsykdommer	FHI
skast	Skadedyrstatistikk	FHI

Endpoints

Method	Path	Purpose
GET	`/Common/source`	List all sources
GET	`/{sourceId}/Table`	List tables in source
GET	`/{sourceId}/Table/{tableId}`	Table info
GET	`/{sourceId}/Table/{tableId}/query`	Query template
GET	`/{sourceId}/Table/{tableId}/dimension`	Dimensions and categories
POST	`/{sourceId}/Table/{tableId}/data`	Fetch data
GET	`/{sourceId}/Table/{tableId}/flag`	Flag/symbol definitions
GET	`/{sourceId}/Table/{tableId}/metadata`	Table metadata

Filter Types

Filter	Description	Example values
`item`	Exact match on listed values	`["2020_2020","2021_2021"]`
`all`	Wildcard match with `*`	`[""]` or `["A","B*"]`
`top`	First N categories	`["5"]`
`bottom`	Last N categories	`["5"]`

Response Formats (data endpoint)

Format	Content-Type	Description
json-stat2	application/json	JSON-Stat 2.0 sparse array
csv2	text/csv	CSV with human-readable labels
csv3	text/csv	CSV with dimension/measure codes
parquet	application/vnd.apache.parquet	Apache Parquet columnar format

MCP Tool Design

Tool 1: `list_sources`

Purpose: Entry point. List all available data sources.

Parameters: None.

Returns: Array of {id, title, description, published_by}.

Implementation: GET /Common/source. Pass through with minor field renaming (snake_case).

Caching: Cache for 24 hours. Source list rarely changes.

Tool 2: `list_tables`

Purpose: Find tables within a source, with optional keyword search.

Parameters:

source_id (string, required) -- Source identifier, e.g. "nokkel".
search (string, optional) -- Case-insensitive keyword filter on table title. Supports multiple words (all must match). Applied client-side.
modified_after (string, optional) -- ISO-8601 datetime. Only return tables modified after this date. Passed to API server-side.

Returns: Array of {table_id, title, published_at, modified_at}.

Implementation: GET /{sourceId}/Table?modifiedAfter=..., then client-side filter on search. Sort by modified_at descending.

Caching: Cache per source_id for 1 hour. Table lists update throughout the day as data is published.

Example:

list_tables(source_id="nokkel", search="befolkning")
→ [{table_id: 185, title: "Befolkningsvekst", ...},
   {table_id: 338, title: "Befolkningssammensetning_antall_andel", ...},
   {table_id: 171, title: "Befolkningsframskriving", ...}]

Tool 3: `describe_table`

Purpose: The primary tool for understanding a table's structure. Gives the agent everything it needs to construct a data query.

Parameters:

source_id (string, required)
table_id (integer, required)

Returns: A structured summary combining table info, dimensions, metadata, and flags. This is a composite call (4 parallel API requests).

Response structure:

{
  "title": "Befolkningsvekst",
  "published_at": "2025-10-21T08:56:39Z",
  "modified_at": "2025-10-21T08:56:39Z",
  "is_official_statistics": false,
  "description": "Differansen mellom befolkningsmengden...",
  "update_frequency": "Årlig",
  "keywords": ["Befolkning", "Befolkningsvekst"],
  "source_institution": "Statistisk sentralbyrå (SSB)",
  "dimensions": [
    {
      "code": "GEO",
      "label": "Geografi",
      "total_categories": 356,
      "is_hierarchical": true,
      "hierarchy_depth": 4,
      "top_level_values": [
        {"value": "0", "label": "Hele landet", "child_count": 15}
      ],
      "note": "Use get_dimension_values to drill into sub-levels"
    },
    {
      "code": "AAR",
      "label": "År",
      "total_categories": 23,
      "is_hierarchical": false,
      "value_format": "YYYY_YYYY (e.g. 2020_2020)",
      "range": "2002..2024",
      "values": ["2002_2002", "2003_2003", ..., "2024_2024"]
    },
    {
      "code": "KJONN",
      "label": "Kjønn",
      "total_categories": 1,
      "is_fixed": true,
      "values": [{"value": "0", "label": "kjønn samlet"}],
      "note": "Single-valued, auto-included in queries"
    },
    {
      "code": "ALDER",
      "label": "Alder",
      "total_categories": 1,
      "is_fixed": true,
      "values": [{"value": "0_120", "label": "alle aldre"}],
      "note": "Single-valued, auto-included in queries"
    },
    {
      "code": "MEASURE_TYPE",
      "label": "Måltall",
      "total_categories": 2,
      "is_fixed": false,
      "values": [
        {"value": "TELLER", "label": "antall"},
        {"value": "RATE", "label": "prosent vekst"}
      ]
    }
  ],
  "flags": [
    {"symbol": "", "description": "Verdi finnes i tabellen"}
  ]
}

Key design decisions:

Summarize large dimensions -- For dimensions with >20 categories (mainly GEO), show only top-level entries with child counts. The agent uses get_dimension_values to drill down.
Mark fixed dimensions -- Dimensions with exactly 1 category get is_fixed: true. The agent knows to ignore these; query_data will auto-include them.
Show value format -- AAR values are "2020_2020", not "2020". Show this explicitly so the agent gets the format right.
Include metadata inline -- Strip HTML from metadata paragraphs. Extract description, keywords, update_frequency, source_institution as top-level fields.
Include flags inline -- Flag definitions are small and always relevant.

Implementation: Parallel fetch of:

GET /{sourceId}/Table/{tableId} (table info)
GET /{sourceId}/Table/{tableId}/dimension (dimensions)
GET /{sourceId}/Table/{tableId}/metadata (metadata)
GET /{sourceId}/Table/{tableId}/flag (flags)

Then merge and transform.

Caching: Cache per (source_id, table_id) for 6 hours. Dimension structure changes rarely.

Tool 4: `get_dimension_values`

Purpose: Drill into large hierarchical dimensions, typically GEO.

Parameters:

source_id (string, required)
table_id (integer, required)
dimension_code (string, required) -- e.g. "GEO".
parent_value (string, optional) -- Return only children of this category. E.g. "18" for Nordland county. If omitted, returns top-level categories.
search (string, optional) -- Case-insensitive search on category labels. E.g. "tromsø" to find the municipality.

Returns: Array of {value, label, child_count}.

Implementation: GET /{sourceId}/Table/{tableId}/dimension, then navigate the category tree client-side. The full tree is fetched and cached; filtering is done in the MCP server.

Examples:

# Get all counties
get_dimension_values("nokkel", 185, "GEO")
→ [{value: "0", label: "Hele landet", child_count: 15}]

# Get municipalities in Nordland
get_dimension_values("nokkel", 185, "GEO", parent_value="18")
→ [{value: "1804", label: "Bodø", child_count: 0},
   {value: "1806", label: "Narvik", child_count: 0}, ...]

# Search for a municipality
get_dimension_values("nokkel", 185, "GEO", search="tromsø")
→ [{value: "5501", label: "Tromsø", child_count: 0}]

Caching: Shares the dimension cache with describe_table.

Tool 5: `query_data`

Purpose: Fetch actual data from a table. The main data retrieval tool.

Parameters:

source_id (string, required)
table_id (integer, required)
dimensions (array, required) -- Each element:
- code (string) -- Dimension code, e.g. "GEO".
- filter (string) -- One of "item", "all", "top", "bottom". Default: "item".
- values (array of strings) -- Filter values.
max_rows (integer, optional) -- Limit returned rows. Default: 1000. Set to 0 for no limit (be careful).

Returns: Structured rows with labeled values.

{
  "table": "Befolkningsvekst",
  "total_rows": 4,
  "rows": [
    {"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet",
     "ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5},
    ...
  ],
  "truncated": false,
  "dimensions_used": {
    "GEO": {"filter": "item", "values": ["0301"]},
    "AAR": {"filter": "bottom", "values": ["2"]},
    "KJONN": {"filter": "item", "values": ["0"]},
    "ALDER": {"filter": "item", "values": ["0_120"]},
    "MEASURE_TYPE": {"filter": "all", "values": ["*"]}
  }
}

Key design decisions:

Default to csv2 internally -- Fetch as csv2 (human-readable labels), parse into rows. CSV is simpler for an agent to reason about than JSON-stat2. The tool internally requests csv2 and structures it.
Auto-include fixed dimensions -- If the agent omits a dimension that has only 1 category (like KJONN or ALDER), the tool adds it automatically with filter: "item" and the single value. This means the agent only needs to specify the dimensions it actually cares about.
Normalize year values -- If the agent sends "2020" for AAR, the tool translates to "2020_2020". The YYYY_YYYY format is an internal API convention the agent shouldn't need to know about.
Default MEASURE_TYPE -- If omitted, default to filter: "all", values: ["*"] to get all measures. Most agents want all available metrics.
Row limit with truncation flag -- Default 1000 rows. Return a truncated: true flag and total_rows count so the agent knows if there's more data.
Echo back dimensions_used -- Show what was actually sent to the API (after auto-completion), so the agent can see the full query.

Implementation:

Fetch dimension info if not cached (to know fixed dimensions and validate)
Auto-complete missing/fixed dimensions
Normalize year values
POST /{sourceId}/Table/{tableId}/data with format=csv2
Parse CSV response into row objects
Apply row limit, compute truncation

Error handling: The API returns ProblemDetails (RFC 7807) on 400/404/422. Transform into clear error messages:

"Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..."
"Value '2025_2025' not found in dimension AAR. Range: 2002..2024"
"maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters."

Tool 6: `get_query_template`

Purpose: Fallback tool returning the raw query template from the API. Useful when the agent needs to see exactly what the API expects.

Parameters:

source_id (string, required)
table_id (integer, required)

Returns: The raw DataRequest JSON as returned by the API.

Implementation: GET /{sourceId}/Table/{tableId}/query. Pass through.

When to use: When query_data auto-completion isn't behaving as expected, or the agent wants to see the complete list of available values for all dimensions.

Tools NOT included (and why)

Considered tool	Decision	Reason
`get_flags` (standalone)	Dropped	Folded into `describe_table`
`get_metadata` (standalone)	Dropped	Folded into `describe_table`
`get_table_info` (standalone)	Dropped	Folded into `describe_table`
`search_across_sources`	Dropped	Too expensive (13 API calls). Agent can call `list_tables` per source
`get_data_jsonstat`	Dropped	Agents don't need raw JSON-stat2
`get_data_parquet`	Dropped	Binary format, not useful for LLM context

Architecture

Stack

Language: Python 3.12+
MCP framework: FastMCP (mcp[cli])
HTTP server: Uvicorn (uvicorn>=0.30) for SSE/HTTP transport
HTTP client: httpx (async)
CSV parsing: stdlib csv
HTML stripping: stdlib html.parser or re (simple tag removal)
Build system: Hatchling (matches Fhi.Metadata.MCPserver pattern)

Transport

The server supports multiple transports via CLI flag, following the same pattern as Fhi.Metadata.MCPserver:

Transport	Use case	Endpoint
`sse`	Local dev + Skybert deployment	`/sse`
`streamable-http`	Future HTTP-only clients	`/mcp`
`stdio`	Direct pipe (legacy)	stdin/stdout

Default: sse on 0.0.0.0:8000. This means the server works over HTTP both locally and when deployed to Skybert, with no transport change needed.

CLI entry point:

fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000

Project Structure

fhi-statistikk-mcp/
├── .github/
│   └── workflows/
│       └── docker-build-push.yaml  # CI/CD → crfhiskybert.azurecr.io
├── .mcp.json.local                 # Local dev: http://localhost:8000/sse
├── .mcp.json.public                # Production: https://<skybert-url>/sse
├── Dockerfile                      # Multi-stage, Python 3.12-slim
├── pyproject.toml                  # Hatchling build, entry point
├── README.md
├── src/
│   └── fhi_statistikk_mcp/
│       ├── __init__.py
│       ├── server.py               # MCP server, tool definitions, main()
│       ├── api_client.py           # Async httpx client for FHI API
│       ├── transformers.py         # CSV parsing, dimension summarization
│       └── cache.py                # Simple TTL cache
└── tests/
    ├── test_transformers.py
    ├── test_cache.py
    └── fixtures/                   # Recorded API responses
        ├── sources.json
        ├── tables_nokkel.json
        ├── dimensions_185.json
        ├── metadata_185.json
        ├── flags_185.json
        └── data_185.csv

MCP Client Configuration

Local development (.mcp.json.local):

{
  "mcpServers": {
    "fhi-statistikk": {
      "type": "sse",
      "url": "http://localhost:8000/sse"
    }
  }
}

Production (.mcp.json.public):

{
  "mcpServers": {
    "fhi-statistikk": {
      "type": "sse",
      "url": "https://<skybert-url>/sse"
    }
  }
}

Dockerfile

Following the Fhi.Metadata.MCPserver pattern:

FROM python:3.12-slim AS base
WORKDIR /app
COPY pyproject.toml .
COPY src/ src/
RUN pip install --no-cache-dir .

FROM base AS prod
EXPOSE 8000
CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"]

CI/CD

Same pipeline pattern as Fhi.Metadata.MCPserver:

Trigger on push to main touching src/, Dockerfile, or pyproject.toml
Azure Federated Identity (OIDC) login
Push to crfhiskybert.azurecr.io/fida/ki/statistikk-mcp
Tag: git short SHA + latest
Dispatch to GitOps repo for Skybert deployment

Logging

Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format. Print startup info (API base URL, cache status) to stderr. No persistent log files -- container logging handles that on Skybert.

Caching Strategy

Data	TTL	Key	Reason
Source list	24h	`"sources"`	Rarely changes
Table list	1h	`source_id`	New tables published daily
Dimensions	6h	`(source_id, table_id)`	Dimension structure is stable
Metadata	6h	`(source_id, table_id)`	Metadata edits are rare
Flags	6h	`(source_id, table_id)`	Flags rarely change
Query templates	6h	`(source_id, table_id)`	Follows dimension changes
Data responses	No cache	--	Queries vary too much to cache

In-memory dict with TTL. No external dependency needed -- the data volume is small and the server is single-process.

Rate Limiting

No documented rate limits, but this is a government API. Be polite:

Max 5 concurrent requests
100ms minimum between requests
Retry with exponential backoff on 429/503

Error Mapping

API Response	MCP Tool Error
400 Bad Request	Descriptive message from ProblemDetails.detail
404 Not Found	"Source/table not found: {id}"
422 Client Error	"Query validation failed: {detail}"
Network timeout	"API request timed out. Try reducing query scope."
CSV parse error	"Failed to parse response. Try get_query_template."

Unicode / Fuzzy Search

Dimension value search (in get_dimension_values) normalizes both query and labels for accent-insensitive matching:

Normalize with unicodedata.normalize("NFD"), strip combining marks
Case-insensitive comparison
"tromso" matches "Tromsø", "barum" matches "Bærum"
Preserve original labels in output

Implementation Plan

Phase 1: Core (MVP)

Set up project skeleton: pyproject.toml with hatchling, src/ layout, entry point fhi-statistikk-mcp
Set up server.py with FastMCP, SSE transport, CLI args (transport, host, port), stderr logging
Implement api_client.py with async httpx client, base URL config
Implement cache.py with simple TTL dict
Implement list_sources tool
Implement list_tables tool with client-side keyword search
Implement describe_table composite tool
- Parallel fetch of 4 endpoints
- Dimension summarization (large dim truncation, fixed dim detection)
- HTML stripping for metadata
- Merge into structured response
Implement query_data tool
- Auto-completion of fixed dimensions
- Year value normalization ("2020" → "2020_2020")
- Default MEASURE_TYPE to all/["*"]
- CSV parsing and row structuring
- Row limit and truncation
Implement get_dimension_values with hierarchy navigation and accent- insensitive search
Implement get_query_template passthrough
Add .mcp.json.local for local dev
Test all tools against live API

Phase 2: Deployment & Polish

Add Dockerfile (multi-stage, Python 3.12-slim)
Add .github/workflows/docker-build-push.yaml for CI/CD
Add .mcp.json.public with Skybert URL
Add comprehensive error handling and error messages
Add rate limiting
Record API fixtures for offline testing
Write unit tests for transformers and cache
Write integration tests against live API

Phase 3: Optional Enhancements

Add a search_all_tables convenience tool (if agents frequently need it)
Add MCP resources for static reference data (source descriptions, common dimension codes)
Add MCP prompt templates (e.g. "finn helsedata om ")

Tool Description Guidelines

MCP tool descriptions are what the agent uses to decide which tool to call. They should be written for an LLM audience:

Lead with the purpose, not the endpoint
Include example parameter values
Document non-obvious conventions (year format, dimension codes)
Mention what describe_table returns, since it's the prerequisite for query_data
Note that Norwegian labels are the default (GEO labels are in Norwegian)

Example tool description for `query_data`:

Fetch statistical data from an FHI table. Before calling this, use describe_table to understand the table's dimensions and available values.

You only need to specify the dimensions you care about. Fixed dimensions (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit MEASURE_TYPE, all measures are returned.

Year values: use "2020" (auto-translated to "2020_2020") or the full format.

Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]), "top" (first N), "bottom" (last N).

Returns labeled rows, max 1000 by default. Check "truncated" field.

Resolved Decisions

Question	Decision	Rationale
Hosting	SSE locally, same for Skybert	Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy.
JSON-stat2 output	No	csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software.
Fuzzy dimension search	Yes, accent-insensitive	Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks.
Sample data in describe_table	No	Adds latency. Agent calls `query_data` with `max_rows=5` if it wants a preview.

24 KiB Raw Blame History

MCP Server for FHI Statistikk Open API

Overview

Problem Statement

API Inventory

Sources (as of 2026-03-27)

Endpoints

Filter Types

Response Formats (data endpoint)

MCP Tool Design

Tool 1: list_sources

Tool 2: list_tables

Tool 3: describe_table

Tool 4: get_dimension_values

Tool 5: query_data

Tool 6: get_query_template

Tools NOT included (and why)

Architecture

Stack

Transport

Project Structure

MCP Client Configuration

Dockerfile

CI/CD

Logging

Caching Strategy

Rate Limiting

Error Mapping

Unicode / Fuzzy Search

Implementation Plan

Phase 1: Core (MVP)

Phase 2: Deployment & Polish

Phase 3: Optional Enhancements

Tool Description Guidelines

Example tool description for query_data:

Resolved Decisions

24 KiB

Raw Blame History

Tool 1: `list_sources`

Tool 2: `list_tables`

Tool 3: `describe_table`

Tool 4: `get_dimension_values`

Tool 5: `query_data`

Tool 6: `get_query_template`

Example tool description for `query_data`: