24 KiB
MCP Server for FHI Statistikk Open API
Overview
An MCP (Model Context Protocol) server that exposes the FHI Statistikk Open API
as tools optimized for AI agent consumption. The server wraps the REST API at
https://statistikk-data.fhi.no/api/open/v1/ and adds intelligent
summarization, format translation, and convenience features that make the API
practical for LLM-based agents.
Base API: https://statistikk-data.fhi.no/api/open/v1/ API docs: https://statistikk-data.fhi.no/swagger/index.html?urls.primaryName=Allvis%20Open%20API License: CC BY 4.0 (open data) Auth: None required
Problem Statement
The raw API has several characteristics that make it hard for AI agents:
- JSON-stat2 format -- The data endpoint returns a multidimensional sparse array format designed for statistical software, not LLMs.
- Mandatory dimension specification -- All dimensions must be included in
every data query, even single-valued ones like
KJONN=["0"]. - Non-obvious value formats -- Year values use
"2020_2020"not"2020". - Massive dimension trees -- The GEO dimension can have 400+ entries in a hierarchical tree (country > county > municipality > city district).
- Multi-step discovery -- Finding relevant data requires: list sources > list tables > get dimensions > construct query > fetch data.
- Metadata contains raw HTML --
<p>,<a>,<ol>tags in content fields. - Swagger spec is incomplete -- Documents only
"item"filter, but the API actually supports"item","all","top","bottom".
API Inventory
Sources (as of 2026-03-27)
| ID | Title | Publisher |
|---|---|---|
| nokkel | Folkehelsestatistikk | Helsedirektoratet |
| ngs | Mikrobiologisk genomovervåkning | FHI |
| mfr | Medisinsk fødselsregister | FHI |
| abr | Abortregisteret | FHI |
| sysvak | Nasjonalt vaksinasjonsregister SYSVAK | FHI |
| daar | Dødsårsakregisteret | FHI |
| msis | Meldingssystem for smittsomme sykdommer | FHI |
| lmr | Legemiddelregisteret | FHI |
| gs | Grossiststatistikk | FHI |
| npr | Norsk pasientregister | FHI |
| kpr | Kommunalt pasient- og brukerregister | FHI |
| hkr | Hjerte- og karsykdommer | FHI |
| skast | Skadedyrstatistikk | FHI |
Endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /Common/source |
List all sources |
| GET | /{sourceId}/Table |
List tables in source |
| GET | /{sourceId}/Table/{tableId} |
Table info |
| GET | /{sourceId}/Table/{tableId}/query |
Query template |
| GET | /{sourceId}/Table/{tableId}/dimension |
Dimensions and categories |
| POST | /{sourceId}/Table/{tableId}/data |
Fetch data |
| GET | /{sourceId}/Table/{tableId}/flag |
Flag/symbol definitions |
| GET | /{sourceId}/Table/{tableId}/metadata |
Table metadata |
Filter Types
| Filter | Description | Example values |
|---|---|---|
item |
Exact match on listed values | ["2020_2020","2021_2021"] |
all |
Wildcard match with * |
["*"] or ["A*","B*"] |
top |
First N categories | ["5"] |
bottom |
Last N categories | ["5"] |
Response Formats (data endpoint)
| Format | Content-Type | Description |
|---|---|---|
| json-stat2 | application/json | JSON-Stat 2.0 sparse array |
| csv2 | text/csv | CSV with human-readable labels |
| csv3 | text/csv | CSV with dimension/measure codes |
| parquet | application/vnd.apache.parquet | Apache Parquet columnar format |
MCP Tool Design
Tool 1: list_sources
Purpose: Entry point. List all available data sources.
Parameters: None.
Returns: Array of {id, title, description, published_by}.
Implementation: GET /Common/source. Pass through with minor field renaming
(snake_case).
Caching: Cache for 24 hours. Source list rarely changes.
Tool 2: list_tables
Purpose: Find tables within a source, with optional keyword search.
Parameters:
source_id(string, required) -- Source identifier, e.g."nokkel".search(string, optional) -- Case-insensitive keyword filter on table title. Supports multiple words (all must match). Applied client-side.modified_after(string, optional) -- ISO-8601 datetime. Only return tables modified after this date. Passed to API server-side.
Returns: Array of {table_id, title, published_at, modified_at}.
Implementation: GET /{sourceId}/Table?modifiedAfter=..., then client-side
filter on search. Sort by modified_at descending.
Caching: Cache per source_id for 1 hour. Table lists update throughout the day as data is published.
Example:
list_tables(source_id="nokkel", search="befolkning")
→ [{table_id: 185, title: "Befolkningsvekst", ...},
{table_id: 338, title: "Befolkningssammensetning_antall_andel", ...},
{table_id: 171, title: "Befolkningsframskriving", ...}]
Tool 3: describe_table
Purpose: The primary tool for understanding a table's structure. Gives the agent everything it needs to construct a data query.
Parameters:
source_id(string, required)table_id(integer, required)
Returns: A structured summary combining table info, dimensions, metadata, and flags. This is a composite call (4 parallel API requests).
Response structure:
{
"title": "Befolkningsvekst",
"published_at": "2025-10-21T08:56:39Z",
"modified_at": "2025-10-21T08:56:39Z",
"is_official_statistics": false,
"description": "Differansen mellom befolkningsmengden...",
"update_frequency": "Årlig",
"keywords": ["Befolkning", "Befolkningsvekst"],
"source_institution": "Statistisk sentralbyrå (SSB)",
"dimensions": [
{
"code": "GEO",
"label": "Geografi",
"total_categories": 356,
"is_hierarchical": true,
"hierarchy_depth": 4,
"top_level_values": [
{"value": "0", "label": "Hele landet", "child_count": 15}
],
"note": "Use get_dimension_values to drill into sub-levels"
},
{
"code": "AAR",
"label": "År",
"total_categories": 23,
"is_hierarchical": false,
"value_format": "YYYY_YYYY (e.g. 2020_2020)",
"range": "2002..2024",
"values": ["2002_2002", "2003_2003", ..., "2024_2024"]
},
{
"code": "KJONN",
"label": "Kjønn",
"total_categories": 1,
"is_fixed": true,
"values": [{"value": "0", "label": "kjønn samlet"}],
"note": "Single-valued, auto-included in queries"
},
{
"code": "ALDER",
"label": "Alder",
"total_categories": 1,
"is_fixed": true,
"values": [{"value": "0_120", "label": "alle aldre"}],
"note": "Single-valued, auto-included in queries"
},
{
"code": "MEASURE_TYPE",
"label": "Måltall",
"total_categories": 2,
"is_fixed": false,
"values": [
{"value": "TELLER", "label": "antall"},
{"value": "RATE", "label": "prosent vekst"}
]
}
],
"flags": [
{"symbol": "", "description": "Verdi finnes i tabellen"}
]
}
Key design decisions:
-
Summarize large dimensions -- For dimensions with >20 categories (mainly GEO), show only top-level entries with child counts. The agent uses
get_dimension_valuesto drill down. -
Mark fixed dimensions -- Dimensions with exactly 1 category get
is_fixed: true. The agent knows to ignore these;query_datawill auto-include them. -
Show value format -- AAR values are
"2020_2020", not"2020". Show this explicitly so the agent gets the format right. -
Include metadata inline -- Strip HTML from metadata paragraphs. Extract
description,keywords,update_frequency,source_institutionas top-level fields. -
Include flags inline -- Flag definitions are small and always relevant.
Implementation: Parallel fetch of:
- GET
/{sourceId}/Table/{tableId}(table info) - GET
/{sourceId}/Table/{tableId}/dimension(dimensions) - GET
/{sourceId}/Table/{tableId}/metadata(metadata) - GET
/{sourceId}/Table/{tableId}/flag(flags)
Then merge and transform.
Caching: Cache per (source_id, table_id) for 6 hours. Dimension structure changes rarely.
Tool 4: get_dimension_values
Purpose: Drill into large hierarchical dimensions, typically GEO.
Parameters:
source_id(string, required)table_id(integer, required)dimension_code(string, required) -- e.g."GEO".parent_value(string, optional) -- Return only children of this category. E.g."18"for Nordland county. If omitted, returns top-level categories.search(string, optional) -- Case-insensitive search on category labels. E.g."tromsø"to find the municipality.
Returns: Array of {value, label, child_count}.
Implementation: GET /{sourceId}/Table/{tableId}/dimension, then navigate
the category tree client-side. The full tree is fetched and cached; filtering
is done in the MCP server.
Examples:
# Get all counties
get_dimension_values("nokkel", 185, "GEO")
→ [{value: "0", label: "Hele landet", child_count: 15}]
# Get municipalities in Nordland
get_dimension_values("nokkel", 185, "GEO", parent_value="18")
→ [{value: "1804", label: "Bodø", child_count: 0},
{value: "1806", label: "Narvik", child_count: 0}, ...]
# Search for a municipality
get_dimension_values("nokkel", 185, "GEO", search="tromsø")
→ [{value: "5501", label: "Tromsø", child_count: 0}]
Caching: Shares the dimension cache with describe_table.
Tool 5: query_data
Purpose: Fetch actual data from a table. The main data retrieval tool.
Parameters:
source_id(string, required)table_id(integer, required)dimensions(array, required) -- Each element:code(string) -- Dimension code, e.g."GEO".filter(string) -- One of"item","all","top","bottom". Default:"item".values(array of strings) -- Filter values.
max_rows(integer, optional) -- Limit returned rows. Default: 1000. Set to 0 for no limit (be careful).
Returns: Structured rows with labeled values.
{
"table": "Befolkningsvekst",
"total_rows": 4,
"rows": [
{"GEO": "Oslo", "AAR": "2023", "KJONN": "kjønn samlet",
"ALDER": "alle aldre", "TELLER": 3516, "RATE": 0.5},
...
],
"truncated": false,
"dimensions_used": {
"GEO": {"filter": "item", "values": ["0301"]},
"AAR": {"filter": "bottom", "values": ["2"]},
"KJONN": {"filter": "item", "values": ["0"]},
"ALDER": {"filter": "item", "values": ["0_120"]},
"MEASURE_TYPE": {"filter": "all", "values": ["*"]}
}
}
Key design decisions:
-
Default to csv2 internally -- Fetch as csv2 (human-readable labels), parse into rows. CSV is simpler for an agent to reason about than JSON-stat2. The tool internally requests csv2 and structures it.
-
Auto-include fixed dimensions -- If the agent omits a dimension that has only 1 category (like KJONN or ALDER), the tool adds it automatically with
filter: "item"and the single value. This means the agent only needs to specify the dimensions it actually cares about. -
Normalize year values -- If the agent sends
"2020"for AAR, the tool translates to"2020_2020". TheYYYY_YYYYformat is an internal API convention the agent shouldn't need to know about. -
Default MEASURE_TYPE -- If omitted, default to
filter: "all", values: ["*"]to get all measures. Most agents want all available metrics. -
Row limit with truncation flag -- Default 1000 rows. Return a
truncated: trueflag andtotal_rowscount so the agent knows if there's more data. -
Echo back dimensions_used -- Show what was actually sent to the API (after auto-completion), so the agent can see the full query.
Implementation:
- Fetch dimension info if not cached (to know fixed dimensions and validate)
- Auto-complete missing/fixed dimensions
- Normalize year values
- POST
/{sourceId}/Table/{tableId}/datawith format=csv2 - Parse CSV response into row objects
- Apply row limit, compute truncation
Error handling: The API returns ProblemDetails (RFC 7807) on 400/404/422. Transform into clear error messages:
- "Dimension 'XYZ' is not valid for this table. Available: GEO, AAR, ..."
- "Value '2025_2025' not found in dimension AAR. Range: 2002..2024"
- "maxRowCount exceeded. Requested ~50000 rows, limit is 1000. Narrow filters."
Tool 6: get_query_template
Purpose: Fallback tool returning the raw query template from the API. Useful when the agent needs to see exactly what the API expects.
Parameters:
source_id(string, required)table_id(integer, required)
Returns: The raw DataRequest JSON as returned by the API.
Implementation: GET /{sourceId}/Table/{tableId}/query. Pass through.
When to use: When query_data auto-completion isn't behaving as expected,
or the agent wants to see the complete list of available values for all
dimensions.
Tools NOT included (and why)
| Considered tool | Decision | Reason |
|---|---|---|
get_flags (standalone) |
Dropped | Folded into describe_table |
get_metadata (standalone) |
Dropped | Folded into describe_table |
get_table_info (standalone) |
Dropped | Folded into describe_table |
search_across_sources |
Dropped | Too expensive (13 API calls). Agent can call list_tables per source |
get_data_jsonstat |
Dropped | Agents don't need raw JSON-stat2 |
get_data_parquet |
Dropped | Binary format, not useful for LLM context |
Architecture
Stack
- Language: Python 3.12+
- MCP framework: FastMCP (
mcp[cli]) - HTTP server: Uvicorn (
uvicorn>=0.30) for SSE/HTTP transport - HTTP client:
httpx(async) - CSV parsing: stdlib
csv - HTML stripping: stdlib
html.parserorre(simple tag removal) - Build system: Hatchling (matches Fhi.Metadata.MCPserver pattern)
Transport
The server supports multiple transports via CLI flag, following the same pattern
as Fhi.Metadata.MCPserver:
| Transport | Use case | Endpoint |
|---|---|---|
sse |
Local dev + Skybert deployment | /sse |
streamable-http |
Future HTTP-only clients | /mcp |
stdio |
Direct pipe (legacy) | stdin/stdout |
Default: sse on 0.0.0.0:8000. This means the server works over HTTP
both locally and when deployed to Skybert, with no transport change needed.
CLI entry point:
fhi-statistikk-mcp --transport sse --host 0.0.0.0 --port 8000
Project Structure
fhi-statistikk-mcp/
├── .github/
│ └── workflows/
│ └── docker-build-push.yaml # CI/CD → crfhiskybert.azurecr.io
├── .mcp.json.local # Local dev: http://localhost:8000/sse
├── .mcp.json.public # Production: https://<skybert-url>/sse
├── Dockerfile # Multi-stage, Python 3.12-slim
├── pyproject.toml # Hatchling build, entry point
├── README.md
├── src/
│ └── fhi_statistikk_mcp/
│ ├── __init__.py
│ ├── server.py # MCP server, tool definitions, main()
│ ├── api_client.py # Async httpx client for FHI API
│ ├── transformers.py # CSV parsing, dimension summarization
│ └── cache.py # Simple TTL cache
└── tests/
├── test_transformers.py
├── test_cache.py
└── fixtures/ # Recorded API responses
├── sources.json
├── tables_nokkel.json
├── dimensions_185.json
├── metadata_185.json
├── flags_185.json
└── data_185.csv
MCP Client Configuration
Local development (.mcp.json.local):
{
"mcpServers": {
"fhi-statistikk": {
"type": "sse",
"url": "http://localhost:8000/sse"
}
}
}
Production (.mcp.json.public):
{
"mcpServers": {
"fhi-statistikk": {
"type": "sse",
"url": "https://<skybert-url>/sse"
}
}
}
Dockerfile
Following the Fhi.Metadata.MCPserver pattern:
FROM python:3.12-slim AS base
WORKDIR /app
COPY pyproject.toml .
COPY src/ src/
RUN pip install --no-cache-dir .
FROM base AS prod
EXPOSE 8000
CMD ["fhi-statistikk-mcp", "--transport", "sse", "--host", "0.0.0.0", "--port", "8000"]
CI/CD
Same pipeline pattern as Fhi.Metadata.MCPserver:
- Trigger on push to
maintouchingsrc/,Dockerfile, orpyproject.toml - Azure Federated Identity (OIDC) login
- Push to
crfhiskybert.azurecr.io/fida/ki/statistikk-mcp - Tag: git short SHA +
latest - Dispatch to GitOps repo for Skybert deployment
Logging
Force all loggers (uvicorn, mcp, fastmcp) to stderr with simple format. Print startup info (API base URL, cache status) to stderr. No persistent log files -- container logging handles that on Skybert.
Caching Strategy
| Data | TTL | Key | Reason |
|---|---|---|---|
| Source list | 24h | "sources" |
Rarely changes |
| Table list | 1h | source_id |
New tables published daily |
| Dimensions | 6h | (source_id, table_id) |
Dimension structure is stable |
| Metadata | 6h | (source_id, table_id) |
Metadata edits are rare |
| Flags | 6h | (source_id, table_id) |
Flags rarely change |
| Query templates | 6h | (source_id, table_id) |
Follows dimension changes |
| Data responses | No cache | -- | Queries vary too much to cache |
In-memory dict with TTL. No external dependency needed -- the data volume is small and the server is single-process.
Rate Limiting
No documented rate limits, but this is a government API. Be polite:
- Max 5 concurrent requests
- 100ms minimum between requests
- Retry with exponential backoff on 429/503
Error Mapping
| API Response | MCP Tool Error |
|---|---|
| 400 Bad Request | Descriptive message from ProblemDetails.detail |
| 404 Not Found | "Source/table not found: {id}" |
| 422 Client Error | "Query validation failed: {detail}" |
| Network timeout | "API request timed out. Try reducing query scope." |
| CSV parse error | "Failed to parse response. Try get_query_template." |
Unicode / Fuzzy Search
Dimension value search (in get_dimension_values) normalizes both query and
labels for accent-insensitive matching:
- Normalize with
unicodedata.normalize("NFD"), strip combining marks - Case-insensitive comparison
"tromso"matches"Tromsø","barum"matches"Bærum"- Preserve original labels in output
Implementation Plan
Phase 1: Core (MVP)
- Set up project skeleton:
pyproject.tomlwith hatchling,src/layout, entry pointfhi-statistikk-mcp - Set up
server.pywith FastMCP, SSE transport, CLI args (transport, host, port), stderr logging - Implement
api_client.pywith async httpx client, base URL config - Implement
cache.pywith simple TTL dict - Implement
list_sourcestool - Implement
list_tablestool with client-side keyword search - Implement
describe_tablecomposite tool- Parallel fetch of 4 endpoints
- Dimension summarization (large dim truncation, fixed dim detection)
- HTML stripping for metadata
- Merge into structured response
- Implement
query_datatool- Auto-completion of fixed dimensions
- Year value normalization (
"2020"→"2020_2020") - Default MEASURE_TYPE to
all/["*"] - CSV parsing and row structuring
- Row limit and truncation
- Implement
get_dimension_valueswith hierarchy navigation and accent- insensitive search - Implement
get_query_templatepassthrough - Add
.mcp.json.localfor local dev - Test all tools against live API
Phase 2: Deployment & Polish
- Add
Dockerfile(multi-stage, Python 3.12-slim) - Add
.github/workflows/docker-build-push.yamlfor CI/CD - Add
.mcp.json.publicwith Skybert URL - Add comprehensive error handling and error messages
- Add rate limiting
- Record API fixtures for offline testing
- Write unit tests for transformers and cache
- Write integration tests against live API
Phase 3: Optional Enhancements
- Add a
search_all_tablesconvenience tool (if agents frequently need it) - Add MCP resources for static reference data (source descriptions, common dimension codes)
- Add MCP prompt templates (e.g. "finn helsedata om ")
Tool Description Guidelines
MCP tool descriptions are what the agent uses to decide which tool to call. They should be written for an LLM audience:
- Lead with the purpose, not the endpoint
- Include example parameter values
- Document non-obvious conventions (year format, dimension codes)
- Mention what
describe_tablereturns, since it's the prerequisite forquery_data - Note that Norwegian labels are the default (GEO labels are in Norwegian)
Example tool description for query_data:
Fetch statistical data from an FHI table. Before calling this, use
describe_tableto understand the table's dimensions and available values.You only need to specify the dimensions you care about. Fixed dimensions (single-valued, like KJONN="kjønn samlet") are auto-included. If you omit MEASURE_TYPE, all measures are returned.
Year values: use "2020" (auto-translated to "2020_2020") or the full format.
Filters: "item" (exact values), "all" (wildcard, e.g. ["*"]), "top" (first N), "bottom" (last N).
Returns labeled rows, max 1000 by default. Check "truncated" field.
Resolved Decisions
| Question | Decision | Rationale |
|---|---|---|
| Hosting | SSE locally, same for Skybert | Follow Fhi.Metadata.MCPserver pattern. HTTP from day one, no transport change on deploy. |
| JSON-stat2 output | No | csv2 is sufficient for LLM agents. JSON-stat2 is for statistical software. |
| Fuzzy dimension search | Yes, accent-insensitive | Norwegian chars (æøå) will trip up agents. Normalize NFD + strip combining marks. |
| Sample data in describe_table | No | Adds latency. Agent calls query_data with max_rows=5 if it wants a preview. |