Files
my-pal-mcp-server/docs/model_ranking.md
Fahad ff9a07a37a feat!: breaking change - OpenRouter models are now read from conf/openrouter_models.json while Custom / Self-hosted models are read from conf/custom_models.json
feat: Azure OpenAI / Azure AI Foundry support. Models should be defined in conf/azure_models.json (or a custom path). See .env.example for environment variables or see readme. https://github.com/BeehiveInnovations/zen-mcp-server/issues/265

feat: OpenRouter / Custom Models / Azure can separately also use custom config paths now (see .env.example )

refactor: Model registry class made abstract, OpenRouter / Custom Provider / Azure OpenAI now subclass these

refactor: breaking change: `is_custom` property has been removed from model_capabilities.py (and thus custom_models.json) given each models are now read from separate configuration files
2025-10-04 21:10:56 +04:00

70 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Model Capability Ranking
Auto mode needs a short, trustworthy list of models to suggest. The server
computes a capability rank for every model at runtime using a simple recipe:
1. Start with the human-supplied `intelligence_score` (120). This is the
anchor—multiply it by five to map onto the 0100 scale the server uses.
2. Add a few light bonuses for hard capabilities:
- **Context window:** up to +5 (log-scale bonus when the model exceeds ~1K tokens).
- **Output budget:** +2 for ≥65K tokens, +1 for ≥32K.
- **Extended thinking:** +3 when the provider supports it.
- **Function calling / JSON / images:** +1 each when available.
- **Custom endpoints:** 1 to nudge cloud-hosted defaults ahead unless tuned.
3. Clamp the final score to 0100 so downstream callers can rely on the range.
In code this looks like:
```python
base = clamp(intelligence_score, 1, 20) * 5
ctx_bonus = min(5, max(0, log10(context_window) - 3))
output_bonus = 2 if max_output_tokens >= 65_000 else 1 if >= 32_000 else 0
feature_bonus = (
(3 if supports_extended_thinking else 0)
+ (1 if supports_function_calling else 0)
+ (1 if supports_json_mode else 0)
+ (1 if supports_images else 0)
)
penalty = 1 if provider == CUSTOM else 0
effective_rank = clamp(base + ctx_bonus + output_bonus + feature_bonus - penalty, 0, 100)
```
The bonuses are intentionally small—the human intelligence score does most
of the work so you can enforce organisational preferences easily.
## Picking an intelligence score
A straightforward rubric that mirrors typical provider tiers:
| Intelligence | Guidance |
|--------------|----------|
| 1819 | Frontier reasoning models (Gemini 2.5 Pro, GPT5) |
| 1517 | Strong general models with large context (O3 Pro, DeepSeek R1) |
| 1214 | Balanced assistants (Claude Opus/Sonnet, Mistral Large) |
| 911 | Fast distillations (Gemini Flash, GPT-5 Mini, Mistral medium) |
| 68 | Local or efficiency-focused models (Llama 3 70B, Claude Haiku) |
| ≤5 | Experimental/lightweight models |
Record the reasoning for your scores so future updates stay consistent.
## How the rank is used
The ranked list is cached per provider and consumed by:
- Tool schemas (`model` parameter descriptions) when auto mode is active.
- The `listmodels` tools “top models” sections.
- Fallback messaging when a requested model is unavailable.
Because the rank is computed after restriction filters, only allowed models
appear in these summaries.
## Customising further
If you need a different weighting you can:
- Override `intelligence_score` in your provider or custom model config.
- Subclass the provider and override `get_effective_capability_rank()`.
- Post-process the rank via `get_capabilities_by_rank()` before surfacing it.
Most teams find that adjusting `intelligence_score` alone is enough to keep
auto mode honest without revisiting code.