Files
my-pal-mcp-server/docs/model_ranking.md
Fahad 6cab9e56fc feat: added intelligence_score to the model capabilities schema; a 1-20 number that can be specified to influence the sort order of models presented to the CLI in auto selection mode
fix: model definition re-introduced into the schema but intelligently and only a summary is generated per tool. Required to ensure CLI calls and uses the correct model
fix: removed `model` param from some tools where this wasn't needed
fix: fixed adherence to `*_ALLOWED_MODELS` by advertising only the allowed models to the CLI
fix: removed duplicates across providers when passing canonical names back to the CLI; the first enabled provider wins
2025-10-02 21:43:44 +04:00

2.8 KiB
Raw Blame History

Model Capability Ranking

Auto mode needs a short, trustworthy list of models to suggest. The server computes a capability rank for every model at runtime using a simple recipe:

  1. Start with the human-supplied intelligence_score (120). This is the anchor—multiply it by five to map onto the 0100 scale the server uses.
  2. Add a few light bonuses for hard capabilities:
    • Context window: up to +5 (log-scale bonus when the model exceeds ~1K tokens).
    • Output budget: +2 for ≥65K tokens, +1 for ≥32K.
    • Extended thinking: +3 when the provider supports it.
    • Function calling / JSON / images: +1 each when available.
    • Custom endpoints: 1 to nudge cloud-hosted defaults ahead unless tuned.
  3. Clamp the final score to 0100 so downstream callers can rely on the range.

In code this looks like:

base = clamp(intelligence_score, 1, 20) * 5
ctx_bonus = min(5, max(0, log10(context_window) - 3))
output_bonus = 2 if max_output_tokens >= 65_000 else 1 if >= 32_000 else 0
feature_bonus = (
    (3 if supports_extended_thinking else 0)
    + (1 if supports_function_calling else 0)
    + (1 if supports_json_mode else 0)
    + (1 if supports_images else 0)
)
penalty = 1 if is_custom else 0

effective_rank = clamp(base + ctx_bonus + output_bonus + feature_bonus - penalty, 0, 100)

The bonuses are intentionally small—the human intelligence score does most of the work so you can enforce organisational preferences easily.

Picking an intelligence score

A straightforward rubric that mirrors typical provider tiers:

Intelligence Guidance
1819 Frontier reasoning models (Gemini 2.5 Pro, GPT5)
1517 Strong general models with large context (O3 Pro, DeepSeek R1)
1214 Balanced assistants (Claude Opus/Sonnet, Mistral Large)
911 Fast distillations (Gemini Flash, GPT-5 Mini, Mistral medium)
68 Local or efficiency-focused models (Llama 3 70B, Claude Haiku)
≤5 Experimental/lightweight models

Record the reasoning for your scores so future updates stay consistent.

How the rank is used

The ranked list is cached per provider and consumed by:

  • Tool schemas (model parameter descriptions) when auto mode is active.
  • The listmodels tools “top models” sections.
  • Fallback messaging when a requested model is unavailable.

Because the rank is computed after restriction filters, only allowed models appear in these summaries.

Customising further

If you need a different weighting you can:

  • Override intelligence_score in your provider or custom model config.
  • Subclass the provider and override get_effective_capability_rank().
  • Post-process the rank via get_capabilities_by_rank() before surfacing it.

Most teams find that adjusting intelligence_score alone is enough to keep auto mode honest without revisiting code.