feat: added intelligence_score to the model capabilities schema; a 1-20 number that can be specified to influence the sort order of models presented to the CLI in auto selection mode

fix: model definition re-introduced into the schema but intelligently and only a summary is generated per tool. Required to ensure CLI calls and uses the correct model fix: removed `model` param from some tools where this wasn't needed fix: fixed adherence to `*_ALLOWED_MODELS` by advertising only the allowed models to the CLI fix: removed duplicates across providers when passing canonical names back to the CLI; the first enabled provider wins
2025-10-02 21:43:44 +04:00
parent e78fe35a1b
commit 6cab9e56fc
22 changed files with 525 additions and 110 deletions
--- a/docs/model_ranking.md
+++ b/docs/model_ranking.md
@@ -0,0 +1,69 @@
+# Model Capability Ranking
+
+Auto mode needs a short, trustworthy list of models to suggest. The server
+computes a capability rank for every model at runtime using a simple recipe:
+
+1. Start with the human-supplied `intelligence_score` (1–20). This is the
+   anchor—multiply it by five to map onto the 0–100 scale the server uses.
+2. Add a few light bonuses for hard capabilities:
+   - **Context window:** up to +5 (log-scale bonus when the model exceeds ~1K tokens).
+   - **Output budget:** +2 for ≥65K tokens, +1 for ≥32K.
+   - **Extended thinking:** +3 when the provider supports it.
+   - **Function calling / JSON / images:** +1 each when available.
+   - **Custom endpoints:** −1 to nudge cloud-hosted defaults ahead unless tuned.
+3. Clamp the final score to 0–100 so downstream callers can rely on the range.
+
+In code this looks like:
+
+```python
+base = clamp(intelligence_score, 1, 20) * 5
+ctx_bonus = min(5, max(0, log10(context_window) - 3))
+output_bonus = 2 if max_output_tokens >= 65_000 else 1 if >= 32_000 else 0
+feature_bonus = (
+    (3 if supports_extended_thinking else 0)
+    + (1 if supports_function_calling else 0)
+    + (1 if supports_json_mode else 0)
+    + (1 if supports_images else 0)
+)
+penalty = 1 if is_custom else 0
+
+effective_rank = clamp(base + ctx_bonus + output_bonus + feature_bonus - penalty, 0, 100)
+```
+
+The bonuses are intentionally small—the human intelligence score does most
+of the work so you can enforce organisational preferences easily.
+
+## Picking an intelligence score
+
+A straightforward rubric that mirrors typical provider tiers:
+
+| Intelligence | Guidance |
+|--------------|----------|
+| 18–19 | Frontier reasoning models (Gemini 2.5 Pro, GPT‑5) |
+| 15–17 | Strong general models with large context (O3 Pro, DeepSeek R1) |
+| 12–14 | Balanced assistants (Claude Opus/Sonnet, Mistral Large) |
+| 9–11  | Fast distillations (Gemini Flash, GPT-5 Mini, Mistral medium) |
+| 6–8   | Local or efficiency-focused models (Llama 3 70B, Claude Haiku) |
+| ≤5    | Experimental/lightweight models |
+
+Record the reasoning for your scores so future updates stay consistent.
+
+## How the rank is used
+
+The ranked list is cached per provider and consumed by:
+- Tool schemas (`model` parameter descriptions) when auto mode is active.
+- The `listmodels` tool’s “top models” sections.
+- Fallback messaging when a requested model is unavailable.
+
+Because the rank is computed after restriction filters, only allowed models
+appear in these summaries.
+
+## Customising further
+
+If you need a different weighting you can:
+- Override `intelligence_score` in your provider or custom model config.
+- Subclass the provider and override `get_effective_capability_rank()`.
+- Post-process the rank via `get_capabilities_by_rank()` before surfacing it.
+
+Most teams find that adjusting `intelligence_score` alone is enough to keep
+auto mode honest without revisiting code.