Breaking change: openrouter_models.json -> custom_models.json

* Support for Custom URLs and custom models, including locally hosted models such as ollama * Support for native + openrouter + local models (i.e. dozens of models) means you can start delegating sub-tasks to particular models or work to local models such as localizations or other boring work etc. * Several tests added * precommit to also include untracked (new) files * Logfile auto rollover * Improved logging
2025-06-13 15:22:09 +04:00
parent f5fdf7b2ed
commit f44ca326ef
27 changed files with 1692 additions and 351 deletions
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
 https://github.com/user-attachments/assets/8097e18e-b926-4d8b-ba14-a979e4c58bda

 <div align="center">  
-  <b>🤖 Claude + [Gemini / O3 / OpenRouter / Any Model] = Your Ultimate AI Development Team</b>
+  <b>🤖 Claude + [Gemini / O3 / OpenRouter / Ollama / Any Model] = Your Ultimate AI Development Team</b>
 </div>

 <br/>
@@ -27,7 +27,7 @@ with context carrying forward seamlessly.
 All within a single conversation thread! Gemini Pro in step 6 _knows_ what was recommended by O3 in step 3! Taking that context
 and review into consideration to aid with its pre-commit review.

-**Think of it as Claude Code _for_ Claude Code.** This MCP isn't magic. It's just **super-glue**.
+**Think of it as Claude Code _for_ Claude Code.** This MCP isn't magic. It's just **super-glue**. 

 ## Quick Navigation

@@ -63,12 +63,13 @@ Claude is brilliant, but sometimes you need:
 - **Multiple AI perspectives** - Let Claude orchestrate between different models to get the best analysis
 - **Automatic model selection** - Claude picks the right model for each task (or you can specify)
 - **A senior developer partner** to validate and extend ideas ([`chat`](#1-chat---general-development-chat--collaborative-thinking))
- **A second opinion** on complex architectural decisions - augment Claude's thinking with perspectives from Gemini Pro, O3, or [dozens of other models via OpenRouter](docs/openrouter.md) ([`thinkdeep`](#2-thinkdeep---extended-reasoning-partner))
+- **A second opinion** on complex architectural decisions - augment Claude's thinking with perspectives from Gemini Pro, O3, or [dozens of other models via custom endpoints](docs/custom_models.md) ([`thinkdeep`](#2-thinkdeep---extended-reasoning-partner))
 - **Professional code reviews** with actionable feedback across entire repositories ([`codereview`](#3-codereview---professional-code-review))
 - **Pre-commit validation** with deep analysis using the best model for the job ([`precommit`](#4-precommit---pre-commit-validation))
 - **Expert debugging** - O3 for logical issues, Gemini for architectural problems ([`debug`](#5-debug---expert-debugging-assistant))
 - **Extended context windows beyond Claude's limits** - Delegate analysis to Gemini (1M tokens) or O3 (200K tokens) for entire codebases, large datasets, or comprehensive documentation
- **Model-specific strengths** - Extended thinking with Gemini Pro, fast iteration with Flash, strong reasoning with O3
+- **Model-specific strengths** - Extended thinking with Gemini Pro, fast iteration with Flash, strong reasoning with O3, local privacy with Ollama
+- **Local model support** - Run models like Llama 3.2 locally via Ollama, vLLM, or LM Studio for privacy and cost control
 - **Dynamic collaboration** - Models can request additional context and follow-up replies from Claude mid-analysis
 - **Smart file handling** - Automatically expands directories, manages token limits based on model capacity
 - **[Bypass MCP's token limits](#working-with-large-prompts)** - Work around MCP's 25K limit automatically
@@ -100,16 +101,25 @@ The final implementation resulted in a 26% improvement in JSON parsing performan
 ### 1. Get API Keys (at least one required)

 **Option A: OpenRouter (Access multiple models with one API)**
- **OpenRouter**: Visit [OpenRouter](https://openrouter.ai/) for access to multiple models through one API. [Setup Guide](docs/openrouter.md)
+- **OpenRouter**: Visit [OpenRouter](https://openrouter.ai/) for access to multiple models through one API. [Setup Guide](docs/custom_models.md)
  - Control model access and spending limits directly in your OpenRouter dashboard
-  - Configure model aliases in `conf/openrouter_models.json`
+  - Configure model aliases in [`conf/custom_models.json`](conf/custom_models.json)

 **Option B: Native APIs**
 - **Gemini**: Visit [Google AI Studio](https://makersuite.google.com/app/apikey) and generate an API key. For best results with Gemini 2.5 Pro, use a paid API key as the free tier has limited access to the latest models.
 - **OpenAI**: Visit [OpenAI Platform](https://platform.openai.com/api-keys) to get an API key for O3 model access.

-> **Note:** Using both OpenRouter and native APIs creates ambiguity about which provider serves each model. 
-> If both are configured, native APIs will take priority for `gemini` and `o3`.
+**Option C: Custom API Endpoints (Local models like Ollama, vLLM)**
+[Please see the setup guide](docs/custom_models.md#custom-api-setup-ollama-vllm-etc). With a custom API you can use:
+- **Ollama**: Run models like Llama 3.2 locally for free inference
+- **vLLM**: Self-hosted inference server for high-throughput inference
+- **LM Studio**: Local model hosting with OpenAI-compatible API interface
+- **Text Generation WebUI**: Popular local interface for running models
+- **Any OpenAI-compatible API**: Custom endpoints for your own infrastructure
+
+> **Note:** Using all three options may create ambiguity about which provider / model to use if there is an overlap. 
+> If all APIs are configured, native APIs will take priority when there is a clash in model name, such as for `gemini` and `o3`.
+> Configure your model aliases and give them unique names in [`conf/custom_models.json`](conf/custom_models.json)

 ### 2. Clone and Set Up

@@ -138,10 +148,16 @@ nano .env
 # The file will contain, at least one should be set:
 # GEMINI_API_KEY=your-gemini-api-key-here  # For Gemini models
 # OPENAI_API_KEY=your-openai-api-key-here  # For O3 model
-# OPENROUTER_API_KEY=your-openrouter-key  # For OpenRouter (see docs/openrouter.md)
+# OPENROUTER_API_KEY=your-openrouter-key  # For OpenRouter (see docs/custom_models.md)
+
+# For local models (Ollama, vLLM, etc.) - Note: Use host.docker.internal for Docker networking:
+# CUSTOM_API_URL=http://host.docker.internal:11434/v1  # Ollama example (NOT localhost!)
+# CUSTOM_API_KEY=                                      # Empty for Ollama
+# CUSTOM_MODEL_NAME=llama3.2                          # Default model
+
 # WORKSPACE_ROOT=/Users/your-username  (automatically configured)

-# Note: At least one API key is required
+# Note: At least one API key OR custom URL is required
 ```

 ### 4. Configure Claude
@@ -222,6 +238,8 @@ Just ask Claude naturally:
 - "Use flash to suggest how to format this code based on the specs mentioned in policy.md" → Uses Gemini Flash specifically
 - "Think deeply about this and get o3 to debug this logic error I found in the checkOrders() function" → Uses O3 specifically
 - "Brainstorm scaling strategies with pro. Study the code, pick your preferred strategy and debate with pro to settle on two best approaches" → Uses Gemini Pro specifically
+- "Use local-llama to localize and add missing translations to this project" → Uses local Llama 3.2 via custom URL
+- "First use local-llama for a quick local analysis, then use opus for a thorough security review" → Uses both providers in sequence

 > **Remember:** Claude remains in control — but **you** are the true orchestrator.  
 > You're the prompter, the guide, the puppeteer.  
@@ -245,6 +263,7 @@ Just ask Claude naturally:
 - Quick formatting check → Claude picks Flash
 - Logical debugging → Claude picks O3
 - General explanations → Claude picks Flash for speed
+- Local analysis → Claude picks your Ollama model

 **Pro Tip:** Thinking modes (for Gemini models) control depth vs token cost. Use "minimal" or "low" for quick tasks, "high" or "max" for complex problems. [Learn more](#thinking-modes---managing-token-costs--quality)

@@ -753,8 +772,12 @@ OPENAI_API_KEY=your-openai-key    # Enables O3, O3-mini
 | **`flash`** (Gemini 2.0 Flash) | Google | 1M tokens | Ultra-fast responses | Quick checks, formatting, simple analysis |
 | **`o3`** | OpenAI | 200K tokens | Strong logical reasoning | Debugging logic errors, systematic analysis |
 | **`o3-mini`** | OpenAI | 200K tokens | Balanced speed/quality | Moderate complexity tasks |
+| **`llama`** (Llama 3.2) | Custom/Local | 128K tokens | Local inference, privacy | On-device analysis, cost-free processing |
 | **Any model** | OpenRouter | Varies | Access to GPT-4, Claude, Llama, etc. | User-specified or based on task requirements |

+**Mix & Match Providers:** Use multiple providers simultaneously! Set both `OPENROUTER_API_KEY` and `CUSTOM_API_URL` to access 
+cloud models (expensive/powerful) AND local models (free/private) in the same conversation.
+
 **Manual Model Selection:**
 You can specify a default model instead of auto mode: