New tool: testgen

Generates unit tests and encourages model to auto-detect framework and testing style from existing sample (if available)
2025-06-14 15:41:47 +04:00
parent 7d33aafcab
commit 4086306c58
14 changed files with 1118 additions and 9 deletions
--- a/README.md
+++ b/README.md
@@ -49,6 +49,7 @@ and review into consideration to aid with its pre-commit review.
  - [`precommit`](#4-precommit---pre-commit-validation) - Pre-commit validation
  - [`debug`](#5-debug---expert-debugging-assistant) - Debugging help
  - [`analyze`](#6-analyze---smart-file-analysis) - File analysis
  - [`testgen`](#7-testgen---comprehensive-test-generation) - Test generation with edge cases
 - **Advanced Usage**
  - [Advanced Features](#advanced-features) - AI-to-AI conversations, large prompts, web search
@@ -254,6 +255,7 @@ Just ask Claude naturally:
 - **Pre-commit validation?** → `precommit` (validate git changes before committing)
 - **Something's broken?** → `debug` (root cause analysis, error tracing)
 - **Want to understand code?** → `analyze` (architecture, patterns, dependencies)
 - **Need comprehensive tests?** → `testgen` (generates test suites with edge cases)
 - **Server info?** → `get_version` (version and configuration details)
 **Auto Mode:** When `DEFAULT_MODEL=auto`, Claude automatically picks the best model for each task. You can override with: "Use flash for quick analysis" or "Use o3 to debug this".
@@ -274,7 +276,8 @@ Just ask Claude naturally:
 4. [`precommit`](#4-precommit---pre-commit-validation) - Validate git changes before committing
 5. [`debug`](#5-debug---expert-debugging-assistant) - Root cause analysis and debugging
 6. [`analyze`](#6-analyze---smart-file-analysis) - General-purpose file and code analysis
-7. [`get_version`](#7-get_version---server-information) - Get server version and configuration
+7. [`testgen`](#7-testgen---comprehensive-test-generation) - Comprehensive test generation with edge case coverage
 8. [`get_version`](#8-get_version---server-information) - Get server version and configuration
 ### 1. `chat` - General Development Chat & Collaborative Thinking
 **Your thinking partner - bounce ideas, get second opinions, brainstorm collaboratively**
@@ -421,7 +424,30 @@ Use zen and perform a thorough precommit ensuring there aren't any new regressio
 - Uses file paths (not content) for clean terminal output
 - Can identify patterns, anti-patterns, and refactoring opportunities
 - **Web search capability**: When enabled with `use_websearch` (default: true), the model can request Claude to perform web searches and share results back to enhance analysis with current documentation, design patterns, and best practices
-### 7. `get_version` - Server Information
+### 7. `testgen` - Comprehensive Test Generation
 **Generates thorough test suites with edge case coverage** based on existing code and test framework used.
 **Thinking Mode (Extended thinking models):** Default is `medium` (8,192 tokens). Use `high` for complex systems with many interactions or `max` for critical systems requiring exhaustive test coverage.
 #### Example Prompts:
 **Basic Usage:**
 ```
 "Use zen to generate tests for User.login() method"
 "Generate comprehensive tests for the sorting method in src/new_sort.py using o3"
 "Create tests for edge cases not already covered in our tests using gemini pro"
 ```
 **Key Features:**
 - Multi-agent workflow analyzing code paths and identifying realistic failure modes
 - Generates framework-specific tests following project conventions
 - Supports test pattern following when examples are provided
 - Dynamic token allocation (25% for test examples, 75% for main code)
 - Prioritizes smallest test files for pattern detection
 - Can reference existing test files: `"Generate tests following patterns from tests/unit/"`
 - Specific code coverage - target specific functions/classes rather than testing everything
 ### 8. `get_version` - Server Information
 ```
 "Get zen to show its version"
 ```
--- a/config.py
+++ b/config.py
@@ -14,7 +14,7 @@ import os
 # These values are used in server responses and for tracking releases
 # IMPORTANT: This is the single source of truth for version and author info
 # Semantic versioning: MAJOR.MINOR.PATCH
-__version__ = "4.3.3"
+__version__ = "4.4.0"
 # Last update date in ISO format
 __updated__ = "2025-06-14"
 # Primary maintainer
--- a/docs/advanced-usage.md
+++ b/docs/advanced-usage.md
@@ -245,6 +245,20 @@ All tools that work with files support **both individual files and entire direct
 "Use o3 to think deeper about the logical flow in this algorithm"
 ```
 **`testgen`** - Comprehensive test generation with edge case coverage
 - `files`: Code files or directories to generate tests for (required)
 - `prompt`: Description of what to test, testing objectives, and scope (required)
 - `model`: auto|pro|flash|o3|o3-mini|o4-mini|o4-mini-high (default: server default)
 - `test_examples`: Optional existing test files as style/pattern reference
 - `thinking_mode`: minimal|low|medium|high|max (default: medium, Gemini only)
 ```
 "Generate tests for User.login() method with edge cases" (auto mode picks best model)
 "Use pro to generate comprehensive tests for src/payment.py with max thinking mode"
 "Use o3 to generate tests for algorithm correctness in sort_functions.py"
 "Generate tests following patterns from tests/unit/ for new auth module"
 ```
 ## Collaborative Workflows
 ### Design → Review → Implement
@@ -277,13 +291,15 @@ To help choose the right tool for your needs:
 1. **Have a specific error/exception?** → Use `debug`
 2. **Want to find bugs/issues in code?** → Use `codereview`
 3. **Want to understand how code works?** → Use `analyze`
-4. **Have analysis that needs extension/validation?** → Use `thinkdeep`
+4. **Need comprehensive test coverage?** → Use `testgen`
-5. **Want to brainstorm or discuss?** → Use `chat`
+5. **Have analysis that needs extension/validation?** → Use `thinkdeep`
 6. **Want to brainstorm or discuss?** → Use `chat`
 **Key Distinctions:**
 - `analyze` vs `codereview`: analyze explains, codereview prescribes fixes
 - `chat` vs `thinkdeep`: chat is open-ended, thinkdeep extends specific analysis
 - `debug` vs `codereview`: debug diagnoses runtime errors, review finds static issues
 - `testgen` vs `debug`: testgen creates test suites, debug just finds issues and recommends solutions
 ## Working with Large Prompts
--- a/server.py
+++ b/server.py
@@ -44,6 +44,7 @@ from tools import (
    CodeReviewTool,
    DebugIssueTool,
    Precommit,
    TestGenTool,
    ThinkDeepTool,
 )
 from tools.models import ToolOutput
@@ -144,6 +145,7 @@ TOOLS = {
    "analyze": AnalyzeTool(),  # General-purpose file and code analysis
    "chat": ChatTool(),  # Interactive development chat and brainstorming
    "precommit": Precommit(),  # Pre-commit validation of git changes
    "testgen": TestGenTool(),  # Comprehensive test generation with edge case coverage
 }
--- a/simulator_tests/init.py
+++ b/simulator_tests/init.py
@@ -19,6 +19,7 @@ from .test_openrouter_fallback import OpenRouterFallbackTest
 from .test_openrouter_models import OpenRouterModelsTest
 from .test_per_tool_deduplication import PerToolDeduplicationTest
 from .test_redis_validation import RedisValidationTest
 from .test_testgen_validation import TestGenValidationTest
 from .test_token_allocation_validation import TokenAllocationValidationTest
 # Test registry for dynamic loading
@@ -36,6 +37,7 @@ TEST_REGISTRY = {
    "openrouter_fallback": OpenRouterFallbackTest,
    "openrouter_models": OpenRouterModelsTest,
    "token_allocation_validation": TokenAllocationValidationTest,
    "testgen_validation": TestGenValidationTest,
    "conversation_chain_validation": ConversationChainValidationTest,
 }
@@ -54,6 +56,7 @@ __all__ = [
    "OpenRouterFallbackTest",
    "OpenRouterModelsTest",
    "TokenAllocationValidationTest",
    "TestGenValidationTest",
    "ConversationChainValidationTest",
    "TEST_REGISTRY",
 ]
--- a/simulator_tests/test_testgen_validation.py
+++ b/simulator_tests/test_testgen_validation.py
@@ -0,0 +1,131 @@
 #!/usr/bin/env python3
 """
 TestGen Tool Validation Test
 Tests the testgen tool by:
 - Creating a test code file with a specific function
 - Using testgen to generate tests with a specific function name
 - Validating that the output contains the expected test function
 - Confirming the format matches test generation patterns
 """
 from .base_test import BaseSimulatorTest
 class TestGenValidationTest(BaseSimulatorTest):
    """Test testgen tool validation with specific function name"""
    @property
    def test_name(self) -> str:
        return "testgen_validation"
    @property
    def test_description(self) -> str:
        return "TestGen tool validation with specific test function"
    def run_test(self) -> bool:
        """Test testgen tool with specific function name validation"""
        try:
            self.logger.info("Test: TestGen tool validation")
            # Setup test files
            self.setup_test_files()
            # Create a specific code file for test generation
            test_code_content = '''"""
 Sample authentication module for testing testgen
 """
 class UserAuthenticator:
    """Handles user authentication logic"""
    def __init__(self):
        self.failed_attempts = {}
        self.max_attempts = 3
    def validate_password(self, username, password):
        """Validate user password with security checks"""
        if not username or not password:
            return False
        if username in self.failed_attempts:
            if self.failed_attempts[username] >= self.max_attempts:
                return False  # Account locked
        # Simple validation for demo
        if len(password) < 8:
            self._record_failed_attempt(username)
            return False
        if password == "password123":  # Demo valid password
            self._reset_failed_attempts(username)
            return True
        self._record_failed_attempt(username)
        return False
    def _record_failed_attempt(self, username):
        """Record a failed login attempt"""
        self.failed_attempts[username] = self.failed_attempts.get(username, 0) + 1
    def _reset_failed_attempts(self, username):
        """Reset failed attempts after successful login"""
        if username in self.failed_attempts:
            del self.failed_attempts[username]
 '''
            # Create the auth code file
            auth_file = self.create_additional_test_file("user_auth.py", test_code_content)
            # Test testgen tool with specific requirements
            self.logger.info("  1.1: Generate tests with specific function name")
            response, continuation_id = self.call_mcp_tool(
                "testgen",
                {
                    "files": [auth_file],
                    "prompt": "Generate comprehensive tests for the UserAuthenticator.validate_password method. Include tests for edge cases, security scenarios, and account locking. Use the specific test function name 'test_password_validation_edge_cases' for one of the test methods.",
                    "model": "flash",
                },
            )
            if not response:
                self.logger.error("Failed to get testgen response")
                return False
            self.logger.info("  1.2: Validate response contains expected test function")
            # Check that the response contains the specific test function name
            if "test_password_validation_edge_cases" not in response:
                self.logger.error("Response does not contain the requested test function name")
                self.logger.debug(f"Response content: {response[:500]}...")
                return False
            # Check for common test patterns
            test_patterns = [
                "def test_",  # Test function definition
                "assert",  # Assertion statements
                "UserAuthenticator",  # Class being tested
                "validate_password",  # Method being tested
            ]
            missing_patterns = []
            for pattern in test_patterns:
                if pattern not in response:
                    missing_patterns.append(pattern)
            if missing_patterns:
                self.logger.error(f"Response missing expected test patterns: {missing_patterns}")
                self.logger.debug(f"Response content: {response[:500]}...")
                return False
            self.logger.info("  ✅ TestGen tool validation successful")
            self.logger.info("  ✅ Generated tests contain expected function name")
            self.logger.info("  ✅ Generated tests follow proper test patterns")
            return True
        except Exception as e:
            self.logger.error(f"TestGen validation test failed: {e}")
            return False
        finally:
            self.cleanup_test_files()
--- a/systemprompts/init.py
+++ b/systemprompts/init.py
@@ -7,6 +7,7 @@ from .chat_prompt import CHAT_PROMPT
 from .codereview_prompt import CODEREVIEW_PROMPT
 from .debug_prompt import DEBUG_ISSUE_PROMPT
 from .precommit_prompt import PRECOMMIT_PROMPT
 from .testgen_prompt import TESTGEN_PROMPT
 from .thinkdeep_prompt import THINKDEEP_PROMPT
 __all__ = [
@@ -16,4 +17,5 @@ __all__ = [
    "ANALYZE_PROMPT",
    "CHAT_PROMPT",
    "PRECOMMIT_PROMPT",
    "TESTGEN_PROMPT",
 ]
--- a/systemprompts/testgen_prompt.py
+++ b/systemprompts/testgen_prompt.py
@@ -0,0 +1,100 @@
 """
 TestGen tool system prompt
 """
 TESTGEN_PROMPT = """
 ROLE
 You are a principal software engineer who specialises in writing bullet-proof production code **and** surgical,
 high-signal test suites. You reason about control flow, data flow, mutation, concurrency, failure modes, and security
 in equal measure. Your mission: design and write tests that surface real-world defects before code ever leaves CI.
 IF MORE INFORMATION IS NEEDED
 If you need additional context (e.g., test framework details, dependencies, existing test patterns) to provide
 accurate test generation, you MUST respond ONLY with this JSON format (and nothing else). Do NOT ask for the
 same file you've been provided unless for some reason its content is missing or incomplete:
 {"status": "clarification_required", "question": "<your brief question>",
 "files_needed": ["[file name here]", "[or some folder/]"]}
 MULTI-AGENT WORKFLOW
 You sequentially inhabit five expert personas—each passes a concise artefact to the next:
 1. **Context Profiler** – derives language(s), test framework(s), build tooling, domain constraints, and existing
 test idioms from the code snapshot provided.
 2. **Path Analyzer** – builds a map of reachable code paths (happy, error, exceptional) plus any external interactions
 that are directly involved (network, DB, file-system, IPC).
 3. **Adversarial Thinker** – enumerates realistic failures, boundary conditions, race conditions, and misuse patterns
 that historically break similar systems.
 4. **Risk Prioritizer** – ranks findings by production impact and likelihood; discards speculative or out-of-scope cases.
 5. **Test Scaffolder** – produces deterministic, isolated tests that follow the *project's* conventions (assert style,
 fixture layout, naming, any mocking strategy, language and tooling etc).
 TEST-GENERATION STRATEGY
 - Start from public API / interface boundaries, then walk inward to critical private helpers.
 - Analyze function signatures, parameters, return types, and side effects
 - Map all code paths including happy paths and error conditions
 - Test behaviour, not implementation details, unless white-box inspection is required to reach untestable paths.
 - Include both positive and negative test cases
 - Prefer property-based or table-driven tests where inputs form simple algebraic domains.
 - Stub or fake **only** the minimal surface area needed; prefer in-memory fakes over mocks when feasible.
 - Flag any code that cannot be tested deterministically and suggest realistic refactors (seams, dependency injection,
 pure functions).
 - Surface concurrency hazards with stress or fuzz tests when the language/runtime supports them.
 - Focus on realistic failure modes that actually occur in production
 - Remain within scope of language, framework, project. Do not over-step. Do not add unnecessary dependencies.
 EDGE-CASE TAXONOMY (REAL-WORLD, HIGH-VALUE)
 - **Data Shape Issues**: `null` / `undefined`, zero-length, surrogate-pair emojis, malformed UTF-8, mixed EOLs.
 - **Numeric Boundaries**: −1, 0, 1, `MAX_…`, floating-point rounding, 64-bit truncation.
 - **Temporal Pitfalls**: DST shifts, leap seconds, 29 Feb, Unix epoch 2038, timezone conversions.
 - **Collections & Iteration**: off-by-one, concurrent modification, empty vs singleton vs large (>10⁶ items).
 - **State & Sequence**: API calls out of order, idempotency violations, replay attacks.
 - **External Dependencies**: slow responses, 5xx, malformed JSON/XML, TLS errors, retry storms, cancelled promises.
 - **Concurrency / Async**: race conditions, deadlocks, promise rejection leaks, thread starvation.
 - **Resource Exhaustion**: memory spikes, file-descriptor leaks, connection-pool saturation.
 - **Locale & Encoding**: RTL scripts, uncommon locales, locale-specific formatting.
 - **Security Surfaces**: injection (SQL, shell, LDAP), path traversal, privilege escalation on shared state.
 TEST QUALITY PRINCIPLES
 - Clear Arrange-Act-Assert sections (or given/when/then per project style) but retain and apply project norms, language
 norms and framework norms and best practices.
 - One behavioural assertion per test unless grouping is conventional.
 - Fast: sub-100 ms/unit test; parallelisable; no remote calls.
 - Deterministic: seeded randomness only; fixed stable clocks when time matters.
 - Self-documenting: names read like specs; failures explain *why*, not just *what*.
 FRAMEWORK SELECTION
 Always autodetect from the repository. When a test framework or existing tests are not found, detect from existing
 code; examples:
 - **Swift / Objective-C** → XCTest (Xcode default) or Swift Testing (Apple provided frameworks)
 - **C# / .NET** → xUnit.net preferred; fall back to NUnit or MSTest if they dominate the repo.
 - **C / C++** → GoogleTest (gtest/gmock) or Catch2, matching existing tooling.
 - **JS/TS** → Jest, Vitest, Mocha, or project-specific wrapper.
 - **Python** → pytest, unittest.
 - **Java/Kotlin** → JUnit 5, TestNG.
 - **Go** → built-in `testing`, `testify`.
 - **Rust** → `#[test]`, `proptest`.
 - **Anything Else** → follow existing conventions; never introduce a new framework without strong justification.
 IF FRAMEWORK SELECTION FAILS
 If you are unable to confidently determine which framework to use based on the existing test samples supplied, or if
 additional test samples would help in making a final decision, you MUST respond ONLY with this JSON
 format (and nothing else). Do NOT ask for the same file you've been provided unless for some reason its content
 is missing or incomplete:
 {"status": "test_sample_needed", "reason": "<brief reason why additional sampling is required>"}
 SCOPE CONTROL
 Stay strictly within the presented codebase, tech stack, and domain.
 Do **not** invent features, frameworks, or speculative integrations.
 Do **not** write tests for functions or classes that do not exist.
 If a test idea falls outside project scope, discard it.
 If a test would be a "good to have" but seems impossible given the current structure, setup of the project, highlight
 it but do not approach or offer refactoring ideas.
 DELIVERABLE
 Return only the artefacts (analysis summary, coverage plan, and generated tests) that fit the detected framework
 and code / project layout.
 No extra commentary, no generic boilerplate.
 Must comment and document logic, test reason / hypothesis in delivered code
 Remember: your value is catching the hard bugs—not inflating coverage numbers.
 """
--- a/tests/test_server.py
+++ b/tests/test_server.py
@@ -26,10 +26,11 @@ class TestServerTools:
        assert "analyze" in tool_names
        assert "chat" in tool_names
        assert "precommit" in tool_names
        assert "testgen" in tool_names
        assert "get_version" in tool_names
-        # Should have exactly 7 tools
+        # Should have exactly 8 tools (including testgen)
-        assert len(tools) == 7
+        assert len(tools) == 8
        # Check descriptions are verbose
        for tool in tools:
--- a/tests/test_testgen.py
+++ b/tests/test_testgen.py
@@ -0,0 +1,381 @@
 """
 Tests for TestGen tool implementation
 """
 import json
 import tempfile
 from pathlib import Path
 from unittest.mock import Mock, patch
 import pytest
 from tests.mock_helpers import create_mock_provider
 from tools.testgen import TestGenRequest, TestGenTool
 class TestTestGenTool:
    """Test the TestGen tool"""
    @pytest.fixture
    def tool(self):
        return TestGenTool()
    @pytest.fixture
    def temp_files(self):
        """Create temporary test files"""
        with tempfile.TemporaryDirectory() as temp_dir:
            temp_path = Path(temp_dir)
            # Create sample code files
            code_file = temp_path / "calculator.py"
            code_file.write_text(
                """
 def add(a, b):
    '''Add two numbers'''
    return a + b
 def divide(a, b):
    '''Divide two numbers'''
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b
 """
            )
            # Create sample test files (different sizes)
            small_test = temp_path / "test_small.py"
            small_test.write_text(
                """
 import unittest
 class TestBasic(unittest.TestCase):
    def test_simple(self):
        self.assertEqual(1 + 1, 2)
 """
            )
            large_test = temp_path / "test_large.py"
            large_test.write_text(
                """
 import unittest
 from unittest.mock import Mock, patch
 class TestComprehensive(unittest.TestCase):
    def setUp(self):
        self.mock_data = Mock()
    def test_feature_one(self):
        # Comprehensive test with lots of setup
        result = self.process_data()
        self.assertIsNotNone(result)
    def test_feature_two(self):
        # Another comprehensive test
        with patch('some.module') as mock_module:
            mock_module.return_value = 'test'
            result = self.process_data()
            self.assertEqual(result, 'expected')
    def process_data(self):
        return "test_result"
 """
            )
            yield {
                "temp_dir": temp_dir,
                "code_file": str(code_file),
                "small_test": str(small_test),
                "large_test": str(large_test),
            }
    def test_tool_metadata(self, tool):
        """Test tool metadata"""
        assert tool.get_name() == "testgen"
        assert "COMPREHENSIVE TEST GENERATION" in tool.get_description()
        assert "BE SPECIFIC about scope" in tool.get_description()
        assert tool.get_default_temperature() == 0.2  # Analytical temperature
        # Check model category
        from tools.models import ToolModelCategory
        assert tool.get_model_category() == ToolModelCategory.EXTENDED_REASONING
    def test_input_schema_structure(self, tool):
        """Test input schema structure"""
        schema = tool.get_input_schema()
        # Required fields
        assert "files" in schema["properties"]
        assert "prompt" in schema["properties"]
        assert "files" in schema["required"]
        assert "prompt" in schema["required"]
        # Optional fields
        assert "test_examples" in schema["properties"]
        assert "thinking_mode" in schema["properties"]
        assert "continuation_id" in schema["properties"]
        # Should not have temperature or use_websearch
        assert "temperature" not in schema["properties"]
        assert "use_websearch" not in schema["properties"]
        # Check test_examples description
        test_examples_desc = schema["properties"]["test_examples"]["description"]
        assert "absolute paths" in test_examples_desc
        assert "smallest representative tests" in test_examples_desc
    def test_request_model_validation(self):
        """Test request model validation"""
        # Valid request
        valid_request = TestGenRequest(files=["/tmp/test.py"], prompt="Generate tests for calculator functions")
        assert valid_request.files == ["/tmp/test.py"]
        assert valid_request.prompt == "Generate tests for calculator functions"
        assert valid_request.test_examples is None
        # With test examples
        request_with_examples = TestGenRequest(
            files=["/tmp/test.py"], prompt="Generate tests", test_examples=["/tmp/test_example.py"]
        )
        assert request_with_examples.test_examples == ["/tmp/test_example.py"]
        # Invalid request (missing required fields)
        with pytest.raises(ValueError):
            TestGenRequest(files=["/tmp/test.py"])  # Missing prompt
    @pytest.mark.asyncio
    @patch("tools.base.BaseTool.get_model_provider")
    async def test_execute_success(self, mock_get_provider, tool, temp_files):
        """Test successful execution"""
        # Mock provider
        mock_provider = create_mock_provider()
        mock_provider.get_provider_type.return_value = Mock(value="google")
        mock_provider.generate_content.return_value = Mock(
            content="Generated comprehensive test suite with edge cases",
            usage={"input_tokens": 100, "output_tokens": 200},
            model_name="gemini-2.5-flash-preview-05-20",
            metadata={"finish_reason": "STOP"},
        )
        mock_get_provider.return_value = mock_provider
        result = await tool.execute(
            {"files": [temp_files["code_file"]], "prompt": "Generate comprehensive tests for the calculator functions"}
        )
        # Verify result structure
        assert len(result) == 1
        response_data = json.loads(result[0].text)
        assert response_data["status"] == "success"
        assert "Generated comprehensive test suite" in response_data["content"]
    @pytest.mark.asyncio
    @patch("tools.base.BaseTool.get_model_provider")
    async def test_execute_with_test_examples(self, mock_get_provider, tool, temp_files):
        """Test execution with test examples"""
        mock_provider = create_mock_provider()
        mock_provider.generate_content.return_value = Mock(
            content="Generated tests following the provided examples",
            usage={"input_tokens": 150, "output_tokens": 250},
            model_name="gemini-2.5-flash-preview-05-20",
            metadata={"finish_reason": "STOP"},
        )
        mock_get_provider.return_value = mock_provider
        result = await tool.execute(
            {
                "files": [temp_files["code_file"]],
                "prompt": "Generate tests following existing patterns",
                "test_examples": [temp_files["small_test"]],
            }
        )
        # Verify result
        assert len(result) == 1
        response_data = json.loads(result[0].text)
        assert response_data["status"] == "success"
    def test_process_test_examples_empty(self, tool):
        """Test processing empty test examples"""
        content, note = tool._process_test_examples([], None)
        assert content == ""
        assert note == ""
    def test_process_test_examples_budget_allocation(self, tool, temp_files):
        """Test token budget allocation for test examples"""
        with patch.object(tool, "filter_new_files") as mock_filter:
            mock_filter.return_value = [temp_files["small_test"], temp_files["large_test"]]
            with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
                mock_prepare.return_value = "Mocked test content"
                # Test with available tokens
                content, note = tool._process_test_examples(
                    [temp_files["small_test"], temp_files["large_test"]], None, available_tokens=100000
                )
                # Should allocate 25% of 100k = 25k tokens for test examples
                mock_prepare.assert_called_once()
                call_args = mock_prepare.call_args
                assert call_args[1]["max_tokens"] == 25000  # 25% of 100k
    def test_process_test_examples_size_sorting(self, tool, temp_files):
        """Test that test examples are sorted by size (smallest first)"""
        with patch.object(tool, "filter_new_files") as mock_filter:
            # Return files in random order
            mock_filter.return_value = [temp_files["large_test"], temp_files["small_test"]]
            with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
                mock_prepare.return_value = "test content"
                tool._process_test_examples(
                    [temp_files["large_test"], temp_files["small_test"]], None, available_tokens=50000
                )
                # Check that files were passed in size order (smallest first)
                call_args = mock_prepare.call_args[0]
                files_passed = call_args[0]
                # Verify smallest file comes first
                assert files_passed[0] == temp_files["small_test"]
                assert files_passed[1] == temp_files["large_test"]
    @pytest.mark.asyncio
    async def test_prepare_prompt_structure(self, tool, temp_files):
        """Test prompt preparation structure"""
        request = TestGenRequest(files=[temp_files["code_file"]], prompt="Test the calculator functions")
        with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
            mock_prepare.return_value = "mocked file content"
            prompt = await tool.prepare_prompt(request)
            # Check prompt structure
            assert "=== USER CONTEXT ===" in prompt
            assert "Test the calculator functions" in prompt
            assert "=== CODE TO TEST ===" in prompt
            assert "mocked file content" in prompt
            assert tool.get_system_prompt() in prompt
    @pytest.mark.asyncio
    async def test_prepare_prompt_with_examples(self, tool, temp_files):
        """Test prompt preparation with test examples"""
        request = TestGenRequest(
            files=[temp_files["code_file"]], prompt="Generate tests", test_examples=[temp_files["small_test"]]
        )
        with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
            mock_prepare.return_value = "mocked content"
            with patch.object(tool, "_process_test_examples") as mock_process:
                mock_process.return_value = ("test examples content", "Note: examples included")
                prompt = await tool.prepare_prompt(request)
                # Check test examples section
                assert "=== TEST EXAMPLES FOR STYLE REFERENCE ===" in prompt
                assert "test examples content" in prompt
                assert "Note: examples included" in prompt
    def test_format_response(self, tool):
        """Test response formatting"""
        request = TestGenRequest(files=["/tmp/test.py"], prompt="Generate tests")
        raw_response = "Generated test cases with edge cases"
        formatted = tool.format_response(raw_response, request)
        # Check formatting includes next steps
        assert raw_response in formatted
        assert "**Next Steps:**" in formatted
        assert "Review Generated Tests" in formatted
        assert "Setup Test Environment" in formatted
    @pytest.mark.asyncio
    async def test_error_handling_invalid_files(self, tool):
        """Test error handling for invalid file paths"""
        result = await tool.execute(
            {"files": ["relative/path.py"], "prompt": "Generate tests"}  # Invalid: not absolute
        )
        # Should return error for relative path
        response_data = json.loads(result[0].text)
        assert response_data["status"] == "error"
        assert "absolute" in response_data["content"]
    @pytest.mark.asyncio
    async def test_large_prompt_handling(self, tool):
        """Test handling of large prompts"""
        large_prompt = "x" * 60000  # Exceeds MCP_PROMPT_SIZE_LIMIT
        result = await tool.execute({"files": ["/tmp/test.py"], "prompt": large_prompt})
        # Should return resend_prompt status
        response_data = json.loads(result[0].text)
        assert response_data["status"] == "resend_prompt"
        assert "too large" in response_data["content"]
    def test_token_budget_calculation(self, tool):
        """Test token budget calculation logic"""
        # Mock model capabilities
        with patch.object(tool, "get_model_provider") as mock_get_provider:
            mock_provider = create_mock_provider(context_window=200000)
            mock_get_provider.return_value = mock_provider
            # Simulate model name being set
            tool._current_model_name = "test-model"
            with patch.object(tool, "_process_test_examples") as mock_process:
                mock_process.return_value = ("test content", "")
                with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
                    mock_prepare.return_value = "code content"
                    request = TestGenRequest(
                        files=["/tmp/test.py"], prompt="Test prompt", test_examples=["/tmp/example.py"]
                    )
                    # This should trigger token budget calculation
                    import asyncio
                    asyncio.run(tool.prepare_prompt(request))
                    # Verify test examples got 25% of 150k tokens (75% of 200k context)
                    mock_process.assert_called_once()
                    call_args = mock_process.call_args[0]
                    assert call_args[2] == 150000  # 75% of 200k context window
    @pytest.mark.asyncio
    async def test_continuation_support(self, tool, temp_files):
        """Test continuation ID support"""
        with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
            mock_prepare.return_value = "code content"
            request = TestGenRequest(
                files=[temp_files["code_file"]], prompt="Continue testing", continuation_id="test-thread-123"
            )
            await tool.prepare_prompt(request)
            # Verify continuation_id was passed to _prepare_file_content_for_prompt
            # The method should be called twice (once for code, once for test examples logic)
            assert mock_prepare.call_count >= 1
            # Check that continuation_id was passed in at least one call
            calls = mock_prepare.call_args_list
            continuation_passed = any(
                call[0][1] == "test-thread-123" for call in calls  # continuation_id is second argument
            )
            assert continuation_passed, f"continuation_id not passed. Calls: {calls}"
    def test_no_websearch_in_prompt(self, tool, temp_files):
        """Test that web search instructions are not included"""
        request = TestGenRequest(files=[temp_files["code_file"]], prompt="Generate tests")
        with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare:
            mock_prepare.return_value = "code content"
            import asyncio
            prompt = asyncio.run(tool.prepare_prompt(request))
            # Should not contain web search instructions
            assert "WEB SEARCH CAPABILITY" not in prompt
            assert "web search" not in prompt.lower()
--- a/tests/test_tools.py
+++ b/tests/test_tools.py
@@ -284,6 +284,22 @@ class TestAbsolutePathValidation:
        assert "must be absolute" in response["content"]
        assert "code.py" in response["content"]
    @pytest.mark.asyncio
    async def test_testgen_tool_relative_path_rejected(self):
        """Test that testgen tool rejects relative paths"""
        from tools import TestGenTool
        tool = TestGenTool()
        result = await tool.execute(
            {"files": ["src/main.py"], "prompt": "Generate tests for the functions"}  # relative path
        )
        assert len(result) == 1
        response = json.loads(result[0].text)
        assert response["status"] == "error"
        assert "must be absolute" in response["content"]
        assert "src/main.py" in response["content"]
    @pytest.mark.asyncio
    @patch("tools.AnalyzeTool.get_model_provider")
    async def test_analyze_tool_accepts_absolute_paths(self, mock_get_provider):
--- a/tools/init.py
+++ b/tools/init.py
@@ -7,6 +7,7 @@ from .chat import ChatTool
 from .codereview import CodeReviewTool
 from .debug import DebugIssueTool
 from .precommit import Precommit
 from .testgen import TestGenTool
 from .thinkdeep import ThinkDeepTool
 __all__ = [
@@ -16,4 +17,5 @@ __all__ = [
    "AnalyzeTool",
    "ChatTool",
    "Precommit",
    "TestGenTool",
 ]
--- a/tools/codereview.py
+++ b/tools/codereview.py
@@ -2,7 +2,7 @@
 Code Review tool - Comprehensive code analysis and review
 This tool provides professional-grade code review capabilities using
-Gemini's understanding of code patterns, best practices, and common issues.
+the chosen model's understanding of code patterns, best practices, and common issues.
 It can analyze individual files or entire codebases, providing actionable
 feedback categorized by severity.
@@ -177,7 +177,7 @@ class CodeReviewTool(BaseTool):
            request: The validated review request
        Returns:
-            str: Complete prompt for the Gemini model
+            str: Complete prompt for the model
        Raises:
            ValueError: If the code exceeds token limits
--- a/tools/testgen.py
+++ b/tools/testgen.py
@@ -0,0 +1,429 @@
 """
 TestGen tool - Comprehensive test suite generation with edge case coverage
 This tool generates comprehensive test suites by analyzing code paths,
 identifying edge cases, and producing test scaffolding that follows
 project conventions when test examples are provided.
 Key Features:
 - Multi-file and directory support
 - Framework detection from existing tests
 - Edge case identification (nulls, boundaries, async issues, etc.)
 - Test pattern following when examples provided
 - Deterministic test example sampling for large test suites
 """
 import logging
 import os
 from typing import Any, Optional
 from mcp.types import TextContent
 from pydantic import Field
 from config import TEMPERATURE_ANALYTICAL
 from systemprompts import TESTGEN_PROMPT
 from .base import BaseTool, ToolRequest
 from .models import ToolOutput
 logger = logging.getLogger(__name__)
 class TestGenRequest(ToolRequest):
    """
    Request model for the test generation tool.
    This model defines all parameters that can be used to customize
    the test generation process, from selecting code files to providing
    test examples for style consistency.
    """
    files: list[str] = Field(
        ...,
        description="Code files or directories to generate tests for (must be absolute paths)",
    )
    prompt: str = Field(
        ...,
        description="Description of what to test, testing objectives, and specific scope/focus areas",
    )
    test_examples: Optional[list[str]] = Field(
        None,
        description=(
            "Optional existing test files or directories to use as style/pattern reference (must be absolute paths). "
            "If not provided, the tool will determine the best testing approach based on the code structure. "
            "For large test directories, only the smallest representative tests should be included to determine testing patterns. "
            "If similar tests exist for the code being tested, include those for the most relevant patterns."
        ),
    )
 class TestGenTool(BaseTool):
    """
    Test generation tool implementation.
    This tool analyzes code to generate comprehensive test suites with
    edge case coverage, following existing test patterns when examples
    are provided.
    """
    def get_name(self) -> str:
        return "testgen"
    def get_description(self) -> str:
        return (
            "COMPREHENSIVE TEST GENERATION - Creates thorough test suites with edge case coverage. "
            "Use this when you need to generate tests for code, create test scaffolding, or improve test coverage. "
            "BE SPECIFIC about scope: target specific functions/classes/modules rather than testing everything. "
            "Examples: 'Generate tests for User.login() method', 'Test payment processing validation', "
            "'Create tests for authentication error handling'. If user request is vague, either ask for "
            "clarification about specific components to test, or make focused scope decisions and explain them. "
            "Analyzes code paths, identifies realistic failure modes, and generates framework-specific tests. "
            "Supports test pattern following when examples are provided. "
            "Choose thinking_mode based on code complexity: 'low' for simple functions, "
            "'medium' for standard modules (default), 'high' for complex systems with many interactions, "
            "'max' for critical systems requiring exhaustive test coverage. "
            "Note: If you're not currently using a top-tier model such as Opus 4 or above, these tools can provide enhanced capabilities."
        )
    def get_input_schema(self) -> dict[str, Any]:
        schema = {
            "type": "object",
            "properties": {
                "files": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Code files or directories to generate tests for (must be absolute paths)",
                },
                "model": self.get_model_field_schema(),
                "prompt": {
                    "type": "string",
                    "description": "Description of what to test, testing objectives, and specific scope/focus areas",
                },
                "test_examples": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": (
                        "Optional existing test files or directories to use as style/pattern reference (must be absolute paths). "
                        "If not provided, the tool will determine the best testing approach based on the code structure. "
                        "For large test directories, only the smallest representative tests will be included to determine testing patterns. "
                        "If similar tests exist for the code being tested, include those for the most relevant patterns."
                    ),
                },
                "thinking_mode": {
                    "type": "string",
                    "enum": ["minimal", "low", "medium", "high", "max"],
                    "description": "Thinking depth: minimal (0.5% of model max), low (8%), medium (33%), high (67%), max (100% of model max)",
                },
                "continuation_id": {
                    "type": "string",
                    "description": "Thread continuation ID for multi-turn conversations. Can be used to continue conversations across different tools. Only provide this if continuing a previous conversation thread.",
                },
            },
            "required": ["files", "prompt"] + (["model"] if self.is_effective_auto_mode() else []),
        }
        return schema
    def get_system_prompt(self) -> str:
        return TESTGEN_PROMPT
    def get_default_temperature(self) -> float:
        return TEMPERATURE_ANALYTICAL
    def get_model_category(self):
        """TestGen requires extended reasoning for comprehensive test analysis"""
        from tools.models import ToolModelCategory
        return ToolModelCategory.EXTENDED_REASONING
    def get_request_model(self):
        return TestGenRequest
    async def execute(self, arguments: dict[str, Any]) -> list[TextContent]:
        """Override execute to check prompt size before processing"""
        # First validate request
        request_model = self.get_request_model()
        request = request_model(**arguments)
        # Check prompt size if provided
        if request.prompt:
            size_check = self.check_prompt_size(request.prompt)
            if size_check:
                return [TextContent(type="text", text=ToolOutput(**size_check).model_dump_json())]
        # Continue with normal execution
        return await super().execute(arguments)
    def _process_test_examples(
        self, test_examples: list[str], continuation_id: Optional[str], available_tokens: int = None
    ) -> tuple[str, str]:
        """
        Process test example files using available token budget for optimal sampling.
        Args:
            test_examples: List of test file paths
            continuation_id: Continuation ID for filtering already embedded files
            available_tokens: Available token budget for test examples
        Returns:
            tuple: (formatted_content, summary_note)
        """
        logger.debug(f"[TESTGEN] Processing {len(test_examples)} test examples")
        if not test_examples:
            logger.debug("[TESTGEN] No test examples provided")
            return "", ""
        # Use existing file filtering to avoid duplicates in continuation
        examples_to_process = self.filter_new_files(test_examples, continuation_id)
        logger.debug(f"[TESTGEN] After filtering: {len(examples_to_process)} new test examples to process")
        if not examples_to_process:
            logger.info(f"[TESTGEN] All {len(test_examples)} test examples already in conversation history")
            return "", ""
        # Calculate token budget for test examples (25% of available tokens, or fallback)
        if available_tokens:
            test_examples_budget = int(available_tokens * 0.25)  # 25% for test examples
            logger.debug(
                f"[TESTGEN] Allocating {test_examples_budget:,} tokens (25% of {available_tokens:,}) for test examples"
            )
        else:
            test_examples_budget = 30000  # Fallback if no budget provided
            logger.debug(f"[TESTGEN] Using fallback budget of {test_examples_budget:,} tokens for test examples")
        original_count = len(examples_to_process)
        logger.debug(
            f"[TESTGEN] Processing {original_count} test example files with {test_examples_budget:,} token budget"
        )
        # Sort by file size (smallest first) for pattern-focused selection
        file_sizes = []
        for file_path in examples_to_process:
            try:
                size = os.path.getsize(file_path)
                file_sizes.append((file_path, size))
                logger.debug(f"[TESTGEN] Test example {os.path.basename(file_path)}: {size:,} bytes")
            except (OSError, FileNotFoundError) as e:
                # If we can't get size, put it at the end
                logger.warning(f"[TESTGEN] Could not get size for {file_path}: {e}")
                file_sizes.append((file_path, float("inf")))
        # Sort by size and take smallest files for pattern reference
        file_sizes.sort(key=lambda x: x[1])
        examples_to_process = [f[0] for f in file_sizes]  # All files, sorted by size
        logger.debug(
            f"[TESTGEN] Sorted test examples by size (smallest first): {[os.path.basename(f) for f in examples_to_process]}"
        )
        # Use standard file content preparation with dynamic token budget
        try:
            logger.debug(f"[TESTGEN] Preparing file content for {len(examples_to_process)} test examples")
            content = self._prepare_file_content_for_prompt(
                examples_to_process,
                continuation_id,
                "Test examples",
                max_tokens=test_examples_budget,
                reserve_tokens=1000,
            )
            # Determine how many files were actually included
            if content:
                from utils.token_utils import estimate_tokens
                used_tokens = estimate_tokens(content)
                logger.info(
                    f"[TESTGEN] Successfully embedded test examples: {used_tokens:,} tokens used ({test_examples_budget:,} available)"
                )
                if original_count > 1:
                    truncation_note = f"Note: Used {used_tokens:,} tokens ({test_examples_budget:,} available) for test examples from {original_count} files to determine testing patterns."
                else:
                    truncation_note = ""
            else:
                logger.warning("[TESTGEN] No content generated for test examples")
                truncation_note = ""
            return content, truncation_note
        except Exception as e:
            # If test example processing fails, continue without examples rather than failing
            logger.error(f"[TESTGEN] Failed to process test examples: {type(e).__name__}: {e}")
            return "", f"Warning: Could not process test examples: {str(e)}"
    async def prepare_prompt(self, request: TestGenRequest) -> str:
        """
        Prepare the test generation prompt with code analysis and optional test examples.
        This method reads the requested files, processes any test examples,
        and constructs a detailed prompt for comprehensive test generation.
        Args:
            request: The validated test generation request
        Returns:
            str: Complete prompt for the model
        Raises:
            ValueError: If the code exceeds token limits
        """
        logger.debug(f"[TESTGEN] Preparing prompt for {len(request.files)} code files")
        if request.test_examples:
            logger.debug(f"[TESTGEN] Including {len(request.test_examples)} test examples for pattern reference")
        # Check for prompt.txt in files
        prompt_content, updated_files = self.handle_prompt_file(request.files)
        # If prompt.txt was found, incorporate it into the prompt
        if prompt_content:
            logger.debug("[TESTGEN] Found prompt.txt file, incorporating content")
            request.prompt = prompt_content + "\n\n" + request.prompt
        # Update request files list
        if updated_files is not None:
            logger.debug(f"[TESTGEN] Updated files list after prompt.txt processing: {len(updated_files)} files")
            request.files = updated_files
        # Calculate available token budget for dynamic allocation
        continuation_id = getattr(request, "continuation_id", None)
        # Get model context for token budget calculation
        model_name = getattr(self, "_current_model_name", None)
        available_tokens = None
        if model_name:
            try:
                provider = self.get_model_provider(model_name)
                capabilities = provider.get_capabilities(model_name)
                # Use 75% of context for content (code + test examples), 25% for response
                available_tokens = int(capabilities.context_window * 0.75)
                logger.debug(
                    f"[TESTGEN] Token budget calculation: {available_tokens:,} tokens (75% of {capabilities.context_window:,}) for model {model_name}"
                )
            except Exception as e:
                # Fallback to conservative estimate
                logger.warning(f"[TESTGEN] Could not get model capabilities for {model_name}: {e}")
                available_tokens = 120000  # Conservative fallback
                logger.debug(f"[TESTGEN] Using fallback token budget: {available_tokens:,} tokens")
        # Process test examples first to determine token allocation
        test_examples_content = ""
        test_examples_note = ""
        if request.test_examples:
            logger.debug(f"[TESTGEN] Processing {len(request.test_examples)} test examples")
            test_examples_content, test_examples_note = self._process_test_examples(
                request.test_examples, continuation_id, available_tokens
            )
            if test_examples_content:
                logger.info("[TESTGEN] Test examples processed successfully for pattern reference")
            else:
                logger.info("[TESTGEN] No test examples content after processing")
        # Calculate remaining tokens for main code after test examples
        if test_examples_content and available_tokens:
            from utils.token_utils import estimate_tokens
            test_tokens = estimate_tokens(test_examples_content)
            remaining_tokens = available_tokens - test_tokens - 5000  # Reserve for prompt structure
            logger.debug(
                f"[TESTGEN] Token allocation: {test_tokens:,} for examples, {remaining_tokens:,} remaining for code files"
            )
        else:
            remaining_tokens = available_tokens - 10000 if available_tokens else None
            if remaining_tokens:
                logger.debug(
                    f"[TESTGEN] Token allocation: {remaining_tokens:,} tokens available for code files (no test examples)"
                )
        # Use centralized file processing logic for main code files
        logger.debug(f"[TESTGEN] Preparing {len(request.files)} code files for analysis")
        code_content = self._prepare_file_content_for_prompt(
            request.files, continuation_id, "Code to test", max_tokens=remaining_tokens, reserve_tokens=2000
        )
        if code_content:
            from utils.token_utils import estimate_tokens
            code_tokens = estimate_tokens(code_content)
            logger.info(f"[TESTGEN] Code files embedded successfully: {code_tokens:,} tokens")
        else:
            logger.warning("[TESTGEN] No code content after file processing")
        # Test generation is based on code analysis, no web search needed
        logger.debug("[TESTGEN] Building complete test generation prompt")
        # Build the complete prompt
        prompt_parts = []
        # Add system prompt
        prompt_parts.append(self.get_system_prompt())
        # Add user context
        prompt_parts.append("=== USER CONTEXT ===")
        prompt_parts.append(request.prompt)
        prompt_parts.append("=== END CONTEXT ===")
        # Add test examples if provided
        if test_examples_content:
            prompt_parts.append("\n=== TEST EXAMPLES FOR STYLE REFERENCE ===")
            if test_examples_note:
                prompt_parts.append(f"// {test_examples_note}")
            prompt_parts.append(test_examples_content)
            prompt_parts.append("=== END TEST EXAMPLES ===")
        # Add main code to test
        prompt_parts.append("\n=== CODE TO TEST ===")
        prompt_parts.append(code_content)
        prompt_parts.append("=== END CODE ===")
        # Add generation instructions
        prompt_parts.append(
            "\nPlease analyze the code and generate comprehensive tests following the multi-agent workflow specified in the system prompt."
        )
        if test_examples_content:
            prompt_parts.append(
                "Use the provided test examples as a reference for style, framework, and testing patterns."
            )
        full_prompt = "\n".join(prompt_parts)
        # Log final prompt statistics
        from utils.token_utils import estimate_tokens
        total_tokens = estimate_tokens(full_prompt)
        logger.info(f"[TESTGEN] Complete prompt prepared: {total_tokens:,} tokens, {len(full_prompt):,} characters")
        return full_prompt
    def format_response(self, response: str, request: TestGenRequest, model_info: Optional[dict] = None) -> str:
        """
        Format the test generation response.
        Args:
            response: The raw test generation from the model
            request: The original request for context
            model_info: Optional dict with model metadata
        Returns:
            str: Formatted response with next steps
        """
        return f"""{response}
 ---
 **Next Steps:**
 1. **Review Generated Tests**: Check if the structure, coverage, and edge cases are valid and useful. Ensure they meet your requirements.
   Confirm the tests cover missing scenarios, follow project conventions, and can be safely added without duplication.
 2. **Setup Test Environment**: Ensure the testing framework and dependencies identified are properly configured in your project.
 3. **Run Initial Tests**: Execute the generated tests to verify they work correctly with your code.
 4. **Customize as Needed**: Modify generated test code, add project-specific edge cases, refine or adjust test structure based on your specific needs if deemed necessary
 based on your existing knowledge of the code.
 5. **Integrate with CI/CD**: Add the tests to your continuous integration pipeline to maintain code quality if this has already been setup and available.
 6. Refine requirements and continue the conversation if additional coverage or improvements are needed.
 Remember: Review the generated tests for completeness and adapt and integrate them to your specific project requirements and testing standards. Continue with your next step in implementation."""