diff --git a/README.md b/README.md index c61e0f3..69d19bc 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,7 @@ and review into consideration to aid with its pre-commit review. - [`precommit`](#4-precommit---pre-commit-validation) - Pre-commit validation - [`debug`](#5-debug---expert-debugging-assistant) - Debugging help - [`analyze`](#6-analyze---smart-file-analysis) - File analysis + - [`testgen`](#7-testgen---comprehensive-test-generation) - Test generation with edge cases - **Advanced Usage** - [Advanced Features](#advanced-features) - AI-to-AI conversations, large prompts, web search @@ -254,6 +255,7 @@ Just ask Claude naturally: - **Pre-commit validation?** → `precommit` (validate git changes before committing) - **Something's broken?** → `debug` (root cause analysis, error tracing) - **Want to understand code?** → `analyze` (architecture, patterns, dependencies) +- **Need comprehensive tests?** → `testgen` (generates test suites with edge cases) - **Server info?** → `get_version` (version and configuration details) **Auto Mode:** When `DEFAULT_MODEL=auto`, Claude automatically picks the best model for each task. You can override with: "Use flash for quick analysis" or "Use o3 to debug this". @@ -274,7 +276,8 @@ Just ask Claude naturally: 4. [`precommit`](#4-precommit---pre-commit-validation) - Validate git changes before committing 5. [`debug`](#5-debug---expert-debugging-assistant) - Root cause analysis and debugging 6. [`analyze`](#6-analyze---smart-file-analysis) - General-purpose file and code analysis -7. [`get_version`](#7-get_version---server-information) - Get server version and configuration +7. [`testgen`](#7-testgen---comprehensive-test-generation) - Comprehensive test generation with edge case coverage +8. [`get_version`](#8-get_version---server-information) - Get server version and configuration ### 1. `chat` - General Development Chat & Collaborative Thinking **Your thinking partner - bounce ideas, get second opinions, brainstorm collaboratively** @@ -421,7 +424,30 @@ Use zen and perform a thorough precommit ensuring there aren't any new regressio - Uses file paths (not content) for clean terminal output - Can identify patterns, anti-patterns, and refactoring opportunities - **Web search capability**: When enabled with `use_websearch` (default: true), the model can request Claude to perform web searches and share results back to enhance analysis with current documentation, design patterns, and best practices -### 7. `get_version` - Server Information +### 7. `testgen` - Comprehensive Test Generation +**Generates thorough test suites with edge case coverage** based on existing code and test framework used. + +**Thinking Mode (Extended thinking models):** Default is `medium` (8,192 tokens). Use `high` for complex systems with many interactions or `max` for critical systems requiring exhaustive test coverage. + +#### Example Prompts: + +**Basic Usage:** +``` +"Use zen to generate tests for User.login() method" +"Generate comprehensive tests for the sorting method in src/new_sort.py using o3" +"Create tests for edge cases not already covered in our tests using gemini pro" +``` + +**Key Features:** +- Multi-agent workflow analyzing code paths and identifying realistic failure modes +- Generates framework-specific tests following project conventions +- Supports test pattern following when examples are provided +- Dynamic token allocation (25% for test examples, 75% for main code) +- Prioritizes smallest test files for pattern detection +- Can reference existing test files: `"Generate tests following patterns from tests/unit/"` +- Specific code coverage - target specific functions/classes rather than testing everything + +### 8. `get_version` - Server Information ``` "Get zen to show its version" ``` diff --git a/config.py b/config.py index 7d048be..b84d730 100644 --- a/config.py +++ b/config.py @@ -14,7 +14,7 @@ import os # These values are used in server responses and for tracking releases # IMPORTANT: This is the single source of truth for version and author info # Semantic versioning: MAJOR.MINOR.PATCH -__version__ = "4.3.3" +__version__ = "4.4.0" # Last update date in ISO format __updated__ = "2025-06-14" # Primary maintainer diff --git a/docs/advanced-usage.md b/docs/advanced-usage.md index 79a24ed..bf6fc42 100644 --- a/docs/advanced-usage.md +++ b/docs/advanced-usage.md @@ -245,6 +245,20 @@ All tools that work with files support **both individual files and entire direct "Use o3 to think deeper about the logical flow in this algorithm" ``` +**`testgen`** - Comprehensive test generation with edge case coverage +- `files`: Code files or directories to generate tests for (required) +- `prompt`: Description of what to test, testing objectives, and scope (required) +- `model`: auto|pro|flash|o3|o3-mini|o4-mini|o4-mini-high (default: server default) +- `test_examples`: Optional existing test files as style/pattern reference +- `thinking_mode`: minimal|low|medium|high|max (default: medium, Gemini only) + +``` +"Generate tests for User.login() method with edge cases" (auto mode picks best model) +"Use pro to generate comprehensive tests for src/payment.py with max thinking mode" +"Use o3 to generate tests for algorithm correctness in sort_functions.py" +"Generate tests following patterns from tests/unit/ for new auth module" +``` + ## Collaborative Workflows ### Design → Review → Implement @@ -277,13 +291,15 @@ To help choose the right tool for your needs: 1. **Have a specific error/exception?** → Use `debug` 2. **Want to find bugs/issues in code?** → Use `codereview` 3. **Want to understand how code works?** → Use `analyze` -4. **Have analysis that needs extension/validation?** → Use `thinkdeep` -5. **Want to brainstorm or discuss?** → Use `chat` +4. **Need comprehensive test coverage?** → Use `testgen` +5. **Have analysis that needs extension/validation?** → Use `thinkdeep` +6. **Want to brainstorm or discuss?** → Use `chat` **Key Distinctions:** - `analyze` vs `codereview`: analyze explains, codereview prescribes fixes - `chat` vs `thinkdeep`: chat is open-ended, thinkdeep extends specific analysis - `debug` vs `codereview`: debug diagnoses runtime errors, review finds static issues +- `testgen` vs `debug`: testgen creates test suites, debug just finds issues and recommends solutions ## Working with Large Prompts diff --git a/server.py b/server.py index 20b110b..9241940 100644 --- a/server.py +++ b/server.py @@ -44,6 +44,7 @@ from tools import ( CodeReviewTool, DebugIssueTool, Precommit, + TestGenTool, ThinkDeepTool, ) from tools.models import ToolOutput @@ -144,6 +145,7 @@ TOOLS = { "analyze": AnalyzeTool(), # General-purpose file and code analysis "chat": ChatTool(), # Interactive development chat and brainstorming "precommit": Precommit(), # Pre-commit validation of git changes + "testgen": TestGenTool(), # Comprehensive test generation with edge case coverage } diff --git a/simulator_tests/__init__.py b/simulator_tests/__init__.py index 6e6f3e5..48379e6 100644 --- a/simulator_tests/__init__.py +++ b/simulator_tests/__init__.py @@ -19,6 +19,7 @@ from .test_openrouter_fallback import OpenRouterFallbackTest from .test_openrouter_models import OpenRouterModelsTest from .test_per_tool_deduplication import PerToolDeduplicationTest from .test_redis_validation import RedisValidationTest +from .test_testgen_validation import TestGenValidationTest from .test_token_allocation_validation import TokenAllocationValidationTest # Test registry for dynamic loading @@ -36,6 +37,7 @@ TEST_REGISTRY = { "openrouter_fallback": OpenRouterFallbackTest, "openrouter_models": OpenRouterModelsTest, "token_allocation_validation": TokenAllocationValidationTest, + "testgen_validation": TestGenValidationTest, "conversation_chain_validation": ConversationChainValidationTest, } @@ -54,6 +56,7 @@ __all__ = [ "OpenRouterFallbackTest", "OpenRouterModelsTest", "TokenAllocationValidationTest", + "TestGenValidationTest", "ConversationChainValidationTest", "TEST_REGISTRY", ] diff --git a/simulator_tests/test_testgen_validation.py b/simulator_tests/test_testgen_validation.py new file mode 100644 index 0000000..b7b4532 --- /dev/null +++ b/simulator_tests/test_testgen_validation.py @@ -0,0 +1,131 @@ +#!/usr/bin/env python3 +""" +TestGen Tool Validation Test + +Tests the testgen tool by: +- Creating a test code file with a specific function +- Using testgen to generate tests with a specific function name +- Validating that the output contains the expected test function +- Confirming the format matches test generation patterns +""" + +from .base_test import BaseSimulatorTest + + +class TestGenValidationTest(BaseSimulatorTest): + """Test testgen tool validation with specific function name""" + + @property + def test_name(self) -> str: + return "testgen_validation" + + @property + def test_description(self) -> str: + return "TestGen tool validation with specific test function" + + def run_test(self) -> bool: + """Test testgen tool with specific function name validation""" + try: + self.logger.info("Test: TestGen tool validation") + + # Setup test files + self.setup_test_files() + + # Create a specific code file for test generation + test_code_content = '''""" +Sample authentication module for testing testgen +""" + +class UserAuthenticator: + """Handles user authentication logic""" + + def __init__(self): + self.failed_attempts = {} + self.max_attempts = 3 + + def validate_password(self, username, password): + """Validate user password with security checks""" + if not username or not password: + return False + + if username in self.failed_attempts: + if self.failed_attempts[username] >= self.max_attempts: + return False # Account locked + + # Simple validation for demo + if len(password) < 8: + self._record_failed_attempt(username) + return False + + if password == "password123": # Demo valid password + self._reset_failed_attempts(username) + return True + + self._record_failed_attempt(username) + return False + + def _record_failed_attempt(self, username): + """Record a failed login attempt""" + self.failed_attempts[username] = self.failed_attempts.get(username, 0) + 1 + + def _reset_failed_attempts(self, username): + """Reset failed attempts after successful login""" + if username in self.failed_attempts: + del self.failed_attempts[username] +''' + + # Create the auth code file + auth_file = self.create_additional_test_file("user_auth.py", test_code_content) + + # Test testgen tool with specific requirements + self.logger.info(" 1.1: Generate tests with specific function name") + response, continuation_id = self.call_mcp_tool( + "testgen", + { + "files": [auth_file], + "prompt": "Generate comprehensive tests for the UserAuthenticator.validate_password method. Include tests for edge cases, security scenarios, and account locking. Use the specific test function name 'test_password_validation_edge_cases' for one of the test methods.", + "model": "flash", + }, + ) + + if not response: + self.logger.error("Failed to get testgen response") + return False + + self.logger.info(" 1.2: Validate response contains expected test function") + + # Check that the response contains the specific test function name + if "test_password_validation_edge_cases" not in response: + self.logger.error("Response does not contain the requested test function name") + self.logger.debug(f"Response content: {response[:500]}...") + return False + + # Check for common test patterns + test_patterns = [ + "def test_", # Test function definition + "assert", # Assertion statements + "UserAuthenticator", # Class being tested + "validate_password", # Method being tested + ] + + missing_patterns = [] + for pattern in test_patterns: + if pattern not in response: + missing_patterns.append(pattern) + + if missing_patterns: + self.logger.error(f"Response missing expected test patterns: {missing_patterns}") + self.logger.debug(f"Response content: {response[:500]}...") + return False + + self.logger.info(" ✅ TestGen tool validation successful") + self.logger.info(" ✅ Generated tests contain expected function name") + self.logger.info(" ✅ Generated tests follow proper test patterns") + + return True + + except Exception as e: + self.logger.error(f"TestGen validation test failed: {e}") + return False + finally: + self.cleanup_test_files() diff --git a/systemprompts/__init__.py b/systemprompts/__init__.py index c4dc153..5a0156d 100644 --- a/systemprompts/__init__.py +++ b/systemprompts/__init__.py @@ -7,6 +7,7 @@ from .chat_prompt import CHAT_PROMPT from .codereview_prompt import CODEREVIEW_PROMPT from .debug_prompt import DEBUG_ISSUE_PROMPT from .precommit_prompt import PRECOMMIT_PROMPT +from .testgen_prompt import TESTGEN_PROMPT from .thinkdeep_prompt import THINKDEEP_PROMPT __all__ = [ @@ -16,4 +17,5 @@ __all__ = [ "ANALYZE_PROMPT", "CHAT_PROMPT", "PRECOMMIT_PROMPT", + "TESTGEN_PROMPT", ] diff --git a/systemprompts/testgen_prompt.py b/systemprompts/testgen_prompt.py new file mode 100644 index 0000000..14057bc --- /dev/null +++ b/systemprompts/testgen_prompt.py @@ -0,0 +1,100 @@ +""" +TestGen tool system prompt +""" + +TESTGEN_PROMPT = """ +ROLE +You are a principal software engineer who specialises in writing bullet-proof production code **and** surgical, +high-signal test suites. You reason about control flow, data flow, mutation, concurrency, failure modes, and security +in equal measure. Your mission: design and write tests that surface real-world defects before code ever leaves CI. + +IF MORE INFORMATION IS NEEDED +If you need additional context (e.g., test framework details, dependencies, existing test patterns) to provide +accurate test generation, you MUST respond ONLY with this JSON format (and nothing else). Do NOT ask for the +same file you've been provided unless for some reason its content is missing or incomplete: +{"status": "clarification_required", "question": "", + "files_needed": ["[file name here]", "[or some folder/]"]} + +MULTI-AGENT WORKFLOW +You sequentially inhabit five expert personas—each passes a concise artefact to the next: + +1. **Context Profiler** – derives language(s), test framework(s), build tooling, domain constraints, and existing +test idioms from the code snapshot provided. +2. **Path Analyzer** – builds a map of reachable code paths (happy, error, exceptional) plus any external interactions + that are directly involved (network, DB, file-system, IPC). +3. **Adversarial Thinker** – enumerates realistic failures, boundary conditions, race conditions, and misuse patterns + that historically break similar systems. +4. **Risk Prioritizer** – ranks findings by production impact and likelihood; discards speculative or out-of-scope cases. +5. **Test Scaffolder** – produces deterministic, isolated tests that follow the *project's* conventions (assert style, +fixture layout, naming, any mocking strategy, language and tooling etc). + +TEST-GENERATION STRATEGY +- Start from public API / interface boundaries, then walk inward to critical private helpers. +- Analyze function signatures, parameters, return types, and side effects +- Map all code paths including happy paths and error conditions +- Test behaviour, not implementation details, unless white-box inspection is required to reach untestable paths. +- Include both positive and negative test cases +- Prefer property-based or table-driven tests where inputs form simple algebraic domains. +- Stub or fake **only** the minimal surface area needed; prefer in-memory fakes over mocks when feasible. +- Flag any code that cannot be tested deterministically and suggest realistic refactors (seams, dependency injection, +pure functions). +- Surface concurrency hazards with stress or fuzz tests when the language/runtime supports them. +- Focus on realistic failure modes that actually occur in production +- Remain within scope of language, framework, project. Do not over-step. Do not add unnecessary dependencies. + +EDGE-CASE TAXONOMY (REAL-WORLD, HIGH-VALUE) +- **Data Shape Issues**: `null` / `undefined`, zero-length, surrogate-pair emojis, malformed UTF-8, mixed EOLs. +- **Numeric Boundaries**: −1, 0, 1, `MAX_…`, floating-point rounding, 64-bit truncation. +- **Temporal Pitfalls**: DST shifts, leap seconds, 29 Feb, Unix epoch 2038, timezone conversions. +- **Collections & Iteration**: off-by-one, concurrent modification, empty vs singleton vs large (>10⁶ items). +- **State & Sequence**: API calls out of order, idempotency violations, replay attacks. +- **External Dependencies**: slow responses, 5xx, malformed JSON/XML, TLS errors, retry storms, cancelled promises. +- **Concurrency / Async**: race conditions, deadlocks, promise rejection leaks, thread starvation. +- **Resource Exhaustion**: memory spikes, file-descriptor leaks, connection-pool saturation. +- **Locale & Encoding**: RTL scripts, uncommon locales, locale-specific formatting. +- **Security Surfaces**: injection (SQL, shell, LDAP), path traversal, privilege escalation on shared state. + +TEST QUALITY PRINCIPLES +- Clear Arrange-Act-Assert sections (or given/when/then per project style) but retain and apply project norms, language +norms and framework norms and best practices. +- One behavioural assertion per test unless grouping is conventional. +- Fast: sub-100 ms/unit test; parallelisable; no remote calls. +- Deterministic: seeded randomness only; fixed stable clocks when time matters. +- Self-documenting: names read like specs; failures explain *why*, not just *what*. + +FRAMEWORK SELECTION +Always autodetect from the repository. When a test framework or existing tests are not found, detect from existing +code; examples: +- **Swift / Objective-C** → XCTest (Xcode default) or Swift Testing (Apple provided frameworks) +- **C# / .NET** → xUnit.net preferred; fall back to NUnit or MSTest if they dominate the repo. +- **C / C++** → GoogleTest (gtest/gmock) or Catch2, matching existing tooling. +- **JS/TS** → Jest, Vitest, Mocha, or project-specific wrapper. +- **Python** → pytest, unittest. +- **Java/Kotlin** → JUnit 5, TestNG. +- **Go** → built-in `testing`, `testify`. +- **Rust** → `#[test]`, `proptest`. +- **Anything Else** → follow existing conventions; never introduce a new framework without strong justification. + +IF FRAMEWORK SELECTION FAILS +If you are unable to confidently determine which framework to use based on the existing test samples supplied, or if +additional test samples would help in making a final decision, you MUST respond ONLY with this JSON +format (and nothing else). Do NOT ask for the same file you've been provided unless for some reason its content +is missing or incomplete: +{"status": "test_sample_needed", "reason": ""} + +SCOPE CONTROL +Stay strictly within the presented codebase, tech stack, and domain. +Do **not** invent features, frameworks, or speculative integrations. +Do **not** write tests for functions or classes that do not exist. +If a test idea falls outside project scope, discard it. +If a test would be a "good to have" but seems impossible given the current structure, setup of the project, highlight +it but do not approach or offer refactoring ideas. + +DELIVERABLE +Return only the artefacts (analysis summary, coverage plan, and generated tests) that fit the detected framework +and code / project layout. +No extra commentary, no generic boilerplate. +Must comment and document logic, test reason / hypothesis in delivered code + +Remember: your value is catching the hard bugs—not inflating coverage numbers. +""" diff --git a/tests/test_server.py b/tests/test_server.py index 78caf1f..b45061e 100644 --- a/tests/test_server.py +++ b/tests/test_server.py @@ -26,10 +26,11 @@ class TestServerTools: assert "analyze" in tool_names assert "chat" in tool_names assert "precommit" in tool_names + assert "testgen" in tool_names assert "get_version" in tool_names - # Should have exactly 7 tools - assert len(tools) == 7 + # Should have exactly 8 tools (including testgen) + assert len(tools) == 8 # Check descriptions are verbose for tool in tools: diff --git a/tests/test_testgen.py b/tests/test_testgen.py new file mode 100644 index 0000000..8bc44de --- /dev/null +++ b/tests/test_testgen.py @@ -0,0 +1,381 @@ +""" +Tests for TestGen tool implementation +""" + +import json +import tempfile +from pathlib import Path +from unittest.mock import Mock, patch + +import pytest + +from tests.mock_helpers import create_mock_provider +from tools.testgen import TestGenRequest, TestGenTool + + +class TestTestGenTool: + """Test the TestGen tool""" + + @pytest.fixture + def tool(self): + return TestGenTool() + + @pytest.fixture + def temp_files(self): + """Create temporary test files""" + with tempfile.TemporaryDirectory() as temp_dir: + temp_path = Path(temp_dir) + + # Create sample code files + code_file = temp_path / "calculator.py" + code_file.write_text( + """ +def add(a, b): + '''Add two numbers''' + return a + b + +def divide(a, b): + '''Divide two numbers''' + if b == 0: + raise ValueError("Cannot divide by zero") + return a / b +""" + ) + + # Create sample test files (different sizes) + small_test = temp_path / "test_small.py" + small_test.write_text( + """ +import unittest + +class TestBasic(unittest.TestCase): + def test_simple(self): + self.assertEqual(1 + 1, 2) +""" + ) + + large_test = temp_path / "test_large.py" + large_test.write_text( + """ +import unittest +from unittest.mock import Mock, patch + +class TestComprehensive(unittest.TestCase): + def setUp(self): + self.mock_data = Mock() + + def test_feature_one(self): + # Comprehensive test with lots of setup + result = self.process_data() + self.assertIsNotNone(result) + + def test_feature_two(self): + # Another comprehensive test + with patch('some.module') as mock_module: + mock_module.return_value = 'test' + result = self.process_data() + self.assertEqual(result, 'expected') + + def process_data(self): + return "test_result" +""" + ) + + yield { + "temp_dir": temp_dir, + "code_file": str(code_file), + "small_test": str(small_test), + "large_test": str(large_test), + } + + def test_tool_metadata(self, tool): + """Test tool metadata""" + assert tool.get_name() == "testgen" + assert "COMPREHENSIVE TEST GENERATION" in tool.get_description() + assert "BE SPECIFIC about scope" in tool.get_description() + assert tool.get_default_temperature() == 0.2 # Analytical temperature + + # Check model category + from tools.models import ToolModelCategory + + assert tool.get_model_category() == ToolModelCategory.EXTENDED_REASONING + + def test_input_schema_structure(self, tool): + """Test input schema structure""" + schema = tool.get_input_schema() + + # Required fields + assert "files" in schema["properties"] + assert "prompt" in schema["properties"] + assert "files" in schema["required"] + assert "prompt" in schema["required"] + + # Optional fields + assert "test_examples" in schema["properties"] + assert "thinking_mode" in schema["properties"] + assert "continuation_id" in schema["properties"] + + # Should not have temperature or use_websearch + assert "temperature" not in schema["properties"] + assert "use_websearch" not in schema["properties"] + + # Check test_examples description + test_examples_desc = schema["properties"]["test_examples"]["description"] + assert "absolute paths" in test_examples_desc + assert "smallest representative tests" in test_examples_desc + + def test_request_model_validation(self): + """Test request model validation""" + # Valid request + valid_request = TestGenRequest(files=["/tmp/test.py"], prompt="Generate tests for calculator functions") + assert valid_request.files == ["/tmp/test.py"] + assert valid_request.prompt == "Generate tests for calculator functions" + assert valid_request.test_examples is None + + # With test examples + request_with_examples = TestGenRequest( + files=["/tmp/test.py"], prompt="Generate tests", test_examples=["/tmp/test_example.py"] + ) + assert request_with_examples.test_examples == ["/tmp/test_example.py"] + + # Invalid request (missing required fields) + with pytest.raises(ValueError): + TestGenRequest(files=["/tmp/test.py"]) # Missing prompt + + @pytest.mark.asyncio + @patch("tools.base.BaseTool.get_model_provider") + async def test_execute_success(self, mock_get_provider, tool, temp_files): + """Test successful execution""" + # Mock provider + mock_provider = create_mock_provider() + mock_provider.get_provider_type.return_value = Mock(value="google") + mock_provider.generate_content.return_value = Mock( + content="Generated comprehensive test suite with edge cases", + usage={"input_tokens": 100, "output_tokens": 200}, + model_name="gemini-2.5-flash-preview-05-20", + metadata={"finish_reason": "STOP"}, + ) + mock_get_provider.return_value = mock_provider + + result = await tool.execute( + {"files": [temp_files["code_file"]], "prompt": "Generate comprehensive tests for the calculator functions"} + ) + + # Verify result structure + assert len(result) == 1 + response_data = json.loads(result[0].text) + assert response_data["status"] == "success" + assert "Generated comprehensive test suite" in response_data["content"] + + @pytest.mark.asyncio + @patch("tools.base.BaseTool.get_model_provider") + async def test_execute_with_test_examples(self, mock_get_provider, tool, temp_files): + """Test execution with test examples""" + mock_provider = create_mock_provider() + mock_provider.generate_content.return_value = Mock( + content="Generated tests following the provided examples", + usage={"input_tokens": 150, "output_tokens": 250}, + model_name="gemini-2.5-flash-preview-05-20", + metadata={"finish_reason": "STOP"}, + ) + mock_get_provider.return_value = mock_provider + + result = await tool.execute( + { + "files": [temp_files["code_file"]], + "prompt": "Generate tests following existing patterns", + "test_examples": [temp_files["small_test"]], + } + ) + + # Verify result + assert len(result) == 1 + response_data = json.loads(result[0].text) + assert response_data["status"] == "success" + + def test_process_test_examples_empty(self, tool): + """Test processing empty test examples""" + content, note = tool._process_test_examples([], None) + assert content == "" + assert note == "" + + def test_process_test_examples_budget_allocation(self, tool, temp_files): + """Test token budget allocation for test examples""" + with patch.object(tool, "filter_new_files") as mock_filter: + mock_filter.return_value = [temp_files["small_test"], temp_files["large_test"]] + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "Mocked test content" + + # Test with available tokens + content, note = tool._process_test_examples( + [temp_files["small_test"], temp_files["large_test"]], None, available_tokens=100000 + ) + + # Should allocate 25% of 100k = 25k tokens for test examples + mock_prepare.assert_called_once() + call_args = mock_prepare.call_args + assert call_args[1]["max_tokens"] == 25000 # 25% of 100k + + def test_process_test_examples_size_sorting(self, tool, temp_files): + """Test that test examples are sorted by size (smallest first)""" + with patch.object(tool, "filter_new_files") as mock_filter: + # Return files in random order + mock_filter.return_value = [temp_files["large_test"], temp_files["small_test"]] + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "test content" + + tool._process_test_examples( + [temp_files["large_test"], temp_files["small_test"]], None, available_tokens=50000 + ) + + # Check that files were passed in size order (smallest first) + call_args = mock_prepare.call_args[0] + files_passed = call_args[0] + + # Verify smallest file comes first + assert files_passed[0] == temp_files["small_test"] + assert files_passed[1] == temp_files["large_test"] + + @pytest.mark.asyncio + async def test_prepare_prompt_structure(self, tool, temp_files): + """Test prompt preparation structure""" + request = TestGenRequest(files=[temp_files["code_file"]], prompt="Test the calculator functions") + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "mocked file content" + + prompt = await tool.prepare_prompt(request) + + # Check prompt structure + assert "=== USER CONTEXT ===" in prompt + assert "Test the calculator functions" in prompt + assert "=== CODE TO TEST ===" in prompt + assert "mocked file content" in prompt + assert tool.get_system_prompt() in prompt + + @pytest.mark.asyncio + async def test_prepare_prompt_with_examples(self, tool, temp_files): + """Test prompt preparation with test examples""" + request = TestGenRequest( + files=[temp_files["code_file"]], prompt="Generate tests", test_examples=[temp_files["small_test"]] + ) + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "mocked content" + + with patch.object(tool, "_process_test_examples") as mock_process: + mock_process.return_value = ("test examples content", "Note: examples included") + + prompt = await tool.prepare_prompt(request) + + # Check test examples section + assert "=== TEST EXAMPLES FOR STYLE REFERENCE ===" in prompt + assert "test examples content" in prompt + assert "Note: examples included" in prompt + + def test_format_response(self, tool): + """Test response formatting""" + request = TestGenRequest(files=["/tmp/test.py"], prompt="Generate tests") + + raw_response = "Generated test cases with edge cases" + formatted = tool.format_response(raw_response, request) + + # Check formatting includes next steps + assert raw_response in formatted + assert "**Next Steps:**" in formatted + assert "Review Generated Tests" in formatted + assert "Setup Test Environment" in formatted + + @pytest.mark.asyncio + async def test_error_handling_invalid_files(self, tool): + """Test error handling for invalid file paths""" + result = await tool.execute( + {"files": ["relative/path.py"], "prompt": "Generate tests"} # Invalid: not absolute + ) + + # Should return error for relative path + response_data = json.loads(result[0].text) + assert response_data["status"] == "error" + assert "absolute" in response_data["content"] + + @pytest.mark.asyncio + async def test_large_prompt_handling(self, tool): + """Test handling of large prompts""" + large_prompt = "x" * 60000 # Exceeds MCP_PROMPT_SIZE_LIMIT + + result = await tool.execute({"files": ["/tmp/test.py"], "prompt": large_prompt}) + + # Should return resend_prompt status + response_data = json.loads(result[0].text) + assert response_data["status"] == "resend_prompt" + assert "too large" in response_data["content"] + + def test_token_budget_calculation(self, tool): + """Test token budget calculation logic""" + # Mock model capabilities + with patch.object(tool, "get_model_provider") as mock_get_provider: + mock_provider = create_mock_provider(context_window=200000) + mock_get_provider.return_value = mock_provider + + # Simulate model name being set + tool._current_model_name = "test-model" + + with patch.object(tool, "_process_test_examples") as mock_process: + mock_process.return_value = ("test content", "") + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "code content" + + request = TestGenRequest( + files=["/tmp/test.py"], prompt="Test prompt", test_examples=["/tmp/example.py"] + ) + + # This should trigger token budget calculation + import asyncio + + asyncio.run(tool.prepare_prompt(request)) + + # Verify test examples got 25% of 150k tokens (75% of 200k context) + mock_process.assert_called_once() + call_args = mock_process.call_args[0] + assert call_args[2] == 150000 # 75% of 200k context window + + @pytest.mark.asyncio + async def test_continuation_support(self, tool, temp_files): + """Test continuation ID support""" + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "code content" + + request = TestGenRequest( + files=[temp_files["code_file"]], prompt="Continue testing", continuation_id="test-thread-123" + ) + + await tool.prepare_prompt(request) + + # Verify continuation_id was passed to _prepare_file_content_for_prompt + # The method should be called twice (once for code, once for test examples logic) + assert mock_prepare.call_count >= 1 + + # Check that continuation_id was passed in at least one call + calls = mock_prepare.call_args_list + continuation_passed = any( + call[0][1] == "test-thread-123" for call in calls # continuation_id is second argument + ) + assert continuation_passed, f"continuation_id not passed. Calls: {calls}" + + def test_no_websearch_in_prompt(self, tool, temp_files): + """Test that web search instructions are not included""" + request = TestGenRequest(files=[temp_files["code_file"]], prompt="Generate tests") + + with patch.object(tool, "_prepare_file_content_for_prompt") as mock_prepare: + mock_prepare.return_value = "code content" + + import asyncio + + prompt = asyncio.run(tool.prepare_prompt(request)) + + # Should not contain web search instructions + assert "WEB SEARCH CAPABILITY" not in prompt + assert "web search" not in prompt.lower() diff --git a/tests/test_tools.py b/tests/test_tools.py index bff688c..77a4776 100644 --- a/tests/test_tools.py +++ b/tests/test_tools.py @@ -284,6 +284,22 @@ class TestAbsolutePathValidation: assert "must be absolute" in response["content"] assert "code.py" in response["content"] + @pytest.mark.asyncio + async def test_testgen_tool_relative_path_rejected(self): + """Test that testgen tool rejects relative paths""" + from tools import TestGenTool + + tool = TestGenTool() + result = await tool.execute( + {"files": ["src/main.py"], "prompt": "Generate tests for the functions"} # relative path + ) + + assert len(result) == 1 + response = json.loads(result[0].text) + assert response["status"] == "error" + assert "must be absolute" in response["content"] + assert "src/main.py" in response["content"] + @pytest.mark.asyncio @patch("tools.AnalyzeTool.get_model_provider") async def test_analyze_tool_accepts_absolute_paths(self, mock_get_provider): diff --git a/tools/__init__.py b/tools/__init__.py index 57185e4..9260083 100644 --- a/tools/__init__.py +++ b/tools/__init__.py @@ -7,6 +7,7 @@ from .chat import ChatTool from .codereview import CodeReviewTool from .debug import DebugIssueTool from .precommit import Precommit +from .testgen import TestGenTool from .thinkdeep import ThinkDeepTool __all__ = [ @@ -16,4 +17,5 @@ __all__ = [ "AnalyzeTool", "ChatTool", "Precommit", + "TestGenTool", ] diff --git a/tools/codereview.py b/tools/codereview.py index 1dfd480..50b3bea 100644 --- a/tools/codereview.py +++ b/tools/codereview.py @@ -2,7 +2,7 @@ Code Review tool - Comprehensive code analysis and review This tool provides professional-grade code review capabilities using -Gemini's understanding of code patterns, best practices, and common issues. +the chosen model's understanding of code patterns, best practices, and common issues. It can analyze individual files or entire codebases, providing actionable feedback categorized by severity. @@ -177,7 +177,7 @@ class CodeReviewTool(BaseTool): request: The validated review request Returns: - str: Complete prompt for the Gemini model + str: Complete prompt for the model Raises: ValueError: If the code exceeds token limits diff --git a/tools/testgen.py b/tools/testgen.py new file mode 100644 index 0000000..a50f107 --- /dev/null +++ b/tools/testgen.py @@ -0,0 +1,429 @@ +""" +TestGen tool - Comprehensive test suite generation with edge case coverage + +This tool generates comprehensive test suites by analyzing code paths, +identifying edge cases, and producing test scaffolding that follows +project conventions when test examples are provided. + +Key Features: +- Multi-file and directory support +- Framework detection from existing tests +- Edge case identification (nulls, boundaries, async issues, etc.) +- Test pattern following when examples provided +- Deterministic test example sampling for large test suites +""" + +import logging +import os +from typing import Any, Optional + +from mcp.types import TextContent +from pydantic import Field + +from config import TEMPERATURE_ANALYTICAL +from systemprompts import TESTGEN_PROMPT + +from .base import BaseTool, ToolRequest +from .models import ToolOutput + +logger = logging.getLogger(__name__) + + +class TestGenRequest(ToolRequest): + """ + Request model for the test generation tool. + + This model defines all parameters that can be used to customize + the test generation process, from selecting code files to providing + test examples for style consistency. + """ + + files: list[str] = Field( + ..., + description="Code files or directories to generate tests for (must be absolute paths)", + ) + prompt: str = Field( + ..., + description="Description of what to test, testing objectives, and specific scope/focus areas", + ) + test_examples: Optional[list[str]] = Field( + None, + description=( + "Optional existing test files or directories to use as style/pattern reference (must be absolute paths). " + "If not provided, the tool will determine the best testing approach based on the code structure. " + "For large test directories, only the smallest representative tests should be included to determine testing patterns. " + "If similar tests exist for the code being tested, include those for the most relevant patterns." + ), + ) + + +class TestGenTool(BaseTool): + """ + Test generation tool implementation. + + This tool analyzes code to generate comprehensive test suites with + edge case coverage, following existing test patterns when examples + are provided. + """ + + def get_name(self) -> str: + return "testgen" + + def get_description(self) -> str: + return ( + "COMPREHENSIVE TEST GENERATION - Creates thorough test suites with edge case coverage. " + "Use this when you need to generate tests for code, create test scaffolding, or improve test coverage. " + "BE SPECIFIC about scope: target specific functions/classes/modules rather than testing everything. " + "Examples: 'Generate tests for User.login() method', 'Test payment processing validation', " + "'Create tests for authentication error handling'. If user request is vague, either ask for " + "clarification about specific components to test, or make focused scope decisions and explain them. " + "Analyzes code paths, identifies realistic failure modes, and generates framework-specific tests. " + "Supports test pattern following when examples are provided. " + "Choose thinking_mode based on code complexity: 'low' for simple functions, " + "'medium' for standard modules (default), 'high' for complex systems with many interactions, " + "'max' for critical systems requiring exhaustive test coverage. " + "Note: If you're not currently using a top-tier model such as Opus 4 or above, these tools can provide enhanced capabilities." + ) + + def get_input_schema(self) -> dict[str, Any]: + schema = { + "type": "object", + "properties": { + "files": { + "type": "array", + "items": {"type": "string"}, + "description": "Code files or directories to generate tests for (must be absolute paths)", + }, + "model": self.get_model_field_schema(), + "prompt": { + "type": "string", + "description": "Description of what to test, testing objectives, and specific scope/focus areas", + }, + "test_examples": { + "type": "array", + "items": {"type": "string"}, + "description": ( + "Optional existing test files or directories to use as style/pattern reference (must be absolute paths). " + "If not provided, the tool will determine the best testing approach based on the code structure. " + "For large test directories, only the smallest representative tests will be included to determine testing patterns. " + "If similar tests exist for the code being tested, include those for the most relevant patterns." + ), + }, + "thinking_mode": { + "type": "string", + "enum": ["minimal", "low", "medium", "high", "max"], + "description": "Thinking depth: minimal (0.5% of model max), low (8%), medium (33%), high (67%), max (100% of model max)", + }, + "continuation_id": { + "type": "string", + "description": "Thread continuation ID for multi-turn conversations. Can be used to continue conversations across different tools. Only provide this if continuing a previous conversation thread.", + }, + }, + "required": ["files", "prompt"] + (["model"] if self.is_effective_auto_mode() else []), + } + + return schema + + def get_system_prompt(self) -> str: + return TESTGEN_PROMPT + + def get_default_temperature(self) -> float: + return TEMPERATURE_ANALYTICAL + + def get_model_category(self): + """TestGen requires extended reasoning for comprehensive test analysis""" + from tools.models import ToolModelCategory + + return ToolModelCategory.EXTENDED_REASONING + + def get_request_model(self): + return TestGenRequest + + async def execute(self, arguments: dict[str, Any]) -> list[TextContent]: + """Override execute to check prompt size before processing""" + # First validate request + request_model = self.get_request_model() + request = request_model(**arguments) + + # Check prompt size if provided + if request.prompt: + size_check = self.check_prompt_size(request.prompt) + if size_check: + return [TextContent(type="text", text=ToolOutput(**size_check).model_dump_json())] + + # Continue with normal execution + return await super().execute(arguments) + + def _process_test_examples( + self, test_examples: list[str], continuation_id: Optional[str], available_tokens: int = None + ) -> tuple[str, str]: + """ + Process test example files using available token budget for optimal sampling. + + Args: + test_examples: List of test file paths + continuation_id: Continuation ID for filtering already embedded files + available_tokens: Available token budget for test examples + + Returns: + tuple: (formatted_content, summary_note) + """ + logger.debug(f"[TESTGEN] Processing {len(test_examples)} test examples") + + if not test_examples: + logger.debug("[TESTGEN] No test examples provided") + return "", "" + + # Use existing file filtering to avoid duplicates in continuation + examples_to_process = self.filter_new_files(test_examples, continuation_id) + logger.debug(f"[TESTGEN] After filtering: {len(examples_to_process)} new test examples to process") + + if not examples_to_process: + logger.info(f"[TESTGEN] All {len(test_examples)} test examples already in conversation history") + return "", "" + + # Calculate token budget for test examples (25% of available tokens, or fallback) + if available_tokens: + test_examples_budget = int(available_tokens * 0.25) # 25% for test examples + logger.debug( + f"[TESTGEN] Allocating {test_examples_budget:,} tokens (25% of {available_tokens:,}) for test examples" + ) + else: + test_examples_budget = 30000 # Fallback if no budget provided + logger.debug(f"[TESTGEN] Using fallback budget of {test_examples_budget:,} tokens for test examples") + + original_count = len(examples_to_process) + logger.debug( + f"[TESTGEN] Processing {original_count} test example files with {test_examples_budget:,} token budget" + ) + + # Sort by file size (smallest first) for pattern-focused selection + file_sizes = [] + for file_path in examples_to_process: + try: + size = os.path.getsize(file_path) + file_sizes.append((file_path, size)) + logger.debug(f"[TESTGEN] Test example {os.path.basename(file_path)}: {size:,} bytes") + except (OSError, FileNotFoundError) as e: + # If we can't get size, put it at the end + logger.warning(f"[TESTGEN] Could not get size for {file_path}: {e}") + file_sizes.append((file_path, float("inf"))) + + # Sort by size and take smallest files for pattern reference + file_sizes.sort(key=lambda x: x[1]) + examples_to_process = [f[0] for f in file_sizes] # All files, sorted by size + logger.debug( + f"[TESTGEN] Sorted test examples by size (smallest first): {[os.path.basename(f) for f in examples_to_process]}" + ) + + # Use standard file content preparation with dynamic token budget + try: + logger.debug(f"[TESTGEN] Preparing file content for {len(examples_to_process)} test examples") + content = self._prepare_file_content_for_prompt( + examples_to_process, + continuation_id, + "Test examples", + max_tokens=test_examples_budget, + reserve_tokens=1000, + ) + + # Determine how many files were actually included + if content: + from utils.token_utils import estimate_tokens + + used_tokens = estimate_tokens(content) + logger.info( + f"[TESTGEN] Successfully embedded test examples: {used_tokens:,} tokens used ({test_examples_budget:,} available)" + ) + if original_count > 1: + truncation_note = f"Note: Used {used_tokens:,} tokens ({test_examples_budget:,} available) for test examples from {original_count} files to determine testing patterns." + else: + truncation_note = "" + else: + logger.warning("[TESTGEN] No content generated for test examples") + truncation_note = "" + + return content, truncation_note + + except Exception as e: + # If test example processing fails, continue without examples rather than failing + logger.error(f"[TESTGEN] Failed to process test examples: {type(e).__name__}: {e}") + return "", f"Warning: Could not process test examples: {str(e)}" + + async def prepare_prompt(self, request: TestGenRequest) -> str: + """ + Prepare the test generation prompt with code analysis and optional test examples. + + This method reads the requested files, processes any test examples, + and constructs a detailed prompt for comprehensive test generation. + + Args: + request: The validated test generation request + + Returns: + str: Complete prompt for the model + + Raises: + ValueError: If the code exceeds token limits + """ + logger.debug(f"[TESTGEN] Preparing prompt for {len(request.files)} code files") + if request.test_examples: + logger.debug(f"[TESTGEN] Including {len(request.test_examples)} test examples for pattern reference") + # Check for prompt.txt in files + prompt_content, updated_files = self.handle_prompt_file(request.files) + + # If prompt.txt was found, incorporate it into the prompt + if prompt_content: + logger.debug("[TESTGEN] Found prompt.txt file, incorporating content") + request.prompt = prompt_content + "\n\n" + request.prompt + + # Update request files list + if updated_files is not None: + logger.debug(f"[TESTGEN] Updated files list after prompt.txt processing: {len(updated_files)} files") + request.files = updated_files + + # Calculate available token budget for dynamic allocation + continuation_id = getattr(request, "continuation_id", None) + + # Get model context for token budget calculation + model_name = getattr(self, "_current_model_name", None) + available_tokens = None + + if model_name: + try: + provider = self.get_model_provider(model_name) + capabilities = provider.get_capabilities(model_name) + # Use 75% of context for content (code + test examples), 25% for response + available_tokens = int(capabilities.context_window * 0.75) + logger.debug( + f"[TESTGEN] Token budget calculation: {available_tokens:,} tokens (75% of {capabilities.context_window:,}) for model {model_name}" + ) + except Exception as e: + # Fallback to conservative estimate + logger.warning(f"[TESTGEN] Could not get model capabilities for {model_name}: {e}") + available_tokens = 120000 # Conservative fallback + logger.debug(f"[TESTGEN] Using fallback token budget: {available_tokens:,} tokens") + + # Process test examples first to determine token allocation + test_examples_content = "" + test_examples_note = "" + + if request.test_examples: + logger.debug(f"[TESTGEN] Processing {len(request.test_examples)} test examples") + test_examples_content, test_examples_note = self._process_test_examples( + request.test_examples, continuation_id, available_tokens + ) + if test_examples_content: + logger.info("[TESTGEN] Test examples processed successfully for pattern reference") + else: + logger.info("[TESTGEN] No test examples content after processing") + + # Calculate remaining tokens for main code after test examples + if test_examples_content and available_tokens: + from utils.token_utils import estimate_tokens + + test_tokens = estimate_tokens(test_examples_content) + remaining_tokens = available_tokens - test_tokens - 5000 # Reserve for prompt structure + logger.debug( + f"[TESTGEN] Token allocation: {test_tokens:,} for examples, {remaining_tokens:,} remaining for code files" + ) + else: + remaining_tokens = available_tokens - 10000 if available_tokens else None + if remaining_tokens: + logger.debug( + f"[TESTGEN] Token allocation: {remaining_tokens:,} tokens available for code files (no test examples)" + ) + + # Use centralized file processing logic for main code files + logger.debug(f"[TESTGEN] Preparing {len(request.files)} code files for analysis") + code_content = self._prepare_file_content_for_prompt( + request.files, continuation_id, "Code to test", max_tokens=remaining_tokens, reserve_tokens=2000 + ) + + if code_content: + from utils.token_utils import estimate_tokens + + code_tokens = estimate_tokens(code_content) + logger.info(f"[TESTGEN] Code files embedded successfully: {code_tokens:,} tokens") + else: + logger.warning("[TESTGEN] No code content after file processing") + + # Test generation is based on code analysis, no web search needed + logger.debug("[TESTGEN] Building complete test generation prompt") + + # Build the complete prompt + prompt_parts = [] + + # Add system prompt + prompt_parts.append(self.get_system_prompt()) + + # Add user context + prompt_parts.append("=== USER CONTEXT ===") + prompt_parts.append(request.prompt) + prompt_parts.append("=== END CONTEXT ===") + + # Add test examples if provided + if test_examples_content: + prompt_parts.append("\n=== TEST EXAMPLES FOR STYLE REFERENCE ===") + if test_examples_note: + prompt_parts.append(f"// {test_examples_note}") + prompt_parts.append(test_examples_content) + prompt_parts.append("=== END TEST EXAMPLES ===") + + # Add main code to test + prompt_parts.append("\n=== CODE TO TEST ===") + prompt_parts.append(code_content) + prompt_parts.append("=== END CODE ===") + + # Add generation instructions + prompt_parts.append( + "\nPlease analyze the code and generate comprehensive tests following the multi-agent workflow specified in the system prompt." + ) + if test_examples_content: + prompt_parts.append( + "Use the provided test examples as a reference for style, framework, and testing patterns." + ) + + full_prompt = "\n".join(prompt_parts) + + # Log final prompt statistics + from utils.token_utils import estimate_tokens + + total_tokens = estimate_tokens(full_prompt) + logger.info(f"[TESTGEN] Complete prompt prepared: {total_tokens:,} tokens, {len(full_prompt):,} characters") + + return full_prompt + + def format_response(self, response: str, request: TestGenRequest, model_info: Optional[dict] = None) -> str: + """ + Format the test generation response. + + Args: + response: The raw test generation from the model + request: The original request for context + model_info: Optional dict with model metadata + + Returns: + str: Formatted response with next steps + """ + return f"""{response} + +--- + +**Next Steps:** + +1. **Review Generated Tests**: Check if the structure, coverage, and edge cases are valid and useful. Ensure they meet your requirements. + Confirm the tests cover missing scenarios, follow project conventions, and can be safely added without duplication. + +2. **Setup Test Environment**: Ensure the testing framework and dependencies identified are properly configured in your project. + +3. **Run Initial Tests**: Execute the generated tests to verify they work correctly with your code. + +4. **Customize as Needed**: Modify generated test code, add project-specific edge cases, refine or adjust test structure based on your specific needs if deemed necessary +based on your existing knowledge of the code. + +5. **Integrate with CI/CD**: Add the tests to your continuous integration pipeline to maintain code quality if this has already been setup and available. + +6. Refine requirements and continue the conversation if additional coverage or improvements are needed. + +Remember: Review the generated tests for completeness and adapt and integrate them to your specific project requirements and testing standards. Continue with your next step in implementation."""