Improved model response handling to handle additional response statuses in future

Improved testgen; encourages follow-ups with less work in between and less token generation to avoid surpassing the 25K barrier Improved coderevew tool to request a focused code review instead where a single-pass code review is too large or complex
2025-06-14 18:43:56 +04:00
parent ec5fee4409
commit 442decba70
8 changed files with 383 additions and 31 deletions
--- a/server.py
+++ b/server.py
@@ -117,8 +117,6 @@ try:
    # Also keep a size-based rotation as backup (100MB max per file)
    # This prevents any single day's log from growing too large
    from logging.handlers import RotatingFileHandler
    size_handler = RotatingFileHandler(
        "/tmp/mcp_server_overflow.log", maxBytes=100 * 1024 * 1024, backupCount=3  # 100MB
    )
--- a/systemprompts/codereview_prompt.py
+++ b/systemprompts/codereview_prompt.py
@@ -64,5 +64,12 @@ After listing issues, add:
 • **Top 3 priority fixes** (quick bullets)
 • **Positive aspects** worth retaining
 IF SCOPE TOO LARGE FOR FOCUSED REVIEW
 If the codebase is too large or complex to review effectively in a single response, you MUST request Claude to
 provide smaller, more focused subsets for review. Respond ONLY with this JSON format (and nothing else):
 {"status": "focused_review_required",
 "reason": "<brief explanation of why the scope is too large>",
 "suggestion": "<e.g., 'Review authentication module (auth.py, login.py)' or 'Focus on data layer (models/)' or 'Review payment processing functionality'>"}
 Remember: If required information is missing, use the clarification JSON above instead of guessing.
 """
--- a/systemprompts/testgen_prompt.py
+++ b/systemprompts/testgen_prompt.py
@@ -24,7 +24,8 @@ test idioms from the code snapshot provided.
 that are directly involved (network, DB, file-system, IPC).
 3. **Adversarial Thinker** – enumerates realistic failures, boundary conditions, race conditions, and misuse patterns
 that historically break similar systems.
-4. **Risk Prioritizer** – ranks findings by production impact and likelihood; discards speculative or out-of-scope cases.
+4. **Risk Prioritizer** – ranks findings by production impact and likelihood; discards speculative or
 out-of-scope cases.
 5. **Test Scaffolder** – produces deterministic, isolated tests that follow the *project's* conventions (assert style,
 fixture layout, naming, any mocking strategy, language and tooling etc).
@@ -41,6 +42,7 @@ pure functions).
 - Surface concurrency hazards with stress or fuzz tests when the language/runtime supports them.
 - Focus on realistic failure modes that actually occur in production
 - Remain within scope of language, framework, project. Do not over-step. Do not add unnecessary dependencies.
 - No bogus, fake tests that seemingly pass for no reason at all
 EDGE-CASE TAXONOMY (REAL-WORLD, HIGH-VALUE)
 - **Data Shape Issues**: `null` / `undefined`, zero-length, surrogate-pair emojis, malformed UTF-8, mixed EOLs.
@@ -93,8 +95,27 @@ it but do not approach or offer refactoring ideas.
 DELIVERABLE
 Return only the artefacts (analysis summary, coverage plan, and generated tests) that fit the detected framework
 and code / project layout.
-No extra commentary, no generic boilerplate.
+Group related tests but separate them into files where this is the convention and most suitable for the project at hand.
-Must comment and document logic, test reason / hypothesis in delivered code
+Prefer adding tests to an existing test file if one was provided and grouping these tests makes sense.
 Must document logic, test reason/hypothesis in delivered code.
 MUST NOT add any additional information, introduction, or summaries around generated code. Deliver only the essentials
 relevant to the test.
 IF ADDITIONAL TEST CASES ARE REQUIRED
 If you determine that comprehensive test coverage requires generating multiple test files or a large number of
 test cases for each file that would risk exceeding context limits, you MUST follow this structured approach:
 1. **Generate Essential Tests First**: Create only the most critical and high-impact tests (typically 3-5 key test
   cases covering the most important paths and failure modes). Clearly state the file these tests belong to, even if
   these should be added to an existing test file.
 2. **Request Continuation**: You MUST your message with the following added in JSON format (and nothing
   more after this). This will list the pending tests and their respective files (even if they belong to the same or
   an existing test file) as this will be used for the next follow-up test generation request.
 {"status": "more_tests_required",
 "pending_tests": "test_name (file_name), another_test_name (file_name)"}
 This approach ensures comprehensive test coverage while maintaining quality and avoiding context overflow.
 Remember: your value is catching the hard bugs—not inflating coverage numbers.
 """
--- a/tests/test_special_status_parsing.py
+++ b/tests/test_special_status_parsing.py
@@ -0,0 +1,263 @@
 """
 Tests for special status parsing in the base tool
 """
 from pydantic import BaseModel
 from tools.base import BaseTool
 class MockRequest(BaseModel):
    """Mock request for testing"""
    test_field: str = "test"
 class TestTool(BaseTool):
    """Minimal test tool implementation"""
    def get_name(self) -> str:
        return "test_tool"
    def get_description(self) -> str:
        return "Test tool for special status parsing"
    def get_input_schema(self) -> dict:
        return {"type": "object", "properties": {}}
    def get_system_prompt(self) -> str:
        return "Test prompt"
    def get_request_model(self):
        return MockRequest
    async def prepare_prompt(self, request) -> str:
        return "test prompt"
 class TestSpecialStatusParsing:
    """Test special status parsing functionality"""
    def setup_method(self):
        """Setup test tool and request"""
        self.tool = TestTool()
        self.request = MockRequest()
    def test_full_codereview_required_parsing(self):
        """Test parsing of full_codereview_required status"""
        response_json = '{"status": "full_codereview_required", "reason": "Codebase too large for quick review"}'
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "full_codereview_required"
        assert result.content_type == "json"
        assert "reason" in result.content
    def test_full_codereview_required_without_reason(self):
        """Test parsing of full_codereview_required without optional reason"""
        response_json = '{"status": "full_codereview_required"}'
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "full_codereview_required"
        assert result.content_type == "json"
    def test_test_sample_needed_parsing(self):
        """Test parsing of test_sample_needed status"""
        response_json = '{"status": "test_sample_needed", "reason": "Cannot determine test framework"}'
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "test_sample_needed"
        assert result.content_type == "json"
        assert "reason" in result.content
    def test_more_tests_required_parsing(self):
        """Test parsing of more_tests_required status"""
        response_json = (
            '{"status": "more_tests_required", "pending_tests": "test_auth (test_auth.py), test_login (test_user.py)"}'
        )
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "more_tests_required"
        assert result.content_type == "json"
        assert "pending_tests" in result.content
    def test_clarification_required_still_works(self):
        """Test that existing clarification_required still works"""
        response_json = (
            '{"status": "clarification_required", "question": "What files need review?", "files_needed": ["src/"]}'
        )
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "clarification_required"
        assert result.content_type == "json"
        assert "question" in result.content
    def test_invalid_status_payload(self):
        """Test that invalid payloads for known statuses are handled gracefully"""
        # Missing required field 'reason' for test_sample_needed
        response_json = '{"status": "test_sample_needed"}'
        result = self.tool._parse_response(response_json, self.request)
        # Should fall back to normal processing since validation failed
        assert result.status in ["success", "continuation_available"]
    def test_unknown_status_ignored(self):
        """Test that unknown status types are ignored and treated as normal responses"""
        response_json = '{"status": "unknown_status", "data": "some data"}'
        result = self.tool._parse_response(response_json, self.request)
        # Should be treated as normal response
        assert result.status in ["success", "continuation_available"]
    def test_normal_response_unchanged(self):
        """Test that normal text responses are handled normally"""
        response_text = "This is a normal response with some analysis."
        result = self.tool._parse_response(response_text, self.request)
        # Should be processed as normal response
        assert result.status in ["success", "continuation_available"]
        assert response_text in result.content
    def test_malformed_json_handled(self):
        """Test that malformed JSON is handled gracefully"""
        response_text = '{"status": "clarification_required", "question": "incomplete json'
        result = self.tool._parse_response(response_text, self.request)
        # Should fall back to normal processing
        assert result.status in ["success", "continuation_available"]
    def test_metadata_preserved(self):
        """Test that model metadata is preserved in special status responses"""
        response_json = '{"status": "full_codereview_required", "reason": "Too complex"}'
        model_info = {"model_name": "test-model", "provider": "test-provider"}
        result = self.tool._parse_response(response_json, self.request, model_info)
        assert result.status == "full_codereview_required"
        assert result.metadata["model_used"] == "test-model"
        assert "original_request" in result.metadata
    def test_more_tests_required_detailed(self):
        """Test more_tests_required with detailed pending_tests parameter"""
        # Test the exact format expected by testgen prompt
        pending_tests = "test_authentication_edge_cases (test_auth.py), test_password_validation_complex (test_auth.py), test_user_registration_flow (test_user.py)"
        response_json = f'{{"status": "more_tests_required", "pending_tests": "{pending_tests}"}}'
        result = self.tool._parse_response(response_json, self.request)
        assert result.status == "more_tests_required"
        assert result.content_type == "json"
        # Verify the content contains the validated, parsed data
        import json
        parsed_content = json.loads(result.content)
        assert parsed_content["status"] == "more_tests_required"
        assert parsed_content["pending_tests"] == pending_tests
        # Verify Claude would receive the pending_tests parameter correctly
        assert "test_authentication_edge_cases (test_auth.py)" in parsed_content["pending_tests"]
        assert "test_password_validation_complex (test_auth.py)" in parsed_content["pending_tests"]
        assert "test_user_registration_flow (test_user.py)" in parsed_content["pending_tests"]
    def test_more_tests_required_missing_pending_tests(self):
        """Test that more_tests_required without required pending_tests field fails validation"""
        response_json = '{"status": "more_tests_required"}'
        result = self.tool._parse_response(response_json, self.request)
        # Should fall back to normal processing since validation failed (missing required field)
        assert result.status in ["success", "continuation_available"]
        assert result.content_type != "json"
    def test_test_sample_needed_missing_reason(self):
        """Test that test_sample_needed without required reason field fails validation"""
        response_json = '{"status": "test_sample_needed"}'
        result = self.tool._parse_response(response_json, self.request)
        # Should fall back to normal processing since validation failed (missing required field)
        assert result.status in ["success", "continuation_available"]
        assert result.content_type != "json"
    def test_special_status_json_format_preserved(self):
        """Test that special status responses preserve exact JSON format for Claude"""
        test_cases = [
            {
                "input": '{"status": "clarification_required", "question": "What framework to use?", "files_needed": ["tests/"]}',
                "expected_fields": ["status", "question", "files_needed"],
            },
            {
                "input": '{"status": "full_codereview_required", "reason": "Codebase too large"}',
                "expected_fields": ["status", "reason"],
            },
            {
                "input": '{"status": "test_sample_needed", "reason": "Cannot determine test framework"}',
                "expected_fields": ["status", "reason"],
            },
            {
                "input": '{"status": "more_tests_required", "pending_tests": "test_auth (test_auth.py), test_login (test_user.py)"}',
                "expected_fields": ["status", "pending_tests"],
            },
        ]
        for test_case in test_cases:
            result = self.tool._parse_response(test_case["input"], self.request)
            # Verify status is correctly detected
            import json
            input_data = json.loads(test_case["input"])
            assert result.status == input_data["status"]
            assert result.content_type == "json"
            # Verify all expected fields are preserved in the response
            parsed_content = json.loads(result.content)
            for field in test_case["expected_fields"]:
                assert field in parsed_content, f"Field {field} missing from {input_data['status']} response"
                assert (
                    parsed_content[field] == input_data[field]
                ), f"Field {field} value mismatch in {input_data['status']} response"
    def test_focused_review_required_parsing(self):
        """Test that focused_review_required status is parsed correctly"""
        import json
        json_response = {
            "status": "focused_review_required",
            "reason": "Codebase too large for single review",
            "suggestion": "Review authentication module (auth.py, login.py)",
        }
        result = self.tool._parse_response(json.dumps(json_response), self.request)
        assert result.status == "focused_review_required"
        assert result.content_type == "json"
        parsed_content = json.loads(result.content)
        assert parsed_content["status"] == "focused_review_required"
        assert parsed_content["reason"] == "Codebase too large for single review"
        assert parsed_content["suggestion"] == "Review authentication module (auth.py, login.py)"
    def test_focused_review_required_missing_suggestion(self):
        """Test that focused_review_required fails validation without suggestion"""
        import json
        json_response = {
            "status": "focused_review_required",
            "reason": "Codebase too large",
            # Missing required suggestion field
        }
        result = self.tool._parse_response(json.dumps(json_response), self.request)
        # Should fall back to normal response since validation failed
        assert result.status == "success"
        assert result.content_type == "text"
--- a/tests/test_testgen.py
+++ b/tests/test_testgen.py
@@ -284,7 +284,7 @@ class TestComprehensive(unittest.TestCase):
        # Check formatting includes new action-oriented next steps
        assert raw_response in formatted
-        assert "IMMEDIATE ACTION REQUIRED" in formatted
+        assert "IMMEDIATE NEXT ACTION" in formatted
        assert "ULTRATHINK" in formatted
        assert "CREATE" in formatted
        assert "VALIDATE BY EXECUTION" in formatted
--- a/tools/base.py
+++ b/tools/base.py
@@ -37,7 +37,7 @@ from utils.conversation_memory import (
 )
 from utils.file_utils import read_file_content, read_files, translate_path_for_environment
-from .models import ClarificationRequest, ContinuationOffer, ToolOutput
+from .models import SPECIAL_STATUS_MODELS, ContinuationOffer, ToolOutput
 logger = logging.getLogger(__name__)
@@ -1183,31 +1183,43 @@ When recommending searches, be specific about what information you need and why
        logger = logging.getLogger(f"tools.{self.name}")
        try:
-            # Try to parse as JSON to check for clarification requests
+            # Try to parse as JSON to check for special status requests
            potential_json = json.loads(raw_text.strip())
-            if isinstance(potential_json, dict) and potential_json.get("status") == "clarification_required":
+            if isinstance(potential_json, dict) and "status" in potential_json:
-                # Validate the clarification request structure
+                status_key = potential_json.get("status")
-                clarification = ClarificationRequest(**potential_json)
+                status_model = SPECIAL_STATUS_MODELS.get(status_key)
                logger.debug(f"{self.name} tool requested clarification: {clarification.question}")
                # Extract model information for metadata
                metadata = {
                    "original_request": (request.model_dump() if hasattr(request, "model_dump") else str(request))
                }
                if model_info:
                    model_name = model_info.get("model_name")
                    if model_name:
                        metadata["model_used"] = model_name
-                return ToolOutput(
+                if status_model:
-                    status="clarification_required",
+                    try:
-                    content=clarification.model_dump_json(),
+                        # Use Pydantic for robust validation of the special status
-                    content_type="json",
+                        parsed_status = status_model.model_validate(potential_json)
-                    metadata=metadata,
+                        logger.debug(f"{self.name} tool detected special status: {status_key}")
-                )
+
                        # Extract model information for metadata
                        metadata = {
                            "original_request": (
                                request.model_dump() if hasattr(request, "model_dump") else str(request)
                            )
                        }
                        if model_info:
                            model_name = model_info.get("model_name")
                            if model_name:
                                metadata["model_used"] = model_name
                        return ToolOutput(
                            status=status_key,
                            content=parsed_status.model_dump_json(),
                            content_type="json",
                            metadata=metadata,
                        )
                    except Exception as e:
                        # Invalid payload for known status, log warning and continue as normal response
                        logger.warning(f"Invalid {status_key} payload: {e}")
        except (json.JSONDecodeError, ValueError, TypeError):
-            # Not a JSON clarification request, treat as normal response
+            # Not a JSON special status request, treat as normal response
            pass
        # Normal text response - format using tool-specific formatting
--- a/tools/models.py
+++ b/tools/models.py
@@ -36,6 +36,10 @@ class ToolOutput(BaseModel):
        "success",
        "error",
        "clarification_required",
        "full_codereview_required",
        "focused_review_required",
        "test_sample_needed",
        "more_tests_required",
        "resend_prompt",
        "continuation_available",
    ] = "success"
@@ -50,6 +54,7 @@ class ToolOutput(BaseModel):
 class ClarificationRequest(BaseModel):
    """Request for additional context or clarification"""
    status: Literal["clarification_required"] = "clarification_required"
    question: str = Field(..., description="Question to ask Claude for more context")
    files_needed: Optional[list[str]] = Field(
        default_factory=list, description="Specific files that are needed for analysis"
@@ -60,6 +65,48 @@ class ClarificationRequest(BaseModel):
    )
 class FullCodereviewRequired(BaseModel):
    """Request for full code review when scope is too large for quick review"""
    status: Literal["full_codereview_required"] = "full_codereview_required"
    important: Optional[str] = Field(None, description="Important message about escalation")
    reason: Optional[str] = Field(None, description="Reason why full review is needed")
 class FocusedReviewRequired(BaseModel):
    """Request for Claude to provide smaller, focused subsets of code for review"""
    status: Literal["focused_review_required"] = "focused_review_required"
    reason: str = Field(..., description="Why the current scope is too large for effective review")
    suggestion: str = Field(
        ..., description="Suggested approach for breaking down the review into smaller, focused parts"
    )
 class TestSampleNeeded(BaseModel):
    """Request for additional test samples to determine testing framework"""
    status: Literal["test_sample_needed"] = "test_sample_needed"
    reason: str = Field(..., description="Reason why additional test samples are required")
 class MoreTestsRequired(BaseModel):
    """Request for continuation to generate additional tests"""
    status: Literal["more_tests_required"] = "more_tests_required"
    pending_tests: str = Field(..., description="List of pending tests to be generated")
 # Registry mapping status strings to their corresponding Pydantic models
 SPECIAL_STATUS_MODELS = {
    "clarification_required": ClarificationRequest,
    "full_codereview_required": FullCodereviewRequired,
    "focused_review_required": FocusedReviewRequired,
    "test_sample_needed": TestSampleNeeded,
    "more_tests_required": MoreTestsRequired,
 }
 class DiagnosticHypothesis(BaseModel):
    """A debugging hypothesis with context and next steps"""
--- a/tools/testgen.py
+++ b/tools/testgen.py
@@ -117,7 +117,10 @@ class TestGenTool(BaseTool):
                },
                "continuation_id": {
                    "type": "string",
-                    "description": "Thread continuation ID for multi-turn conversations. Can be used to continue conversations across different tools. Only provide this if continuing a previous conversation thread.",
+                    "description": (
                        "Thread continuation ID for multi-turn conversations. Can be used to continue conversations "
                        "across different tools. Only provide this if continuing a previous conversation thread."
                    ),
                },
            },
            "required": ["files", "prompt"] + (["model"] if self.is_effective_auto_mode() else []),
@@ -436,7 +439,7 @@ class TestGenTool(BaseTool):
 ---
-# IMMEDIATE ACTION REQUIRED
+# IMMEDIATE NEXT ACTION
 Claude, you are now in EXECUTION MODE. Take immediate action:
@@ -472,5 +475,6 @@ After creating each test file, show the user:
 ## Step 5: MOVE TO NEXT ACTION
 Once tests are confirmed working, immediately proceed to the next logical step for the project.
-**CRITICAL**: Do NOT stop after generating - you MUST create, validate, run, and confirm the tests work.
+**CRITICAL**: Do NOT stop after generating - you MUST create, validate, run, and confirm the tests work. Take full
-Take full ownership of the testing implementation and move to your next work."""
+ownership of the testing implementation and move to your next work. If you were supplied a more_work_required request
 in the response above, you MUST honor it."""