Performance improvements when embedding files:

- Exit early at MCP boundary if files won't fit within given context of chosen model
- Encourage claude to re-run with better context
- Check file sizes before embedding
- Drop files from older conversations when building continuations and give priority to newer files
- List and mention excluded files to Claude on return
- Improved tests
- Improved precommit prompt
- Added a new Low severity to precommit
- Improved documentation of file embedding strategy
- Refactor
This commit is contained in:
Fahad
2025-06-16 05:51:52 +04:00
parent 56333cbd86
commit 91077e3810
16 changed files with 1557 additions and 308 deletions

View File

@@ -30,12 +30,33 @@ Key Features:
- Turn-by-turn conversation history storage with tool attribution
- Cross-tool continuation support - switch tools while preserving context
- File context preservation - files shared in earlier turns remain accessible
- Automatic turn limiting (5 turns max) to prevent runaway conversations
- NEWEST-FIRST FILE PRIORITIZATION - when the same file appears in multiple turns,
references from newer turns take precedence over older ones. This ensures the
most recent file context is preserved when token limits require exclusions.
- Automatic turn limiting (20 turns max) to prevent runaway conversations
- Context reconstruction for stateless request continuity
- Redis-based persistence with automatic expiration (1 hour TTL)
- Redis-based persistence with automatic expiration (3 hour TTL)
- Thread-safe operations for concurrent access
- Graceful degradation when Redis is unavailable
FILE PRIORITIZATION STRATEGY:
The conversation memory system implements a sophisticated file prioritization algorithm
that ensures newer file references always take precedence over older ones:
1. When collecting files across conversation turns, the system walks BACKWARDS through
turns (newest to oldest) and builds a unique file list
2. If the same file path appears in multiple turns, only the reference from the
NEWEST turn is kept in the final list
3. This "newest-first" ordering is preserved throughout the entire pipeline:
- get_conversation_file_list() establishes the order
- build_conversation_history() maintains it during token budgeting
- When token limits are hit, OLDER files are excluded first
4. This strategy works across conversation chains - files from newer turns in ANY
thread take precedence over files from older turns in ANY thread
This approach ensures that when token limits force file exclusions, the most
recently referenced and contextually relevant files are preserved.
USAGE EXAMPLE:
1. Tool A creates thread: create_thread("analyze", request_data) → returns UUID
2. Tool A adds response: add_turn(UUID, "assistant", response, files=[...], tool_name="analyze")
@@ -262,11 +283,12 @@ def add_turn(
model_metadata: Optional[dict[str, Any]] = None,
) -> bool:
"""
Add turn to existing thread
Add turn to existing thread with atomic file ordering.
Appends a new conversation turn to an existing thread. This is the core
function for building conversation history and enabling cross-tool
continuation. Each turn preserves the tool and model that generated it.
continuation. Each turn preserves the tool and model that generated it,
and tracks file reception order using atomic Redis counters.
Args:
thread_id: UUID of the conversation thread
@@ -289,7 +311,7 @@ def add_turn(
Note:
- Refreshes thread TTL to configured timeout on successful update
- Turn limits prevent runaway conversations
- File references are preserved for cross-tool access
- File references are preserved for cross-tool access with atomic ordering
- Model information enables cross-provider conversations
"""
logger.debug(f"[FLOW] Adding {role} turn to {thread_id} ({tool_name})")
@@ -374,77 +396,212 @@ def get_thread_chain(thread_id: str, max_depth: int = 20) -> list[ThreadContext]
def get_conversation_file_list(context: ThreadContext) -> list[str]:
"""
Get all unique files referenced across all turns in a conversation.
Extract all unique files from conversation turns with newest-first prioritization.
This function extracts and deduplicates file references from all conversation
turns to enable efficient file embedding - files are read once and shared
across all turns rather than being embedded multiple times.
This function implements the core file prioritization logic used throughout the
conversation memory system. It walks backwards through conversation turns
(from newest to oldest) and collects unique file references, ensuring that
when the same file appears in multiple turns, the reference from the NEWEST
turn takes precedence.
PRIORITIZATION ALGORITHM:
1. Iterate through turns in REVERSE order (index len-1 down to 0)
2. For each turn, process files in the order they appear in turn.files
3. Add file to result list only if not already seen (newest reference wins)
4. Skip duplicate files that were already added from newer turns
This ensures that:
- Files from newer conversation turns appear first in the result
- When the same file is referenced multiple times, only the newest reference is kept
- The order reflects the most recent conversation context
Example:
Turn 1: files = ["main.py", "utils.py"]
Turn 2: files = ["test.py"]
Turn 3: files = ["main.py", "config.py"] # main.py appears again
Result: ["main.py", "config.py", "test.py", "utils.py"]
(main.py from Turn 3 takes precedence over Turn 1)
Args:
context: ThreadContext containing the complete conversation
context: ThreadContext containing all conversation turns to process
Returns:
list[str]: Deduplicated list of file paths referenced in the conversation
list[str]: Unique file paths ordered by newest reference first.
Empty list if no turns exist or no files are referenced.
Performance:
- Time Complexity: O(n*m) where n=turns, m=avg files per turn
- Space Complexity: O(f) where f=total unique files
- Uses set for O(1) duplicate detection
"""
if not context.turns:
logger.debug("[FILES] No turns found, returning empty file list")
return []
# Collect all unique files from all turns, preserving order of first appearance
# Collect files by walking backwards (newest to oldest turns)
seen_files = set()
unique_files = []
file_list = []
logger.debug(f"[FILES] Collecting files from {len(context.turns)} turns")
logger.debug(f"[FILES] Collecting files from {len(context.turns)} turns (newest first)")
for i, turn in enumerate(context.turns):
# Process turns in reverse order (newest first) - this is the CORE of newest-first prioritization
# By iterating from len-1 down to 0, we encounter newer turns before older turns
# When we find a duplicate file, we skip it because the newer version is already in our list
for i in range(len(context.turns) - 1, -1, -1): # REVERSE: newest turn first
turn = context.turns[i]
if turn.files:
logger.debug(f"[FILES] Turn {i + 1} has {len(turn.files)} files: {turn.files}")
for file_path in turn.files:
if file_path not in seen_files:
# First time seeing this file - add it (this is the NEWEST reference)
seen_files.add(file_path)
unique_files.append(file_path)
logger.debug(f"[FILES] Added new file: {file_path}")
file_list.append(file_path)
logger.debug(f"[FILES] Added new file: {file_path} (from turn {i + 1})")
else:
logger.debug(f"[FILES] Duplicate file skipped: {file_path}")
else:
logger.debug(f"[FILES] Turn {i + 1} has no files")
# File already seen from a NEWER turn - skip this older reference
logger.debug(f"[FILES] Skipping duplicate file: {file_path} (newer version already included)")
logger.debug(f"[FILES] Final unique file list ({len(unique_files)}): {unique_files}")
return unique_files
logger.debug(f"[FILES] Final file list ({len(file_list)}): {file_list}")
return file_list
def _plan_file_inclusion_by_size(all_files: list[str], max_file_tokens: int) -> tuple[list[str], list[str], int]:
"""
Plan which files to include based on size constraints.
This is ONLY used for conversation history building, not MCP boundary checks.
Args:
all_files: List of files to consider for inclusion
max_file_tokens: Maximum tokens available for file content
Returns:
Tuple of (files_to_include, files_to_skip, estimated_total_tokens)
"""
if not all_files:
return [], [], 0
files_to_include = []
files_to_skip = []
total_tokens = 0
logger.debug(f"[FILES] Planning inclusion for {len(all_files)} files with budget {max_file_tokens:,} tokens")
for file_path in all_files:
try:
from utils.file_utils import estimate_file_tokens, translate_path_for_environment
translated_path = translate_path_for_environment(file_path)
if os.path.exists(translated_path) and os.path.isfile(translated_path):
# Use centralized token estimation for consistency
estimated_tokens = estimate_file_tokens(file_path)
if total_tokens + estimated_tokens <= max_file_tokens:
files_to_include.append(file_path)
total_tokens += estimated_tokens
logger.debug(
f"[FILES] Including {file_path} - {estimated_tokens:,} tokens (total: {total_tokens:,})"
)
else:
files_to_skip.append(file_path)
logger.debug(
f"[FILES] Skipping {file_path} - would exceed budget (needs {estimated_tokens:,} tokens)"
)
else:
files_to_skip.append(file_path)
logger.debug(f"[FILES] Skipping {file_path} - file not accessible")
except Exception as e:
files_to_skip.append(file_path)
logger.debug(f"[FILES] Skipping {file_path} - error: {type(e).__name__}: {e}")
logger.debug(
f"[FILES] Inclusion plan: {len(files_to_include)} include, {len(files_to_skip)} skip, {total_tokens:,} tokens"
)
return files_to_include, files_to_skip, total_tokens
def build_conversation_history(context: ThreadContext, model_context=None, read_files_func=None) -> tuple[str, int]:
"""
Build formatted conversation history for tool prompts with embedded file contents.
Creates a formatted string representation of the conversation history that includes
full file contents from all referenced files. Files are embedded only ONCE at the
start, even if referenced in multiple turns, to prevent duplication and optimize
token usage.
Creates a comprehensive conversation history that includes both conversation turns and
file contents, with intelligent prioritization to maximize relevant context within
token limits. This function enables stateless tools to access complete conversation
context from previous interactions, including cross-tool continuations.
If the thread has a parent chain, this function traverses the entire chain to
include the complete conversation history.
FILE PRIORITIZATION BEHAVIOR:
Files from newer conversation turns are prioritized over files from older turns.
When the same file appears in multiple turns, the reference from the NEWEST turn
takes precedence. This ensures the most recent file context is preserved when
token limits require file exclusions.
CONVERSATION CHAIN HANDLING:
If the thread has a parent_thread_id, this function traverses the entire chain
to include complete conversation history across multiple linked threads. File
prioritization works across the entire chain, not just the current thread.
TOKEN MANAGEMENT:
- Uses model-specific token allocation (file_tokens + history_tokens)
- Files are embedded ONCE at the start to prevent duplication
- Conversation turns are processed newest-first but presented chronologically
- Stops adding turns when token budget would be exceeded
- Gracefully handles token limits with informative notes
Args:
context: ThreadContext containing the complete conversation
model_context: ModelContext for token allocation (optional, uses DEFAULT_MODEL if not provided)
read_files_func: Optional function to read files (for testing)
context: ThreadContext containing the conversation to format
model_context: ModelContext for token allocation (optional, uses DEFAULT_MODEL fallback)
read_files_func: Optional function to read files (primarily for testing)
Returns:
tuple[str, int]: (formatted_conversation_history, total_tokens_used)
Returns ("", 0) if no conversation turns exist
Returns ("", 0) if no conversation turns exist in the context
Format:
- Header with thread metadata and turn count
- All referenced files embedded once with full contents
- Each turn shows: role, tool used, which files were used, content
- Clear delimiters for AI parsing
- Continuation instruction at end
Output Format:
=== CONVERSATION HISTORY (CONTINUATION) ===
Thread: <thread_id>
Tool: <original_tool_name>
Turn <current>/<max_allowed>
You are continuing this conversation thread from where it left off.
Note:
This formatted history allows tools to "see" both conversation context AND
file contents from previous tools, enabling true cross-tool collaboration
while preventing duplicate file embeddings.
=== FILES REFERENCED IN THIS CONVERSATION ===
The following files have been shared and analyzed during our conversation.
[NOTE: X files omitted due to size constraints]
Refer to these when analyzing the context and requests below:
<embedded_file_contents_with_line_numbers>
=== END REFERENCED FILES ===
Previous conversation turns:
--- Turn 1 (Claude) ---
Files used in this turn: file1.py, file2.py
<turn_content>
--- Turn 2 (Gemini using analyze via google/gemini-2.5-flash) ---
Files used in this turn: file3.py
<turn_content>
=== END CONVERSATION HISTORY ===
IMPORTANT: You are continuing an existing conversation thread...
This is turn X of the conversation - use the conversation history above...
Cross-Tool Collaboration:
This formatted history allows any tool to "see" both conversation context AND
file contents from previous tools, enabling seamless handoffs between analyze,
codereview, debug, chat, and other tools while maintaining complete context.
Performance Characteristics:
- O(n) file collection with newest-first prioritization
- Intelligent token budgeting prevents context window overflow
- Redis-based persistence with automatic TTL management
- Graceful degradation when files are inaccessible or too large
"""
# Get the complete thread chain
if context.parent_thread_id:
@@ -453,19 +610,25 @@ def build_conversation_history(context: ThreadContext, model_context=None, read_
# Collect all turns from all threads in chain
all_turns = []
all_files_set = set()
total_turns = 0
for thread in chain:
all_turns.extend(thread.turns)
total_turns += len(thread.turns)
# Collect files from this thread
for turn in thread.turns:
if turn.files:
all_files_set.update(turn.files)
all_files = list(all_files_set)
# Use centralized file collection logic for consistency across the entire chain
# This ensures files from newer turns across ALL threads take precedence
# over files from older turns, maintaining the newest-first prioritization
# even when threads are chained together
temp_context = ThreadContext(
thread_id="merged_chain",
created_at=context.created_at,
last_updated_at=context.last_updated_at,
tool_name=context.tool_name,
turns=all_turns, # All turns from entire chain in chronological order
initial_context=context.initial_context,
)
all_files = get_conversation_file_list(temp_context) # Applies newest-first logic to entire chain
logger.debug(f"[THREAD] Built history from {len(chain)} threads with {total_turns} total turns")
else:
# Single thread, no parent chain
@@ -511,101 +674,91 @@ def build_conversation_history(context: ThreadContext, model_context=None, read_
"",
]
# Embed all files referenced in this conversation once at the start
# Embed files referenced in this conversation with size-aware selection
if all_files:
logger.debug(f"[FILES] Starting embedding for {len(all_files)} files")
history_parts.extend(
[
"=== FILES REFERENCED IN THIS CONVERSATION ===",
"The following files have been shared and analyzed during our conversation.",
"Refer to these when analyzing the context and requests below:",
"",
]
)
if read_files_func is None:
from utils.file_utils import read_file_content
# Plan file inclusion based on size constraints
# CRITICAL: all_files is already ordered by newest-first prioritization from get_conversation_file_list()
# So when _plan_file_inclusion_by_size() hits token limits, it naturally excludes OLDER files first
# while preserving the most recent file references - exactly what we want!
files_to_include, files_to_skip, estimated_tokens = _plan_file_inclusion_by_size(all_files, max_file_tokens)
# Optimized: read files incrementally with token tracking
file_contents = []
total_tokens = 0
files_included = 0
files_truncated = 0
if files_to_skip:
logger.info(f"[FILES] Skipping {len(files_to_skip)} files due to size constraints: {files_to_skip}")
for file_path in all_files:
try:
logger.debug(f"[FILES] Processing file {file_path}")
# Correctly unpack the tuple returned by read_file_content
formatted_content, content_tokens = read_file_content(file_path)
if formatted_content:
# read_file_content already returns formatted content, use it directly
# Check if adding this file would exceed the limit
if total_tokens + content_tokens <= max_file_tokens:
if files_to_include:
history_parts.extend(
[
"=== FILES REFERENCED IN THIS CONVERSATION ===",
"The following files have been shared and analyzed during our conversation.",
(
""
if not files_to_skip
else f"[NOTE: {len(files_to_skip)} files omitted due to size constraints]"
),
"Refer to these when analyzing the context and requests below:",
"",
]
)
if read_files_func is None:
from utils.file_utils import read_file_content
# Process files for embedding
file_contents = []
total_tokens = 0
files_included = 0
for file_path in files_to_include:
try:
logger.debug(f"[FILES] Processing file {file_path}")
formatted_content, content_tokens = read_file_content(file_path)
if formatted_content:
file_contents.append(formatted_content)
total_tokens += content_tokens
files_included += 1
logger.debug(
f"File embedded in conversation history: {file_path} ({content_tokens:,} tokens)"
)
logger.debug(
f"[FILES] Successfully embedded {file_path} - {content_tokens:,} tokens (total: {total_tokens:,})"
)
else:
files_truncated += 1
logger.debug(
f"File truncated due to token limit: {file_path} ({content_tokens:,} tokens, would exceed {max_file_tokens:,} limit)"
)
logger.debug(
f"[FILES] File {file_path} would exceed token limit - skipping (would be {total_tokens + content_tokens:,} tokens)"
)
# Stop processing more files
break
else:
logger.debug(f"File skipped (empty content): {file_path}")
logger.debug(f"[FILES] File {file_path} has empty content - skipping")
except Exception as e:
# Skip files that can't be read but log the failure
logger.warning(
f"Failed to embed file in conversation history: {file_path} - {type(e).__name__}: {e}"
)
logger.debug(f"[FILES] Failed to read file {file_path} - {type(e).__name__}: {e}")
continue
logger.debug(f"File skipped (empty content): {file_path}")
except Exception as e:
logger.warning(
f"Failed to embed file in conversation history: {file_path} - {type(e).__name__}: {e}"
)
continue
if file_contents:
files_content = "".join(file_contents)
if files_truncated > 0:
files_content += (
f"\n[NOTE: {files_truncated} additional file(s) were truncated due to token limit]\n"
)
history_parts.append(files_content)
logger.debug(
f"Conversation history file embedding complete: {files_included} files embedded, {files_truncated} truncated, {total_tokens:,} total tokens"
)
logger.debug(
f"[FILES] File embedding summary - {files_included} embedded, {files_truncated} truncated, {total_tokens:,} tokens total"
)
else:
history_parts.append("(No accessible files found)")
logger.debug(
f"Conversation history file embedding: no accessible files found from {len(all_files)} requested"
)
logger.debug(f"[FILES] No accessible files found from {len(all_files)} requested files")
else:
# Fallback to original read_files function for backward compatibility
files_content = read_files_func(all_files)
if files_content:
# Add token validation for the combined file content
from utils.token_utils import check_token_limit
within_limit, estimated_tokens = check_token_limit(files_content)
if within_limit:
if file_contents:
files_content = "".join(file_contents)
if files_to_skip:
files_content += (
f"\n[NOTE: {len(files_to_skip)} additional file(s) were omitted due to size constraints. "
f"These were older files from earlier conversation turns.]\n"
)
history_parts.append(files_content)
logger.debug(
f"Conversation history file embedding complete: {files_included} files embedded, {len(files_to_skip)} omitted, {total_tokens:,} total tokens"
)
else:
# Handle token limit exceeded for conversation files
error_message = f"ERROR: The total size of files referenced in this conversation has exceeded the context limit and cannot be displayed.\nEstimated tokens: {estimated_tokens}, but limit is {max_file_tokens}."
history_parts.append(error_message)
history_parts.append("(No accessible files found)")
logger.debug(f"[FILES] No accessible files found from {len(files_to_include)} planned files")
else:
history_parts.append("(No accessible files found)")
# Fallback to original read_files function for backward compatibility
files_content = read_files_func(all_files)
if files_content:
# Add token validation for the combined file content
from utils.token_utils import check_token_limit
within_limit, estimated_tokens = check_token_limit(files_content)
if within_limit:
history_parts.append(files_content)
else:
# Handle token limit exceeded for conversation files
error_message = f"ERROR: The total size of files referenced in this conversation has exceeded the context limit and cannot be displayed.\nEstimated tokens: {estimated_tokens}, but limit is {max_file_tokens}."
history_parts.append(error_message)
else:
history_parts.append("(No accessible files found)")
history_parts.extend(
[

View File

@@ -178,3 +178,65 @@ def is_binary_file(file_path: str) -> bool:
from pathlib import Path
return Path(file_path).suffix.lower() in BINARY_EXTENSIONS
# File-type specific token-to-byte ratios for accurate token estimation
# Based on empirical analysis of file compression characteristics and tokenization patterns
TOKEN_ESTIMATION_RATIOS = {
# Programming languages
".py": 3.5, # Python - moderate verbosity
".js": 3.2, # JavaScript - compact syntax
".ts": 3.3, # TypeScript - type annotations add tokens
".jsx": 3.1, # React JSX - JSX tags are tokenized efficiently
".tsx": 3.0, # React TSX - combination of TypeScript + JSX
".java": 3.6, # Java - verbose syntax, long identifiers
".cpp": 3.7, # C++ - preprocessor directives, templates
".c": 3.8, # C - function definitions, struct declarations
".go": 3.9, # Go - explicit error handling, package names
".rs": 3.5, # Rust - similar to Python in verbosity
".php": 3.3, # PHP - mixed HTML/code, variable prefixes
".rb": 3.6, # Ruby - descriptive method names
".swift": 3.4, # Swift - modern syntax, type inference
".kt": 3.5, # Kotlin - similar to modern languages
".scala": 3.2, # Scala - functional programming, concise
# Scripts and configuration
".sh": 4.1, # Shell scripts - commands and paths
".bat": 4.0, # Batch files - similar to shell
".ps1": 3.8, # PowerShell - more structured than bash
".sql": 3.8, # SQL - keywords and table/column names
# Data and configuration formats
".json": 2.5, # JSON - lots of punctuation and quotes
".yaml": 3.0, # YAML - structured but readable
".yml": 3.0, # YAML (alternative extension)
".xml": 2.8, # XML - tags and attributes
".toml": 3.2, # TOML - similar to config files
# Documentation and text
".md": 4.2, # Markdown - natural language with formatting
".txt": 4.0, # Plain text - mostly natural language
".rst": 4.1, # reStructuredText - documentation format
# Web technologies
".html": 2.9, # HTML - tags and attributes
".css": 3.4, # CSS - properties and selectors
# Logs and data
".log": 4.5, # Log files - timestamps, messages, stack traces
".csv": 3.1, # CSV - data with delimiters
# Docker and infrastructure
".dockerfile": 3.7, # Dockerfile - commands and paths
".tf": 3.5, # Terraform - infrastructure as code
}
def get_token_estimation_ratio(file_path: str) -> float:
"""
Get the token estimation ratio for a file based on its extension.
Args:
file_path: Path to the file
Returns:
Token-to-byte ratio for the file type (default: 3.5 for unknown types)
"""
from pathlib import Path
extension = Path(file_path).suffix.lower()
return TOKEN_ESTIMATION_RATIOS.get(extension, 3.5) # Conservative default

View File

@@ -18,10 +18,12 @@ Security Model:
- Symbolic links are resolved to ensure they stay within bounds
"""
import json
import logging
import os
import time
from pathlib import Path
from typing import Optional
from typing import Callable, Optional
from .file_types import BINARY_EXTENSIONS, CODE_EXTENSIONS, IMAGE_EXTENSIONS, TEXT_EXTENSIONS
from .security_config import CONTAINER_WORKSPACE, EXCLUDED_DIRS, MCP_SIGNATURE_FILES, SECURITY_ROOT, WORKSPACE_ROOT
@@ -689,3 +691,349 @@ def read_files(
result = "\n\n".join(content_parts) if content_parts else ""
logger.debug(f"[FILES] read_files complete: {len(result)} chars, {total_tokens:,} tokens used")
return result
def estimate_file_tokens(file_path: str) -> int:
"""
Estimate tokens for a file using file-type aware ratios.
Args:
file_path: Path to the file
Returns:
Estimated token count for the file
"""
try:
translated_path = translate_path_for_environment(file_path)
if not os.path.exists(translated_path) or not os.path.isfile(translated_path):
return 0
file_size = os.path.getsize(translated_path)
# Get the appropriate ratio for this file type
from .file_types import get_token_estimation_ratio
ratio = get_token_estimation_ratio(file_path)
return int(file_size / ratio)
except Exception:
return 0
def check_files_size_limit(files: list[str], max_tokens: int, threshold_percent: float = 1.0) -> tuple[bool, int, int]:
"""
Check if a list of files would exceed token limits.
Args:
files: List of file paths to check
max_tokens: Maximum allowed tokens
threshold_percent: Percentage of max_tokens to use as threshold (0.0-1.0)
Returns:
Tuple of (within_limit, total_estimated_tokens, file_count)
"""
if not files:
return True, 0, 0
total_estimated_tokens = 0
file_count = 0
threshold = int(max_tokens * threshold_percent)
for file_path in files:
try:
estimated_tokens = estimate_file_tokens(file_path)
total_estimated_tokens += estimated_tokens
if estimated_tokens > 0: # Only count accessible files
file_count += 1
except Exception:
# Skip files that can't be accessed for size check
continue
within_limit = total_estimated_tokens <= threshold
return within_limit, total_estimated_tokens, file_count
class LogTailer:
"""
General-purpose log file tailer with rotation detection.
This class provides a reusable way to monitor log files for new content,
automatically handling log rotation and maintaining position tracking.
"""
def __init__(self, file_path: str, initial_seek_end: bool = True):
"""
Initialize log tailer for a specific file.
Args:
file_path: Path to the log file to monitor
initial_seek_end: If True, start monitoring from end of file
"""
self.file_path = file_path
self.position = 0
self.last_size = 0
self.initial_seek_end = initial_seek_end
# Ensure file exists and initialize position
Path(self.file_path).touch()
if self.initial_seek_end and os.path.exists(self.file_path):
self.last_size = os.path.getsize(self.file_path)
self.position = self.last_size
def read_new_lines(self) -> list[str]:
"""
Read new lines since last call, handling rotation.
Returns:
List of new lines from the file
"""
if not os.path.exists(self.file_path):
return []
try:
current_size = os.path.getsize(self.file_path)
# Check for log rotation (file size decreased)
if current_size < self.last_size:
self.position = 0
self.last_size = current_size
with open(self.file_path, encoding="utf-8", errors="ignore") as f:
f.seek(self.position)
new_lines = f.readlines()
self.position = f.tell()
self.last_size = current_size
# Strip whitespace from each line
return [line.strip() for line in new_lines if line.strip()]
except OSError:
return []
def monitor_continuously(
self,
line_handler: Callable[[str], None],
check_interval: float = 0.5,
stop_condition: Optional[Callable[[], bool]] = None,
):
"""
Monitor file continuously and call handler for each new line.
Args:
line_handler: Function to call for each new line
check_interval: Seconds between file checks
stop_condition: Optional function that returns True to stop monitoring
"""
while True:
try:
if stop_condition and stop_condition():
break
new_lines = self.read_new_lines()
for line in new_lines:
line_handler(line)
time.sleep(check_interval)
except KeyboardInterrupt:
break
except Exception as e:
logger.warning(f"Error monitoring log file {self.file_path}: {e}")
time.sleep(1)
def read_json_file(file_path: str) -> Optional[dict]:
"""
Read and parse a JSON file with proper error handling.
Args:
file_path: Path to the JSON file
Returns:
Parsed JSON data as dict, or None if file doesn't exist or invalid
"""
try:
translated_path = translate_path_for_environment(file_path)
if not os.path.exists(translated_path):
return None
with open(translated_path, encoding="utf-8") as f:
return json.load(f)
except (json.JSONDecodeError, OSError):
return None
def write_json_file(file_path: str, data: dict, indent: int = 2) -> bool:
"""
Write data to a JSON file with proper formatting.
Args:
file_path: Path to write the JSON file
data: Dictionary data to serialize
indent: JSON indentation level
Returns:
True if successful, False otherwise
"""
try:
translated_path = translate_path_for_environment(file_path)
os.makedirs(os.path.dirname(translated_path), exist_ok=True)
with open(translated_path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=indent, ensure_ascii=False)
return True
except (OSError, TypeError):
return False
def get_file_size(file_path: str) -> int:
"""
Get file size in bytes with proper error handling.
Args:
file_path: Path to the file
Returns:
File size in bytes, or 0 if file doesn't exist or error
"""
try:
translated_path = translate_path_for_environment(file_path)
if os.path.exists(translated_path) and os.path.isfile(translated_path):
return os.path.getsize(translated_path)
return 0
except OSError:
return 0
def ensure_directory_exists(file_path: str) -> bool:
"""
Ensure the parent directory of a file path exists.
Args:
file_path: Path to file (directory will be created for parent)
Returns:
True if directory exists or was created, False on error
"""
try:
translated_path = translate_path_for_environment(file_path)
directory = os.path.dirname(translated_path)
if directory:
os.makedirs(directory, exist_ok=True)
return True
except OSError:
return False
def is_text_file(file_path: str) -> bool:
"""
Check if a file is likely a text file based on extension and content.
Args:
file_path: Path to the file
Returns:
True if file appears to be text, False otherwise
"""
from .file_types import is_text_file as check_text_type
return check_text_type(file_path)
def read_file_safely(file_path: str, max_size: int = 10 * 1024 * 1024) -> Optional[str]:
"""
Read a file with size limits and encoding handling.
Args:
file_path: Path to the file
max_size: Maximum file size in bytes (default 10MB)
Returns:
File content as string, or None if file too large or unreadable
"""
try:
translated_path = translate_path_for_environment(file_path)
if not os.path.exists(translated_path) or not os.path.isfile(translated_path):
return None
file_size = os.path.getsize(translated_path)
if file_size > max_size:
return None
with open(translated_path, encoding="utf-8", errors="ignore") as f:
return f.read()
except OSError:
return None
def check_total_file_size(files: list[str], model_name: Optional[str] = None) -> Optional[dict]:
"""
Check if total file sizes would exceed token threshold before embedding.
IMPORTANT: This performs STRICT REJECTION at MCP boundary.
No partial inclusion - either all files fit or request is rejected.
This forces Claude to make better file selection decisions.
Args:
files: List of file paths to check
model_name: Model name for context-aware thresholds, or None for default
Returns:
Dict with MCP_CODE_TOO_LARGE response if too large, None if acceptable
"""
if not files:
return None
# Get model-specific token allocation (dynamic thresholds)
if not model_name:
from config import DEFAULT_MODEL
model_name = DEFAULT_MODEL
# Handle auto mode gracefully
if model_name.lower() == "auto":
from providers.registry import ModelProviderRegistry
model_name = ModelProviderRegistry.get_preferred_fallback_model()
from utils.model_context import ModelContext
model_context = ModelContext(model_name)
token_allocation = model_context.calculate_token_allocation()
# Dynamic threshold based on model capacity
context_window = token_allocation.total_tokens
if context_window >= 1_000_000: # Gemini-class models
threshold_percent = 0.8 # Can be more generous
elif context_window >= 500_000: # Mid-range models
threshold_percent = 0.7 # Moderate
else: # OpenAI-class models (200K)
threshold_percent = 0.6 # Conservative
max_file_tokens = int(token_allocation.file_tokens * threshold_percent)
# Use centralized file size checking (threshold already applied to max_file_tokens)
within_limit, total_estimated_tokens, file_count = check_files_size_limit(files, max_file_tokens)
if not within_limit:
return {
"status": "code_too_large",
"content": (
f"The selected files are too large for analysis "
f"(estimated {total_estimated_tokens:,} tokens, limit {max_file_tokens:,}). "
f"Please select fewer, more specific files that are most relevant "
f"to your question, then invoke the tool again."
),
"content_type": "text",
"metadata": {
"total_estimated_tokens": total_estimated_tokens,
"limit": max_file_tokens,
"file_count": file_count,
"threshold_percent": threshold_percent,
"model_context_window": context_window,
"instructions": "Reduce file selection and try again - all files must fit within budget",
},
}
return None # Proceed with ALL files