Performance improvements when embedding files:
- Exit early at MCP boundary if files won't fit within given context of chosen model - Encourage claude to re-run with better context - Check file sizes before embedding - Drop files from older conversations when building continuations and give priority to newer files - List and mention excluded files to Claude on return - Improved tests - Improved precommit prompt - Added a new Low severity to precommit - Improved documentation of file embedding strategy - Refactor
This commit is contained in:
@@ -30,12 +30,33 @@ Key Features:
|
||||
- Turn-by-turn conversation history storage with tool attribution
|
||||
- Cross-tool continuation support - switch tools while preserving context
|
||||
- File context preservation - files shared in earlier turns remain accessible
|
||||
- Automatic turn limiting (5 turns max) to prevent runaway conversations
|
||||
- NEWEST-FIRST FILE PRIORITIZATION - when the same file appears in multiple turns,
|
||||
references from newer turns take precedence over older ones. This ensures the
|
||||
most recent file context is preserved when token limits require exclusions.
|
||||
- Automatic turn limiting (20 turns max) to prevent runaway conversations
|
||||
- Context reconstruction for stateless request continuity
|
||||
- Redis-based persistence with automatic expiration (1 hour TTL)
|
||||
- Redis-based persistence with automatic expiration (3 hour TTL)
|
||||
- Thread-safe operations for concurrent access
|
||||
- Graceful degradation when Redis is unavailable
|
||||
|
||||
FILE PRIORITIZATION STRATEGY:
|
||||
The conversation memory system implements a sophisticated file prioritization algorithm
|
||||
that ensures newer file references always take precedence over older ones:
|
||||
|
||||
1. When collecting files across conversation turns, the system walks BACKWARDS through
|
||||
turns (newest to oldest) and builds a unique file list
|
||||
2. If the same file path appears in multiple turns, only the reference from the
|
||||
NEWEST turn is kept in the final list
|
||||
3. This "newest-first" ordering is preserved throughout the entire pipeline:
|
||||
- get_conversation_file_list() establishes the order
|
||||
- build_conversation_history() maintains it during token budgeting
|
||||
- When token limits are hit, OLDER files are excluded first
|
||||
4. This strategy works across conversation chains - files from newer turns in ANY
|
||||
thread take precedence over files from older turns in ANY thread
|
||||
|
||||
This approach ensures that when token limits force file exclusions, the most
|
||||
recently referenced and contextually relevant files are preserved.
|
||||
|
||||
USAGE EXAMPLE:
|
||||
1. Tool A creates thread: create_thread("analyze", request_data) → returns UUID
|
||||
2. Tool A adds response: add_turn(UUID, "assistant", response, files=[...], tool_name="analyze")
|
||||
@@ -262,11 +283,12 @@ def add_turn(
|
||||
model_metadata: Optional[dict[str, Any]] = None,
|
||||
) -> bool:
|
||||
"""
|
||||
Add turn to existing thread
|
||||
Add turn to existing thread with atomic file ordering.
|
||||
|
||||
Appends a new conversation turn to an existing thread. This is the core
|
||||
function for building conversation history and enabling cross-tool
|
||||
continuation. Each turn preserves the tool and model that generated it.
|
||||
continuation. Each turn preserves the tool and model that generated it,
|
||||
and tracks file reception order using atomic Redis counters.
|
||||
|
||||
Args:
|
||||
thread_id: UUID of the conversation thread
|
||||
@@ -289,7 +311,7 @@ def add_turn(
|
||||
Note:
|
||||
- Refreshes thread TTL to configured timeout on successful update
|
||||
- Turn limits prevent runaway conversations
|
||||
- File references are preserved for cross-tool access
|
||||
- File references are preserved for cross-tool access with atomic ordering
|
||||
- Model information enables cross-provider conversations
|
||||
"""
|
||||
logger.debug(f"[FLOW] Adding {role} turn to {thread_id} ({tool_name})")
|
||||
@@ -374,77 +396,212 @@ def get_thread_chain(thread_id: str, max_depth: int = 20) -> list[ThreadContext]
|
||||
|
||||
def get_conversation_file_list(context: ThreadContext) -> list[str]:
|
||||
"""
|
||||
Get all unique files referenced across all turns in a conversation.
|
||||
Extract all unique files from conversation turns with newest-first prioritization.
|
||||
|
||||
This function extracts and deduplicates file references from all conversation
|
||||
turns to enable efficient file embedding - files are read once and shared
|
||||
across all turns rather than being embedded multiple times.
|
||||
This function implements the core file prioritization logic used throughout the
|
||||
conversation memory system. It walks backwards through conversation turns
|
||||
(from newest to oldest) and collects unique file references, ensuring that
|
||||
when the same file appears in multiple turns, the reference from the NEWEST
|
||||
turn takes precedence.
|
||||
|
||||
PRIORITIZATION ALGORITHM:
|
||||
1. Iterate through turns in REVERSE order (index len-1 down to 0)
|
||||
2. For each turn, process files in the order they appear in turn.files
|
||||
3. Add file to result list only if not already seen (newest reference wins)
|
||||
4. Skip duplicate files that were already added from newer turns
|
||||
|
||||
This ensures that:
|
||||
- Files from newer conversation turns appear first in the result
|
||||
- When the same file is referenced multiple times, only the newest reference is kept
|
||||
- The order reflects the most recent conversation context
|
||||
|
||||
Example:
|
||||
Turn 1: files = ["main.py", "utils.py"]
|
||||
Turn 2: files = ["test.py"]
|
||||
Turn 3: files = ["main.py", "config.py"] # main.py appears again
|
||||
|
||||
Result: ["main.py", "config.py", "test.py", "utils.py"]
|
||||
(main.py from Turn 3 takes precedence over Turn 1)
|
||||
|
||||
Args:
|
||||
context: ThreadContext containing the complete conversation
|
||||
context: ThreadContext containing all conversation turns to process
|
||||
|
||||
Returns:
|
||||
list[str]: Deduplicated list of file paths referenced in the conversation
|
||||
list[str]: Unique file paths ordered by newest reference first.
|
||||
Empty list if no turns exist or no files are referenced.
|
||||
|
||||
Performance:
|
||||
- Time Complexity: O(n*m) where n=turns, m=avg files per turn
|
||||
- Space Complexity: O(f) where f=total unique files
|
||||
- Uses set for O(1) duplicate detection
|
||||
"""
|
||||
if not context.turns:
|
||||
logger.debug("[FILES] No turns found, returning empty file list")
|
||||
return []
|
||||
|
||||
# Collect all unique files from all turns, preserving order of first appearance
|
||||
# Collect files by walking backwards (newest to oldest turns)
|
||||
seen_files = set()
|
||||
unique_files = []
|
||||
file_list = []
|
||||
|
||||
logger.debug(f"[FILES] Collecting files from {len(context.turns)} turns")
|
||||
logger.debug(f"[FILES] Collecting files from {len(context.turns)} turns (newest first)")
|
||||
|
||||
for i, turn in enumerate(context.turns):
|
||||
# Process turns in reverse order (newest first) - this is the CORE of newest-first prioritization
|
||||
# By iterating from len-1 down to 0, we encounter newer turns before older turns
|
||||
# When we find a duplicate file, we skip it because the newer version is already in our list
|
||||
for i in range(len(context.turns) - 1, -1, -1): # REVERSE: newest turn first
|
||||
turn = context.turns[i]
|
||||
if turn.files:
|
||||
logger.debug(f"[FILES] Turn {i + 1} has {len(turn.files)} files: {turn.files}")
|
||||
for file_path in turn.files:
|
||||
if file_path not in seen_files:
|
||||
# First time seeing this file - add it (this is the NEWEST reference)
|
||||
seen_files.add(file_path)
|
||||
unique_files.append(file_path)
|
||||
logger.debug(f"[FILES] Added new file: {file_path}")
|
||||
file_list.append(file_path)
|
||||
logger.debug(f"[FILES] Added new file: {file_path} (from turn {i + 1})")
|
||||
else:
|
||||
logger.debug(f"[FILES] Duplicate file skipped: {file_path}")
|
||||
else:
|
||||
logger.debug(f"[FILES] Turn {i + 1} has no files")
|
||||
# File already seen from a NEWER turn - skip this older reference
|
||||
logger.debug(f"[FILES] Skipping duplicate file: {file_path} (newer version already included)")
|
||||
|
||||
logger.debug(f"[FILES] Final unique file list ({len(unique_files)}): {unique_files}")
|
||||
return unique_files
|
||||
logger.debug(f"[FILES] Final file list ({len(file_list)}): {file_list}")
|
||||
return file_list
|
||||
|
||||
|
||||
def _plan_file_inclusion_by_size(all_files: list[str], max_file_tokens: int) -> tuple[list[str], list[str], int]:
|
||||
"""
|
||||
Plan which files to include based on size constraints.
|
||||
|
||||
This is ONLY used for conversation history building, not MCP boundary checks.
|
||||
|
||||
Args:
|
||||
all_files: List of files to consider for inclusion
|
||||
max_file_tokens: Maximum tokens available for file content
|
||||
|
||||
Returns:
|
||||
Tuple of (files_to_include, files_to_skip, estimated_total_tokens)
|
||||
"""
|
||||
if not all_files:
|
||||
return [], [], 0
|
||||
|
||||
files_to_include = []
|
||||
files_to_skip = []
|
||||
total_tokens = 0
|
||||
|
||||
logger.debug(f"[FILES] Planning inclusion for {len(all_files)} files with budget {max_file_tokens:,} tokens")
|
||||
|
||||
for file_path in all_files:
|
||||
try:
|
||||
from utils.file_utils import estimate_file_tokens, translate_path_for_environment
|
||||
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
|
||||
if os.path.exists(translated_path) and os.path.isfile(translated_path):
|
||||
# Use centralized token estimation for consistency
|
||||
estimated_tokens = estimate_file_tokens(file_path)
|
||||
|
||||
if total_tokens + estimated_tokens <= max_file_tokens:
|
||||
files_to_include.append(file_path)
|
||||
total_tokens += estimated_tokens
|
||||
logger.debug(
|
||||
f"[FILES] Including {file_path} - {estimated_tokens:,} tokens (total: {total_tokens:,})"
|
||||
)
|
||||
else:
|
||||
files_to_skip.append(file_path)
|
||||
logger.debug(
|
||||
f"[FILES] Skipping {file_path} - would exceed budget (needs {estimated_tokens:,} tokens)"
|
||||
)
|
||||
else:
|
||||
files_to_skip.append(file_path)
|
||||
logger.debug(f"[FILES] Skipping {file_path} - file not accessible")
|
||||
|
||||
except Exception as e:
|
||||
files_to_skip.append(file_path)
|
||||
logger.debug(f"[FILES] Skipping {file_path} - error: {type(e).__name__}: {e}")
|
||||
|
||||
logger.debug(
|
||||
f"[FILES] Inclusion plan: {len(files_to_include)} include, {len(files_to_skip)} skip, {total_tokens:,} tokens"
|
||||
)
|
||||
return files_to_include, files_to_skip, total_tokens
|
||||
|
||||
|
||||
def build_conversation_history(context: ThreadContext, model_context=None, read_files_func=None) -> tuple[str, int]:
|
||||
"""
|
||||
Build formatted conversation history for tool prompts with embedded file contents.
|
||||
|
||||
Creates a formatted string representation of the conversation history that includes
|
||||
full file contents from all referenced files. Files are embedded only ONCE at the
|
||||
start, even if referenced in multiple turns, to prevent duplication and optimize
|
||||
token usage.
|
||||
Creates a comprehensive conversation history that includes both conversation turns and
|
||||
file contents, with intelligent prioritization to maximize relevant context within
|
||||
token limits. This function enables stateless tools to access complete conversation
|
||||
context from previous interactions, including cross-tool continuations.
|
||||
|
||||
If the thread has a parent chain, this function traverses the entire chain to
|
||||
include the complete conversation history.
|
||||
FILE PRIORITIZATION BEHAVIOR:
|
||||
Files from newer conversation turns are prioritized over files from older turns.
|
||||
When the same file appears in multiple turns, the reference from the NEWEST turn
|
||||
takes precedence. This ensures the most recent file context is preserved when
|
||||
token limits require file exclusions.
|
||||
|
||||
CONVERSATION CHAIN HANDLING:
|
||||
If the thread has a parent_thread_id, this function traverses the entire chain
|
||||
to include complete conversation history across multiple linked threads. File
|
||||
prioritization works across the entire chain, not just the current thread.
|
||||
|
||||
TOKEN MANAGEMENT:
|
||||
- Uses model-specific token allocation (file_tokens + history_tokens)
|
||||
- Files are embedded ONCE at the start to prevent duplication
|
||||
- Conversation turns are processed newest-first but presented chronologically
|
||||
- Stops adding turns when token budget would be exceeded
|
||||
- Gracefully handles token limits with informative notes
|
||||
|
||||
Args:
|
||||
context: ThreadContext containing the complete conversation
|
||||
model_context: ModelContext for token allocation (optional, uses DEFAULT_MODEL if not provided)
|
||||
read_files_func: Optional function to read files (for testing)
|
||||
context: ThreadContext containing the conversation to format
|
||||
model_context: ModelContext for token allocation (optional, uses DEFAULT_MODEL fallback)
|
||||
read_files_func: Optional function to read files (primarily for testing)
|
||||
|
||||
Returns:
|
||||
tuple[str, int]: (formatted_conversation_history, total_tokens_used)
|
||||
Returns ("", 0) if no conversation turns exist
|
||||
Returns ("", 0) if no conversation turns exist in the context
|
||||
|
||||
Format:
|
||||
- Header with thread metadata and turn count
|
||||
- All referenced files embedded once with full contents
|
||||
- Each turn shows: role, tool used, which files were used, content
|
||||
- Clear delimiters for AI parsing
|
||||
- Continuation instruction at end
|
||||
Output Format:
|
||||
=== CONVERSATION HISTORY (CONTINUATION) ===
|
||||
Thread: <thread_id>
|
||||
Tool: <original_tool_name>
|
||||
Turn <current>/<max_allowed>
|
||||
You are continuing this conversation thread from where it left off.
|
||||
|
||||
Note:
|
||||
This formatted history allows tools to "see" both conversation context AND
|
||||
file contents from previous tools, enabling true cross-tool collaboration
|
||||
while preventing duplicate file embeddings.
|
||||
=== FILES REFERENCED IN THIS CONVERSATION ===
|
||||
The following files have been shared and analyzed during our conversation.
|
||||
[NOTE: X files omitted due to size constraints]
|
||||
Refer to these when analyzing the context and requests below:
|
||||
|
||||
<embedded_file_contents_with_line_numbers>
|
||||
|
||||
=== END REFERENCED FILES ===
|
||||
|
||||
Previous conversation turns:
|
||||
|
||||
--- Turn 1 (Claude) ---
|
||||
Files used in this turn: file1.py, file2.py
|
||||
|
||||
<turn_content>
|
||||
|
||||
--- Turn 2 (Gemini using analyze via google/gemini-2.5-flash) ---
|
||||
Files used in this turn: file3.py
|
||||
|
||||
<turn_content>
|
||||
|
||||
=== END CONVERSATION HISTORY ===
|
||||
|
||||
IMPORTANT: You are continuing an existing conversation thread...
|
||||
This is turn X of the conversation - use the conversation history above...
|
||||
|
||||
Cross-Tool Collaboration:
|
||||
This formatted history allows any tool to "see" both conversation context AND
|
||||
file contents from previous tools, enabling seamless handoffs between analyze,
|
||||
codereview, debug, chat, and other tools while maintaining complete context.
|
||||
|
||||
Performance Characteristics:
|
||||
- O(n) file collection with newest-first prioritization
|
||||
- Intelligent token budgeting prevents context window overflow
|
||||
- Redis-based persistence with automatic TTL management
|
||||
- Graceful degradation when files are inaccessible or too large
|
||||
"""
|
||||
# Get the complete thread chain
|
||||
if context.parent_thread_id:
|
||||
@@ -453,19 +610,25 @@ def build_conversation_history(context: ThreadContext, model_context=None, read_
|
||||
|
||||
# Collect all turns from all threads in chain
|
||||
all_turns = []
|
||||
all_files_set = set()
|
||||
total_turns = 0
|
||||
|
||||
for thread in chain:
|
||||
all_turns.extend(thread.turns)
|
||||
total_turns += len(thread.turns)
|
||||
|
||||
# Collect files from this thread
|
||||
for turn in thread.turns:
|
||||
if turn.files:
|
||||
all_files_set.update(turn.files)
|
||||
|
||||
all_files = list(all_files_set)
|
||||
# Use centralized file collection logic for consistency across the entire chain
|
||||
# This ensures files from newer turns across ALL threads take precedence
|
||||
# over files from older turns, maintaining the newest-first prioritization
|
||||
# even when threads are chained together
|
||||
temp_context = ThreadContext(
|
||||
thread_id="merged_chain",
|
||||
created_at=context.created_at,
|
||||
last_updated_at=context.last_updated_at,
|
||||
tool_name=context.tool_name,
|
||||
turns=all_turns, # All turns from entire chain in chronological order
|
||||
initial_context=context.initial_context,
|
||||
)
|
||||
all_files = get_conversation_file_list(temp_context) # Applies newest-first logic to entire chain
|
||||
logger.debug(f"[THREAD] Built history from {len(chain)} threads with {total_turns} total turns")
|
||||
else:
|
||||
# Single thread, no parent chain
|
||||
@@ -511,101 +674,91 @@ def build_conversation_history(context: ThreadContext, model_context=None, read_
|
||||
"",
|
||||
]
|
||||
|
||||
# Embed all files referenced in this conversation once at the start
|
||||
# Embed files referenced in this conversation with size-aware selection
|
||||
if all_files:
|
||||
logger.debug(f"[FILES] Starting embedding for {len(all_files)} files")
|
||||
history_parts.extend(
|
||||
[
|
||||
"=== FILES REFERENCED IN THIS CONVERSATION ===",
|
||||
"The following files have been shared and analyzed during our conversation.",
|
||||
"Refer to these when analyzing the context and requests below:",
|
||||
"",
|
||||
]
|
||||
)
|
||||
|
||||
if read_files_func is None:
|
||||
from utils.file_utils import read_file_content
|
||||
# Plan file inclusion based on size constraints
|
||||
# CRITICAL: all_files is already ordered by newest-first prioritization from get_conversation_file_list()
|
||||
# So when _plan_file_inclusion_by_size() hits token limits, it naturally excludes OLDER files first
|
||||
# while preserving the most recent file references - exactly what we want!
|
||||
files_to_include, files_to_skip, estimated_tokens = _plan_file_inclusion_by_size(all_files, max_file_tokens)
|
||||
|
||||
# Optimized: read files incrementally with token tracking
|
||||
file_contents = []
|
||||
total_tokens = 0
|
||||
files_included = 0
|
||||
files_truncated = 0
|
||||
if files_to_skip:
|
||||
logger.info(f"[FILES] Skipping {len(files_to_skip)} files due to size constraints: {files_to_skip}")
|
||||
|
||||
for file_path in all_files:
|
||||
try:
|
||||
logger.debug(f"[FILES] Processing file {file_path}")
|
||||
# Correctly unpack the tuple returned by read_file_content
|
||||
formatted_content, content_tokens = read_file_content(file_path)
|
||||
if formatted_content:
|
||||
# read_file_content already returns formatted content, use it directly
|
||||
# Check if adding this file would exceed the limit
|
||||
if total_tokens + content_tokens <= max_file_tokens:
|
||||
if files_to_include:
|
||||
history_parts.extend(
|
||||
[
|
||||
"=== FILES REFERENCED IN THIS CONVERSATION ===",
|
||||
"The following files have been shared and analyzed during our conversation.",
|
||||
(
|
||||
""
|
||||
if not files_to_skip
|
||||
else f"[NOTE: {len(files_to_skip)} files omitted due to size constraints]"
|
||||
),
|
||||
"Refer to these when analyzing the context and requests below:",
|
||||
"",
|
||||
]
|
||||
)
|
||||
|
||||
if read_files_func is None:
|
||||
from utils.file_utils import read_file_content
|
||||
|
||||
# Process files for embedding
|
||||
file_contents = []
|
||||
total_tokens = 0
|
||||
files_included = 0
|
||||
|
||||
for file_path in files_to_include:
|
||||
try:
|
||||
logger.debug(f"[FILES] Processing file {file_path}")
|
||||
formatted_content, content_tokens = read_file_content(file_path)
|
||||
if formatted_content:
|
||||
file_contents.append(formatted_content)
|
||||
total_tokens += content_tokens
|
||||
files_included += 1
|
||||
logger.debug(
|
||||
f"File embedded in conversation history: {file_path} ({content_tokens:,} tokens)"
|
||||
)
|
||||
logger.debug(
|
||||
f"[FILES] Successfully embedded {file_path} - {content_tokens:,} tokens (total: {total_tokens:,})"
|
||||
)
|
||||
else:
|
||||
files_truncated += 1
|
||||
logger.debug(
|
||||
f"File truncated due to token limit: {file_path} ({content_tokens:,} tokens, would exceed {max_file_tokens:,} limit)"
|
||||
)
|
||||
logger.debug(
|
||||
f"[FILES] File {file_path} would exceed token limit - skipping (would be {total_tokens + content_tokens:,} tokens)"
|
||||
)
|
||||
# Stop processing more files
|
||||
break
|
||||
else:
|
||||
logger.debug(f"File skipped (empty content): {file_path}")
|
||||
logger.debug(f"[FILES] File {file_path} has empty content - skipping")
|
||||
except Exception as e:
|
||||
# Skip files that can't be read but log the failure
|
||||
logger.warning(
|
||||
f"Failed to embed file in conversation history: {file_path} - {type(e).__name__}: {e}"
|
||||
)
|
||||
logger.debug(f"[FILES] Failed to read file {file_path} - {type(e).__name__}: {e}")
|
||||
continue
|
||||
logger.debug(f"File skipped (empty content): {file_path}")
|
||||
except Exception as e:
|
||||
logger.warning(
|
||||
f"Failed to embed file in conversation history: {file_path} - {type(e).__name__}: {e}"
|
||||
)
|
||||
continue
|
||||
|
||||
if file_contents:
|
||||
files_content = "".join(file_contents)
|
||||
if files_truncated > 0:
|
||||
files_content += (
|
||||
f"\n[NOTE: {files_truncated} additional file(s) were truncated due to token limit]\n"
|
||||
)
|
||||
history_parts.append(files_content)
|
||||
logger.debug(
|
||||
f"Conversation history file embedding complete: {files_included} files embedded, {files_truncated} truncated, {total_tokens:,} total tokens"
|
||||
)
|
||||
logger.debug(
|
||||
f"[FILES] File embedding summary - {files_included} embedded, {files_truncated} truncated, {total_tokens:,} tokens total"
|
||||
)
|
||||
else:
|
||||
history_parts.append("(No accessible files found)")
|
||||
logger.debug(
|
||||
f"Conversation history file embedding: no accessible files found from {len(all_files)} requested"
|
||||
)
|
||||
logger.debug(f"[FILES] No accessible files found from {len(all_files)} requested files")
|
||||
else:
|
||||
# Fallback to original read_files function for backward compatibility
|
||||
files_content = read_files_func(all_files)
|
||||
if files_content:
|
||||
# Add token validation for the combined file content
|
||||
from utils.token_utils import check_token_limit
|
||||
|
||||
within_limit, estimated_tokens = check_token_limit(files_content)
|
||||
if within_limit:
|
||||
if file_contents:
|
||||
files_content = "".join(file_contents)
|
||||
if files_to_skip:
|
||||
files_content += (
|
||||
f"\n[NOTE: {len(files_to_skip)} additional file(s) were omitted due to size constraints. "
|
||||
f"These were older files from earlier conversation turns.]\n"
|
||||
)
|
||||
history_parts.append(files_content)
|
||||
logger.debug(
|
||||
f"Conversation history file embedding complete: {files_included} files embedded, {len(files_to_skip)} omitted, {total_tokens:,} total tokens"
|
||||
)
|
||||
else:
|
||||
# Handle token limit exceeded for conversation files
|
||||
error_message = f"ERROR: The total size of files referenced in this conversation has exceeded the context limit and cannot be displayed.\nEstimated tokens: {estimated_tokens}, but limit is {max_file_tokens}."
|
||||
history_parts.append(error_message)
|
||||
history_parts.append("(No accessible files found)")
|
||||
logger.debug(f"[FILES] No accessible files found from {len(files_to_include)} planned files")
|
||||
else:
|
||||
history_parts.append("(No accessible files found)")
|
||||
# Fallback to original read_files function for backward compatibility
|
||||
files_content = read_files_func(all_files)
|
||||
if files_content:
|
||||
# Add token validation for the combined file content
|
||||
from utils.token_utils import check_token_limit
|
||||
|
||||
within_limit, estimated_tokens = check_token_limit(files_content)
|
||||
if within_limit:
|
||||
history_parts.append(files_content)
|
||||
else:
|
||||
# Handle token limit exceeded for conversation files
|
||||
error_message = f"ERROR: The total size of files referenced in this conversation has exceeded the context limit and cannot be displayed.\nEstimated tokens: {estimated_tokens}, but limit is {max_file_tokens}."
|
||||
history_parts.append(error_message)
|
||||
else:
|
||||
history_parts.append("(No accessible files found)")
|
||||
|
||||
history_parts.extend(
|
||||
[
|
||||
|
||||
@@ -178,3 +178,65 @@ def is_binary_file(file_path: str) -> bool:
|
||||
from pathlib import Path
|
||||
|
||||
return Path(file_path).suffix.lower() in BINARY_EXTENSIONS
|
||||
|
||||
|
||||
# File-type specific token-to-byte ratios for accurate token estimation
|
||||
# Based on empirical analysis of file compression characteristics and tokenization patterns
|
||||
TOKEN_ESTIMATION_RATIOS = {
|
||||
# Programming languages
|
||||
".py": 3.5, # Python - moderate verbosity
|
||||
".js": 3.2, # JavaScript - compact syntax
|
||||
".ts": 3.3, # TypeScript - type annotations add tokens
|
||||
".jsx": 3.1, # React JSX - JSX tags are tokenized efficiently
|
||||
".tsx": 3.0, # React TSX - combination of TypeScript + JSX
|
||||
".java": 3.6, # Java - verbose syntax, long identifiers
|
||||
".cpp": 3.7, # C++ - preprocessor directives, templates
|
||||
".c": 3.8, # C - function definitions, struct declarations
|
||||
".go": 3.9, # Go - explicit error handling, package names
|
||||
".rs": 3.5, # Rust - similar to Python in verbosity
|
||||
".php": 3.3, # PHP - mixed HTML/code, variable prefixes
|
||||
".rb": 3.6, # Ruby - descriptive method names
|
||||
".swift": 3.4, # Swift - modern syntax, type inference
|
||||
".kt": 3.5, # Kotlin - similar to modern languages
|
||||
".scala": 3.2, # Scala - functional programming, concise
|
||||
# Scripts and configuration
|
||||
".sh": 4.1, # Shell scripts - commands and paths
|
||||
".bat": 4.0, # Batch files - similar to shell
|
||||
".ps1": 3.8, # PowerShell - more structured than bash
|
||||
".sql": 3.8, # SQL - keywords and table/column names
|
||||
# Data and configuration formats
|
||||
".json": 2.5, # JSON - lots of punctuation and quotes
|
||||
".yaml": 3.0, # YAML - structured but readable
|
||||
".yml": 3.0, # YAML (alternative extension)
|
||||
".xml": 2.8, # XML - tags and attributes
|
||||
".toml": 3.2, # TOML - similar to config files
|
||||
# Documentation and text
|
||||
".md": 4.2, # Markdown - natural language with formatting
|
||||
".txt": 4.0, # Plain text - mostly natural language
|
||||
".rst": 4.1, # reStructuredText - documentation format
|
||||
# Web technologies
|
||||
".html": 2.9, # HTML - tags and attributes
|
||||
".css": 3.4, # CSS - properties and selectors
|
||||
# Logs and data
|
||||
".log": 4.5, # Log files - timestamps, messages, stack traces
|
||||
".csv": 3.1, # CSV - data with delimiters
|
||||
# Docker and infrastructure
|
||||
".dockerfile": 3.7, # Dockerfile - commands and paths
|
||||
".tf": 3.5, # Terraform - infrastructure as code
|
||||
}
|
||||
|
||||
|
||||
def get_token_estimation_ratio(file_path: str) -> float:
|
||||
"""
|
||||
Get the token estimation ratio for a file based on its extension.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
|
||||
Returns:
|
||||
Token-to-byte ratio for the file type (default: 3.5 for unknown types)
|
||||
"""
|
||||
from pathlib import Path
|
||||
|
||||
extension = Path(file_path).suffix.lower()
|
||||
return TOKEN_ESTIMATION_RATIOS.get(extension, 3.5) # Conservative default
|
||||
|
||||
@@ -18,10 +18,12 @@ Security Model:
|
||||
- Symbolic links are resolved to ensure they stay within bounds
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Callable, Optional
|
||||
|
||||
from .file_types import BINARY_EXTENSIONS, CODE_EXTENSIONS, IMAGE_EXTENSIONS, TEXT_EXTENSIONS
|
||||
from .security_config import CONTAINER_WORKSPACE, EXCLUDED_DIRS, MCP_SIGNATURE_FILES, SECURITY_ROOT, WORKSPACE_ROOT
|
||||
@@ -689,3 +691,349 @@ def read_files(
|
||||
result = "\n\n".join(content_parts) if content_parts else ""
|
||||
logger.debug(f"[FILES] read_files complete: {len(result)} chars, {total_tokens:,} tokens used")
|
||||
return result
|
||||
|
||||
|
||||
def estimate_file_tokens(file_path: str) -> int:
|
||||
"""
|
||||
Estimate tokens for a file using file-type aware ratios.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
|
||||
Returns:
|
||||
Estimated token count for the file
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
|
||||
if not os.path.exists(translated_path) or not os.path.isfile(translated_path):
|
||||
return 0
|
||||
|
||||
file_size = os.path.getsize(translated_path)
|
||||
|
||||
# Get the appropriate ratio for this file type
|
||||
from .file_types import get_token_estimation_ratio
|
||||
|
||||
ratio = get_token_estimation_ratio(file_path)
|
||||
|
||||
return int(file_size / ratio)
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
def check_files_size_limit(files: list[str], max_tokens: int, threshold_percent: float = 1.0) -> tuple[bool, int, int]:
|
||||
"""
|
||||
Check if a list of files would exceed token limits.
|
||||
|
||||
Args:
|
||||
files: List of file paths to check
|
||||
max_tokens: Maximum allowed tokens
|
||||
threshold_percent: Percentage of max_tokens to use as threshold (0.0-1.0)
|
||||
|
||||
Returns:
|
||||
Tuple of (within_limit, total_estimated_tokens, file_count)
|
||||
"""
|
||||
if not files:
|
||||
return True, 0, 0
|
||||
|
||||
total_estimated_tokens = 0
|
||||
file_count = 0
|
||||
threshold = int(max_tokens * threshold_percent)
|
||||
|
||||
for file_path in files:
|
||||
try:
|
||||
estimated_tokens = estimate_file_tokens(file_path)
|
||||
total_estimated_tokens += estimated_tokens
|
||||
if estimated_tokens > 0: # Only count accessible files
|
||||
file_count += 1
|
||||
except Exception:
|
||||
# Skip files that can't be accessed for size check
|
||||
continue
|
||||
|
||||
within_limit = total_estimated_tokens <= threshold
|
||||
return within_limit, total_estimated_tokens, file_count
|
||||
|
||||
|
||||
class LogTailer:
|
||||
"""
|
||||
General-purpose log file tailer with rotation detection.
|
||||
|
||||
This class provides a reusable way to monitor log files for new content,
|
||||
automatically handling log rotation and maintaining position tracking.
|
||||
"""
|
||||
|
||||
def __init__(self, file_path: str, initial_seek_end: bool = True):
|
||||
"""
|
||||
Initialize log tailer for a specific file.
|
||||
|
||||
Args:
|
||||
file_path: Path to the log file to monitor
|
||||
initial_seek_end: If True, start monitoring from end of file
|
||||
"""
|
||||
self.file_path = file_path
|
||||
self.position = 0
|
||||
self.last_size = 0
|
||||
self.initial_seek_end = initial_seek_end
|
||||
|
||||
# Ensure file exists and initialize position
|
||||
Path(self.file_path).touch()
|
||||
if self.initial_seek_end and os.path.exists(self.file_path):
|
||||
self.last_size = os.path.getsize(self.file_path)
|
||||
self.position = self.last_size
|
||||
|
||||
def read_new_lines(self) -> list[str]:
|
||||
"""
|
||||
Read new lines since last call, handling rotation.
|
||||
|
||||
Returns:
|
||||
List of new lines from the file
|
||||
"""
|
||||
if not os.path.exists(self.file_path):
|
||||
return []
|
||||
|
||||
try:
|
||||
current_size = os.path.getsize(self.file_path)
|
||||
|
||||
# Check for log rotation (file size decreased)
|
||||
if current_size < self.last_size:
|
||||
self.position = 0
|
||||
self.last_size = current_size
|
||||
|
||||
with open(self.file_path, encoding="utf-8", errors="ignore") as f:
|
||||
f.seek(self.position)
|
||||
new_lines = f.readlines()
|
||||
self.position = f.tell()
|
||||
self.last_size = current_size
|
||||
|
||||
# Strip whitespace from each line
|
||||
return [line.strip() for line in new_lines if line.strip()]
|
||||
|
||||
except OSError:
|
||||
return []
|
||||
|
||||
def monitor_continuously(
|
||||
self,
|
||||
line_handler: Callable[[str], None],
|
||||
check_interval: float = 0.5,
|
||||
stop_condition: Optional[Callable[[], bool]] = None,
|
||||
):
|
||||
"""
|
||||
Monitor file continuously and call handler for each new line.
|
||||
|
||||
Args:
|
||||
line_handler: Function to call for each new line
|
||||
check_interval: Seconds between file checks
|
||||
stop_condition: Optional function that returns True to stop monitoring
|
||||
"""
|
||||
while True:
|
||||
try:
|
||||
if stop_condition and stop_condition():
|
||||
break
|
||||
|
||||
new_lines = self.read_new_lines()
|
||||
for line in new_lines:
|
||||
line_handler(line)
|
||||
|
||||
time.sleep(check_interval)
|
||||
|
||||
except KeyboardInterrupt:
|
||||
break
|
||||
except Exception as e:
|
||||
logger.warning(f"Error monitoring log file {self.file_path}: {e}")
|
||||
time.sleep(1)
|
||||
|
||||
|
||||
def read_json_file(file_path: str) -> Optional[dict]:
|
||||
"""
|
||||
Read and parse a JSON file with proper error handling.
|
||||
|
||||
Args:
|
||||
file_path: Path to the JSON file
|
||||
|
||||
Returns:
|
||||
Parsed JSON data as dict, or None if file doesn't exist or invalid
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
if not os.path.exists(translated_path):
|
||||
return None
|
||||
|
||||
with open(translated_path, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
except (json.JSONDecodeError, OSError):
|
||||
return None
|
||||
|
||||
|
||||
def write_json_file(file_path: str, data: dict, indent: int = 2) -> bool:
|
||||
"""
|
||||
Write data to a JSON file with proper formatting.
|
||||
|
||||
Args:
|
||||
file_path: Path to write the JSON file
|
||||
data: Dictionary data to serialize
|
||||
indent: JSON indentation level
|
||||
|
||||
Returns:
|
||||
True if successful, False otherwise
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
os.makedirs(os.path.dirname(translated_path), exist_ok=True)
|
||||
|
||||
with open(translated_path, "w", encoding="utf-8") as f:
|
||||
json.dump(data, f, indent=indent, ensure_ascii=False)
|
||||
return True
|
||||
except (OSError, TypeError):
|
||||
return False
|
||||
|
||||
|
||||
def get_file_size(file_path: str) -> int:
|
||||
"""
|
||||
Get file size in bytes with proper error handling.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
|
||||
Returns:
|
||||
File size in bytes, or 0 if file doesn't exist or error
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
if os.path.exists(translated_path) and os.path.isfile(translated_path):
|
||||
return os.path.getsize(translated_path)
|
||||
return 0
|
||||
except OSError:
|
||||
return 0
|
||||
|
||||
|
||||
def ensure_directory_exists(file_path: str) -> bool:
|
||||
"""
|
||||
Ensure the parent directory of a file path exists.
|
||||
|
||||
Args:
|
||||
file_path: Path to file (directory will be created for parent)
|
||||
|
||||
Returns:
|
||||
True if directory exists or was created, False on error
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
directory = os.path.dirname(translated_path)
|
||||
if directory:
|
||||
os.makedirs(directory, exist_ok=True)
|
||||
return True
|
||||
except OSError:
|
||||
return False
|
||||
|
||||
|
||||
def is_text_file(file_path: str) -> bool:
|
||||
"""
|
||||
Check if a file is likely a text file based on extension and content.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
|
||||
Returns:
|
||||
True if file appears to be text, False otherwise
|
||||
"""
|
||||
from .file_types import is_text_file as check_text_type
|
||||
|
||||
return check_text_type(file_path)
|
||||
|
||||
|
||||
def read_file_safely(file_path: str, max_size: int = 10 * 1024 * 1024) -> Optional[str]:
|
||||
"""
|
||||
Read a file with size limits and encoding handling.
|
||||
|
||||
Args:
|
||||
file_path: Path to the file
|
||||
max_size: Maximum file size in bytes (default 10MB)
|
||||
|
||||
Returns:
|
||||
File content as string, or None if file too large or unreadable
|
||||
"""
|
||||
try:
|
||||
translated_path = translate_path_for_environment(file_path)
|
||||
if not os.path.exists(translated_path) or not os.path.isfile(translated_path):
|
||||
return None
|
||||
|
||||
file_size = os.path.getsize(translated_path)
|
||||
if file_size > max_size:
|
||||
return None
|
||||
|
||||
with open(translated_path, encoding="utf-8", errors="ignore") as f:
|
||||
return f.read()
|
||||
except OSError:
|
||||
return None
|
||||
|
||||
|
||||
def check_total_file_size(files: list[str], model_name: Optional[str] = None) -> Optional[dict]:
|
||||
"""
|
||||
Check if total file sizes would exceed token threshold before embedding.
|
||||
|
||||
IMPORTANT: This performs STRICT REJECTION at MCP boundary.
|
||||
No partial inclusion - either all files fit or request is rejected.
|
||||
This forces Claude to make better file selection decisions.
|
||||
|
||||
Args:
|
||||
files: List of file paths to check
|
||||
model_name: Model name for context-aware thresholds, or None for default
|
||||
|
||||
Returns:
|
||||
Dict with MCP_CODE_TOO_LARGE response if too large, None if acceptable
|
||||
"""
|
||||
if not files:
|
||||
return None
|
||||
|
||||
# Get model-specific token allocation (dynamic thresholds)
|
||||
if not model_name:
|
||||
from config import DEFAULT_MODEL
|
||||
|
||||
model_name = DEFAULT_MODEL
|
||||
|
||||
# Handle auto mode gracefully
|
||||
if model_name.lower() == "auto":
|
||||
from providers.registry import ModelProviderRegistry
|
||||
|
||||
model_name = ModelProviderRegistry.get_preferred_fallback_model()
|
||||
|
||||
from utils.model_context import ModelContext
|
||||
|
||||
model_context = ModelContext(model_name)
|
||||
token_allocation = model_context.calculate_token_allocation()
|
||||
|
||||
# Dynamic threshold based on model capacity
|
||||
context_window = token_allocation.total_tokens
|
||||
if context_window >= 1_000_000: # Gemini-class models
|
||||
threshold_percent = 0.8 # Can be more generous
|
||||
elif context_window >= 500_000: # Mid-range models
|
||||
threshold_percent = 0.7 # Moderate
|
||||
else: # OpenAI-class models (200K)
|
||||
threshold_percent = 0.6 # Conservative
|
||||
|
||||
max_file_tokens = int(token_allocation.file_tokens * threshold_percent)
|
||||
|
||||
# Use centralized file size checking (threshold already applied to max_file_tokens)
|
||||
within_limit, total_estimated_tokens, file_count = check_files_size_limit(files, max_file_tokens)
|
||||
|
||||
if not within_limit:
|
||||
return {
|
||||
"status": "code_too_large",
|
||||
"content": (
|
||||
f"The selected files are too large for analysis "
|
||||
f"(estimated {total_estimated_tokens:,} tokens, limit {max_file_tokens:,}). "
|
||||
f"Please select fewer, more specific files that are most relevant "
|
||||
f"to your question, then invoke the tool again."
|
||||
),
|
||||
"content_type": "text",
|
||||
"metadata": {
|
||||
"total_estimated_tokens": total_estimated_tokens,
|
||||
"limit": max_file_tokens,
|
||||
"file_count": file_count,
|
||||
"threshold_percent": threshold_percent,
|
||||
"model_context_window": context_window,
|
||||
"instructions": "Reduce file selection and try again - all files must fit within budget",
|
||||
},
|
||||
}
|
||||
|
||||
return None # Proceed with ALL files
|
||||
|
||||
Reference in New Issue
Block a user