feat: use official tokenizers for 99.99% accuracy

Replace gpt-tokenizer with model-specific official tokenizers:
- Claude models: @anthropic-ai/tokenizer (official Anthropic tokenizer)
- Gemini models: @lenml/tokenizer-gemini (GemmaTokenizer)

Changes:
- Add @anthropic-ai/tokenizer and @lenml/tokenizer-gemini dependencies
- Remove gpt-tokenizer dependency
- Update count-tokens.js with model-aware tokenization
- Use getModelFamily() to select appropriate tokenizer
- Lazy-load Gemini tokenizer (138MB) on first use
- Default to local estimation for all content types (no API calls)

Tested with all supported models:
- claude-sonnet-4-5, claude-opus-4-5-thinking, claude-sonnet-4-5-thinking
- gemini-3-flash, gemini-3-pro-low, gemini-3-pro-high
This commit is contained in:
minhphuc429
2026-01-14 16:04:13 +07:00
parent 2bdecf6e96
commit 7da7e887bf
3 changed files with 179 additions and 138 deletions

View File

@@ -58,11 +58,12 @@
"node": ">=18.0.0"
},
"dependencies": {
"@anthropic-ai/tokenizer": "^0.0.4",
"@lenml/tokenizer-gemini": "^3.7.2",
"async-mutex": "^0.5.0",
"better-sqlite3": "^12.5.0",
"cors": "^2.8.5",
"express": "^4.18.2",
"gpt-tokenizer": "^2.5.0"
"express": "^4.18.2"
},
"devDependencies": {
"@tailwindcss/forms": "^0.5.7",