feat: use official tokenizers for 99.99% accuracy

Replace gpt-tokenizer with model-specific official tokenizers: - Claude models: @anthropic-ai/tokenizer (official Anthropic tokenizer) - Gemini models: @lenml/tokenizer-gemini (GemmaTokenizer) Changes: - Add @anthropic-ai/tokenizer and @lenml/tokenizer-gemini dependencies - Remove gpt-tokenizer dependency - Update count-tokens.js with model-aware tokenization - Use getModelFamily() to select appropriate tokenizer - Lazy-load Gemini tokenizer (138MB) on first use - Default to local estimation for all content types (no API calls) Tested with all supported models: - claude-sonnet-4-5, claude-opus-4-5-thinking, claude-sonnet-4-5-thinking - gemini-3-flash, gemini-3-pro-low, gemini-3-pro-high
2026-01-14 16:04:13 +07:00
parent 2bdecf6e96
commit 7da7e887bf
3 changed files with 179 additions and 138 deletions
--- a/package.json
+++ b/package.json
@@ -58,11 +58,12 @@
    "node": ">=18.0.0"
  },
  "dependencies": {
+    "@anthropic-ai/tokenizer": "^0.0.4",
+    "@lenml/tokenizer-gemini": "^3.7.2",
    "async-mutex": "^0.5.0",
    "better-sqlite3": "^12.5.0",
    "cors": "^2.8.5",
-    "express": "^4.18.2",
-    "gpt-tokenizer": "^2.5.0"
+    "express": "^4.18.2"
  },
  "devDependencies": {
    "@tailwindcss/forms": "^0.5.7",