This is just an estimation based on an average, development setup, without GPU :
Current Stats:
- 11 active channels
- 1,389 messages over 10 days of telegram_date (Nov 23 - Dec 3)
- Processing rate: ~130-176 msg/hour (average ~135 msg/hr)
- Spam rate: 7.5% (only 7.5% filtered as spam, 92.5% kept)
- Average: 138 messages/day across all 11 channels = 12.5 msg/channel/day
Scaling to 200 channels from Feb 2022:
- February 2022 to December 2025 = ~34 months = ~1,034 days
- 200 channels
- Messages per channel per day: 12.5
- Total messages: 200 × 12.5 × 1,034 = 2,585,000 messages (before spam filter)
- After 7.5% spam filter: ~2,391,125 messages need LLM processing
Processing Time Calculation:
- Current rate: ~135 messages/hour (this is with 1 processor worker)
- Time at current rate: 2,391,125 / 135 = 17,712 hours = 738 days = ~2 years
But wait - the processing rate includes the LLM bottleneck of 22-26s per message.
Let me recalculate with the LLM being the constraint:
- LLM takes ~24s per message (midpoint of 22-26s)
- 1 processor can do: 3600/24 = 150 messages/hour
- That’s close to observed 135/hr (some overhead)
Scaling Options:
-
With 1 processor worker: 738 days
-
With 4 processor workers: 738/4 = 185 days = ~6 months
-
With 8 processor workers: 738/8 = 92 days = ~3 months
With 8 workers, the processing time drops significantly. However, Telegram’s rate limits introduce another complexity. The GetHistory API allows ~300 requests per minute, enabling about 100 messages per request. For 200 channels over 1,034 days, this means 26,000 requests. At 300 requests per minute, fetching would take roughly 87 minutes. Interestingly, the Telegram data retrieval is relatively quick compared to the LLM processing bottleneck.
● Excellent data! Now let me do the math:
📊 Backfill Scaling Calculation: 200 Channels from Feb 2022
★ Insight ───────────────────────────────────── This calculation reveals why LLM processing is the true bottleneck. Telegram fetching is fast (~300 requests/min), but the qwen2.5:3b model running on CPU takes ~24s per message for classification. This is the constraint that dominates scaling. ─────────────────────────────────────────────────
Current Baseline (from actual data):
| Metric | Value |
|---|---|
| Active Channels | 11 |
| Messages (10 days) | 1,389 |
| Avg msgs/channel/day | 12.6 |
| Processing Rate | ~135 msg/hr |
| Spam Filter Rate | 7.5% |
| LLM Time per Message | ~24 seconds |
Scaled Scenario: 200 Channels, Feb 2022 → Dec 2025
Timeline: February 2022 → December 2025 = ~34 months = 1,034 days
Total Messages Estimate: 200 channels × 12.6 msg/channel/day × 1,034 days = 2,605,680 messages (raw)
After Spam Filter (7.5% removed): = 2,410,254 messages need LLM processing
Processing Time Estimates:
| Scenario | Processor Workers | Time to Backfill |
|---|---|---|
| Current | 1 worker | 17,853 hours = 744 days = 2+ years |
| Scaled | 2 workers | 372 days = 1 year |
| Scaled | 4 workers | 186 days = 6 months |
| Practical | 8 workers | 93 days = 3 months |
| Maximum | 16 workers | 47 days = 6.5 weeks |
Bottleneck Analysis:
-
Telegram Fetching: ~87 minutes (NOT a bottleneck)
- 26,000 API requests at 300/min
-
LLM Classification: THE BOTTLENECK
- 24 seconds × 2.4M messages = 16M seconds
- = 185 days with 1 worker
-
Database/Media: Fast (parallel, async)
Realistic Recommendation:
With 8 processor-workers (reasonable for your server): → ~3 months to complete full backfill
With GPU-accelerated LLM (RTX 3060 or better): → Could cut time by 5-10x → 2-4 weeks
★ Insight ───────────────────────────────────── Key optimizations to consider:
- Scale workers: docker-compose up -d –scale processor-worker=8
- GPU LLM: Switch Ollama to GPU mode (if hardware available) → 5-10x speedup
- Skip old content: Many Feb 2022 messages may be less relevant - consider backfilling only last 6-12 months for most channels
- Tiered approach: Backfill high-priority channels fully, others only recent history