While building an AI application designed to analyze research papers, I encountered Groq's free-tier API token limits. This challenge prompted the development of a multi-model LLM routing system, which ultimately enhanced application performance. Initially, I had no intention of creating such a complex system; my goal was simply to summarize a 40-page research paper without incurring API costs.
My side project, Papers.ai, was born out of the frustration with academic literature reviews. Opening a dense, 30-page paper often means spending 20 minutes just to determine its relevance. I aimed to streamline this process, and crucially, to do it for free.
The Setup and Initial Roadblocks
The initial tech stack was straightforward: a React frontend, a Node.js backend, Firebase for authentication and storage, and Groq as the LLM provider. Groq was chosen for its exceptional speed – genuinely, shockingly fast compared to most LLM APIs. Its free tier also proved sufficient for building real-world applications.
The plan was simple: user uploads a PDF → extract text → send to Groq → get a summary. However, it was far from that simple.
The first significant hurdle was Groq's per-model-per-minute token limits on its free tier. Summarizing a research paper frequently involves pushing 8,000–15,000 tokens in a single request. Hitting this limit results in a 429 error, and repeated occurrences render the application unusable.
My initial, obvious reaction was to truncate the paper, sending only the first N tokens. This approach "worked" in terms of avoiding the limit, but it was fundamentally flawed. Summaries often missed entire results sections, skipped methodology, or presented confidently incorrect information based solely on the abstract and introduction. Truncation was clearly not a viable solution, necessitating a smarter approach.
The Multi-Model Routing Concept
The key insight came from recognizing that Groq offers multiple models, each with its own separate rate limit bucket:
llama3-8b-8192: Smaller, faster, with an 8k context window.llama3-70b-8192: Larger, smarter, also with an 8k context window.mixtral-8x7b-32768: Features a significantly larger context window, supporting up to 32k tokens.
The mixtral model, with its expansive context window, proved crucial. It became clear that different tasks demand different LLM capabilities. A quick keyword extraction, for instance, doesn't require a 70B model, whereas a deep synthesis of methodology across several papers likely does. Therefore, instead of routing every request to a single model and hoping for the best, I designed a simple router to select the most appropriate model based on the actual needs of the task.
How the Routing System Operates
The routing logic is straightforward, almost embarrassingly so once understood:
function routeToModel(task, tokenCount) {
if (tokenCount > 20000) {
// Only mixtral can handle this context size
return 'mixtral-8x7b-32768';
}
if (task === 'summary' || task === 'qa') {
// These need reasoning ability — use the big model
return 'llama3-70b-8192';
}
if (task === 'extraction' || task === 'keywords') {
// Structured extraction doesn't need a 70B model
return 'llama3-8b-8192';
}
// Default fallback
return 'llama3-70b-8192';
}
This function dynamically determines which Groq model to use based on the input task type and token count. If the tokenCount exceeds 20000, it defaults to mixtral-8x7b-32768 to leverage its large context window. For tasks requiring strong reasoning, such as summary or qa, the more capable llama3-70b-8192 is selected. For simpler, structured tasks like extraction or keywords, the more efficient llama3-8b-8192 is chosen. If none of these specific conditions are met, llama3-70b-8192 serves as the default fallback. This intelligent routing strategy not only successfully circumvented Groq's free-tier token limits but also significantly improved the application's efficiency and accuracy.