LLM usage quotas (rate limits, monthly spending caps, token limits) cause n8n workflows to fail with 429 or quota-exceeded errors. Prevent this by implementing request rate limiting with the Wait node, tracking token usage in a database, setting up spending alerts before hitting hard limits, and using fallback models when your primary provider is throttled.
Why LLM Quota Management Matters for n8n Workflows
Every LLM provider enforces usage limits: OpenAI has requests-per-minute (RPM) and tokens-per-minute (TPM) limits, Anthropic has requests-per-minute limits per model, and most providers have monthly spending caps. When your n8n workflow exceeds these limits, API calls fail with 429 (Too Many Requests) or quota-exceeded errors, breaking your automation. This tutorial shows how to build a comprehensive quota management system that tracks usage, enforces rate limits within your workflow, alerts before hitting caps, and gracefully falls back to alternative models when limits are reached.
Prerequisites
- A running n8n instance (self-hosted or cloud) on version 1.30 or later
- LLM API credentials (OpenAI, Anthropic, or Mistral)
- A PostgreSQL database for usage tracking
- An email or Slack credential for alerts
- Understanding of your LLM provider's rate limits and pricing
Step-by-step guide
Understand your provider's quota structure
Understand your provider's quota structure
Before implementing quota management, document your exact limits. OpenAI has RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day) limits that vary by tier and model. Anthropic has requests per minute per model. Mistral has requests per minute and tokens per minute. Check your provider's dashboard for your current tier limits. Also note your monthly spending cap — most providers let you set one, and hitting it kills all API calls instantly.
1// Common LLM provider limits (Tier 1 / default)2// OpenAI GPT-4o: 500 RPM, 30,000 TPM, $100/mo default cap3// OpenAI GPT-4o-mini: 500 RPM, 200,000 TPM4// Anthropic Claude 3.5 Sonnet: 50 RPM, 40,000 TPM5// Mistral Large: 30 RPM, varies by plan67// Check your actual limits:8// OpenAI: platform.openai.com → Settings → Limits9// Anthropic: console.anthropic.com → Settings → Limits10// Mistral: console.mistral.ai → BillingExpected result: You have documented your exact RPM, TPM, and monthly spending limits for each LLM provider you use.
Implement request rate limiting with Wait node
Implement request rate limiting with Wait node
The simplest way to avoid hitting RPM limits is to add a Wait node before your LLM call that enforces a minimum delay between requests. Calculate the delay as 60000 / RPM (e.g., for 50 RPM, wait 1200ms between requests). For batch processing with the SplitInBatches node, this is critical — without throttling, a batch of 100 items will fire 100 API calls simultaneously, instantly hitting RPM limits.
1// Rate limiting strategy using Code node + Wait node23// Code node before LLM: calculate required delay4const staticData = $getWorkflowStaticData('global');5const RPM_LIMIT = 50; // Your provider's RPM limit6const MIN_DELAY_MS = Math.ceil(60000 / RPM_LIMIT); // 1200ms for 50 RPM78const now = Date.now();9const lastCallTime = staticData.lastLlmCallTime || 0;10const elapsed = now - lastCallTime;11const waitTime = Math.max(0, MIN_DELAY_MS - elapsed);1213staticData.lastLlmCallTime = now + waitTime;1415return [{16 json: {17 ...($input.first().json),18 _waitMs: waitTime19 }20}];2122// Wait node after Code node:23// Wait Amount: {{ $json._waitMs }}24// Unit: MillisecondsExpected result: Requests to the LLM are spaced at least MIN_DELAY_MS apart, preventing RPM limit violations.
Track token usage in a database
Track token usage in a database
Create a PostgreSQL table to track token usage per provider, per day. After every LLM call, log the token counts from the API response. A scheduled aggregation query can then compare daily usage against your limits and trigger alerts. This also provides a historical record for cost analysis and budgeting.
1-- PostgreSQL: Create usage tracking table2CREATE TABLE IF NOT EXISTS llm_usage_tracking (3 id SERIAL PRIMARY KEY,4 provider VARCHAR(50) NOT NULL,5 model VARCHAR(100),6 prompt_tokens INTEGER DEFAULT 0,7 completion_tokens INTEGER DEFAULT 0,8 total_tokens INTEGER DEFAULT 0,9 estimated_cost_usd NUMERIC(10, 6),10 workflow_id VARCHAR(255),11 created_at TIMESTAMP DEFAULT NOW()12);1314CREATE INDEX idx_usage_provider_date15 ON llm_usage_tracking(provider, created_at);Expected result: The llm_usage_tracking table exists and is ready to receive token usage data after every LLM call.
Log usage after every LLM call
Log usage after every LLM call
Add a Code node after each LLM node that extracts token usage from the API response and calculates the estimated cost. Different providers return usage data in different formats — the Code node normalizes them. Connect the Code node output to a Postgres node that inserts the usage record. Enable 'Continue On Fail' on the Postgres node so a database error does not block the main workflow.
1// Code node — JavaScript2// Extract and normalize token usage from LLM response34const response = $input.first().json;56// Provider-specific token extraction7let provider = 'unknown';8let model = 'unknown';9let promptTokens = 0;10let completionTokens = 0;1112if (response.usage?.prompt_tokens !== undefined) {13 // OpenAI / Mistral format14 provider = response.model?.includes('gpt') ? 'openai' : 'mistral';15 model = response.model || 'unknown';16 promptTokens = response.usage.prompt_tokens;17 completionTokens = response.usage.completion_tokens;18} else if (response.usage?.input_tokens !== undefined) {19 // Anthropic format20 provider = 'anthropic';21 model = response.model || 'unknown';22 promptTokens = response.usage.input_tokens;23 completionTokens = response.usage.output_tokens;24}2526// Cost estimation (approximate, per 1M tokens)27const COSTS = {28 'gpt-4o': { input: 2.50, output: 10.00 },29 'gpt-4o-mini': { input: 0.15, output: 0.60 },30 'claude-3-5-sonnet-20241022': { input: 3.00, output: 15.00 },31 'mistral-large-latest': { input: 2.00, output: 6.00 }32};3334const pricing = COSTS[model] || { input: 5.00, output: 15.00 };35const estimatedCost = (promptTokens * pricing.input + completionTokens * pricing.output) / 1000000;3637return [{38 json: {39 provider,40 model,41 prompt_tokens: promptTokens,42 completion_tokens: completionTokens,43 total_tokens: promptTokens + completionTokens,44 estimated_cost_usd: estimatedCost,45 workflow_id: $workflow.id46 }47}];Expected result: Every LLM call's token usage and estimated cost are logged to PostgreSQL for tracking and alerting.
Set up spending alerts
Set up spending alerts
Create a scheduled workflow that runs every 6 hours. It queries the usage table for the current month's total spending per provider and compares it against your spending cap. If spending exceeds 80% of the cap, it sends a warning. If spending exceeds 95%, it sends a critical alert. This gives you time to react before hitting the hard limit and having all API calls fail.
1-- Postgres node query: Monthly spending check2SELECT3 provider,4 SUM(total_tokens) as monthly_tokens,5 SUM(estimated_cost_usd) as monthly_cost_usd,6 COUNT(*) as total_calls7FROM llm_usage_tracking8WHERE created_at >= date_trunc('month', NOW())9GROUP BY provider;1011-- Code node after query: Check against limits12// const SPENDING_CAPS = {13// openai: 100, // $100/month14// anthropic: 50, // $50/month15// mistral: 30 // $30/month16// };17// const WARNING_THRESHOLD = 0.80; // 80%18// const CRITICAL_THRESHOLD = 0.95; // 95%Expected result: Spending alerts fire at 80% and 95% of your monthly cap, giving you time to reduce usage or increase limits.
Implement fallback models for quota exhaustion
Implement fallback models for quota exhaustion
When your primary model hits a rate limit (429 error), automatically fall back to a cheaper or less-limited model instead of failing the workflow. Use the HTTP Request node's 'Retry On Fail' combined with an error handler that switches models. For example, fall back from GPT-4o to GPT-4o-mini, or from Claude 3.5 Sonnet to Claude 3.5 Haiku. The fallback provides a degraded but functional experience while your rate limit resets.
1// Code node — JavaScript2// Fallback model selection after rate limit error34const FALLBACK_CHAIN = [5 { provider: 'openai', model: 'gpt-4o', priority: 1 },6 { provider: 'openai', model: 'gpt-4o-mini', priority: 2 },7 { provider: 'anthropic', model: 'claude-3-5-haiku-20241022', priority: 3 }8];910const staticData = $getWorkflowStaticData('global');11const rateLimitedModels = staticData.rateLimitedModels || {};12const now = Date.now();1314// Clean expired rate limit flags (reset after 60 seconds)15for (const [model, timestamp] of Object.entries(rateLimitedModels)) {16 if (now - timestamp > 60000) {17 delete rateLimitedModels[model];18 }19}2021// Find first available model22const available = FALLBACK_CHAIN.find(23 m => !rateLimitedModels[m.model]24);2526if (!available) {27 // All models rate limited — wait and retry primary28 return [{ json: { _waitMs: 30000, model: FALLBACK_CHAIN[0].model, isFallback: false } }];29}3031return [{32 json: {33 model: available.model,34 provider: available.provider,35 isFallback: available.priority > 1,36 fallbackLevel: available.priority37 }38}];Expected result: When the primary model is rate-limited, the workflow automatically switches to a fallback model without failing.
Complete working example
1// Complete Code node: LLM quota manager2// Place before LLM node to enforce rate limits and select models34const staticData = $getWorkflowStaticData('global');5const now = Date.now();67// Initialize tracking8if (!staticData.quotaManager) {9 staticData.quotaManager = {10 requestLog: [], // Timestamps of recent requests11 rateLimited: {}, // Model → timestamp of last 42912 dailyTokens: {}, // Provider → token count today13 dailyReset: now // When to reset daily counters14 };15}1617const qm = staticData.quotaManager;1819// Reset daily counters at midnight20if (now - qm.dailyReset > 86400000) {21 qm.dailyTokens = {};22 qm.dailyReset = now;23}2425// Clean old request log (keep last 60 seconds)26qm.requestLog = qm.requestLog.filter(ts => now - ts < 60000);2728// Configuration29const RPM_LIMIT = 45; // Stay 10% under actual limit30const DAILY_TOKEN_LIMIT = 1000000;31const FALLBACKS = [32 { provider: 'openai', model: 'gpt-4o' },33 { provider: 'openai', model: 'gpt-4o-mini' },34 { provider: 'anthropic', model: 'claude-3-5-haiku-20241022' }35];3637// Check RPM38const currentRPM = qm.requestLog.length;39let waitMs = 0;4041if (currentRPM >= RPM_LIMIT) {42 // Wait until oldest request falls out of the window43 const oldestInWindow = Math.min(...qm.requestLog);44 waitMs = 60000 - (now - oldestInWindow) + 100; // +100ms buffer45}4647// Select model (skip rate-limited ones)48let selectedModel = FALLBACKS[0];49for (const model of FALLBACKS) {50 const limitedUntil = qm.rateLimited[model.model] || 0;51 if (now > limitedUntil) {52 selectedModel = model;53 break;54 }55}5657// Record this request58qm.requestLog.push(now + waitMs);5960// Check daily token limit61const dailyUsed = qm.dailyTokens[selectedModel.provider] || 0;62const quotaRemaining = DAILY_TOKEN_LIMIT - dailyUsed;6364return [{65 json: {66 ...$input.first().json,67 _selectedModel: selectedModel.model,68 _selectedProvider: selectedModel.provider,69 _waitMs: waitMs,70 _currentRPM: currentRPM,71 _dailyTokensUsed: dailyUsed,72 _quotaRemaining: quotaRemaining,73 _isFallback: selectedModel !== FALLBACKS[0]74 }75}];Common mistakes when preventing Hitting Usage Quotas with LLM Calls in n8n
Why it's a problem: Sending batch items to the LLM in parallel without rate limiting, instantly hitting RPM caps
How to avoid: Use SplitInBatches with a batch size of 1 and a Wait node between batches to enforce RPM spacing
Why it's a problem: Tracking only request count but not token usage, missing TPM limits
How to avoid: Track both requests per minute AND tokens per minute — some providers enforce both independently
Why it's a problem: Not setting a spending cap at the provider level, risking unlimited charges from a bug or loop
How to avoid: Set monthly spending limits in your OpenAI/Anthropic/Mistral dashboard immediately
Why it's a problem: Using the same model for all tasks regardless of complexity, wasting quota on simple tasks
How to avoid: Route simple tasks (classification, yes/no) to cheap models (GPT-4o-mini) and complex tasks to powerful models
Why it's a problem: Retrying 429 errors immediately without backing off, making the rate limit situation worse
How to avoid: Use exponential backoff: wait 1s, then 2s, then 4s. Or switch to a fallback model immediately.
Best practices
- Set your n8n rate limit to 90% of your actual provider limit to leave headroom for manual API usage
- Track token usage in a database, not just static data, for historical analysis and cost reporting
- Set spending alerts at 80% of your monthly cap to give time for action
- Always set provider-side spending caps as a safety net in addition to n8n-side tracking
- Use fallback model chains so workflows degrade gracefully instead of failing completely
- Space batch processing requests with Wait nodes to stay within RPM limits
- Monitor daily and monthly token usage trends to right-size your provider tier
- Use cheaper models (GPT-4o-mini, Claude 3.5 Haiku) for low-complexity tasks to stretch your quota
Still stuck?
Copy one of these prompts to get a personalized, step-by-step explanation.
My n8n workflow processes hundreds of items through OpenAI and keeps hitting 429 rate limit errors. How do I implement rate limiting, token tracking, and fallback models in n8n to prevent quota issues?
I need to add rate limiting to my n8n workflow that calls the OpenAI API. How do I use the Wait node and Code node to enforce RPM limits, and how do I track token usage in PostgreSQL?
Frequently asked questions
What happens when I hit my OpenAI spending cap?
All API calls immediately fail with a 429 error and a message about exceeding your billing limit. No requests go through until the next billing cycle or until you increase the cap in Settings → Limits on platform.openai.com.
Can I increase my rate limits without paying more?
Yes, OpenAI automatically increases rate limits as your account ages and payment history builds. You can also request a limit increase through their support portal. Anthropic and Mistral have similar tier systems.
How do I handle rate limits when multiple n8n workflows share the same API key?
Use a shared rate limiting mechanism: either a centralized Redis counter that all workflows check before making calls, or a shared PostgreSQL table that tracks requests per minute across all workflows.
Is it cheaper to use one large prompt or multiple small prompts?
One large prompt is usually cheaper because you pay input token costs only once. Multiple small prompts repeat the system prompt and context in each call. However, one large prompt may hit token limits — balance cost against reliability.
How accurate is the cost estimation in the tracking Code node?
The estimation uses published per-token pricing and is accurate for standard usage. It does not account for cached input tokens (OpenAI discount), batch API pricing, or promotional credits. Treat it as an approximation and verify against your provider's billing dashboard monthly.
Can RapidDev help optimize LLM costs for high-volume n8n workflows?
Yes, RapidDev specializes in optimizing LLM costs for n8n workflows. Their team implements rate limiting, caching, model routing (expensive models for complex tasks, cheap models for simple ones), and token budget management. Clients typically see 40-60% cost reductions after optimization.
Talk to an Expert
Our team has built 600+ apps. Get personalized help with your project.
Book a free consultation