Skip to main content
RapidDev - Software Development Agency
n8n-tutorial

How to Prevent Hitting Usage Quotas with LLM Calls in n8n

LLM usage quotas (rate limits, monthly spending caps, token limits) cause n8n workflows to fail with 429 or quota-exceeded errors. Prevent this by implementing request rate limiting with the Wait node, tracking token usage in a database, setting up spending alerts before hitting hard limits, and using fallback models when your primary provider is throttled.

What you'll learn

  • How to implement request rate limiting using the Wait node and Code node
  • How to track token usage and spending in a PostgreSQL database
  • How to set up spending alerts that fire before hitting hard limits
  • How to configure fallback models for graceful degradation when quotas are reached
Book a free consultation
4.9Clutch rating
600+Happy partners
17+Countries served
190+Team members
Advanced10 min read35-45 minutesn8n 1.30+, any LLM provider (OpenAI, Anthropic, Mistral), PostgreSQL for tracking, Code nodeMarch 2026RapidDev Engineering Team
TL;DR

LLM usage quotas (rate limits, monthly spending caps, token limits) cause n8n workflows to fail with 429 or quota-exceeded errors. Prevent this by implementing request rate limiting with the Wait node, tracking token usage in a database, setting up spending alerts before hitting hard limits, and using fallback models when your primary provider is throttled.

Why LLM Quota Management Matters for n8n Workflows

Every LLM provider enforces usage limits: OpenAI has requests-per-minute (RPM) and tokens-per-minute (TPM) limits, Anthropic has requests-per-minute limits per model, and most providers have monthly spending caps. When your n8n workflow exceeds these limits, API calls fail with 429 (Too Many Requests) or quota-exceeded errors, breaking your automation. This tutorial shows how to build a comprehensive quota management system that tracks usage, enforces rate limits within your workflow, alerts before hitting caps, and gracefully falls back to alternative models when limits are reached.

Prerequisites

  • A running n8n instance (self-hosted or cloud) on version 1.30 or later
  • LLM API credentials (OpenAI, Anthropic, or Mistral)
  • A PostgreSQL database for usage tracking
  • An email or Slack credential for alerts
  • Understanding of your LLM provider's rate limits and pricing

Step-by-step guide

1

Understand your provider's quota structure

Before implementing quota management, document your exact limits. OpenAI has RPM (requests per minute), TPM (tokens per minute), and RPD (requests per day) limits that vary by tier and model. Anthropic has requests per minute per model. Mistral has requests per minute and tokens per minute. Check your provider's dashboard for your current tier limits. Also note your monthly spending cap — most providers let you set one, and hitting it kills all API calls instantly.

typescript
1// Common LLM provider limits (Tier 1 / default)
2// OpenAI GPT-4o: 500 RPM, 30,000 TPM, $100/mo default cap
3// OpenAI GPT-4o-mini: 500 RPM, 200,000 TPM
4// Anthropic Claude 3.5 Sonnet: 50 RPM, 40,000 TPM
5// Mistral Large: 30 RPM, varies by plan
6
7// Check your actual limits:
8// OpenAI: platform.openai.com → Settings → Limits
9// Anthropic: console.anthropic.com → Settings → Limits
10// Mistral: console.mistral.ai → Billing

Expected result: You have documented your exact RPM, TPM, and monthly spending limits for each LLM provider you use.

2

Implement request rate limiting with Wait node

The simplest way to avoid hitting RPM limits is to add a Wait node before your LLM call that enforces a minimum delay between requests. Calculate the delay as 60000 / RPM (e.g., for 50 RPM, wait 1200ms between requests). For batch processing with the SplitInBatches node, this is critical — without throttling, a batch of 100 items will fire 100 API calls simultaneously, instantly hitting RPM limits.

typescript
1// Rate limiting strategy using Code node + Wait node
2
3// Code node before LLM: calculate required delay
4const staticData = $getWorkflowStaticData('global');
5const RPM_LIMIT = 50; // Your provider's RPM limit
6const MIN_DELAY_MS = Math.ceil(60000 / RPM_LIMIT); // 1200ms for 50 RPM
7
8const now = Date.now();
9const lastCallTime = staticData.lastLlmCallTime || 0;
10const elapsed = now - lastCallTime;
11const waitTime = Math.max(0, MIN_DELAY_MS - elapsed);
12
13staticData.lastLlmCallTime = now + waitTime;
14
15return [{
16 json: {
17 ...($input.first().json),
18 _waitMs: waitTime
19 }
20}];
21
22// Wait node after Code node:
23// Wait Amount: {{ $json._waitMs }}
24// Unit: Milliseconds

Expected result: Requests to the LLM are spaced at least MIN_DELAY_MS apart, preventing RPM limit violations.

3

Track token usage in a database

Create a PostgreSQL table to track token usage per provider, per day. After every LLM call, log the token counts from the API response. A scheduled aggregation query can then compare daily usage against your limits and trigger alerts. This also provides a historical record for cost analysis and budgeting.

typescript
1-- PostgreSQL: Create usage tracking table
2CREATE TABLE IF NOT EXISTS llm_usage_tracking (
3 id SERIAL PRIMARY KEY,
4 provider VARCHAR(50) NOT NULL,
5 model VARCHAR(100),
6 prompt_tokens INTEGER DEFAULT 0,
7 completion_tokens INTEGER DEFAULT 0,
8 total_tokens INTEGER DEFAULT 0,
9 estimated_cost_usd NUMERIC(10, 6),
10 workflow_id VARCHAR(255),
11 created_at TIMESTAMP DEFAULT NOW()
12);
13
14CREATE INDEX idx_usage_provider_date
15 ON llm_usage_tracking(provider, created_at);

Expected result: The llm_usage_tracking table exists and is ready to receive token usage data after every LLM call.

4

Log usage after every LLM call

Add a Code node after each LLM node that extracts token usage from the API response and calculates the estimated cost. Different providers return usage data in different formats — the Code node normalizes them. Connect the Code node output to a Postgres node that inserts the usage record. Enable 'Continue On Fail' on the Postgres node so a database error does not block the main workflow.

typescript
1// Code node — JavaScript
2// Extract and normalize token usage from LLM response
3
4const response = $input.first().json;
5
6// Provider-specific token extraction
7let provider = 'unknown';
8let model = 'unknown';
9let promptTokens = 0;
10let completionTokens = 0;
11
12if (response.usage?.prompt_tokens !== undefined) {
13 // OpenAI / Mistral format
14 provider = response.model?.includes('gpt') ? 'openai' : 'mistral';
15 model = response.model || 'unknown';
16 promptTokens = response.usage.prompt_tokens;
17 completionTokens = response.usage.completion_tokens;
18} else if (response.usage?.input_tokens !== undefined) {
19 // Anthropic format
20 provider = 'anthropic';
21 model = response.model || 'unknown';
22 promptTokens = response.usage.input_tokens;
23 completionTokens = response.usage.output_tokens;
24}
25
26// Cost estimation (approximate, per 1M tokens)
27const COSTS = {
28 'gpt-4o': { input: 2.50, output: 10.00 },
29 'gpt-4o-mini': { input: 0.15, output: 0.60 },
30 'claude-3-5-sonnet-20241022': { input: 3.00, output: 15.00 },
31 'mistral-large-latest': { input: 2.00, output: 6.00 }
32};
33
34const pricing = COSTS[model] || { input: 5.00, output: 15.00 };
35const estimatedCost = (promptTokens * pricing.input + completionTokens * pricing.output) / 1000000;
36
37return [{
38 json: {
39 provider,
40 model,
41 prompt_tokens: promptTokens,
42 completion_tokens: completionTokens,
43 total_tokens: promptTokens + completionTokens,
44 estimated_cost_usd: estimatedCost,
45 workflow_id: $workflow.id
46 }
47}];

Expected result: Every LLM call's token usage and estimated cost are logged to PostgreSQL for tracking and alerting.

5

Set up spending alerts

Create a scheduled workflow that runs every 6 hours. It queries the usage table for the current month's total spending per provider and compares it against your spending cap. If spending exceeds 80% of the cap, it sends a warning. If spending exceeds 95%, it sends a critical alert. This gives you time to react before hitting the hard limit and having all API calls fail.

typescript
1-- Postgres node query: Monthly spending check
2SELECT
3 provider,
4 SUM(total_tokens) as monthly_tokens,
5 SUM(estimated_cost_usd) as monthly_cost_usd,
6 COUNT(*) as total_calls
7FROM llm_usage_tracking
8WHERE created_at >= date_trunc('month', NOW())
9GROUP BY provider;
10
11-- Code node after query: Check against limits
12// const SPENDING_CAPS = {
13// openai: 100, // $100/month
14// anthropic: 50, // $50/month
15// mistral: 30 // $30/month
16// };
17// const WARNING_THRESHOLD = 0.80; // 80%
18// const CRITICAL_THRESHOLD = 0.95; // 95%

Expected result: Spending alerts fire at 80% and 95% of your monthly cap, giving you time to reduce usage or increase limits.

6

Implement fallback models for quota exhaustion

When your primary model hits a rate limit (429 error), automatically fall back to a cheaper or less-limited model instead of failing the workflow. Use the HTTP Request node's 'Retry On Fail' combined with an error handler that switches models. For example, fall back from GPT-4o to GPT-4o-mini, or from Claude 3.5 Sonnet to Claude 3.5 Haiku. The fallback provides a degraded but functional experience while your rate limit resets.

typescript
1// Code node — JavaScript
2// Fallback model selection after rate limit error
3
4const FALLBACK_CHAIN = [
5 { provider: 'openai', model: 'gpt-4o', priority: 1 },
6 { provider: 'openai', model: 'gpt-4o-mini', priority: 2 },
7 { provider: 'anthropic', model: 'claude-3-5-haiku-20241022', priority: 3 }
8];
9
10const staticData = $getWorkflowStaticData('global');
11const rateLimitedModels = staticData.rateLimitedModels || {};
12const now = Date.now();
13
14// Clean expired rate limit flags (reset after 60 seconds)
15for (const [model, timestamp] of Object.entries(rateLimitedModels)) {
16 if (now - timestamp > 60000) {
17 delete rateLimitedModels[model];
18 }
19}
20
21// Find first available model
22const available = FALLBACK_CHAIN.find(
23 m => !rateLimitedModels[m.model]
24);
25
26if (!available) {
27 // All models rate limited — wait and retry primary
28 return [{ json: { _waitMs: 30000, model: FALLBACK_CHAIN[0].model, isFallback: false } }];
29}
30
31return [{
32 json: {
33 model: available.model,
34 provider: available.provider,
35 isFallback: available.priority > 1,
36 fallbackLevel: available.priority
37 }
38}];

Expected result: When the primary model is rate-limited, the workflow automatically switches to a fallback model without failing.

Complete working example

quota-manager.js
1// Complete Code node: LLM quota manager
2// Place before LLM node to enforce rate limits and select models
3
4const staticData = $getWorkflowStaticData('global');
5const now = Date.now();
6
7// Initialize tracking
8if (!staticData.quotaManager) {
9 staticData.quotaManager = {
10 requestLog: [], // Timestamps of recent requests
11 rateLimited: {}, // Model → timestamp of last 429
12 dailyTokens: {}, // Provider → token count today
13 dailyReset: now // When to reset daily counters
14 };
15}
16
17const qm = staticData.quotaManager;
18
19// Reset daily counters at midnight
20if (now - qm.dailyReset > 86400000) {
21 qm.dailyTokens = {};
22 qm.dailyReset = now;
23}
24
25// Clean old request log (keep last 60 seconds)
26qm.requestLog = qm.requestLog.filter(ts => now - ts < 60000);
27
28// Configuration
29const RPM_LIMIT = 45; // Stay 10% under actual limit
30const DAILY_TOKEN_LIMIT = 1000000;
31const FALLBACKS = [
32 { provider: 'openai', model: 'gpt-4o' },
33 { provider: 'openai', model: 'gpt-4o-mini' },
34 { provider: 'anthropic', model: 'claude-3-5-haiku-20241022' }
35];
36
37// Check RPM
38const currentRPM = qm.requestLog.length;
39let waitMs = 0;
40
41if (currentRPM >= RPM_LIMIT) {
42 // Wait until oldest request falls out of the window
43 const oldestInWindow = Math.min(...qm.requestLog);
44 waitMs = 60000 - (now - oldestInWindow) + 100; // +100ms buffer
45}
46
47// Select model (skip rate-limited ones)
48let selectedModel = FALLBACKS[0];
49for (const model of FALLBACKS) {
50 const limitedUntil = qm.rateLimited[model.model] || 0;
51 if (now > limitedUntil) {
52 selectedModel = model;
53 break;
54 }
55}
56
57// Record this request
58qm.requestLog.push(now + waitMs);
59
60// Check daily token limit
61const dailyUsed = qm.dailyTokens[selectedModel.provider] || 0;
62const quotaRemaining = DAILY_TOKEN_LIMIT - dailyUsed;
63
64return [{
65 json: {
66 ...$input.first().json,
67 _selectedModel: selectedModel.model,
68 _selectedProvider: selectedModel.provider,
69 _waitMs: waitMs,
70 _currentRPM: currentRPM,
71 _dailyTokensUsed: dailyUsed,
72 _quotaRemaining: quotaRemaining,
73 _isFallback: selectedModel !== FALLBACKS[0]
74 }
75}];

Common mistakes when preventing Hitting Usage Quotas with LLM Calls in n8n

Why it's a problem: Sending batch items to the LLM in parallel without rate limiting, instantly hitting RPM caps

How to avoid: Use SplitInBatches with a batch size of 1 and a Wait node between batches to enforce RPM spacing

Why it's a problem: Tracking only request count but not token usage, missing TPM limits

How to avoid: Track both requests per minute AND tokens per minute — some providers enforce both independently

Why it's a problem: Not setting a spending cap at the provider level, risking unlimited charges from a bug or loop

How to avoid: Set monthly spending limits in your OpenAI/Anthropic/Mistral dashboard immediately

Why it's a problem: Using the same model for all tasks regardless of complexity, wasting quota on simple tasks

How to avoid: Route simple tasks (classification, yes/no) to cheap models (GPT-4o-mini) and complex tasks to powerful models

Why it's a problem: Retrying 429 errors immediately without backing off, making the rate limit situation worse

How to avoid: Use exponential backoff: wait 1s, then 2s, then 4s. Or switch to a fallback model immediately.

Best practices

  • Set your n8n rate limit to 90% of your actual provider limit to leave headroom for manual API usage
  • Track token usage in a database, not just static data, for historical analysis and cost reporting
  • Set spending alerts at 80% of your monthly cap to give time for action
  • Always set provider-side spending caps as a safety net in addition to n8n-side tracking
  • Use fallback model chains so workflows degrade gracefully instead of failing completely
  • Space batch processing requests with Wait nodes to stay within RPM limits
  • Monitor daily and monthly token usage trends to right-size your provider tier
  • Use cheaper models (GPT-4o-mini, Claude 3.5 Haiku) for low-complexity tasks to stretch your quota

Still stuck?

Copy one of these prompts to get a personalized, step-by-step explanation.

ChatGPT Prompt

My n8n workflow processes hundreds of items through OpenAI and keeps hitting 429 rate limit errors. How do I implement rate limiting, token tracking, and fallback models in n8n to prevent quota issues?

n8n Prompt

I need to add rate limiting to my n8n workflow that calls the OpenAI API. How do I use the Wait node and Code node to enforce RPM limits, and how do I track token usage in PostgreSQL?

Frequently asked questions

What happens when I hit my OpenAI spending cap?

All API calls immediately fail with a 429 error and a message about exceeding your billing limit. No requests go through until the next billing cycle or until you increase the cap in Settings → Limits on platform.openai.com.

Can I increase my rate limits without paying more?

Yes, OpenAI automatically increases rate limits as your account ages and payment history builds. You can also request a limit increase through their support portal. Anthropic and Mistral have similar tier systems.

How do I handle rate limits when multiple n8n workflows share the same API key?

Use a shared rate limiting mechanism: either a centralized Redis counter that all workflows check before making calls, or a shared PostgreSQL table that tracks requests per minute across all workflows.

Is it cheaper to use one large prompt or multiple small prompts?

One large prompt is usually cheaper because you pay input token costs only once. Multiple small prompts repeat the system prompt and context in each call. However, one large prompt may hit token limits — balance cost against reliability.

How accurate is the cost estimation in the tracking Code node?

The estimation uses published per-token pricing and is accurate for standard usage. It does not account for cached input tokens (OpenAI discount), batch API pricing, or promotional credits. Treat it as an approximation and verify against your provider's billing dashboard monthly.

Can RapidDev help optimize LLM costs for high-volume n8n workflows?

Yes, RapidDev specializes in optimizing LLM costs for n8n workflows. Their team implements rate limiting, caching, model routing (expensive models for complex tasks, cheap models for simple ones), and token budget management. Clients typically see 40-60% cost reductions after optimization.

RapidDev

Talk to an Expert

Our team has built 600+ apps. Get personalized help with your project.

Book a free consultation

Need help with your project?

Our experts have built 600+ apps and can accelerate your development. Book a free consultation — no strings attached.

Book a free consultation

We put the rapid in RapidDev

Need a dedicated strategic tech and growth partner? Discover what RapidDev can do for your business! Book a call with our team to schedule a free, no-obligation consultation. We'll discuss your project and provide a custom quote at no cost.