Long LLM responses get cut off in n8n when the max_tokens setting is too low, the model hits its output token limit, or n8n's execution timeout kills the workflow before the response arrives. Fix this by increasing max_tokens in the LLM node, raising n8n's execution timeout via N8N_EXECUTIONS_TIMEOUT, splitting long generation tasks into chunks, and using the HTTP Request node for direct API control over streaming and token limits.
Why n8n Cuts Off Long LLM Responses
There are three distinct reasons why LLM responses get truncated in n8n, and each requires a different fix. First, the max_tokens parameter in the LLM node caps the output length — if set to 1024, the model stops generating at 1024 tokens regardless of whether it finished. Second, n8n's execution timeout (default 300 seconds) kills long-running workflows, and large LLM responses can take 30-60+ seconds to generate. Third, the model itself may hit its output token limit (e.g., Claude Sonnet has a 8,192 output token limit by default). This tutorial addresses all three causes with specific fixes.
Prerequisites
- A running n8n instance (v1.30 or later)
- An active credential for at least one LLM provider
- A workflow that generates long responses (articles, reports, documentation)
- Access to n8n environment configuration (for timeout settings)
Step-by-step guide
Diagnose which limit is causing the truncation
Diagnose which limit is causing the truncation
Open your failed or truncated execution in n8n. Click on the LLM node output and inspect the response. Look for three clues: (1) If the response has a 'finish_reason' of 'length' (OpenAI) or 'end_turn' is missing (Claude), the max_tokens limit was hit. (2) If the execution shows 'Execution timed out', the n8n timeout killed it. (3) If the response is cut mid-sentence with no error, the model's built-in output limit was reached. Each requires a different fix.
1// Check the LLM node output for these indicators:2//3// OpenAI: response.choices[0].finish_reason4// 'stop' = completed normally5// 'length' = max_tokens limit reached (TRUNCATED)6//7// Claude: response.stop_reason8// 'end_turn' = completed normally9// 'max_tokens' = output limit reached (TRUNCATED)10//11// Gemini: response.candidates[0].finishReason12// 'STOP' = completed normally13// 'MAX_TOKENS' = limit reached (TRUNCATED)Expected result: You identify whether the truncation is caused by max_tokens, execution timeout, or model output limits
Increase max_tokens in the LLM node
Increase max_tokens in the LLM node
The most common cause of truncation is the max_tokens setting being too low. In the LLM node configuration, find the Max Tokens (or Maximum Length) parameter and increase it. For the built-in OpenAI node, this is under Options > Maximum Number of Tokens. For the Claude node, it is Max Tokens to Sample. Set it based on your needs: 2048 for medium responses, 4096 for long responses, or 8192 for very long content like articles.
1// Recommended max_tokens by use case:2//3// Short answers (support replies): 512-10244// Medium content (summaries, explanations): 2048-40965// Long content (articles, reports): 4096-81926// Maximum output by model:7// GPT-4o: 16,384 output tokens8// Claude 3.5 Sonnet: 8,192 output tokens9// Claude 3 Opus: 4,096 output tokens10// Gemini 2.0 Flash: 8,192 output tokens11// Gemini 1.5 Pro: 8,192 output tokensExpected result: The LLM generates longer responses without truncation at the token level
Increase n8n execution timeout for long-running LLM calls
Increase n8n execution timeout for long-running LLM calls
Large LLM responses (4000+ tokens) can take 30-120 seconds to generate, especially with larger models like GPT-4o or Claude Opus. If n8n's execution timeout is set too low, it kills the workflow before the response arrives. Set the N8N_EXECUTIONS_TIMEOUT environment variable to a higher value (in seconds). For LLM-heavy workflows, 600 seconds (10 minutes) is a safe starting point.
1# Environment variable configuration:2N8N_EXECUTIONS_TIMEOUT=60034# For Docker:5# environment:6# - N8N_EXECUTIONS_TIMEOUT=60078# Per-workflow override (in workflow settings):9# Settings → Timeout Workflow After → 600 secondsExpected result: Workflows have enough time to wait for large LLM responses without being killed
Use the HTTP Request node for direct API control
Use the HTTP Request node for direct API control
The built-in LLM nodes in n8n abstract away some API parameters. For maximum control over token limits and response handling, use the HTTP Request node to call the LLM API directly. This lets you set exact max_tokens values, access streaming endpoints, and read response metadata like finish_reason to detect truncation programmatically.
1// HTTP Request node calling Claude API directly:2// Method: POST3// URL: https://api.anthropic.com/v1/messages4// Headers:5// x-api-key: {{ $env.ANTHROPIC_API_KEY }}6// anthropic-version: 2023-06-017// content-type: application/json8// Body:9{10 "model": "claude-sonnet-4-20250514",11 "max_tokens": 8192,12 "messages": [13 {14 "role": "user",15 "content": "{{ $json.message }}"16 }17 ],18 "system": "You are a technical writer. Write complete, detailed responses. Never truncate or summarize unless explicitly asked."19}Expected result: You have full control over max_tokens, model selection, and can inspect finish_reason in the response
Split long generation tasks into sequential chunks
Split long generation tasks into sequential chunks
When you need responses longer than any single model's output limit (e.g., a 5000-word article), split the task into sections. Use a Code node to create an array of section prompts (Introduction, Section 1, Section 2, etc.), process them with SplitInBatches through the LLM node, and concatenate the results. Each chunk stays within token limits while the combined output can be arbitrarily long.
1// Code node: Split a long article into section prompts2const topic = $input.first().json.topic;34const sections = [5 { section: 'Introduction', prompt: `Write a 500-word introduction about ${topic}. End with a transition to the first main section.` },6 { section: 'Background', prompt: `Write a 600-word background section about ${topic}. Cover the history and current state. Start from where the introduction left off.` },7 { section: 'Analysis', prompt: `Write a 700-word analysis section about ${topic}. Include data points and expert opinions.` },8 { section: 'Recommendations', prompt: `Write a 500-word recommendations section about ${topic}. Provide actionable advice.` },9 { section: 'Conclusion', prompt: `Write a 300-word conclusion about ${topic}. Summarize key points and end with a call to action.` }10];1112return sections.map(s => ({ json: s }));Expected result: A 2500-word article is generated in 5 chunks of 500 words each, bypassing any single-call output limit
Add a truncation detection node after the LLM
Add a truncation detection node after the LLM
Add a Code node after the LLM that checks whether the response was truncated. If it was, the node can automatically retry with a continuation prompt ('Continue from where you left off: [last 200 chars]') to get the remaining content. This handles edge cases where even high max_tokens settings are not quite enough.
1const response = $input.first().json;23// Detect truncation based on provider4const finishReason = response.choices?.[0]?.finish_reason // OpenAI5 || response.stop_reason // Claude6 || response.candidates?.[0]?.finishReason // Gemini7 || 'unknown';89const isTruncated = ['length', 'max_tokens', 'MAX_TOKENS'].includes(finishReason);10const text = response.choices?.[0]?.message?.content || response.content?.[0]?.text || response.text || '';1112if (isTruncated) {13 // Get last 200 characters for continuation prompt14 const lastChunk = text.slice(-200);15 return [{16 json: {17 text,18 isTruncated: true,19 continuationPrompt: `Continue writing from exactly where this text ends. Do not repeat any content. Previous text ended with: "${lastChunk}"`20 }21 }];22}2324return [{ json: { text, isTruncated: false } }];Expected result: Truncated responses are detected and can be continued automatically
Complete working example
1// ====== Truncation Detector & Auto-Continuator — Code Node ======2// Place after the LLM node to detect and handle truncated responses34const response = $input.first().json;56// Extract text and finish reason across providers7const text = response.choices?.[0]?.message?.content // OpenAI8 || response.content?.[0]?.text // Claude9 || response.candidates?.[0]?.content?.parts?.[0]?.text // Gemini10 || response.output || response.text || '';1112const finishReason = response.choices?.[0]?.finish_reason // OpenAI13 || response.stop_reason // Claude 14 || response.candidates?.[0]?.finishReason // Gemini15 || 'unknown';1617const isTruncated = [18 'length', // OpenAI19 'max_tokens', // Claude20 'MAX_TOKENS' // Gemini21].includes(finishReason);2223const usage = {24 inputTokens: response.usage?.prompt_tokens || response.usage?.input_tokens || 0,25 outputTokens: response.usage?.completion_tokens || response.usage?.output_tokens || 0,26 finishReason27};2829if (isTruncated) {30 const lastChunk = text.slice(-300);31 return [{32 json: {33 text,34 isTruncated: true,35 usage,36 continuation: {37 needed: true,38 prompt: `Continue from where this text ends. Do not repeat content. The text ended with: "...${lastChunk}"`,39 previousLength: text.length40 }41 }42 }];43}4445return [{46 json: {47 text,48 isTruncated: false,49 usage,50 continuation: { needed: false }51 }52}];Common mistakes when stopping n8n from Cutting Off Long Language Model Responses
Why it's a problem: Leaving max_tokens at the default value (often 256 or 1024) when generating long content
How to avoid: Explicitly set max_tokens to match your content needs: 4096 for articles, 8192 for reports
Why it's a problem: Confusing n8n execution timeout with LLM generation time — they are separate limits
How to avoid: Increase both: max_tokens in the LLM node AND N8N_EXECUTIONS_TIMEOUT in environment variables
Why it's a problem: Setting max_tokens higher than the model supports, expecting more output
How to avoid: Check the model's actual output limit. Claude Sonnet maxes at 8,192 output tokens regardless of what you set
Why it's a problem: Using the Webhook Response Mode 'Immediately' for long LLM calls, returning an empty response to the caller
How to avoid: Use 'Using Respond to Webhook Node' so the HTTP connection stays open until the LLM finishes
Best practices
- Always check finish_reason/stop_reason in LLM responses to detect truncation programmatically
- Set max_tokens to at least 2x your expected output length for safety margin
- Use per-workflow timeout settings instead of global N8N_EXECUTIONS_TIMEOUT for targeted control
- For content longer than 8,000 tokens, split into sections and generate sequentially
- Use the HTTP Request node instead of built-in LLM nodes when you need full control over parameters
- Include 'Write a complete response. Do not truncate.' in your system prompt as a behavioral nudge
- Monitor token usage to optimize cost — higher max_tokens does not increase cost, only actual tokens generated do
- Set the Webhook Response Mode to 'Using Respond to Webhook Node' to prevent HTTP timeouts on long generations
Still stuck?
Copy one of these prompts to get a personalized, step-by-step explanation.
My n8n workflow cuts off long responses from the language model. The text stops mid-sentence. How do I fix this? I need to handle max_tokens limits, n8n execution timeouts, and model output caps.
Increase max_tokens in the LLM node (Options > Maximum Tokens) to 4096-8192. Set N8N_EXECUTIONS_TIMEOUT=600 in environment variables. Add a Code node after the LLM to check finish_reason — if it's 'length' or 'max_tokens', auto-continue with a follow-up prompt. For very long content, use SplitInBatches to generate sections sequentially.
Frequently asked questions
Does setting a higher max_tokens cost more money?
No. You are only charged for tokens actually generated, not the max_tokens limit. Setting max_tokens to 8192 but receiving a 2000-token response means you pay for 2000 output tokens. There is no penalty for setting the limit high.
What is the maximum output length for each major model?
GPT-4o supports up to 16,384 output tokens. Claude 3.5 Sonnet supports 8,192 output tokens. Gemini 2.0 Flash supports 8,192 output tokens. Gemini 1.5 Pro supports 8,192 output tokens. These are hard limits that cannot be exceeded in a single API call.
Can I use streaming to get longer responses?
Streaming does not increase the maximum output length — it delivers the same content token-by-token instead of all at once. However, streaming prevents HTTP timeouts because the connection receives data continuously. Use the HTTP Request node with streaming enabled for the best timeout behavior.
How do I concatenate chunked responses from SplitInBatches?
After SplitInBatches completes all iterations, add a Code node that uses $('LLM Node').all() to get all outputs, then join them: $('LLM Node').all().map(item => item.json.text).join('\n\n'). This produces the complete concatenated text.
Why does my response get cut off even with max_tokens set to 8192?
Check three things: (1) the model's actual output limit may be lower than 8192, (2) n8n's execution timeout may be killing the workflow, or (3) the Webhook response mode may be set to 'Immediately' which returns before the LLM finishes. Fix all three independently.
Can RapidDev help optimize my n8n workflow for long-form content generation?
Yes. RapidDev builds n8n content generation pipelines with automatic chunking, truncation recovery, and quality verification for teams producing articles, reports, and documentation at scale.
Talk to an Expert
Our team has built 600+ apps. Get personalized help with your project.
Book a free consultation