December 8, 202510 min read

    Merging PDFs at Scale: Performance Tips and Tricks

    Discover how to efficiently merge thousands of PDFs using our API. Optimize memory usage, handle large files, and process documents in parallel.

    Merging PDFs seems simple - combine files into one document. But at scale, naive implementations quickly hit performance bottlenecks. Let's explore how to merge PDFs efficiently for production workloads.

    The Challenge of Scale

    Merging a few PDFs is easy. Merging thousands reveals challenges:

    • Memory constraints: Loading all files into memory crashes your application
    • API rate limits: Sequential processing is too slow
    • File size limits: Individual files may be huge
    • Network reliability: Long operations fail on timeouts

    Strategy 1: Batch Processing

    Instead of merging all files at once, process in manageable batches:

    async function mergeLargeBatch(files) {
      const BATCH_SIZE = 10; // Merge 10 files at a time
      const batches = chunkArray(files, BATCH_SIZE);
      const mergedBatches = [];
      
      // Process each batch
      for (const batch of batches) {
        const merged = await mergeFiles(batch);
        mergedBatches.push(merged);
      }
      
      // Final merge of all batches
      return await mergeFiles(mergedBatches);
    }
    
    function chunkArray(array, size) {
      const chunks = [];
      for (let i = 0; i < array.length; i += size) {
        chunks.push(array.slice(i, i + size));
      }
      return chunks;
    }

    Strategy 2: Parallel Processing

    Process multiple batches simultaneously to maximize throughput:

    async function mergeParallel(files) {
      const BATCH_SIZE = 10;
      const PARALLEL_BATCHES = 3;
      
      const batches = chunkArray(files, BATCH_SIZE);
      const results = [];
      
      // Process 3 batches in parallel
      for (let i = 0; i < batches.length; i += PARALLEL_BATCHES) {
        const parallelBatches = batches.slice(i, i + PARALLEL_BATCHES);
        const batchResults = await Promise.all(
          parallelBatches.map(batch => mergeFiles(batch))
        );
        results.push(...batchResults);
      }
      
      return await mergeFiles(results);
    }

    Strategy 3: Streaming for Memory Efficiency

    Stream files instead of loading them entirely into memory:

    const fs = require('fs');
    const FormData = require('form-data');
    
    async function streamMerge(filePaths) {
      const form = new FormData();
      
      // Stream each file
      filePaths.forEach((path, index) => {
        const stream = fs.createReadStream(path);
        form.append(`file${index}`, stream);
      });
      
      const response = await fetch('https://pdfmunk.com/api/merge', {
        method: 'POST',
        headers: {
          'Authorization': 'Bearer YOUR_API_KEY',
          ...form.getHeaders()
        },
        body: form
      });
      
      return response.body; // Stream the response too
    }

    Strategy 4: Progressive Merging

    For very large operations, merge progressively and store intermediate results:

    async function progressiveMerge(files, onProgress) {
      let currentMerged = files[0];
      
      for (let i = 1; i < files.length; i++) {
        // Merge current result with next file
        currentMerged = await mergeTwo(currentMerged, files[i]);
        
        // Report progress
        onProgress({
          processed: i,
          total: files.length,
          percent: (i / files.length) * 100
        });
        
        // Optional: Save checkpoint
        if (i % 100 === 0) {
          await saveCheckpoint(currentMerged, i);
        }
      }
      
      return currentMerged;
    }

    Handling Large Individual Files

    When individual PDFs are large (50MB+), special care is needed:

    • Compress first: Reduce file size before merging
    • Split then merge: Break large files into smaller chunks
    • Direct upload: Use presigned URLs for very large files
    • Increase timeouts: Allow more time for large operations

    Optimizing API Usage

    Maximize efficiency and minimize costs:

    // ❌ Bad: Many small merges
    for (const file of files) {
      result = await merge(result, file); // N API calls
    }
    
    // ✅ Good: One large merge
    result = await merge(files); // 1 API call
    
    // ✅ Better: Balanced batching
    const batches = chunk(files, 20);
    const merged = await Promise.all(
      batches.map(batch => merge(batch))
    );
    result = await merge(merged); // 1 + ceil(N/20) calls

    Error Recovery at Scale

    With large batches, failures are inevitable. Build in resilience:

    async function robustMerge(files) {
      const MAX_RETRIES = 3;
      const failed = [];
      
      for (const file of files) {
        let success = false;
        
        for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
          try {
            await processFile(file);
            success = true;
            break;
          } catch (error) {
            console.error(`Attempt ${attempt + 1} failed:`, error);
            await sleep(Math.pow(2, attempt) * 1000);
          }
        }
        
        if (!success) {
          failed.push(file);
        }
      }
      
      if (failed.length > 0) {
        throw new Error(`Failed to process ${failed.length} files`);
      }
    }

    Performance Benchmarks

    Sequential: ~1000 files/hour

    Batched (10 per batch): ~5000 files/hour

    Parallel (3 workers): ~15000 files/hour

    Optimized pipeline: ~25000+ files/hour

    *Benchmarks based on average 2MB PDFs with 10 pages each

    Conclusion

    Merging PDFs at scale requires thoughtful architecture. Start with batch processing, add parallelization for speed, and implement proper error handling. Monitor performance and adjust batch sizes based on your specific workload.

    Want to try these techniques? Our API documentation includes working examples and performance guidelines.

    Merging PDFs at Scale: Performance Tips and Tricks

    Strategies for high-volume PDF merging with batching, retries, and monitoring best practices. This page is part of the PDF Munk API platform used for document generation and processing workflows such as HTML to PDF, URL capture, image conversion, OCR, merging, splitting, compression, watermarking, and secure file lifecycle controls.

    Developers typically start with interactive tests, then move the same payloads into backend services, scheduled jobs, and workflow automation tools. You can use this route to validate request structure, evaluate response behavior, and confirm output quality before production rollout.

    Canonical URL: https://pdfmunk.com/blog/merge-pdfs-at-scale. For implementation guidance, review API Docs, run examples in Try Now, and check integration references for n8n and Zapier on the tutorials and blog pages.

    Common production patterns include generating invoices from HTML templates, capturing webpages for legal records, extracting searchable text from scanned files, transforming PDF pages into preview images, and combining or splitting files in approval workflows. Teams often pair these endpoints with queue workers, idempotent retry logic, and structured logging so conversion jobs remain reliable during traffic spikes and downstream API delays.

    When implementing this route, validate input payloads early, keep output mode consistent per workflow, and add monitoring for latency, error rates, and response integrity. For sensitive documents, enforce least-privilege API key handling, rotate credentials periodically, and delete temporary files using lifecycle endpoints once processing is complete. These operational practices improve reliability, security, and cost control as document volume grows.

    Implementation checklist for teams

    Before going live, define request validation rules, decide whether responses should return files or URLs, and set clear retry behavior for network failures. Use consistent timeout values across services, track request IDs end-to-end, and record conversion outcomes for auditing. In batch workflows, split large jobs into smaller units so retries are cheaper and easier to reason about. If you process user-uploaded files, normalize inputs, enforce file-size limits, and surface actionable error messages when payloads are invalid or inaccessible.

    For SEO and rendering quality, keep templates deterministic, pin fonts where possible, and test with representative documents instead of only minimal samples. Add smoke tests for key paths such as create, transform, OCR, and delete operations. If your business depends on predictable output formatting, run visual regression checks on generated documents and store known-good fixtures. These practices reduce operational surprises and help teams maintain stable document automation as APIs, templates, and customer data evolve.

    Need a practical starting point? Begin with a single route, ship observability first, then expand endpoint coverage incrementally. Most teams achieve faster rollout by standardizing request wrappers, centralizing credential handling, and documenting common payload patterns for engineers and no-code operators alike.