One GPU to Rule Them All

Comparing GPU Performance for ComfyUI Workflows

Choosing the right GPU for your AI workflows can be overwhelming. With so many options, it’s hard to know which one will deliver the performance you need. That’s why we launched the "One GPU to Rule Them All" blog series. In this series, we’ll test various GPUs and workflows to help you make informed decisions.

In this first post, we tested a simple text-to-image workflow using ComfyUI across different models: SDXL, SD1.5, and two variations of the Flux model (default and FP8 precision). We tested six different GPUs—A5000, A40, A100, RTX 4090, L40, and H100 SXM—to give you a comprehensive view. Additionally, we factored in each GPU’s hourly cost and VRAM capacity to make recommendations based on value for performance.

The Test Setup

Workflow: Simple text-to-image generation.
Resolution:
- SD1.5 outputs: 512x512.
- Other models (SDXL, Flux): 1024x1024.
GPUs Tested: Six GPUs: A5000, A40, A100, RTX 4090, L40, and H100 SXM.
Runs: Excluded the initial warm-up to focus on stabilized performance.

Key Findings

Performance-to-Cost Analysis

We analyzed the average runtime across workflows for each GPU relative to its hourly cost. Here’s how they compare:

H100 SXM: Offers the best performance-to-cost ratio, making it ideal for high-performance needs despite its premium cost.
L40: Provides a strong balance of cost and performance, suitable for workflows requiring substantial memory.
RTX 4090: Delivers excellent performance for its cost, particularly for non-memory-intensive tasks.
A100: Performs well for memory-intensive tasks like Flux, but does not offer significant advantages over the RTX 4090 for models like SDXL or SD1.5 given its higher cost.
A40: Consistently underperforms compared to the RTX 4090 across all workflows, offering limited value for its price.
A5000: While the cheapest option, it lags in performance, making it suitable for light, budget-conscious tasks.

Flux Model Insights

An interesting observation was noted for the Flux workflows:

24GB GPUs (A5000, RTX 4090):
- The Flux FP8 model outperforms Flux-Dev due to its slightly smaller memory footprint, leading to better performance.
Larger VRAM GPUs (A40, L40, A100, H100 SXM):
- Flux-Dev performs better on GPUs with 48GB or more VRAM, leveraging the additional memory effectively.

Runtime and Cost Comparison

The chart below compares average runtimes and hourly costs across GPUs, highlighting trade-offs between speed and expense:

Average Runtime by Model Tyoe and Relative Costs Across GPUs

H100 SXM: Fastest across all workflows, with costs justified for demanding tasks.
RTX 4090: Great runtime at a reasonable price, ideal for general-purpose tasks.
L40: Excels in memory-heavy workflows with balanced cost efficiency.

Recommendations

When selecting a GPU, consider how you plan to use the instance:

For Workflow Ideation and Refinement:
- If you’re iterating on workflows, running a few inferences, or refining your process, smaller, more cost-effective GPUs like the A5000 or RTX 4090 are ideal. They provide sufficient performance for development without incurring unnecessary costs.
For High-Volume Inference or Batch Processing:
- If your workflow is finalized and involves processing large batches or running a significant number of inferences, larger GPUs like the H100 SXM, L40, or A100 offer better cost-to-performance efficiency.
For Memory-Intensive Workflows:
- Use higher VRAM GPUs like L40, A100, or H100 SXM for Flux FP8 and Flux-Dev. These GPUs offer the memory capacity required for efficient handling of larger models like Flux.
- Use larger VRAM GPUs like A100 or H100 SXM for Flux-Dev to optimize performance.
For Balanced Performance and Cost:
- The RTX 4090 offers the best overall value for most use cases, excelling in workflows like SDXL and SD1.5 while also handling lighter memory-intensive tasks.
- The L40 is suitable for more memory-demanding workflows, but only when tasks require substantial VRAM.
- Avoid the A40, as it underperforms relative to its cost across all workflows tested.

The Winner: RTX 4090

After analyzing performance, cost, and memory utilization across all tested GPUs, the RTX 4090 emerges as the best all-around GPU. It strikes an excellent balance of cost and performance, excelling in workflows like SDXL and SD1.5 while maintaining strong runtime efficiency. Compared to the A100, the RTX 4090 is approximately 35% faster on SDXL and SD1.5 workflows while being significantly more cost-effective (50% lower hourly cost). Additionally, the RTX 4090 performs respectably well for Flux workflows, especially the Flux FP8 variant, making it a versatile option for most use cases. For most users, whether working on ideation or finalizing workflows, the RTX 4090 provides the greatest value.

For ideation or general-purpose workflows requiring higher memory, the L40 stands out as an excellent choice. Its balanced cost-to-performance ratio and higher VRAM make it ideal for users who need to refine workflows with occasional large memory demands.

For users with highly memory-intensive workloads or large-scale batch processing needs, the H100 SXM remains the top choice, with up to 60% better performance on Flux-Dev compared to the RTX 4090, albeit at a premium cost.

What’s Next?

This is just the beginning! We’ll continue testing different GPUs and workflows, expanding this series to include real-world use cases and advanced optimizations

Stay tuned for the next post in the "One GPU to Rule Them All" series, where we’ll dive into more complex workflows with controlnets and loras.

Need Help?

Have questions about ComfyUI or GPU selection? Reach out, and we’ll be happy to help you pick the perfect setup for your needs!

‍