One GPU to Rule Them All

HunyuanVideo Performance Testing Across GPUs

This article is part of the "One GPU to Rule Them All" series, where we test GPUs under real-world AI workloads. This time, we examined HunyuanVideo using different GPUs and applied optimizations like Sage Attention, FP8 Quantization, and Triton + Torch Compile to evaluate their impact. Our findings can help users who are running Hunyuan on ComfyUI and looking for the best GPU for their workflow.

Test Setup

For consistency, we applied the following parameters to all runs:

Model: Hunyuan 720_cfgdistill_fp8
LoRA: Hunyuan fast video LoRA
Teacache: Enabled (We chose to enable Teacache as we believe it should be the default setting for most workflows, ensuring efficient memory management.)
Resolution: 560×368
Frames: 73
Frame Rate: 24 fps
Steps: 20
Execution: ComfyUI-HunyuanVideoWrapper nodes

Additionally, we opted for an FP8 quantized base model to ensure compatibility across all tested GPUs. This choice also aligns with what most users will be able to run, providing a practical benchmark for real-world usage.

‍

GPUs Tested & Optimization Methods

Each GPU was tested under four conditions:

Base Run - Standard execution using FP8 quantization as the base model is FP8, but without additional optimizations.
Sage Attention - Applied to improve memory efficiency and speed by optimizing how memory is accessed during inference, reducing redundant memory operations.
Sage Attention + FP8 Quantization - Used fp8_e4m3fn_fast quantization. However, this method was only available on GPUs with Compute Capability 8.9+ (RTX 4090, L40, H100 SXM). Older GPUs (A5000, A40, A100) do not support FP8 quantization due to hardware limitations, making this method unavailable.
Sage Attention + FP8 Quantization + Triton + Torch Compile - Applied additional optimizations using Triton and Torch Compile to further optimize kernel execution, improving inference efficiency on supported hardware.

The following GPUs were tested:

A5000
A40
A100
RTX 4090
L40
H100 SXM

‍

Performance Analysis

Understanding how different GPUs handle HunyuanVideo workflows is crucial in selecting the right hardware for your needs. Our testing reveals distinct performance variations across GPUs, with optimizations playing a significant role in reducing rendering times. Below, we dive deeper into each test result to highlight key insights and practical takeaways.Baseline (No Optimization) ResultsWithout any optimizations, the raw compute performance of each GPU was tested to establish a baseline. This test helps in understanding how each GPU performs under default conditions, without additional acceleration techniques. The results reveal that H100 SXM offers the best out-of-the-box performance, while A5000 and A40 struggle significantly in comparison.

‍Baseline Average Runtimes:

H100 SXM: 36.7s (fastest)
A100: 73.2s
RTX 4090: 77.3s
L40: 87.4s
A40: 115.9s
A5000: 139.6s (slowest)

Impact of Optimizations

‍Applying optimizations significantly altered the performance landscape across GPUs. Our tests show that combining Sage Attention, FP8 Quantization, and Triton + Torch Compile leads to major speed improvements, particularly on high-end GPUs like the H100 SXM, RTX 4090, and L40.

Sage Attention: By improving memory efficiency, it resulted in noticeable speedups across all GPUs, with the most significant impact observed on A100 and RTX 4090, reducing runtimes by 22-26%.
FP8 Quantization: Only available on GPUs with Compute Capability 8.9+ (RTX 4090, L40, H100 SXM), this optimization further reduced rendering times by utilizing FP8 tensor cores for faster calculations.
Triton + Torch Compile: This final optimization step, when combined with Sage Attention and FP8 Quantization, resulted in the fastest observed runtimes. On the H100 SXM, the fully optimized workflow reduced inference time to 18.8s, a nearly 50% improvement from the baseline run. This demonstrates the importance of leveraging multiple optimizations together to maximize efficiency.

**Performance Comparison Across Optimizations**

Optimization Stack and Percentage Reductions

‍Since optimizations are applied sequentially, we also analyzed how much each step contributed to the final performance gains. The stacked bar chart below illustrates the cumulative impact of each optimization as a percentage reduction from the baseline.

H100 SXM saw nearly 50% total runtime reduction when using all optimizations.
RTX 4090 and L40 showed strong improvements, with FP8 quantization contributing significantly to their performance.
Older GPUs like A5000 and A40 did not benefit from FP8 acceleration, resulting in lower total gains.

Choosing the Best GPU for HunyuanVideo

Selecting the right GPU depends on your specific requirements, budget, and workload demands. However, optimizations play a crucial role regardless of the GPU you choose. Without optimizations, even the highest-end GPUs can underperform compared to an optimized mid-tier GPU.

Best Overall Performance: H100 SXM – Fastest execution, 2x speedup with full optimizations. If speed is your priority, this is the best choice.
Best Value for Performance: RTX 4090 – Competitive speeds at a more affordable cost. With the right optimizations, it can close the gap with more expensive options.
Balanced Performance for Cloud Users: L40 – A great option for users who need a balance of speed and memory efficiency, especially in cloud environments.
Avoid for HunyuanVideo: A5000 & A40 – High runtime, lack FP8 support, and struggle with large workloads. Even with optimizations, these GPUs cannot compete with newer hardware.

Regardless of which GPU you choose, applying the right optimizations is key to maximizing performance. A well-optimized workflow can significantly reduce rendering times and improve efficiency, allowing users to get the most out of their hardware investments.

One-Click Deployment on InstaSD

All of these GPUs are available for one-click deployment of ComfyUI workflows on InstaSD. Users can run their workflows online, deploy them as APIs, and instantly access high-performance GPUs without requiring manual configuration. This makes it easy to experiment with different optimizations and GPU choices to find the best balance of speed, efficiency, and cost for their AI video generation needs.