One GPU to Rule Them All: HiDream Performance Testing Across GPUs
In this edition of our "One GPU to Rule Them All" series, we benchmark HiDream, one of the newest and most advanced image generation models available. Released with support for multiple precision formats and step counts, HiDream offers a balance between visual quality and generation speed—making GPU selection more important than ever. We tested six HiDream variants across six GPUs to evaluate both generation speed and VRAM usage, identifying the ideal setup for different needs.
All of these tests were done inside ComfyUI, using our cloud platform InstaSD, which lets anyone run ComfyUI online with powerful GPUs and zero setup. Whether you’re experimenting or building production pipelines, performance matters — and HiDream performance was something we wanted to measure properly
🚀 What Makes HiDream Different?
HiDream-I1, released by HiDream.ai in April 2025 under the MIT license, is a 17-billion-parameter diffusion model designed for high-performance text-to-image generation. Here’s why it stands out:
Hybrid Architecture: Combines Diffusion Transformers (DiT) with Mixture of Experts (MoE). It uses dual-stream MMDiT modules for early multimodal fusion and single-stream DiT modules for global consistency and image refinement.
Multimodal Text Encoding: Leverages four text encoders—OpenCLIP ViT-bigG, OpenAI CLIP ViT-L, T5-XXL, and Llama3.1-8B—to parse complex prompts with impressive accuracy.
Flexible Routing: Uses dynamic expert activation to allocate GPU resources efficiently, allowing it to process complex scenes while maintaining detail in lighting, colors, and edges.
🧪 What We Tested
HiDream-I1 is based on the Diffusion Transformer (DiT) architecture, offering flexibility through three distinct checkpoints: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. These models vary in both inference speed and output quality:
Full: Delivers the highest quality but requires the most resources.
Dev: A balanced option suitable for development.
Fast: Prioritizes speed with slightly lower visual fidelity.
Each variant supports multiple inference modes, including FP16, BF16, and FP8 precision levels. This modularity makes HiDream adaptable for a wide range of hardware, from consumer-grade GPUs to high-end enterprise setups.
We benchmarked the six different HiDream variants:
HiDream Full - 50 Steps
FP16 (high precision)
FP8 (optimized for speed)
HiDream Dev - 28 Steps
BF16 (mid-speed, mid-quality)
FP8
HiDream Fast - 16 Steps
BF16 (fastest among precision formats)
FP8
| The bf16 and fp16 variants could not run on 24GB GPUs due to memory limitations. Our test devices ranged from A5000s and A40s to A100s, 4090s, L40s, and H100s.
📊 Generation Time (in Seconds)
Here’s how long each variant took to generate a single image across different GPUs:
Model Variant
A5000
A40
A100
RTX 4090
L40
H100
Full - 50 steps - fp16
X
169s
52s
X
83s
31s
Dev - 28 steps - bf16
X
105s
14.8s
X
22s
9.1s
Fast - 16 steps - bf16
X
17s
8.6s
X
12.9s
4.4s
Full - 50 steps - fp8
138s
106s
51s
74s
83s
31s
Dev - 28 steps - fp8
34s
30s
14.8s
20s
24s
9.1s
Fast - 16 steps - fp8
10s
9.4s
5.2s
6.8s
6.7s
2.6s
Below are bar charts visualizing the generation times for each HiDream model variant across different GPUs:
HiDream Full Performance
HiDream Dev Peformance
HiDream Fast Performance
Observations:
HiDream Full (fp16) is the slowest but highest quality, taking up to 169 seconds on the A40.
HiDream Dev (bf16/fp8) offers a great middle ground, especially on A100 and H100 cards.
HiDream Fast (bf16/fp8) delivers incredible speed — just 4.4 seconds on an H100.
This makes the Fast fp8 version perfect for real-time iterations, while Dev fp8 is ideal for high-speed, decent-quality generation.
VRAM Insights:
HiDream Full demands up to 48GB VRAM — suitable for only top-tier cards.
HiDream Dev and Fast variants stay around 16–32GB, depending on precision.
If you want to run HiDream online, fp8 variants are your best bet for broader GPU compatibility and faster performance.
Key Observations
FP8 formats are the most versatile, working across all GPUs tested—including the 24GB A5000 and 4090.
H100 outperforms all others across every model variant with extremely fast generation times.
A100 consistently performs well, especially with BF16 and FP8, offering a great balance between speed and compatibility.
A5000 and RTX 4090, while limited to FP8, still deliver acceptable results for fast inference.
L40 and A40 support more variants but show slower generation times relative to A100 and H100.
Choosing the Best GPU for HiDream
Best Overall Speed
H100 – Unmatched speed across all variants, including FP16.
Best Mid-Range Option
A100 – Excellent performance with broad compatibility, especially with BF16/FP8.
Best for Speed-Conscious Developers
RTX 4090 or A5000 – Solid for FP8 fast/dev models when you're limited to 24GB VRAM.
🌐 Try HiDream Online in ComfyUI
You can try any of these HiDream variants online in ComfyUI with just a few clicks — no installation needed.
Whether you’re working on a creative project or stress-testing a production pipeline, InstaSD gives you instant access to powerful GPUs and preloaded workflows.
Launch it now!
Want to jump right into generating incredible images and videos? We've prepared this and many more workflows to get you started.