One GPU to Rule Them All

One GPU to Rule Them All: HiDream Performance Testing Across GPUs

In this edition of our "One GPU to Rule Them All" series, we benchmark HiDream, one of the newest and most advanced image generation models available. Released with support for multiple precision formats and step counts, HiDream offers a balance between visual quality and generation speed—making GPU selection more important than ever. We tested six HiDream variants across six GPUs to evaluate both generation speed and VRAM usage, identifying the ideal setup for different needs.

All of these tests were done inside ComfyUI, using our cloud platform InstaSD, which lets anyone run ComfyUI online with powerful GPUs and zero setup. Whether you’re experimenting or building production pipelines, performance matters — and HiDream performance was something we wanted to measure properly


🚀 What Makes HiDream Different?

HiDream-I1, released by HiDream.ai in April 2025 under the MIT license, is a 17-billion-parameter diffusion model designed for high-performance text-to-image generation. Here’s why it stands out:

  • Hybrid Architecture: Combines Diffusion Transformers (DiT) with Mixture of Experts (MoE). It uses dual-stream MMDiT modules for early multimodal fusion and single-stream DiT modules for global consistency and image refinement.
  • Multimodal Text Encoding: Leverages four text encoders—OpenCLIP ViT-bigG, OpenAI CLIP ViT-L, T5-XXL, and Llama3.1-8B—to parse complex prompts with impressive accuracy.
  • Flexible Routing: Uses dynamic expert activation to allocate GPU resources efficiently, allowing it to process complex scenes while maintaining detail in lighting, colors, and edges.

🧪 What We Tested

HiDream-I1 is based on the Diffusion Transformer (DiT) architecture, offering flexibility through three distinct checkpoints: HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast. These models vary in both inference speed and output quality:

  • Full: Delivers the highest quality but requires the most resources.
  • Dev: A balanced option suitable for development.
  • Fast: Prioritizes speed with slightly lower visual fidelity.

Each variant supports multiple inference modes, including FP16, BF16, and FP8 precision levels. This modularity makes HiDream adaptable for a wide range of hardware, from consumer-grade GPUs to high-end enterprise setups.

We benchmarked the six different HiDream variants:

  1. HiDream Full - 50 Steps
    • FP16 (high precision)
    • FP8 (optimized for speed)
  2. HiDream Dev - 28 Steps
    • BF16 (mid-speed, mid-quality)
    • FP8
  3. HiDream Fast - 16 Steps
    • BF16 (fastest among precision formats)
    • FP8

| The bf16 and fp16 variants could not run on 24GB GPUs due to memory limitations. Our test devices ranged from A5000s and A40s to A100s, 4090s, L40s, and H100s.

📊 Generation Time (in Seconds)

Here’s how long each variant took to generate a single image across different GPUs:

Model Variant A5000 A40 A100 RTX 4090 L40 H100
Full - 50 steps - fp16 X 169s 52s X 83s 31s
Dev - 28 steps - bf16 X 105s 14.8s X 22s 9.1s
Fast - 16 steps - bf16 X 17s 8.6s X 12.9s 4.4s
Full - 50 steps - fp8 138s 106s 51s 74s 83s 31s
Dev - 28 steps - fp8 34s 30s 14.8s 20s 24s 9.1s
Fast - 16 steps - fp8 10s 9.4s 5.2s 6.8s 6.7s 2.6s

Below are bar charts visualizing the generation times for each HiDream model variant across different GPUs:

HiDream Full Performance
HiDream Dev Peformance
HiDream Fast Performance

Observations:

  • HiDream Full (fp16) is the slowest but highest quality, taking up to 169 seconds on the A40.
  • HiDream Dev (bf16/fp8) offers a great middle ground, especially on A100 and H100 cards.
  • HiDream Fast (bf16/fp8) delivers incredible speed — just 4.4 seconds on an H100.

This makes the Fast fp8 version perfect for real-time iterations, while Dev fp8 is ideal for high-speed, decent-quality generation.

VRAM Insights:

  • HiDream Full demands up to 48GB VRAM — suitable for only top-tier cards.
  • HiDream Dev and Fast variants stay around 16–32GB, depending on precision.
  • If you want to run HiDream online, fp8 variants are your best bet for broader GPU compatibility and faster performance.

Key Observations

  • FP8 formats are the most versatile, working across all GPUs tested—including the 24GB A5000 and 4090.
  • H100 outperforms all others across every model variant with extremely fast generation times.
  • A100 consistently performs well, especially with BF16 and FP8, offering a great balance between speed and compatibility.
  • A5000 and RTX 4090, while limited to FP8, still deliver acceptable results for fast inference.
  • L40 and A40 support more variants but show slower generation times relative to A100 and H100.

Choosing the Best GPU for HiDream

Best Overall Speed

H100 – Unmatched speed across all variants, including FP16.

Best Mid-Range Option

A100 – Excellent performance with broad compatibility, especially with BF16/FP8.

Best for Speed-Conscious Developers

RTX 4090 or A5000 – Solid for FP8 fast/dev models when you're limited to 24GB VRAM.

🌐 Try HiDream Online in ComfyUI

You can try any of these HiDream variants online in ComfyUI with just a few clicks — no installation needed.

👉 Run ComfyUI online with HiDream

Whether you’re working on a creative project or stress-testing a production pipeline, InstaSD gives you instant access to powerful GPUs and preloaded workflows.

Other Posts