Whisper Local Node Documentation

Overview

The Whisper Local node is part of the ComfyUI LLM Party, designed to facilitate speech-to-text conversion using local models. This node leverages the capabilities of the Whisper Automatic Speech Recognition (ASR) model to transcribe audio input into text. The node accommodates both direct audio input and audio files located on the local file system. It is highly versatile and can be integrated into various ComfyUI workflows that require audio transcription capabilities.

Functionality

What This Node Does

The Whisper Local node processes audio input and converts it into textual representation. It utilizes a locally specified model (typically a Whisper model from OpenAI) to perform speech recognition. The node supports real-time audio processing and transcription from saved audio files.

Inputs

The Whisper Local node accepts the following inputs:

Model Name or Path (model_name_or_path):
- Type: String
- Description: Determines the model that should be used for audio transcription. It defaults to the Whisper model from OpenAI, "openai/whisper-small".
Audio (audio):
- Type: Audio Object
- Description: The raw audio data to be transcribed. This input is optional if the audio file path is provided.
Enable (is_enable):
- Type: Boolean
- Default: True
- Description: Controls whether the node is active. If set to false, the node will not perform any operations.
Audio File Path (audio_path):
- Type: String
- Description: Specifies the path to the audio file that needs to be transcribed. If an audio object is provided, this path is used to save the transcribed text.

Outputs

The Whisper Local node produces the following output:

Text (text):
- Type: String
- Description: The transcribed text as a result of the audio processing. This is the primary output of the node, which represents the conversion from speech to text.

Usage in ComfyUI Workflows

The Whisper Local node can be an integral component in workflows that require the conversion of speech data into text. Below are some guidelines on how it can be utilized within ComfyUI:

Speech-to-Text Applications: Incorporate this node in applications that need to process and interpret spoken inputs. It can transform spoken commands into actionable text for automation or control systems.
Media Analysis: Use this node to transcribe audio from media files, aiding in subtitling, indexing, or content analysis.
Accessibility Enhancements: Include the node in workflows aimed at improving accessibility for users with hearing impairments by converting spoken words into readable text.

Special Features and Considerations

Device Compatibility: The node automatically selects the best available computing device - CUDA for NVIDIA GPUs, MPS for Apple silicon, or CPU - ensuring efficient performance depending on the hardware.
Real-time Transcription: With support for both live audio input management and file-based transcription, it offers flexibility in speech recognition tasks.
Language Configuration: The display name of the node can be configured for different languages based on the system settings or project configuration, making it user-friendly across various locales.
Timestamped Records: The node saves audio recordings with a timestamp, which helps in organizing and referencing transcriptions.
Enable/Disable Functionality: The node includes an option to disable processing if needed, providing control over its operation within larger workflows.

The Whisper Local node thus offers a robust solution for integrating speech recognition into ComfyUI applications, with flexibility and adaptability for diverse use cases.

comfyui_LLM_party

Available Nodes

whisper_local