Fish Whisper Node Documentation
Overview
The Fish Whisper node is a component of the ComfyUI LLM Party project. This node is designed to perform Automatic Speech Recognition (ASR), converting spoken language in audio files into text format. It leverages the Fish Audio SDK to provide this functionality, allowing users to incorporate speech-to-text conversion as a part of their larger LLM workflows.
Node Functionality
The Fish Whisper node provides the ability to input audio in various forms and outputs the transcribed text. This can be integrated into different ComfyUI workflows that require speech input to be processed and understood in text form.
Inputs
The Fish Whisper node accepts the following inputs:
Required Inputs
- is_enable: A boolean value (default:
True
) indicating whether the node should be active. If set to False
, the node will not process any inputs and will not produce an output.
- audio_path: A string representing the file path to the audio file that needs transcription.
Optional Inputs
- api_key: A string for the API key required by the Fish Audio SDK. If not provided, the node will attempt to use an API key stored in the configuration file.
- audio: Audio input in the form of waveform data that can be used as an alternative to an audio file path. This option allows for dynamic audio input during workflow execution.
Outputs
The Fish Whisper node produces the following output:
- text: A string that contains the transcribed text from the given audio input.
Usage in ComfyUI Workflows
Within ComfyUI workflows, the Fish Whisper node can be incorporated to process speech inputs. It is particularly useful in workflows that involve interaction with Large Language Models (LLMs) where speech input from users needs to be converted to text for further processing and interaction with other nodes.
Example Use Cases
- Voice Command Recognition: In a workflow designed to execute commands based on voice input, the Fish Whisper node can translate spoken commands into text, which can then be interpreted and executed by subsequent nodes.
- Speech Interaction Workflows: In workflows that involve human-computer interaction through speech, this node can facilitate the translation of audio to text, enhancing interaction capabilities.
- Transcription Services: The node can be part of a workflow that needs to transcribe audio files into text, which can then be stored or further processed for insights or records.
Special Features and Considerations
- API Key Management: Ensure that a valid API key is provided either through the
api_key
input or configured within the configuration file. This is crucial for authentication and access to the Fish Audio API services.
- Audio Formats: The node supports audio input either by file path or waveform data, providing flexibility in how audio data is processed.
- Configuration: Users may want to configure language settings or API keys in the configuration file to automate and customize the node’s behavior within their workflows.
- Enable/Disable Functionality: The
is_enable
input parameter allows users to control when the node processes data, providing the ability to dynamically enable or disable its function within a larger workflow context.
By leveraging the Fish Whisper node, users of ComfyUI can seamlessly integrate speech-to-text capabilities into their workflows, expanding the potential applications of their LLM setups.