Low-latency push-to-talk speech-to-text for Linux, built on faster-whisper with
GPU acceleration and clean fallbacks. Works on Wayland and X11, with hotkey
bindings provided by your window manager or keybinding tool.
This repo is a cleaned, production-ready version of a working F7/F9 push-to-talk workflow. It focuses on:
- Fast local transcription (GPU when available, CPU fallback when not)
- Configurable capture + output (type, clipboard, file, stdout)
- Wayland + X11 support
- Distro-agnostic setup with clear dependencies
git clone <your-repo-url>
cd SpeechToText
./scripts/install.sh
# Bind hotkeys (example: Hyprland)
# bind = , F7, exec, stt-ptt press en
# bindr = , F7, exec, stt-ptt release enIf you're using the whisper.cpp backend and want to skip Python setup:
STT_SKIP_PYTHON=1 ./scripts/install.shPush-to-talk commands:
stt-ptt press en
stt-ptt release enCheck status / cleanup:
stt-ptt status
stt-ptt cleanupCopy the sample config:
cp config/config.env.example ~/.config/speech-to-text/config.envKey settings:
VOICE_MODEL— whisper model (e.g.large-v3-turbo)VOICE_DEVICE—cuda,cpu, orautoVOICE_SOURCE— Pulse/PipeWire source nameVOICE_OUTPUT—type,clipboard,file, orstdoutVOICE_ENGINE—faster-whisper(default) orwhispercpp
Find audio sources:
pactl list sources shortRecording backends (any one works):
pw-record(PipeWire)ffmpeg(Pulse/PipeWire)parec(PulseAudio / PipeWire-pulse)arecord(ALSA)
Output tools (choose one):
- Wayland:
wtypeorydotool - X11:
xdotool - Clipboard:
wl-clipboard,xclip, orxsel
Python:
- Python 3.10+
faster-whisper(installed viarequirements.txt)
- Hyprland:
examples/hyprland.conf - Sway:
examples/sway.conf - GNOME:
examples/gnome.md - KDE:
examples/kde.md - keyd:
examples/keyd.conf
If you want a Python-free backend, you can use whisper.cpp instead of
faster-whisper.
- Build or install
whisper.cppand get a.ggufmodel. - Set these in
~/.config/speech-to-text/config.env:
VOICE_ENGINE=whispercpp
VOICE_WHISPER_CPP_BIN=whisper-cli
VOICE_WHISPER_CPP_MODEL=/path/to/gguf/model.gguf
# Optional:
# VOICE_WHISPER_CPP_ARGS="-nt"
Notes:
- The script defaults to using
-otxt/-ofoutput if supported. - If your build doesn't support those flags, set:
VOICE_WHISPER_CPP_OUTPUT_MODE=stdoutand add the right output flags inVOICE_WHISPER_CPP_ARGS.
Run diagnostics:
./scripts/diagnose.shCommon issues:
- No audio captured → verify
VOICE_SOURCEandpactl list sources short - No typing → install
wtype(Wayland) orxdotool(X11) - CUDA errors → set
VOICE_DEVICE=cpuor install correct CUDA libs
Audio is recorded to a temporary file in /tmp/speech-to-text and deleted after
transcription. Set VOICE_KEEP_AUDIO=1 to keep the last recording for debugging.
MIT