SpeechToText

Low-latency push-to-talk speech-to-text for Linux, built on faster-whisper with GPU acceleration and clean fallbacks. Works on Wayland and X11, with hotkey bindings provided by your window manager or keybinding tool.

Why this exists

This repo is a cleaned, production-ready version of a working F7/F9 push-to-talk workflow. It focuses on:

Fast local transcription (GPU when available, CPU fallback when not)
Configurable capture + output (type, clipboard, file, stdout)
Wayland + X11 support
Distro-agnostic setup with clear dependencies

Quick Start

git clone <your-repo-url>
cd SpeechToText
./scripts/install.sh

# Bind hotkeys (example: Hyprland)
# bind = , F7, exec, stt-ptt press en
# bindr = , F7, exec, stt-ptt release en

If you're using the whisper.cpp backend and want to skip Python setup:

STT_SKIP_PYTHON=1 ./scripts/install.sh

Usage

Push-to-talk commands:

stt-ptt press en
stt-ptt release en

Check status / cleanup:

stt-ptt status
stt-ptt cleanup

Configuration

Copy the sample config:

cp config/config.env.example ~/.config/speech-to-text/config.env

Key settings:

VOICE_MODEL — whisper model (e.g. large-v3-turbo)
VOICE_DEVICE — cuda, cpu, or auto
VOICE_SOURCE — Pulse/PipeWire source name
VOICE_OUTPUT — type, clipboard, file, or stdout
VOICE_ENGINE — faster-whisper (default) or whispercpp

Find audio sources:

pactl list sources short

Dependencies

Recording backends (any one works):

pw-record (PipeWire)
ffmpeg (Pulse/PipeWire)
parec (PulseAudio / PipeWire-pulse)
arecord (ALSA)

Output tools (choose one):

Wayland: wtype or ydotool
X11: xdotool
Clipboard: wl-clipboard, xclip, or xsel

Python:

Python 3.10+
faster-whisper (installed via requirements.txt)

Examples

Hyprland: examples/hyprland.conf
Sway: examples/sway.conf
GNOME: examples/gnome.md
KDE: examples/kde.md
keyd: examples/keyd.conf

whisper.cpp backend (optional)

If you want a Python-free backend, you can use whisper.cpp instead of faster-whisper.

Build or install whisper.cpp and get a .gguf model.
Set these in ~/.config/speech-to-text/config.env:

VOICE_ENGINE=whispercpp
VOICE_WHISPER_CPP_BIN=whisper-cli
VOICE_WHISPER_CPP_MODEL=/path/to/gguf/model.gguf
# Optional:
# VOICE_WHISPER_CPP_ARGS="-nt"

Notes:

The script defaults to using -otxt/-of output if supported.
If your build doesn't support those flags, set: VOICE_WHISPER_CPP_OUTPUT_MODE=stdout and add the right output flags in VOICE_WHISPER_CPP_ARGS.

Troubleshooting

Run diagnostics:

./scripts/diagnose.sh

Common issues:

No audio captured → verify VOICE_SOURCE and pactl list sources short
No typing → install wtype (Wayland) or xdotool (X11)
CUDA errors → set VOICE_DEVICE=cpu or install correct CUDA libs

Privacy

Audio is recorded to a temporary file in /tmp/speech-to-text and deleted after transcription. Set VOICE_KEEP_AUDIO=1 to keep the last recording for debugging.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin		bin
config		config
docs		docs
examples		examples
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpeechToText

Why this exists

Quick Start

Usage

Configuration

Dependencies

Examples

whisper.cpp backend (optional)

Troubleshooting

Privacy

License

About

Uh oh!

Releases

Packages

Languages

License

devmobasa/SpeechToText

Folders and files

Latest commit

History

Repository files navigation

SpeechToText

Why this exists

Quick Start

Usage

Configuration

Dependencies

Examples

whisper.cpp backend (optional)

Troubleshooting

Privacy

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages