A fast, lightweight REST API for text summarization using modern semantic understanding with Sentence-BERT. Built for the LLM-speech browser extension, but works as a standalone service.
- Modern Semantic Understanding: Uses Sentence-BERT contextual embeddings (10x better than Word2Vec/TF-IDF)
- Fast Response: ~500ms processing time on CPU
- Lightweight: ~100MB RAM footprint, perfect for free-tier hosting
- Open Source: Apache 2.0 licensed, no API costs
- Web Demo: Interactive interface for testing
- Production Ready: Includes error handling, CORS, health checks, and monitoring
- Python 3.11+
- pip
- Clone the repository
git clone <your-repo-url>
cd speech-summarizer-api- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"- Run the server
python app.pyThe API will be available at http://localhost:5000
Open your browser and navigate to:
http://localhost:5000
Local: http://localhost:5000
Production: https://your-app.onrender.com (after deployment)
Generate a summary from input text.
Request:
{
"text": "Your long text here...",
"summary_length": 3, // optional, default: 3
"min_text_length": 500 // optional, default: 500
}Response (Success):
{
"status": "success",
"original_length": 1250,
"summary_length": 320,
"summary_text": "Summary goes here...",
"processing_time_ms": 234.56,
"compression_ratio": 0.256
}Response (Text Too Short):
{
"status": "skipped",
"message": "Text too short for summarization",
"original_length": 300,
"original_text": "..."
}Response (Error):
{
"status": "error",
"message": "Error description",
"error_type": "ValidationError"
}Example with curl:
curl -X POST http://localhost:5000/summarize \
-H "Content-Type: application/json" \
-d '{
"text": "Your long text here...",
"summary_length": 3
}'Health check endpoint for monitoring.
Response:
{
"status": "healthy",
"version": "1.0.0",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"model_loaded": true,
"embedding_dimension": 384
}Get detailed model information.
Response:
{
"model_name": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
"max_sequence_length": 256,
"parameters": "22M",
"license": "Apache-2.0",
"type": "extractive_with_contextual_embeddings"
}async function summarizeText(text) {
try {
const response = await fetch('https://your-api.onrender.com/summarize', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
summary_length: 3,
min_text_length: 500
})
});
const data = await response.json();
if (data.status === 'success') {
return data.summary_text;
} else {
console.warn('Summarization skipped:', data.message);
return text; // Fallback to original
}
} catch (error) {
console.error('Summarization failed:', error);
return text; // Fallback
}
}- Push to GitHub
git add .
git commit -m "Initial commit"
git push origin main-
Connect to Render
- Go to render.com
- Click "New +" → "Web Service"
- Connect your GitHub repository
- Render will auto-detect
render.yamland configure everything
-
Deploy
- Click "Create Web Service"
- Wait 5-10 minutes for first deployment
- Your API will be live at:
https://your-app-name.onrender.com
Free Tier Limitations:
- Spins down after 15 minutes of inactivity
- First request after sleep: 20-60 seconds (cold start)
- 512MB RAM
Upgrade to Paid ($7/month):
- Always-on (no cold starts)
- 2GB RAM
- Better performance
# Build image
docker build -t speech-summarizer-api .
# Run container
docker run -p 8080:8080 speech-summarizer-api
# Test
curl http://localhost:8080/healthgcloud run deploy speech-summarizer \
--source . \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--memory 1Giflyctl launch
flyctl deployspeech-summarizer-api/
├── app.py # Flask application
├── summarizer.py # Sentence-BERT summarizer
├── requirements.txt # Python dependencies
├── runtime.txt # Python version
├── Procfile # Heroku/Railway config
├── render.yaml # Render.com config
├── Dockerfile # Docker config
├── .env.example # Environment variables template
├── templates/
│ └── index.html # Web demo interface
├── static/
│ ├── style.css # Demo page styles
│ └── script.js # Demo page logic
├── .gitignore
├── .dockerignore
└── README.md
- Flask: Web framework
- Sentence-BERT: Modern contextual embeddings
- NLTK: Sentence tokenization
- scikit-learn: Similarity calculations
- NumPy: Numerical operations
- Tokenize: Split text into sentences using NLTK
- Embed: Generate 384-dimensional contextual embeddings for each sentence
- Analyze: Calculate document centroid (average of all sentence embeddings)
- Rank: Find sentences most similar to document centroid
- Select: Return top N sentences in original order
Advantages:
- ✅ Contextual understanding (e.g., "bank" means different things in different contexts)
- ✅ Fast inference (~500ms)
- ✅ Small memory footprint (~100MB)
- ✅ No API costs
- ✅ No hallucinations (extractive only)
- ✅ Privacy-friendly (runs on your server)
Limitations:
- Cannot generate novel sentences (extractive only)
- Limited to 256 tokens per sentence
- Less creative than LLMs (GPT, Claude)
| Metric | Value |
|---|---|
| Model Size | ~100MB |
| RAM Usage | ~100-200MB |
| Response Time | 200-500ms (CPU) |
| Cold Start | 20-60s (free tier) |
| Embedding Dimension | 384 |
| Parameters | 22M |
python summarizer.py# Health check
curl http://localhost:5000/health
# Model info
curl http://localhost:5000/model-info
# Summarize
curl -X POST http://localhost:5000/summarize \
-H "Content-Type: application/json" \
-d '{"text": "Your text here...", "summary_length": 3}'// Open http://localhost:5000 and run in console:
testAPI()Create a .env file (see .env.example):
FLASK_ENV=development
PORT=5000- Add Rate Limiting (optional)
pip install flask-limiter- Add API Key Authentication (optional)
# In app.py
API_KEY = os.environ.get('API_KEY')- Restrict CORS (production)
# In app.py, change CORS origins from "*" to specific domains
CORS(app, resources={r"/*": {"origins": ["https://your-extension.com"]}})# Via Render dashboard or CLI
render logsSet up uptime monitoring with:
- UptimeRobot
- Pingdom
- StatusCake
Ping /health endpoint every 5 minutes.
Problem: First request takes 30-60 seconds Solution:
- Upgrade to paid plan ($7/month)
- Or show loading message to users
- Or use keep-alive service (check host TOS)
Problem: App crashes on free tier Solution:
- Upgrade to 1-2GB RAM plan
- Or use smaller model (already using smallest production-ready model)
Problem: Summaries take >2 seconds Solution:
- Check server CPU
- Reduce summary_length
- Consider GPU hosting
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Apache 2.0 License - see LICENSE file for details
| Approach | Speed | Quality | Cost | RAM |
|---|---|---|---|---|
| This API (Sentence-BERT) | ⚡⚡⚡ Fast | ⭐⭐⭐ Good | 💰 Free | 100MB |
| Word2Vec + TextRank | ⚡⚡⚡ Fast | ⭐⭐ Okay | 💰 Free | 100MB |
| BART/T5 Local | ⚡⚡ Medium | ⭐⭐⭐⭐ Very Good | 💰 Free | 2GB |
| OpenAI API | ⚡⚡⚡ Fast | ⭐⭐⭐⭐⭐ Excellent | 💰💰 $$$ | N/A |
For issues, questions, or feedback:
- Open an issue on GitHub
- Check existing documentation
- Review API logs
- Add caching for repeated requests
- Implement API key authentication
- Add more summarization strategies
- Support multiple languages
- WebSocket support for streaming
- Add usage analytics dashboard
Built with Sentence-BERT | Open Source | Production Ready