A high-performance web scraper for AIBase News built with Rust and React. Features incremental scraping, full-text search, real-time progress tracking, and a modern dark-themed UI.
The system follows a three-tier architecture:
High-performance async scraper with rate limiting.
| Component | Technology |
|---|---|
| Web Framework | Axum |
| HTTP Client | reqwest |
| HTML Parser | scraper |
| Database | sqlx (PostgreSQL) |
| Rate Limiter | governor |
Full-text search with tsvector indexing.
| Feature | Implementation |
|---|---|
| Full-text Search | GIN index on tsvector |
| Deduplication | Unique constraint on external_id |
| Migrations | sqlx-migrate |
Modern SPA with real-time updates.
| Component | Technology |
|---|---|
| Runtime | Bun |
| Build Tool | Vite |
| UI Framework | React 18 |
| Components | shadcn/ui |
| State | TanStack Query |
| Real-time | WebSocket |
The frontend includes a Knowledge Base section that displays markdown notes from the social_presence repository via symlink.
| Feature | Technology |
|---|---|
| Rendering | react-markdown |
| GFM Support | remark-gfm |
| Symlink | /public/knowledge -> social_presence |
Setup Knowledge Base (optional):
# Clone the social_presence repo
git clone https://github.com/divital-coder/social_presence ~/Desktop/social_presence
# Create symlink in frontend
ln -s ~/Desktop/social_presence frontend/public/knowledgeBefore starting, ensure you have the following installed:
Follow these steps in order to get the application running:
git clone https://github.com/divital-coder/aibase-scraper.git
cd aibase-scraperThe database runs in a Docker container. Start it with:
docker compose up -dVerify it's running:
docker ps
# Should show: aibase-scraper-db running on port 5432Copy the example environment file for the backend:
cp .env.example backend/.envThe default configuration works out of the box with the Docker PostgreSQL setup.
Open a terminal and run:
cd backend
cargo run --releaseWait for the message:
INFO aibase_scraper: Server listening on 127.0.0.1:3001
The backend is now running at http://localhost:3001
Open a new terminal and run:
cd frontend
bun install
bun dev --port 3002The frontend is now running at http://localhost:3002
Open your browser and navigate to:
http://localhost:3002
You should see the dashboard. Navigate to the Scraper page to start collecting articles.
Scrapes articles from listing pages. Best for regular updates.
curl -X POST http://localhost:3001/api/scraper/start \
-H "Content-Type: application/json" \
-d '{"scrape_type": "incremental", "max_pages": 10}'Scrapes articles by ID range. Best for initial data collection.
curl -X POST http://localhost:3001/api/scraper/start-range \
-H "Content-Type: application/json" \
-d '{"start_id": 14000, "end_id": 24178}'Press Ctrl+C in the frontend terminal.
Press Ctrl+C in the backend terminal.
# Stop but keep data
docker compose stop
# Stop and remove all data
docker compose down -vIf you've stopped the application and want to restart:
# 1. Start database (from project root)
docker compose up -d
# 2. Start backend (in one terminal)
cd backend && cargo run --release
# 3. Start frontend (in another terminal)
cd frontend && bun dev --port 3002| Method | Endpoint | Description |
|---|---|---|
| GET | /api/articles |
List articles with pagination and search |
| GET | /api/articles/:id |
Get single article by ID |
| DELETE | /api/articles/:id |
Delete article |
Query Parameters for listing:
page- Page number (default: 1)per_page- Items per page (default: 20)search- Full-text search querytag- Filter by tag
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/scraper/start |
Start pagination scrape |
| POST | /api/scraper/start-range |
Start ID range scrape |
| POST | /api/scraper/stop |
Stop current job |
| GET | /api/scraper/status |
Get current job status |
| GET | /api/scraper/runs |
List past scrape runs |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/stats |
Dashboard statistics |
| GET | /api/stats/tags |
Tag distribution |
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/settings |
Get all settings |
| PATCH | /api/settings/:key |
Update setting |
| Endpoint | Description |
|---|---|
WS /ws/scrape-progress |
Real-time scrape progress |
Environment variables (backend/.env):
# Database
DATABASE_URL=postgres://scraper:scraper_password@localhost:5432/aibase_scraper
# Server
SERVER_HOST=127.0.0.1
SERVER_PORT=3001
# Scraper
SCRAPER_RATE_LIMIT=2 # Requests per second
SCRAPER_MAX_RETRIES=3 # Retry attempts on failure
# Logging
RUST_LOG=info,aibase_scraper=debugcd backend
cargo watch -x run # Auto-reload on changes
cargo build --release # Production buildcd frontend
bun dev # Vite dev server with HMR
bun run build # Production build
bun run preview # Preview production build# Reset database
docker compose down -v
docker compose up -d
# View logs
docker compose logs -f postgresKill the process using the port:
# For backend (port 3001)
lsof -ti :3001 | xargs kill -9
# For frontend (port 3002)
lsof -ti :3002 | xargs kill -9Ensure PostgreSQL is running:
docker compose up -d
docker ps # Verify container is runningCheck that the backend is running and accessible:
curl http://localhost:3001/api/stats| Metric | Value |
|---|---|
| Rate Limit | 2 requests/second |
| Max Range | 50,000 articles/run |
| Estimated Time (10k articles) | ~85 minutes |
The scraper handles 404s gracefully and skips non-existent article IDs automatically.
MIT