Interactive Pokemon browser with all 1025 Pokemon, featuring visual similarity search powered by CLIP embeddings.
The visual similarity feature loads ~11MB of data on first visit, which may take some time. This data is cached after the first load.
# Install UV if you haven't already
# See: https://github.com/astral-sh/uv
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # macOS/Linux
# or
.venv\Scripts\activate # Windows
uv pip install -r requirements.txt# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtpython3 serve.py
# Open http://localhost:8000pokemon_data.csv- Pokemon metadata (names, types, stats)pokemon_embeddings.npy- Computed CLIP embeddings for all 1025 Pokemon (~11MB)pokemon_artwork/- Original artwork (475x475 RGBA PNG format, 1025 images)pokemon_artwork_rgb/- RGB-converted images with white background
index.html- Main web interface with Pokemon browser and searchapp.js- Core JavaScript logic for browsing, filtering, and displaying Pokemonsimilarity.js- Visual similarity search implementation using CLIP embeddingsstyle.css- Styling for the web interface
serve.py- Simple HTTP server to host the web interfaceconvert_to_rgb.py- Utility script to convert RGBA images to RGB formatgenerate_embeddings_daft_fixed.py- Daft-based CLIP embedding generator with thread-safe model loading
requirements.txt- Python dependencies (transformers, daft, numpy, Pillow, etc.)LICENSE- Project license
This project demonstrates generating CLIP embeddings for Pokemon images using Daft, a distributed dataframe library. I encountered and solved interesting multiprocessing challenges that provide insights into how Daft handles user-defined functions (UDFs) at scale.
I wanted to generate CLIP embeddings for 1025 Pokemon images using Daft's UDF feature. The initial implementation failed with mysterious errors when processing more than ~100 images.
-
First Discovery (commit 3582467): Created a test script to reproduce the issue. Found that Daft worked with 100 images but failed at 1025 with "Too many open files" error.
-
Finding the Threshold (commit c115b42): Through binary search, discovered the exact failure threshold was 128 images. Above this, Daft failed with
ImportErrorin worker processes. -
First Fix Attempt (commit 3f3c908): Moving imports to module level fixed the ImportError but revealed a new threshold at 406/407 images with "meta tensor" errors.
-
Deeper Investigation (commit 1da7f8d): Implemented lazy loading and CPU-only inference, which pushed the threshold to 430 images but still failed on larger datasets.
-
Root Cause Discovery: Through detailed debugging, I discovered that Daft creates multiple UDF instances when processing larger datasets. These instances were trying to load the CLIP model simultaneously, causing race conditions in the transformers library.
-
Final Solution (commit 8e7411f): Implemented thread-safe model loading with a global cache, successfully processing all 1025 images.
The issue was that when Daft scales up processing, it creates multiple UDF instances that can run in parallel threads within the same process. If these instances try to initialize heavy ML models simultaneously, it causes race conditions.
My solution uses a thread-safe global model cache that ensures only one model is loaded per process, regardless of how many UDF instances Daft creates.
-
Daft's Execution Model: Below ~128 images, Daft runs in single-threaded mode. Above that, it spawns multiple UDF instances for parallelism.
-
Model Loading Race Conditions: ML frameworks like transformers aren't always thread-safe during model initialization.
-
Thread-Safe Patterns: The double-check locking pattern is crucial for efficient thread-safe lazy initialization.
# Generate embeddings for all Pokemon
python generate_embeddings_daft_fixed.py 1025
# Test with different dataset sizes
python generate_embeddings_daft_fixed.py 100 # Single-threaded execution
python generate_embeddings_daft_fixed.py 500 # Multi-threaded execution- Images: PokeAPI/sprites
- Data: lgreski/pokemonData (from PokemonDB.net)