Assignment data science intern
Assignment data science intern
Objective
You will implement a sentiment analysis pipeline on a well-known public dataset. The process
includes:
Why IMDB?
o It’s a standard sentiment analysis dataset, with ~25k labeled movie reviews.
Labels are typically positive or negative.
o Easily available via:
Kaggle: IMDB Dataset of 50K Movie Reviews
Hugging Face Datasets: huggingface datasets
load_dataset("imdb")
Important: If you prefer another labeled sentiment dataset (e.g., Yelp Reviews Polarity), that’s
fine, but the IMDB dataset is the baseline recommendation.
1. Data Collection
1. Data Cleaning
o Ensure there are no obvious errors or duplicates.
o For text cleanup, you can consider:
Lowercasing
Removing HTML tags (some IMDB reviews contain <br /> etc.)
Removing punctuation (optional)
o Keep the cleaned version stored in memory or in a new column, whichever you
prefer.
2. Exploratory Data Analysis (EDA)
o Show basic stats:
Number of reviews per sentiment (distribution)
Average review length for positive vs. negative
o (Optional) Some simple plots or word clouds can be included for illustration.
3. Model Training
1. Model Type
o Baseline: A simple approach like Logistic Regression or Naive Bayes on TF-
IDF vectors.
o (Optional) Try a transformer-based model like DistilBERT or BERT if you’re
comfortable with Hugging Face Transformers.
2. Train/Validation Split
o If the dataset is already split into train/test, use that.
o You can create an additional validation split from the training set (e.g., 80%/20%
within the training set).
3. Training
o Fit the model on the training data.
o Monitor basic metrics (accuracy, F1, etc.) on the validation set.
4. Evaluation
o Evaluate on the test set.
o Report metrics (accuracy, precision, recall, F1-score).
Tip: Keep the model as simple or advanced as you like, but logistic regression on TF-IDF is
typically enough for a baseline.
1. Flask API
o Create a simple Flask app (app.py or main.py).
o Include an endpoint, e.g., POST /predict:
Input: JSON with a field review_text (the new text to classify).
Output: JSON with a field sentiment_prediction (e.g., "positive" /
"negative").
2. Model Loading
o Ensure your trained model (e.g., a pickle file or Hugging Face model weights) is
loaded once when the app starts.
o The Flask endpoint should do the following:
5. (Optional) Deployment
Cloud: Deploy to any free-tier service (Heroku, Render, Railway, etc.) or a small EC2
instance on AWS.
Provide instructions or documentation if you do this step.
Deliverables
1. Code Repository
o A well-structured repository (GitHub, GitLab, etc.) containing:
Data ingestion/DB setup script or notebook (e.g., data_setup.py).
Model training script or notebook (e.g., train_model.py).
Flask app (e.g., app.py).
Requirements file (requirements.txt) listing all Python dependencies.
2. Database Schema
o Simple instructions or a .sql file that shows table creation if using
MySQL/PostgreSQL.
o If using SQLite, mention your database file name (e.g., imdb_reviews.db) and
how it’s created.
3. README
o Project Setup: Steps to install dependencies (e.g., pip install -r
requirements.txt) and set up the database.
o Data Acquisition: How you downloaded or loaded the dataset.
o Run Instructions:
How to run the training script.
How to start the Flask server.
How to test the endpoint with example commands or requests.
o Model Info: A summary of the chosen model approach and key results (e.g., final
accuracy on test set).
4. (Optional) Additional Assets
o If you generate plots or a short EDA report, include them in the repo (e.g., a
.ipynb or .pdf).
Evaluation Criteria
1. Completeness: Did you store data in a DB, train a sentiment model, and serve it via
Flask?
2. Correctness: Does the Flask endpoint work? Does the model predict sentiment
accurately on test data?
3. Code Quality & Organization: Is the code clean, documented, and logically separated
into files/modules?
4. Documentation: Is there a clear README with setup and usage instructions?
Overview
You will implement a simple Retrieval-Augmented Generation (RAG) chatbot that uses a vector database
for semantic search and stores chat history in a local MySQL database. The chatbot will be served via a
Flask API.
Task Breakdown
1. Data Preparation
o Corpus: Pick or create a small text corpus (e.g., a set of documentation pages, articles,
or Wikipedia paragraphs on a particular topic). You can:
Scrape a small set of web pages (ensure the data is not too large—just enough
to test retrieval).
Split larger documents into smaller chunks (e.g., ~200-300 words each).
o Vector Database: You can choose any free and local-friendly vector DB (e.g., Faiss,
Chroma, or Milvus).
o Embeddings:
Embed each chunk of text and store the resulting vectors in the vector DB.
o Retrieval:
Implement a function to query the vector DB given a user query (in embeddings
space).
o RAG Approach:
o Implementation Details:
You can run a local model (small transformer, or even a rule-based approach) if
you do not want to rely on large model inference.
4. Pass them (with the query) into your small generation or extraction
function to form the final answer.
4. Flask API
o Endpoints:
POST /chat: Accepts a JSON payload with the user’s query. Returns:
Generated answer.
Possibly the top retrieved chunks for debugging (optional).
GET /history: Returns the chat history from the MySQL DB.
o Chat History:
For each user query and system answer, store them as separate rows in MySQL
with fields such as:
id (auto-increment)
timestamp
5. Database
o MySQL:
Optionally, you can also store user IDs if you want to support multiple users (not
mandatory).
6. Testing
o Unit Tests:
Test the Flask endpoints (/chat, /history) to ensure the entire pipeline works.
Deliverables
1. Code Repository:
o All source code for the chatbot pipeline, including the vector DB setup, embedding code,
retrieval code, and Flask API.
o A requirements.txt or Pipfile.
2. Database Schema:
o The SQL schema or migrations for MySQL (tables for storing chat history).
3. Demo / Documentation:
1. Repo Structure:
o Assignment1
requirements.txt
etc.
o Assignment2
data_preprocessing.py
embed_store.py
requirements.txt
etc.
o README.md
2. Instructions: Provide clear instructions on how someone else can clone your repository, install
dependencies, set up the database, and run each assignment’s solution.
3. Time Expectation:
Evaluation Criteria
Correctness & Completeness: Does the code run end-to-end without major issues? Are all parts
of the assignment addressed?
Project Structure & Code Quality: Is the repository organized logically? Is the code readable,
well-documented, and maintainable?
Solution Design: Are the chosen methods (data cleaning, feature engineering, model selection,
retrieval, etc.) appropriate and justified?
Documentation: Is there a clear README explaining how to install, run, and test the solution?