0% found this document useful (0 votes)
6 views

Assignment data science intern

The document outlines two assignments focused on building machine learning applications: an End-to-End Sentiment Analysis Pipeline using the IMDB dataset and a Retrieval-Augmented Generation (RAG) Chatbot utilizing a vector database. Each assignment includes detailed steps for data collection, model training, API development with Flask, and database setup, along with deliverables and evaluation criteria. The assignments are designed to be completed in 2-3 days and emphasize code quality, completeness, and documentation.

Uploaded by

sushantgaurav80
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Assignment data science intern

The document outlines two assignments focused on building machine learning applications: an End-to-End Sentiment Analysis Pipeline using the IMDB dataset and a Retrieval-Augmented Generation (RAG) Chatbot utilizing a vector database. Each assignment includes detailed steps for data collection, model training, API development with Flask, and database setup, along with deliverables and evaluation criteria. The assignments are designed to be completed in 2-3 days and emphasize code quality, completeness, and documentation.

Uploaded by

sushantgaurav80
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Assignment 1: End-to-End Sentiment Analysis Pipeline

Objective

You will implement a sentiment analysis pipeline on a well-known public dataset. The process
includes:

1. Downloading the dataset (already annotated with sentiment).


2. Storing the data in a database.
3. Cleaning and exploring the data.
4. Training and evaluating a classification model.
5. Serving the trained model via a Flask API (with an endpoint that predicts sentiment for
new text).

Dataset: IMDB Movie Reviews

 Why IMDB?
o It’s a standard sentiment analysis dataset, with ~25k labeled movie reviews.
Labels are typically positive or negative.
o Easily available via:
 Kaggle: IMDB Dataset of 50K Movie Reviews
 Hugging Face Datasets: huggingface datasets
load_dataset("imdb")

Important: If you prefer another labeled sentiment dataset (e.g., Yelp Reviews Polarity), that’s
fine, but the IMDB dataset is the baseline recommendation.

Steps & Requirements

1. Data Collection

1. Obtain the data


o Download from Kaggle (CSV file) OR load via Hugging Face Datasets.
o Confirm you have ~25k rows of labeled reviews (train + test sets).
2. Database Setup
o Choose any relational database (e.g., MySQL, PostgreSQL, or SQLite).
o Create a table (e.g., imdb_reviews) with columns:
 id (primary key)
 review_text (text)
 sentiment (text or integer, e.g., “positive” / “negative”)
o Insert all data into this table.
Note: If you use SQLite, you won’t need a separate DB server. This is often the simplest option.

2. Data Cleaning & Exploration

1. Data Cleaning
o Ensure there are no obvious errors or duplicates.
o For text cleanup, you can consider:
 Lowercasing
 Removing HTML tags (some IMDB reviews contain <br /> etc.)
 Removing punctuation (optional)
o Keep the cleaned version stored in memory or in a new column, whichever you
prefer.
2. Exploratory Data Analysis (EDA)
o Show basic stats:
 Number of reviews per sentiment (distribution)
 Average review length for positive vs. negative
o (Optional) Some simple plots or word clouds can be included for illustration.

3. Model Training

1. Model Type
o Baseline: A simple approach like Logistic Regression or Naive Bayes on TF-
IDF vectors.
o (Optional) Try a transformer-based model like DistilBERT or BERT if you’re
comfortable with Hugging Face Transformers.
2. Train/Validation Split
o If the dataset is already split into train/test, use that.
o You can create an additional validation split from the training set (e.g., 80%/20%
within the training set).
3. Training
o Fit the model on the training data.
o Monitor basic metrics (accuracy, F1, etc.) on the validation set.
4. Evaluation
o Evaluate on the test set.
o Report metrics (accuracy, precision, recall, F1-score).

Tip: Keep the model as simple or advanced as you like, but logistic regression on TF-IDF is
typically enough for a baseline.

4. Model Serving with Flask

1. Flask API
o Create a simple Flask app (app.py or main.py).
o Include an endpoint, e.g., POST /predict:
 Input: JSON with a field review_text (the new text to classify).
Output: JSON with a field sentiment_prediction (e.g., "positive" /
"negative").
2. Model Loading
o Ensure your trained model (e.g., a pickle file or Hugging Face model weights) is
loaded once when the app starts.
o The Flask endpoint should do the following:

1. Receive text input.


2. Apply the same preprocessing steps used during training.
3. Run the text through the trained model.
4. Return the predicted sentiment.
3. Testing Locally
o Show how to send a test request (using curl, Postman, or Python’s requests) to
verify the endpoint works.

5. (Optional) Deployment

 Cloud: Deploy to any free-tier service (Heroku, Render, Railway, etc.) or a small EC2
instance on AWS.
 Provide instructions or documentation if you do this step.

Deliverables

1. Code Repository
o A well-structured repository (GitHub, GitLab, etc.) containing:
 Data ingestion/DB setup script or notebook (e.g., data_setup.py).
 Model training script or notebook (e.g., train_model.py).
 Flask app (e.g., app.py).
 Requirements file (requirements.txt) listing all Python dependencies.
2. Database Schema
o Simple instructions or a .sql file that shows table creation if using
MySQL/PostgreSQL.
o If using SQLite, mention your database file name (e.g., imdb_reviews.db) and
how it’s created.
3. README
o Project Setup: Steps to install dependencies (e.g., pip install -r
requirements.txt) and set up the database.
o Data Acquisition: How you downloaded or loaded the dataset.
o Run Instructions:
 How to run the training script.
 How to start the Flask server.
 How to test the endpoint with example commands or requests.
o Model Info: A summary of the chosen model approach and key results (e.g., final
accuracy on test set).
4. (Optional) Additional Assets
o If you generate plots or a short EDA report, include them in the repo (e.g., a
.ipynb or .pdf).

Time Expectation & Scope

 This assignment should be 2-3 days of work at a normal pace.


 Keep the solution focused on these steps—no need to explore other major NLP tasks.

Evaluation Criteria

1. Completeness: Did you store data in a DB, train a sentiment model, and serve it via
Flask?
2. Correctness: Does the Flask endpoint work? Does the model predict sentiment
accurately on test data?
3. Code Quality & Organization: Is the code clean, documented, and logically separated
into files/modules?
4. Documentation: Is there a clear README with setup and usage instructions?

Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot

Overview

You will implement a simple Retrieval-Augmented Generation (RAG) chatbot that uses a vector database
for semantic search and stores chat history in a local MySQL database. The chatbot will be served via a
Flask API.

Task Breakdown

1. Data Preparation

o Corpus: Pick or create a small text corpus (e.g., a set of documentation pages, articles,
or Wikipedia paragraphs on a particular topic). You can:

 Provide your own text files.

 Scrape a small set of web pages (ensure the data is not too large—just enough
to test retrieval).

o Chunk & Preprocess:

 Split larger documents into smaller chunks (e.g., ~200-300 words each).

 Clean and normalize text (remove extra whitespaces, etc.).


2. Embedding & Vector Store

o Vector Database: You can choose any free and local-friendly vector DB (e.g., Faiss,
Chroma, or Milvus).

o Embeddings:

 Use any sentence embedding model (e.g., sentence-transformers from Hugging


Face).

 Embed each chunk of text and store the resulting vectors in the vector DB.

o Retrieval:

 Implement a function to query the vector DB given a user query (in embeddings
space).

 Return the top-k most relevant chunks.

3. Generation (Answer Construction)

o RAG Approach:

 Take the user’s query.

 Retrieve relevant chunks from the vector DB.

 Use the retrieved chunks as context to generate an answer.

 Model: you can any LLM

o Implementation Details:

 You can run a local model (small transformer, or even a rule-based approach) if
you do not want to rely on large model inference.

 For each query, your pipeline should be:

1. Convert query to embedding.

2. Retrieve top-k relevant chunks.

3. Concatenate these chunks.

4. Pass them (with the query) into your small generation or extraction
function to form the final answer.

4. Flask API

o Endpoints:

 POST /chat: Accepts a JSON payload with the user’s query. Returns:

 Generated answer.
 Possibly the top retrieved chunks for debugging (optional).

 GET /history: Returns the chat history from the MySQL DB.

o Chat History:

 For each user query and system answer, store them as separate rows in MySQL
with fields such as:

 id (auto-increment)

 timestamp

 role (user or system)

 content (the text of the user query or system answer)

 You can store them after each response is generated.

5. Database

o MySQL:

 Set up a local MySQL database.

 Create a table(s) to hold chat messages.

 Optionally, you can also store user IDs if you want to support multiple users (not
mandatory).

6. Testing

o Unit Tests:

 Test embedding and retrieval with a few sample queries.

 Test the Flask endpoints (/chat, /history) to ensure the entire pipeline works.

Deliverables

1. Code Repository:

o All source code for the chatbot pipeline, including the vector DB setup, embedding code,
retrieval code, and Flask API.

o A requirements.txt or Pipfile.

2. Database Schema:

o The SQL schema or migrations for MySQL (tables for storing chat history).

3. Demo / Documentation:

o A short README describing:

 How to install and run the system locally.


 How to set up MySQL and create the required tables.

 How to test the /chat and /history endpoints.

 Any environment variables needed (e.g., DB credentials).

4. (Optional) Extra Credit:

o Dockerize the application to ensure consistent environment setup.

o Provide instructions for a minimal cloud deployment if you want.

General Submission Instructions

1. Repo Structure:

o Assignment1

 data_ingestion.py (or notebook)

 model_training.py (or notebook)

 app.py (Flask API)

 requirements.txt

 etc.

o Assignment2

 data_preprocessing.py

 embed_store.py

 rag_chatbot.py (or app.py for Flask)

 requirements.txt

 etc.

o README.md

2. Instructions: Provide clear instructions on how someone else can clone your repository, install
dependencies, set up the database, and run each assignment’s solution.

3. Time Expectation:

o Each assignment is designed to be completed in about 1-3 days of focused work


(depending on your familiarity with the tools).

Evaluation Criteria
 Correctness & Completeness: Does the code run end-to-end without major issues? Are all parts
of the assignment addressed?

 Project Structure & Code Quality: Is the repository organized logically? Is the code readable,
well-documented, and maintainable?

 Solution Design: Are the chosen methods (data cleaning, feature engineering, model selection,
retrieval, etc.) appropriate and justified?

 Documentation: Is there a clear README explaining how to install, run, and test the solution?

 Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered


a plus.

You might also like