0% found this document useful (0 votes)

26 views8 pages

Assignment data science intern

The document outlines two assignments focused on building machine learning applications: an End-to-End Sentiment Analysis Pipeline using the IMDB dataset and a Retrieval-Augmented Generation (RAG) Chatbot utilizing a vector database. Each assignment includes detailed steps for data collection, model training, API development with Flask, and database setup, along with deliverables and evaluation criteria. The assignments are designed to be completed in 2-3 days and emphasize code quality, completeness, and documentation.

Uploaded by

sushantgaurav80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views8 pages

Assignment data science intern

Uploaded by

sushantgaurav80

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Assignment 1: End-to-End Sentiment Analysis Pipeline

Objective

You will implement a sentiment analysis pipeline on a well-known public dataset. The process
includes:

1. Downloading the dataset (already annotated with sentiment).

2. Storing the data in a database.
3. Cleaning and exploring the data.
4. Training and evaluating a classification model.
5. Serving the trained model via a Flask API (with an endpoint that predicts sentiment for
new text).

Dataset: IMDB Movie Reviews

 Why IMDB?
o It’s a standard sentiment analysis dataset, with ~25k labeled movie reviews.
Labels are typically positive or negative.
o Easily available via:
 Kaggle: IMDB Dataset of 50K Movie Reviews
 Hugging Face Datasets: huggingface datasets
load_dataset("imdb")

Important: If you prefer another labeled sentiment dataset (e.g., Yelp Reviews Polarity), that’s
fine, but the IMDB dataset is the baseline recommendation.

Steps & Requirements

1. Data Collection

1. Obtain the data

o Download from Kaggle (CSV file) OR load via Hugging Face Datasets.
o Confirm you have ~25k rows of labeled reviews (train + test sets).
2. Database Setup
o Choose any relational database (e.g., MySQL, PostgreSQL, or SQLite).
o Create a table (e.g., imdb_reviews) with columns:
 id (primary key)
 review_text (text)
 sentiment (text or integer, e.g., “positive” / “negative”)
o Insert all data into this table.
Note: If you use SQLite, you won’t need a separate DB server. This is often the simplest option.

2. Data Cleaning & Exploration

1. Data Cleaning
o Ensure there are no obvious errors or duplicates.
o For text cleanup, you can consider:
 Lowercasing
 Removing HTML tags (some IMDB reviews contain <br /> etc.)
 Removing punctuation (optional)
o Keep the cleaned version stored in memory or in a new column, whichever you
prefer.
2. Exploratory Data Analysis (EDA)
o Show basic stats:
 Number of reviews per sentiment (distribution)
 Average review length for positive vs. negative
o (Optional) Some simple plots or word clouds can be included for illustration.

3. Model Training

1. Model Type
o Baseline: A simple approach like Logistic Regression or Naive Bayes on TF-
IDF vectors.
o (Optional) Try a transformer-based model like DistilBERT or BERT if you’re
comfortable with Hugging Face Transformers.
2. Train/Validation Split
o If the dataset is already split into train/test, use that.
o You can create an additional validation split from the training set (e.g., 80%/20%
within the training set).
3. Training
o Fit the model on the training data.
o Monitor basic metrics (accuracy, F1, etc.) on the validation set.
4. Evaluation
o Evaluate on the test set.
o Report metrics (accuracy, precision, recall, F1-score).

Tip: Keep the model as simple or advanced as you like, but logistic regression on TF-IDF is
typically enough for a baseline.

4. Model Serving with Flask

1. Flask API
o Create a simple Flask app (app.py or main.py).
o Include an endpoint, e.g., POST /predict:
 Input: JSON with a field review_text (the new text to classify).
Output: JSON with a field sentiment_prediction (e.g., "positive" /
"negative").
2. Model Loading
o Ensure your trained model (e.g., a pickle file or Hugging Face model weights) is
loaded once when the app starts.
o The Flask endpoint should do the following:

1. Receive text input.

2. Apply the same preprocessing steps used during training.
3. Run the text through the trained model.
4. Return the predicted sentiment.
3. Testing Locally
o Show how to send a test request (using curl, Postman, or Python’s requests) to
verify the endpoint works.

5. (Optional) Deployment

 Cloud: Deploy to any free-tier service (Heroku, Render, Railway, etc.) or a small EC2
instance on AWS.
 Provide instructions or documentation if you do this step.

Deliverables

1. Code Repository
o A well-structured repository (GitHub, GitLab, etc.) containing:
 Data ingestion/DB setup script or notebook (e.g., data_setup.py).
 Model training script or notebook (e.g., train_model.py).
 Flask app (e.g., app.py).
 Requirements file (requirements.txt) listing all Python dependencies.
2. Database Schema
o Simple instructions or a .sql file that shows table creation if using
MySQL/PostgreSQL.
o If using SQLite, mention your database file name (e.g., imdb_reviews.db) and
how it’s created.
3. README
o Project Setup: Steps to install dependencies (e.g., pip install -r
requirements.txt) and set up the database.
o Data Acquisition: How you downloaded or loaded the dataset.
o Run Instructions:
 How to run the training script.
 How to start the Flask server.
 How to test the endpoint with example commands or requests.
o Model Info: A summary of the chosen model approach and key results (e.g., final
accuracy on test set).
4. (Optional) Additional Assets
o If you generate plots or a short EDA report, include them in the repo (e.g., a
.ipynb or .pdf).

Time Expectation & Scope

 This assignment should be 2-3 days of work at a normal pace.

 Keep the solution focused on these steps—no need to explore other major NLP tasks.

Evaluation Criteria

1. Completeness: Did you store data in a DB, train a sentiment model, and serve it via
Flask?
2. Correctness: Does the Flask endpoint work? Does the model predict sentiment
accurately on test data?
3. Code Quality & Organization: Is the code clean, documented, and logically separated
into files/modules?
4. Documentation: Is there a clear README with setup and usage instructions?

Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot

Overview

You will implement a simple Retrieval-Augmented Generation (RAG) chatbot that uses a vector database
for semantic search and stores chat history in a local MySQL database. The chatbot will be served via a
Flask API.

Task Breakdown

1. Data Preparation

o Corpus: Pick or create a small text corpus (e.g., a set of documentation pages, articles,
or Wikipedia paragraphs on a particular topic). You can:

 Provide your own text files.

 Scrape a small set of web pages (ensure the data is not too large—just enough
to test retrieval).

o Chunk & Preprocess:

 Split larger documents into smaller chunks (e.g., ~200-300 words each).

 Clean and normalize text (remove extra whitespaces, etc.).

2. Embedding & Vector Store

o Vector Database: You can choose any free and local-friendly vector DB (e.g., Faiss,
Chroma, or Milvus).

o Embeddings:

 Use any sentence embedding model (e.g., sentence-transformers from Hugging

Face).

 Embed each chunk of text and store the resulting vectors in the vector DB.

o Retrieval:

 Implement a function to query the vector DB given a user query (in embeddings
space).

 Return the top-k most relevant chunks.

3. Generation (Answer Construction)

o RAG Approach:

 Take the user’s query.

 Retrieve relevant chunks from the vector DB.

 Use the retrieved chunks as context to generate an answer.

 Model: you can any LLM

o Implementation Details:

 You can run a local model (small transformer, or even a rule-based approach) if
you do not want to rely on large model inference.

 For each query, your pipeline should be:

1. Convert query to embedding.

2. Retrieve top-k relevant chunks.

3. Concatenate these chunks.

4. Pass them (with the query) into your small generation or extraction
function to form the final answer.

4. Flask API

o Endpoints:

 POST /chat: Accepts a JSON payload with the user’s query. Returns:

 Generated answer.
 Possibly the top retrieved chunks for debugging (optional).

 GET /history: Returns the chat history from the MySQL DB.

o Chat History:

 For each user query and system answer, store them as separate rows in MySQL
with fields such as:

 id (auto-increment)

 timestamp

 role (user or system)

 content (the text of the user query or system answer)

 You can store them after each response is generated.

5. Database

o MySQL:

 Set up a local MySQL database.

 Create a table(s) to hold chat messages.

 Optionally, you can also store user IDs if you want to support multiple users (not
mandatory).

6. Testing

o Unit Tests:

 Test embedding and retrieval with a few sample queries.

 Test the Flask endpoints (/chat, /history) to ensure the entire pipeline works.

Deliverables

1. Code Repository:

o All source code for the chatbot pipeline, including the vector DB setup, embedding code,
retrieval code, and Flask API.

o A requirements.txt or Pipfile.

2. Database Schema:

o The SQL schema or migrations for MySQL (tables for storing chat history).

3. Demo / Documentation:

o A short README describing:

 How to install and run the system locally.

 How to set up MySQL and create the required tables.

 How to test the /chat and /history endpoints.

 Any environment variables needed (e.g., DB credentials).

4. (Optional) Extra Credit:

o Dockerize the application to ensure consistent environment setup.

o Provide instructions for a minimal cloud deployment if you want.

General Submission Instructions

1. Repo Structure:

o Assignment1

 data_ingestion.py (or notebook)

 model_training.py (or notebook)

 app.py (Flask API)

 requirements.txt

 etc.

o Assignment2

 data_preprocessing.py

 embed_store.py

 rag_chatbot.py (or app.py for Flask)

 requirements.txt

 etc.

o README.md

2. Instructions: Provide clear instructions on how someone else can clone your repository, install
dependencies, set up the database, and run each assignment’s solution.

3. Time Expectation:

o Each assignment is designed to be completed in about 1-3 days of focused work

(depending on your familiarity with the tools).

Evaluation Criteria
 Correctness & Completeness: Does the code run end-to-end without major issues? Are all parts
of the assignment addressed?

 Project Structure & Code Quality: Is the repository organized logically? Is the code readable,
well-documented, and maintainable?

 Solution Design: Are the chosen methods (data cleaning, feature engineering, model selection,
retrieval, etc.) appropriate and justified?

 Documentation: Is there a clear README explaining how to install, run, and test the solution?

 Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered

a plus.

AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Chatbot: Abhishek Verma (00414902018) Archit Kr. Singh (01414902018) Jatin Bagga (03814902018)
No ratings yet
Chatbot: Abhishek Verma (00414902018) Archit Kr. Singh (01414902018) Jatin Bagga (03814902018)
29 pages
OAI Installation Tutorial
No ratings yet
OAI Installation Tutorial
19 pages
Dotnet Core Tutorial
100% (1)
Dotnet Core Tutorial
108 pages
Chat Bot
No ratings yet
Chat Bot
10 pages
Complex Engineering Activity
No ratings yet
Complex Engineering Activity
2 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
10 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Ai Phase 3 Project
No ratings yet
Ai Phase 3 Project
18 pages
BAET Record
No ratings yet
BAET Record
19 pages
set 1
No ratings yet
set 1
4 pages
Sundar RajI Phase 3
No ratings yet
Sundar RajI Phase 3
29 pages
Abusive_Language_Detection_Chatbot_Project_Summary_Detailed
No ratings yet
Abusive_Language_Detection_Chatbot_Project_Summary_Detailed
6 pages
SENTIMENT ANALYSIS PPT
100% (1)
SENTIMENT ANALYSIS PPT
35 pages
chatbot
No ratings yet
chatbot
6 pages
britto-1-15-2-15_merged
No ratings yet
britto-1-15-2-15_merged
18 pages
Course Project Report For: Artificial Intelligence EL-3011
No ratings yet
Course Project Report For: Artificial Intelligence EL-3011
8 pages
Britto
No ratings yet
Britto
16 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
Practical Fie AI Class 10
No ratings yet
Practical Fie AI Class 10
19 pages
AI Intern Assignment
No ratings yet
AI Intern Assignment
2 pages
lab 6,7
No ratings yet
lab 6,7
5 pages
Experiential Learning
No ratings yet
Experiential Learning
8 pages
Customer Sentiment Analysis Project
No ratings yet
Customer Sentiment Analysis Project
3 pages
NLP_Assignment2 proper RNN working
No ratings yet
NLP_Assignment2 proper RNN working
3 pages
Arsalan's Project
No ratings yet
Arsalan's Project
4 pages
Arsalan's Project New
No ratings yet
Arsalan's Project New
4 pages
Sentiment Analysis On Tweets
No ratings yet
Sentiment Analysis On Tweets
2 pages
48 75 Dsa Report
No ratings yet
48 75 Dsa Report
11 pages
it hw exp 3
No ratings yet
it hw exp 3
17 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Assignment: Machine Learning Engineer: Problem Description 1 (NLP)
No ratings yet
Assignment: Machine Learning Engineer: Problem Description 1 (NLP)
1 page
Python Chatbot Project
No ratings yet
Python Chatbot Project
6 pages
RAI AI Engineer Intern Assignments
No ratings yet
RAI AI Engineer Intern Assignments
3 pages
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
No ratings yet
Bring Your Data To Life - Creating A Chatbot With LLM, LangChain, Vector DB
10 pages
Practical 2
No ratings yet
Practical 2
4 pages
research paper text classification
No ratings yet
research paper text classification
17 pages
Project
No ratings yet
Project
11 pages
Building An NLP Chatbot For A Restaurant With Flask
No ratings yet
Building An NLP Chatbot For A Restaurant With Flask
30 pages
Building An NLP Chatbot For A Restaurant
No ratings yet
Building An NLP Chatbot For A Restaurant
30 pages
Ai Lab 02
No ratings yet
Ai Lab 02
12 pages
nlp_project(documentation)
No ratings yet
nlp_project(documentation)
8 pages
Gen AI - Prompt Engeneering
No ratings yet
Gen AI - Prompt Engeneering
160 pages
Agentic RAG_removed
No ratings yet
Agentic RAG_removed
9 pages
ML Assignment
No ratings yet
ML Assignment
5 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Chatbot Phase3
No ratings yet
Chatbot Phase3
7 pages
GPR
No ratings yet
GPR
4 pages
Case Study
No ratings yet
Case Study
25 pages
Natural Language Understanding in Chatbots
No ratings yet
Natural Language Understanding in Chatbots
4 pages
P3R1 Text Classification
No ratings yet
P3R1 Text Classification
4 pages
Abusive_Language_Detection_Chatbot_Project_Summary
No ratings yet
Abusive_Language_Detection_Chatbot_Project_Summary
3 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
NLP_Assignment2
No ratings yet
NLP_Assignment2
7 pages
Python Chat Bot Project
100% (1)
Python Chat Bot Project
6 pages
Whats App
No ratings yet
Whats App
24 pages
PGI20S02J - LAB RECORD (3)
No ratings yet
PGI20S02J - LAB RECORD (3)
24 pages
Build A Chatbot On Your CSV Data With LangChain and OpenAI
No ratings yet
Build A Chatbot On Your CSV Data With LangChain and OpenAI
5 pages
Python Chatbot Project: January 2022
No ratings yet
Python Chatbot Project: January 2022
6 pages
Chatbot Code Explanation
No ratings yet
Chatbot Code Explanation
2 pages
Final Assesment
No ratings yet
Final Assesment
1 page
Fine-tuned vs RAG Short Notes ?
No ratings yet
Fine-tuned vs RAG Short Notes ?
25 pages
How To Disable This Copy of Microsoft Office Is Not Activated
No ratings yet
How To Disable This Copy of Microsoft Office Is Not Activated
1 page
CS-3500-4500-5500 Bulletin 1
No ratings yet
CS-3500-4500-5500 Bulletin 1
26 pages
Create Intracompany Balancing Rules Page: Selected Fields
No ratings yet
Create Intracompany Balancing Rules Page: Selected Fields
6 pages
Tutorial DOGBOT
No ratings yet
Tutorial DOGBOT
135 pages
Incident Response Playbooks AND Workflows
No ratings yet
Incident Response Playbooks AND Workflows
62 pages
LAB 05(Meen221101023)
No ratings yet
LAB 05(Meen221101023)
16 pages
Prueba Recien Creado
No ratings yet
Prueba Recien Creado
7 pages
Data Classification
No ratings yet
Data Classification
4 pages
UM08022 Flasher
No ratings yet
UM08022 Flasher
4 pages
AutoChem II 2920 Operator Manual Rev - Jan 2018 2
No ratings yet
AutoChem II 2920 Operator Manual Rev - Jan 2018 2
299 pages
How To Import Acadgild VM Installation Guide - Final Draft
No ratings yet
How To Import Acadgild VM Installation Guide - Final Draft
11 pages
IBM Data Analyts Professional Certificate Note
No ratings yet
IBM Data Analyts Professional Certificate Note
16 pages
WCM Business Process - ACB
100% (1)
WCM Business Process - ACB
61 pages
LIFELINE_CONNECT[1]
No ratings yet
LIFELINE_CONNECT[1]
129 pages
Office Productivity Week 5
No ratings yet
Office Productivity Week 5
7 pages
Toward The 6G Network Era: Opportunities and Challenges
No ratings yet
Toward The 6G Network Era: Opportunities and Challenges
5 pages
Web Designing Lab File (KIT451) : Submitted By-Saumya Asawa Roll No: 1900970130108
67% (3)
Web Designing Lab File (KIT451) : Submitted By-Saumya Asawa Roll No: 1900970130108
32 pages
Rohini 90925466680
No ratings yet
Rohini 90925466680
6 pages
U.are.U SDK Developer Guide
100% (2)
U.are.U SDK Developer Guide
78 pages
Epson SCT7000 Series Service Manual
100% (1)
Epson SCT7000 Series Service Manual
314 pages
Acceptable Professional Standards
No ratings yet
Acceptable Professional Standards
1 page
1.5k Hotmail Mail Access Combo
No ratings yet
1.5k Hotmail Mail Access Combo
28 pages
Unit-4 Unit-4: Cascading Style Sheet
No ratings yet
Unit-4 Unit-4: Cascading Style Sheet
15 pages
Resume Kushal PDF
No ratings yet
Resume Kushal PDF
1 page
Docu31803 - VPLEX and GeoSynchrony 5.0 - 5.4.x Simple Support Matrix
No ratings yet
Docu31803 - VPLEX and GeoSynchrony 5.0 - 5.4.x Simple Support Matrix
7 pages
N2OS UserManual
No ratings yet
N2OS UserManual
78 pages
Shadow Fight Arena Promo Codes (NEW) January 2024
No ratings yet
Shadow Fight Arena Promo Codes (NEW) January 2024
1 page
Learning Python for Data Analysis and Visualization
No ratings yet
Learning Python for Data Analysis and Visualization
19 pages

Assignment data science intern

Uploaded by

Assignment data science intern

Uploaded by

Assignment 1: End-to-End Sentiment Analysis Pipeline

1. Downloading the dataset (already annotated with sentiment).

Dataset: IMDB Movie Reviews

Steps & Requirements

1. Obtain the data

2. Data Cleaning & Exploration

4. Model Serving with Flask

1. Receive text input.

Time Expectation & Scope

 This assignment should be 2-3 days of work at a normal pace.

Assignment 2: RAG (Retrieval-Augmented Generation) Chatbot

 Provide your own text files.

o Chunk & Preprocess:

 Clean and normalize text (remove extra whitespaces, etc.).

 Use any sentence embedding model (e.g., sentence-transformers from Hugging

 Return the top-k most relevant chunks.

3. Generation (Answer Construction)

 Take the user’s query.

 Retrieve relevant chunks from the vector DB.

 Use the retrieved chunks as context to generate an answer.

 Model: you can any LLM

 For each query, your pipeline should be:

1. Convert query to embedding.

2. Retrieve top-k relevant chunks.

3. Concatenate these chunks.

 role (user or system)

 content (the text of the user query or system answer)

 You can store them after each response is generated.

 Set up a local MySQL database.

 Create a table(s) to hold chat messages.

 Test embedding and retrieval with a few sample queries.

o A short README describing:

 How to install and run the system locally.

 How to test the /chat and /history endpoints.

 Any environment variables needed (e.g., DB credentials).

4. (Optional) Extra Credit:

o Dockerize the application to ensure consistent environment setup.

o Provide instructions for a minimal cloud deployment if you want.

General Submission Instructions

 data_ingestion.py (or notebook)

 model_training.py (or notebook)

 app.py (Flask API)

 rag_chatbot.py (or app.py for Flask)

o Each assignment is designed to be completed in about 1-3 days of focused work

 Bonus/Extras: (Optional) Dockerization, cloud deployment, or additional tests will be considered

You might also like