0% found this document useful (0 votes)
3 views6 pages

Assignment_Data_Science (1)

This document outlines a data science assignment consisting of three problem statements focused on Data Science, MLOps, and Generative AI. Participants are required to choose two tasks, complete them within specified timelines, and submit their solutions via GitHub or a zip folder, including a Jupyter Notebook and any necessary scripts. Each problem statement includes detailed tasks such as data preprocessing, model training, API deployment, and email summarization or response generation.

Uploaded by

bikasgupta526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Assignment_Data_Science (1)

This document outlines a data science assignment consisting of three problem statements focused on Data Science, MLOps, and Generative AI. Participants are required to choose two tasks, complete them within specified timelines, and submit their solutions via GitHub or a zip folder, including a Jupyter Notebook and any necessary scripts. Each problem statement includes detailed tasks such as data preprocessing, model training, API deployment, and email summarization or response generation.

Uploaded by

bikasgupta526
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Science Assignment

Introduction:

We’re excited to have you participate in this interview assignment! The purpose of this task is to
assess your skills in Data Science, MLOps, and Generative AI. You will have the option to work
on any two of the problem statements outlined below. Each task is designed to test different aspects
of your technical abilities, from data manipulation and machine learning to deploying AI models.

Timelines:

• 3-5 days after receiving the assignment.

Instructions:

• Choose any two tasks from the assignment to complete.

• Submit your solution using one of the following methods:

o GitHub Repository: Provide a link to the repository containing your code.

o Zip Folder: Submit a zip file that contains your project code.

• You can use google colab or any other open-source notebooks too (if needed).

Ensure your submission includes:

• A Jupyter Notebook that explains your process and walks through the solution. Notebook
should contain the outputs too.

• Any additional Python scripts/files if required for deployment or additional functionalities.

Good luck, and we look forward to reviewing your solutions!


Problem Statement 1: Data Science Task

You will be working with a dataset related to a company's customer churn. The goal is to predict
whether a customer is likely to churn.

Dataset:

Open-source dataset from Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-


churn/data

Objective:

Build a Machine Learning / Deep Learning Model to predict customer churn.

Suggested Timelines:

• 2 days

Tasks:

1. Data Preprocessing:
• Perform EDA (Exploratory Data Analysis) to understand the dataset and handle
missing values, outliers, and feature transformations.
• Encode categorical variables and scale numerical features if necessary.
2. Feature Engineering:
• Create new features that may improve model performance. For example, can you
create a feature that indicates customer tenure length in months?
3. Model Training and Selection:
• Train at least 2/3 different models (should have at-least 1 DL model) and compare
their performances.
• Optimize hyperparameters using cross-validation or a grid search method.
4. Model Evaluation:
• Evaluate your models using metrics such as accuracy, precision, recall, F1-score,
and AUC-ROC.
• Create a confusion matrix for all models and discuss the results.
5. Model Interpretation:
• Explain the most important features that contribute to the predictions.
• Use techniques like SHAP values or feature importance plots to explain the model’s
decisions.
6. Bonus:
• Suggest potential business actions the company could take based on your analysis
to reduce customer churn.
Deliverables:

• A Jupyter Notebook with your EDA, model training, evaluation, and model interpretation
steps.
• A report summarizing your findings and recommendations.

Problem Statement 2: MLOps Task

The second part involves setting up a basic MLOps pipeline to ensure that your machine learning
model can be continuously improved and deployed into production.

Scenario:

You have been asked to deploy the churn prediction model as a REST API for your company's
marketing team. They should be able to send customer data and get a prediction of whether the
customer is likely to churn.

Suggested Timelines:

• 1 day

Tasks:

1. Model Packaging:

• Package the trained machine learning model (from Part 1) using a Python-based
framework like Flask, FastAPI, or Django.

• Ensure the model can be served as a REST API that accepts customer data and
returns a prediction.

2. Containerization:

• Containerize the API using Docker, ensuring that the application can run in a
reproducible environment.

3. CI/CD Pipeline (Bonus):

• Set up a CI/CD pipeline using tools like GitHub Actions, Jenkins to automatically
test and deploy the model when changes are made.

• Integrate a testing framework to validate the API before deployment.


4. Model Monitoring:

• Suggest how you would monitor the model’s performance in production. For
example, tracking model drift, API latency, and request logs.

5. Scaling (Bonus):

• Discuss how you would scale this API in a production environment. Include tools
like Kubernetes or auto-scaling with cloud platforms (e.g., AWS, GCP).

Deliverables:

• A GitHub repository or zip file of the project containing:

o The REST API code for model deployment.

o A Dockerfile for containerizing the API.

o Optional CI/CD setup files (if part of the bonus tasks).

o Instructions to run the API locally.

Problem Statement 3: Generative AI

In this task, you will use the “Enron Email Dataset” to build a system that can either summarize
long email threads or generate responses to common emails. The goal is to explore the capabilities
of a generative language model to handle everyday email tasks.

Dataset:

Open-source dataset from Kaggle: https://www.kaggle.com/datasets/wcukierski/enron-email-


dataset

Scenario:

Your company wants to automate part of its email workflow by using AI to either summarize long
email threads or generate responses for common email types.

Suggested Timelines:

• 2 days
Objective:

• Create a pipeline using a pre-trained language model to perform one of the following tasks:

1. Summarize long email threads.

2. Generate automated responses to common email types.

Tasks:

1. Dataset Exploration & Preprocessing:

• Download the Enron Email Dataset from Kaggle.

• Pick a subset of emails that are either part of long threads (3+ replies) or cover common
topics (e.g., meeting requests, project updates).

• Clean the email text by:

o Removing unnecessary parts (e.g., signatures, metadata).

o Simplifying the content for model input.

2. Email Summarization Task:

Goal: Summarize long email threads into concise, actionable summaries.

• Steps:

o Use a pre-trained model open-source model for text summarization.

o Input email threads into the model and generate a summary.

o Evaluate the summary by checking if it captures the key points (e.g., decisions,
action items).

3. Response Generation Task

Goal: Automatically generate responses to common email types.

• Steps:

o Select a set of common email topics (e.g., meeting requests, status updates).

o Use a pre-trained model to generate an automated response based on the email


content.

o Evaluate the responses by checking if they are relevant and appropriate for the
context.

4. Model Evaluation:
• Summarization Task:

o Assess the quality (using metrics or manually) of the summaries. Does the
summary capture the main points? Is it concise and accurate?

• Response Generation Task:

o Check if the responses are coherent, contextually appropriate, and relevant to


the email.

5. Bonus (Optional): Simple API Deployment:

• Deploy the summarization or response generation system as a simple Flask or FastAPI


app.

o The app should allow users to input an email thread and receive a summary or
an automated response.

Deliverables:

• A Jupyter Notebook or Python script with:

o Data preprocessing steps.

o Implementation of the email summarization and response generation task.

o Manual evaluation of the results (or use of advanced metrics).

• A GitHub repository containing:

o The code for email summarization and response generation.

o (Optional) API code if the bonus task is completed.

You might also like