Assignment_Data_Science (1)
Assignment_Data_Science (1)
Introduction:
We’re excited to have you participate in this interview assignment! The purpose of this task is to
assess your skills in Data Science, MLOps, and Generative AI. You will have the option to work
on any two of the problem statements outlined below. Each task is designed to test different aspects
of your technical abilities, from data manipulation and machine learning to deploying AI models.
Timelines:
Instructions:
o Zip Folder: Submit a zip file that contains your project code.
• You can use google colab or any other open-source notebooks too (if needed).
• A Jupyter Notebook that explains your process and walks through the solution. Notebook
should contain the outputs too.
You will be working with a dataset related to a company's customer churn. The goal is to predict
whether a customer is likely to churn.
Dataset:
Objective:
Suggested Timelines:
• 2 days
Tasks:
1. Data Preprocessing:
• Perform EDA (Exploratory Data Analysis) to understand the dataset and handle
missing values, outliers, and feature transformations.
• Encode categorical variables and scale numerical features if necessary.
2. Feature Engineering:
• Create new features that may improve model performance. For example, can you
create a feature that indicates customer tenure length in months?
3. Model Training and Selection:
• Train at least 2/3 different models (should have at-least 1 DL model) and compare
their performances.
• Optimize hyperparameters using cross-validation or a grid search method.
4. Model Evaluation:
• Evaluate your models using metrics such as accuracy, precision, recall, F1-score,
and AUC-ROC.
• Create a confusion matrix for all models and discuss the results.
5. Model Interpretation:
• Explain the most important features that contribute to the predictions.
• Use techniques like SHAP values or feature importance plots to explain the model’s
decisions.
6. Bonus:
• Suggest potential business actions the company could take based on your analysis
to reduce customer churn.
Deliverables:
• A Jupyter Notebook with your EDA, model training, evaluation, and model interpretation
steps.
• A report summarizing your findings and recommendations.
The second part involves setting up a basic MLOps pipeline to ensure that your machine learning
model can be continuously improved and deployed into production.
Scenario:
You have been asked to deploy the churn prediction model as a REST API for your company's
marketing team. They should be able to send customer data and get a prediction of whether the
customer is likely to churn.
Suggested Timelines:
• 1 day
Tasks:
1. Model Packaging:
• Package the trained machine learning model (from Part 1) using a Python-based
framework like Flask, FastAPI, or Django.
• Ensure the model can be served as a REST API that accepts customer data and
returns a prediction.
2. Containerization:
• Containerize the API using Docker, ensuring that the application can run in a
reproducible environment.
• Set up a CI/CD pipeline using tools like GitHub Actions, Jenkins to automatically
test and deploy the model when changes are made.
• Suggest how you would monitor the model’s performance in production. For
example, tracking model drift, API latency, and request logs.
5. Scaling (Bonus):
• Discuss how you would scale this API in a production environment. Include tools
like Kubernetes or auto-scaling with cloud platforms (e.g., AWS, GCP).
Deliverables:
In this task, you will use the “Enron Email Dataset” to build a system that can either summarize
long email threads or generate responses to common emails. The goal is to explore the capabilities
of a generative language model to handle everyday email tasks.
Dataset:
Scenario:
Your company wants to automate part of its email workflow by using AI to either summarize long
email threads or generate responses for common email types.
Suggested Timelines:
• 2 days
Objective:
• Create a pipeline using a pre-trained language model to perform one of the following tasks:
Tasks:
• Pick a subset of emails that are either part of long threads (3+ replies) or cover common
topics (e.g., meeting requests, project updates).
• Steps:
o Evaluate the summary by checking if it captures the key points (e.g., decisions,
action items).
• Steps:
o Select a set of common email topics (e.g., meeting requests, status updates).
o Evaluate the responses by checking if they are relevant and appropriate for the
context.
4. Model Evaluation:
• Summarization Task:
o Assess the quality (using metrics or manually) of the summaries. Does the
summary capture the main points? Is it concise and accurate?
o The app should allow users to input an email thread and receive a summary or
an automated response.
Deliverables: