T15-AWSAnalyticsAndAI-ProblemStatement-Mocktest
T15-AWSAnalyticsAndAI-ProblemStatement-Mocktest
Challenge Overview:
This challenge comprises two distinct parts: Analytics and Machine Learning. You
will work with CSV datasets stored in an S3 bucket named "employee-data" with
a unique prefix followed by a random number. Within this bucket, there are two
essential folders:
1. Inputfile: This folder contains the employee_data.csv dataset, which you will
utilize for the Analytics part. Your objective is to leverage AWS Glue for data
cleaning and transformations, subsequently storing the results in Redshift tables.
2. redshift_cleaned_data: This folder contains the employee_cleaned_data.csv
dataset, designated for the Machine Learning tasks.
Note: Don’t worry about the CSV files present in the local folder, only use the CSV
files from the s3 bucket itself for the tasks outlined in this challenge.
Note: You can do any of the task first, both tasks are independent.
Creating a VPC Endpoint for Amazon S3, AWS Glue, and Amazon
Redshift Integration:
To facilitate seamless data transfer and integration between Amazon S3, AWS
Glue, and Amazon Redshift, you need to create a VPC endpoint. Follow the
instructions below to set up the VPC endpoint correctly.
You are tasked with creating crawlers in AWS Glue to connect to data stores,
determine the schema for your data, and create metadata tables in your data
catalog. Follow the instructions below to create the necessary crawlers for your
datasets.
Create Crawler for Source Data (S3 Bucket):
Step 1: Create Crawler
NOTE: Create redhsift output tables before creating the Crawlers for Redshift
Tables, kindly check in the tasks for the output schema
Note1: Check the output in the redshift tables to ensure that data loaded
successfully after completing of all the tasks.
Note2: Upon completion of the Analytics part, the cleaned data will be stored in a
Redshift table. Assume the cleaned data in the redshift table is transferred to the
s3 bucket. You can access this cleaned data in the S3 bucket "employee-data"
within the "redshift_cleaned_data" folder. This preprocessed data is ready for
utilization in the subsequent Machine Learning tasks.
Cloud-Driven HR Insights with Machine Learning
Cloud Driven Tasks:
Instructions:
Begin by importing the AWS SDK for Python (boto3) and other necessary
libraries.
Use the AWS SDK to establish a session and obtain the execution role
for your AWS account.
Use the bucket name: employee-data
Use SageMaker library to set the execution roles to access the data from
S3 bucket.
Hints:
You might need to look up how to use boto3.client('s3') to interact with the
S3 service.
Verify and print the number of duplicate records if exists and the shape
of the DataFrame after cleaning to ensure the dataset is ready for
analysis.
Create a pie chart to visualize the distribution of values in the
'gender' column.
Use seaborn to generate a count plot for the 'education' column,
categorizing data by gender to analyze demographic
distributions.
Hints:
Sample Graphs:
Split the data based on the independent and dependent variables and store
it in for example X and y.
Transform the columns in the data X using sklearn’s column transfer
technique and preprocess them using one hot encoder &
passthrough methods.
Create new features that could be predictive of outcomes like turnover
(e.g., employee engagement scores based on multiple factors). Use
sklearns feature selection functions to grab the best feature based on
techniques like regression with a `k` value of 5.
Fit and transform the data in `X` and using the feature selector, store
the new data in variable like `X_selected`.
Verify the selected columns by checking the shape of the data.
Sample output:
- List of column names
- Ex: [‘column1’, ‘column2’........]
Split your data into training and testing sets using train_test_split from
scikit-learn with a test size of 20% and random state of `0`. Use the
feature engineered data i.e `X_selected`.
Feature scale the X_train and X_test data using Sklearn’s standard scaler.
Build a LogisticRegression model using scikit-learn with a random state of
`0`.
Get the predicted value and store it in a variable such as `y_pred`.
Evaluate models using metrics such as accuracy, precision, recall score
and F1-score.
Plot a heatmap of the confusion matrix against the actual and the
predicted values for more insights.
Sample output:
Accuracy – floating point value greater than 65.0
Precision – floating point value greater than 55.0; not less than 50.0
F1-Score – floating point value greater than 50.0
Recall Score – floating point value greater than 45.0
Hints:
Sample message:
Successfully pushed data to S3: model.pkl
With the similar approach of pushing the pickle file to the S3 bucket,
using tempfile and joblib, download the pickle file and load it in a variable
for example `model`.
Fit the newly loaded `model` and run predictions on the testing data and
store it in some variable to calculate the metrics.
Execute a prediction using the test dataset and compute the accuracy
of the model using standard metrics.
Hints: