Lab 3 ML
Lab 3 ML
base_path = '/home/ec2-user/SageMaker/data/'
file_path = 'spambase.data'
Lab 3 explanation
The provided code snippet appears to be a set of Python commands that accomplish
the following tasks:
1. Update Boto3 and SageMaker Libraries:
The %pip install --upgrade boto3 sagemaker command is used to
upgrade the boto3 and sagemaker Python libraries to their latest
versions. These libraries are commonly used in AWS (Amazon Web
Services) for tasks like interacting with AWS services and working with
SageMaker, a service for building, training, and deploying machine
learning models on AWS.
2. Reset Python Environment:
The %reset -f command is used to reset the Python environment, clearing
all variables and namespaces.
3. Import Necessary Libraries:
The code imports necessary Python libraries, such as subprocess and
os, for interacting with the system and running shell commands.
4. Check File Existence and Download if Needed:
This part of the code checks whether a file named "spambase.data" exists
in a specified path (base_path). The file is expected to be located in the
SageMaker data directory.
If the file doesn't exist, it creates the directory specified by base_path
using the mkdir command.
It then uses the aws s3 cp command to download files recursively from an
S3 bucket ('s3://aws-tc-largeobjects/ILT-TF-200-MLDWTS/lab3/') and
stores them in the base_path.
The purpose of this section is to ensure that the required data file is
available locally for further processing.
5. Print Confirmation Message:
If the file already exists in the specified path, it prints a message indicating
that the file has already been downloaded.
In summary, this code snippet is setting up the Python environment for working with
AWS SageMaker and checking whether a specific data file ("spambase.data") exists
locally. If the file is not found, it downloads it from an S3 bucket. This is a common setup
when preparing data for machine learning experiments in SageMaker, ensuring that the
necessary data is available in the SageMaker environment.
import csv
import numpy as np
data = []
f = open('/home/ec2-user/SageMaker/data/spambase.data')
reader = csv.reader(f)
next(reader, None)
for row in reader:
data.append(row)
f.close()
The provided code snippet is a Python script that reads data from a CSV (Comma-
Separated Values) file named "spambase.data" located at the specified path and stores
it in a list called "data." Here's a step-by-step explanation of what each part of the code
does:
1. Import Necessary Libraries:
The code begins by importing two Python libraries:
csv: This library provides functionality for reading and writing CSV
files.
numpy (imported as np): NumPy is a popular library for numerical
computations in Python. It's often used for working with arrays and
matrices.
2. Initialize an Empty List for Data:
An empty list named "data" is created. This list will be used to store the
data read from the CSV file.
3. Open and Read the CSV File:
The code opens the CSV file located at
/home/ec2-user/SageMaker/data/spambase.data using the open
function. It assigns the file object to the variable f.
A CSV reader is created by passing the file object f to csv.reader. This
reader will allow us to iterate through the rows of the CSV file.
The next(reader, None) command is used to skip the header row of the
CSV file, assuming that the first row contains column headers. This
ensures that the data starts from the second row.
4. Iterate Through CSV Rows and Store Data:
The code then enters a loop that iterates through each row in the CSV file
using for row in reader:.
For each row, it appends the row as a list to the "data" list. This effectively
collects all the rows of data from the CSV file into the "data" list.
5. Close the CSV File:
After all the data has been read and stored, the code closes the CSV file
using f.close() to release system resources associated with the file.
Once this code has been executed, the "data" list will contain the data from the CSV file,
with each row of data represented as a list of values. You can then proceed to analyze
or process this data as needed, potentially using libraries like NumPy for further
computations or machine learning tasks.
import pandas as pd
df = pd.DataFrame(data=np.array(data).astype(np.float), columns=["word_freq_make",
"word_freq_address",
The provided code snippet is using the pandas library in Python to create a DataFrame
from the data that was previously read from the CSV file and stored in the "data" list.
This DataFrame is used to structure and organize the data in a tabular format. Let's
break down what this code is doing:
1. Import Necessary Library:
The code imports the pandas library and assigns it the alias pd. pandas
is a powerful library for data manipulation and analysis in Python.
2. Create a DataFrame:
The code creates a DataFrame by calling the pd.DataFrame()
constructor.
3. Data for DataFrame Initialization:
The data for initializing the DataFrame is provided within the constructor:
data=np.array(data).astype(np.float): This part converts the
"data" list, which contains rows of data as lists of strings, into a
NumPy array of floating-point numbers (np.float). This conversion
ensures that the data is in a numerical format suitable for
DataFrame creation.
4. Column Names:
The columns parameter is used to specify the names of the DataFrame
columns. In this case, column names are provided as a list of strings:
["word_freq_make", "word_freq_address", ...]: The list contains
column names that correspond to the data attributes in the
DataFrame.
5. DataFrame Structure:
The resulting DataFrame df will have the same number of columns as
there are column names provided in the "columns" parameter. Each
column will contain the data from the "data" list, and the column names
will match the names specified in the "columns" list.
6. DataFrame Usage:
Once the DataFrame is created, you can use various pandas methods
and operations to perform data analysis, exploration, and manipulation on
the structured data.
Overall, this code snippet is transforming the raw data from the CSV file into a
structured DataFrame, making it easier to work with and analyze the data using pandas
capabilities. The provided column names correspond to the attributes of the data,
allowing for more meaningful and organized data handling.
df.head()
he df.head() command is used to display the first few rows of a DataFrame in Python,
specifically the DataFrame named df that you created earlier using pandas. When you
execute df.head(), it returns a preview of the DataFrame with its top rows. By default, it
shows the first 5 rows, but you can specify a different number of rows to display by
passing an argument to the head() method.
Df.describe
# Get the feature values until the target column (not included)
X = df.values[:, :-1].astype(np.float32)
# Get 80% of the data for training; the remaining 20% will be for validation and test
train_features, test_features, train_labels, test_labels = train_test_split(X, y,
test_size=0.2)
# Split the remaining 20% of data as 10% test and 10% validation
test_features, val_features, test_labels, val_labels = train_test_split(test_features,
test_labels, test_size=0.5)
import sagemaker
Certainly! The provided code is configuring and creating an Amazon SageMaker Linear
Learner model for binary classification. Let's break down what each part of the code
does:
1. Import SageMaker Library:
The code begins by importing the sagemaker library. Amazon SageMaker
is a cloud service that allows you to build, train, and deploy machine
learning models, and this library provides the necessary tools and
functions to work with SageMaker services in Python.
2. Create a Linear Learner Estimator:
The main part of the code involves creating a Linear Learner estimator
object. An estimator in SageMaker represents a machine learning model
that can be trained and deployed.
The sagemaker.LinearLearner class is used to create this estimator.
3. Configuration Parameters:
Several configuration parameters are passed to the
sagemaker.LinearLearner constructor to configure the estimator:
role: This parameter specifies the IAM (Identity and Access
Management) role that SageMaker should assume when
performing operations. It is obtained using
sagemaker.get_execution_role(), which retrieves the SageMaker
execution role associated with your environment.
instance_count: This parameter specifies the number of Amazon
EC2 instances to use for training the model. In this case, it's set to
1 instance.
instance_type: This parameter specifies the type of Amazon EC2
instance to use for training. Here, it's set to 'ml.m4.xlarge', which
represents a specific type of compute instance optimized for
machine learning workloads.
predictor_type: This parameter indicates that the Linear Learner
will be used as a binary classifier, making predictions for binary
classification tasks (e.g., spam detection, fraud detection).
4. Estimator Object:
The binary_estimator variable now holds the Linear Learner estimator
object, fully configured and ready to be trained.
In summary, this code sets up a SageMaker Linear Learner estimator for binary
classification. It defines various configuration parameters, such as the instance type,
instance count, and role, which are necessary for training and deploying the model on
Amazon SageMaker. Once configured, you can proceed to train the model using data
and deploy it for making predictions on new data.
The provided code snippet is preparing the data for training a machine learning model
using the Amazon SageMaker Linear Learner estimator. It involves converting the
training, validation, and testing datasets into a specific data format called "RecordSet"
that is compatible with SageMaker. Let's break down what each line of code does:
1. train_records Line:
train_records = binary_estimator.record_set(train_features,
train_labels, channel='train')
This line creates a "RecordSet" from the training data.
train_features and train_labels are the feature data (input) and target
labels (output) for training, respectively.
channel='train' specifies that this RecordSet is associated with the
training channel. In SageMaker, a "channel" is a named data source that
you can use to organize and manage data for different purposes, such as
training or validation.
2. val_records Line:
val_records = binary_estimator.record_set(val_features, val_labels,
channel='validation')
Similar to the previous line, this line creates a RecordSet, but for the
validation data.
val_features and val_labels represent the feature data and target labels
for validation.
channel='validation' specifies that this RecordSet is associated with the
validation channel.
3. test_records Line:
test_records = binary_estimator.record_set(test_features,
test_labels, channel='test')
This line creates a RecordSet for the test data.
test_features and test_labels contain the feature data and target labels
for testing.
channel='test' indicates that this RecordSet is associated with the test
channel.
In summary, these lines of code prepare the data in a format that SageMaker can work
with during model training. The data is organized into RecordSets, each associated with
a specific data channel (train, validation, or test). This separation helps SageMaker
manage and use the data appropriately during the training process, ensuring that the
model is trained on the training data, validated on the validation data, and tested on the
test data.
The provided code is instructing Amazon SageMaker to fit (train) a machine learning
model using the binary_estimator on the prepared data, which includes training,
validation, and testing datasets in RecordSet format. Let's break down what this code
does:
1. binary_estimator.fit([...]) Line:
This line invokes the fit method on the binary_estimator object. The fit
method is used to train a machine learning model using SageMaker.
Inside the fit method, a list [train_records, val_records, test_records] is
passed as an argument. This list contains the RecordSets for the training,
validation, and testing data.
Here's what happens when you call binary_estimator.fit([...]):
SageMaker uses the Linear Learner estimator (binary_estimator) to train a
binary classification model.
During training, the model learns patterns and relationships in the training data to
make predictions on binary classification tasks.
The training process typically involves multiple iterations or epochs, where the
model adjusts its internal parameters to minimize a loss function and improve its
predictive accuracy.
The validation data (val_records) is used periodically during training to evaluate
the model's performance on data it hasn't seen during training. This helps in
monitoring the model's generalization and preventing overfitting.
The testing data (test_records) is not used during training but is kept separate
for evaluating the model's performance after training is complete. It provides an
unbiased assessment of how well the model will perform on new, unseen data.
In summary, the binary_estimator.fit([...]) line is the key step for training a binary
classification model using SageMaker. It uses the provided RecordSets for training,
validation, and testing to create and fine-tune the model, ensuring that it can make
accurate binary classification predictions.
sagemaker.analytics.TrainingJobAnalytics(binary_estimator._current_job_name,
metric_names = ['test:binary_classification_accuracy',
'test:precision',
'test:recall']
).dataframe()
The provided code is used to obtain training job analytics data from an Amazon
SageMaker training job. It fetches specific metrics related to the performance of the
machine learning model trained using the binary_estimator. Let's break down what this
code does:
1. sagemaker.analytics.TrainingJobAnalytics(...) Line:
This line of code creates an instance of the TrainingJobAnalytics class
from SageMaker's analytics module. This class is used to retrieve
analytics data from a SageMaker training job.
The TrainingJobAnalytics constructor takes two main arguments:
binary_estimator._current_job_name: This represents the name
of the current SageMaker training job associated with the
binary_estimator. It's used to specify which training job's analytics
data you want to retrieve.
metric_names: This is a list of specific metrics that you want to
retrieve from the training job's analytics data. In this case, the code
is interested in three metrics: binary classification accuracy,
precision, and recall.
2. .dataframe() Method:
After creating the TrainingJobAnalytics instance, the .dataframe()
method is called on it.
This method retrieves the analytics data from the specified training job and
returns it as a Pandas DataFrame. The DataFrame will contain the
requested metrics along with other relevant information.
The purpose of this code is to fetch performance metrics from a SageMaker training job.
These metrics are essential for evaluating how well the machine learning model trained
by binary_estimator performs on the test data. The specific metrics chosen, such as
accuracy, precision, and recall, provide insights into the model's predictive performance,
especially in binary classification tasks.
Once you execute this code, you will have access to the training job analytics data in a
DataFrame format, allowing you to further analyze and visualize the model's
performance metrics.
binary_predictor = binary_estimator.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
The provided code is used to deploy the trained Amazon SageMaker Linear Learner
model as an endpoint for making predictions on new data. Let's break down what each
part of the code does:
1. binary_predictor Variable Assignment:
The code assigns the result of the deployment operation to the
binary_predictor variable. This variable will be used to interact with the
deployed model for making predictions.
2. binary_estimator.deploy(...) Line:
This line invokes the deploy method on the binary_estimator object. The
deploy method is used to create an endpoint for serving the trained
machine learning model.
3. Arguments for Deployment:
The deploy method takes two important arguments:
initial_instance_count: This specifies the number of Amazon EC2
instances to launch for hosting the endpoint. In this case, it's set to
1, indicating that a single instance will be used initially.
instance_type: This specifies the type of Amazon EC2 instance to
use for hosting the model endpoint. Here, it's set to 'ml.m4.xlarge',
which represents a specific type of compute instance optimized for
machine learning workloads.
4. Deployment Process:
When you execute this line of code, SageMaker will perform the following
actions:
Create a new SageMaker endpoint using the trained Linear Learner
model.
Allocate the specified resources (e.g., EC2 instances) to host the
endpoint.
Load the model into the deployed endpoint, making it ready to
receive inference requests.
5. binary_predictor as the Endpoint Interface:
After deployment, the binary_predictor variable can be used as an
interface to the deployed endpoint. You can use it to send new data to the
endpoint and receive predictions from the trained model.
In summary, this code deploys the trained Linear Learner model as a SageMaker
endpoint, allowing you to use it for real-time predictions on new data. The
binary_predictor variable represents the deployed model, and you can interact with it
to obtain predictions from the model hosted on SageMaker.
# Calculate precision
precision = (preds[preds == 1] == labels[preds == 1]).sum() / len(preds[preds == 1])
print(f'Precision: {precision}')
# Calculate recall
recall = (preds[preds == 1] == labels[preds == 1]).sum() / len(labels[labels == 1])
print(f'Recall: {recall}')
The provided code snippet defines a function called predict_batches that is used for
making predictions using a SageMaker endpoint and then calculating and visualizing
evaluation metrics. Let's break down what this code does step by step:
1. Import Libraries:
The code imports the matplotlib.pyplot and seaborn libraries for data
visualization.
2. Define predict_batches Function:
This function takes the following three arguments:
predictor: The SageMaker predictor (endpoint) that will be used to
make predictions.
features: The input features for which predictions will be made.
labels: The true labels corresponding to the input features.
3. Predict in Batches:
The function splits the features into batches of size 25 using
np.array_split.
It then iterates through these batches and uses the predictor.predict()
method to obtain predictions for each batch.
The predictions are collected and processed to extract the predicted
labels.
4. Calculate Evaluation Metrics:
The following evaluation metrics are calculated:
Accuracy: It measures the overall correctness of the predictions by
comparing them to the true labels.
Precision: It measures the proportion of true positive predictions
out of all positive predictions.
Recall: It measures the proportion of true positive predictions out of
all actual positive instances in the dataset.
These metrics are printed to the console.
5. Create and Display a Confusion Matrix:
A confusion matrix is generated using pd.crosstab to show the
distribution of predicted and true labels.
The sns.heatmap function from Seaborn is used to create a heatmap
visualization of the confusion matrix.
The heatmap provides a visual representation of the model's performance
in terms of true positives, true negatives, false positives, and false
negatives.
In summary, this code defines a function that takes a SageMaker predictor, input
features, and true labels, and then it uses the predictor to make predictions, calculates
evaluation metrics (accuracy, precision, recall), and visualizes the confusion matrix. This
can be helpful for assessing the performance of a binary classification model deployed
on SageMaker.
The predict_batches function is being used to make predictions on the training data
using the binary_predictor, which is the SageMaker endpoint where the trained model
is deployed. Let's recap what happens when you call
predict_batches(binary_predictor, train_features, train_labels):
1. Predictions on Training Data:
The function uses the binary_predictor (the SageMaker endpoint) to
make predictions on the train_features, which are the input features of
the training data.
These predictions are generated in batches to efficiently process the data.
2. Evaluation Metrics Calculation:
After obtaining the predictions, the function calculates three evaluation
metrics:
Accuracy: This metric measures how accurately the model's
predictions match the true labels in the training data. It gives you an
overall sense of the model's performance.
Precision: Precision measures the proportion of true positive
predictions out of all positive predictions. It helps assess the
model's ability to correctly identify positive cases.
Recall: Recall measures the proportion of true positive predictions
out of all actual positive instances in the training data. It helps
assess the model's ability to capture all positive cases.
3. Confusion Matrix Visualization:
A confusion matrix is generated based on the true labels and predicted
labels. This matrix provides a detailed breakdown of true positives, true
negatives, false positives, and false negatives.
The confusion matrix is visualized as a heatmap using the Seaborn library.
It allows you to see how well the model performs in different prediction
categories.
4. Printing Metrics:
The calculated metrics (accuracy, precision, and recall) are printed to the
console.
By calling predict_batches(binary_predictor, train_features, train_labels), you are
essentially assessing how well the deployed machine learning model performs on the
training data. This can provide insights into the model's initial performance and its ability
to learn patterns from the training dataset.
The predict_batches function is being used to make predictions on the test data using
the binary_predictor, which is the SageMaker endpoint where the trained model is
deployed. This allows you to assess the model's performance on data it hasn't seen
during training. Here's what happens when you call predict_batches(binary_predictor,
test_features, test_labels):
1. Predictions on Test Data:
The function uses the binary_predictor (the SageMaker endpoint) to
make predictions on the test_features, which are the input features of the
test data.
These predictions are generated in batches to efficiently process the data.
2. Evaluation Metrics Calculation:
After obtaining the predictions, the function calculates three evaluation
metrics:
Accuracy: This metric measures how accurately the model's
predictions match the true labels in the test data. It gives you an
overall sense of the model's performance on unseen data.
Precision: Precision measures the proportion of true positive
predictions out of all positive predictions. It helps assess the
model's ability to correctly identify positive cases in the test data.
Recall: Recall measures the proportion of true positive predictions
out of all actual positive instances in the test data. It helps assess
the model's ability to capture all positive cases in the test data.
3. Confusion Matrix Visualization:
A confusion matrix is generated based on the true labels and predicted
labels for the test data. This matrix provides a detailed breakdown of true
positives, true negatives, false positives, and false negatives.
The confusion matrix is visualized as a heatmap using the Seaborn library.
This visualization allows you to see how well the model performs in
different prediction categories for the test data.
4. Printing Metrics:
The calculated metrics (accuracy, precision, and recall) are printed to the
console, providing a quantitative assessment of the model's performance
on the test data.
By calling predict_batches(binary_predictor, test_features, test_labels), you are
evaluating how well the deployed machine learning model generalizes to new, unseen
data. This assessment helps you understand how the model performs in a real-world
scenario and whether it maintains its predictive accuracy on data it hasn't encountered
during training.