0% found this document useful (0 votes)
3 views30 pages

Week 1 Explore The Use Case and Analyze The Dataset

The document outlines the copyright notice for slides distributed under a Creative Commons License by DeepLearning.AI for educational purposes. It discusses practical data science in the cloud, including data ingestion, exploration, and machine learning workflows using AWS tools. Additionally, it covers popular machine learning tasks, sentiment analysis of product reviews, and data visualization techniques.

Uploaded by

Raish S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views30 pages

Week 1 Explore The Use Case and Analyze The Dataset

The document outlines the copyright notice for slides distributed under a Creative Commons License by DeepLearning.AI for educational purposes. It discusses practical data science in the cloud, including data ingestion, exploration, and machine learning workflows using AWS tools. Additionally, it covers popular machine learning tasks, sentiment analysis of product reviews, and data visualization techniques.

Uploaded by

Raish S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode


Practical
Data
Science
Explore the Use
Case and Analyze
the Dataset
Practical
Data
Science in
the Cloud
Introductio
n
AI, ML, DL, data science…?

Artificial
Intelligen
ce

Machin
e
Learnin
g

Deep
Learnin
g
AI, ML, DL, data science…?

Artificial
Intelligen
ce
D
Machin o Mathemati
e m cs
Learnin Data a
i Statistic
g Deep Scienc n s
Learnin e Visualizatio
g k
n n
o Programmi
w ng
l
e
d
g
e
Practical Data
Science?
Practical data science

Massive data
sets
Extrac Knowledge +
t Insight
… in the
Cloud?
Practical data science in the cloud

Store & Large data


process any science and
amount of data ML toolbox

Scale up Scale Elastic


out infrastructure

Limited Local Notebook / Prototype


by existing
hardware
Data science and ML
toolbox
Machine Learning Workflow
Ingest Prepare Train Deploy
& & & &
Analyz Transfor Tune Manag
e
Data m
Feature Automated e
Model
exploration engineering ML deployment
Bias Feature Model train and Automated
detection store tune pipelines

Amazon S3 & Amazon Amazon Amazon


Amazon SageMaker Data SageMaker SageMaker
Athena Wrangler Autopilot Endpoints
AWS Amazon Amazon Amazon
Glue SageMaker SageMaker SageMaker Batch
Amazon Processing Jobs Training & Transform
SageMaker Data Amazon Debugger
Amazon SageMaker Amazon
Wrangler SageMaker Hyperparameter SageMaker
& Clarify Feature Store Tuning Pipelines
Machine Learning Workflow
Ingest Prepare Train Deploy
& & & &
Analyz Transfor Tune Manag
e
Data m
Feature Automated e
Model
exploration engineering ML deployment
Bias Feature Model train and Automated
detection store tune pipelines

Amazon S3 & Amazon Amazon Amazon


Amazon SageMaker Data SageMaker SageMaker
Athena Wrangler Autopilot Endpoints
AWS Amazon Amazon Amazon
Glue SageMaker SageMaker SageMaker Batch
Amazon Processing Jobs Training & Transform
SageMaker Data Amazon Debugger
Amazon SageMaker Amazon
Wrangler SageMaker Hyperparameter SageMaker
& Clarify Feature Store Tuning Pipelines
Use Case
and
Dataset
Introductio
n
Popular ML tasks and learning
paradigms

Classificati Clusterin Image Text


on & g Processing Analysis
Regression
Supervis Unsupervis Computer NLP /
ed ed Vision NLU
Multi-class classification for sentiment
analysis of product reviews

“I simply love it!”


“It's ok.”


“It arrived
damaged.
Going to return.”
Working with product reviews data

Input feature Label for


for model model
training training
Review Text Sentiment

I simply love it! 1 (positive)

It's ok. 0 (neutral)

It arrived -1 (negative)
damaged, going
to return
Data
Ingestion &
Exploration
Ingest data into data lakes

● Centralized and secure


repository
● Store, discover and share data
at any scale
○ structured relational data
○ semi-structured data
○ unstructured data
○ streaming data
● Governance
Data lakes on Amazon S3

● Amazon Simple Storage


Data Analytic Machin Service (Amazon S3)
Warehousin s e
g Learnin ● Object storage
g
● Durable, available, exabyte
scale
● Secure, compliant,
Amazon S3 auditable
AWS Data Wrangler

● Open source Python !pip install


library awswrangler
● Connects pandas import awswrangler as
wr
DataFrames and AWS import pandas as pd
data services
○ data lakes # Retrieving the data directly from
● Load/unload data from Amazon S3
○ data
df = wr.s3.read_csv(
warehouses
path='s3://bucket/prefix/')
○ databases
Register data with AWS Glue Data
Catalog
● Creates reference to
data ("S3-to-table"
AWS Glue mapping)
Data
Catalog ● Just metadata / schema
Name reviews stored in tables
● No data is moved
Database dsoaws_deep_learning
● AWS Glue Crawlers can
Classification csv
be
○ set up to
infer data
Location s3://<bucket>/ automatically
<prefix>
schema
○ update data
catalog
Register data with AWS Glue Data
Catalog import awswrangler as wr

# Create a database in
AWS Glue the # AWS Glue Data
Data Catalog
Catalog wr.catalog.create_databas
Name reviews e(
name=...)
Database dsoaws_deep_learning

Classification csv # Create CSV table (metadata only) in


the # AWS Glue Data Catalog
Location s3://<bucket>/ wr.catalog.create_csv_table(
<prefix>
table=...,
column_types=..
.,
...)
Query data with Amazon Athena
● Query data in S3 import awswrangler as Python
wr
● Using SQL # Create Amazon Athena S3
Amazo
bucket
n ● No infrastructure to set
Athen wr.athena.create_athena_bucket
up ()
a # Execute SQL query on Amazon
● Schema lookup in Athena
AWS Glue Data df =
Catalog wr.athena.read_sql_query
( sql=...,
● No data to load database=...)
'SELECT product_category FROM SQL
reviews'
Query data with Amazon Athena

● Complex analytical queries

● Gigabytes > Terabytes >


Petabytes
● Scales automatically

● Runs queries in parallel

● Based on Presto
● No infrastructure
setup / no data
movement required
Data
Visualizati
on
Popular Python data analysis &
visualization tools

pip install pandas pip install numpy

pip install pip install seaborn


matplotlib
How many reviews are in each sentiment
class?
SELECT sentiment, COUNT(*) AS SQL Query
count_sentiment
FROM dsoaws_deep_learning.reviews
GROUP BY sentiment
ORDER BY sentiment DESC, count_sentiment

import matplotlib.pyplot as Python visualization


plt
chart = df.plot.bar( code
x="sentiment",
y="count_sentiment
")
plt.xlabel("sentiment
") plt.show(chart)
How many reviews are in each sentiment
class?
What is the distribution of review lengths?
(number of words)

SELECT CARDINALITY(SPLIT(review_body, ' ')) as SQL Query


num_words
FROM dsoaws_deep_learning.reviews

Python visualization
summary = df["num_words"].describe( code
percentiles=[0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90,
1.00])
df["num_words"].plot.hist(
xticks=[0, 16, 32, 64, 128, 256], bins=100,
range=[0, 256]).axvline(x=summary["100%"],
c="red")
What is the distribution of review lengths?
(number of words)
mean 52.51
std 31.38
min 1.00
10% 10.00
20% 22.00
30% 32.00
40% 41.00
50% 51.00
60% 61.00
70% 73.00
80% 88.00
90% 97.00
100% 115.00

You might also like