0% found this document useful (0 votes)
6 views

1st Class-Introduction and Python Package (1)

Uploaded by

Dyna Fransisca
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

1st Class-Introduction and Python Package (1)

Uploaded by

Dyna Fransisca
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Introduction to

Data Science
and Python
Package for DS
Syamil Fakhruddin H.A

University of Indonesia
Geophysics

Digital Talent Scholarship


Artificial Intelligence

PT Zegen Laraka Utama


AI Developer

IYKRA
Data Fellowship
What will we learn in this
Session (Objective)
1 Introduction to Data Science

2 Python Packages for Data Science

3 Exploratory Data Analysis

4 Data Preparation
Outline
Introduction to Data Science

Numpy

Pandas

Matplotlib

Exploratory Data
Analysis
Data Preparation
Background of Data Science

How to use it and bring


Industrial Revolution insight to business value? DJ Patil, a Computer
4.0 (Big Data, IoT, Scientist and
Tech, etc) Mathematician, invented
the word "DS"
What’s Data Science?

Science that combines 3


things, namely
programming,
mathematics and
statistics, and business
Overkill!!!
Cross-industry standard process for data
mining (Framework)
Python Packages for Data
Science
Basic
Numpy
Our Topics

What and
Why What’s Array Build
Numpy Array

Some
2DArray
Operation
with Array

Numpy for Hands-On


Statistics
What’s Numpy
Numpy is short for Numerical Python, an open source library
containing multidimensional array objects.
In short: Numpy library in python for creating / manipulating a
multi dimensional array.

Focus on
1. Build Array
2. Do some operations

Source:https://numpy.org/devdocs/user/ab
solute_beginners.html
Why Numpy?
1. NumPy arrays are faster and more powerful than python lists.
2. NumPy uses less memory.
3.

Numpy > List


So what’s Array?
Array is…
Array is a collection or set or grid that contains information about raw data, which is
indexed and can be accessed by its value, and supports multidimensional data.

ndarray
An N-dimensional array is simply an array with any number of dimensions.
Build Array
1. np.array()

2. np.zeros() 5. np.arange() 7. Specifying your


data type

3. np.ones()

6. np.linspace()
4. np.empty()
Adding and Sorting Elements
1. np.sort()

2. np.concatenate()
How do you know the shape and size of an
array?

ndim

size

shape
Indexing and Slicing
Indexing and Slicing (with Condition)
Create Array from Existing Data
1. Slicing

2. vstack() and hstack()


Create Array from Existing Data
3. hsplit()
Create Array from Existing Data
3. copy()

Change

Not Change
Basic Array Operations
Substraction, multiplication, division

Adding
More useful array operations
min(), max(), sum()

Multiplication with scalar


2D Array (Matrices)
Build

Indexing and Slicing


2D Array (Matrices)
Build Array
2D Array (Matrices)
Min, Max, Sum
2D Array (Matrices)
Operation
Implementation
Let’s go to Notebook
Pandas and
Data
Preparation
Pandas
Pandas is a software library written for the Python programming
language for data manipulation and analysis

Pandas DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data structure.
Build DataFrame

Columns
Index

Rows
Import Dataset To DataFrame

Nama dataframe Lokasi file berada

Sintaksi untuk membaca


Some Common Sintax after
Loading DataFrame
Some Common Sintax after
Loading DataFrame
DataFrame also offers a number of There are also key attributes of a Data Frame such
statistic as:
functions such as: shape — shows dimensionality of the DataFrame
● abs() — Absolute values size — number of items
● mean() — Mean values. It also offers ndim — number of axes
median(), mode()
● min() — minimum value. It also offers Describe
max() If you want to see a quick summary of your data
count(), std() — standard deviation frame and want to be
informed of its count, mean, standard deviation,
minimum, maximum
and a number of percentiles for each of the
columns in the data frame
then use the describe method:
df.describe()
Data Manipulation with Pandas

1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting

Bracket
Loc and Iloc
0 1 2
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Let’s go to Notebook
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the first step in your Data Science
(DS) or Machine Learning (ML) workflow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Other Method for Imputing Missing


Value :
1. Median (Used for skewness
distribution)
2. Mode (Used for categorical type)
3. Mean (Used for Normally
Distributed Data)
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1

The box plot is a useful graphical


display for describing the behavior of the
data in the middle as well as at the ends
of the distributions.
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Let’s go to Notebook
Data Wrangling
Combining Data (Join/Merge)
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
Combining Data (concat)
Data
Visualization
in Python
What is Data Visualization

Data visualization is the discipline of trying to understand data by placing it


in a visual context so that patterns, trends and correlations that might not
otherwise be detected can be exposed.

There are several popular plotting libraries:


● Matplotlib: low level, provides lots of freedom
● Pandas Visualization: easy to use interface, built on Matplotlib
● Seaborn: high-level interface, great default styles
● Plotly: can create interactive plots
Plotting with Pandas

Pandas Dataframe offers a range of graphical


plotting
options.
We can plot, box plot, area, scatter plots,
stacked charts, bar
charts, histograms, etc.

● df.plot.scatter() #plots a scatter chart


● df.plot.line() # plots a line chart
● df.boxplot() # plots a box plot
Matplotlib

What is Matplotlib?
Matplotlib is a 2-D plotting library that helps in visualizing figures.
Matplotlib emulates Matlab like graphs and visualizations.

So why we don’t use Matlab instead?


Matlab is not free, is difficult to scale and as a programming language is
tedious.

So, matplotlib in Python is used as it is a robust, free and easy library for
data visualization.
Install Matplotlib
The Matplotlib Object Hierarchy

A Figure object is the outermost container for a


matplotlib graphic, which can contain multiple Axes
objects. One source of confusion is the name: an Axes
actually translates into what we think of as an
individual plot or graph (rather than the plural of “axis,”
as we might
expect).

You can think of the Figure object as a box-like


container holding one or more Axes (actual plots).
Below the Axes in the hierarchy are smaller objects
such as tick marks, individual lines, legends, and text
boxes. Almost every “element” of a chart is its own
manipulable Python object, all the way down to the
ticks and labels
Types of Visualization

● Histogram
● Multiple Histogram
● Pie Chart
● Time Series by Line Plot
● Box Plot
https://matplotlib.org/gallery/index.html
● Twin Axis
● Bar Plot
● Scatter Plot
And many more
When to use: We should
use histogram when we
need the count of the
variable in a plot.

eg: Number of particular


games sold in a store.

From above we can see the


histogram for Grand Canyon
visitors in years
When to use: When we
need to understand the
distributions between 2
entity variables

We can see that Grand


Canyon
has comparably more
visitors
than Bryce Canyon
Let’s go to Notebook
Exploratory
Data
Analysis
What is EDA?

Exploratory Data Analysis refers to the critical process of performing


initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
3 Parts of EDA

1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc

2. Defining questions
Identifying the relationship between the variables that are particularly
interesting or unexpected

3. Visualizations
Using effective visualizations to communicate the result
Let’s go to Notebook
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook
Homework(?)
1. Melakukan Business Understanding dan Data Understanding dari data HR yang sudah di
berikan.
Source : https://www.kaggle.com/rhuebner/human-resources-data-set

2. Lakukan Data Cleansing, data exploration, hingga data preparation


Sources

Pandas Documentetaion:
https://pandas.pydata.org/pandas-docs/stable/pandas.pdf
Numpy Documentation : https://numpy.org/doc/stable/numpy-ref.pdf
Matplotlib Documentation : https://matplotlib.org/contents.html
Seaborn Documentation : https://seaborn.pydata.org/
Encode : https://pbpython.com/categorical-encoding.html
Scaling :
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-lear
n-6ccc7d176a02
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transf
orms-in-python/

You might also like