Introduction to
Data Science
and Python
Package for DS
Syamil Fakhruddin H.A
University of Indonesia
Geophysics
Digital Talent Scholarship
Artificial Intelligence
PT Zegen Laraka Utama
AI Developer
IYKRA
Data Fellowship
What will we learn in this
Session (Objective)
1 Introduction to Data Science
2 Python Packages for Data Science
3 Exploratory Data Analysis
4 Data Preparation
Outline
Introduction to Data Science
Numpy
Pandas
Matplotlib
Exploratory Data
Analysis
Data Preparation
Background of Data Science
How to use it and bring
Industrial Revolution insight to business value? DJ Patil, a Computer
4.0 (Big Data, IoT, Scientist and
Tech, etc) Mathematician, invented
the word "DS"
What’s Data Science?
Science that combines 3
things, namely
programming,
mathematics and
statistics, and business
Overkill!!!
Cross-industry standard process for data
mining (Framework)
Python Packages for Data
Science
Basic
Numpy
Our Topics
What and
Why What’s Array Build
Numpy Array
Some
2DArray
Operation
with Array
Numpy for Hands-On
Statistics
What’s Numpy
Numpy is short for Numerical Python, an open source library
containing multidimensional array objects.
In short: Numpy library in python for creating / manipulating a
multi dimensional array.
Focus on
1. Build Array
2. Do some operations
Source:https://numpy.org/devdocs/user/ab
solute_beginners.html
Why Numpy?
1. NumPy arrays are faster and more powerful than python lists.
2. NumPy uses less memory.
3.
Numpy > List
So what’s Array?
Array is…
Array is a collection or set or grid that contains information about raw data, which is
indexed and can be accessed by its value, and supports multidimensional data.
ndarray
An N-dimensional array is simply an array with any number of dimensions.
Build Array
1. np.array()
2. np.zeros() 5. np.arange() 7. Specifying your
data type
3. np.ones()
6. np.linspace()
4. np.empty()
Adding and Sorting Elements
1. np.sort()
2. np.concatenate()
How do you know the shape and size of an
array?
ndim
size
shape
Indexing and Slicing
Indexing and Slicing (with Condition)
Create Array from Existing Data
1. Slicing
2. vstack() and hstack()
Create Array from Existing Data
3. hsplit()
Create Array from Existing Data
3. copy()
Change
Not Change
Basic Array Operations
Substraction, multiplication, division
Adding
More useful array operations
min(), max(), sum()
Multiplication with scalar
2D Array (Matrices)
Build
Indexing and Slicing
2D Array (Matrices)
Build Array
2D Array (Matrices)
Min, Max, Sum
2D Array (Matrices)
Operation
Implementation
Let’s go to Notebook
Pandas and
Data
Preparation
Pandas
Pandas is a software library written for the Python programming
language for data manipulation and analysis
Pandas DataFrame
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data structure.
Build DataFrame
Columns
Index
Rows
Import Dataset To DataFrame
Nama dataframe Lokasi file berada
Sintaksi untuk membaca
Some Common Sintax after
Loading DataFrame
Some Common Sintax after
Loading DataFrame
DataFrame also offers a number of There are also key attributes of a Data Frame such
statistic as:
functions such as: shape — shows dimensionality of the DataFrame
● abs() — Absolute values size — number of items
● mean() — Mean values. It also offers ndim — number of axes
median(), mode()
● min() — minimum value. It also offers Describe
max() If you want to see a quick summary of your data
count(), std() — standard deviation frame and want to be
informed of its count, mean, standard deviation,
minimum, maximum
and a number of percentiles for each of the
columns in the data frame
then use the describe method:
df.describe()
Data Manipulation with Pandas
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Bracket
Loc and Iloc
0 1 2
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Let’s go to Notebook
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the first step in your Data Science
(DS) or Machine Learning (ML) workflow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Other Method for Imputing Missing
Value :
1. Median (Used for skewness
distribution)
2. Mode (Used for categorical type)
3. Mean (Used for Normally
Distributed Data)
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1
The box plot is a useful graphical
display for describing the behavior of the
data in the middle as well as at the ends
of the distributions.
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Let’s go to Notebook
Data Wrangling
Combining Data (Join/Merge)
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
Combining Data (concat)
Data
Visualization
in Python
What is Data Visualization
Data visualization is the discipline of trying to understand data by placing it
in a visual context so that patterns, trends and correlations that might not
otherwise be detected can be exposed.
There are several popular plotting libraries:
● Matplotlib: low level, provides lots of freedom
● Pandas Visualization: easy to use interface, built on Matplotlib
● Seaborn: high-level interface, great default styles
● Plotly: can create interactive plots
Plotting with Pandas
Pandas Dataframe offers a range of graphical
plotting
options.
We can plot, box plot, area, scatter plots,
stacked charts, bar
charts, histograms, etc.
● df.plot.scatter() #plots a scatter chart
● df.plot.line() # plots a line chart
● df.boxplot() # plots a box plot
Matplotlib
What is Matplotlib?
Matplotlib is a 2-D plotting library that helps in visualizing figures.
Matplotlib emulates Matlab like graphs and visualizations.
So why we don’t use Matlab instead?
Matlab is not free, is difficult to scale and as a programming language is
tedious.
So, matplotlib in Python is used as it is a robust, free and easy library for
data visualization.
Install Matplotlib
The Matplotlib Object Hierarchy
A Figure object is the outermost container for a
matplotlib graphic, which can contain multiple Axes
objects. One source of confusion is the name: an Axes
actually translates into what we think of as an
individual plot or graph (rather than the plural of “axis,”
as we might
expect).
You can think of the Figure object as a box-like
container holding one or more Axes (actual plots).
Below the Axes in the hierarchy are smaller objects
such as tick marks, individual lines, legends, and text
boxes. Almost every “element” of a chart is its own
manipulable Python object, all the way down to the
ticks and labels
Types of Visualization
● Histogram
● Multiple Histogram
● Pie Chart
● Time Series by Line Plot
● Box Plot
https://matplotlib.org/gallery/index.html
● Twin Axis
● Bar Plot
● Scatter Plot
And many more
When to use: We should
use histogram when we
need the count of the
variable in a plot.
eg: Number of particular
games sold in a store.
From above we can see the
histogram for Grand Canyon
visitors in years
When to use: When we
need to understand the
distributions between 2
entity variables
We can see that Grand
Canyon
has comparably more
visitors
than Bryce Canyon
Let’s go to Notebook
Exploratory
Data
Analysis
What is EDA?
Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
3 Parts of EDA
1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc
2. Defining questions
Identifying the relationship between the variables that are particularly
interesting or unexpected
3. Visualizations
Using effective visualizations to communicate the result
Let’s go to Notebook
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook
Homework(?)
1. Melakukan Business Understanding dan Data Understanding dari data HR yang sudah di
berikan.
Source : https://www.kaggle.com/rhuebner/human-resources-data-set
2. Lakukan Data Cleansing, data exploration, hingga data preparation
Sources
Pandas Documentetaion:
https://pandas.pydata.org/pandas-docs/stable/pandas.pdf
Numpy Documentation : https://numpy.org/doc/stable/numpy-ref.pdf
Matplotlib Documentation : https://matplotlib.org/contents.html
Seaborn Documentation : https://seaborn.pydata.org/
Encode : https://pbpython.com/categorical-encoding.html
Scaling :
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-lear
n-6ccc7d176a02
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transf
orms-in-python/