1st Class-Introduction and Python Package (1)
1st Class-Introduction and Python Package (1)
Data Science
and Python
Package for DS
Syamil Fakhruddin H.A
University of Indonesia
Geophysics
IYKRA
Data Fellowship
What will we learn in this
Session (Objective)
1 Introduction to Data Science
4 Data Preparation
Outline
Introduction to Data Science
Numpy
Pandas
Matplotlib
Exploratory Data
Analysis
Data Preparation
Background of Data Science
What and
Why What’s Array Build
Numpy Array
Some
2DArray
Operation
with Array
Focus on
1. Build Array
2. Do some operations
Source:https://numpy.org/devdocs/user/ab
solute_beginners.html
Why Numpy?
1. NumPy arrays are faster and more powerful than python lists.
2. NumPy uses less memory.
3.
ndarray
An N-dimensional array is simply an array with any number of dimensions.
Build Array
1. np.array()
3. np.ones()
6. np.linspace()
4. np.empty()
Adding and Sorting Elements
1. np.sort()
2. np.concatenate()
How do you know the shape and size of an
array?
ndim
size
shape
Indexing and Slicing
Indexing and Slicing (with Condition)
Create Array from Existing Data
1. Slicing
Change
Not Change
Basic Array Operations
Substraction, multiplication, division
Adding
More useful array operations
min(), max(), sum()
Pandas DataFrame
Data structure also contains labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data structure.
Build DataFrame
Columns
Index
Rows
Import Dataset To DataFrame
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Bracket
Loc and Iloc
0 1 2
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Let’s go to Notebook
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the first step in your Data Science
(DS) or Machine Learning (ML) workflow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1
What is Matplotlib?
Matplotlib is a 2-D plotting library that helps in visualizing figures.
Matplotlib emulates Matlab like graphs and visualizations.
So, matplotlib in Python is used as it is a robust, free and easy library for
data visualization.
Install Matplotlib
The Matplotlib Object Hierarchy
● Histogram
● Multiple Histogram
● Pie Chart
● Time Series by Line Plot
● Box Plot
https://matplotlib.org/gallery/index.html
● Twin Axis
● Bar Plot
● Scatter Plot
And many more
When to use: We should
use histogram when we
need the count of the
variable in a plot.
1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc
2. Defining questions
Identifying the relationship between the variables that are particularly
interesting or unexpected
3. Visualizations
Using effective visualizations to communicate the result
Let’s go to Notebook
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook
Homework(?)
1. Melakukan Business Understanding dan Data Understanding dari data HR yang sudah di
berikan.
Source : https://www.kaggle.com/rhuebner/human-resources-data-set
Pandas Documentetaion:
https://pandas.pydata.org/pandas-docs/stable/pandas.pdf
Numpy Documentation : https://numpy.org/doc/stable/numpy-ref.pdf
Matplotlib Documentation : https://matplotlib.org/contents.html
Seaborn Documentation : https://seaborn.pydata.org/
Encode : https://pbpython.com/categorical-encoding.html
Scaling :
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-lear
n-6ccc7d176a02
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transf
orms-in-python/