0% found this document useful (0 votes)

12 views93 pages

1st Class-Introduction and Python Package

Uploaded by

Dyna Fransisca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views93 pages

1st Class-Introduction and Python Package

Uploaded by

Dyna Fransisca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Introduction to

Data Science
and Python
Package for DS
Syamil Fakhruddin H.A

University of Indonesia
Geophysics

Digital Talent Scholarship

Artiﬁcial Intelligence

PT Zegen Laraka Utama

AI Developer

IYKRA
Data Fellowship
What will we learn in this
Session (Objective)
1 Introduction to Data Science

2 Python Packages for Data Science

3 Exploratory Data Analysis

4 Data Preparation
Outline
Introduction to Data Science

Numpy

Pandas

Matplotlib

Exploratory Data
Analysis
Data Preparation
Background of Data Science

How to use it and bring

Industrial Revolution insight to business value? DJ Patil, a Computer
4.0 (Big Data, IoT, Scientist and
Tech, etc) Mathematician, invented
the word "DS"
What’s Data Science?

Science that combines 3

things, namely
programming,
mathematics and
statistics, and business
Overkill!!!
Cross-industry standard process for data
mining (Framework)
Python Packages for Data
Science
Basic
Numpy
Our Topics

What and
Why What’s Array Build
Numpy Array

Some
2DArray
Operation
with Array

Numpy for Hands-On

Statistics
What’s Numpy
Numpy is short for Numerical Python, an open source library
containing multidimensional array objects.
In short: Numpy library in python for creating / manipulating a
multi dimensional array.

Focus on
1. Build Array
2. Do some operations

Source:https://numpy.org/devdocs/user/ab
solute_beginners.html
Why Numpy?
1. NumPy arrays are faster and more powerful than python lists.
2. NumPy uses less memory.
3.

Numpy > List

So what’s Array?
Array is…
Array is a collection or set or grid that contains information about raw data, which is
indexed and can be accessed by its value, and supports multidimensional data.

ndarray
An N-dimensional array is simply an array with any number of dimensions.
Build Array
1. np.array()

2. np.zeros() 5. np.arange() 7. Specifying your

data type

3. np.ones()

6. np.linspace()
4. np.empty()
Adding and Sorting Elements
1. np.sort()

2. np.concatenate()
How do you know the shape and size of an
array?

ndim

size

shape
Indexing and Slicing
Indexing and Slicing (with Condition)
Create Array from Existing Data
1. Slicing

2. vstack() and hstack()

Create Array from Existing Data
3. hsplit()
Create Array from Existing Data
3. copy()

Change

Not Change
Basic Array Operations
Substraction, multiplication, division

Adding
More useful array operations
min(), max(), sum()

Multiplication with scalar

2D Array (Matrices)
Build

Indexing and Slicing

2D Array (Matrices)
Build Array
2D Array (Matrices)
Min, Max, Sum
2D Array (Matrices)
Operation
Implementation
Let’s go to Notebook
Pandas and
Data
Preparation
Pandas
Pandas is a software library written for the Python programming
language for data manipulation and analysis

Pandas DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data structure.
Build DataFrame

Columns
Index

Rows
Import Dataset To DataFrame

Nama dataframe Lokasi file berada

Sintaksi untuk membaca

Some Common Sintax after
Loading DataFrame
Some Common Sintax after
Loading DataFrame
DataFrame also offers a number of There are also key attributes of a Data Frame such
statistic as:
functions such as: shape — shows dimensionality of the DataFrame
● abs() — Absolute values size — number of items
● mean() — Mean values. It also offers ndim — number of axes
median(), mode()
● min() — minimum value. It also offers Describe
max() If you want to see a quick summary of your data
count(), std() — standard deviation frame and want to be
informed of its count, mean, standard deviation,
minimum, maximum
and a number of percentiles for each of the
columns in the data frame
then use the describe method:
df.describe()
Data Manipulation with Pandas

1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting

Bracket
Loc and Iloc
0 1 2
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
1. Selection
2. Addition
3. Deletion
4. Rename
5. Sorting
Let’s go to Notebook
Data Cleansing
About Cleaning and
Preprocessing Dataset
Cleaning your data should be the ﬁrst step in your Data Science
(DS) or Machine Learning (ML) workﬂow. Without clean data you’ll
be having a much harder time seeing the actual important parts
in your exploration. According to CrowdFlower, data scientists
spend 60% of the time organizing and cleansing data!
Why Cleaning data and preprocess
important?
Reasons:
1. It's easier to visualize and analyze with a cleaned dataset
2. Data interpretation is valid
3. If the data is not cleaned. Sometimes, there is a function that
will error
4. Many data scientists can improve the accuracy of models only
from cleaning
and processing data
Common Problem in Data Cleansing
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Analyze the
Appropriately
Identifying missing number or
delete or impute
values proportion of
missing values
missing values
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

Other Method for Imputing Missing

Value :
1. Median (Used for skewness
distribution)
2. Mode (Used for categorical type)
3. Mean (Used for Normally
Distributed Data)
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type

An outlier is a data point that lies an abnormal distance from other values
in the data.
Basic Outlier Formula :
1. Lower Bound = Q1 - 1.5 x IQR
2. Upper Bound = Q3 + 1.5 x IQR
3. IQR = Q3 - Q1

The box plot is a useful graphical

display for describing the behavior of the
data in the middle as well as at the ends
of the distributions.
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
1. Duplicate Dataset
2. Missing Data
3. Outliers
4. Data Type
Let’s go to Notebook
Data Wrangling
Combining Data (Join/Merge)
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
1. Natural join
2. Full outer join
3. Left outer join
4. Right outer join
Combining Data (concat)
Data
Visualization
in Python
What is Data Visualization

Data visualization is the discipline of trying to understand data by placing it

in a visual context so that patterns, trends and correlations that might not
otherwise be detected can be exposed.

There are several popular plotting libraries:

● Matplotlib: low level, provides lots of freedom
● Pandas Visualization: easy to use interface, built on Matplotlib
● Seaborn: high-level interface, great default styles
● Plotly: can create interactive plots
Plotting with Pandas

Pandas Dataframe offers a range of graphical

plotting
options.
We can plot, box plot, area, scatter plots,
stacked charts, bar
charts, histograms, etc.

● df.plot.scatter() #plots a scatter chart

● df.plot.line() # plots a line chart
● df.boxplot() # plots a box plot
Matplotlib

What is Matplotlib?
Matplotlib is a 2-D plotting library that helps in visualizing ﬁgures.
Matplotlib emulates Matlab like graphs and visualizations.

So why we don’t use Matlab instead?

Matlab is not free, is difﬁcult to scale and as a programming language is
tedious.

So, matplotlib in Python is used as it is a robust, free and easy library for
data visualization.
Install Matplotlib
The Matplotlib Object Hierarchy

A Figure object is the outermost container for a

matplotlib graphic, which can contain multiple Axes
objects. One source of confusion is the name: an Axes
actually translates into what we think of as an
individual plot or graph (rather than the plural of “axis,”
as we might
expect).

You can think of the Figure object as a box-like

container holding one or more Axes (actual plots).
Below the Axes in the hierarchy are smaller objects
such as tick marks, individual lines, legends, and text
boxes. Almost every “element” of a chart is its own
manipulable Python object, all the way down to the
ticks and labels
Types of Visualization

● Histogram
● Multiple Histogram
● Pie Chart
● Time Series by Line Plot
● Box Plot
https://matplotlib.org/gallery/index.html
● Twin Axis
● Bar Plot
● Scatter Plot
And many more
When to use: We should
use histogram when we
need the count of the
variable in a plot.

eg: Number of particular

games sold in a store.

From above we can see the

histogram for Grand Canyon
visitors in years
When to use: When we
need to understand the
distributions between 2
entity variables

We can see that Grand

Canyon
has comparably more
visitors
than Bryce Canyon
Let’s go to Notebook
Exploratory
Data
Analysis
What is EDA?

Exploratory Data Analysis refers to the critical process of performing

initial investigations on data so as to discover patterns, to spot
anomalies, to check assumption with the help of of statistical
summary and graphical representations
3 Parts of EDA

1. Cleansing
Checking for problems with the collected data, such as missing data or
measurement error, data types of columns, etc

2. Deﬁning questions
Identifying the relationship between the variables that are particularly
interesting or unexpected

3. Visualizations
Using effective visualizations to communicate the result
Let’s go to Notebook
Data
Preprocessing
Encode Data
Some Approaches
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
1. Find and Replace
2. Label Encoding
3. One Hot Encoding
4. With Scikit-Learn
Scaling
Use MinMaxScaler
Let’s go to Notebook
Homework(?)
1. Melakukan Business Understanding dan Data Understanding dari data HR yang sudah di
berikan.
Source : https://www.kaggle.com/rhuebner/human-resources-data-set

2. Lakukan Data Cleansing, data exploration, hingga data preparation

Sources

Pandas Documentetaion:
https://pandas.pydata.org/pandas-docs/stable/pandas.pdf
Numpy Documentation : https://numpy.org/doc/stable/numpy-ref.pdf
Matplotlib Documentation : https://matplotlib.org/contents.html
Seaborn Documentation : https://seaborn.pydata.org/
Encode : https://pbpython.com/categorical-encoding.html
Scaling :
https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-lear
n-6ccc7d176a02
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transf
orms-in-python/

OLYMPUS EMPOWER H35 Product Details - 78729 - 1
No ratings yet
OLYMPUS EMPOWER H35 Product Details - 78729 - 1
2 pages
Kalyani Powertrain Ltd-4Nov2022
No ratings yet
Kalyani Powertrain Ltd-4Nov2022
32 pages
18.05.25 - Isr - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification - Rev
No ratings yet
18.05.25 - Isr - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification - Rev
2 pages
Microstrip Filters For RF Microwave Applications 1st Edition Jia-Shen G. Hong PDF Download
100% (1)
Microstrip Filters For RF Microwave Applications 1st Edition Jia-Shen G. Hong PDF Download
62 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Compressor Valves and Unloaders For Reciprocating Compressors
No ratings yet
Compressor Valves and Unloaders For Reciprocating Compressors
19 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Python Question Bank
No ratings yet
Python Question Bank
10 pages
Lulu Learns Math
No ratings yet
Lulu Learns Math
51 pages
انتقال حرارة الثالث كورس ثاني
No ratings yet
انتقال حرارة الثالث كورس ثاني
69 pages
2A - Python+Data Analysis For Pyhton2 v2
No ratings yet
2A - Python+Data Analysis For Pyhton2 v2
38 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
DV29RME - Life Boat Engine
No ratings yet
DV29RME - Life Boat Engine
40 pages
Module 4
No ratings yet
Module 4
57 pages
Physics Project Report XII Light Dependence Reistance
72% (65)
Physics Project Report XII Light Dependence Reistance
11 pages
Attachment 3 Python For Data Analysis Lyst9850
No ratings yet
Attachment 3 Python For Data Analysis Lyst9850
31 pages
Datascience
No ratings yet
Datascience
26 pages
Model Based Machine Learning 1704187221
No ratings yet
Model Based Machine Learning 1704187221
300 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Testing For Excessive Cylinder Blowby in 3500 Engines
No ratings yet
Testing For Excessive Cylinder Blowby in 3500 Engines
10 pages
3 - Pandas
No ratings yet
3 - Pandas
87 pages
PythonDASE - 2025 Version1
No ratings yet
PythonDASE - 2025 Version1
44 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Lab 2 DWM
No ratings yet
Lab 2 DWM
13 pages
MLS 1 - Python For Data Science
No ratings yet
MLS 1 - Python For Data Science
33 pages
Tool and Lib in Data Science
No ratings yet
Tool and Lib in Data Science
32 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
Data Science Curriculum 2024
No ratings yet
Data Science Curriculum 2024
16 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Jenisha INTERNSHIP REPORT-2
No ratings yet
Jenisha INTERNSHIP REPORT-2
19 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Sop For Flushing Water Pump
No ratings yet
Sop For Flushing Water Pump
6 pages
Data Science
No ratings yet
Data Science
42 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Silica Selective TF-5
No ratings yet
Silica Selective TF-5
3 pages
PP&DS Unit Iii
No ratings yet
PP&DS Unit Iii
26 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Modulo7 Tese
No ratings yet
Modulo7 Tese
2 pages
Technical Guidance Note: A Bearing Length B Bearing Width
No ratings yet
Technical Guidance Note: A Bearing Length B Bearing Width
2 pages
DS 2
No ratings yet
DS 2
38 pages
Analysis of Prestressed Concrete Containment Vessel PDF
No ratings yet
Analysis of Prestressed Concrete Containment Vessel PDF
10 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Data Analytics and Interactive Dashboards Using Python
No ratings yet
Data Analytics and Interactive Dashboards Using Python
96 pages
Data Analysis Using Python Day - 1 To Day - 4
No ratings yet
Data Analysis Using Python Day - 1 To Day - 4
30 pages
Automotion PDF
100% (1)
Automotion PDF
44 pages
Solar PV Basics and Intro To PVsyst For VSU
No ratings yet
Solar PV Basics and Intro To PVsyst For VSU
41 pages
Fbcs
No ratings yet
Fbcs
344 pages
Department of Mathematics: INDIAN INSTITUTE OF SCIENCE (HTTP://WWW - Iisc.ac - In)
No ratings yet
Department of Mathematics: INDIAN INSTITUTE OF SCIENCE (HTTP://WWW - Iisc.ac - In)
4 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
0625 w14 Ms 63
100% (1)
0625 w14 Ms 63
4 pages
DVP First Module
No ratings yet
DVP First Module
88 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
CH 4
No ratings yet
CH 4
17 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
MODULE 5 Merged
No ratings yet
MODULE 5 Merged
22 pages
Data Visualization1
No ratings yet
Data Visualization1
52 pages
Data Visualization
No ratings yet
Data Visualization
25 pages
Data Manipulation and Visualization
No ratings yet
Data Manipulation and Visualization
21 pages
Module 1
No ratings yet
Module 1
91 pages
Aerosols
100% (1)
Aerosols
47 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Accident Fallacy
No ratings yet
Accident Fallacy
4 pages
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Prize de Telefon System
No ratings yet
Prize de Telefon System
2 pages
Report File
No ratings yet
Report File
40 pages
Week 6 Lecture
No ratings yet
Week 6 Lecture
13 pages
Discrete-Time Sliding Mode Control of Permanent Magnet Linear Synchronous Motor in High-Performance Motion With Large Parameter Uncertainty
No ratings yet
Discrete-Time Sliding Mode Control of Permanent Magnet Linear Synchronous Motor in High-Performance Motion With Large Parameter Uncertainty
4 pages
2012 Glass-Basaltepoxy Hybrid Composites For Marine Applications
No ratings yet
2012 Glass-Basaltepoxy Hybrid Composites For Marine Applications
9 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Surgical Instruments 3
No ratings yet
Surgical Instruments 3
110 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Python - Numpy, Pandas
No ratings yet
Python - Numpy, Pandas
40 pages
Numpy&pandas
No ratings yet
Numpy&pandas
17 pages
The Jargon File, Version 4.0.0, 24 Jul 1996 by Various
No ratings yet
The Jargon File, Version 4.0.0, 24 Jul 1996 by Various
433 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Datascienece
No ratings yet
Datascienece
18 pages

1st Class-Introduction and Python Package

Uploaded by

1st Class-Introduction and Python Package

Uploaded by

Introduction to

Digital Talent Scholarship

PT Zegen Laraka Utama

2 Python Packages for Data Science

3 Exploratory Data Analysis

How to use it and bring

Science that combines 3

Numpy for Hands-On

Numpy > List

2. np.zeros() 5. np.arange() 7. Specifying your

2. vstack() and hstack()

Multiplication with scalar

Indexing and Slicing

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Nama dataframe Lokasi file berada

Sintaksi untuk membaca

Other Method for Imputing Missing

The box plot is a useful graphical

Data visualization is the discipline of trying to understand data by placing it

There are several popular plotting libraries:

Pandas Dataframe offers a range of graphical

● df.plot.scatter() #plots a scatter chart

So why we don’t use Matlab instead?

A Figure object is the outermost container for a

You can think of the Figure object as a box-like

eg: Number of particular

From above we can see the

We can see that Grand

Exploratory Data Analysis refers to the critical process of performing

2. Lakukan Data Cleansing, data exploration, hingga data preparation

You might also like