0% found this document useful (0 votes)
20 views

Data Analysis Using Python Day_1 to Day_4

Uploaded by

talal.saadaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Data Analysis Using Python Day_1 to Day_4

Uploaded by

talal.saadaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

DATA

ANALYSIS
USING
PYTHON
Mohamed Essam

AGENDA

1. Introduction to Data Analysis with Python


2. The data analysis workflow
3. Introduction to Python for Data Analysis
4. Setting up your development environment (Jupyter Notebook)
5. Python basics (variables, data types, operators)
6. Importing and exporting data (CSV, Excel)
1. INTRODUCTION TO DATA ANALYSIS
WITH PYTHON
What is Data Analysis?
• Data is raw information, and analysis of data is the systematic process of interpreting and transforming that
data into meaningful insights.

Why Data Analysis is important?

Business Intelligence

Performance Evaluation

Problem Solving

Risk Management

Optimizing Processes

Informed Decision-Makin

Types of Data Analysis

Descriptive A Descriptive Analysis looks at data and analyzes past events for insight as to how to approach
Analysis future events. It looks at the past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past.

Diagnostic Diagnostic analysis works hand in hand with Descriptive Analysis. As descriptive Analysis finds
Analysis out what happened in the past, diagnostic Analysis, on the other hand, finds out why did that
happen or what measures were taken at that time, or how frequently it has happened.

Predictive Information we have received from descriptive and diagnostic analysis, we can use that
information to predict future data. Predictive analysis basically finds out what is likely to happen
Analysis in the future. Now when future data doesn’t mean we have become fortune-tellers, by looking
at the past trends and behavioral patterns we are forecasting that it might happen in the future.

Prescriptive This is an advanced method of Predictive Analysis. Now when you predict something or when
Analysis you start thinking out of the box you will definitely have a lot of options, and then we get
confused as to which option will actually work.

Statistical Statistical Analysis is a statistical approach or technique for analyzing data sets in order to
Analysis summarize their important and main characteristics generally by using some visual aids. This
approach can be used to gather knowledge about the following aspects of data:
• SAS
• Microsoft Excel
Top • R
Data
• Python
Analysis
Tools • Tableau Public
• RapidMiner
• Knime

Business
Intelligence

Predictive
Maintenance in
Healthcare
Manufacturing Optimization

Applications
of Data
Analysis

Fraud
Detection Financial
and Forecasting
Security

Marketing
and
Customer
Insights
The data analysis workflow

Define
Objectives and
Questions

Interpretation
and Data Collection
Communication

The Process of Data


Analysis

Statistical Data Cleaning


Analysis or and
Modeling Preprocessing

Exploratory
Data
Analysis(EDA)

2. INTRODUCTION TO PYTHON FOR DATA


ANALYSIS

• What is Python
• Python is a high-level, general-purpose, and very popular programming language.
Python programming language (latest Python 3) is being used in web development,
and Machine Learning applications, along with all cutting-edge technology in Software
Industry. Python language is being used by almost all tech-giant companies like –
Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.

• It’s coding time…..


DATA
ANALYSIS
USING
PYTHON
Mohamed Essam

AGENDA

1. Working with Data Structures in Python


2. Lists, tuples, dictionaries
3. DataFrames and Series in pandas
4. Data Cleaning and Manipulation
5. Handling missing values
6. Data wrangling techniques (filtering, sorting, grouping)
WORKING WITH DATA STRUCTURES IN
PYTHON

Specifically, you'll learn about:

• Types of Data Structures: Lists, Tuples, Sets, Dictionaries, Compound Data Structures
• Operators: Membership, Identity
• Built-In Functions and Methods

What Are Data Structures?

• Data structures are containers or collections of data that organize and group data types
together in different ways.You can think of data structures as file folders that have
organized files of data inside them.

Containers that hold some values


Lists! [ ]
A list is one of the most common and basic data structures in Python.

You saw here that you can create a list with square brackets.
Lists can contain any mix and match of the data types you have seen so far.

Slice and Dice with Lists

You saw that we can pull more than one value from a list at a time by using slicing.
When using slicing, it is important to remember that the lower index is inclusive
and the upper index is exclusive.

Therefore, this:

Mutability and Order

Mutability refers to whether or not we can change an object once it has been created. If an
object can be changed, it is called mutable. However, if an object cannot be changed after
it has been created, then the object is considered immutable.

Examples - Lists are mutable, and strings are immutable.


Quiz Question
sentence1 = "I wish to register a complaint."
sentence2 = ["I", "wish", "to", "register", "a", "complaint", "."]

sentence2[6]="!"

sentence2[0]= "Our Majesty"

sentence1[30]="!"

sentence2[0:2] = ["We", "want"]

Tuples! ( )
A tuple is another useful container. It's a data type for immutable ordered sequences of elements.
They are often used to store related pieces of information.
Consider this example involving latitude and longitude:

Sets { }
A set is a data type for mutable unordered collections of unique elements. One
application of a set is to quickly remove duplicates from a list.
Sets support the in operator the same as lists do.You can add elements to sets
using the add method, and remove elements using the pop method, similar to lists.
Although, when you pop an element from a set, a random element is removed.
Remember that sets, unlike lists, are unordered so there is no "last element".

Dictionaries {“key”: value}


A dictionary is a mutable data type that stores mappings of unique keys to values.
Here's a dictionary that stores elements and their atomic numbers.

Dictionaries are mutable, but their keys need to be any immutable type, like strings, integers,
or tuples. It's not even necessary for every key in a dictionary to have the same type!
For example, the following dictionary is perfectly valid:

random_dict = {"abc": 1, 5: "hello"}


Data Structures!

Just key must be immutable

DATAFRAMES AND SERIES IN PANDAS

Pandas is a powerful and open-source Python library. The Pandas library is used for
data manipulation and analysis. Pandas consist of data structures and functions to
perform efficient operations on data.

Getting Started with Pandas

Open cmd or the terminal of you IDEL:


! pip install pandas

Importing Pandas
import pandas as pd

Data Structures in Pandas Library

Pandas generally provide two data structures for manipulating data. They are:

• Series
• DataFrame
Pandas Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, Python objects, etc.). The axis labels are collectively called indexes.

Pandas DataFrame

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure


with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e.,
data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three
principal components, the data, rows, and columns.
DATA WRANGLING IN PYTHON

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format
for better understanding, decision-making, accessing, and analysis in less time.
Data Wrangling is also known as Data Munging.

Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
Dealing with missing values: Most of the datasets having a vast amount of data contain missing
values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most
frequent value of the column, or simply by dropping the row having a NaN value.
Reshaping data: In this process, data is manipulated according to the requirements, where new
data can be added or pre-existing data can be modified.
Filtering data: Some times datasets are comprised of unwanted rows or columns which are required
to be removed or filtered

DATA
ANALYSIS
USING
PYTHON
Mohamed Essam
AGENDA

1. Working with Numerical Data in Python


2. NumPy for numerical computations
3. Mathematical functions and operations
4. Working with time series data

WORKING WITH NUMERICAL DATA IN PYTHON

Importance of Numerical Data


• Fundamental for data analysis, scientific computing, and machine learning.
Data Types:
• Integers: Whole numbers, e.g., 5, 10.
• Floats: Decimal numbers, e.g., 5.0, 10.5.
• Complex Numbers: Numbers with real and imaginary parts, e.g., 3 + 4j.
• Arrays: Collections of numbers, e.g., lists or arrays.

Features of NumPy
NumPy has various features which make them popular over lists.

Why Use NumPy?

•Performance: NumPy arrays are more efficient than Python lists due to their fixed size and contiguous memory
allocation.
•Functionality: NumPy provides a wide array of mathematical functions and tools for array manipulation.
•Integration: Works well with other scientific libraries like SciPy, Matplotlib, and Pandas.
NUMPY FOR NUMERICAL COMPUTATIONS

Importance of Numerical Data


• Fundamental for data analysis, scientific computing, and machine learning.
Data Types:
• Integers: Whole numbers, e.g., 5, 10.
• Floats: Decimal numbers, e.g., 5.0, 10.5.
• Complex Numbers: Numbers with real and imaginary parts, e.g., 3 + 4j.
• Arrays: Collections of numbers, e.g., lists or arrays.

Features of NumPy
NumPy has various features which make them popular over lists.

Why Use NumPy?

•Performance: NumPy arrays are more efficient than Python lists due to their fixed size and contiguous memory
allocation.
•Functionality: NumPy provides a wide array of mathematical functions and tools for array manipulation.
•Integration: Works well with other scientific libraries like SciPy, Matplotlib, and Pandas.

PYTHON LISTS VS NUMPY ARRAYS

What is a Numpy array?


NumPy is the fundamental package for scientific computing in Python. Numpy arrays facilitate advanced mathematical
and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and
with less code than is possible using Python’s built-in sequences. Numpy is not another programming language but a
Python extension module. It provides fast and efficient operations on arrays of homogeneous data.

Some important points about Numpy arrays:

• We can create an N-dimensional array in Python using Numpy.array().


• The array is by default Homogeneous, which means data inside an array must be of the same Datatype. (Note You
can also create a structured array in Python).
• Element-wise operation is possible.
• Numpy array has various functions, methods, and variables, to ease our task of matrix computation.
• Elements of an array are stored contiguously in memory. For example, all rows of a two-dimensioned array must
have the same number of columns. A three-dimensional array must have the same number of rows and columns
on each card.
What is Python List?
A Python list is a collection that is ordered and changeable. In Python, lists are written with square brackets.

Some important points about Python Lists:

• The list can be homogeneous or heterogeneous.


• Element-wise operation is not possible on the list.
• Python list is by default 1-dimensional. But we can create an N-Dimensional list. But then too it will be 1 D list
storing another 1D list
• Elements of a list need not be contiguous in memory.
• Below are some examples which clearly demonstrate how Numpy arrays are better than Python lists by
analyzing the memory consumption, execution time comparison, and operations supported by both of them.

COMPARISON BETWEEN NUMPY ARRAY AND PYTHON


LIST
Python Lists
1. Element Overhead: Lists in Python store additional information about each element, such as its type and
reference count. This overhead can be significant when dealing with a large number of elements.
2. Datatype: Lists can hold different data types, but this can decrease memory efficiency and slow numerical
operations.
3. Memory Fragmentation: Lists may not store elements in contiguous memory locations, causing memory
fragmentation and inefficiency.
4. Performance: Lists are not optimized for numerical computations and may have slower mathematical operations
due to Python’s interpretation overhead. They are generally used as general-purpose data structures.
5. Functionality: Lists can store any data type, but lack specialized NumPy functions for numerical operations.
Numpy Arrays
1. Homogeneous Data: NumPy arrays store elements of the same data type, making them more compact and
memory-efficient than lists.
2. Fixed Data Type: NumPy arrays have a fixed data type, reducing memory overhead by eliminating the need to
store type information for each element.
3. Contiguous Memory: NumPy arrays store elements in adjacent memory locations, reducing fragmentation and
allowing for efficient access.
4. Array Metadata: NumPy arrays have extra metadata like shape, strides, and data type. However, this overhead
is usually smaller than the per-element overhead in lists.
5. Performance: NumPy arrays are optimized for numerical computations, with efficient element-wise operations
and mathematical functions. These operations are implemented in C, resulting in faster performance than
equivalent operations on lists.

WORKING WITH TIME SERIES DATA

Definition: Time series data is a sequence of data points collected or recorded at time-ordered intervals.
Examples: Stock prices, weather data, sales figures.

Characteristics of Time Series Data


Components:
• Trend: Long-term movement in the data.
• Seasonality: Regular pattern repeating at specific intervals.
• Cyclic: Long-term cycles not fixed to a calendar.
• Irregular: Random or unpredictable components.

Time Series Analysis Objectives


• Forecasting: Predict future data points.
• Understanding: Identify underlying patterns and relationships.
• Anomaly Detection: Identify abnormal patterns or outliers.

Data Preparation
• Data Collection: Sources and methods.
• Cleaning: Handling missing values, outliers, and noise.
• Transformation: Aggregation, normalization, and differencing.
BASIC DATETIME OPERATIONS IN PYTHON
Python has an in-built module named DateTime to deal with dates and times in numerous ways. In this article, we
are going to see basic DateTime operations in Python.

There are six main object classes with their respective components in the datetime module mentioned below:

• datetime.date
• datetime.time
• datetime.datetime
• datetime.tzinfo
• datetime.timedelta
• datetime.timezone

Types of Time Series Data


Time series data can be broadly classified into two sections:

1. Continuous Time Series Data:Continuous time series data involves measurements or observations that are
recorded at regular intervals, forming a seamless and uninterrupted sequence. This type of data is characterized by
a continuous range of possible values and is commonly encountered in various domains, including:

Temperature Data: Continuous recordings of temperature at consistent intervals (e.g., hourly or daily
measurements).
Stock Market Data: Continuous tracking of stock prices or values throughout trading hours.
Sensor Data: Continuous measurements from sensors capturing variables like pressure, humidity, or air quality.

2. Discrete Time Series Data: Discrete time series data, on the other hand, consists of measurements or
observations that are limited to specific values or categories. Unlike continuous data, discrete data does not have a
continuous range of possible values but instead comprises distinct and separate data points. Common examples
include:

Count Data: Tracking the number of occurrences or events within a specific time period.
Categorical Data: Classifying data into distinct categories or classes (e.g., customer segments, product types).
Binary Data: Recording data with only two possible outcomes or states.
WHAT IS A TREND IN TIME SERIES?
Trend is a pattern in data that shows the movement of a series to relatively higher or lower values over a long
period of time. In other words, a trend is observed when there is an increasing or decreasing slope in the time
series. Trend usually happens for some time and then disappears, it does not repeat. For example, some new
song comes, it goes trending for a while, and then disappears. There is fairly any chance that it would be trending
again.

A trend could be :

• Uptrend: Time Series Analysis shows a general pattern that is upward then it is Uptrend.
• Downtrend: Time Series Analysis shows a pattern that is downward then it is Downtrend.
• Horizontal or Stationary trend: If no pattern observed then it is called a Horizontal or stationary trend.

DATA
ANALYSIS
USING
PYTHON

Mohamed Essam
AGENDA

1. Exploratory Data Analysis (EDA) with pandas


2. Descriptive statistics
3. Data visualization (histograms, scatter plots, boxplots)
4. Understanding data distribution

WHAT IS EXPLORATORY DATA ANALYSIS?

Exploratory data analysis is one of the basic and essential steps of a data science project. A data scientist involves
almost 70% of his work in doing the EDA of the dataset.

Key aspects of EDA include:

• Distribution of Data: Examining the distribution of data points to understand their range, central
tendencies (mean, median), and dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to
visualize relationships within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence
statistical analyses and might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between variables to understand how they might affect
each other. This includes computing correlation coefficients and creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data points, whether by
imputation or removal, depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
TYPES OF EXPLORATORY DATA ANALYSIS

1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily concerned with
describing the data and finding patterns existing in a single feature. This sort of evaluation makes a speciality of
analyzing character variables inside the records set. It involves summarizing and visualizing a unmarried variable at
a time to understand its distribution, relevant tendency, unfold, and different applicable records. Common
techniques include:

• Histograms: Used to visualize the distribution of a variable.


• Box plots: Useful for detecting outliers and understanding the spread and skewness of the data.
• Bar charts: Employed for categorical data to show the frequency of each category.
• Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that describe
the central tendency and dispersion of the data.

2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It enables find associations, correlations,
and dependencies between pairs of variables. Bivariate analysis is a crucial form of exploratory data analysis that
examines the relationship between two variables. Some key techniques used in bivariate analysis:

• Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps
visualize the relationship between two continuous variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear
relationships) quantifies the degree to which two variables are related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the relationship
between two categorical variables. It shows the frequency distribution of categories of one variable in rows
and the other in columns, which helps in understanding the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare two variables over time.
This helps in identifying trends, cycles, or patterns that emerge in the interaction of the variables over the
specified period.
• Covariance: Covariance is a measure used to determine how much two random variables change together.
However, it is sensitive to the scale of the variables, so it’s often supplemented by the correlation coefficient
for a more standardized assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to
understand how variables interact with one another, which is crucial for most statistical modeling techniques.
Techniques include:

• Pair plots:Visualize relationships across several variables simultaneously to capture a comprehensive view of
potential interactions.
• Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the
dimensionality of large datasets, while preserving as much variance as possible.
DESCRIPTIVE STATISTICS

Measures of Central Tendency: Measures of central tendency focus on the average or middle values of data sets,
whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables and
general discussions to help people understand the meaning of the analyzed data. Like mean, median, mode.

Measures of spread: aid in analyzing how dispersed the distribution is for a set of data. Like range, variance,
Standard Deviation.

Distribution: refers to the quantity of times a data point occurs, it’s a graphical representation of data that was
collected from a sample or population.

MEAN, MEDIAN AND MODE

Mean, median, and mode are different measures of centre in a numerical data set. They each try to
summarize a dataset with a single number to represent a "typical" data point from the dataset.
Mean: The "average" number; found by adding all data points and dividing by the number of data points.
“Affected By Outliers”
Example: The mean of 4, 1, and 7 is (4+1+7)/3 = 12/3 = 4.
Median: The middle number; found by ordering all data points and picking out the one in the middle (or if
there are two middle numbers, taking the mean of those two numbers).
“More reliable to substitute missing data points”
Example: The median of 4, 1, and 7 is 4 because when the numbers are put in order the number 4 is in the
middle.
Mode: The most frequent number—that is, the number that occurs the highest number of times.
Example: The mode of { 4, 2, 4, 3, 2, 2} is 2 because it occurs three times, which is more than any other
number.
MEAN, MEDIAN AND MODE

Asymmetry
Any occasion when there a disparity in access to data.

11
Variability
It describes how far apart data points lie from each other and from the center of a distribution.

12

Variance
variance is a measure of how far a set of data are dispersed out from their mean or
average value. It is denoted as ‘σ2’.

Properties of Variance
It is always non-negative since each term in the variance sum is squared and
therefore the result is either positive or zero.
Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared. Since the population variance is
squared, we cannot compare it directly with the mean or the data themselves.

13
Standard Deviation
Distribution measures the deviation of data from its mean or average position. The
degree of dispersion is computed by the method of estimating the deviation of
data points. It is denoted by the symbol, ‘σ’.

Properties of Standard Deviation


It describes the square root of the mean of the squares of all values in a data
set and is also called the root-mean-square deviation.
The smallest value of the standard deviation is 0 since it cannot be negative.
When the data values of a group are similar, then the standard deviation will be
very low or close to zero. But when the data values vary with each other, then
the standard variation is high or far from zero.
14

Range and Interquartile Range


The interquartile range is a measure of where the “middle fifty” is in a data set.
Where a range is a measure of where the beginning and end are in a set, an
interquartile range is a measure of where the bulk of the values lie. That’s
why it’s preferred over many other measures of spread when reporting things like
school performance or SAT scores.

The interquartile range formula is the first quartile subtracted from the third quartile:

IQR = Q3 – Q1.

15
1st quartile and 3rd quartile
Quartiles segment any distribution that’s ordered from low to high into four equal parts. The interquartile
range (IQR) contains the second and third quartiles, or the middle half of your data set.

16

Quartile Calculations

17
Box Plot
Box plots divide the data into sections that each contain approximately 25% of the data in that set.

Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify
mean values, the dispersion of the data set, and signs of outliers.

18

Box Plot

19
Box Plot
When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the
box plot.

20

Normal Distribution The normal distribution, also


known as a Gaussian
distribution or “bell curve” is the
most common frequency
distribution. This distribution is
symmetrical, with most values
falling towards the centre and
long tails to the left and right. It
is a continuous distribution, with
no gaps between values.

Later we can use this to


identify the outliers

21
Skewed Distribution One important fact about skewed
distributions is that, unlike a bell
curve, the mode, median and
mean are not the same value. The
long tail skews the mean and
median in the direction of the tail.
There is a very easy way to
calculate the different average
values using a histogram
diagram. If you rely on average
values to make quick predictions,
pay attention to which average
you use!

22

How to plot different types of data


Numerical data:
dot plots, stem and leaf graphs, histograms, box plots, ogive graphs, and scatter plots.

Categorical data: Frequency tables, pie charts, and bar charts.

There are many plots used to visualize categorical with numerical like scatterplot, box plot and line plot .
Look Here.

23
Plots for numerical

24

Plots for Categorical data

25

You might also like