Data Analysis Using Python Day_1 to Day_4
Data Analysis Using Python Day_1 to Day_4
ANALYSIS
USING
PYTHON
Mohamed Essam
AGENDA
Business Intelligence
Performance Evaluation
Problem Solving
Risk Management
Optimizing Processes
Informed Decision-Makin
Descriptive A Descriptive Analysis looks at data and analyzes past events for insight as to how to approach
Analysis future events. It looks at the past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past.
Diagnostic Diagnostic analysis works hand in hand with Descriptive Analysis. As descriptive Analysis finds
Analysis out what happened in the past, diagnostic Analysis, on the other hand, finds out why did that
happen or what measures were taken at that time, or how frequently it has happened.
Predictive Information we have received from descriptive and diagnostic analysis, we can use that
information to predict future data. Predictive analysis basically finds out what is likely to happen
Analysis in the future. Now when future data doesn’t mean we have become fortune-tellers, by looking
at the past trends and behavioral patterns we are forecasting that it might happen in the future.
Prescriptive This is an advanced method of Predictive Analysis. Now when you predict something or when
Analysis you start thinking out of the box you will definitely have a lot of options, and then we get
confused as to which option will actually work.
Statistical Statistical Analysis is a statistical approach or technique for analyzing data sets in order to
Analysis summarize their important and main characteristics generally by using some visual aids. This
approach can be used to gather knowledge about the following aspects of data:
• SAS
• Microsoft Excel
Top • R
Data
• Python
Analysis
Tools • Tableau Public
• RapidMiner
• Knime
Business
Intelligence
Predictive
Maintenance in
Healthcare
Manufacturing Optimization
Applications
of Data
Analysis
Fraud
Detection Financial
and Forecasting
Security
Marketing
and
Customer
Insights
The data analysis workflow
Define
Objectives and
Questions
Interpretation
and Data Collection
Communication
Exploratory
Data
Analysis(EDA)
• What is Python
• Python is a high-level, general-purpose, and very popular programming language.
Python programming language (latest Python 3) is being used in web development,
and Machine Learning applications, along with all cutting-edge technology in Software
Industry. Python language is being used by almost all tech-giant companies like –
Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
AGENDA
• Types of Data Structures: Lists, Tuples, Sets, Dictionaries, Compound Data Structures
• Operators: Membership, Identity
• Built-In Functions and Methods
• Data structures are containers or collections of data that organize and group data types
together in different ways.You can think of data structures as file folders that have
organized files of data inside them.
You saw here that you can create a list with square brackets.
Lists can contain any mix and match of the data types you have seen so far.
You saw that we can pull more than one value from a list at a time by using slicing.
When using slicing, it is important to remember that the lower index is inclusive
and the upper index is exclusive.
Therefore, this:
Mutability refers to whether or not we can change an object once it has been created. If an
object can be changed, it is called mutable. However, if an object cannot be changed after
it has been created, then the object is considered immutable.
sentence2[6]="!"
sentence1[30]="!"
Tuples! ( )
A tuple is another useful container. It's a data type for immutable ordered sequences of elements.
They are often used to store related pieces of information.
Consider this example involving latitude and longitude:
Sets { }
A set is a data type for mutable unordered collections of unique elements. One
application of a set is to quickly remove duplicates from a list.
Sets support the in operator the same as lists do.You can add elements to sets
using the add method, and remove elements using the pop method, similar to lists.
Although, when you pop an element from a set, a random element is removed.
Remember that sets, unlike lists, are unordered so there is no "last element".
Dictionaries are mutable, but their keys need to be any immutable type, like strings, integers,
or tuples. It's not even necessary for every key in a dictionary to have the same type!
For example, the following dictionary is perfectly valid:
Pandas is a powerful and open-source Python library. The Pandas library is used for
data manipulation and analysis. Pandas consist of data structures and functions to
perform efficient operations on data.
Importing Pandas
import pandas as pd
Pandas generally provide two data structures for manipulating data. They are:
• Series
• DataFrame
Pandas Series
A Pandas Series is a one-dimensional labeled array capable of holding data of any type
(integer, string, float, Python objects, etc.). The axis labels are collectively called indexes.
Pandas DataFrame
Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format
for better understanding, decision-making, accessing, and analysis in less time.
Data Wrangling is also known as Data Munging.
Data exploration: In this process, the data is studied, analyzed, and understood by
visualizing representations of data.
Dealing with missing values: Most of the datasets having a vast amount of data contain missing
values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most
frequent value of the column, or simply by dropping the row having a NaN value.
Reshaping data: In this process, data is manipulated according to the requirements, where new
data can be added or pre-existing data can be modified.
Filtering data: Some times datasets are comprised of unwanted rows or columns which are required
to be removed or filtered
DATA
ANALYSIS
USING
PYTHON
Mohamed Essam
AGENDA
Features of NumPy
NumPy has various features which make them popular over lists.
•Performance: NumPy arrays are more efficient than Python lists due to their fixed size and contiguous memory
allocation.
•Functionality: NumPy provides a wide array of mathematical functions and tools for array manipulation.
•Integration: Works well with other scientific libraries like SciPy, Matplotlib, and Pandas.
NUMPY FOR NUMERICAL COMPUTATIONS
Features of NumPy
NumPy has various features which make them popular over lists.
•Performance: NumPy arrays are more efficient than Python lists due to their fixed size and contiguous memory
allocation.
•Functionality: NumPy provides a wide array of mathematical functions and tools for array manipulation.
•Integration: Works well with other scientific libraries like SciPy, Matplotlib, and Pandas.
Definition: Time series data is a sequence of data points collected or recorded at time-ordered intervals.
Examples: Stock prices, weather data, sales figures.
Data Preparation
• Data Collection: Sources and methods.
• Cleaning: Handling missing values, outliers, and noise.
• Transformation: Aggregation, normalization, and differencing.
BASIC DATETIME OPERATIONS IN PYTHON
Python has an in-built module named DateTime to deal with dates and times in numerous ways. In this article, we
are going to see basic DateTime operations in Python.
There are six main object classes with their respective components in the datetime module mentioned below:
• datetime.date
• datetime.time
• datetime.datetime
• datetime.tzinfo
• datetime.timedelta
• datetime.timezone
1. Continuous Time Series Data:Continuous time series data involves measurements or observations that are
recorded at regular intervals, forming a seamless and uninterrupted sequence. This type of data is characterized by
a continuous range of possible values and is commonly encountered in various domains, including:
Temperature Data: Continuous recordings of temperature at consistent intervals (e.g., hourly or daily
measurements).
Stock Market Data: Continuous tracking of stock prices or values throughout trading hours.
Sensor Data: Continuous measurements from sensors capturing variables like pressure, humidity, or air quality.
2. Discrete Time Series Data: Discrete time series data, on the other hand, consists of measurements or
observations that are limited to specific values or categories. Unlike continuous data, discrete data does not have a
continuous range of possible values but instead comprises distinct and separate data points. Common examples
include:
Count Data: Tracking the number of occurrences or events within a specific time period.
Categorical Data: Classifying data into distinct categories or classes (e.g., customer segments, product types).
Binary Data: Recording data with only two possible outcomes or states.
WHAT IS A TREND IN TIME SERIES?
Trend is a pattern in data that shows the movement of a series to relatively higher or lower values over a long
period of time. In other words, a trend is observed when there is an increasing or decreasing slope in the time
series. Trend usually happens for some time and then disappears, it does not repeat. For example, some new
song comes, it goes trending for a while, and then disappears. There is fairly any chance that it would be trending
again.
A trend could be :
• Uptrend: Time Series Analysis shows a general pattern that is upward then it is Uptrend.
• Downtrend: Time Series Analysis shows a pattern that is downward then it is Downtrend.
• Horizontal or Stationary trend: If no pattern observed then it is called a Horizontal or stationary trend.
DATA
ANALYSIS
USING
PYTHON
Mohamed Essam
AGENDA
Exploratory data analysis is one of the basic and essential steps of a data science project. A data scientist involves
almost 70% of his work in doing the EDA of the dataset.
• Distribution of Data: Examining the distribution of data points to understand their range, central
tendencies (mean, median), and dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar charts to
visualize relationships within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can influence
statistical analyses and might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between variables to understand how they might affect
each other. This includes computing correlation coefficients and creating correlation matrices.
• Handling Missing Values: Detecting and deciding how to address missing data points, whether by
imputation or removal, depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
TYPES OF EXPLORATORY DATA ANALYSIS
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily concerned with
describing the data and finding patterns existing in a single feature. This sort of evaluation makes a speciality of
analyzing character variables inside the records set. It involves summarizing and visualizing a unmarried variable at
a time to understand its distribution, relevant tendency, unfold, and different applicable records. Common
techniques include:
2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It enables find associations, correlations,
and dependencies between pairs of variables. Bivariate analysis is a crucial form of exploratory data analysis that
examines the relationship between two variables. Some key techniques used in bivariate analysis:
• Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot helps
visualize the relationship between two continuous variables.
• Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for linear
relationships) quantifies the degree to which two variables are related.
• Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the relationship
between two categorical variables. It shows the frequency distribution of categories of one variable in rows
and the other in columns, which helps in understanding the relationship between the two variables.
• Line Graphs: In the context of time series data, line graphs can be used to compare two variables over time.
This helps in identifying trends, cycles, or patterns that emerge in the interaction of the variables over the
specified period.
• Covariance: Covariance is a measure used to determine how much two random variables change together.
However, it is sensitive to the scale of the variables, so it’s often supplemented by the correlation coefficient
for a more standardized assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to
understand how variables interact with one another, which is crucial for most statistical modeling techniques.
Techniques include:
• Pair plots:Visualize relationships across several variables simultaneously to capture a comprehensive view of
potential interactions.
• Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the
dimensionality of large datasets, while preserving as much variance as possible.
DESCRIPTIVE STATISTICS
Measures of Central Tendency: Measures of central tendency focus on the average or middle values of data sets,
whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables and
general discussions to help people understand the meaning of the analyzed data. Like mean, median, mode.
Measures of spread: aid in analyzing how dispersed the distribution is for a set of data. Like range, variance,
Standard Deviation.
Distribution: refers to the quantity of times a data point occurs, it’s a graphical representation of data that was
collected from a sample or population.
Mean, median, and mode are different measures of centre in a numerical data set. They each try to
summarize a dataset with a single number to represent a "typical" data point from the dataset.
Mean: The "average" number; found by adding all data points and dividing by the number of data points.
“Affected By Outliers”
Example: The mean of 4, 1, and 7 is (4+1+7)/3 = 12/3 = 4.
Median: The middle number; found by ordering all data points and picking out the one in the middle (or if
there are two middle numbers, taking the mean of those two numbers).
“More reliable to substitute missing data points”
Example: The median of 4, 1, and 7 is 4 because when the numbers are put in order the number 4 is in the
middle.
Mode: The most frequent number—that is, the number that occurs the highest number of times.
Example: The mode of { 4, 2, 4, 3, 2, 2} is 2 because it occurs three times, which is more than any other
number.
MEAN, MEDIAN AND MODE
Asymmetry
Any occasion when there a disparity in access to data.
11
Variability
It describes how far apart data points lie from each other and from the center of a distribution.
12
Variance
variance is a measure of how far a set of data are dispersed out from their mean or
average value. It is denoted as ‘σ2’.
Properties of Variance
It is always non-negative since each term in the variance sum is squared and
therefore the result is either positive or zero.
Variance always has squared units. For example, the variance of a set of weights
estimated in kilograms will be given in kg squared. Since the population variance is
squared, we cannot compare it directly with the mean or the data themselves.
13
Standard Deviation
Distribution measures the deviation of data from its mean or average position. The
degree of dispersion is computed by the method of estimating the deviation of
data points. It is denoted by the symbol, ‘σ’.
The interquartile range formula is the first quartile subtracted from the third quartile:
IQR = Q3 – Q1.
15
1st quartile and 3rd quartile
Quartiles segment any distribution that’s ordered from low to high into four equal parts. The interquartile
range (IQR) contains the second and third quartiles, or the middle half of your data set.
16
Quartile Calculations
17
Box Plot
Box plots divide the data into sections that each contain approximately 25% of the data in that set.
Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify
mean values, the dispersion of the data set, and signs of outliers.
18
Box Plot
19
Box Plot
When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the
box plot.
20
21
Skewed Distribution One important fact about skewed
distributions is that, unlike a bell
curve, the mode, median and
mean are not the same value. The
long tail skews the mean and
median in the direction of the tail.
There is a very easy way to
calculate the different average
values using a histogram
diagram. If you rely on average
values to make quick predictions,
pay attention to which average
you use!
22
There are many plots used to visualize categorical with numerical like scatterplot, box plot and line plot .
Look Here.
23
Plots for numerical
24
25