Lesson 04 Data Analytics Overview
Lesson 04 Data Analytics Overview
Data, by itself, is just an information source. Unless you understand it, you will not be able to use it effectively.
When the transaction details are presented as a line chart, the deposit and withdrawal patterns become apparent.
Overall pattern
Why Data Analytics?
When the transaction details are presented as a line chart, the deposit and withdrawal patterns become
apparent. It helps view and analyze general trends and discrepancies.
Discrepancy
Introduction to Data Analytics
Question or Business Data Acquisition Data Wrangling Data Exploration Conclusion or Communication or
Problem Predictions Data Visualization
• Ability to ask • Beautiful Soup • CSV or other file • NumPy • Scikit-Learn, the • Pandas
appropriate for web scraping knowledge • SciPy main machine • Database
questions and • CSV or other file • NumPy • Pandas learning library • Matplotlib
know the knowledge • Pandas • Matplotlib • CSV or other file • PPT
business • NumPy • Database knowledge • CSV or other file
• Domain • Pandas • SciPy • NumPy knowledge
knowledge • Database • Pandas
• Passion for data • Database
• Analytical • SciPy
approach
Business Problem
The process of analytics begins with questions or business problems of the stakeholders.
Sales Inventory
Collect data from various sources for analysis to answer the question raised in step 1.
Twitter, Facebook,
LinkedIn, and other social
media and information
sites provide streaming
APIs.
Data wrangling is the most important phase of the data analytics process.
This phase includes data cleansing, data manipulation, data aggregation, data split, and reshaping of data.
Data wrangling is the most challenging phase and takes upto 70% of the Data Scientist’s time.
Data Exploration: Model Selection
Model selection is based on the overall data analysis process to draw conclusions and make accurate predictions.
Model selection
• Is based on the overall data analysis process
• Should be accurate to avoid iterations
• Depends on pattern identification and algorithms
• Depends on hypothesis building and testing
• Leads to building mathematical statistical functions
Exploratory Data Analysis (EDA)
Quantitative:
EDA techniques
The focus is on data Provides numeric
EDA approach make minimal or no
and its structure, outputs for the
studies the data to assumptions. They
outliers, and models inputted data
recommend suitable present all the
suggested by the Graphical:
models that best fit underlying data
data. Uses statistical
the data. without any data
functions for
loss.
graphical output
EDA: Quantitative Technique
EDA quantitative technique has two goals: measurement of central tendency and spread of data.
Measurement of Spread
Variance Variance is approximately the mean of the squares of the deviations.
Histograms and scatter plots are two popular graphical techniques to depict data.
a univariate dataset.
It shows:
Frequency
20
1 1 2 2
0 5 Per 0
Miles 5
Gallon
EDA: Graphical Technique
This step involves reaching a conclusion and making predictions based on the data analysis.
This phase:
Hypothesis is used to establish the relationship between dependent and independent variables.
Draw two samples from the population and calculate the difference between their means.
μ1 Calculating the
difference
S1 between the two
means is called
hypothesis
testing.
μ2
S2
Hypothesis Testing
Alternative Hypothesis
• Proposed model outcome is
accurate and matches the data
• There is a difference between the
means of S1 and S2
Null Hypothesis
• Opposite to the alternative
hypothesis
• There is no difference between
the means of S1 and S2
Hypothesis Testing Process
Choosing the training and test dataset and evaluating them with the null and alternative hypothesis.
Usually the training dataset is between 60% to 80% of the big dataset and the test dataset is between
20% to 40% of the big dataset.
Data Visualization
Communication
The last step of data analysis is communication, where the analyzed data is formally presented to stakeholders.
Plotting is a data visualization technique used to represent underlying data through graphics.
Features of plotting:
Data is measured in time blocks such as date, month, year, and time (hours, minutes, and seconds)
Time Series
Types of Plot
Data acquisition is a process to collect data from various data sources, such as RDBMS and NoSQL databases, collect web
server logs, and also scrape the web through web APIs.
Knowledge
Check What is exploratory data analysis technique?
Select all that apply.
2
Most EDA techniques are graphical in nature with a few quantitative techniques and also, suggest models that best fit the data.
They use almost the entire data with minimum and no assumptions.
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3
a. Regression plot
b. Line chart
c. Histogram
d. Heat map
Knowledge
Check Which plotting technique is used for continuous data?
Select all that apply.
3
a. Regression plot
b. Line chart
c. Histogram
d. Heat map
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
Knowledge
Check
Which Python library is the main machine learning library?
4
a. Pandas
b. Matplotlib
c. Scikit-learn
d. NumPy
a. Data acquisition
b. Data visualization
c. Data wrangling
d. Machine learning
Knowledge
Check Which of the following includes data transformation, merging, aggregation, group by operation, and
reshaping?
5
a. Data acquisition
b. Data visualization
c. Data wrangling
d. Machine learning
Data wrangling includes data transformation, merging, aggregation, group by operation, and reshaping.
Key Takeaways