Reviewer
Reviewer
values.
Agenda
• It is the average of the middle two values otherwise.
• Basic Statistical Descriptions of Data
• It is estimated by interpolation.
• Measuring the Central Tendency of Data
Mode
• Symmetric vs. Skewed Distribution
• This is the most frequent value in the data set.
• Measuring the Dispersion of Data
• The empirical formula is based on the values of the
• Statistical descriptions also allow us to better • A symmetric distribution occurs when the values of
understand out data. variables appear at regular frequencies.
• We can analyze aspects of the data, such as central • The two sides of the distribution are a mirror image of
tendency, variation and spread. each other.
Measuring the Central Tendency of Data • This is also known as a normal distribution.
• A measure of central tendency is a single value that • The characteristics of a symmetric distribution are:
• It identifies a central position within the data set. • Mode, Median, and Mean are the same and
are together in the center of the curve
• It helps with finding the average of a dataset.
• There can only be one Mode
• The three most common measures of central
tendency • Most of the data is clustered around the
center, with extreme values on the side
are:
Skewed Distribution
• Mode
• A skewed distribution occurs when one tail of the
• Median
distribution is longer than another.
• Mean
• Skewness is the tendency for the values to be more
Mean frequent around the high or low ends of the x-axis.
• The sum of all values divided by the total number of Skewed Distribution
values.
• A left-skewed distribution has a long left tail.
• Can also be thought of as the weighted arithmetic
• It is also called a negatively-skewed distribution
mean.
• A right-skewed distribution has a long right tail.
• Trimmed mean: a variation where you remove
extreme values from the data set. • It is also called a positively-skewed distribution
• This is the middle number in an ordered data set. • The characteristics of a skewed distribution are:
• Asymmetrical shape of the curve • Web scraping has two parts, the crawler and the
scraper.
• Mean and Median have different values and
do not all lie at the center of the curve • The crawler is an algorithm that browses the web to
look for a particular data.
• There can be more than one mode
• The scraper is a tool that extracts data from a website.
• The distribution of the data tends towards the
high or low end of the data set Types of Web Scrapers
• These are the main types of web scrapers:
Measuring the Dispersion of Data
• Self-built Web Scrapers
• Dispersion is the state of getting dispersed or spread.
• Pre-built Web Scrapers
• Statistical dispersion means the extent to which
numerical • Browser Extension Web Scrapers
• These are the different statistical methods that we Self-built Web Scrapers
can use to analyze the dispersion of data:
• These are scrapers which are built from the ground
• Boxplot Analysis
up by a programmer.
• Histogram
• Its features depend on what can be added by the
• Quantile Plot developer.
• Quantile-Quantile (Q-Q) Plot
• If your hardware is not able to meet the needs of the collecting data from the web.
scraper, the application might slow down or fail. • Scrapy has three main features:
web scraping. • This will install the Scrapy library onto your system so
that it can be utilized by Python
• It has various libraries for collecting data, cleaning
data, Beautiful Soup
Data Objects
Attributes
Nominal
-Symmetric
-Asymmetric
Qualitative