0% found this document useful (0 votes)
3 views4 pages

Lab 3 Report

This report explains essential statistical concepts for data analysis in data science, including measures of central tendency, variability, and visualization techniques. It emphasizes the importance of using robust measures like the median and z-scores to handle outliers and improve data interpretation. The combination of statistical methods and visual tools enhances understanding of complex datasets and supports effective decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Lab 3 Report

This report explains essential statistical concepts for data analysis in data science, including measures of central tendency, variability, and visualization techniques. It emphasizes the importance of using robust measures like the median and z-scores to handle outliers and improve data interpretation. The combination of statistical methods and visual tools enhances understanding of complex datasets and supports effective decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

TITLE: Understanding Statistical Analysis in Data Science

Introduction
This report breaks down basic and advanced statistical concepts to help understand data analysis
in data science. It covers measures of central tendency, variability, and relative standing, as well
as visualization techniques and how to deal with outliers. Using examples and Python tools, it
explains the strengths and weaknesses of different statistical methods when analyzing real-world
datasets.

Theory
 Statistics:

Statistics is about estimating values that represent an entire population by analyzing a


sample. Here are the key parts:

 Measures of Central Tendency: These identify the middle or typical value in a dataset.

a. Mean: The average of all values. It’s easy to calculate but very sensitive to
outliers.
b. Median: The middle value when data is sorted. It’s not affected by outliers,
making it more reliable for skewed data.
c. Mode: The most common value. This is useful for datasets with repeated values.

 Measures of Variation: These show how spread out the data is.

a. Range: The difference between the largest and smallest values.


b. Interquartile Range (IQR): Measures the spread of the middle 50% of data by
removing extremes.
c. Standard Deviation & Variance: Show how much data points differ from the
mean.

 Relative Standing: Helps understand where a specific value fits in the dataset.

a. Z-Scores: Measure how far a value is from the mean in terms of standard
deviations.
b. Percentiles and ranks also fall under this category.

Why Visualization Matters


Graphs like bar charts, histograms, and line graphs make it easier to spot patterns and outliers in
data. Visuals help present numerical findings in a way that’s easier to interpret.

Applications and Problems


1. How Outliers Affect Central Tendency

 Mean: Adding extreme values, like 110, can make the mean much higher and less useful
for understanding the dataset.
 Median: Outliers like 31 have little to no effect on the median, so it remains a reliable
measure of central tendency.

2. Understanding Data Spread

 A dataset of Facebook friends showed a mean of 789 and a standard deviation of 425.
This large standard deviation indicated a wide variation in the number of friends among
users. Visualizing this data helped highlight the differences.
3.Using Z-Scores

 Z-scores standardized the dataset, showing how far each value was from the mean.
 Negative scores represented below-average values, while positive scores highlighted
above-average ones.

Results and Insights


Variation Metrics

1. Range: Shows the total spread but doesn’t provide details about distribution.
2. Standard Deviation: Gives a clearer picture of data spread when combined with the mean.
3. Coefficient of Variation (CV): Allows comparisons between datasets with different units
or scales.

What Visuals Reveal


Graphs pinpointed patterns and outliers in the data. For example, bar charts highlighted clusters
of data points within one standard deviation of the mean and made outliers stand out clearly.

Relative Standing

Z-scores helped identify how individual data points compared to the rest of the dataset, making it
easier to detect unusual values.

Discussion
Combining measures of central tendency and variation with visualization techniques gives a full
understanding of a dataset. While the mean and standard deviation are useful, they’re affected by
outliers. Methods like the median and z-scores offer more reliable insights in such cases. Visual
tools add another layer of clarity by presenting data trends and anomalies visually.

Conclusion
This report highlights the importance of using a mix of statistical techniques and visualizations
for effective data analysis. By addressing outliers with robust measures like the median and z-
scores, analysts can make more accurate interpretations. Visualization tools enhance these
findings, making complex data easier to understand. Together, these methods form a strong
foundation for more advanced data science tasks and decision-making.

You might also like