Lab 3 Report
Lab 3 Report
Introduction
This report breaks down basic and advanced statistical concepts to help understand data analysis
in data science. It covers measures of central tendency, variability, and relative standing, as well
as visualization techniques and how to deal with outliers. Using examples and Python tools, it
explains the strengths and weaknesses of different statistical methods when analyzing real-world
datasets.
Theory
Statistics:
Measures of Central Tendency: These identify the middle or typical value in a dataset.
a. Mean: The average of all values. It’s easy to calculate but very sensitive to
outliers.
b. Median: The middle value when data is sorted. It’s not affected by outliers,
making it more reliable for skewed data.
c. Mode: The most common value. This is useful for datasets with repeated values.
Measures of Variation: These show how spread out the data is.
Relative Standing: Helps understand where a specific value fits in the dataset.
a. Z-Scores: Measure how far a value is from the mean in terms of standard
deviations.
b. Percentiles and ranks also fall under this category.
Mean: Adding extreme values, like 110, can make the mean much higher and less useful
for understanding the dataset.
Median: Outliers like 31 have little to no effect on the median, so it remains a reliable
measure of central tendency.
A dataset of Facebook friends showed a mean of 789 and a standard deviation of 425.
This large standard deviation indicated a wide variation in the number of friends among
users. Visualizing this data helped highlight the differences.
3.Using Z-Scores
Z-scores standardized the dataset, showing how far each value was from the mean.
Negative scores represented below-average values, while positive scores highlighted
above-average ones.
1. Range: Shows the total spread but doesn’t provide details about distribution.
2. Standard Deviation: Gives a clearer picture of data spread when combined with the mean.
3. Coefficient of Variation (CV): Allows comparisons between datasets with different units
or scales.
Relative Standing
Z-scores helped identify how individual data points compared to the rest of the dataset, making it
easier to detect unusual values.
Discussion
Combining measures of central tendency and variation with visualization techniques gives a full
understanding of a dataset. While the mean and standard deviation are useful, they’re affected by
outliers. Methods like the median and z-scores offer more reliable insights in such cases. Visual
tools add another layer of clarity by presenting data trends and anomalies visually.
Conclusion
This report highlights the importance of using a mix of statistical techniques and visualizations
for effective data analysis. By addressing outliers with robust measures like the median and z-
scores, analysts can make more accurate interpretations. Visualization tools enhance these
findings, making complex data easier to understand. Together, these methods form a strong
foundation for more advanced data science tasks and decision-making.