stat python
stat python
Inferential statistics, on the other hand, use data from a sample to make predictions or generalizations
about a larger population.
Types of Data
Numeric (Quantitative):
Categorical (Qualitative):
5. Data Visualization
. Measures of Center
The shape of the data in the histogram describes how the CO2 emissions are spread across countries.
Left-skewed: The data has a long tail on the left side (lower values are more common).
Right-skewed: The data has a long tail on the right side (higher values are more common).
To calculate the mean and median of CO2 emissions, you would use the .agg() function. “.agg() allows
you to apply multiple aggregation functions (like mean and median) to a column.”
Given the skew, median is usually better for summarizing the data because it isn’t affected by extreme
values like the mean is. So, the median is the best measure of central tendency for this case.
Standard deviation measures how much data points deviate from the mean. It tells you whether the values in a
dataset are closely packed around the average or widely spread out.
A low standard deviation means most values are close to the mean (less variation).
A high standard deviation means values are spread out (more variation).
Standard deviation is just the square root of variance, making it more practical for real-world
interpretation! 🚀
ChatGPT said:
Variance → Measures the spread of data but in squared units (harder to interpret).
Standard Deviation → Square root of variance, showing spread in the same units as the data
(easier to understand).
Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as
variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that’s
less influenced by outliers. IQR is also often used to find outliers. If a value is less than \ - 1.5 \ or greater
than \ + 1.5 \, it’s considered an outlier. In fact, this is how the lengths of the whiskers in
a matplotlib box plot are calculated
Sampling
Simple Definition:
Sampling is the process of selecting a small group (sample) from a larger group (population) to analyze,
instead of looking at every single item in the population. The goal is to use the sample to make
estimates or conclusions about the entire population.
1. Calculating the average (mean) of song durations for the whole dataset (population).
2. Taking a random sample of songs and calculating the average (mean) of the sample.
3. Comparing the two averages to see how well the sample represents the whole dataset.
spotify_sample = spotify_population.sample(n=1000)
print(spotify_sample)
mean_dur_pop = spotify_population["duration_minutes"].mean()
mean_dur_samp = spotify_sample["duration_minutes"].mean()
print(mean_dur_pop)
print(mean_dur_samp)
Convenience sampling selects data in the easiest way, often leading to biased samples that don’t
represent the population.
Findings: The sample had higher acousticness values than the population, meaning it is not
representative.
Conclusion: The findings are not generalizable because the sample is biased. Random sampling
would be a better approach for accurate insights.