0% found this document useful (0 votes)
1 views

stat python

The document outlines the differences between descriptive and inferential statistics, detailing types of data and methods for data visualization. It explains measures of central tendency, standard deviation, and the importance of sampling, emphasizing that random sampling is crucial for accurate representation of a population. Additionally, it discusses the impact of outliers on statistical measures and the use of interquartile range (IQR) to identify them.

Uploaded by

youssef mahmoud
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

stat python

The document outlines the differences between descriptive and inferential statistics, detailing types of data and methods for data visualization. It explains measures of central tendency, standard deviation, and the importance of sampling, emphasizing that random sampling is crucial for accurate representation of a population. Additionally, it discusses the impact of outliers on statistical measures and the use of interquartile range (IQR) to identify them.

Uploaded by

youssef mahmoud
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Descriptive statistics summarize and describe data that has already been collected, focusing on specific

details like averages, percentages, and patterns.

Inferential statistics, on the other hand, use data from a sample to make predictions or generalizations
about a larger population.

Types of Data

 Numeric (Quantitative):

o Continuous: Can take any value (e.g., stock price).

o Discrete: Whole numbers (e.g., cups of coffee per day).

 Categorical (Qualitative):

o Nominal: Unordered categories (e.g., eye color).

o Ordinal: Ordered categories (e.g., survey responses).

5. Data Visualization

 Numeric Data: Scatter plots, histograms.

 Categorical Data: Bar charts, grouped aggregations.

. Measures of Center

 Mean (Average): Sum of values / number of values.

 Median: Middle value when sorted.

 Mode: Most frequent value.

 Choosing the Right Measure:

The shape of the data in the histogram describes how the CO2 emissions are spread across countries.

 No skew: The data is evenly distributed.

 Left-skewed: The data has a long tail on the left side (lower values are more common).

 Right-skewed: The data has a long tail on the right side (higher values are more common).

To calculate the mean and median of CO2 emissions, you would use the .agg() function. “.agg() allows
you to apply multiple aggregation functions (like mean and median) to a column.”

Given the skew, median is usually better for summarizing the data because it isn’t affected by extreme
values like the mean is. So, the median is the best measure of central tendency for this case.
Standard deviation measures how much data points deviate from the mean. It tells you whether the values in a
dataset are closely packed around the average or widely spread out.

 A low standard deviation means most values are close to the mean (less variation).

 A high standard deviation means values are spread out (more variation).

Standard deviation is just the square root of variance, making it more practical for real-world
interpretation! 🚀

ChatGPT said:

Short Difference Between Variance & Standard Deviation

 Variance → Measures the spread of data but in squared units (harder to interpret).

 Standard Deviation → Square root of variance, showing spread in the same units as the data
(easier to understand).

👉 Standard deviation is more practical for business decisions

Finding outliers using IQR

Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as
variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that’s
less influenced by outliers. IQR is also often used to find outliers. If a value is less than \ - 1.5 \ or greater
than \ + 1.5 \, it’s considered an outlier. In fact, this is how the lengths of the whiskers in
a matplotlib box plot are calculated
Sampling

Simple Definition:

Sampling is the process of selecting a small group (sample) from a larger group (population) to analyze,
instead of looking at every single item in the population. The goal is to use the sample to make
estimates or conclusions about the entire population.

In this exercise, you are:

1. Calculating the average (mean) of song durations for the whole dataset (population).

2. Taking a random sample of songs and calculating the average (mean) of the sample.

3. Comparing the two averages to see how well the sample represents the whole dataset.

Sample 1000 rows from spotify_population

spotify_sample = spotify_population.sample(n=1000)

# Print the sample

print(spotify_sample)

# Calculate the mean duration in mins from spotify_population

mean_dur_pop = spotify_population["duration_minutes"].mean()

# Calculate the mean duration in mins from spotify_sample

mean_dur_samp = spotify_sample["duration_minutes"].mean()

# Print the means

print(mean_dur_pop)
print(mean_dur_samp)

 Convenience sampling selects data in the easiest way, often leading to biased samples that don’t
represent the population.

 You compared acousticness distributions of:

1. General population (spotify_population).

2. A sample of 1107 songs (spotify_mysterious_sample).

 Findings: The sample had higher acousticness values than the population, meaning it is not
representative.

 Conclusion: The findings are not generalizable because the sample is biased. Random sampling
would be a better approach for accurate insights.

You might also like