0% found this document useful (0 votes)
19 views15 pages

Edaunit IV

This document covers descriptive statistics, including distribution functions, measures of central tendency, dispersion, kurtosis, and various analysis techniques such as univariate, bivariate, multivariate, and time series analysis. It includes sample experiments for studying different distribution techniques, performing data cleaning, and calculating statistical measures using Python. Key concepts are explained with examples and code snippets for practical implementation.

Uploaded by

aimlbtech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views15 pages

Edaunit IV

This document covers descriptive statistics, including distribution functions, measures of central tendency, dispersion, kurtosis, and various analysis techniques such as univariate, bivariate, multivariate, and time series analysis. It includes sample experiments for studying different distribution techniques, performing data cleaning, and calculating statistical measures using Python. Key concepts are explained with examples and code snippets for practical implementation.

Uploaded by

aimlbtech7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT-IV

Descriptive Statistics: Distribution function, Measures of central tendency, Measures of dispersion, Types of kurtosis, Calculating
percentiles, Quartiles, Grouping Datasets, Correlation, Understanding univariate, bivariate, multivariate analysis, Time Series Analysis
Sample Experiments:

1. Study the following Distribution Techniques on a sample data


1)Uniform Distribution 2) Normal Distribution 3)Gamma Distribution 4)Exponential Distribution 5)Poisson Distribution
6)Binomial Distribution
Perform Data Cleaning on a sample dataset.
2. Compute measure of Central Tendency on a sample dataset
a) Mean b)Median c)Mode
3. Explore Measures of Dispersion on a sample dataset
a) Variance b) Standard Deviation c) Skewness d) Kurtosis
4. a) Calculating percentiles on sample dataset
b) Calculate Inter Quartile Range(IQR) and Visualize using Box Plots
5. Perform the following analysis on automobile dataset.
a) Bivariate analysis b)Multivariate analysis
6. Perform Time Series Analysis on Open Power systems dataset

Descriptive statistics is a field of statistics that involves summarizing and describing the characteristics of a dataset. This unit
covers essential concepts like distribution functions, central tendency, dispersion, kurtosis, percentiles, and various analysis
techniques, including time series analysis.

Key Concepts and Techniques

1. Distribution Function

o A distribution function (or cumulative distribution function) gives the probability that a random variable takes on
a value less than or equal to a specified value. It’s fundamental for understanding how values are spread across
a dataset.

o Syntax/Example:

 CDF for Normal Distribution: F(x) = P(X ≤ x)

 The CDF of a normal distribution can be calculated using libraries like scipy.stats.norm.cdf(x,
loc=mean, scale=stddev) in Python.

2. Measures of Central Tendency

o These are metrics that describe the center of a dataset.

 Mean: The average of all data points.

 Syntax/Example: mean = sum(data) / len(data)

 Median: The middle value when data is ordered.

 Syntax/Example: median = data[n//2] (for odd number of data points).

 Mode: The value that appears most frequently.

 Syntax/Example: mode = statistics.mode(data) in Python.

3. Measures of Dispersion

o These metrics describe the spread of data points in a dataset.

 Variance: Measures the average squared deviation from the mean.

 Syntax/Example: variance = sum((x - mean) ** 2 for x in data) / len(data)

 Standard Deviation: The square root of variance, gives the average distance from the mean.

 Syntax/Example: std_dev = math.sqrt(variance)

 Skewness: Measures the asymmetry of the data distribution.

 Kurtosis: Measures the "tailedness" of the data distribution.

4. Types of Kurtosis

o Leptokurtic: Distribution has heavy tails and a sharp peak.


o Platykurtic: Distribution has light tails and a flat peak.

o Mesokurtic: Distribution has a moderate peak, similar to a normal distribution.

5. Calculating Percentiles and Quartiles

o Percentiles: Values that divide a dataset into 100 equal parts.

o Quartiles: Divides data into four equal parts. The difference between the first and third quartiles is called the
Interquartile Range (IQR).

 Syntax/Example: Q1, Q3 = np.percentile(data, [25, 75]) in Python.

6. Grouping Datasets

o Grouping involves aggregating data based on a certain criterion to summarize it better, like summing or
averaging data by groups.

 Syntax/Example: df.groupby('column_name').agg('mean') in Python.

7. Correlation

o A statistical measure that expresses the extent to which two variables are linearly related.

 Syntax/Example: correlation = np.corrcoef(data1, data2) in Python.

8. Univariate, Bivariate, and Multivariate Analysis

o Univariate Analysis: Analyzes a single variable. Common tools include frequency distribution, histograms, and
summary statistics.

o Bivariate Analysis: Analyzes two variables, often using scatter plots or correlation coefficients.

o Multivariate Analysis: Analyzes more than two variables simultaneously, commonly used in regression
analysis or principal component analysis (PCA).

9. Time Series Analysis

o Time series analysis is used to analyze data points collected or recorded at specific time intervals.

 Syntax/Example: plt.plot(time, data) for visualizing trends over time.

13. Study the following Distribution Techniques on a sample data:

a)Uniform Distribution
Uniform distribution is a type of distribution in which every outcome has an equal
chance of occurring.

 Sample Code:

import numpy as np

import matplotlib.pyplot as plt

data_uniform = np.random.uniform(low=0, high=10, size=1000)

plt.hist(data_uniform, bins=30, density=True, alpha=0.6, color='g')

plt.title('Uniform Distribution')

plt.show()

 Output Example: A histogram where the probability is uniformly spread across the x-axis.

b) Normal Distribution
Normal distribution follows the classic bell curve.

 Sample Code:
data_normal = np.random.normal(loc=0, scale=1, size=1000)

plt.hist(data_normal, bins=30, density=True, alpha=0.6, color='b')

plt.title('Normal Distribution')

plt.show()

 Output Example: A bell-shaped curve with the mean around 0 and standard deviation of 1.

c) Gamma Distribution
The Gamma distribution is a two-parameter family of continuous probability distributions.

 Sample Code:

data_gamma = np.random.gamma(shape=2, scale=2, size=1000)

plt.hist(data_gamma, bins=30, density=True, alpha=0.6, color='r')

plt.title('Gamma Distribution')

plt.show()

 Output Example: A skewed distribution that starts at 0 and gradually increases.

d) Exponential Distribution
Exponential distribution is often used to model time between events in a Poisson process.

 Sample Code:

data_exponential = np.random.exponential(scale=1, size=1000)

plt.hist(data_exponential, bins=30, density=True, alpha=0.6, color='y')

plt.title('Exponential Distribution')

plt.show()

 Output Example: A distribution where values are skewed toward the lower end.
e) Poisson Distribution
The Poisson distribution is used to describe the number of events occurring in a fixed interval of time or space.

 Sample Code:

data_poisson = np.random.poisson(lam=3, size=1000)

plt.hist(data_poisson, bins=30, density=True, alpha=0.6, color='m')

plt.title('Poisson Distribution')

plt.show()

 Output Example: A distribution where the frequency of occurrences is concentrated around a certain value (e.g., 3).

f) Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials.

 Sample Code:

data_binomial = np.random.binomial(n=10, p=0.5, size=1000)

plt.hist(data_binomial, bins=30, density=True, alpha=0.6, color='c')

plt.title('Binomial Distribution')

plt.show()

 Output Example: A distribution where outcomes are centered around the expected number of successes.

14. Perform Data Cleaning on a sample dataset.

 Steps:

1. Remove duplicates

2. Handle missing values (either impute or drop rows/columns)

3. Correct data types

4. Standardize text (e.g., lowercase)

 Sample Code:

import pandas as pd

# Sample data with missing values

data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],

'Age': [25, None, 30, 22, 29],

'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']}
df = pd.DataFrame(data)

# Handle missing values

# Impute with mean for 'Age' and mode for 'Name'

df['Age'] = df['Age'].fillna(df['Age'].mean()) # Assign result back to 'Age' column

df['Name'] = df['Name'].fillna(df['Name'].mode()[0]) # Assign result back to 'Name' column

 Output Example:

Name Age City

0 Alice 25.0 New York

1 Bob 29.0 Los Angeles

2 Charlie 30.0 Chicago

3 Eve 22.0 New York

15. Compute Measure of Central Tendency on a sample dataset

a) Mean

mean = df['Age'].mean()

print(f"Mean: {mean}")

output:

Mean: 26.5

b) Median

median = df['Age'].median()

print(f"Median: {median}")

output:

Median: 26.5

c) Mode

mode = df['Age'].mode()[0]

print(f"Mode: {mode}")

output:-

Mode: 25.0

16. Explore Measures of Dispersion on a sample dataset

a) Variance

variance = df['Age'].var()

print(f"Variance: {variance}")

output:

Variance: 7.25

b) Standard Deviation

std_dev = df['Age'].std()

print(f"Standard Deviation: {std_dev}")

output:-

Standard Deviation: 2.69

c) Skewness

from scipy.stats import skew

skewness = skew(df['Age'])

print(f"Skewness: {skewness}")

output:

Skewness: 0.00
d) Kurtosis

from scipy.stats import kurtosis

kurt = kurtosis(df['Age'])

print(f"Kurtosis: {kurt}")

output:

Kurtosis: -1.3

17. Calculate Percentiles and IQR

a) Percentiles

percentiles = np.percentile(df['Age'], [25, 50, 75])

print(f"25th, 50th, 75th Percentiles: {percentiles}")

output:-

25th, 50th, 75th Percentiles: [25. 26.5 29. ]

b) IQR and Box Plot

IQR = np.percentile(df['Age'], 75) - np.percentile(df['Age'], 25)

print(f"IQR: {IQR}")

# Box Plot

plt.boxplot(df['Age'])

plt.title('Box Plot of Age')

plt.show()

 Output Example:

Percentiles: [25. 26.5 29.]

IQR: 3.0

18. Perform the following analysis on an automobile dataset

a) Bivariate Analysis (e.g., Age vs Price)

import seaborn as sns

import matplotlib.pyplot as plt

import pandas as pd

# Creating a sample dataframe (if df_automobile is not defined)

data = {'Age': [1, 2, 3, 4, 5], 'Price': [20000, 18000, 15000, 12000, 10000]}

df_automobile = pd.DataFrame(data)

# Scatter plot

sns.scatterplot(x='Age', y='Price', data=df_automobile)

plt.title('Bivariate Analysis of Age vs Price')

plt.show()
b) Multivariate Analysis (e.g., Age, Price, Engine Size)

import seaborn as sns

import matplotlib.pyplot as plt

import pandas as pd

# Creating a sample dataframe (if df_automobile is not defined)

data = {'Age': [1, 2, 3, 4, 5],

'Price': [20000, 18000, 15000, 12000, 10000],

'EngineSize': [1.2, 1.6, 2.0, 2.5, 3.0]}

df_automobile = pd.DataFrame(data)

# Pairplot

g = sns.pairplot(df_automobile[['Age', 'Price', 'EngineSize']])

# Set title correctly

g.fig.suptitle('Multivariate Analysis of Age, Price, and Engine Size', y=1.02)

plt.show()

19. Perform Time Series Analysis on Open Power Systems dataset

 Steps:

1. Convert time column to datetime type.

2. Set datetime column as index.

3. Plot time series data.

Sample Code:

import pandas as pd

import matplotlib.pyplot as plt


# Sample DataFrame (if df_power is not already defined)

data = {

'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],

'Power': [100, 150, 120, 180, 130]

df_power = pd.DataFrame(data) # Define the DataFrame

# Convert 'Date' column to datetime

df_power['Date'] = pd.to_datetime(df_power['Date'])

# Set 'Date' as index

df_power.set_index('Date', inplace=True)

# Plot power consumption over time

df_power['Power'].plot(figsize=(10, 5), marker='o', linestyle='-')

# Set title and labels

plt.title('Power Consumption Over Time')

plt.xlabel('Date')

plt.ylabel('Power')

plt.grid(True) # Improves readability

plt.show()

Sample Experiments

1. Study the following Distribution Techniques on a sample data:

o Uniform Distribution: Every outcome is equally likely.

o Normal Distribution: Data is symmetrically distributed around the mean.

o Gamma Distribution: A generalization of the exponential distribution.

o Exponential Distribution: Describes time between events in a Poisson process.

o Poisson Distribution: Models the number of events in a fixed interval of time.

o Binomial Distribution: Models the number of successes in a fixed number of trials.

2. Perform Data Cleaning on a sample dataset:

o Removing missing values, outliers, and correcting inconsistencies in the dataset.

3. Compute Measures of Central Tendency on a sample dataset:

o Calculate mean, median, and mode using Python libraries like numpy or statistics.

4. Explore Measures of Dispersion on a sample dataset:

o Compute variance, standard deviation, skewness, and kurtosis.


5. Calculate Percentiles and Inter Quartile Range (IQR) on sample dataset:

o Use numpy.percentile() to calculate percentiles and IQR.

6. Perform Bivariate and Multivariate Analysis on an automobile dataset:

o Use scatter plots for bivariate analysis and techniques like regression for multivariate analysis.

7. Perform Time Series Analysis on Open Power systems dataset:

o Analyze the time series data for trends, seasonality, and forecasting.

13. Study the following Distribution Techniques on a sample data:

a) Uniform Distribution
Uniform distribution is a type of distribution in which every outcome has an equal chance of occurring.

 Sample Code:

import numpy as np

import matplotlib.pyplot as plt

data_uniform = np.random.uniform(low=0, high=10, size=1000)

plt.hist(data_uniform, bins=30, density=True, alpha=0.6, color='g')

plt.title('Uniform Distribution')

plt.show()

 Output Example: A histogram where the probability is uniformly spread across the x-axis.

b) Normal Distribution
Normal distribution follows the classic bell curve.

 Sample Code:

data_normal = np.random.normal(loc=0, scale=1, size=1000)

plt.hist(data_normal, bins=30, density=True, alpha=0.6, color='b')

plt.title('Normal Distribution')

plt.show()
 Output Example: A bell-shaped curve with the mean around 0 and standard deviation of 1.

c) Gamma Distribution
The Gamma distribution is a two-parameter family of continuous probability distributions.

 Sample Code:

data_gamma = np.random.gamma(shape=2, scale=2, size=1000)

plt.hist(data_gamma, bins=30, density=True, alpha=0.6, color='r')

plt.title('Gamma Distribution')

plt.show()

 Output Example: A skewed distribution that starts at 0 and gradually increases.

d) Exponential Distribution
Exponential distribution is often used to model time between events in a Poisson process.

 Sample Code:

data_exponential = np.random.exponential(scale=1, size=1000)

plt.hist(data_exponential, bins=30, density=True, alpha=0.6, color='y')

plt.title('Exponential Distribution')

plt.show()
 Output Example: A distribution where values are skewed toward the lower end.

e) Poisson Distribution
The Poisson distribution is used to describe the number of events occurring in a fixed interval of time or space.

 Sample Code:

data_poisson = np.random.poisson(lam=3, size=1000)

plt.hist(data_poisson, bins=30, density=True, alpha=0.6, color='m')

plt.title('Poisson Distribution')

plt.show()

 Output Example: A distribution where the frequency of occurrences is concentrated around a certain value (e.g., 3).

f) Binomial Distribution
The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials.

 Sample Code:

data_binomial = np.random.binomial(n=10, p=0.5, size=1000)

plt.hist(data_binomial, bins=30, density=True, alpha=0.6, color='c')

plt.title('Binomial Distribution')

plt.show()
 Output Example: A distribution where outcomes are centered around the expected number of successes.

14. Perform Data Cleaning on a sample dataset.

 Steps:

1. Remove duplicates

2. Handle missing values (either impute or drop rows/columns)

3. Correct data types

4. Standardize text (e.g., lowercase)

 Sample Code:

import pandas as pd

# Sample data with missing values

import pandas as pd

# Create DataFrame

data = {

'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],

'Age': [25, None, 30, 22, 29],

'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']

df = pd.DataFrame(data)

# Remove duplicates properly by reassigning

df = df.drop_duplicates().copy() # Ensure we're working on a proper copy

# Handle missing values

df['Age'] = df['Age'].fillna(df['Age'].mean()) # Use assignment instead of inplace=True

df['Name'] = df['Name'].fillna(df['Name'].mode()[0]) # Use assignment instead of inplace=True

# Print cleaned data

print(df)

 Output Example:

Name Age City

0 Alice 25.0 New York

1 Bob 29.0 Los Angeles

2 Charlie 30.0 Chicago

3 Eve 22.0 New York

15. Compute Measure of Central Tendency on a sample dataset


a) Mean

mean = df['Age'].mean()

print(f"Mean: {mean}")

b) Median

python

Copy

median = df['Age'].median()

print(f"Median: {median}")

c) Mode

mode = df['Age'].mode()[0]

print(f"Mode: {mode}")

 Output Example:

Mean: 26.5

Median: 26.5

Mode: 25.0

16. Explore Measures of Dispersion on a sample dataset

a) Variance

variance = df['Age'].var()

print(f"Variance: {variance}")

b) Standard Deviation

std_dev = df['Age'].std()

print(f"Standard Deviation: {std_dev}")

c) Skewness

from scipy.stats import skew

skewness = skew(df['Age'])

print(f"Skewness: {skewness}")

d) Kurtosis

from scipy.stats import kurtosis

kurt = kurtosis(df['Age'])

print(f"Kurtosis: {kurt}")

 Output Example:

Variance: 7.25

Standard Deviation: 2.69

Skewness: 0.00

Kurtosis: -1.3

17. Calculate Percentiles and IQR

a) Percentiles

percentiles = np.percentile(df['Age'], [25, 50, 75])

print(f"25th, 50th, 75th Percentiles: {percentiles}")

b) IQR and Box Plot

IQR = np.percentile(df['Age'], 75) - np.percentile(df['Age'], 25)

print(f"IQR: {IQR}")

# Box Plot

plt.boxplot(df['Age'])
plt.title('Box Plot of Age')

plt.show()

 Output Example:

Percentiles: [25. 26.5 29.]

IQR: 3.0

18. Perform the following analysis on an automobile dataset

a) Bivariate Analysis (e.g., Age vs Price)

import seaborn as sns

# Assume df_automobile has 'Age' and 'Price' columns

sns.scatterplot(x='Age', y='Price', data=df_automobile)

plt.title('Bivariate Analysis of Age vs Price')

plt.show()

b) Multivariate Analysis (e.g., Age, Price, Engine Size)

sns.pairplot(df_automobile[['Age', 'Price', 'EngineSize']])

plt.title('Multivariate Analysis of Age, Price, and Engine Size')

plt.show()

19. Perform Time Series Analysis on Open Power Systems dataset

 Steps:

1. Convert time column to datetime type.

2. Set datetime column as index.

3. Plot time series data.

Sample Code:

import pandas as pd

import matplotlib.pyplot as plt

# Example: Create a sample df_power DataFrame

data = {

'Date': ['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04'],

'Power': [150, 160, 145, 170]

df_power = pd.DataFrame(data)

# Convert 'Date' column to datetime

df_power['Date'] = pd.to_datetime(df_power['Date'])

# Set 'Date' as the index

df_power.set_index('Date', inplace=True)

# Plot the 'Power' column

df_power['Power'].plot(figsize=(10, 5))

# Add title and display plot

plt.title('Power Consumption Over Time')

plt.show()
 Output Example: A time series plot showing power consumption over time.

You might also like