0% found this document useful (0 votes)
17 views4 pages

Statistics For Data Science

The document discusses statistical analysis and predictive modeling in data science, highlighting various methods such as descriptive statistics, inferential statistics, and regression analysis. It explains the importance of these techniques in decision-making and forecasting, providing examples of their application using Python. The document concludes with a case study demonstrating how to perform statistical analysis and build a predictive model to estimate sales based on advertising spending.

Uploaded by

Deepa Ravindran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

Statistics For Data Science

The document discusses statistical analysis and predictive modeling in data science, highlighting various methods such as descriptive statistics, inferential statistics, and regression analysis. It explains the importance of these techniques in decision-making and forecasting, providing examples of their application using Python. The document concludes with a case study demonstrating how to perform statistical analysis and build a predictive model to estimate sales based on advertising spending.

Uploaded by

Deepa Ravindran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

https://www.geeksforgeeks.

org/statistics-for-data-science/

https://www.coursera.org/learn/statistics-for-data-science-python

Statistical Analysis or Modeling in Data Analysis

Statistical analysis or modeling involves using mathematical techniques to extract meaningful insights
from data. This can include identifying patterns, relationships, and trends, or making predictions
about future outcomes. It plays a crucial role in decision-making, research, and business intelligence.

1. Statistical Analysis

Statistical analysis involves applying various statistical tests and methods to interpret data, check for
significance, and validate hypotheses.

Types of Statistical Analysis:

✅ Descriptive Statistics: Summarizes data using measures such as mean, median, mode, variance,
and standard deviation. Example: Finding the average sales per month.

✅ Inferential Statistics: Draws conclusions about a population based on a sample using hypothesis
testing and confidence intervals. Example: A/B testing in marketing to compare two advertisement
strategies.

✅ Correlation & Regression Analysis: Determines relationships between variables.

 Correlation: Measures the strength of the relationship between two variables. Example:
Relationship between temperature and ice cream sales.

 Regression: Predicts the dependent variable based on one or more independent variables.
Example: Predicting house prices based on size, location, and number of rooms.

✅ Time Series Analysis: Analyzes data points collected over time to identify trends, seasonality, and
cycles. Example: Stock market price predictions.

✅ ANOVA (Analysis of Variance): Compares means of multiple groups to determine if differences are
statistically significant. Example: Comparing customer satisfaction scores across different store
locations.

✅ Chi-Square Test: Checks the association between categorical variables. Example: Analyzing
whether gender influences product preference.

2. Predictive Modeling

Predictive modeling involves using statistical and machine learning algorithms to forecast future
trends based on historical data.

Common Predictive Models:

📌 Linear Regression: Predicts a continuous value based on independent variables. Example:


Predicting sales based on marketing spend.
📌 Logistic Regression: Used for binary classification (Yes/No, 0/1). Example: Predicting whether a
customer will churn.

📌 Decision Trees & Random Forest: Tree-based models that classify or predict outcomes. Example:
Predicting loan approval based on credit history.

📌 Time Series Forecasting: ARIMA, Exponential Smoothing, and LSTMs are used for future trend
forecasting. Example: Predicting next quarter’s revenue.

📌 Clustering (K-Means, DBSCAN): Groups data points based on similarity. Example: Customer
segmentation for targeted marketing.

📌 Neural Networks & Deep Learning: Advanced models used for complex pattern recognition.
Example: Image classification or fraud detection.

3. Choosing the Right Approach

 Use statistical analysis when testing hypotheses, analyzing distributions, or determining


relationships.

 Use predictive modeling when the goal is to forecast trends, classify outcomes, or optimize
business strategies.

Let's go through an example where we perform statistical analysis and predictive modeling using
Python.

Problem Statement:

We have a dataset containing information about a company's advertising budget for TV, Radio, and
Newspaper ads, and we want to:

1. Perform statistical analysis to check correlations.

2. Build a predictive model to predict sales based on advertising spending.

Step 1: Import Libraries and Load Data

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load dataset (Example dataset)


url = "https://raw.githubusercontent.com/selva86/datasets/master/Advertising.csv"

data = pd.read_csv(url)

# Display first 5 rows

print(data.head())

Step 2: Perform Statistical Analysis

# Summary statistics

print(data.describe())

# Check correlation between features

plt.figure(figsize=(8,5))

sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Matrix")

plt.show()

🔹 Insights from Correlation Matrix:

 TV and Sales have a strong positive correlation.

 Radio also impacts Sales but less than TV.

 Newspaper has a weaker correlation.

Step 3: Build a Predictive Model (Linear Regression)

# Define independent (X) and dependent (y) variables

X = data[['TV', 'Radio', 'Newspaper']]

y = data['Sales']

# Split data into training and testing sets (80-20 split)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression Model

model = LinearRegression()

model.fit(X_train, y_train)
# Predict on test set

y_pred = model.predict(X_test)

# Evaluate Model Performance

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")

print(f"R² Score: {r2:.2f}")

Step 4: Interpret the Model

# Print Coefficients

coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})

print(coefficients)

🔹 Insights from Model Coefficients:

 TV has the highest coefficient, meaning it has the most impact on sales.

 Radio contributes positively, but less than TV.

 Newspaper has the lowest impact (which aligns with the correlation analysis).

Conclusion

✔ Statistical Analysis (correlation matrix) helped identify important features.


✔ Predictive Modeling (Linear Regression) created a model to estimate future sales based on ad
spending.
✔ Model Evaluation (R² Score) shows how well the model explains variability in sales.

You might also like