Tejaswi
Tejaswi
Submitted in partial fulfilment of the requirement for the award of the certificate
course conducted in 2nd Semester by Edu pinnacle
BY
TEJASWI A
P03LZ23M015163
Under the Guidance of
Mr BHARATH RAJANNA
EDUPINNACLE
2024-25
1
Project 1
“Predictive Modeling for Online Sales Revenue: Insights and
Forecasting”
Introduction
This project focuses on analyzing an online sales dataset to gain valuable insights
into factors driving total revenue in e-commerce. The dataset consists of 240
observations and includes key variables such as transaction ID, date, product
category, product name, units sold, unit price, total revenue, region, and
payment method. By examining these variables, we aim to uncover trends and
patterns in online sales, specifically identifying how units sold, unit price, and
product category influence the overall revenue. The goal is to leverage data-
driven insights to optimize sales strategies, pricing models, and inventory
management.
The second part of the analysis focuses on understanding the regional
performance of online sales. With regions like North America, Europe, and Asia
in the dataset, we will explore how different regions contribute to total sales
revenue and identify any regional preferences for specific product categories.
Using techniques such as Chi-Square tests and visualizations, we analyze how
product categories are distributed across regions and how regional demand
influences the sales strategy. These insights can help businesses fine-tune their
approach to different markets, ensuring that they offer the right products in the
right regions.
Finally, the project incorporates a predictive modeling approach to forecast future total
revenue based on historical data. By utilizing linear regression, we will model the
relationship between key predictors such as units sold, unit price, product category, and
region to predict total revenue. The accuracy of the model will be assessed through various
visualizations, such as actual vs predicted revenue plots. This predictive model aims to
assist businesses in planning future sales, managing inventory, and creating targeted
marketing campaigns to maximize revenue.Analysis and Interpretation
2
Output after Cleaning Data
3
Code:
Output:
Interpretation:
The Units Sold by Product Category analysis compares the sales volume across
different product categories using a boxplot. This visualization helps identify
which categories consistently perform well, the variability in units sold, and any
outliers. It is a crucial tool for understanding consumer preferences, inventory
management, and prioritizing high-demand categories for promotions or stock
replenishment.
Code:
Output:
5
Interpretation:
The Regional Revenue Performance analysis evaluates how total revenue varies
across different regions using a violin plot. This visualization highlights both the
median revenue and the distribution of revenue for each region, offering insights
into geographic strengths and disparities. Understanding regional performance
enables businesses to tailor marketing, logistics, and product availability strategies
to maximize profitability in high-performing regions and improve outcomes in
underperforming ones.
Code
6
Output:
Interpretation:
Median Revenue: Regions with higher median revenue are top contributors
to overall sales and may warrant focused investments.
Revenue Distribution: Wider distributions indicate diverse sales patterns, with
both high and low-revenue transactions occurring in a region.
7
Outliers:High-
revenue outliers might signal key opportunities, such as premium customer
segments or bulk purchases.
Statistical Data
This graph helps visualize the distribution of total revenue and compares it to
the expected mean of $300.
Code:
Output:
8
Interpretation:
o If $300 is outside the CI, it suggests that the average total revenue is
statistically different from $300.
A low p-value (< 0.05) suggests that the average revenue is not equal to
$300 and that further actions (e.g., reviewing sales strategies) might be
needed.
A high p-value (≥ 0.05) indicates that the average revenue is consistent with
the expected value ($300), implying no immediate need for strategic
changes.
The t-statistic indicates the size of the difference relative to the variation in
the sample, with higher absolute values suggesting a more significant
difference from the hypothesized mean.
This test is helpful for comparing observed data against a target or benchmark,
providing insights on whether the current revenue levels align with
expectations.
Code:
9
Output:
Interpretation:
1. Null Hypothesis (H0):The mean total revenue is the same across all regions.
2. Alternative Hypothesis (H1): At least one region has a significantly
different mean total revenue from others.
3. p-value Interpretation:
o If the p-value < 0.05, reject the null hypothesis. This means there
are significant differences in the mean total revenue across regions.
o If the p-value ≥ 0.05, fail to reject the null hypothesis. This
suggests there are no significant differences in the mean total
revenue between regions.
o F-Statistic: The F-statistic tests the ratio of variance between group
means to the variance within the groups. A larger value indicates
that the group means are more different from each other compared
to the within-group variability.
4. Post-Hoc Analysis (if p-value<0.05): If the ANOVA results show
significance, further tests like Tukey’s HSD can be conducted to identify
which specific regions differ from each other.
10
5. Revenue Differences by Region: If the p-value is significant, it indicates
that the total revenue differs by region. This could imply regional
preferences, varying economic conditions, or different marketing efforts
impacting revenue.
6. Regional Strategies: Regions with significantly higher average revenue may
be targeted for future investments or more focused marketing.
7. Business Decisions: Understanding revenue variations helps optimize
inventory and distribution strategies, ensuring the right products are
available in the right regions at the right time.
A stacked bar chart shows the distribution of product categories within each region
Code:
Output:
11
Interpretation:
Proportional Distribution:
o The stacked bar chart visually represents the proportional
distribution of product categories within each region.
Regional Preferences:
o If certain product categories dominate specific regions (e.g.,
Electronics in North America, Books in Europe), this indicates
Predictive model
Code:
12
Output:
Interpretation:
Project 2
"Exploring Student Performance: An Analytical Approach with R"
Introduction
Understanding the factors that influence student performance is essential for improving
educational outcomes and addressing disparities. This study focuses on analysing student
performance data using R, with a specific emphasis on the relationships between key
variables such as math, reading, and writing scores, gender, lunch type, and test preparation
courses. By identifying trends and patterns, we aim to gain insights into how various factors
contribute to academic success.
The dataset includes variables that capture student performance across three core subjects—
math, reading, and writing—alongside demographic and contextual factors like gender, lunch
type, and test preparation status. Through statistical analyses, data visualizations, and
predictive modeling, we explore the distribution of scores, gender-based performance
differences, and the impact of external factors like lunch type and test preparation courses on
overall academic achievement. These analyses will provide a foundation for understanding
how to support student learning effectively.
This study employs a combination of exploratory data analysis (EDA) and statistical
modeling to uncover critical insights. For instance, we investigate whether completing a test
preparation course significantly boosts total scores and if gender or lunch type influences
performance. Additionally, a linear regression model is developed to predict average scores
based on individual subject scores. The findings from this study aim to assist educators,
policymakers, and stakeholders in designing strategies that promote equitable and effective
education for all students.
15
After data preprocessing:
code:
Output:
Interpretation:
This histogram shows how math scores are distributed among the students. It helps identify
the range, peaks, and potential outliers. If scores cluster around the middle, it suggests most
students have average math proficiency.
17
generic analysis: Average Score by Gender
Objective: Compare the average scores of male and female students to evaluate gender-based
performance differences.
Code:
Output:
18
Interpretation:
This boxplot compares average scores for male and female students. If one gender has a
consistently higher median or smaller variability, it might indicate performance differences
based on gender.
Code:
Output:
19
Interpretation:
This boxplot highlights whether students with standard or reduced lunch perform better. If
there's a clear difference in medians, it suggests a correlation between lunch type and
academic performance.
1. Statistical analysis
Code:
Output:
20
This histogram will show how the average scores are distributed, highlighting peaks and
possible skewness.
Interpretation:
The plot reveals whether the average scores are centred around a specific range or spread out.
A uniform distribution may indicate fair scoring, while skewness could indicate variations in
performance.
Code:
Output:
21
This boxplot compares the distributions of average scores for males and females.
Interpretation:
If there’s a difference in median scores, it could suggest performance disparity by gender.
Smaller interquartile ranges imply consistent performance.
Code:
Output:
This boxplot compares total scores based on whether students completed the test preparation
course.
Interpretation:
22
A higher median score for students who completed the course would suggest its effectiveness.
Significant overlaps might imply minimal impact.
Code:
Output:
23
Interpretation:
Model: The regression explains the relationship between individual subject scores and
the overall average.
Graph: Points close to the red line indicate accurate predictions. Deviations suggest
where the model struggles.
24
Project 3
"Predictive Insights into Investment Behavior: A Data-Driven
Approach"
Introduction
In today’s dynamic financial landscape, understanding individuals' investment
preferences is crucial for guiding investment strategies. Investment behaviors are
influenced by a variety of factors, such as age, income, profession, and the sources
individuals rely on for information. The Investment Survey dataset offers insights
into these preferences, capturing demographic and behavioral information from a
sample of 100 participants. The key objective of this analysis is to predict the mode
of investment based on several predictor variables, such as age, income,
professional status, and motivational factors. By leveraging statistical modeling
techniques, this study aims to shed light on how various factors influence
individuals' investment choices.
To explore these relationships, logistic regression is employed, a powerful tool for
predicting categorical outcomes. Logistic regression is particularly suitable for
situations where the dependent variable, in this case, the mode of investment, is
categorical. By analyzing the predictors, we aim to predict how individuals are
likely to choose between different investment options like mutual funds, stocks, or
banking products such as fixed deposits. Additionally, visualizing the predicted
probabilities using plots enhances the interpretability of the model, offering a clear
view of the model's prediction accuracy and distribution.
The results of this analysis are expected to provide valuable insights into the
underlying factors driving investment decisions. By understanding these patterns,
businesses and financial institutions can tailor their marketing and investment
products to better meet the needs of different segments of investors. Furthermore,
this predictive model can help policymakers and financial advisors predict trends in
investment behavior, ensuring that they can offer targeted advice and strategies to
individuals based on their unique
Analysis & Interpretation:
Stage 1: data preprocessing
25
After data preprocessing
Generic analysis
generic analysis1: Distribution of Math Scores
26
Objective: Explore the distribution of math scores to understand its spread and central
tendencies.
code:
Output:
Interpretation:
This bar chart presents the count of individuals preferring different investment modes. It
helps identify the most and least popular investment methods among the respondents.
27
Statistical Data
Code:
Output:
Interpretation:
28
Objective of the Model:
o The linear
regression model was built to predict annual income
29
conclude that age significantly impacts annual income.
Predicted vs Actual:
o By plotting the predicted values of annual income versus the actual
values, we can visually assess how well the model is performing. A
tight alignment of the predicted vs actual points indicates a good
model fit.
Significance of Model:
o If the model’s overall F-statistic is significant (p-value < 0.05), it
means that at least one of the predictors has a meaningful
relationship with annual income.
Visualization:
o A scatter plot with a regression line (geom_smooth(method = "lm"))
will show the relationship between age and annual income. If the
slope is positive, this indicates that as age increases, so does income,
holding other variables constant.
o The color differentiation (e.g., by gender) allows us to observe any
gender- based trends in income across age groups.
Code:
30
Output:
Interpretation:
Logistic Regression can be applied to a binary outcome, such as predicting whether an
individual is a "working professional" (working professional = 1 or 0) based on other
predictor variables like age, gender, and annual income.
Code:
31
Output:
Interpretation:
A Multinomial Logistic Regression (MLR) model is an extension of binary logistic regression
used when the outcome variable has more than two categories (i.e., the response variable is
nominal). In this analysis, we can model a response variable like mode of investment (which
has multiple categories like "Banking - RD, FD," "Stocks - Intraday, long term," "Mutual
Funds," etc.) based on predictor variables such as age, gender, annual income, etc.
Predictive model
32
profesional status. This
analysis assumes a linear
relationship between these variables and annual income. From the model, we
observe how age and employment status contribute to variations in income levels.
A scatter plot of predicted versus actual income demonstrates the model's
performance, where points near the diagonal red line indicate accurate predictions.
Insights drawn from this analysis can help identify trends in income disparities
based on demographic and occupational factors, which are crucial for understanding
financial behavior.
Code:
Output:
33
Interpretation:
34
PROJECT 4
35
Stage – 2 Analysis and interpretation:
1. Generic Analysis
Code:
Output:
36
Interpretation: The plot reveals whether the bank predominantly issues debit or credit
cards, helping understand customer preferences or institutional focus.
Objective: Analysis credit limit differences between debit and credit cards.
Code:
37
Output:
Interpretation: Higher limits on credit cards align with their intended usage for credit
purchases.
Code:
38
Output:
Code:
39
Output:
Interpretation: This plot compares the number of cards with chips versus those without.
It highlights the adoption of chip technology, which is important for security and transaction
efficiency.
Code:
40
Output:
Interpretation: This visualization shows when clients last changed their PINs. It can
indicate how often clients update their security measures, which is vital for risk
management.
41
Statistical analysis:
Correlation Between CVV
and Credit Limit
Code:
Output:
Interpretation:
No significant correlation is expected between CVV and credit limits. Any patterns observed
could indicate systemic issues in card or limit assignment processes.
42
Stage – 3 PREPICTIVE
MODEL
Objective:
Code:
Output:
43
Interpretation:
1. Regression Line: The red line shows the predicted relationship between
num_cards_issued and credit limit.
2. Confidence Interval: The shaded area represents the confidence interval of the
regression line.
3. Insights: If the slope of the regression line is positive, having more cards issued
generally increases the credit limit. A flat or negative slope would indicate no or
inverse correlation.
44