0% found this document useful (0 votes)
27 views44 pages

Tejaswi

Uploaded by

lohithkumar87109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views44 pages

Tejaswi

Uploaded by

lohithkumar87109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

“A STUDY ON EXPLORING DATA PATTERN:

AN R PROGRAMMING APPROACH TO ANALYSIS THE DATA”

Submitted in partial fulfilment of the requirement for the award of the certificate
course conducted in 2nd Semester by Edu pinnacle

MASTER OF BUSINESS ADMINISTRATION


OF
BANGALORE UNIVERSITY

BY
TEJASWI A
P03LZ23M015163
Under the Guidance of
Mr BHARATH RAJANNA
EDUPINNACLE
2024-25

1
Project 1
“Predictive Modeling for Online Sales Revenue: Insights and
Forecasting”

Introduction
This project focuses on analyzing an online sales dataset to gain valuable insights
into factors driving total revenue in e-commerce. The dataset consists of 240
observations and includes key variables such as transaction ID, date, product
category, product name, units sold, unit price, total revenue, region, and
payment method. By examining these variables, we aim to uncover trends and
patterns in online sales, specifically identifying how units sold, unit price, and
product category influence the overall revenue. The goal is to leverage data-
driven insights to optimize sales strategies, pricing models, and inventory
management.
The second part of the analysis focuses on understanding the regional
performance of online sales. With regions like North America, Europe, and Asia
in the dataset, we will explore how different regions contribute to total sales
revenue and identify any regional preferences for specific product categories.
Using techniques such as Chi-Square tests and visualizations, we analyze how
product categories are distributed across regions and how regional demand
influences the sales strategy. These insights can help businesses fine-tune their
approach to different markets, ensuring that they offer the right products in the
right regions.

Finally, the project incorporates a predictive modeling approach to forecast future total
revenue based on historical data. By utilizing linear regression, we will model the
relationship between key predictors such as units sold, unit price, product category, and
region to predict total revenue. The accuracy of the model will be assessed through various
visualizations, such as actual vs predicted revenue plots. This predictive model aims to
assist businesses in planning future sales, managing inventory, and creating targeted
marketing campaigns to maximize revenue.Analysis and Interpretation

Analysis and Interpretation

Stage 1: Data Preprocessing

2
Output after Cleaning Data

Stage 2: Analysis and Interpretation


Analysis:1 Distribution of Total Revenue The Distribution of Total Revenue provides
insights into the sales performance across all transactions. By plotting a histogram, we can
observe the frequency of transactions at different revenue levels, identify the central
tendency, and detect the presence of outliers or unusual revenue patterns. This analysis helps
businesses understand the most common revenue brackets and supports targeted strategies to
optimize low-performing or leverage high-performing transactions.

3
Code:

Output:

Interpretation:

1. This histogram will help determine if revenue is skewed, with many


transactions having low or high values.
2. Insights from this can guide decisions like focusing on higher-
value sales or addressing low-value sales strategies.
3. Central Tendency :Most transactions fall within a specific revenue
range, reflecting common sales trends.
4
4. Skewness: The distribution may reveal skewness, indicating
whether sales are dominated by small or large revenues.
5. Outliers: Transactions with very high revenues might represent
high-ticket items or bulk purchases, requiring special focus.

Analysis:2 Units Sold by Product Category

The Units Sold by Product Category analysis compares the sales volume across
different product categories using a boxplot. This visualization helps identify
which categories consistently perform well, the variability in units sold, and any
outliers. It is a crucial tool for understanding consumer preferences, inventory
management, and prioritizing high-demand categories for promotions or stock
replenishment.

Code:

Output:

5
Interpretation:

1. This boxplot visualizes variations in sales volume for each category.


2. A wider spread or higher median units sold indicates categories that
perform well consistently.
3. Median Sales: Categories with higher median units sold are consistent
performers and likely contribute significantly to overall sales.
4. Spread of Units Sold: A wider spread indicates variability, suggesting some
items sell much better than others within the same category.
5. Top Performers: Categories with high upper quartiles may reflect customer
favourites or seasonal trends.
6. Low Performers: Categories with lower sales may need better marketing
strategies or product adjustments.

7. Outliers: Outliers can indicate occasional bulk sales or highly popular


items in a category.

Analysis:3 Regional Revenue Performance

The Regional Revenue Performance analysis evaluates how total revenue varies
across different regions using a violin plot. This visualization highlights both the
median revenue and the distribution of revenue for each region, offering insights
into geographic strengths and disparities. Understanding regional performance
enables businesses to tailor marketing, logistics, and product availability strategies
to maximize profitability in high-performing regions and improve outcomes in
underperforming ones.

Code

6
Output:

Interpretation:

 Median Revenue: Regions with higher median revenue are top contributors
to overall sales and may warrant focused investments.
 Revenue Distribution: Wider distributions indicate diverse sales patterns, with
both high and low-revenue transactions occurring in a region.

 Strong Markets: Regions with consistently high revenue values represent


strong markets for targeted expansion or retention efforts.
 Under performing Regions: Regions with low median and narrow distributions
may need strategic interventions to boost sales.

7
 Outliers:High-
revenue outliers might signal key opportunities, such as premium customer
segments or bulk purchases.

Statistical Data

Data-1: One-Sample t-Test (on total revenue)

This graph helps visualize the distribution of total revenue and compares it to
the expected mean of $300.

Code:

Output:

8
Interpretation:

 Null Hypothesis (H0): The average total revenue is equal to $300.


 Alternative Hypothesis (H1): The average total revenue is different from $300.
 p-value Interpretation:
o If the p-value is less than 0.05, we reject the null hypothesis,
indicating that the average total revenue is significantly different
from $300.
o If the p-value is greater than or equal to 0.05, we fail to reject the
null hypothesis, implying that there is no significant difference
between the average total revenue and $300.
 Confidence Interval (CI):
o The CI provides a range of values within which the true population
mean of total revenue lies, with a certain level of confidence.

o If $300 is outside the CI, it suggests that the average total revenue is
statistically different from $300.

 A low p-value (< 0.05) suggests that the average revenue is not equal to
$300 and that further actions (e.g., reviewing sales strategies) might be
needed.
 A high p-value (≥ 0.05) indicates that the average revenue is consistent with
the expected value ($300), implying no immediate need for strategic
changes.
 The t-statistic indicates the size of the difference relative to the variation in
the sample, with higher absolute values suggesting a more significant
difference from the hypothesized mean.

This test is helpful for comparing observed data against a target or benchmark,
providing insights on whether the current revenue levels align with
expectations.

Data-2: ANOVA (Analysis of Variance) (on total revenue by


region)

A boxplot shows the distribution of total revenue across different regions.

Code:

9
Output:

Interpretation:

1. Null Hypothesis (H0):The mean total revenue is the same across all regions.
2. Alternative Hypothesis (H1): At least one region has a significantly
different mean total revenue from others.
3. p-value Interpretation:
o If the p-value < 0.05, reject the null hypothesis. This means there
are significant differences in the mean total revenue across regions.
o If the p-value ≥ 0.05, fail to reject the null hypothesis. This
suggests there are no significant differences in the mean total
revenue between regions.
o F-Statistic: The F-statistic tests the ratio of variance between group
means to the variance within the groups. A larger value indicates
that the group means are more different from each other compared
to the within-group variability.
4. Post-Hoc Analysis (if p-value<0.05): If the ANOVA results show
significance, further tests like Tukey’s HSD can be conducted to identify
which specific regions differ from each other.
10
5. Revenue Differences by Region: If the p-value is significant, it indicates
that the total revenue differs by region. This could imply regional
preferences, varying economic conditions, or different marketing efforts
impacting revenue.
6. Regional Strategies: Regions with significantly higher average revenue may
be targeted for future investments or more focused marketing.
7. Business Decisions: Understanding revenue variations helps optimize
inventory and distribution strategies, ensuring the right products are
available in the right regions at the right time.

Data-3: Chi-Square Test (on product category and region)

A stacked bar chart shows the distribution of product categories within each region
Code:

Output:

11
Interpretation:

 Proportional Distribution:
o The stacked bar chart visually represents the proportional
distribution of product categories within each region.
 Regional Preferences:
o If certain product categories dominate specific regions (e.g.,
Electronics in North America, Books in Europe), this indicates

o regional preferences. For example, a larger segment of the bar for


Electronics in North America suggests a strong market for tech
products in that region.
 Balanced Distribution vs. Skewed Distribution:
o If the distribution is balanced (i.e., each product category has a
similar proportion in each region), it suggests uniform demand
across regions. A skewed distribution (with one or two categories
heavily dominating in certain regions) suggests regional
specialization.
 Insights for Marketing & Inventory:
o Regions where certain product categories have a higher proportion
should be targeted with region-specific marketing strategies and
inventory adjustments.
 Significance of Association:
o If the Chi-Square test yields a significant p-value (< 0.05), this
confirms that product category and region are not independent. In
other words, regional preferences significantly influence product
category sales. If the p-value is not significant, no such dependence
exists, and the product sales are more evenly spread across regions.

Predictive model

Linear Regression Model

Code:

12
Output:

Interpretation:

 Overall Model Performance:


o The linear regression model aims to predict total revenue based on
units sold, unit price, region, and product category.
o R-squared value: The proportion of variance in the total revenue
13
explained
by the
model. A higher value indicates a better fit.

o Significant predictors: The p-values for each coefficient will


indicate whether the predictors (e.g., units sold, unit price, region,
product category) are statistically significant. If the p-value is <
0.05, the predictor is significant in explaining total revenue.
 Coefficient Interpretation:
o Units Sold: A positive coefficient suggests that as the number of
units sold increases, total revenue increases, which is expected. The
magnitude of the coefficient tells you how much revenue changes
per unit sold.
o Unit Price: A positive coefficient for unit price means that as the
price per unit increases, total revenue also increases, reflecting the
direct impact of pricing on revenue.
o Region: The coefficients for different regions show how much
revenue is expected in comparison to the reference region. A
positive or negative coefficient indicates regions with higher or
lower revenues than the reference region, respectively.
o Product Category: Similar to region, the coefficients for product
categories indicate how each category performs in comparison to
the reference category.
 Accuracy of Predictions:
o The actual vs predicted plot visually shows how closely the
predicted values align with the actual total revenue.
o Points near the regression line: If the points are close to the line, it
indicates that the model's predictions are accurate for those data
points.
o Outliers: Points that fall far from the line represent instances where
the model struggles to predict total revenue accurately. These could
be outliers or areas where the model needs improvement.
 Model Residuals:
o Residual analysis: Checking residuals (the difference between actual
and predicted values) can help identify patterns that the model
hasn’t captured, suggesting areas for improvement.
o If residuals are randomly scattered around zero, the model’s
assumptions are likely met, and it's a good fit. If not, it may indicate
the need for model adjustments (e.g., non-linear relationships,
interactions, or outlier handling).
 Next Steps for Improvement:
o Refining the Model: If the model shows significant room for
improvement, consider adding interaction terms, transformations
(e.g., log), or using more advanced models like decision trees or
14
random
forests.

Project 2
"Exploring Student Performance: An Analytical Approach with R"

Introduction
Understanding the factors that influence student performance is essential for improving
educational outcomes and addressing disparities. This study focuses on analysing student
performance data using R, with a specific emphasis on the relationships between key
variables such as math, reading, and writing scores, gender, lunch type, and test preparation
courses. By identifying trends and patterns, we aim to gain insights into how various factors
contribute to academic success.
The dataset includes variables that capture student performance across three core subjects—
math, reading, and writing—alongside demographic and contextual factors like gender, lunch
type, and test preparation status. Through statistical analyses, data visualizations, and
predictive modeling, we explore the distribution of scores, gender-based performance
differences, and the impact of external factors like lunch type and test preparation courses on
overall academic achievement. These analyses will provide a foundation for understanding
how to support student learning effectively.

This study employs a combination of exploratory data analysis (EDA) and statistical
modeling to uncover critical insights. For instance, we investigate whether completing a test
preparation course significantly boosts total scores and if gender or lunch type influences
performance. Additionally, a linear regression model is developed to predict average scores
based on individual subject scores. The findings from this study aim to assist educators,
policymakers, and stakeholders in designing strategies that promote equitable and effective
education for all students.

Analysis & Interpretation:


Stage 1: data preprocessing

15
After data preprocessing:

Stage 2: Analysis and interpretation


Generic analysis

generic analysis1: Distribution of Math Scores


16
Objective: Explore the distribution of math scores to understand its spread and central
tendencies.

code:

Output:

Interpretation:
This histogram shows how math scores are distributed among the students. It helps identify
the range, peaks, and potential outliers. If scores cluster around the middle, it suggests most
students have average math proficiency.

17
generic analysis: Average Score by Gender
Objective: Compare the average scores of male and female students to evaluate gender-based
performance differences.

Code:

Output:

18
Interpretation:
This boxplot compares average scores for male and female students. If one gender has a
consistently higher median or smaller variability, it might indicate performance differences
based on gender.

generic analysis: Effect of Lunch on Total Scores


Objective: Examine whether lunch type (standard/reduced) affects total scores.

Code:

Output:

19
Interpretation:
This boxplot highlights whether students with standard or reduced lunch perform better. If
there's a clear difference in medians, it suggests a correlation between lunch type and
academic performance.
1. Statistical analysis

Distribution of Average Scores


Objective:
To examine the spread and central tendency of average scores.

Code:

Output:

20
This histogram will show how the average scores are distributed, highlighting peaks and
possible skewness.

Interpretation:
The plot reveals whether the average scores are centred around a specific range or spread out.
A uniform distribution may indicate fair scoring, while skewness could indicate variations in
performance.

Gender vs Average Score


Objective:
To compare the average scores based on gender.

Code:

Output:

21
This boxplot compares the distributions of average scores for males and females.

Interpretation:
If there’s a difference in median scores, it could suggest performance disparity by gender.
Smaller interquartile ranges imply consistent performance.

Test Preparation Course vs Total Scores


Objective:
To analyse if completing a test preparation course impacts total score.

Code:

Output:

This boxplot compares total scores based on whether students completed the test preparation
course.

Interpretation:
22
A higher median score for students who completed the course would suggest its effectiveness.
Significant overlaps might imply minimal impact.

Linear Regression Model: Average Score vs Math, Reading, and Writing


Scores
Objective:
To predict the average score based on individual subject scores (math, reading, and writing).

Code:

Output:

23
Interpretation:
 Model: The regression explains the relationship between individual subject scores and
the overall average.
 Graph: Points close to the red line indicate accurate predictions. Deviations suggest
where the model struggles.

24
Project 3
"Predictive Insights into Investment Behavior: A Data-Driven
Approach"

Introduction
In today’s dynamic financial landscape, understanding individuals' investment
preferences is crucial for guiding investment strategies. Investment behaviors are
influenced by a variety of factors, such as age, income, profession, and the sources
individuals rely on for information. The Investment Survey dataset offers insights
into these preferences, capturing demographic and behavioral information from a
sample of 100 participants. The key objective of this analysis is to predict the mode
of investment based on several predictor variables, such as age, income,
professional status, and motivational factors. By leveraging statistical modeling
techniques, this study aims to shed light on how various factors influence
individuals' investment choices.
To explore these relationships, logistic regression is employed, a powerful tool for
predicting categorical outcomes. Logistic regression is particularly suitable for
situations where the dependent variable, in this case, the mode of investment, is
categorical. By analyzing the predictors, we aim to predict how individuals are
likely to choose between different investment options like mutual funds, stocks, or
banking products such as fixed deposits. Additionally, visualizing the predicted
probabilities using plots enhances the interpretability of the model, offering a clear
view of the model's prediction accuracy and distribution.
The results of this analysis are expected to provide valuable insights into the
underlying factors driving investment decisions. By understanding these patterns,
businesses and financial institutions can tailor their marketing and investment
products to better meet the needs of different segments of investors. Furthermore,
this predictive model can help policymakers and financial advisors predict trends in
investment behavior, ensuring that they can offer targeted advice and strategies to
individuals based on their unique
Analysis & Interpretation:
Stage 1: data preprocessing

25
After data preprocessing

Stage 2: Analysis and interpretation

Generic analysis
generic analysis1: Distribution of Math Scores
26
Objective: Explore the distribution of math scores to understand its spread and central
tendencies.

code:

Output:

Interpretation:
This bar chart presents the count of individuals preferring different investment modes. It
helps identify the most and least popular investment methods among the respondents.

27
Statistical Data

Data-1: Linear Regression Model


We’ll create a linear regression model to predict annual income based on
predictors such as age and working professional. We will then visualize the
regression line with respect to age.

Code:

Output:

Interpretation:

28
Objective of the Model:
o The linear
regression model was built to predict annual income

o (dependent variable) based on age, working professional status


(binary: 1 for working, 0 for non-working), and gender (categorical:
Male or Female).
Model Formula:
o The formula used in the model is:
o annual income = β0 + β1(age) + β2(working professional) + β3(gender)
 Where β0 is the intercept, and β1, β2, and β3 are the
coefficients for the predictor variables.
Key Insights from the Coefficients:
o Age: The coefficient for age indicates whether income increases or
decreases with age. A positive value suggests that, as age increases,
annual income tends to increase.
o Working Professional: The working professional variable is binary
(0 or 1). The coefficient tells you how much more or less
income is expected for individuals who are employed versus those
who are not.
o Gender: The coefficient for gender (if coded as a factor) represents
the expected difference in income between males and females. If the
coefficient for gender is significant, it implies gender-based
disparities in income within this dataset.
Model Assumptions Check:
o Linearity: The relationship between predictors (age, working
professional, and gender) and annual income should be linear.
o Homoscedasticity: The variance of the residuals should remain
constant across the values of age, working professional, and gender.
o Independence: The observations should be independent of one
another, which can be checked using residual plots.
o Normality of Errors: The residuals should follow a normal
distribution. This can be verified with a histogram or Q-Q plot of
residuals.
Model Evaluation:
o R-squared (R²): This value explains how well the predictors in the
model explain the variation in annual income. A high R² indicates
that the model fits the data well, while a low R² suggests that there
are other variables not captured in the model that may explain
income.
o p-values: The p-values for the coefficients help determine if each
predictor has a statistically significant effect on annual income. A
small p-value (typically < 0.05) suggests a significant effect.
 For example, if the p-value for age is less than 0.05, we

29
conclude that age significantly impacts annual income.

Predicted vs Actual:
o By plotting the predicted values of annual income versus the actual
values, we can visually assess how well the model is performing. A
tight alignment of the predicted vs actual points indicates a good
model fit.
Significance of Model:
o If the model’s overall F-statistic is significant (p-value < 0.05), it
means that at least one of the predictors has a meaningful
relationship with annual income.
Visualization:
o A scatter plot with a regression line (geom_smooth(method = "lm"))
will show the relationship between age and annual income. If the
slope is positive, this indicates that as age increases, so does income,
holding other variables constant.
o The color differentiation (e.g., by gender) allows us to observe any
gender- based trends in income across age groups.

Data-2: Logistic Regression Model

Code:

30
Output:

Interpretation:
Logistic Regression can be applied to a binary outcome, such as predicting whether an
individual is a "working professional" (working professional = 1 or 0) based on other
predictor variables like age, gender, and annual income.

Data-3: Multinomial Logistic Regression Model

Code:

31
Output:

Interpretation:
A Multinomial Logistic Regression (MLR) model is an extension of binary logistic regression
used when the outcome variable has more than two categories (i.e., the response variable is
nominal). In this analysis, we can model a response variable like mode of investment (which
has multiple categories like "Banking - RD, FD," "Stocks - Intraday, long term," "Mutual
Funds," etc.) based on predictor variables such as age, gender, annual income, etc.

Predictive model

Linear Regression Model


The linear regression model predicts annual income based on age and working

32
profesional status. This
analysis assumes a linear
relationship between these variables and annual income. From the model, we
observe how age and employment status contribute to variations in income levels.
A scatter plot of predicted versus actual income demonstrates the model's
performance, where points near the diagonal red line indicate accurate predictions.
Insights drawn from this analysis can help identify trends in income disparities
based on demographic and occupational factors, which are crucial for understanding
financial behavior.

Code:

Output:

33
Interpretation:

 Dependent Variable: The model predicts Annual_income as the outcome.


 Independent Variables: It uses Age and Working_professional to explain
variations in income.
 Linear Relationship Assumption: Assumes that income changes linearly with
age and working status.
 Model Coefficients: Show how much the annual income changes for each unit
increase in age and for being a working professional.
 Goodness of Fit: Use the R2R^2R2 value to assess how well the model
explains the variability in income.
 Graph Insights: The scatter plot highlights discrepancies between predicted
and actual values; closer alignment to the diagonal indicates better accuracy.
 Practical Use: Identifies demographic groups likely to have higher or lower
incomes, useful for policy-making or market segmentation.

34
PROJECT 4

"Predicting Financial Trends with Transactional Data: A Machine


Learning Approach"

Analysis & Interpretation

Stage – 1 Data preprocessing

After data preprocessing:

35
Stage – 2 Analysis and interpretation:

1. Generic Analysis

Generic analysis : Distribution of Card Types

Objective: To analysis the proportion of debit and credit cards.

Code:

Output:

36
Interpretation: The plot reveals whether the bank predominantly issues debit or credit
cards, helping understand customer preferences or institutional focus.

Generic analysis in Credit Limit by Card Type:

Objective: Analysis credit limit differences between debit and credit cards.

Code:

37
Output:

Interpretation: Higher limits on credit cards align with their intended usage for credit
purchases.

Generic analysis in Account Opening Trends Over Time

Objective: Study customer acquisition trends by analysing account opening years.

Code:

38
Output:

Interpretation: Peaks may coincide with promotional campaigns or economic growth


periods.

STAGE -2 Statistical Analysis

Cards with Chip vs. Without Chip

Code:
39
Output:

Interpretation: This plot compares the number of cards with chips versus those without.
It highlights the adoption of chip technology, which is important for security and transaction
efficiency.

Statistical Analysis : Year PIN Last Changed

Code:
40
Output:

Interpretation: This visualization shows when clients last changed their PINs. It can
indicate how often clients update their security measures, which is vital for risk
management.

41
Statistical analysis:
Correlation Between CVV
and Credit Limit

Objective: To explore whether CVV assignment correlates with credit limits.

Code:

Output:

Interpretation:
No significant correlation is expected between CVV and credit limits. Any patterns observed
could indicate systemic issues in card or limit assignment processes.

42
Stage – 3 PREPICTIVE
MODEL

Linear Regression: Credit Limit vs. Number of Cards Issued

Objective:

The primary objective of predictive modeling is to leverage historical data to identify


patterns and relationships among variables, enabling accurate forecasts of future outcomes.
For instance, in the context of the provided dataset, predictive models can help anticipate
customer credit limits, detect potential fraud, or predict card renewal likelihood,
empowering businesses to make informed decisions, improve operational efficiency, and
enhance customer satisfaction.

Code:

Output:

43
Interpretation:

1. Regression Line: The red line shows the predicted relationship between
num_cards_issued and credit limit.

2. Confidence Interval: The shaded area represents the confidence interval of the
regression line.

3. Insights: If the slope of the regression line is positive, having more cards issued
generally increases the credit limit. A flat or negative slope would indicate no or
inverse correlation.

44

You might also like