Data Acqusition Final Report
Data Acqusition Final Report
Kitchner, Ontario
Canada
A Report On
Submitted by:
Group 6
Sr.No. Name
1. Urvika Patel
2. Yash Thakker
3. Revathy
4. Adagboyi
1
TABLE OF CONTENTS
Content Page No
1 INTRODUCTION 3
1.1 Problem statement 3
1.2 Summary 3
1.3 Data set 3
2 Descriptive Statistics 4
2.1 Descriptive Statistics Analysis 4
2.2 Excel Output 4
2.3 Insights 5
3 Data Visualization 7
4 Statistical Inference 8
5 Regression Analysis 9
6 Trend and Seasonality Check 11
7 Output Excel spreadshe 14
8 Recommendation 14
9 Conclusion 14
RACI Chart 15
Appendix 16
2
1. Introduction
1.1 Problem Statement:
Owning stocks in various companies can enhance your investment portfolio's value,
enabling you to sustain savings, safeguard your funds against inflation and taxes, and
optimize income from your investments.
Investors are particular about Microsoft stock. However, they are aware that the
stock market doesn't go up every year. They typically fall below expectations in most
years. Some drops can feel quite brutal, and their level of volatility is not for
everyone.
Therefore, they want to understand the relationship between the closing price of
Microsoft stock and the passage of time.
We would like to carry out at least four data analyses to help them identify stocks
with strong growth potential.
1.2 Summary
This report presents a comprehensive analysis of historical Microsoft stock prices &
predicting the prices. The analysis covers descriptive statistics, data visualization,
linear regression, time series analysis, predictive data mining, and evaluation
methods. Each section provides insights into different aspects of the data.
1.3 Dataset
Displays of the first 5 rows in the dataset
3
2. Descriptive Statistics
2.1 Descriptive Statistics Analysis
The central tendency, variability, and distribution of each column in the dataset are
numerically summarized by the descriptive statistics, which, in turn, give information
about the stock's price and trading activity throughout the given time period.
Skewness: 0.8265555200891408
Kurtosis: -0.47511021911585827
4
2.3 Insights
Close, Open, High, and Low:
'Open,' 'High,' 'Low,' and 'Close' mean values indicate the average prices over the
length of the time. The average closing price, or "Close," for instance, is roughly
$107.42.
Standard Deviation:
These columns' standard deviations show how variable or spread out the
corresponding prices are. The standard deviation for "Close" is roughly $56.70.
Volume level:
Mean Volume:
The typical trading activity is shown by the mean volume, which is roughly
30,198,630.
Volume Range:
The values of the minimum and maximum volumes indicate the
Closing Prices Over Time.
Daily Return:
Skewness (0.8266):
5
A positive skewness means that the 'Close' price distribution has a longer right tail
and is skewed to the right. A right-skewed distribution may result from periods of
comparatively higher positive returns or price increases, as suggested by the positive
skewness.
Kurtosis (-0.4751):
When compared to a normal distribution, a negative kurtosis means that the
distribution of "Close" prices is less peaked and has thinner tails.
In contrast to a normal distribution, the negative kurtosis indicates that extreme
price movements—both high and low—occur less frequently. The distribution's tails
are narrower, and its peak is milder.
6
3 Data Visualization - Closing Prices Over
Time
Shows the 'Close' pricing as a line plot over time. The date (2015-2021) is shown on
the x-axis, while the closing price is shown on the y-axis. The graphic depiction
provided by this visualization illustrates the historical fluctuations in stock values.
The is an upward positive linear trend in the graph.
7
4 Statistical Inference
Null hypothesis (H0): The average 'Close' price is equal to the mean.
The difference between the mean and the average 'Close' price is measured by the
T-statistic. The T-statistic in this instance is precisely 0.0, indicating that the average
'Close' price and the mean do not differ significantly.
Assuming that the null hypothesis is true, the p-value is the likelihood of finding a T-
statistic that is as extreme as the one shown in the data. The p-value in this instance
is 1.0, which is quite high. When the p-value is 1.0, the null hypothesis is not
rejected.
Interpretation
There is no evidence to support a significant difference between the mean and the
average 'Close' price, as indicated by the T-statistic of 0.0 and p-value of 1.0. The T-
value: 0.0
The T-statistic calculates the difference between the mean and average 'Close' price.
The average 'Close' price and the mean do not differ considerably in this case, as
indicated by the T-statistic of exactly 0.0.
The probability of obtaining a T-statistic as extreme as the one displayed in the data,
if the null hypothesis is true, is represented by the p-value. In this case, the p-value is
1.0, which is relatively high. The null hypothesis is not rejected when the p-value is
1.0.
Interpretation
8
5 Regression Model
Scatter Chart
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.936173661
R Square 0.876421124
Adjusted R Square 0.87633923
Standard Error 19.93960906
Observations 1511
ANOVA
df SS MS F Significance F
Regression 1 4254917.219 4254917.219 10701.8248 0
Residual 1509 599960.3064 397.5880095
Total 1510 4854877.526
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -60655.42397 587.3667455 -103.2666974 0 -61807.56575 -59503.28218 -61807.56575 -59503.28218
Year 30.11425122 0.291100634 103.4496244 0 29.54324647 30.68525598 29.54324647 30.68525598
Interpretation
Multiple R: This is the correlation coefficient between the observed and predicted
values. In this case, it is approximately 0.936 (94%), indicating a strong positive
correlation.
9
In this case, it is approximately 0.876 (88%), which is quite high and suggests a good
fit.
Coefficients:
Year: The coefficient for the independent variable (Year), representing the change in
the dependent variable (Close price of the stock) for a one-unit change in the
independent variable.
P-value: The probability that the coefficient is not significantly different from zero.
95% Confidence Interval: The range within which we are 95% confident the true
coefficient lies.
To fit the linear regression equation line using the formula for a simple linear
regression:
Y=β0+β1×X
Here:
β0 is the intercept,
Closing Price=−60655.42397+30.11425122×Year
=−60655.42397 + 30.11425122(2022)
=235.59
10
This equation represents the line that best fits the relationship between the "Year"
variable and the closing price of the stock based on the regression analysis.
Linear trend and seasonality is observed in the above graphs. However, lets delve
into the model evaluation scores
Insights
11
Mean Trend and Seasonality: The core tendency of Microsoft's stock prices can be
seen in the computed mean trend of 102.22. This offers a basic comprehension of
the overall movement and acts as an important starting point for more research. The
Seasonal effect is minimal; however. seasonality factor is ignored.
Residual Analysis: A subtle component is highlighted by the computed mean
residual of -0.1671, which shows that actual data points constantly tilt slightly lower
than the predicted mean trend. Although little, this disparity is significant since it
points to fundamental forces influencing stock prices.
Seasonal Variation Variability in trend, seasonality, and residuals are displayed by
standard deviations The dynamic character of seasonal effects is emphasized by the
existence of a seasonal variation, which is quantified at 2.56. The observed patterns
are made more complex overall by the stock prices' swings, which highlight the fact
that seasonality is not a static occurrence.
The average closing stock prices of various periods are calculated for doing Time Series
Analysis
12
Average Closing Price For Various Periods
250
193.026
200
Avg Closing Price
150 130.38
101.03
100
71.98
47.72 55.25
50
0
0
0 1 2 3 4 5 6 7 8
Period
The above graph illustrates that the the graph of closing prices on various periods is
showing curvilinear trend.
r Square (Coefficient of Determination): 0.91 - This means that the linear regression
model can account for roughly 91% of the volatility in the annual average closing prices.
13
ANOVA Significance F-Value: 40.69 - The regression model as a whole appears to be
significant based on the low p-value (0.0031).
Mean Values:
Period: 28.03, Intercept: 1.80
Year 7 Calculation:
Closing Price= Intercept + Prediction period * Period(in output)
The real value (232.02) and the computed closing price (197.41) are not the same. The
regression model's linear structure is the cause of this discrepancy. The regression
assumes that the period and closing prices have a continuous linear connection.
7 Recommendation
Use complex Machine Learning Models like SARIMAX model which takes in
to consideration both Seasonality and exogenous factors. Exogenous factors
always play a key role in Stock prices. FbProphet is also a good model for
analyzing this dataset.
Utilizing TensorFlow or scikit-learn packages for adaptive moving averages,
utilize rolling windows.
The above two methods have not been adapted as they go beyond the scope of the
project. However, it might produce better results.
8 Conclusion
Through the examination of subtleties in the data and the application of
sophisticated modeling methods, this study can offer investors more useful
information. Making wise investing decisions will require constant improvement and
flexibility in terms of changing market conditions.
14
Urvikaben Yash
Revathy Adagboyi
Activity Jayeshbhai Pankajkumar
Prabhakaran Ugbabe
Patel Thakker
Descriptive
R A C I
statistic
Data
I A R C
Visualization
Statistical
A, C I R A
Inference
Model
A R C I
evaluation
Predictive data
C R A I
mining
Linear
A I C R
Regression
8 RACI chart
[RACI CHART]
Responsibility (R): The person who is responsible for executing the activity.
Accountability (A): The person who is ultimately accountable for the success of
the activity.
15
Consulted (C): People who provide input and feedback but are not directly
responsible.
Informed (I): People who need to be kept informed about the progress of the
activity.
Appendix
2. Python Codes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from docx import Document
from docx.shared import Inches
import io
import base64
16
# Step 3: Data Visualization
# Line plot of 'Close' prices over time
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Close'], marker='o', linestyle='-')
plt.title('Closing Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.grid(True)
17