Seminar Report (T9247)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

A Project Based Seminar Report

On

“Using machine learning and data Mining finding


the patterns in the data and predict the trends and
volatility of stock”

Submitted to the

Savitribai Phule Pune University


In partial fulfillment for the award of the Degree of
Bachelor of Engineering
in

Information Technology
by

First Name Last Name


(71928083E /T9247./Division:2)

Under the guidance of


Miss. Neha Chankhore

Department of Information Technology


STES’s, Smt. Kashibai Navale College of Engineering,
Vadgaon (BK),
Pune, 411 041.
2019-2020
(SEM-II)
CERTIFICATE
This is to certify that the project based seminar report entitled “Using machine
learning and data Mining finding the patterns in the data and predict the
trends and volatility of stock ” being submitted by Komal Bharat Kolambe
(71928083E/T9247/Division:2) is a record of bonafide or carried out by him/her under
the supervision and guidance of Prof. Neha Chankhore in partial f u l f i l l m e n t of the
requirement for TE ( Information Technology)ˆ 2015 course of Savitribai Phule
Pune University, Pune in the academic year 2019-2020.

Date: /03/ 2020

Place: Pune

Prof Name of Guide Prof. R. H. Borhade


Guide Head of the Department, IT

Dr. A. V. Deshpande
Principal, SKNCOE, Pune

This Project Bas ed Seminar report has been examined by us as per the
Savitribai Phule Pune University, Pune requirements at Smt. Kashibai Navale College of
Engineering, Pune-41on ________________

Internal Examiner External Examiner


2
ACKNOWLEDGEMENT
I would like to thank, first of all our project guide Miss. Neha Chankhore
mam. Thanks, mam for always supporting us, encouraging us and our ideas
and believing in us. I would like to thank HOD of IT dept. Ravindra
Bhorade sir.
I would like to thank my project team mates Pranjali and Rahul. I will
always be grateful for your support and your contribution in this project.

Komal B. Kolambe
(Students Name &Signature)

3
A
b
s
t
r
a
c
t
Data mining is process of analyzing large amounts of data in order to extract patterns
and useful information. In the last few years, data mining has been widely recognized as
a powerful and versatile data analysis tool in a variety of fields. Information technology
in primis, but also clinical medicine, sociology, physics. In this technical note we
provide a high-level overview of the most prominent tasks and methods that form the
basis of data mining. The note also focuses on some of the most recent yet promising
interdisciplinary aspects of data mining. KDD(Knowledge Discovery Database) is the
process of finding the useful information in large datasets.
Stock value forecast is a significant subject in account and financial aspects which
has prodded the enthusiasm of analysts throughout the years to grow better prescient
models. The autoregressive incorporated moving normal (ARIMA) models have been
investigated in writing for time arrangement expectation. This paper presents broad
procedure of building stock cost prescient model utilizing the ARIMA model.
Distributed stock information got from New York Stock Exchange (NYSE) and Nigeria
Stock Exchange (NSE) are utilized with stock cost prescient model created. Results got
uncovered that the ARIMA model has a solid potential for present moment expectation
and can contend well with existing methods for stock value expectation

4
Contents
Acknowledgemen I
t
Abstract II

1. INTRODUCTION TO PROJECT TOPIC 8


1.1 Introduction to Project 8
1.2 Motivation behind project topic. 8
1.3 Aim and Objective(s) of the work 8
1.4 Introduction to Seminar Topic. 9

2. LITERATURE SURVEY OF Seminar Title/Topic 10


2.1 Introduction to Project
2.2 Motivation behind project topic.
2.3 Aim and Objective(s) of the work
2.4 Introduction to Seminar Topic.
3. SEMINAR RELATED OTHER CHAPTERS 15
4. CONCLUSION 25

5
LISTOF F I G U R E S
Sr. No. Figure Name
PageNo.
3.1. The Process of Knowledge Discovery in
database

3.2. Data Mining Taxonomy.

5.1. Graphical representation of the Nokia stock closing


Price index

5.2. The correlogram of Nokia stock price index

5.3. Graphical representation of the Nokia stock price index


after differencing

5.4. The correlogram of Nokia stock price index after


first differencing

5.5. ADF unit root test for DCLOSE of Nokia stock index.

5.6. ARIMA (2, 1, 0) estimation output with DCLOSE


of Nokia index.

5.7 Correlogram of residuals of the Nokia stock index

5.8 Graph of Actual Stock Price vs Predicted values of


Nokia Stock Index

6
LISTOF T A B L E S
Sr. No. TableName PageNo.

1. Statistical results of different arima


parametersfor nokia stock index

2. sample of empirical results of arima (2,1,0)


of nokia stock index.

7
CHAPTER 1
1 INTRODUCTION TO Finding the patterns in
data and predict the possibility of manipulation
in stock exchange/share market

1.1 Introduction to Project

Securities exchange is the bone of quick developing economies, for example, India. Major of
capital mixture for organizations the nation over was made conceivable just through offers
offered to individuals. So our nation development is firmly limited with the exhibition of our
securities exchange. Practically all the creating countries depend on their financial exchange for
additional reinforcing of their economy. Any route in creating economies under 10% of
individuals are connecting with themselves with securities exchange speculation dreading the
unstable idea of financial exchange. Numerous individuals felt that purchasing and selling of
offers is a demonstration of betting which is an off-base thought. Greater part of money related
analysts concur that financial exchange is the main spot where speculator are getting reliable
swelling beaten returns for such a large number of years. Considering the reality of absence of
information and mindfulness over the individuals financial exchange forecast strategies assumes
a critical job in bringing more individuals into advertise just as to hold the current speculators.
Additionally, the expectation methods must be dealt with like crystal gazing or betting. The
applied procedures must yield steady exact outcomes with certain degree of exactness
consistently all together change the mentality of uninvolved speculators. By analyzing the
writing securities exchange expectation procedures can be assembled into four sorts. 1)
Technical investigation approach, 2) Fundamental examination approach, 3) Time arrangement
forecast and 4) Machine learning algorithmic strategies. They are creating forecasts dependent
on the recorded value estimations of chose stocks. Essential examination approach is discovering
the genuine estimation of a stock and contrasts it and the present exchanging levels and
prescribes purchasing of stock which is exchanged lesser than its actual worth. In the event of
Time arrangement expectation straight stream forecast models are produced and notable
examples are followed. The point is to discover an articulation that can create the information.
Job of information mining in securities exchange Many scientists endeavors to anticipate stock
costs by applying factual and graphing approaches. In any case, those strategies needs behind
vigorously because of human one-sided choices on financial exchange dependent on everyday
attitude of human conduct. By applying information mining in a reasonable manner concealed
examples can be revealed which was impractical by conventional methodologies. Additionally,
by applying business knowledge future cost expectation with expanded precision levels are
conceivable with information mining systems. The gigantic measure of information produced by
financial exchanges constrained the scientists to apply information mining to settle on venture
choices. The accompanying difficulties of financial exchange can be viably tended to by mining
strategies.
8
1.4 Introduction to Seminar Topic
Knowledge Discovery in Databases (KDD) is an automatic, exploratory analysis and
modeling of large data repositories. KDD is the organized process of identifying valid,
novel, useful, and understandable patterns from large and complex data sets. Data Mining
(DM) is the core of the KDD process, involving the inferring of algorithms that explore the
data, develop the model and discover previously unknown patterns. The model is used for
understanding phenomena from the data, analysis and prediction. The accessibility and
abundance of data today makes knowledge discovery and Data Mining a matter of
considerable importance and necessity. Given the recent growth of the field, it is not
surprising that a wide variety of methods is now available to the researchers and
practitioners. No one method is superior to others for all cases
Forecast will keep on being an intriguing zone of investigate making analysts in the
space field consistently craving to improve existing prescient models. The explanation is
that establishments and people are enabled to make speculation choices and capacity to
design and create compelling procedure about their every day and future endevours. Stock
value forecast is viewed as one of generally troublesome errand to achieve in money related
determining because of complex nature of securities exchange [1, 2, 3].
The longing of numerous financial specialists is to lay hold of any determining stratey
that could ensure simple benefitting and limit venture hazard from the financial exchange.
This remaining parts a propelling component for analysts to advance and grow new
prescient models .In the previous years a few models and strategies had been created to
stock value expectation. Among them are fake neural systems (ANNs) model which are very
well known because of its capacity to take in designs from information and deduce
arrangement from obscure information. Not many related works that connected with ANNs
model to stock value forecast are [5, 6, 7]. In late time, cross breed approaches has likewise
been locked in to improve stock cost prescient models by abusing the one of a kind quality
of every one of them [2]. ANNs is from counterfeit insight viewpoints. ARIMA models are
from measurable models viewpoints. For the most part, it is accounted for in writing that
forecast can be done from two points of view: factual and man-made reasoning strategies .
ARIMA models are known to be powerful and productive in budgetary time arrangement
guaging particularly momentary forecast than even the most mainstream ANNs procedures
([8, 9, 10]. It has been broadly utilized in field of financial aspects and money. Different
measurements models are relapse technique, exponential smoothing, summed up
autoregressive restrictive heteroskedasticity (GARCH). Hardly any related works that has
connected with ARIMA model for determining incorporates [11, 12, 13, 14, 15, 16]. Right
now procedure of building ARIMA models for transient stock value expectation is
9
introduced. The outcomes got from genuine information showed the potential quality of
ARIMA models to give speculators transient forecast that could help venture choice making
process.

10
KDD Process
The knowledge Discovery process is iterative and intelligent,
comprising of nine stages. the procedure is iterative at each progression,
implying that moving back to past advances might be required. The
procedure has many "masterful" perspectives in the feeling that one can't
present one equation or make a total scientific categorization for the
correct decisions for each progression and application type. Therefore it
is required to comprehend the procedure and the various needs and
conceivable outcomes in each progression. Prologue to Knowledge
Discovery in Databases three Scientific classification is suitable for the
Data Mining techniques and is introduced in the next area.

Figure 3.1. The Process of Knowledge Discovery in Databases.

The procedure begins with deciding the KDD objectives, and "closures"
with the usage of the found information. At that point the circle is shut -
the Dynamic Data Mining part begins (which is past the extent of this
book and the procedure characterized here). Subsequently, changes
would need to be made in the application space, (for example, offering
various highlights to cell phone clients so as to decrease stirring). This
shuts the circle, and the impacts are at that point estimated on the new
information vaults, and the KDD procedure is propelled once more

1. Developing an understanding of the application domain area This is the


underlying preliminary advance. It readies the scene for understanding
what ought to be finished with the numerous choices (about change,
calculations, portrayal, and so forth.). The individuals who are
responsible for a KDD venture need to comprehend and characterize the
objectives of the end-client and nature in which the information

11
disclosure procedure will happen (counting pertinent earlier information).
As the KDD procedure continues, there might be indeed, even an update
of this progression. Having comprehended the KDD objectives, the
preprocessing of the information begins, characterized in the following
three stages (note that a portion of the strategies here are like Data
Mining calculations, yet are utilized in the preprocessing context):
2. Selecting and creating a data set on which discovery will be performed.Having
characterized the objectives, the information that will be utilized for
the information disclosure ought to be resolved. This incorporates
discovering what information is accessible, acquiring extra vital
information, and afterward coordinating all the information for the
information disclosure into one informational index, including the
qualities that will be considered for the procedure. This procedure is
important on the grounds that the Data Mining takes in and finds from
the accessible information. This is the proof base for building the
models. On the off chance that some significant characteristics are
missing, at that point the whole investigation may come up short. From
this regard, the more characteristics are thought of, the better. On the
other hand, to gather, sort out and work complex information
storehouses is costly and there is a tradeoff with the open door for best
understanding the wonders. This tradeoff speaks to a perspective where
the intuitive and iterative part of the KDD is occurring. This begins
with the best accessible informational collection and later extends and
watches the impact as far as information revelation and displaying.
3. Preprocessing and cleansing In this stage, data reliability is
improved. It incorporates data clearing, for example, dealing with
missing qualities and expulsion of anomalies. There are numerous
techniques clarified in the handbook, from doing nothing to turning
into the significant part (as far as time consumed) of a KDD venture in
specific tasks. It might include complex factual strategies or utilizing a
Data Mining calculation right now. For model, on the off chance that
one presumes that a specific trait is of lacking unwavering quality or
has many missing information, at that point this property could turn
into the objective of an information mining administered calculation. A
forecast model for this trait will be created, and afterward missing
information can be anticipated. The augmentation to which one focuses
on this level relies upon numerous components. Regardless, examining
the angles is significant and regularly uncovering without anyone else,
in regards to big business data frameworks.

4. Data Transformation. In this stage age of better information for the


data mining is arranged and created. Techniques here incorporate

12
measurement decrease, (for example, highlight choice and extraction and
record testing), also, characteristic change, (for example, discretization of
numerical traits and practical change). This progression can be pivotal for
the accomplishment of the whole KDD task, and it is generally very
undertaking explicit. For instance, in clinical assessments, the remainder
of qualities may regularly be the most significant factor, and not every
one without anyone else. In promoting, we may need to think about
impacts outside our ability to control just as endeavors and fleeting
issues, (for example, examining the impact of promoting gathering). In
any case, regardless of whether we don't utilize the correct change toward
the start, we may acquire an astonishing impact that insights to us about
the change required (in the following emphasis). In this way the KDD
procedure reflects upon itself and prompts a comprehension of the
change required. Having finished the over four stages, the accompanying
four stages are identified with the Data Mining part, where the emphasis
is on the algorithmic angles utilized for each undertaking:
5 .Choosing the appropriate Data Mining task. We are presently prepared to
choose on which sort of Data Mining to use, for instance, grouping,
relapse, or bunching. This for the most part relies upon the KDD
objectives, and furthermore on the past advances. There are two
significant objectives in Data Mining: expectation and depiction.
Expectation is frequently alluded to as administered Data Mining, while
graphic Data Mining incorporates the solo and representation parts of
Data Mining. Most information mining methods are in view of inductive
realizing, where a model is developed unequivocally or verifiably by
summing up from an adequate number of preparing models. The hidden
suspicion of the inductive methodology is that the prepared model is
pertinent to future cases. The methodology likewise considers the degree
of meta-learning for the specific arrangement of accessible information.
7. Employing the Data Mining algorithm.Having the system, we currently settle
on the strategies. This stage incorporates choosing the particular
technique to be utilized for looking through examples (counting different
inducers). For instance, in considering exactness versus
understandability, the previous is better with neural systems, while the
last is better with choice trees. For every technique of meta-realizing
there are a few prospects of how it can be practiced. Meta-learning
centers around clarifying what causes a Data Mining calculation to be
effective or not in a specific issue. Therefore, this methodology
endeavors to comprehend the conditions under which a
Data Mining calculation is generally suitable. Every calculation has
parameters and strategies of learning, (for example, ten times cross-
approval or another division for preparing and testing).
7. Employing the Data Mining algorithm. At long last the execution of the
Data Mining calculation is reached. Right now may need to utilize the
calculation a few times until a fulfilled outcome is acquired, for example
by tuning the calculation's control parameters, for example, the least
number of occurrences in a solitary leaf of a choice tree.
13
8. Evaluation. Right now assess and decipher the mined examples (rules,
unwavering quality and so forth.), as for the objectives characterized in
the initial step. Here we consider the preprocessing ventures as for their
impact on the Data Mining calculation results (for instance, including
highlights in Step 4, and rehashing from that point). This progression
centers around the intelligibility furthermore, value of the actuated
model. Right now found information is likewise reported for additional
use.The last step is the usage and overall feedback on the patterns and discovery results
obtained by the Data Mining:
9. Using the discovered knowledge.We are presently prepared to fuse
the information into another framework for additional activity. The
information gets dynamic as in we may make changes to the framework
and measure the impacts. As a matter of fact the achievement of this
progression decides the viability of the whole KDD process. There are
numerous difficulties right now step, for example, loosing the "research
center conditions" under which we have worked. For example, the
information was found from a certain static preview (typically test) of the
information, however now the information becomes dynamic.
Information structures may change (certain qualities become
inaccessible), and the information space might be adjusted, (for example,
a trait may have a worth that was not expected previously).

Data Mining Methods


14
There are numerous techniques for Data Mining utilized for various purposes
and objectives. Scientific classification is called for to help in understanding the
assortment of strategies, their interrelation and gathering. It is helpful to recognize
two fundamental sorts of Data Mining: check situated (the framework confirms
the client's theory) and disclosure arranged (the framework finds new standards
and examples self-governingly). Figure 1.2 presents this scientific categorization.
Disclosure strategies are those that naturally recognize designs in the information.
The disclosure strategy branch comprises of expectation strategies versus
depiction techniques. Graphic strategies are arranged to information translation,
which centers around comprehension (by perception for instance) the manner in
which the fundamental information identifies with its parts. Expectation situated
techniques plan to manufacture a social model, which gets new and inconspicuous
examples and can anticipate estimations of at least one factors identified with the
example. It likewise creates designs which structure the found information in a
manner which is reasonable and simple to work upon. Some forecast arranged
techniques can likewise help give comprehension of the information. A large
portion of the disclosure arranged Data Mining methods (quantitative
specifically) depend on inductive realizing, where a model is built, unequivocally
or certainly, by summing up from an adequate number of preparing models. The
basic suspicion of the inductive methodology is that the prepared model is
appropriate to future inconspicuous models.

Figure 4.2. Data Mining Taxonomy.

Verification methods, then again, manage the assessment of a theory


proposed by an outside source (like a specialist and so forth.). These
strategies incorporate the most widely recognized strategies for
customary measurements, similar to integrity of fit test, trial of theories
15
(e.g., t-trial of means), and examination of change (ANOVA). These
techniques are less connected with Data Mining than their disclosure
situated partners, on the grounds that most Data Mining issues are
worried about finding a theory (out of a huge arrangement of
speculations), rather than testing a known one. A great part of the focal
point of conventional factual techniques is on model estimation rather
than one of the principle targets of Data Mining: model recognizable
proof and development, which is proof based (however cover happens).
Another regular phrasing, utilized by the AI people group,alludes to the
forecast techniques as directed learning, rather than unaided learning.
Unaided learning alludes to demonstrating the conveyance of occasions
in a run of the mill, high-dimensional information space. Solo learning
alludes for the most part to methods that gathering occasions without a
prespecified, subordinate quality. In this manner the expression "solo
learning" covers just a segment of the portrayal strategies introduced in
Figure 1.2. For occasion, it covers grouping strategies yet not perception
techniques. Administered strategies are techniques that endeavor to find
the relationship between input traits (at times called autonomous factors)
and an objective discribe some of the time alluded to as a needy
variable). The relationship found is spoken to in a structure alluded to as
a model. Typically models portray and clarify marvels, which are
covered up in the informational index and can be utilized for anticipating
the estimation of the objective quality knowing the estimations of the
input properties. The managed strategies can be actualized on an
assortment of areas, for example, showcasing, money and assembling. It
is valuable to recognize two principle managed models: order models and
relapse models. The last guide the info space into a genuine esteemed
area. For example, a regressor can anticipate the interest for a certain
item given its qualities. Then again, classifiers map the information space
into predefined classes. For instance, classifiers can be utilized to order
contract buyers as great (completely recompense the home loan on
schedule) and terrible (deferred recompense), or the same number of
target classes varying. There are numerous choices to speak to classifiers.
Average models incorporate, bolster vector machines, choice trees,
probabilistic synopses, or arithmetical capacity.

16
5. KDD & DM Research Opportunities and Challenges
Observational examination of the exhibition of various methodologies and their
variations in a wide scope of utilization areas has demonstrated that each performs best in a
few, yet not all, spaces. This marvel is known as the particular predominance issue, which
implies, for our situation, that no acceptance calculation can be the most ideal in all spaces.
The explanation is that every calculation contains an unequivocal or understood
predisposition that drives it to lean toward specific speculations over others, and it will be
fruitful just as long as this predisposition coordinates the attributes of the application area.
Results have exhibited the presence and rightness of this "no free lunch hypothesis". On the
off chance that one inducer is superior to another in certain spaces, at that point there are
essentially different spaces in which this relationship is switched. This suggests in KDD that
for a given issue a specific methodology can yield more information from similar
information than different methodologies. In numerous application spaces, the speculation
mistake (on the general area, not simply the one traversed in the given informational index)
of even the best strategies is far over the preparation set, and the topic of whether it very
well may be improved, what's more, if so how, is an open and significant one. Some portion
of the response to this inquiry is to decide the base blunder attainable by any classifier in the
application space (known as the ideal Bayes mistake). In the case of existing classifiers do
not arrive at this level, new methodologies are required. In spite of the fact that this issue has
gotten extensive consideration, no for the most part dependable technique has so far been
illustrated. This is one of the difficulties of the DM examine – not exclusively to tackle it,
however even to evaluate and comprehend it better. Heuristic strategies can at that point be
analyzed completely and not simply against one another. A subset of this summed up study
is the topic of which inducer to use for a given issue. To be considerably progressively
explicit, the presentation measure needs to be characterized suitably for every issue. Despite
the fact that there are some regularly acknowledged estimates it isn't sufficient. For instance,
if the investigator is searching for exactness just, one arrangement is to attempt every one
thus, and by assessing the speculation mistake, to pick the one that seems to perform best.
Another approach, known as multi-methodology learning, endeavors to consolidate at least
two various ideal models in a solitary calculation. The predicament of what technique to
pick turns out to be significantly more prominent if different factors, for example,
intelligibility are contemplated. For example, for a particular area, neural systems may beat
choice trees in exactness. Anyway from the intelligibility angle, choice trees are viewed as
prevalent. At the end of the day, right now if the scientist realizes that neural organize is
increasingly exact, the issue of what strategies to utilize despite everything exists (or
possibly to join strategies for their different quality). Acceptance is one of the focal issues in
numerous orders, for example, AI, design acknowledgment, and insights. Anyway the
element that recognizes Data Mining from conventional techniques is its versatility to very
huge arrangements of changed kinds of information. Adaptability implies working in a
domain of high number of records, high dimensionality, and a high number of classes or
heterogeneousness. By the by, attempting to find information in reality and huge databases
presents time and memory issues.

As huge databases have become the standard in numerous fields (counting cosmology,
atomic science, fund, advertising, social insurance, and numerous others), the utilization of
17
Data Mining to find designs in them has gotten conceivably very helpful for the endeavor.
Numerous organizations are staking an enormous piece of their future on these "Information
Mining" applications, and go to the exploration network for answers for the crucial issues
they experience. While a lot of accessible information used to be the fantasy of any
information expert, these days the equivalent word for "huge" has become "terabyte"
or on the other hand "pentabyte", a scarcely conceivable volume of data.
Informationintensive associations (like telecom organizations and budgetary establishments)
are expected to collect a few terabytes of crude information each one to two years. High
dimensionality of the info (that is, the quantity of qualities) increments the size of the hunt
space in an exponential way (known as the "Revile of Dimensionality"), and subsequently
builds the opportunity that the inducer will discover false classifiers that when all is said in
done are not substantial. There are a few methodologies for managing a high number of
records counting: testing strategies, collection, enormously equal handling, and effective
capacity strategies.

5. KDD & DM Trends


18
The field is still in its beginning times as in further essential strategies are being created.
The workmanship grows yet so does the comprehension and the computerization of the nine
stages and their interrelation. For this to happen we need better portrayal of the KDD issue
range and definition. The terms KDD and DM are not very much characterized as far as
what strategies they contain, what sorts of issue are best understood by these techniques, and
what results to anticipate. How are KDD\DM contrasted with measurements, AI, tasks look
into, and so on.? On the off chance that subset or superset of the above fields? Or then again
an extension\adaptation of them? Or then again a different field without anyone else?
Notwithstanding the techniques – which are the most encouraging fields of utilization and
what is the vision KDD\DM brings to these fields? Unquestionably we as of now observe
the extraordinary outcomes and accomplishments of KDD\DM, yet we can't evaluate their
outcomes with regard to the capability of this field. All these fundamental investigation must
be contemplated also, we see a few patterns for future research and execution, including:

Active DM –Closing the loop, as in charge hypothesis, where changes to the framework are
made by the KDD results and the full cycle begins once more. Steadiness and controllability
which will be essentially extraordinary in these sort of frameworks, should be all around
characterized

Full taxonomy –for all the nine stages of the KDD procedure. We have indicated a scientific
categorization for the DM strategies, yet a scientific classification is required for every one
of the nine stages. Such a scientific classification will contain strategies suitable for each
progression (even the first), and for the entire procedure as well.

Meta-algorithms –algorithms that analyze the attributes of the information so as to decide the
best strategies, and parameters (counting disintegrations).

Benefit analysis - to comprehend the impact of the potential KDD\DM results on the
undertaking

Problem characteristics – investigation of the issue itself for its appropriateness to the KDD
procedure.

Expanding the database for Data Mining derivation to incorporate additionally information
from pictures, voice, video, sound, and so forth. This will require adjusting and growing new
strategies (for instance, for looking at pictures utilizing grouping and pressure examination).

Appropriated Data Mining – The capacity to flawlessly and adequately utilize Data Mining
techniques on databases that are situated in different locales.

Extending the information base for the KDD procedure, including not just information yet
additionally extraction from well established realities to standards (for instance, extracating
from a machine its guideline, and along these lines having the option to apply it in different
circumstances).

19
Extending Data Mining thinking to incorporate innovative arrangements, not simply the
ones that shows up in the information, yet having the option to consolidate arrangements and
create another methodology.

ARIMA Model:
20
Box and Jenkins in 1970 introduced the ARIMA model. It also referred to as Box-Jenkins
methodology composed of set of activities for identifying, estimating and diagnosing
ARIMA models with time series data. The model is most prominent methods in financial
forecasting [1, 12, 9]. ARIMA models have shown efficient capability to generate short-term
forecasts. It constantly outperformed complex structural models in short-term prediction
[17]. In ARIMA model, the future value of a variable is a linear combination of past values
and past errors, expressed as follows:

where, Yt is the actual value and t ε is the random error at t, φi and θ j are the coefficients, p
and q are integers that are often referred to as autoregressive and moving
average,respectively. The steps in building ARIMA predictive model consist of model
identification, parameter estimation and diagnostic checking.

Methodology:

The method used in this study to develop ARIMA model for stock price forecasting is
explained in detail in subsections below. The tool used for implementation is Eviews
software version 5. Stock data used in this research work are historical daily stock prices
obtained from two countries stock exchanged. The data composed of four elements, namely:
open price, low price, high price and close price respectively. In this research the closing
price is chosen to represent the price of the index to be predicted. Closing price is chosen
because it reflects all the activities of the index in a trading day. To determine the best
ARIMA model among several experiments performed, the following criteria are used in this
study for each stock index. • Relatively small of BIC (Bayesian or Schwarz Information
Criterion) • Relatively small standard error of regression (S.E. of regression) • Relatively
high of adjusted R2 • Q-statistics and correlogram show that there is no significant pattern
left in the autocorrelation functions (ACFs) and partial autocorrelation functions (PACFs) of
the residuals, it means the residual of the selected model are white noise. The subsections
below described the processes of ARIMA model-development. A. ARIMA (p, d, q) Model
for Nokia Stock Index Nokia stock data used in this study covers the period from 25th April,
1995 to 25th February, 2011 having a total number of 3990 observations. Figure 1 depicts
the original pattern of the series to have general overview whether the time series is
stationary or not. From the graph below the time series have random walk pattern.

21
Figure5. 1: Graphical representation of the Nokia stock closing price index

Figure 2 is the correlogram of Nokia time series. From the graph, the ACF dies down
extremely slowly which simply means that the time series is nonstationary. If the series is
not stationary, it is converted to a stationary series by differencing. After the first difference,
the series “DCLOSE” of Nokia stock index becomes stationary as shown in figure 3 and
figure 4 of the line graph and correlogram respectively.

Figure 5.2:The correlogram of Nokia stock price index

Figure 5.3: Graphical representation of the Nokia stock price index after differencing

22
Figure 5.4: The correlogram of Nokia stock price index after first differencing

In figure 5 the model checking was done with Augmented Dickey Fuller (ADF) unit root
test on “DCLOSE” of Nokia stock index. The result confirms that the series becomes
stationary after the first-difference of the series.

Figure 5.5: ADF unit root test for DCLOSE of Nokia stock index.

Table 1 shows the different parameters of autoregressive (p) and moving average (q) among
the several ARIMA model experimented upon . ARIMA (2, 1, 0) is considered the best for
Nokia stock index as shown in figure 6. The model returned the smallest Bayesian or
Schwarz information criterion of 5.3927 and relatively smallest standard error of regression
of 3.5808 as shown in figure 6.

23
Figure5.6: ARIMA (2, 1, 0) estimation output with DCLOSE of Nokia index.

Figure 7 is the residual of the series. If the model is good, the residuals (difference between
actual and predicted values) of the model are series of random errors. Since there are no
significant spikes of ACFs and PACFs, it means that the residual of the selected ARIMA
model are white noise, no other significant patterns left in the time series. Therefore, there is
no need to consider any AR(p) and MA(q) further.

Figure 5.7: Correlogram of residuals of the Nokia stock index

In forecasting form, the best model selected can be expressed as follows:

(i.e., the difference between the actual value of the series and the
forecast value)

TABLE I: STATISTICAL RESULTS OF DIFFERENT


ARIMA PARAMETERS FOR NOKIA STOCK INDEX

The bold row represent the best ARIMA model among the several experiments.

Result of ARIMA Model for Nokia Stock Price Prediction


24
Table 3 is the result of the predicted values of ARIMA (2, 1, 0) considered the best model
for Nokia stock index. Figure 15 gives graphical illustration of the level accuracy of the
predicted price against actual stock price to see the performance of the ARIMA model
selected. From the graph, is obvious that the performance is satisfactory.
TABLE II: SAMPLE OF EMPIRICAL RESULTS OF ARIMA
(2,1,0) OF NOKIA STOCK INDEX.

Figure 5.8: Graph of Actual Stock Price vs Predicted values of Nokia Stock Index

25
CONCLUSION
This report presents KDD(Knowledge Discovery Database) for finding the pattern in
the database .KDD contains steps which are used for detecting fraud in stock market
extensive process of building ARIMA model for stock price prediction. The
experimental results obtained with best ARIMA model demonstrated the potential of
ARIMA models to predict stock prices satisfactory on short-term basis. This could guide
investors in stock market to make profitable investment decisions. With the results
obtained ARIMA models can compete reasonably well with emerging forecasting
techniques in short-term prediction.

26
REFERENCES
1. Ayodele A. Adebiyi., 2 Aderemi O. Adewumi 1,2School of Mathematic, Statistics &
Computer Science University of KwaZulu-Natal Durban, South Africa email: {adebiyi,
Adewunmi}@ukzn.ac.za

2.Oded Maimon Department of Industrial Engineering Tel-Aviv University


[email protected]

4. Knowledge Discovery and Data Mining: Towards a Unifying Framework Padhraic Smyth
Information and Computer Science University of California, Irvine CA 92717-3425, USA
smythQ.ics.uci.edu

3.KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining Vol 2

27
4. Published in: 2015 19th International Conference on System Theory, Control and
Computing (ICSTCC)Date of Conference: 14-16 Oct. 2015 Date Added to IEEE Xplore: 09
November 2015 ISBN Information:INSPEC Accession Number: 15586652
Publisher: IEEECheile Gradistei, Romania

5. L.Y. Wei, “A hybrid model based on ANFIS and adaptive expectation genetic algorithm
to forecast TAIEX”, Economic Modelling vol. 33 pp. 893-899, 2013.

28

You might also like