Seminar Report (T9247)
Seminar Report (T9247)
Seminar Report (T9247)
On
Submitted to the
Information Technology
by
Place: Pune
Dr. A. V. Deshpande
Principal, SKNCOE, Pune
This Project Bas ed Seminar report has been examined by us as per the
Savitribai Phule Pune University, Pune requirements at Smt. Kashibai Navale College of
Engineering, Pune-41on ________________
Komal B. Kolambe
(Students Name &Signature)
3
A
b
s
t
r
a
c
t
Data mining is process of analyzing large amounts of data in order to extract patterns
and useful information. In the last few years, data mining has been widely recognized as
a powerful and versatile data analysis tool in a variety of fields. Information technology
in primis, but also clinical medicine, sociology, physics. In this technical note we
provide a high-level overview of the most prominent tasks and methods that form the
basis of data mining. The note also focuses on some of the most recent yet promising
interdisciplinary aspects of data mining. KDD(Knowledge Discovery Database) is the
process of finding the useful information in large datasets.
Stock value forecast is a significant subject in account and financial aspects which
has prodded the enthusiasm of analysts throughout the years to grow better prescient
models. The autoregressive incorporated moving normal (ARIMA) models have been
investigated in writing for time arrangement expectation. This paper presents broad
procedure of building stock cost prescient model utilizing the ARIMA model.
Distributed stock information got from New York Stock Exchange (NYSE) and Nigeria
Stock Exchange (NSE) are utilized with stock cost prescient model created. Results got
uncovered that the ARIMA model has a solid potential for present moment expectation
and can contend well with existing methods for stock value expectation
4
Contents
Acknowledgemen I
t
Abstract II
5
LISTOF F I G U R E S
Sr. No. Figure Name
PageNo.
3.1. The Process of Knowledge Discovery in
database
5.5. ADF unit root test for DCLOSE of Nokia stock index.
6
LISTOF T A B L E S
Sr. No. TableName PageNo.
7
CHAPTER 1
1 INTRODUCTION TO Finding the patterns in
data and predict the possibility of manipulation
in stock exchange/share market
Securities exchange is the bone of quick developing economies, for example, India. Major of
capital mixture for organizations the nation over was made conceivable just through offers
offered to individuals. So our nation development is firmly limited with the exhibition of our
securities exchange. Practically all the creating countries depend on their financial exchange for
additional reinforcing of their economy. Any route in creating economies under 10% of
individuals are connecting with themselves with securities exchange speculation dreading the
unstable idea of financial exchange. Numerous individuals felt that purchasing and selling of
offers is a demonstration of betting which is an off-base thought. Greater part of money related
analysts concur that financial exchange is the main spot where speculator are getting reliable
swelling beaten returns for such a large number of years. Considering the reality of absence of
information and mindfulness over the individuals financial exchange forecast strategies assumes
a critical job in bringing more individuals into advertise just as to hold the current speculators.
Additionally, the expectation methods must be dealt with like crystal gazing or betting. The
applied procedures must yield steady exact outcomes with certain degree of exactness
consistently all together change the mentality of uninvolved speculators. By analyzing the
writing securities exchange expectation procedures can be assembled into four sorts. 1)
Technical investigation approach, 2) Fundamental examination approach, 3) Time arrangement
forecast and 4) Machine learning algorithmic strategies. They are creating forecasts dependent
on the recorded value estimations of chose stocks. Essential examination approach is discovering
the genuine estimation of a stock and contrasts it and the present exchanging levels and
prescribes purchasing of stock which is exchanged lesser than its actual worth. In the event of
Time arrangement expectation straight stream forecast models are produced and notable
examples are followed. The point is to discover an articulation that can create the information.
Job of information mining in securities exchange Many scientists endeavors to anticipate stock
costs by applying factual and graphing approaches. In any case, those strategies needs behind
vigorously because of human one-sided choices on financial exchange dependent on everyday
attitude of human conduct. By applying information mining in a reasonable manner concealed
examples can be revealed which was impractical by conventional methodologies. Additionally,
by applying business knowledge future cost expectation with expanded precision levels are
conceivable with information mining systems. The gigantic measure of information produced by
financial exchanges constrained the scientists to apply information mining to settle on venture
choices. The accompanying difficulties of financial exchange can be viably tended to by mining
strategies.
8
1.4 Introduction to Seminar Topic
Knowledge Discovery in Databases (KDD) is an automatic, exploratory analysis and
modeling of large data repositories. KDD is the organized process of identifying valid,
novel, useful, and understandable patterns from large and complex data sets. Data Mining
(DM) is the core of the KDD process, involving the inferring of algorithms that explore the
data, develop the model and discover previously unknown patterns. The model is used for
understanding phenomena from the data, analysis and prediction. The accessibility and
abundance of data today makes knowledge discovery and Data Mining a matter of
considerable importance and necessity. Given the recent growth of the field, it is not
surprising that a wide variety of methods is now available to the researchers and
practitioners. No one method is superior to others for all cases
Forecast will keep on being an intriguing zone of investigate making analysts in the
space field consistently craving to improve existing prescient models. The explanation is
that establishments and people are enabled to make speculation choices and capacity to
design and create compelling procedure about their every day and future endevours. Stock
value forecast is viewed as one of generally troublesome errand to achieve in money related
determining because of complex nature of securities exchange [1, 2, 3].
The longing of numerous financial specialists is to lay hold of any determining stratey
that could ensure simple benefitting and limit venture hazard from the financial exchange.
This remaining parts a propelling component for analysts to advance and grow new
prescient models .In the previous years a few models and strategies had been created to
stock value expectation. Among them are fake neural systems (ANNs) model which are very
well known because of its capacity to take in designs from information and deduce
arrangement from obscure information. Not many related works that connected with ANNs
model to stock value forecast are [5, 6, 7]. In late time, cross breed approaches has likewise
been locked in to improve stock cost prescient models by abusing the one of a kind quality
of every one of them [2]. ANNs is from counterfeit insight viewpoints. ARIMA models are
from measurable models viewpoints. For the most part, it is accounted for in writing that
forecast can be done from two points of view: factual and man-made reasoning strategies .
ARIMA models are known to be powerful and productive in budgetary time arrangement
guaging particularly momentary forecast than even the most mainstream ANNs procedures
([8, 9, 10]. It has been broadly utilized in field of financial aspects and money. Different
measurements models are relapse technique, exponential smoothing, summed up
autoregressive restrictive heteroskedasticity (GARCH). Hardly any related works that has
connected with ARIMA model for determining incorporates [11, 12, 13, 14, 15, 16]. Right
now procedure of building ARIMA models for transient stock value expectation is
9
introduced. The outcomes got from genuine information showed the potential quality of
ARIMA models to give speculators transient forecast that could help venture choice making
process.
10
KDD Process
The knowledge Discovery process is iterative and intelligent,
comprising of nine stages. the procedure is iterative at each progression,
implying that moving back to past advances might be required. The
procedure has many "masterful" perspectives in the feeling that one can't
present one equation or make a total scientific categorization for the
correct decisions for each progression and application type. Therefore it
is required to comprehend the procedure and the various needs and
conceivable outcomes in each progression. Prologue to Knowledge
Discovery in Databases three Scientific classification is suitable for the
Data Mining techniques and is introduced in the next area.
The procedure begins with deciding the KDD objectives, and "closures"
with the usage of the found information. At that point the circle is shut -
the Dynamic Data Mining part begins (which is past the extent of this
book and the procedure characterized here). Subsequently, changes
would need to be made in the application space, (for example, offering
various highlights to cell phone clients so as to decrease stirring). This
shuts the circle, and the impacts are at that point estimated on the new
information vaults, and the KDD procedure is propelled once more
11
disclosure procedure will happen (counting pertinent earlier information).
As the KDD procedure continues, there might be indeed, even an update
of this progression. Having comprehended the KDD objectives, the
preprocessing of the information begins, characterized in the following
three stages (note that a portion of the strategies here are like Data
Mining calculations, yet are utilized in the preprocessing context):
2. Selecting and creating a data set on which discovery will be performed.Having
characterized the objectives, the information that will be utilized for
the information disclosure ought to be resolved. This incorporates
discovering what information is accessible, acquiring extra vital
information, and afterward coordinating all the information for the
information disclosure into one informational index, including the
qualities that will be considered for the procedure. This procedure is
important on the grounds that the Data Mining takes in and finds from
the accessible information. This is the proof base for building the
models. On the off chance that some significant characteristics are
missing, at that point the whole investigation may come up short. From
this regard, the more characteristics are thought of, the better. On the
other hand, to gather, sort out and work complex information
storehouses is costly and there is a tradeoff with the open door for best
understanding the wonders. This tradeoff speaks to a perspective where
the intuitive and iterative part of the KDD is occurring. This begins
with the best accessible informational collection and later extends and
watches the impact as far as information revelation and displaying.
3. Preprocessing and cleansing In this stage, data reliability is
improved. It incorporates data clearing, for example, dealing with
missing qualities and expulsion of anomalies. There are numerous
techniques clarified in the handbook, from doing nothing to turning
into the significant part (as far as time consumed) of a KDD venture in
specific tasks. It might include complex factual strategies or utilizing a
Data Mining calculation right now. For model, on the off chance that
one presumes that a specific trait is of lacking unwavering quality or
has many missing information, at that point this property could turn
into the objective of an information mining administered calculation. A
forecast model for this trait will be created, and afterward missing
information can be anticipated. The augmentation to which one focuses
on this level relies upon numerous components. Regardless, examining
the angles is significant and regularly uncovering without anyone else,
in regards to big business data frameworks.
12
measurement decrease, (for example, highlight choice and extraction and
record testing), also, characteristic change, (for example, discretization of
numerical traits and practical change). This progression can be pivotal for
the accomplishment of the whole KDD task, and it is generally very
undertaking explicit. For instance, in clinical assessments, the remainder
of qualities may regularly be the most significant factor, and not every
one without anyone else. In promoting, we may need to think about
impacts outside our ability to control just as endeavors and fleeting
issues, (for example, examining the impact of promoting gathering). In
any case, regardless of whether we don't utilize the correct change toward
the start, we may acquire an astonishing impact that insights to us about
the change required (in the following emphasis). In this way the KDD
procedure reflects upon itself and prompts a comprehension of the
change required. Having finished the over four stages, the accompanying
four stages are identified with the Data Mining part, where the emphasis
is on the algorithmic angles utilized for each undertaking:
5 .Choosing the appropriate Data Mining task. We are presently prepared to
choose on which sort of Data Mining to use, for instance, grouping,
relapse, or bunching. This for the most part relies upon the KDD
objectives, and furthermore on the past advances. There are two
significant objectives in Data Mining: expectation and depiction.
Expectation is frequently alluded to as administered Data Mining, while
graphic Data Mining incorporates the solo and representation parts of
Data Mining. Most information mining methods are in view of inductive
realizing, where a model is developed unequivocally or verifiably by
summing up from an adequate number of preparing models. The hidden
suspicion of the inductive methodology is that the prepared model is
pertinent to future cases. The methodology likewise considers the degree
of meta-learning for the specific arrangement of accessible information.
7. Employing the Data Mining algorithm.Having the system, we currently settle
on the strategies. This stage incorporates choosing the particular
technique to be utilized for looking through examples (counting different
inducers). For instance, in considering exactness versus
understandability, the previous is better with neural systems, while the
last is better with choice trees. For every technique of meta-realizing
there are a few prospects of how it can be practiced. Meta-learning
centers around clarifying what causes a Data Mining calculation to be
effective or not in a specific issue. Therefore, this methodology
endeavors to comprehend the conditions under which a
Data Mining calculation is generally suitable. Every calculation has
parameters and strategies of learning, (for example, ten times cross-
approval or another division for preparing and testing).
7. Employing the Data Mining algorithm. At long last the execution of the
Data Mining calculation is reached. Right now may need to utilize the
calculation a few times until a fulfilled outcome is acquired, for example
by tuning the calculation's control parameters, for example, the least
number of occurrences in a solitary leaf of a choice tree.
13
8. Evaluation. Right now assess and decipher the mined examples (rules,
unwavering quality and so forth.), as for the objectives characterized in
the initial step. Here we consider the preprocessing ventures as for their
impact on the Data Mining calculation results (for instance, including
highlights in Step 4, and rehashing from that point). This progression
centers around the intelligibility furthermore, value of the actuated
model. Right now found information is likewise reported for additional
use.The last step is the usage and overall feedback on the patterns and discovery results
obtained by the Data Mining:
9. Using the discovered knowledge.We are presently prepared to fuse
the information into another framework for additional activity. The
information gets dynamic as in we may make changes to the framework
and measure the impacts. As a matter of fact the achievement of this
progression decides the viability of the whole KDD process. There are
numerous difficulties right now step, for example, loosing the "research
center conditions" under which we have worked. For example, the
information was found from a certain static preview (typically test) of the
information, however now the information becomes dynamic.
Information structures may change (certain qualities become
inaccessible), and the information space might be adjusted, (for example,
a trait may have a worth that was not expected previously).
16
5. KDD & DM Research Opportunities and Challenges
Observational examination of the exhibition of various methodologies and their
variations in a wide scope of utilization areas has demonstrated that each performs best in a
few, yet not all, spaces. This marvel is known as the particular predominance issue, which
implies, for our situation, that no acceptance calculation can be the most ideal in all spaces.
The explanation is that every calculation contains an unequivocal or understood
predisposition that drives it to lean toward specific speculations over others, and it will be
fruitful just as long as this predisposition coordinates the attributes of the application area.
Results have exhibited the presence and rightness of this "no free lunch hypothesis". On the
off chance that one inducer is superior to another in certain spaces, at that point there are
essentially different spaces in which this relationship is switched. This suggests in KDD that
for a given issue a specific methodology can yield more information from similar
information than different methodologies. In numerous application spaces, the speculation
mistake (on the general area, not simply the one traversed in the given informational index)
of even the best strategies is far over the preparation set, and the topic of whether it very
well may be improved, what's more, if so how, is an open and significant one. Some portion
of the response to this inquiry is to decide the base blunder attainable by any classifier in the
application space (known as the ideal Bayes mistake). In the case of existing classifiers do
not arrive at this level, new methodologies are required. In spite of the fact that this issue has
gotten extensive consideration, no for the most part dependable technique has so far been
illustrated. This is one of the difficulties of the DM examine – not exclusively to tackle it,
however even to evaluate and comprehend it better. Heuristic strategies can at that point be
analyzed completely and not simply against one another. A subset of this summed up study
is the topic of which inducer to use for a given issue. To be considerably progressively
explicit, the presentation measure needs to be characterized suitably for every issue. Despite
the fact that there are some regularly acknowledged estimates it isn't sufficient. For instance,
if the investigator is searching for exactness just, one arrangement is to attempt every one
thus, and by assessing the speculation mistake, to pick the one that seems to perform best.
Another approach, known as multi-methodology learning, endeavors to consolidate at least
two various ideal models in a solitary calculation. The predicament of what technique to
pick turns out to be significantly more prominent if different factors, for example,
intelligibility are contemplated. For example, for a particular area, neural systems may beat
choice trees in exactness. Anyway from the intelligibility angle, choice trees are viewed as
prevalent. At the end of the day, right now if the scientist realizes that neural organize is
increasingly exact, the issue of what strategies to utilize despite everything exists (or
possibly to join strategies for their different quality). Acceptance is one of the focal issues in
numerous orders, for example, AI, design acknowledgment, and insights. Anyway the
element that recognizes Data Mining from conventional techniques is its versatility to very
huge arrangements of changed kinds of information. Adaptability implies working in a
domain of high number of records, high dimensionality, and a high number of classes or
heterogeneousness. By the by, attempting to find information in reality and huge databases
presents time and memory issues.
As huge databases have become the standard in numerous fields (counting cosmology,
atomic science, fund, advertising, social insurance, and numerous others), the utilization of
17
Data Mining to find designs in them has gotten conceivably very helpful for the endeavor.
Numerous organizations are staking an enormous piece of their future on these "Information
Mining" applications, and go to the exploration network for answers for the crucial issues
they experience. While a lot of accessible information used to be the fantasy of any
information expert, these days the equivalent word for "huge" has become "terabyte"
or on the other hand "pentabyte", a scarcely conceivable volume of data.
Informationintensive associations (like telecom organizations and budgetary establishments)
are expected to collect a few terabytes of crude information each one to two years. High
dimensionality of the info (that is, the quantity of qualities) increments the size of the hunt
space in an exponential way (known as the "Revile of Dimensionality"), and subsequently
builds the opportunity that the inducer will discover false classifiers that when all is said in
done are not substantial. There are a few methodologies for managing a high number of
records counting: testing strategies, collection, enormously equal handling, and effective
capacity strategies.
Active DM –Closing the loop, as in charge hypothesis, where changes to the framework are
made by the KDD results and the full cycle begins once more. Steadiness and controllability
which will be essentially extraordinary in these sort of frameworks, should be all around
characterized
Full taxonomy –for all the nine stages of the KDD procedure. We have indicated a scientific
categorization for the DM strategies, yet a scientific classification is required for every one
of the nine stages. Such a scientific classification will contain strategies suitable for each
progression (even the first), and for the entire procedure as well.
Meta-algorithms –algorithms that analyze the attributes of the information so as to decide the
best strategies, and parameters (counting disintegrations).
Benefit analysis - to comprehend the impact of the potential KDD\DM results on the
undertaking
Problem characteristics – investigation of the issue itself for its appropriateness to the KDD
procedure.
Expanding the database for Data Mining derivation to incorporate additionally information
from pictures, voice, video, sound, and so forth. This will require adjusting and growing new
strategies (for instance, for looking at pictures utilizing grouping and pressure examination).
Appropriated Data Mining – The capacity to flawlessly and adequately utilize Data Mining
techniques on databases that are situated in different locales.
Extending the information base for the KDD procedure, including not just information yet
additionally extraction from well established realities to standards (for instance, extracating
from a machine its guideline, and along these lines having the option to apply it in different
circumstances).
19
Extending Data Mining thinking to incorporate innovative arrangements, not simply the
ones that shows up in the information, yet having the option to consolidate arrangements and
create another methodology.
ARIMA Model:
20
Box and Jenkins in 1970 introduced the ARIMA model. It also referred to as Box-Jenkins
methodology composed of set of activities for identifying, estimating and diagnosing
ARIMA models with time series data. The model is most prominent methods in financial
forecasting [1, 12, 9]. ARIMA models have shown efficient capability to generate short-term
forecasts. It constantly outperformed complex structural models in short-term prediction
[17]. In ARIMA model, the future value of a variable is a linear combination of past values
and past errors, expressed as follows:
where, Yt is the actual value and t ε is the random error at t, φi and θ j are the coefficients, p
and q are integers that are often referred to as autoregressive and moving
average,respectively. The steps in building ARIMA predictive model consist of model
identification, parameter estimation and diagnostic checking.
Methodology:
The method used in this study to develop ARIMA model for stock price forecasting is
explained in detail in subsections below. The tool used for implementation is Eviews
software version 5. Stock data used in this research work are historical daily stock prices
obtained from two countries stock exchanged. The data composed of four elements, namely:
open price, low price, high price and close price respectively. In this research the closing
price is chosen to represent the price of the index to be predicted. Closing price is chosen
because it reflects all the activities of the index in a trading day. To determine the best
ARIMA model among several experiments performed, the following criteria are used in this
study for each stock index. • Relatively small of BIC (Bayesian or Schwarz Information
Criterion) • Relatively small standard error of regression (S.E. of regression) • Relatively
high of adjusted R2 • Q-statistics and correlogram show that there is no significant pattern
left in the autocorrelation functions (ACFs) and partial autocorrelation functions (PACFs) of
the residuals, it means the residual of the selected model are white noise. The subsections
below described the processes of ARIMA model-development. A. ARIMA (p, d, q) Model
for Nokia Stock Index Nokia stock data used in this study covers the period from 25th April,
1995 to 25th February, 2011 having a total number of 3990 observations. Figure 1 depicts
the original pattern of the series to have general overview whether the time series is
stationary or not. From the graph below the time series have random walk pattern.
21
Figure5. 1: Graphical representation of the Nokia stock closing price index
Figure 2 is the correlogram of Nokia time series. From the graph, the ACF dies down
extremely slowly which simply means that the time series is nonstationary. If the series is
not stationary, it is converted to a stationary series by differencing. After the first difference,
the series “DCLOSE” of Nokia stock index becomes stationary as shown in figure 3 and
figure 4 of the line graph and correlogram respectively.
Figure 5.3: Graphical representation of the Nokia stock price index after differencing
22
Figure 5.4: The correlogram of Nokia stock price index after first differencing
In figure 5 the model checking was done with Augmented Dickey Fuller (ADF) unit root
test on “DCLOSE” of Nokia stock index. The result confirms that the series becomes
stationary after the first-difference of the series.
Figure 5.5: ADF unit root test for DCLOSE of Nokia stock index.
Table 1 shows the different parameters of autoregressive (p) and moving average (q) among
the several ARIMA model experimented upon . ARIMA (2, 1, 0) is considered the best for
Nokia stock index as shown in figure 6. The model returned the smallest Bayesian or
Schwarz information criterion of 5.3927 and relatively smallest standard error of regression
of 3.5808 as shown in figure 6.
23
Figure5.6: ARIMA (2, 1, 0) estimation output with DCLOSE of Nokia index.
Figure 7 is the residual of the series. If the model is good, the residuals (difference between
actual and predicted values) of the model are series of random errors. Since there are no
significant spikes of ACFs and PACFs, it means that the residual of the selected ARIMA
model are white noise, no other significant patterns left in the time series. Therefore, there is
no need to consider any AR(p) and MA(q) further.
(i.e., the difference between the actual value of the series and the
forecast value)
The bold row represent the best ARIMA model among the several experiments.
Figure 5.8: Graph of Actual Stock Price vs Predicted values of Nokia Stock Index
25
CONCLUSION
This report presents KDD(Knowledge Discovery Database) for finding the pattern in
the database .KDD contains steps which are used for detecting fraud in stock market
extensive process of building ARIMA model for stock price prediction. The
experimental results obtained with best ARIMA model demonstrated the potential of
ARIMA models to predict stock prices satisfactory on short-term basis. This could guide
investors in stock market to make profitable investment decisions. With the results
obtained ARIMA models can compete reasonably well with emerging forecasting
techniques in short-term prediction.
26
REFERENCES
1. Ayodele A. Adebiyi., 2 Aderemi O. Adewumi 1,2School of Mathematic, Statistics &
Computer Science University of KwaZulu-Natal Durban, South Africa email: {adebiyi,
Adewunmi}@ukzn.ac.za
4. Knowledge Discovery and Data Mining: Towards a Unifying Framework Padhraic Smyth
Information and Computer Science University of California, Irvine CA 92717-3425, USA
smythQ.ics.uci.edu
3.KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining Vol 2
27
4. Published in: 2015 19th International Conference on System Theory, Control and
Computing (ICSTCC)Date of Conference: 14-16 Oct. 2015 Date Added to IEEE Xplore: 09
November 2015 ISBN Information:INSPEC Accession Number: 15586652
Publisher: IEEECheile Gradistei, Romania
5. L.Y. Wei, “A hybrid model based on ANFIS and adaptive expectation genetic algorithm
to forecast TAIEX”, Economic Modelling vol. 33 pp. 893-899, 2013.
28