Minorproject Report
Minorproject Report
A
Project Work
Submitted as Minor Project in Partial fulfillment for the award of Graduate Degree in
Bachelor of Engineering in Computer Science & Engineering.
Submitted to
Submitted By--
Amardeep Singh Rathaur (0105CS191018)
Adarsh Kumar Singh (0105CS191009)
Devraj Singh (0105CS191035)
JULY-DEC 2021
i
CRYPTO PRICE PREDICTION 1
Oriental Institute of Science & Technology, Bhopal
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
This is to certify that the project entitled “Crypto Price Predictor & Visualiser”
B.Tech in Computer Science & Engineering have done their work as MINOR
PROJECT-I for Partial fulfillment of the B.Tech degree from RGPV, Bhopal (M.P.) is a
Guide Head
Department of Computer Science& Department of Computer Science &
Engineering Engineering
The purpose of this study is to find out with what accuracy the direction of the price of
crypto currency can be predicted using machine learning methods. This is basically a time
series prediction problem. While much research exists surrounding the use of different
machine learning.
Techniques for time series prediction, research in this area relating specifically to crypto
currency is lacking. In addition, crypto currency as a currency is in a transient stage and as
a result is considerably more volatile than other currencies such as the USD. Interestingly,
it is the top performing currency four out of the last five years. Thus, its prediction offers
great potential and this provides motivation for research in the area. As evidenced by an
analysis of the existing literature, running machine learning algorithms on a GPU as
opposed to a CPU can offer significant performance improvements. This is explored by
benchmarking the training of the RNN and LSTM network using both the GPU and CPU.
This provides a solution to the sub research topic.
Finally, in analysing the chosen dependent variables, each variables importance is assessed
using a random forest algorithm. In addition, the ability to predict the direction of the price
of an asset such as crypto currency offers the opportunity for profit to be made by trading
the asset.
i
CRYPTO PRICE PREDICTION 3
ACKNOWLEDGEMENT
I take the opportunity to express my cordial gratitude and deep sense of
indebtedness to my guide for the valuable guidance and inspiration throughout
the project duration. I feel thankful to him for his innovative ideas, which led
to successful submission of this minor project work. I feel proud and fortune to
work under such an outstanding mentor in the field of Crypto Price Predictor
& Visualiser. He has always welcomed my problem and helped us to clear our
doubt. I will always be grateful to him for providing me moral support and
sufficient time.
I owe sincere thanks to Director OIST, for providing us with moral support and
necessary help during my project work in the Department.
At the same time, I would like to thank HOD CSE and all other faculty
members and all non-teaching staff of department of Computer Science
&Engineering for their valuable co-operation.
I would also thank to my Institution, faculty members and staff without whom
this project would have been a distant reality. I also extend my heartfelt thanks
to our family and well-wishers.
Amardeep Singh Rathaur (0105CS191018)
Adarsh Kumar Singh (0105CS191009)
Devraj Singh (0105CS191035)
ABSTRACT I
ACKNOWLEDGEMENT II
CHAPTERS III
LIST OF FIGURES VI
Introduction 1
Chapter 2
Literature Survey 8
Chapter 3
iii
CRYPTO PRICE PREDICTION 5
Requirement Analysi 23
s
3.1 Functional Requirements 23
Chapter 4
Design 26
Chapter 5
Implementation 30
5.1 Dataset 30
5.3 Classification 33
Chapter 6
Testing 45
1. Unit Testing 45
2. Integration Testing 45
iv
CRYPTO PRICE PREDICTION 6
Chapter 7
Outputs 47
Chapter 8
References 52
v
CRYPTO PRICE PREDICTION 7
LIST OF FIGURES
2.1 Data 9
Mining
2.2 Stages in Data Mining 10
5.1 Clasification 33
5.2 NNAR 39
6.1 Testing Process 46
vi
CRYPTO PRICE PREDICTION 8
CHAPTER 1
INTRODUCTION
Time series prediction is not a new phenomenon. Prediction of most financial markets
such as the stock market has been researched at large scale. crypto currency presents an
interesting parallel to this as it is a time series prediction problem in a market still in its
beggining stage. As a result, there is high volatility in the market and this provides an
opportunity in terms of prediction. In addition, crypto currency is the leading
cryptocurrency in the world with adoption growing consistently over time. Due to the
open nature of crypto currency it also poses another difficulty as opposed to traditional
financial markets. It operates on a decentralised, peer-to-peer and trustless system in
which all transactions are posted to an open ledger called the Blockchain. This type of
transparency is not seen in other financial markets. Traditional time series prediction
methods such as Holt- Winters exponential smoothing models rely on linear assumptions
and require data that can be broken down into trend, seasonal and noise to be effective.
This type of methodology is more suitable for a task such as predicting sales where
seasonal effects are present. Due to the lack of seasonality in the crypto currency market
and it’s high volatility, these methods are not very effective for this task. Given the
complexity of the task, deep learning makes for an interesting technological solution
based on its performance in similar areas. Tasks such as natural language processing
which are also sequential in nature and have shown promising results. This type of task
uses data of a sequential nature and as a result is similar to a price prediction task. The
recurrent neural network (RNN) and the long short term memory (LSTM) flavour of
artificial neural networks are favoured over the traditional multilayer perceptron (MLP)
due to the temporal nature of the more advanced algorithms.
The aim of this research is to ascertain with what accuracy can the price of crypto
currency be predicted using machine learning. Section one addresses the project
specification which includes the research question, sub research questions, the purpose
of the study and
Out of approximately 653 papers published on crypto currency only 7 have related to
machine learning for prediction. As a result, literature relating to other financial time
series prediction using deep learning is also assessed as these tasks can be considered
analogous.
The price data is sourced from the crypto currency Price index. The task is achieved with
varying degrees of success through the implementation of a Bayesian optimized
recurrent neural network (RNN) and Long Short-Term Memory (LSTM) network.
The purpose of this study is to find out with what accuracy the direction of the price of
crypto currency can be predicted using machine learning methods. This is basically a
time series prediction problem. While much research exists surrounding the use of
different machine learning techniques for time series prediction, research in this area
relating specifically to crypto currency is lacking. In addition, crypto currency as a
currency is in a transient stage and as a result is considerably more volatile than other
currencies such as the USD. Interestingly, it is the top performing currency four out of
the last five years1. Thus, its prediction offers great potential and this provides
motivation for research in the area. As evidenced by an analysis of the existing
literature, running machine learning algorithms on a GPU as opposed to a CPU can
offer significant performance improvements. This is explored by benchmarking the
training of the RNN and LSTM network using both the GPU and CPU. This provides a
solution to the sub research topic.
By external we are referring to agents which influence indirectly the price of crypto
currency(exchange closures, replacing cryptocurrencies, speculation markets, the fact that
as its believed widely over 80% of crypto currencys in circulation is concentrated in a
limited number of investors etc.) Anyway, we shall compare our results to other models
built for cryptocurrency prediction. Let’s not forget that in the first month of 2018 there
were models which predicted that crypto currency would surpass the 100,000.00 USD
per crypto currency till the end of the year, while we are barely reaching the 7,000.00
USD value just 2 months before the end of the year.
The main feature of this system is to propose a general and effective approach to
predict the crypto currency price using data mining techniques. The main goal of the
proposed system is to analyze and study the hidden patterns and relationships between
the data present in the crypto currency dataset. The solution to the crypto currency
analysis problem can provide extremely useful information to prevent investors from
loosing money which is being invested on crypto currency. Most of the existing work
solves these problems separately by different models. so dealing with this becomes
more important. The analysis and prediction plays an important role in the problem
definition.
The constant increase in crypto currency usage has become an extremely serious
problem, with the development of technology and hi-tech tools having a significantly
greater impact on the crypto currency price. The large amounts of information also
poses a challenge to analyze such data and identify similarities or relations between the
data. Also there is a challenge of inconsistency that can occur in the data due to
incompleteness in the dataset. Therefore, there is an urging need of proper techniques
to analyze large volumes of data to get some useful results out of it. So the main aim of
this project is to propose a general and effective approach to predict the crypto currency
price using data mining techniques.
1. DATA GATHERING
The first step in this project or in any data mining project is the collection of data to be
studied or examined to find the hidden relationships between the data members. The
important concern while choosing a dataset is that the data which we are gathering
should be relevant to the problem statement and it must be large enough so that the
inference derived from the data is useful to extract some important patterns between
the data such that they can be used to predict the future events or can be studied for
further analysis. The result of the process of gathering and creating a collection of data
results into what we call as a Dataset. The dataset contains large volume of data that can
be analyzed to get some knowledge from the databases. This is an important step in the
process because choosing the inappropriate dataset can lead us to incorrect results.
2. DATA PREPROCESSING
The primary data collected from the internet resources remains in the raw form of
statements, digits and qualitative terms. The raw data contains error, omissions and
inconsistencies. It requires corrections after careful scrutinizing the completed
questionnaires. The following steps are involved in the processing of primary data. A
huge volume of raw data collected through field survey needs to be grouped for similar
details of individual responses.
Data Preprocessing is a technique that is used to convert the raw data into a clean data
set. In other words, whenever the data is gathered from different sources it is collected
in raw format which is not feasible for the analysis.
Therefore, certain steps are executed to convert the data into a small clean data set. This
technique is performed before the execution of Iterative Analysis. The set of steps is
known as data preprocessing.
Data Cleaning
Data Integration
Data Transformation
Data Reduction
1.5.3 CLASSIFICATION
This technique is used to divide various data into different classes. This process is also
similar to clustering. It segments data records into various segments which are known as
classes. Unlike clustering, here we have knowledge of different clusters. Ex: Outlook
email, they have an algorithm to categorize an email as legitimate or spam.
LITERATURE SURVEY
1. DATA MINING
Literature survey is that the most vital step in code development method. Before
developing the tool it's necessary to see the time issue, economy and company strength.
Once these things are satisfied, then next steps is to determine which operating system
and language can be used for developing the tool Once the programmers begin building
the tool the programmers would like heap of external support. This support is obtained
from senior programmers, from book or from websites Before building the system the on
top of thought area unit taken under consideration for developing the projected system..
We have to analyze the Data mining Outline Survey:
Data mining contains techniques for analysis which involve various domains. For instance,
some of the domains involved in data mining are Statistics, Machine Learning and
Database systems. Data mining is additionally spoken as “Knowledge discovery in
databases (KDD)”.
The real task of data mining systems is the semi-automatic or automatic analysis of large
volumes of data to extract previously unknown relationships such as groups of data
members(clustering analysis),unusual records(outlier or anomaly detection),and
dependencies. Normally, this includes database techniques like spatial indices.
Data mining may identify multiple groups in the data, that can be put to further use for
accurate predictions by a decision support system.
3. Data Modeling: In this step the relationships and patterns that were hidden in the data
are examined and extracted from the datasets. The data can be modeled based on the
technique that is being used. Some of the different techniques that can be used for
modeling data are: clustering, classification, association and decision trees.
Deploying Models: Once the relationships and patterns present in the data are discovered
we need to put that knowledge to use. These patterns can be used to predict events in the
future and also they can be used for further analysis. The discovered patterns can be used as inputs for
machine learning and predictive analysis for the datasets.
1.Classification: This technique is used to divide various data into different classes. This
process is also similar to clustering. It segments data records into various segments
which are known as classes. Unlike clustering, here we have knowledge of different
clusters. Ex: Outlook email, they have an algorithm to categorize an email as legitimate
or spam.
2.Association: This technique is used to discover hidden patterns in the data and also
for identifying interesting relations between the variables in a database. Ex: It is used in
retail industry.
3.Prediction: This technique is used only for particular uses. It is used extract
relationships between independent and dependent variables in the dataset. Ex: We use
this technique to predict profit obtained from sales for the future.
4.Clustering: A cluster is referred to as a group of data objects. The data objects that are
similar in properties are kept in the same cluster. In other words we can tell that
clustering is a process of discovering groups or clusters.
5.Here we do not have prior knowledge of the clusters. Ex: It can be used in consumer
profiling.
6.Sequential Patterns: This is an essential aspect of data mining techniques its main aim
is to discover similar patterns in the dataset. Ex: E-commerce websites suggestions are
based on what we have bought previously.
7.Decision Trees: This technique is a vital role in data mining because it is easier to
understand for the users. The decision tree begins with a root which is a simple
question. As they can have multiple answers we get our nodes of the decision tree also
the questions in the root node might lead to another set of questions. Thus, the nodes
keep adding in the decision tree. At last, we are allowed to make a final decision on it.
Apart from these techniques there are certain other techniques which allow us to
remove noisy data and also clean the dataset. This helps us to get accurate analysis and
prediction results.
In marketing, it can be used for predicting profits and also can be used for
creating targeted advertisements for various customers.
In retail sector, it is used for improving consumer experience andalso
increasing the amount of profits.
Tax governing organizations use it to determine frauds in
transactions.
1. Disadvantages:
Storing of large amounts of data that contains a lot of information about Bit coin
price is posing a challenge for the Researchers and the Investors.
Sometimes the data is entered manually and humans can make mistakes, so
there are chances of incorrect data being entered in the dataset which can lead
to inaccurate results while analyzing the data.
In such a large dataset, there is always a chance of some fields containing missing
values, these missing values can make the data noisy and thus we must take
appropriate measures to remove inconsistency from the datasets.
The Investors and the researchers do not have adequate techniques to analyze
and study the data to get some inference out of it and use this inference to
efficiently predict the price of the crypto currency.
3. PROPOSED SYSTEM
The proposed system implements machine algorithm to build the model to predict the
price of the bit-coin based on historical dataset available on online database.In the
proposed model, The can be done using the LSTM(Long Short Term Memory) is one of
the type of the RNN (Recurrent Neural Networks). The tool used for project are
anaconda-navigator.The procedure to be followed for the proposed system is given as
follows:
First, collect the data set using the Rest-API to collect the historic of the bit-coin
prices from the online database.
Arrange the data into the data frame according to the problem definition, so as
to get analysis correct and produce the results which are efficient to meet goals
of the system.
Then the rows of the dataset which are outdated for analysis/prediction to build
a model and in-order to feed the relevant data to the model extra columns are
removed and stored into a CSV file.
Then we Build the model for the data-set using the LSTM (RNN) algorithm to
predict values of bit-coin on daily basis.
1. JUPYTER NOTEBOOK
The Jupyter Notebook App is a server-customer application that permits altering and
running note pad records by means of an internet browser. The Jupyter Notebook App
can be executed on a nearby work area requiring no web access (as portrayed in this
report) or can be introduced on a remote server and got to through the web.
Notwithstanding showing/altering/running note pad archives, the Jupyter Notebook App
has a "Dashboard" (Notebook Dashboard), a "control board" indicating nearby records
and permitting to open note pad reports or closing down their portions.
A scratch pad part is a "computational motor" that executes the code contained in a
Notebook record. The ipython part, referenced in this guide, executes python code.
Portions for some, different dialects exist (official parts).When you open a Notebook
report, the related part is consequently propelled. At the point when the scratch pad is
executed (either cell-by-cell or with menu Cell - > Run All), the portion plays out the
calculation and produces the outcomes. Contingent upon the sort of calculations, the
piece may expend critical CPU and RAM.
Note that the RAM isn't discharged until the part is closed down, he Notebook
Dashboard is the part which is indicated first when you dispatch Jupyter Notebook App.
The Notebook Dashboard is essentially used to open note pad archives, and to deal with
the running portions (picture and shutdown).
The Notebook Dashboard has different highlights like a record director, in particular
exploring organizers and renaming/erasing documents.
People are exceptionally visual animals: we comprehend things better when we see
things envisioned. Notwithstanding, the progression to showing investigations, results or
bits of knowledge can be a bottleneck: you probably won't realize where to begin or you
may have as of now a correct configuration as a top priority, however then inquiries like
"Is this the correct method to imagine the bits of knowledge that I need to convey to my
group of onlookers?" will have unquestionably gone over your brain.
When you're working with the Python plotting library Matplotlib, the initial step to
responding to the above inquiries is by structure up information on themes like: The life
structures of a Matplotlib plot: what is a subplot? What are the Axes? What precisely is a
figure?
Plot creation, which could bring up issues about what module you precisely need to
import (pylab or pyplot?), how you precisely ought to approach instating the figure and
the Axes of your plot, how to utilize matplotlib in Jupyter note pads, and so on.
Sparing, appearing, … your plots: demonstrate the plot, spare at least one figures to, for
instance, pdf documents, clear the tomahawks, clear the figure or close the plot, and so
on.
In conclusion, you'll quickly cover two manners by which you can alter Matplotlib: with
templates and the rc settings.
Since all is set for you to begin plotting your information, it's an ideal opportunity to
investigate some plotting schedules. You'll regularly go over capacities like plot() and
disperse(), which either draw focuses with lines or markers interfacing them, or draw
detached focuses, which are scaled or shaded. In any case, as you have just found in the
case of the primary area, you shouldn't neglect to pass the information that you need
these capacities to utilize!
2.4.3 NUMPY
NumPy is, much the same as SciPy, Scikit-Learn, Pandas, and so forth one of the bundles
that you can't miss when you're learning information science, principally in light of the
fact that this library gives you a cluster information structure that holds a few advantages
over Python records, for example, being increasingly reduced, quicker access in perusing
and composing things, being progressively advantageous and increasingly productive.
NumPy exhibits are somewhat similar to Python records, yet at the same time
particularly unique in the meantime. For those of you who are new to the subject, how
about we clear up what it precisely is and what it's useful for.
As the name gives away, a NumPy cluster is a focal information structure of the numpy
library. The library's name is another way to say "Numeric Python" or "Numerical
Python".
At the end of the day, NumPy is a Python library that is the center library for logical
registering in Python. It contains an accumulation of apparatuses and strategies that can
be utilized to settle on a PC numerical models of issues in Science and Engineering. One
of these apparatuses is an elite multidimensional cluster object that is an incredible
information structure for effective calculation of exhibits and lattices.
A few activities have been incorporated underneath with the goal that you would
already be able to rehearse how it's done before you begin your own!
To make a numpy exhibit, you can simply utilize the np.array() work. You should simply
pass a rundown to it, and alternatively, you can likewise indicate the information sort of
the information. In the event that you need to find out about the conceivable
information types that you can pick, go here or consider investigating DataCamp's
NumPy cheat sheet.
Remember that, so as to work with the np.array() work, you have to ensure that the
numpy library is available in your condition. The NumPy library pursues an import
tradition: when you import this library, you need to ensure that you import it as np. By
doing this, you'll ensure that different Pythonistas comprehend your code all the more
effectively.
Python with Pandas is utilized in a wide scope of fields including scholastic and business
areas including money, financial matters, Statistics, examination, and so on. In this
instructional exercise, we will get familiar with the different highlights of Python Pandas
and how to utilize them practically speaking.
This instructional exercise has been set up for the individuals who try to become
familiar with the essentials and different elements of Pandas. It will be explicitly valuable
for individuals working with information purging and examination. In the wake of
finishing this instructional exercise, you will wind up at a moderate dimension of ability
from where you can take yourself to more elevated amounts of skill.
library utilizes the vast majority of the functionalities of NumPy. It is recommended that
you experience our instructional exercise on NumPy before continuing with this
instructional exercise.
Regularly in diagnostic work you will finish up with huge amounts of half-completed note
pads clarifying Proof-of-Concept thoughts, of which most won't lead anyplace at first.
A portion of these introductions may months after the fact—or even years after the fact
— present an establishment to work from for another issue.
Python is generally basic, so it's anything but difficult to learn since it requires a one of a
kind language structure that centers around coherence. Designers can peruse and
interpret Python code a lot simpler than different dialects. Thusly, this decreases the
expense of program upkeep and improvement since it enables groups to work
cooperatively without huge language and experience obstructions.
REQUIREMENT ANALYSIS
1. FUNCTIONAL REQUIREMENTS
The functions of software systems are defined in functional requirements and the
behavior of the system is evaluated when presented with specific inputs or conditions
which may include calculations, data manipulation and processing and other specific
functionality.
Our system should be able to read the crime data and preprocess data.
It should be able to analyze the crime data.
It should be able to group data based on hidden patterns.
It should be able to assign a label based on its data groups.
It should be able to split data into train set and test set.
It should be able to train model using train set.
It must validate trained model using test set.
It should be able to classify the crime data.
2. NON-FUNCTIONAL REQUIREMENTS
Nonfunctional needs describe however a system should behave and establish constraints
of its practicality.This type of needs is additionally called the system’s quality attributes..
Attributes such as performance, security, usability, compatibility are not the feature of
the system, they are a required characteristic. They are "developing" properties that
emerge fromthe whole arrangement and hence we can't compose a particular line of
code to execute them. Any attributes required by the customer are described by the
specification. We must include only those requirements that are appropriate for our
project.
1. ACCESSIBILITY:
Availability is a general term used to depict how much an item, gadget, administration,
or condition is open by however many individuals as would be prudent.
In our venture individuals who have enrolled with the cloud can get to the cloud to
store and recover their information with the assistance of a mystery key sent to their
email ids. UI is straightforward and productive and simple to utilize.
2. MAINTAINABILITY:
In programming designing, viability is the simplicity with which a product item can be
altered so as to:
• Correct absconds
New functionalities can be included in the task based the client necessities just by
adding the proper documents to existing venture utilizing ASP.net and C# programming
dialects. Since the writing computer programs is extremely straightforward, it is simpler
to discover and address the imperfections and to roll out the improvements in the
undertaking.
Framework is fit for taking care of increment all out throughput under an expanded
burden when assets (commonly equipment) are included.
Framework can work ordinarily under circumstances, for example, low data transfer
capacity and substantial number of clients.
4. PORTABILITY:
DESIGN
4.1 DESIGN GOALS
The goal of this project is to predict the highest and closing price of crypto currency on
a given day based on the crypto currency data of several preceding quarters. It is
technically challenging to predict the accurate price, mainly due to lack of seasonality
and highly volatile nature of the cryptocurrency market. This is primarily a statistic
prediction drawback. Artificial neural network (ANNs) models of time series is used to
perform the prediction task, mainly due to the ability of ANNs to deal with non-
linearities in the data such as lack of seasonality These two models are trained and
tested on crypto currency data starting from 2012 till the first quarter of 2018. In
order to make the one day ahead prediction of highest and closing price of crypto
currency, features such as open price, high price, low price, close price and volume of
currency (USD) are taken into consideration. To predict the highest and closing price
on a day of quarter, both the neural network models are trained with data over the
past eight quarters and it is tested over the next quarter The document explains the
info preparation steps followed by the neural network models and their practicality.
Quantitative measures like to MSE (mean square error), NMSE (normalized mean
square error. The predicted high and closing price using these two neural networks
are presented in tabular format. At the end, the report discusses possible
improvements that can be made to increase the scope of the experiment. The
constant increase in crypto currency usage has become an extremely serious problem,
with the development of technology and hi- tech tools having a significantly greater
impact on the crypto currency price. The large amounts of information also pose’s a
challenge to analyze such data and identify similarities or relations between the data.
Also there is a challenge of inconsistency that can occur in the data due to
incompleteness in the dataset. Therefore, there is an urging need of proper
techniques to analyze large volumes of data to get some useful results out of it. So the
main aim of this project is to propose a general and effective approach to predict the
crypto currency price using data mining techniques.
1. INPUT/OUTPUT PRIVACY
No sensitive information from the large data sets are taken. The data taken are
of use to the society as it helps in solving important problems.
2. EFFICIENCY
The local computations done by the programmer helps the system that is
developed to be more efficient than the rest of the systems. Efficiency is very important
when it comes to large systems, as it plays an important role.
Crime dataset which consists of the crimes that have occurred from day to day
for 10 years.
Forecast engine.
crypto
currency
database
IMPLEMENTATION
5.1 DATASET
Several crypto currency data sets are available online to download for free. Most of
them provide the data related to price of crypto currency on a minute to minute basis
However, the top goal of the project is to create one-day ahead prediction of highest
and shutting worth of crypto currency. So, we will need data such as highest and
closing price of crypto currency for each day over period of several years The Quandl
API provides the crypto currency worth knowledge set, ranging from September 2011 –
2018 (present). This API gives access to crypto currency exchanges and daily crypto
currency values. It permits users to customise the question whereas victimisation the
interface to transfer the historical crypto currency costs. The data is available in three
different formats i.e JSON, XML and CSV. Data is downloaded in the
.csv format. Size of data is around 200KB. It has a total of 2381 data records (each
record corresponds to a day) consisting of crypto currency open, high, low, closing price
and volume of crypto currency (USD) starting from Sept 2011 – 2018 (present).
However due to inconsistencies in the data from September 2011 to December 2011,
this data has been
discarded and data records starting from January 2012 – March 2018 are taken into
consideration for this project. So, after the data is cleaned, the final data set has a total
of 2271 data records. The total data records are divided into three (3) sets, namely: Y12-
13 –2012 and 2013 data, Y1415 – 2014 and 2015 data, Y16-17 – 2016 and 2017 data.
Y12-13 has eight quarters and the neural networks (TDNN, RNN) will be trained on this
data and tested on the first quarter of 2014. Similarly, Y14-15 has eight quarters and
the neural networks will be trained on this data and tested on the first quarter of 2016.
In the same way, Y16-17 has eight quarters and neural networks are trained on this
data and tested on the first quarter of 2018.
To predict the highest and closing price of crypto currency one day ahead, in each of
the sub data sets, columns high and close are shifted up by one (1) unit. In the three
sub data sets, it should be noted that the testing data is from 1st January to 18th March
and it is predicted on 19th March (of years 2014, 2016, 2018) for three sets
respectively. The data set has limited features and in the current project almost all
these features are considered valuable for the prediction task. To be clear, for
predicting the highest and closing price of crypto currency one step ahead, features
such as open, high, low, closing price and volume of crypto currency (USD) are used.
Therefore, bound steps square measure dead to convert {the knowledge| the info| the
information} into a little clean data set. This technique is performed before the
execution of reiterative Analysis. The set of steps is understood as knowledge
preprocessing.. The process comprises:
Data Gathering
Data Cleaning
Data Normalization
Data Gathering:
Daily data of four channels are considered since 2013.First, the crypto currency price
history, which is extracted from Coin market cap through its open API. Secondly, data
from Blockchain is gathered, in particular we choose the average block size, the
number of
user addresses, number of transactions, and the miners revenue. We found it counter
intuitive to have some Blockchain data, given the incessant scaling problem, on the
other hand, the number of accounts, by definition is related to the price movements,
since an increase in the number of accounts, either means more transactions occurring
(presumably for exchanging with different parties and not just transferring crypto
currencys to another address), or it is a sign of more users joining the network.
All in all, these make for 12 features. The Pearson correlation between the attributes is
shown in Figure 2. Clearly, some attributes are not too correlated, for example, the
financial indices are relevant with each other, but not with any of crypto
currencyrelated attributes. Also, we see how Google Trends are related to crypto
currency transactions
Data Cleaning:
From exchange data we consider relevant only the Volume, Close, Open, High prices
and
Market capitalization. For all data sets if NaN values are found to be existent, they are
replaced with the mean of the respective attribute. After this, all datasets are merged
into one, along the time dimension. Judging from crypto currency price movements
during the period from 2013 until 2014, we considered best to get rid of data points
before 2014, hence the data which will be passed to the network lies from 2014 until
September 2018.
Data Normalization:
Deciding on the method for normalizing a time series, especially financial ones is never
easy. What's more, as a rule of thumb a neural network should load data that take
relatively large values, or data that is heterogeneous (referring to time-series that have
different scales, like exchange price, with Google Trends). Doing so can trigger large
gradient updates that will prevent the network from converging. To make learning
easier for the network, data should have the following characteristics
The column Resolution is dropped because it does not provide any assistance and has no
significance in helping to predict the target variable.
5.3 CLASSIFICATION
This technique is used to divide various data into different classes. This process is also
similar to clustering. It segments data records into various segments which are known as
classes. Unlike clustering, here we have knowledge of different clusters. Ex: Outlook
email, they have an algorithm to categorize an email as legitimate or spam.
Decision Trees
Boosted Trees
Random Forest
Neural Networks
Normally a time series is a sequence of numbers along time. LSTM for sequenceprediction
acts as a supervised algorithm unlike its autoencoder version. As such, the overall dataset
should be split into inputs and outputs. Moreover, LSTM is great in comparison with classic
statistics linear models, since it can easier handle multiple input forecasting problems. In
our approach, the LSTM will use previous data to predict 30 days ahead of closing price.
First, we have a need to decide on how many previous days one forecast will have access
to. This number we refer as the window size. We have opted for 35 days in case of
monthly prediction, and 65 days in that of 2 months prediction, therefore the input data
set will be a tensor comprising of matrices with dimension 35x12/65x12 respectively, such
that we have 12 features, and 35 rows in each window. So the first window will consist of 0
to the 34 row (python is zero indexed), the second from 1 to 35 and so on. Another reason
for choosing this window length is that a small window leaves out patterns which may
appear in a longer sequence. The output data takes into account not only the window size
but also the prediction range which in our case is 30 days. The output dataset starts from
row 35 up until the end, and is made of chunks of length 30. The prediction range also
determines the output size for the LSTM network.
In overall the idea is simple, in that we separate the data into chunks of 35, and push
these small windows of data into a numpy array. Each window is a 35x12 matrix, so all
windows will create the tensor. Furthermore, in LSTM the input layer is by design,
specified from the input shape argument on the first hidden, the these three dimensions
of input shape
LSTM internals
A chief feature of feed forward Networks, is that they don’t retain any memory.
So each input is processed independently, with no state being saved between inputs.
Given that we are dealing with time series where information from previous crypto
currency price are needed, we should maintain some information to predict the future. An
architecture providing this is the Recurrent neural network (RNN) which along with the
output has a self-directing loop. So the window we provide as input gets processed in a
sequence rather than in a single step. However, when the time step (size of window) is
large (which is often the case) the gradient gets too small/large, which leads to the
phenomenon known as vanishing/exploding gradient respectively [Chollet2017]. This
problem occurs while the optimizer backpropagates, and will make the algorithm run,
while the weights almost do not change at all. RNN variations mitigate the problem,
namely LSTM and GRU.
We used the Sequential API, rather than the functional one. The overall architecture is
as follows:
•1 LSTM Layer: The LSTM layer is the inner one, and all the gates, mentioned at the
very beginning are already implemented by Keras, with a default activation of hard-
sigmoid [Keras2015]. The LSTM parameters are the number of neurons, and the input
shape as discussed above.
•1 Dropout Layer: Typically this is used before the Dense layer. As for Keras, a dropout
can be added after any hidden layer, in our case it is after the LSTM.
• 1 Dense Layer: This is the regular fully connected layer.
• 1 Activation Layer: Because we are solving a regression problem, the last layer should
give the linear combination of the activations of the previous layer with the weight
vectors. Therefore, this activation is a linear one. Alternatively, it could be passed as a
parameter to the previous Dense layer.
import numpy as
np import pandas
as pd
import math
data = pd.read_csv("crypto
currency.csv") data.head()
data['rp_key'].value_counts()
df = data.loc[(data['rp_key'] == 'btc_us')]
df.head()
df = df.reset_index(drop=True)
df['datetime'] =
pd.to_datetime(df['datetime_id'])
df = df.loc[df['datetime'] >
pd.to_datetime('2017-06-28 00:00:00')]
df.head()
data1["month"]=data1["Timestamp"].dt.year
data1["year"]=data1["Timestamp"].dt
.month
data1.head()
data1["hour"]=data1["Timestamp"].dt.hour
data1["minute"]=data1["Timestamp"].dt
.minute
data1["seconds"]=data1["Timestamp"].dt
.second data1.head()
'Volume_(Currency)' : 'VolumeCurrency',
'Weighted_Price' : 'WeightedPrice' })
data1['Open'].plot
() plt.show()
data1["Log_Normalization"]=data1["Open"]/len(data1["Open"])
data1["Log_Normalization"].head()
data1.loc[:,["VolumeLevel","VolumeBTC"]].head(
threshold = sum(data1.Volume1BTC)/len(data1.VolumeBTC)
df = df[['last']]
dataset = df.values
dataset = dataset.astype('float32')
dataset = scaler.fit_transform(dataset)
= len(dataset) - train_size
:] print(len(train), len(test))
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
look_back = 10
look_back=look_back) trainX
trainY
model = Sequential()
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam')
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
trainPredict =
scaler.inverse_transform(trainPredict) trainY =
scaler.inverse_transform([trainY]) testPredict =
scaler.inverse_transform(testPredict) testY =
scaler.inverse_transform([testY])
trainPredictPlot =
np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
testPredictPlot =
np.empty_like(dataset)
testPredictPlot[:, :] = np.nan
plt.plot(df['last'], label='Actual')
label='Training')
label='Testing')
plt.legend(loc='best')
plt.show()
1. UNIT TESTING
Unit testing includes the structure of experiments that approve that the inward program
rationale is working legitimately, and that program inputs produce substantial yields. All
choice branches and inside code stream ought to be approved. It is the trying of
individual programming units of the application .it is done after the finishing of an
individual unit before combination. This is a basic testing, that depends on data of its
development and is obtrusive. Unit tests perform fundamental tests at part level and
test a particular business procedure, application, and additionally framework design.
Unit tests guarantee that every extraordinary way of a business procedure performs
precisely to the recorded particulars and contains obviously characterized information
sources and anticipated outcomes.
2. INTEGRATION TESTING
Joining tests are intended to test incorporated programming segments to decide
whether they really keep running as one program. Testing is occasion driven and is
progressively worried about the fundamental result of screens or fields.
Incorporation tests exhibit that despite the fact that the segments were separately
fulfillment, as appeared by effectively unit testing, the mix of parts is right and reliable.
Coordination testing is explicitly gone for uncovering the issues that emerge from the
blend of segments.
A building approval test (EVT) is performed on first building models, to guarantee that the
essential unit performs to plan objectives and particulars. It is imperative in recognizing
plan issues, and fathoming them as right off the bat in the structure cycle as could
reasonably be expected, is the way to keeping ventures on schedule and inside spending
plan. Over and over again, item plan and execution issues are not identified until late in
the item improvement cycle — when the item is prepared to be transported. The familiar
saying remains constant: It costs a penny to roll out an improvement in building, a dime
underway and a dollar after an item is in the field.
Approval is a Quality affirmation procedure of setting up proof that gives a high level of
confirmation that an item, administration, or framework achieves its planned
prerequisites. This regularly includes acknowledgment of qualification for reason with
end clients and other item partners.
The Mean Squared Error (MSE) is perhaps the simplest and most
common loss function, often taught in introductory Machine Learning
courses. To calculate the MSE, you take the difference between your
model’s predictions and the ground truth, square it, and average it out
across the whole dataset.
To calculate the MAE, you take the difference between your model’s
predictions and the ground truth, apply the absolute value to that
difference, and then average it out across the whole dataset.
Statistical analysis of the data indicates that the predicted price has
a mean value of 57,173.258 USD, a maximum value of 64,358.805
USD, and a minimum value of 50,775.013 USD, whereas the actual
price has a mean value of 57,249.388 USD, a maximum value of
64,380.999 USD, and a minimum value of 50,941.0 USD. The mean
difference between the mean values of the actual and the
predicated prices is 76.13 USD.
CONCLUSION
This was a very nice exposure to learn a lot of new concepts. Crypto currency
prediction is a very crucial topic to deal with and making a system suitable
for it was a challenging role to do. This project was an approach to use
different neural network modules such as LSTM, SVM, RF and compare their
errors based on their prediction on the given dataset.
From the comparison table in page 54 we can see that MAE (mean squared
error) and MSE is the least for lstm while both svm and random forest have
higher error rate. We also passed different datasets and compared the error
in each case, we found that the error in LSTM model was least as compared
to the other models. LSTM model was most suitable for predicting the price
with least error.
https://ieeexplore.ieee.org/document/7395797
https://www.geeksforgeeks.org/unified-modeling-language-uml-sequence-diagra
ms/
https://www.geeksforgeeks.org/designing-use-cases-for-a-project/
https://dl.acm.org/citation.cfm?id=170072 https://www.edureka.co/blog/apriori-
algorithm/
https://content.iospress.com/articles/intelligent-data-analysis/ida1-1-02
https://link.springer.com/book/10.1007%2F978-3-319-10247-4
https://www.geeksforgeeks.org/apriori-algorithm/