Bitcoin Price Prediction using Machine Learning in Python
Last Updated :
23 Jul, 2025
Machine learning proves immensely helpful in many industries in automating tasks that earlier required human labor one such application of ML is predicting whether a particular trade will be profitable or not.
In this article, we will learn how to predict a signal that indicates whether buying a particular stock will be helpful or not by using ML.
Let's start by importing some libraries which will be used for various purposes which will be explained later in this article.
Importing Libraries
Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.
- Pandas - This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy - Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn - This library is used to draw visualizations.
- Sklearn - This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
- XGBoost - This contains the eXtreme Gradient Boosting machine learning algorithm which is one of the algorithms which helps us to achieve high accuracy on predictions.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')
Importing Dataset
The dataset we will use here to perform the analysis and build a predictive model is Bitcoin Price data. We will use OHLC('Open', 'High', 'Low', 'Close') data from 17th July 2014 to 29th December 2022 which is for 8 years for the Bitcoin price.
Dataset: click here.
Python
df = pd.read_csv('bitcoin.csv')
df.head()
Output:
First five rows of the data
Python
Output:
(2713, 7)
From this, we got to know that there are 2904 rows of data available and for each row, we have 7 different features or columns.
Python
Output:
Data description
Python
Output:
Descriptive statistical measures of the dataExploratory Data Analysis
EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations.
While performing the EDA of the Bitcoin Price data we will analyze how prices of the cryptocurrency have moved over the period of time and how the end of the quarters affects the prices of the currency.
Python
plt.figure(figsize=(15, 5))
plt.plot(df['Close'])
plt.title('Bitcoin Close price.', fontsize=15)
plt.ylabel('Price in dollars.')
plt.show()
Output:
Variation in the price of cryptocurrencyThe prices of the Bitcoin stocks are showing an upward trend as depicted by the plot of the closing price of the stocks.
Python
df[df['Close'] == df['Adj Close']].shape, df.shape
Output:
((2713, 7), (2713, 7))
From here we can conclude that all the rows of columns 'Close' and 'Adj Close' have the same data. So, having redundant data in the dataset is not going to help so, we'll drop this column before further analysis.
Python
df = df.drop(['Adj Close'], axis=1)
Now let's draw the distribution plot for the continuous features given in the dataset but before moving further let's check for the null values if any are present in the data frame.
Python
Output:
Sum of null values column wiseThis implies that there are no null values in the data set provided.
Python
features = ['Open', 'High', 'Low', 'Close']
plt.subplots(figsize=(20,10))
for i, col in enumerate(features):
plt.subplot(2,2,i+1)
sn.distplot(df[col])
plt.show()
Output:
Distribution plot of the OHLC data
Python
plt.subplots(figsize=(20,10))
for i, col in enumerate(features):
plt.subplot(2,2,i+1)
sn.boxplot(df[col], orient='h')
plt.show()
Output:
Boxplot of the OHLC dataThere are so many outliers in the data which means that the prices of the stock have varied hugely in a very short period of time. Let's check this with the help of a barplot.
Feature Engineering
Feature Engineering helps to derive some valuable features from the existing ones. These extra features sometimes help in increasing the performance of the model significantly and certainly help to gain deeper insights into the data.
Python
splitted = df['Date'].str.split('-', expand=True)
df['year'] = splitted[0].astype('int')
df['month'] = splitted[1].astype('int')
df['day'] = splitted[2].astype('int')
# Convert the 'Date' column to datetime objects
df['Date'] = pd.to_datetime(df['Date'])
df.head()
# This code is modified by Susobhan Akhuli
Output:
First five rows of the dataNow we have three more columns namely 'day', 'month' and 'year' all these three have been derived from the 'Date' column which was initially provided in the data.
Python
data_grouped = df.groupby('year').mean()
plt.subplots(figsize=(20,10))
for i, col in enumerate(['Open', 'High', 'Low', 'Close']):
plt.subplot(2,2,i+1)
data_grouped[col].plot.bar()
plt.show()
Output:
Barplot of the mean price of the bitcoin year wiseHere we can observe why there are so many outliers in the data as the prices of bitcoin have exploded in the year 2021.
Python
df['is_quarter_end'] = np.where(df['month']%3==0,1,0)
df.head()
Output:
First five rows of the data
Python
df['open-close'] = df['Open'] - df['Close']
df['low-high'] = df['Low'] - df['High']
df['target'] = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)
Above we have added some more columns which will help in the training of our model. We have added the target feature which is a signal whether to buy or not we will train our model to predict this only. But before proceeding let's check whether the target is balanced or not using a pie chart.
Python
plt.pie(df['target'].value_counts().values,
labels=[0, 1], autopct='%1.1f%%')
plt.show()
Output:
Pie chart for data distribution across two labelsWhen we add features to our dataset we have to ensure that there are no highly correlated features as they do not help in the learning process of the algorithm.
Python
plt.figure(figsize=(10, 10))
sn.heatmap(df.corr() > 0.9, annot=True, cbar=False)
plt.show()
Output:
Heatmap to find the highly correlated featuresFrom the above heatmap, we can say that there is a high correlation between OHLC which is pretty obvious, and the added features are not highly correlated with each other or previously provided features which means that we are good to go and build our model.
Python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Assuming df is already defined
features = df[['open-close', 'low-high', 'is_quarter_end']]
target = df['target']
# Scaling the features
scaler = StandardScaler()
features = scaler.fit_transform(features)
# Split the data into training and validation (test) sets
X_train, X_valid, Y_train, Y_valid = train_test_split(features, target, test_size=0.3, random_state=42)
# 'test_size=0.3' means 30% of the data will be used for testing, and 70% for training
After selecting the features to train the model on we should normalize the data because normalized data leads to stable and fast training of the model. After that whole data has been split into two parts with a 70/30 ratio so, that we can evaluate the performance of our model on unseen data.
Model Development and Evaluation
Now is the time to train some state-of-the-art machine learning models(Logistic Regression, Support Vector Machine, XGBClassifier), and then based on their performance on the training and validation data we will choose which ML model is serving the purpose at hand better.
For the evaluation metric, we will use the ROC-AUC curve but why this is because instead of predicting the hard probability that is 0 or 1 we would like it to predict soft probabilities that are continuous values between 0 to 1. And with soft probabilities, the ROC-AUC curve is generally used to measure the accuracy of the predictions.
Python
models = [LogisticRegression(), SVC(kernel='poly', probability=True), XGBClassifier()]
for i in range(3):
models[i].fit(X_train, Y_train)
print(f'{models[i]} : ')
print('Training Accuracy : ', metrics.roc_auc_score(Y_train, models[i].predict_proba(X_train)[:,1]))
print('Validation Accuracy : ', metrics.roc_auc_score(Y_valid, models[i].predict_proba(X_valid)[:,1]))
print()
Output:
ModelsAmong the three models, we have trained XGBClassifier has the highest performance but it is pruned to overfitting as the difference between the training and the validation accuracy is too high. But in the case of the Logistic Regression, this is not the case.
Now let's plot a confusion matrix for the validation data.
Python
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(models[0], X_valid, Y_valid, cmap='Blues')
plt.show()
Output:
Confusion matrix for the validation dataWe can see that our model is working fine.
You can download source code from here.
Bitcoin Price Prediction using Machine Learning in Python
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice