0% found this document useful (0 votes)
2 views20 pages

Machine Learning Lab

Uploaded by

ramyarajan0713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views20 pages

Machine Learning Lab

Uploaded by

ramyarajan0713
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Mini Project

1. a. Housing Price Decision Sklearn Linear Regression

Aim:
To predict the price of houses using linear regression

Procedure:

1. Import libraries
In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/housing-prices-dataset/Housing.csv
2. Reading data
In [2]:
data_frame = pd.read_csv('/kaggle/input/housing-prices-dataset/Housing.csv')
data_frame
Out[2]:

are bedro bathro stor mainr guestr base hotwaterh aircondit park pref furnishin
price
a oms oms ies oad oom ment eating ioning ing area gstatus

1330 74
0 4 2 3 yes no no no yes 2 yes furnished
0000 20

1225 89
1 4 4 4 yes no no no yes 3 no furnished
0000 60

1225 99 semi-
2 3 2 2 yes no yes no no 2 yes
0000 60 furnished

1221 75
3 4 2 2 yes no yes no yes 3 yes furnished
5000 00

Page | 1
are bedro bathro stor mainr guestr base hotwaterh aircondit park pref furnishin
price
a oms oms ies oad oom ment eating ioning ing area gstatus

1141 74
4 4 1 2 yes yes yes no yes 2 no furnished
0000 20

... ... ... ... ... ... ... ... ... ... ... ... ... ...

5
1820 30 unfurnish
4 2 1 1 yes no yes no no 2 no
000 00 ed
0

5
1767 24 semi-
4 3 1 1 no no no no no 0 no
150 00 furnished
1

5
1750 36 unfurnish
4 2 1 1 yes no no no no 0 no
000 20 ed
2

5
1750 29
4 3 1 1 no no no no no 0 no furnished
000 10
3

5
1750 38 unfurnish
4 3 1 2 yes no no no no 0 no
000 50 ed
4

545 rows × 13 columns


3. Cleaning the data
Convert datatype "string" to "int64" in some categories
In [3]:
data_frame['mainroad'] = data_frame['mainroad'].astype('category')
data_frame['mainroad'] = data_frame['mainroad'].cat.codes

data_frame['guestroom'] = data_frame['guestroom'].astype('category')
data_frame['guestroom'] = data_frame['guestroom'].cat.codes

Page | 2
data_frame['basement'] = data_frame['basement'].astype('category')
data_frame['basement'] = data_frame['basement'].cat.codes

data_frame['hotwaterheating'] = data_frame['hotwaterheating'].astype('category')
data_frame['hotwaterheating'] = data_frame['hotwaterheating'].cat.codes

data_frame['airconditioning'] = data_frame['airconditioning'].astype('category')
data_frame['airconditioning'] = data_frame['airconditioning'].cat.codes

data_frame['prefarea'] = data_frame['prefarea'].astype('category')
data_frame['prefarea'] = data_frame['prefarea'].cat.codes

code_mapping_furniture = {'unfurnished':0, 'semi-furnished':1, 'furnished':2}


data_frame['furnishingstatus'] = data_frame['furnishingstatus'].astype('category')
data_frame['furnishingstatus'] = data_frame['furnishingstatus'].map(code_mapping_furniture)

## We need to map the value of furnishingstatus with the numbers above since it will increase the accuracy of
the model, because of the fact that the more furniture we have, the more expensive the house will be

data_frame
Out[3]:

are bedro bathro stor mainr guestr base hotwaterh aircondit park pref furnishin
price
a oms oms ies oad oom ment eating ioning ing area gstatus

1330 74
0 4 2 3 1 0 0 0 1 2 1 2
0000 20

1225 89
1 4 4 4 1 0 0 0 1 3 0 2
0000 60

1225 99
2 3 2 2 1 0 1 0 0 2 1 1
0000 60

1221 75
3 4 2 2 1 0 1 0 1 3 1 2
5000 00

1141 74
4 4 1 2 1 1 1 0 1 2 0 2
0000 20

... ... ... ... ... ... ... ... ... ... ... ... ... ...

Page | 3
are bedro bathro stor mainr guestr base hotwaterh aircondit park pref furnishin
price
a oms oms ies oad oom ment eating ioning ing area gstatus

5
1820 30
4 2 1 1 1 0 1 0 0 2 0 0
000 00
0

5
1767 24
4 3 1 1 0 0 0 0 0 0 0 1
150 00
1

5
1750 36
4 2 1 1 1 0 0 0 0 0 0 0
000 20
2

5
1750 29
4 3 1 1 0 0 0 0 0 0 0 2
000 10
3

5
1750 38
4 3 1 2 1 0 0 0 0 0 0 0
000 50
4

545 rows × 13 columns


Check if the data frame contains Null values
In [4]:
data_frame.isnull().sum()
Out[4]:
price 0
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
Page | 4
dtype: int64
Define target value and indepentdent variables
In [5]:
x = data_frame.drop(columns = 'price')
y = data_frame['price']
y_max = max(data_frame['price'])
print(y_max)
13300000
Delete outliers
In [6]:
# import libaries
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import IsolationForest
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
In [7]:
# Create an instance of Isolation Forest
outlier_detector = IsolationForest(contamination=0.07) # Adjust the contamination parameter as needed

# Fit the detector on your data


outlier_detector.fit(x)

# Predict outliers (anomalies)


outliers = outlier_detector.predict(x)

x_clean = x[outliers == 1]
y_clean = y[outliers == 1]

#x = x_clean
#y = y_clean

## In this version, I DON'T USE the x_clean, and y_clean, but I have noted the result with different
contaminaitons at the end of the file (I have runned it sereval times with different contaminations before)
/opt/conda/lib/python3.10/site-packages/sklearn/base.py:439: UserWarning: X does not have valid feature
names, but IsolationForest was fitted with feature names
warnings.warn(
4. Split data
In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
print("Done")
Done
5. Training the model
In [9]:
model = LinearRegression()
model.fit(x_train, y_train)
Page | 5
print("Done")
Done
In [10]:
c = model.intercept_
m = model.coef_
print(c)
print(m)

## y = m1x1 + m2x2 + ... + c


-353311.8336093277
[2.48857876e+02 1.34994406e+05 9.50583380e+05 4.18321569e+05
4.66890751e+05 3.68497644e+05 3.59364424e+05 1.24665331e+06
8.97037026e+05 2.23301809e+05 6.96754525e+05 2.30222653e+05]
In [11]:
y_pred_train = model.predict(x_train)
print("Done")
Done

In [12]:
import matplotlib.pyplot as plt
plt.scatter(y_train, y_pred_train)
plt.xlabel("Actual result")
plt.ylabel("Predicted result")
x_point = np.array([0,14000000])
y_point = np.array([0,14000000])
# max value of y is around 13 million
plt.plot(x_point, y_point, c = 'r')
print("RESULT WITH TRAINED DATA")
print("Number of data train: ", len(x_train))
plt.show()

Page | 6
## I create the red line because it is easier to visualize the data, if the dot is near the red line, that means the
model is quite accurate
RESULT WITH TRAINED DATA
Number of data train: 381

In [13]:
from sklearn.metrics import r2_score
r2_score_without_test = r2_score(y_train, y_pred_train)
print(r2_score_without_test)
0.6575703217254214
6. Test the model with tested data
In [14]:
y_pred_test = model.predict(x_test)
print("Done")
Done
In [15]:
import matplotlib.pyplot as plt
plt.scatter(y_test, y_pred_test)
plt.xlabel("Actual result")
plt.ylabel("Predicted result")
x_point = np.array([0,14000000])
y_point = np.array([0,14000000])
plt.plot(x_point, y_point, c = 'r')
print("RESULT WITH TESTED DATA")
print("Number of data test: ", len(x_test))
Page | 7
plt.show()

RESULT WITH TESTED DATA


Number of data test: 164

In [16]:
from sklearn.metrics import r2_score
r2_score_with_test = r2_score(y_test, y_pred_test)
print(r2_score_with_test)
0.723501522320035
7. Result
Result with different contaminations
linkcode

Contaminations 0% 3% 5% 7%

Without test 0.67 0.66 0.68 0.62

With test 0.72 0.64 0.6 0.64

1. b. Housing Prices multiple Regression

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory


# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you
create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/housing-prices-dataset/Housing.csv
In [2]:
# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
Page | 8
import matplotlib.pyplot as plt

df = pd.read_csv(r'/kaggle/input/housing-prices-dataset/Housing.csv')

# List of variables to map


varlist = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

# Defining the map function


def binary_map(x):
return x.map({'yes': 1, "no": 0})

df['newFurnish'] = LabelEncoder().fit_transform(df['furnishingstatus'])

# Applying the function to the housing list


df[varlist] = df[varlist].apply(binary_map)
df.drop(['furnishingstatus'], axis = 1, inplace = True)
df.head(5)
Out[2]:

are bedro bathro stor mainr guestr base hotwaterh airconditi park prefa newFur
price
a oms oms ies oad oom ment eating oning ing rea nish

13300 74
0 4 2 3 1 0 0 0 1 2 1 0
000 20

12250 89
1 4 4 4 1 0 0 0 1 3 0 0
000 60

12250 99
2 3 2 2 1 0 1 0 0 2 1 1
000 60

12215 75
3 4 2 2 1 0 1 0 1 3 1 0
000 00

11410 74
4 4 1 2 1 1 1 0 1 2 0 0
000 20

In [3]:
# Drop missing and invalid values
df = df.dropna()

# Separate the independent and dependent variables


Page | 9
X = df.drop(['price'], axis=1)
y = df['price']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
In [4]:
# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data


model.fit(X_train, y_train)

# Predict the prices on the test data


y_pred = model.predict(X_test)
In [5]:
# Calculate the mean squared error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)
Mean Squared Error: 986041803890.0269
In [6]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
Out[6]:
0.6578047592637595
In [7]:
# Calculate the root mean squared error
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the root mean squared error


print("Root Mean Squared Error:", rmse)
Root Mean Squared Error: 992996.3765744701
In [8]:
plt.scatter(y_test,y_pred)
plt.xlabel('actual')
plt.ylabel('predected')
plt.show
Out[8]:
<function matplotlib.pyplot.show(close=None, block=None)>

Page | 10
Page | 11
2. Data Project - Stock Market Analysis

Time Series data is a series of data points indexed in time order. Time series data is everywhere, so
manipulating them is important for any data analyst or data scientist.
we will discover and explore data from the stock market, particularly some technology stocks (Apple,
Amazon, Google, and Microsoft). We will learn how to use yfinance to get stock information, and visualize
different aspects of it using Seaborn and Matplotlib. we will look at a few ways of analyzing the risk of a stock,
based on its previous performance history. We will also be predicting future stock prices through a Long Short
Term Memory (LSTM) method!
We'll be answering the following questions along the way:
1.) What was the change in price of the stock over time?
2.) What was the daily return of the stock on average?
3.) What was the moving average of the various stocks?

4.) What was the correlation between different stocks'?


5.) How much value do we put at risk by investing in a particular stock?
6.) How can we attempt to predict future stock behavior? (Predicting the closing price stock price of APPLE inc
using LSTM)

Getting the Data


The first step is to get the data and load it to memory. We will get our stock data from the Yahoo Finance
website. Yahoo Finance is a rich resource of financial market data and tools to find compelling investments. To
get the data from Yahoo Finance, we will be using yfinance library which offers a threaded and Pythonic way to
download market data from Yahoo. Check this article to learn more about yfinance: Reliably download
historical market data from with Python
1. What was the change in price of the stock overtime?
In this section we'll go over how to handle requesting stock information with pandas, and how to analyze basic
attributes of a stock.
unfold_moreShow hidden cell
In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt


import seaborn as sns
sns.set_style('whitegrid')
plt.style.use("fivethirtyeight")
%matplotlib inline
Page | 12
# For reading stock data from yahoo
from pandas_datareader.data import DataReader
import yfinance as yf
from pandas_datareader import data as pdr

yf.pdr_override()

# For time stamps


from datetime import datetime

# The tech stocks we'll use for this analysis


tech_list = ['AAPL', 'GOOG', 'MSFT', 'AMZN']

# Set up End and Start times for data grab


tech_list = ['AAPL', 'GOOG', 'MSFT', 'AMZN']

end = datetime.now()
start = datetime(end.year - 1, end.month, end.day)

for stock in tech_list:


globals()[stock] = yf.download(stock, start, end)

company_list = [AAPL, GOOG, MSFT, AMZN]


company_name = ["APPLE", "GOOGLE", "MICROSOFT", "AMAZON"]

for company, com_name in zip(company_list, company_name):


company["company_name"] = com_name

df = pd.concat(company_list, axis=0)
df.tail(10)
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
[*********************100%***********************] 1 of 1 completed
Out[2]:

Open High Low Close Adj Close Volume company_name

Date

2023-01-17 98.680000 98.889999 95.730003 96.050003 96.050003 72755000 AMAZON


00:00:00-

Page | 13
Open High Low Close Adj Close Volume company_name

Date

05:00

2023-01-18
00:00:00- 97.250000 99.320000 95.379997 95.459999 95.459999 79570400 AMAZON
05:00

2023-01-19
00:00:00- 94.739998 95.440002 92.860001 93.680000 93.680000 69002700 AMAZON
05:00

2023-01-20
00:00:00- 93.860001 97.349998 93.199997 97.250000 97.250000 67307100 AMAZON
05:00

2023-01-23
00:00:00- 97.559998 97.779999 95.860001 97.519997 97.519997 76501100 AMAZON
05:00

2023-01-24
00:00:00- 96.930000 98.089996 96.000000 96.320000 96.320000 66929500 AMAZON
05:00

2023-01-25
00:00:00- 92.559998 97.239998 91.519997 97.180000 97.180000 94261600 AMAZON
05:00

2023-01-26
00:00:00- 98.239998 99.489998 96.919998 99.220001 99.220001 68523600 AMAZON
05:00

2023-01-27 99.529999 103.489998 99.529999 102.239998 102.239998 87678100 AMAZON


00:00:00-
Page | 14
Open High Low Close Adj Close Volume company_name

Date

05:00

2023-01-30
00:00:00- 101.089996 101.739998 99.010002 100.550003 100.550003 70566100 AMAZON
05:00

Reviewing the content of our data, we can see that the data is numeric and the date is the index of the data.
Notice also that weekends are missing from the records.
Quick note: Using globals() is a sloppy way of setting the DataFrame names, but it's simple. Now we have our
data, let's perform some basic data analysis and check our data.
Descriptive Statistics about the Data
.describe() generates descriptive statistics. Descriptive statistics include those that summarize the central
tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will
vary depending on what is provided. Refer to the notes below for more detail.
In [3]:
# Summary Stats
AAPL.describe()
Out[3]:

Open High Low Close Adj Close Volume

count 251.000000 251.000000 251.000000 251.000000 251.000000 2.510000e+02

mean 152.117251 154.227052 150.098406 152.240797 151.861737 8.545738e+07

std 13.239204 13.124055 13.268053 13.255593 13.057870 2.257398e+07

min 126.010002 127.769997 124.169998 125.019997 125.019997 3.519590e+07

Page | 15
Open High Low Close Adj Close Volume

25% 142.110001 143.854996 139.949997 142.464996 142.190201 7.027710e+07

50% 150.089996 151.990005 148.199997 150.649994 150.400497 8.100050e+07

75% 163.434998 165.835007 160.879997 163.629997 163.200417 9.374540e+07

max 178.550003 179.610001 176.699997 178.960007 178.154037 1.826020e+08

We have only 255 records in one year because weekends are not included in the data.
Information About the Data
.info() method prints information about a DataFrame including the index dtype and columns, non-null values,
and memory usage.
In [4]:
# General info
AAPL.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2022-01-31 00:00:00-05:00 to 2023-01-30 00:00:00-05:00
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 251 non-null float64
1 High 251 non-null float64
2 Low 251 non-null float64
3 Close 251 non-null float64
4 Adj Close 251 non-null float64
5 Volume 251 non-null int64
6 company_name 251 non-null object
dtypes: float64(5), int64(1), object(1)
memory usage: 23.8+ KB
Closing Price
The closing price is the last price at which the stock is traded during the regular trading day. A stock’s closing
price is the standard benchmark used by investors to track its performance over time.
In [5]:
linkcode
# Let's see a historical view of the closing price
plt.figure(figsize=(15, 10))
plt.subplots_adjust(top=1.25, bottom=1.2)

Page | 16
for i, company in enumerate(company_list, 1):
plt.subplot(2, 2, i)
company['Adj Close'].plot()
plt.ylabel('Adj Close')
plt.xlabel(None)
plt.title(f"Closing Price of {tech_list[i - 1]}")

plt.tight_layout()

Volume of Sales
Volume is the amount of an asset or security that changes hands over some period of time, often over the course
of a day. For instance, the stock trading volume would refer to the number of shares of security traded between
its daily open and close. Trading volume, and changes to volume over the course of time, are important inputs
for technical traders.
In [6]:
linkcode
# Now let's plot the total volume of stock being traded each day
plt.figure(figsize=(15, 10))
plt.subplots_adjust(top=1.25, bottom=1.2)

for i, company in enumerate(company_list, 1):


plt.subplot(2, 2, i)
company['Volume'].plot()
plt.ylabel('Volume')
plt.xlabel(None)
plt.title(f"Sales Volume for {tech_list[i - 1]}")
Page | 17
plt.tight_layout()

Now that we've seen the visualizations for the closing price and the volume traded each day, let's go
ahead and caculate the moving average for the stock.

2. What was the moving average of the various stocks?


The moving average (MA) is a simple technical analysis tool that smooths out price data by creating a
constantly updated average price. The average is taken over a specific period of time, like 10 days, 20 minutes,
30 weeks, or any time period the trader chooses.
In [7]:
linkcode
ma_day = [10, 20, 50]

for ma in ma_day:
for company in company_list:
column_name = f"MA for {ma} days"
company[column_name] = company['Adj Close'].rolling(ma).mean()

fig, axes = plt.subplots(nrows=2, ncols=2)


fig.set_figheight(10)
fig.set_figwidth(15)

AAPL[['Adj Close', 'MA for 10 days', 'MA for 20 days', 'MA for 50 days']].plot(ax=axes[0,0])
Page | 18
axes[0,0].set_title('APPLE')

GOOG[['Adj Close', 'MA for 10 days', 'MA for 20 days', 'MA for 50 days']].plot(ax=axes[0,1])
axes[0,1].set_title('GOOGLE')

MSFT[['Adj Close', 'MA for 10 days', 'MA for 20 days', 'MA for 50 days']].plot(ax=axes[1,0])
axes[1,0].set_title('MICROSOFT')

AMZN[['Adj Close', 'MA for 10 days', 'MA for 20 days', 'MA for 50 days']].plot(ax=axes[1,1])
axes[1,1].set_title('AMAZON')

fig.tight_layout()

We see in the graph that the best values to measure the moving average are 10 and 20 days because we still
capture trends in the data without noise.

Page | 19
Page | 20

You might also like