0% found this document useful (0 votes)
6 views

ML - Regression

Regression In ML

Uploaded by

sohamchatt4403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
6 views

ML - Regression

Regression In ML

Uploaded by

sohamchatt4403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 34
CHAPTER [ this chapter and subsequent chapters, we are going to discuss Machine Learning sMPLE LINEAR 2EGRESSION models which are usefull to analyze data and provide predictions about new data. Model js a term that represents an algorithm or logic. The main purpose of a model is to ‘understand the given data. It is something like brain of human beings which analyzes the data received from the sense organs like eyes, ears, nose, tongue and skin. The light hich is reflected on objects enters the eyes and then brain. The neural network in the brain transmits this data to a particular center (or point) in the brain where the light signal is understood and interpreted by the brain regarding what the object seen by the eyes. Depending on the shapes and patterns already stored in the brain, it interprets that object asacat or a car or a human being etc. AMachine Learning model also does the same thing. When data is given to the model, it uses some mathematical formula and fits the data into that formula. If the data fits into the formula in the best possible manner, then the model will understand the relationship between the pieces of data according to the formula. When new data is encountered, the ‘model will apply the same formula on the new data and makes predictions about the new ata. Various Machine Learning models were created by Computer Scientists and Data Scientists ‘oexplain various relationships between the pieces of data. When certain data is given to Us, itis we to select correct Machine Learning model to apply on the data. When our model ‘Snot correct, then the results will not be accurate. The word ‘regression’ means measure of relation between variables or pieces of data. A "gression model such as Linear Regression or Ridge Regression tries to understand the ‘lationship between different pieces of data, There are 2 objectives of any regression “odel. They are: } To-establish relationship between two variables: There are two types of relationship that exists between variables. When a variable increases, if another variable also increases then itis called positive relationship. When a variable increases, if another one mw Scanned with CamScanner Chapter 19 decreases, then it s called negative relationship. For example, when i e enditure can also increase. This is positive relationship, Whey ti, tierenaes! the humidity in the climate will decrease. This ig Negative r ela, a Inh;, * 2, Predict new observations: Once the regression model understands between the variables, it can predict new results. For example, whee thet ti, in the past 1 year is given to a regression model, it can predict the saleg ont “ quarter. on Variables A variable is nothing but data. We know that a data frame contains data inthe several rows and columns. Here, the columns are called variables. We ca dag? variables into 2 types. They are: yy 1, Dependent variable: This is the variable whose value is to forecast or nx : value is dependent on the values of other variables, called ‘Sndependent sc Dependent variables are also called ‘response variables’ or ‘target variates hg mathematical equations, they are generally represented by the letter 2. Independent variable: This is the variable which is useful to calculate te mies another variable. Independent variables do not depend on each other. While tite. not be possible practically, but independent variables are so called since ts not having dependency on any other variable. Independent variables are alse features’, or ‘regressors’. In the mathematical equations, they are represented lr: letter *’, Linear Regression Tinear regression is a Machine Learning model that depends on the linear ret? between a dependent variable and one or more independent variables. Let us a the word ‘linear relationship’. We can say two variables are in linear relat their values can be represented using a straight line. ‘That means, the data pl values of the variables) lie on the straight line, , When there is only one independent variable, itis called ‘Simple Linear regression Lele than one independent variables, the model is called Multiple Linear regress Linear regression is also called ‘Least Squares regression’, the term we will yt In this Chapter, we will focus on Simple Linear regression where only one i variable is considered, The Linear Equation Scanned with CamScanner ope and b equation is useful to fi called the dependent variable and x is known a: ind y value depending, on em i an 6 independent variable, atu Heres 8 Se tities, WC write the same equation as: syatisticS+ wee BR ihe slope is B, The constant value B, is called intercept. fi, indicates the distance on et Tus take a linear equation: y = 4+2x, Compare this with y = B, + Px. Here, B,= ys%*. “1 9, By substituting x value into this equation, we can find the value of y. So, xis tal e dependent variable and y is called dependent variable since y value is dependent al ga xvalue s=0, then y= 442(0)= 4. yix=2, then y = 4#2(2) = 8. ix=4, then y= 4#2(4) = 12. ix=6, then y = 4+2(6) = 16. ipthis manner, when x is increased in steps of 2, y values are increasing in steps of 4. ese values are shown in the Figure 19.1. From the Figure, we can calculate the slope B, sn intercept B, values, as: Slope B, = deviation in y / deviation in x= dy / dx= 4 /2=2 Intercept B,= distance on y axis where the line crosses = 4 Y= Bo+ Bix y=442x o 2 4 6 8 Understanding Linear Equation Figure | Wty we called y = 442x a linear equation? Because when the x values and y values are "nin the form of a graph as in Figure 19-1, it will show a straight line, That means the Scanned with CamScanner 384 the independent variable (x) and the dependent Variable relationship between en the data, we can apply Linear reprae in When such a relation exists betwe analyze the data. Me The r Squared Value i dataset, we can plot them in the form of a -ading the data from the t rn : wep p, eae may not be exactly on the straight line. There will be deviations from ty. ine, This is called error ‘E’. The linear regression should consider this erry gp" . : the formula will be: Let.us discuss about this error term ‘B’ now, FRI ee : But the y values are deviated due to the deviations ofthe data points from the suas ‘These deviated y values are: yi = (0.8, 2.5, 3, 4.8, 4.4 z ana ‘That means we should get 1 but we got 0.8 as y value. This difference is called mix error (E1). We should square these errors. If we do not square them, while finding se total, the positive and negative values may cancel out. Hence squaring is needed Similarly, we have to calculate the differences of y values from their mean. Thsen deviations from the mean value (E2 = y-Mean). We have to square this value (£2) Now, r squared value = 1- (Sum of El? / Sum of E2? ) ‘The above formula can be used to find the value of r squared. Please observe the following table to understand how to calculate r squared value. Table 19.1: Calculating r squared value y yt Etzy-yt et ezey-wean| © Z 2 2 S J } 3 sc i : 1 1 | ad 5 A D eos Mean = 3 es : Sum] = 1.29 Using the above table data, the formula wil be: T squared = 1~( Sum1 /Sum2)=1 = (1.29/10) = 0.871 4 Scanned with CamScanner Simple Linear Regression aiue is also called ‘Coefficient of determination’ we . The r squared value obtai aed . btained _ si) gr. In percentage, itis 0.871 X 100 = 87.1, This indicates 87% accuracy level tS je That means the Linear regression model in this example can explain 87% jet Heys successflly, whereas the remaining 23% cannot be explained by the model sens theres chae of error level at 23%, The accuracy level ofthe model is 87%, rsdvaue wil be in the range of 0 to 1. Ifr squared value is closer to 1, then the actual and predicted values {on WG line) will be very close. It represents high accuracy of the ge. When 7 squared value is nearer to 0, they are much apart. So, the prediction may potve correct: happens when there are no deviations between y and yl. That means, El value Then El?will also be 0. Then the sum of squares of errors (Sum!) will be 0, This picates 100% accuracy for the model. So, the point is this: in Linear regression models, spesum of squares should have least value and that represents high accuracy. This is the ‘eon, Simple Linear Regression model’ js also called ‘Least Squares Regression model’ whet ate ‘etlowing Python code explains how to calculate r squared value for the above example. gam’ is a package from scikit-learn.org that contains many machine learning related soiules. In sklearn, we have a module by the name metrics. This module contains a incon 12_score(). By calling this function and passing the original data and predicted doa, we can find the r squared value as shown below: [ron sklearn.metri€s import r2-score nie y-[l, 2, 3, 4, 5) yl =(0.8,2.5,3,4.8,4.4] Rsquare = r2_score(y, yl) | print('coefficient of Determination’, R-square) _ ge Nees Output: (TGefficient oF Determination Practical Use of Simple Linear Regression Neare given prices of houses based on their area in New York city. That means, our data alt® the area of the house in square foot and its price, We should understand whether asus any linear relationship between the area and price of the houses. Then, we apply Simple Linear Regression model to analyze the data. Finally, we have to find Pree ofa new house when its area is given. Te dataset: homeprices.csv This ‘i ‘ata setis a simple dataset that contains only 5 rows with 2.columns, The column are Lowey Price of the house, The price of the house is mentioned in dollars. This dataset is "ailable in kaggle.com. ba Scanned with CamScanner hapter shown in Figure 1° 550000 565000 610000 680000 725000 Figure 19.2: Home prices in New York city In this dataset, if there exists linear relationship between area ang relationship can be represented by the equation of the straight line: y fo equation, the dependent variable is ‘y’ whose value is dependent on ‘" In our dataset, we are supposed to find the price of a house depending on i, we have to find, i.e. the price ~ becomes the dependent variable or response vu it depends on, i.e. the area ~ becomes the independent variable or feature, Sot equation becomes: ee Valeo a price =m * area +b We know that ‘m' represent the slope of the line and ‘b’ is a constant value that rege intercept on y ~ axis, These ‘m’ and ‘b’ values are calculated depending on te Simple Linear regression model. Let us draw a scatter plot to see the relationship between the area and price rates: can be done using scatterplot() function of seaborn module, as: sns.scatterplot (datasdf, \x='aréa", 'y='price’) Output: 725000 700000 675000 » 650000 = 625000 600000 575000 550000 2600 2800 3000 3200 3400 3600 380° area 40 Figure 19.3: Data points are aligned in a straight line Scanned with CamScanner Simple Linear Regression the data points (dots) in the scatter plot, we can un serve vote derstand that they at oe more or less using a straight line. Hence, we can apply Simple Linear * cone’ spine learning model on this data. tn mac! “so i ing models (or logic) are implemented in the form of various Python ine learn ‘kitd ek jo nen’ package by scikit-leam.org. The name ‘scikit’ represents Scientific jst Pg. If we want (0 use @ particular machine learning model, first we should ass name and then create an object to that class. For example, to implement gression model, we have to first create an object to ‘LinearRegression’ class 3 del module of sklearn package. Observe the following code: ode Torn near node] fmport Cinearaegréssion 7! E * inearregression() # Create object to LinearRegression class soft ™ 5 ineat_m0 1 emodelis available to us in the form of ‘reg’ object. We can call any methods of 7 sion class using this object ‘reg’. The next step is to train the machine jneatRegres aan del, by calling fit() method on the data in the form of: del, ay tothe fit) method. ‘” indicates the dependent variable, i.e. df.price. Now the eens: how to convert the df.area column data into a 2D array? There are two ways. sist ways to first convert the df.area column into an array by using values attribute, mart =. [Wfarea.values Tenconvert the 1D array into 2D array by using reshape() method of numpy arrays, « | tharea.values.reshape(-1,1) # gives 20 array : Watths reshape(-1, 1) is doing? It is converting the 1D array into 2D array by reshaping eanay in such a way that the resultant array contains only 1 column. For example, let ‘stale an array of shape (2,4). When we reshape it with (-1, 1), then the array will get ‘sbiped in such a way that the resultant array has only 1 column and this is only possible Sitting 8 rows (i.e. 2X4=8). Hence the resultant array will have the shape of (8,1). Tilly, we can use the fit() method, as: 1 df arealValues sreshape (1/4) _TafitG, df. price) : ae wp f converting the df.area value into a 2D array is to take the ‘area’ column name ae Pairs of square braces, as: a srea']]“# atea column in’ 20 array. Format a a 7 i me We can write the fit() method as: Tg "@FEtrarea"]], df sprice) SE ‘eetug ts the model has been trained with the data. That means, the model could fit the data in the form of a straight line. That means the model could have | Scanned with CamScanner 388 | Chapter 19 s yd that there 2 Saeed ‘understands the data, then it is ready to be tested with Ba oh ce of a house having 3300 square foot area, thi, all the predict() method of the model and pass Pion S, 4 Let us now predict the pri in the dataset. We have to c Observe the output. It shows only 1 element in the form of 1D array. It says tae of a 3300 sft house will be around 628715 dollars. Please remember, the cypy . machine learning models will be a 1D array. Let us now understand the inner details of the Simple Linear Regression following is the equation used by the model: E price. ad To calculate the price of 3300 sft house, the above formula is used by our model Bi it is using the above formula? To use the above formula, it should know the sori the line and intercept (b) values. The following constants of Linear Regression casi used to calculate these values: EesetaDe Let us substitute these values to find out pr area ob eee aS Ee aT 300 Price value is: 628715.7534151643. It means, the price of a 3300 saft pose te 628715 de i ollars. The same output has been already given by the predict) me Hence, we can confirm that the Si ae ce, we e Simple Lin ion model is using straight line to predict the new vals, “hee 7 Similarly, if we want to Predict() method, a; 5 2g oi ic ca Predict the price of house having area of 5000 safs "°° That means, the predicted price is around 859554 dollars. Scanned with CamScanner Simple Linear Regression level of the model, find the accuracy lel, we have to cal ip t 1 lculate r square i ong 2-scorel fanction. To this function, we have to pass the orginal preccahn, ot WH prices fom our model. Observe the following code: ce een st snes ics ii qi jlearn.metrics ‘import r2_score si | fret ginal = df.price _ ot reg.predict(df[['area']]) score(y_original, y_predicted) f y d= redictet | ettare = F2 | r squared value’, R_square) | psqar print b at ee one ated value 0.9584307138199486 “noms 95.8% accuracy for our model. That means our mode “ es which is pretty good. ir model performs accurately in lata is fit by the model in the form of a straight line (or regression line), we ‘ter plot once again with a line using Implot() function as: i sis sho gs. time tosee how the d can draw the scatt sig lnplot (data=df, xe'area’, y="price') inst) wil draw the straight line that best fits the data points as shown in Figure 19.4: 750000 700000 price 650000 600000 550000 00000 2600 2800 3000/3200 «3400-3600 3800-4000 area Figure 19.4: Data points with regression line Scanned with CamScanner m 1: Predicting the price of a house in New York depending on; s of houses whose areas are: 3300 sqft saa ta Prograi 3 1 ug house area, find out the price! | = sai Simple Linear regress? : : ‘ paciccina the house prices depending on area import pandas as pd import seaborn as sms : / from sklearn.linear_model import LinearRegression # load the data into dataframe = = df = pd.read_csv("E:/test/homeprices.csv") df # plot a scatter plot = sns.scatterplot(data=-df, x-'area', y='price') # once we see the scatter plot,-we can understand that the # distribution is linear and can use Liner regression model. reg = LinearRegression() reg.fit(df(['area']], df.price) # fitting means training # predict the price of 3300 sft house reg.predict([[3300]]) # 628715.7534151643 # find the coefficient. this is slope m reg.coef_ # find the intercept. this is b reg. intercept_ # if we substitute m and b values in y = mx+b, # we get the predicted value above. ye 135. 78767123 * 3300 + 180616. 43835616432 y # displays 628715.7534151643 next predict the price of 5000 sft house reg.predict([[sooo]]) # 859554.79452055 # find accuracy Jeve’ # gives 95.8% accura from sklearn.metrics yoriginal = df.price y-Predicted = ‘1 of the model by finding r squared values icy ‘import r2_score Pea.predicrraerri-— vans : m Scanned with CamScanner Simple Linear Regression | 394 core(y-original, y_predicted) 2 tls re mM value’, Risquare) ny weer square S Be pyre iscectan plot va @ regression line 1 dieF yoccaatardts X= area’, y='price') e above program line by line in Spyder IDE, as shown in Figure 19.5: roe Tol Vier Hp Cate ny ong cms fa gopoee cM em an als 6/6 9 Geren — 1 a aciseficient. thks 8 slope» tie a Floates apn the eters HS isb Linear_podel. base. Linea regeintercert eee poe vatiete = ond bvolucs in = att, we get ¢ ae ‘i Serra + 3900 + 180626. 43635616652 1 yy enr 205068 ne subst ere gredice the price of 5000 sft house fer Eeet([(5000]]) F 859554. 79452055 hares’, pe'price’} love(29): tid ecaracy level of the wodel by Finding r squared |]|cseaborn-axtagrid.FacetGrid at fives 95.8% accuracy ee eaicaiae {SE NScern.metrics import #2_score yoriginal = of.price Torebiced = reg.predict(d#{C'area"}}) [gare = r2iscore(y original) y_preditted) ll fin('r squared value", R_square) 4 display the scatter plot with a regression Line fs. leplot(data=df, x="area’, y="price’) a5 rao [ar hen tay wwe seams | Figure 19.5: Executing the program in Spyder ‘ie Linear Regression with Train and Test Data hy } ee | please observe the following statements: ree gsuinearRegression() ™ STR 7 oc tieHittarea’ Ty, df price), # train the del Set et all the rows of the dataset to Simple Linear Re t i. ied on the data. But in many cases, this may not be ‘the case. The Machine fois ggrPet divide the data rows into 2 parts: some rows are used for training the ie other rows are used for testing purpose. Generally, 70% of the rows are Purpose of training the model and the remaining 30% are used for testing the gression model and the fo th Scanned with CamScanner Chapter 19 model. Alternately, we can use 80% rows for training and the Temaining op First, let us see how to split the data into train and test data. lop | *e skteamn package has model selection module that contains a function yyy, test split). This function is used to split the data rows into 4 parts: x tafe wake 15 7A test, and y_test. from sklearn- model x_train, x_test, random_state = 0) 5 : train_test_split() function is taking x (independent variable) and y (dependent parts of the data and splitting the data into 4 parts. We are Tepresenting them * X.test, y_train, and y_test. Please observe the attribute: test_size= 0.3, this ing Abe the test size is 30%. That means it randomly selects 30% of rows in the dataset them for testing purpose. The remaining 70% of rows are used as training ane randomly selects depending on the seed given in random_state. Suppose oe ~0, then it will take the seed 0 and creates some random numbers, like: 3 | 5,6 on means the 3%, 1*, 5, 6% and 9" rows are selected for testing purpose. Whee tte is given, it will always generate the same row numbers for selecting the test ata eg the output of the program will be same on running several times. If the seed ‘une changed, then it will select a different set of rows. If we do not use random. state ati then it will generate an integer number randomly and based on that selects the mnt testing. These rows may change every time when the program is run, We can also use train_test_split() function by mentioning train_size attribute insteté test_size attribute as: - ~ Scanned with CamScanner Simple Linear Rey eos are Busine we should verify its accuracy by comparing with original sett ua. yest with the predicted data y_pred. This can be done using r2_score() ected 2 pe “qearn.metrics ‘import r2_score f feeeore vests y-pred) if poet getail «can be easily understood from the Figure 19.6. mete Joutput Machine Leaming Model Icompare| Figure 19.6: Spliting and using the data with Machine Learning Model To understand the relationship between the x and y variables in the train data, we can diana scatter plot. This scatter plot represents how the original data is distributed along Tand y axis, ble, en(xctrain, Yo : z Tieregression line isthe line that is used by the model tot the above data, This regression tee = mxtc, When we pass train data to ode, it predicts the output according to this formula. Hence, regression line should | ae train data and predicted data. nq sbletGeratn,: reg predict @cerain) eden: th the 7 Sean statement, x_train represents the train data and reg.predict(x_train) represents ‘cted data for all the rows in x_train. bm Scanned with CamScanner Chapter 19 When we display scatter plot along with the line Plot using the prey; looks like the plot as shown in Figure 19.7: VIOUS tag Experience vs Salary 120000 + 100000 + 60000- 40000- Yrs of exp Figure 19.7: Scatter plot with regression line _ Now, let us solve a task related to how to decide the salary of an employee depts experience. We give Salary Data.csv that contains Experience of employee and ts= variable (). Split the dataset into train data and test data. Using train data, a and then predict the salary of an employee having 11 years of experience. The dataset: Salary Data.csv is salaryine, wut the experience of an employee and his sae at ince an 0 ows and 2 columns. The column names are: YearsExpeti ne wt, Since we have to predict the Salary depending on YearsExperience, We ya that ‘Salary’is the dependent variable on YearsExperience which becomes variable. See Figure 19.8. This dataset contains data abo There Scanned with CamScanner Simple Linear Regression | 49% Salary Figure 19.8: Employee experience and salary dataset isonly 1 independent variable, we can use Simple Linear Regression model and asthe “the data fits into this model or not. This can be done by drawing a scatter plot ec xend y variables as: sltiscetter(x, y) Oop 120000 100000 60000 40000 Figure 19.9: The data points are showing linear relationship. Scanned with CamScanner Chapter 19 This indicates linear relationship between x and y variables. When value is also increasing. Hence this is positive relationship. So, we Regression model to analyze the data. Program 2: Train the computer using Simple Linear Regression mode] experience and salary data, Predict the salary of an employee having 11 years“ ty, # Simple Linear Regression with train and test data Sof epee import pandas as pd import matplotlib.pyplot as pit X Value jg j CaN Use girs, imple) ley! # load the dataset from the computer into dataset object dataset = pd.read_csv("E:/test/Salary_Data.csv") # retrieve only Oth column and take it as x x = dataset.iloc[:, :-1].values x # retrieve only Ist column and take it as y y = dataset.iloc[:, 1].values y # draw scatter plot to verify Simple Linear Regression # model can be used. Scatter plot shows dots as straight line plt.scatter(x, y) # take 70% of data for training and 30% for testing # random_state indicates the random seed used in selecting test rovs from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test.size0, random_state = 0) # train the computer with Simple Linear Regression model from sklearn.linear_model import LinearRegression # Create LinearRegression class object reg = Linearregression() # train the model by passing train data to fit() method reg.fit(x_train, y_train) # test the model by passing test data and obtain predicted dat@ y-pred = reg.predict(x_test) y_pred # find the r squared value by comparing test data # (expected data). and predicted data. accuracy is 97-4% from sklearn.metrics import r2_score r2_score(y_test, y pred) # 0.9740993407213511 Scanned with CamScanner Simple Linear Regression salary of employee with expe the 0 Weare Vala perience 11 Years, 1 16 So.predict¢C{21]])) # (129740, 26548933) atplotliby draw scatter plot and line plot She below al! statements and then execute at once 4 selec ercctrain, y_train, color='red") escaee train, reg-predict(x-train), color= ‘blue') sitfl0t Cmexperence vs Salary") are abel YES of exp") oe yabel salary") HeshowO xs now atte! ars from ye ne would be mpt another task where we are given per capita income of Canada during 1970 to 2016. Using Simple Linear Regression Model, we have to predict the per capita income during the years 2020 and 2021. paoset used: canada_per_ capita_income.xlsx x capita income represents the income earned per person in a given country. It is Gated by dividing the country’s total income by its total population. When per capita oie is more, the people in that country are leading comfortable and probably rich life. eemjqiset canada per_capita_income.xlsx is an Excel file that contains year number and ‘epita income in that year for Canadians. This xlsx file contains a sheet by the name Sect” and hence while reading data from this file, we can use read_excel() functions of yenies module, as: a : "dataset = pd-read_excel("€:/test/canada_per_capita_incone Jelsx" "sheet1") this dataset has 47 rows and 2 columns. The first few columns of this dataset are shown inFigure 19.10. Income 1970 __3399.299037 1971 3768.297935 1972 4251.175484 1973 4804.463248 1974 5576,514583 1975 5998.144346 1976 7062.131392 1977 7100.12617 1978, 7247.967035 1979, 7602.912681 1980 8355.96812 1981, _-9434.390652 1982 9619.438377 1983 Sheett | Figure 19.10: Data related to Income per person in Canada Scanned with CamScanner Chapter 19 To cheek if the data fits into Simple Linear Regression Model, we can qi raw 4 \ as: pit.scatter(x, y, color='red') Output: 40000 35000 30000 25000 20000 15000 1970 1980 1990 2000 2010 Figure 19.11: The data points are nearly linear The data points are more or less showing a straight line. Hence, we can use Simpl l=" Regression Model on this data. We can also find the accuracy of the model, by comparing y_test and y_pred 8° 12_score() function, as: [r2iscore(y-tést; y=pred)= = eee roe Another way to measure the accuracy of Linear Regression Model is by cal method on the model object (reg), a: aoe | reg-score(x test, y test) E é we! E = = = Sena st. ased on XS ay The score() method takes x_test and calculates y_pred values aii Hence compares y_test and y_pred values to provide the score of the mo score with r2_score() function and score() method. ¥ a” 2020 conesion Program 3: Find out the per capita income of Canada during the Ye" the per cap: aa a a Scanned with CamScanner Simple Linear Regression 4 data from "sheet." of Excel file ' wet = pd.read_excel ("E:/test/canada_per_capita_income.x1sx", "Sheet i ' praset.iTocL!» OF2].values # retrieve st column as 20 array fe * aataset Tots» -l].values # retrieve last column as 10 array , F di jstribution of data is linear or not neck the dis pei Kegeater(&s Y» colors!red) jie 708 of data for training and 30% for testing eon state indicates the random seed used in splitting the data fron sklearn.model_selection import train_test_split ctrain, xtest, y-train, y_test = train_test_split(x, y, test_size=0.3, “dom state = 5) rain the computer using Simple Linear Regression model ge sklearn. linear_model. import LiaearRegression rag = LinearRegression() reg.fit@train, y_train) #nake prediction based on test data pred = reg. predict (x_test) y.pred # find the r squared value by comparing test data and predicted data fros sklearn.metrics import r2_score rscore(y_test, y_pred) # 0.8433026110551844 # another way to know the score of a linear regression model feg.score(x_test, y_test) # 0.8433026110551844 # predict the per capita income during the years 2020 and 2021 # Output: array([41819. 49650873, 42681. 02869595]) # this means 41819$ in 2020 and 42681$ in 2021 years. *eg-predict({[2020], [2021]]) a Scatter plot and regression Tine mm et below block and run at once He gutterGxtrain, y_train, color="red’), 4, de eestrain, reg.predict(x_train) » color= 'blue') nit, ‘TtleC"PER CAPITA INCOME OF CANADA") “Mabel ("Year") Plt, : me qatel Per Capita Income’) 0 We ran thi ; Tun this program in Spyder, it looks like the following. le foung / ies pn per capita income during the years 2020 and 2021 as: edict (L(2020))"{202111) 4 gel bh Scanned with CamScanner Chapter 19 i ements in 2D array an , we are passing the years as el Fe STapeTAUATOTASSOUTS 42681.02869595]) ita i Id be 41819 doy, is i at the per capita ineome would be. liars ip Sones ieoa orem the accuracy of the model is 84% oniy"* ! 2029 id got the utp DeetwerDOeenccan Bet £4 \¢ 9 came dataset atarrane ree near sede), a array of neg Mr, Mtest Array of intes train Array of ineee from sklearn.metrics import r2score P2score(y test, plt.show) Zee Ber Oyen d4)_ were Figure 19.12: Running the program in Spyder Points to Remember par rej i latinos Q Linear regression is a Machine Learning model that depends on the linear 1th between variables, alt data is (The input data is represented by independent variables and the target dependent variable. the? sts Q When a variable value increases, if another variable value also incre called positive relationship, When a variable val iris i es, then’ ue increases, if another variable value decreas negative relationshi ip. 7 i Here BON Simple linear regression model uses the formula: y = BO + Bix +E: Here’ intercept and B which s iagPresents the slope of the line. E is Error term as T squared value’. ’ in and test dal i cnaes*®St Split function is Used to split the dataset into train es either train size or test. size based on which it divides the dataset- Scanned with CamScanner y CHAPTER 20 gression model, we take only 1 independent variable (x) that is useful dent variable (y) value. For example, ‘area’ of the house is useful to But in many practical cases, house ‘price’ does not dependent f a house can be decided depending on various factors like ms’, ‘the age of the house’, etc. These variables are called and they are useful to predict the price of the house. When we use ariables in Linear Regression model, it is called ‘Multiple Linear \uLTIPLE LINEAR EGRESSION topredict the depen prelict the house ‘price’. tay on ‘area’, The price of ‘gra, the number of bedroo! [ simple Linear Re independent variables rultple (more than 1) vé Regession Model’. ariable (y) value can be calculated based on ln Multiple Linear Regression, the target v So, the equation used by this model will be several independent variables (x1, x2, x3, ... inthe form of: y= mena nds 4 EB se Here, y is called the independent variable or target variable. x1, x2, x3, ... are called dependent variables or features. m1, m2, m3, ... are called quotients associated with the drpendent variables. b is called intercept. ‘That means it can be shown in the The relationship between y and x1 should be linear. fen y and x2 should be linear. Also, f a ae straight line. Similarly, the relationship betwes *telationship between y and x3 should be linear. ‘Susual, when Multiple Linear Regression model is applied on data, there will be certain ations between the predicted values and original values which can be calculated ie SQuared value, The r squared value should be betvit O and 1. Ifitis nearer to ee the model is not performing well. If it is nearer to 1, then the model is doing well ‘curacy level is high. lets atthe PPI Multiple Linear Regression model on the house Patt ata and then understand how to use the model on the data. .3 data, We will first look ~~ Scanned with CamScanner Chapter 20 The dataset: homeprices.csv This dataset represents home prices in Monroe Township, New Jersey, ugg is a sample dataset that contains only 6 rows and 4 columns, The column, elite, the house in square foot, the number of bed rooms, age of the house in aaa thet of the house in dollars. This is shown in Figure 20. rea____bedrooms age price. -~& 2600 = Bt _ 550000 | 3000 4 15 _ 565000 Ee mt 3200 18 610000 © 5) 36008. _ 595000 6] 4000 5g ___760000 7/4400 Ss 8 795000 8 | lt Figure 20.1: Homeprices dataset We are supposed to calculate the home ‘price’ depending on the ‘area’, ‘bedrooms’ ani’? columns, So, the Multiple Linear Regression model uses the following formula: Y= mix + m2x2 4 m3x3 4b = Bee _ Price = mitarea + m2*bedrooms + mage +b Z el = Looms E , fi Please observe that this dataset has a missing value in ‘bedrooms’ column. Son! let us clean the data or make the data ready for the model. For this purpost "i, 7 2 ate vive? cither delete that row that contains the missing value or substitute appropt! that place. To find out the missing values in the data frame (df), we can use: df. 18nu1110). sum) é Output: Pearce bedrooms 1 0 0 age "price atype: intea Scanned with CamScanner Multiple Linear Regression ax easly ells us that there is 1 missing value found in the ‘bedrooms’ column. to calculate median value of all the other values in that column and then oat value in the place of missing value. So, first let us find the median value for pl column aS! oe roons median coe cut 40 a float number as a result of executing the above statement. Convert that into at floor() function of math module, as: seny 6 spnege 20 sport oath ; ned = path. floor (df. bedrooms .median()) wed cop (4 mately, we can use int() function as: rqed = int (df. bedrooms .median()) ed uw, lets fil the median value into missing place of ‘bedrooms’ column, as: éfibedrooms = df. bedrooms. fi11na(med) ff Output: area bedrooms age price 0 2600 3.0 20 550000 1 3000 4.0 15 565000 2 3200 4.0 18 610000 3 3600 3.0 30 595000 d 4000 5.0 8 760000 4400 5.0 8 795000 e Pa ‘he datais alright, we can check if we can apply Linear Regression model on this data wt. This can be done by checking the relationships between ‘area’ and ‘price’, between. tr ny te ‘price’, and between ‘age’ and ‘price’. To view these relationships, we can Laing hat dlaplays a seattered data points along wi i in be drawn using impo) fonction of seaborm module. Now, in 3 ways: th regression (relationship) line. let’s use the Implot() To 7 ind the relation between constructed area and price by drawing Implot as: eee seaborn as sns -Implot(x='area', y='price’, data-df) : — bh Scanned with CamScanner 404 © | Chapter 20 Output 2600 2800 3000 3200 3400 3600 3800 4000 4200 aan area ae Figure 20.2: Relationship between area and price of house art fe output shows positive relationship. That means, if the constructed ‘ouse increases, the price of the house will also increase. ast To find the relationship between the number of bedrooms and Pat a Implot as: g ‘Scanned with CamScanner Multiple Linear Regression output price 4.75 5.00 3.00 325 350 375 4.00 425 4.50 bedrooms Figure 20.3: Relationship between bedrooms and price of house The out ig shows that a positive relationship exists between the number of bedrooms nd price. That means, if the number of bedrooms is increased, the price of the house “ills increase. hnouse and price by drawing Implot as: jata-df) eS To fing n the relation between age of the SBS ImpTot ‘Scanned with CamScanner Chapter 20 Price age Figure 20.4: Relationship between ‘at there is ed, the price This output shows the age and price of house the house is increas, ice, ifthe negative relationship between age and es ill decrease as it represents an older co tiple neve is linear (og Straight line) relationship between the multipl Variables fie. ar f i ice), we a, bedrooms id age) with the dependent variable (pri Multiple Linear Re, 'ssion mode], on the data, ‘Scanned with CamScanner Multiple Linear Regression the data using fit() method. We should i i odel 07 s uld remember that while passin, x thie independent variables), we have to pass them in the form of 2D array and 792 Oe cpendent variable) should be given as 1D array. oP e(@FLL' area , ‘bedrooms’; ‘age"}], dF 'price"]) a coeficient values used by the model, we can display coef_ variable, as: ~=8529-30115951 = display intercept_variable as: (77561 89287339806 z 3 a model has been trained, we can predict the house price for given area, bedrooms ‘alues. This can be done using predict() method to which we have to pass the area, ‘and age values in the form of a 2D array. edict (113000, 3, 40]1) = = so tpete independent variable value, the output will also be displayed in the form of a 1D ta. See the output below: et adage put: - 7 array((427301. 786273871) eto = z = 2 ‘tatmeans, the price of the house in New Jersey with 3000 square foot constructed area, 3teirooms and 40 years old would be 427301 dollars. baths manner, we can use the model on the new data to provide predictions or forecast. Ee Program 1: We are given Home prices in Monroe Township, NJ (USA). We should predict tte prices for the following homes: * 3000 sqft area, 3 bed rooms, 40 years old 4. 2500 sqft area, 4 bed rooms, 5 years old errey {ultiple Linear Regression wiodel = predicting house prices import pandas: as pd ' ad the dataset into dataframe a Pa-Fead_csv("e:/test/homeprices.csv") 44 t hegegitt any missing values in the dataset tf. {g,00™S has’ missing value “SMU. sum¢) ti) 1 the missing data (NaN) with median of bedrooms a Scanned with CamScanner Chapter 20 Ymport mat! med = math med #4 # F171 the missing data (NaN columns) with this median y df.bedroons = df.bedroons. fil Ina(med) alue df # represent the relations between independent and dependene # area, bedroons and age are independent vars and prices pa import seaborn as sns Dendy sns.Implot(x="area’, y="price’, data=df) sns.Implot(x="bedrooms', y='price’, data-df) sns.Implot(x='age', y='price’, datadf) H a . Floor (df. bedrooms .median()) # create Tinear regression model with multiple variables # take the independent vars first and take dependent var next. from sklearn.1inear_model import LinearRegression : reg = Linearregression() veg fitCdf[C'area’, ‘bedrooms’, 'age']], df['price']) # Fitting means training # print coefficients,i.e. m1, m2, m3 values reg.coef_ # 142.895644 , -48591.66405516, ~8529. 30115951 # intercept reg.intercept_ # 485561. 89282339806 # predict the price of 3000 sqft area, 3 bed rooms, 40 years old hast reg-predict([[3000, 3, 40]]) # 427301 # predict the price of 2500 sqft area, 4 bed rooms, 5 years old hast reg.predict([[2500, 4, 5]]) # 605787 We are going to attempt another task which is similar to the previous task butwits¥ dataset. We are given a dataset that contains house prices in California state de the constructed area, number of bedrooms and number of bath rooms. Unlike we have to divide the dataset into train data and test data and find out the price depending on the area, bedrooms and bathrooms. The given dataset: cal1-03homes.xis Obsereatn, EXet Workbook file that contains one sheet where the dat 5 ot is available on nett Prive of the house, Square foot area, Bed rooms, Bath yt! rows and 7 columisc td iP code of the place where the house is bacoted ot i take only 3 columney gate Te Ot Boing to take all these columns i\? ue Of the house. Since the cthae (oot &€@» Bed rooms and Bath rooms t0 CC io ‘se. Since the other factors lite Garage and Zip code are not 0" Scanned with CamScanner Multiple Linear Regression ._ From this discussion, we can understand that the independent variables 200 Soms', Baths’ and the dependent variable is ‘Price’. So, the Multiple Linear i ea jnternally uses the following formula: "mx? + 3X3 + — +B itsqet + m2*BedRooms + m3*Baths + b ie draw 3 plots: 7 gto nd out any linear relationship is existing between the Sqft column and 0 rots representing the relation between BedRooms and Price, and then Baths a ee ‘These boxplots are useful to find out any outliers are present in the data. od " a " i draw Implot that contains datapoints along with the regression line, This can Ht plot() function of seaborn as: yeenice 800000 600000 Price 400000 200000 4000 2000 3000 4000 5000 6000 70 Figure 20.5: Implot between area and price Scanned with CamScanner Let us now deci The columns: ‘sqpy, Once we find 0) method of data Chapter 20 This Implot represents here is linear relationship between nts that ther f s price and it is a positive relationship. That means if the are, will also increase. the a Fea of 9 increases th 8, the: draw the box plots using seaborn boxplotd jaunction, as: Let us now draw oo ora -boxplot (x="BedRooms ‘Pri fa Sas boxe lot Galestheniver rice aldara df) Output: points are nothing but abnormal values i datapoints using IQR (Inter Quartile Range) method, as: # calculate igr 93 = df['Price'].quantile(o,75) gi = df['Price'] quantile(o.25) igr = q3-q1 # calculate upper and lower 13, # mits from iqr any value above ul or below 11 will become outliers. UT = 3 +(1.5 * igry M1 = ql -(2.5 * ign) # Price should not be more than ul and less than 11. #If itis SO, then it become outlier Upper = MP. whereCdF[' pricey >= ul) Tower = MP: whereCdf[' price] <= 11) a yes sits ut the ouliers, we can delete those rows with outlier valu frame, as: oF. drop Cuppertoy, inplace=triey f df.di PClower [0], ‘inplace=true) - ‘able ) in Oh ide inde (x) and dependent variable Oe! ‘aths’ are independent variable Pendent variables ‘BedRooms’ and 8, Scanned with CamScanner Multiple Linear Regression mns. ‘The column ‘Price’ is the 1" column that should be taken as dependent s ni i atte ly 2nd, 3rd and 4th c nite eve ons valites ‘Olumns and take them as x ist column and take it as y jeve g retrte 1].values Mapedlocl?» wpe above statements, .values property converts the values into array format. The ‘’ se Ue given in 2D array format and the value willbe given as 1D array format. These a are required when We want to supply this data to the Machine Learning model a using train_test_split() function, as: train_test_split(x, y, test_size=0.2, fo yspltthe data into train and test dat wearin, xtest, y_train, y_test Todo state = 10) = pee ri and y_train data should be used to train the model. xtest and y_test are to qeused for testing the accuracy of the model. 7 aise Linear Regression is also nothing but Linear Regression model only. Hence, we seefate te model by creating an object to LinearRegression class as: jnearRegressionO. z = ‘ein the 7 = ~ ag. Fit OC tral Ny eyatta ee ee ee tee the training of the model is completed, we can find the accuracy level of the model by cdauating r squared value as: riscore(y_test, y-pred) ae = BE ;| Weare comparing the test rows (y_test) against edited by the model based on x_test rows pass' ‘Souse score) method to find the score as: reg.score(x_test, y_test) = Se = The above score() method calculates y_pre pooh y.test to decide the score of the mod ig the model. Let us predict the house price th 3 bed rooms and 1 bath room. oe eR 3, 11) = vit We want to predict the prices of 2 ho {ake 3 bedrooms, 2 bathrooms and 2000 sqft area, Tepes can use the predict) method as: ts -Predict({[is00, 3, 2], (2000, 4, 4]]) oo We inember, the input should be in the form of 2D array and the output will be given “tenon form of 1D array by any Machine Learning ‘model. For example, the above egg ea Produce the following output which is @ 1D array: eee fc 13085266 205294.90136746] . Yen ts element 128866, 13085266 represents the ‘ment 205294.90136746 represents the price of the seco} t the predicted rows (y_pred). These are ed to predict/) method. Of course, we can .d values based on x_test values and compares el. Finally, let us predict the house prices having 780 square foot constructed area, ‘uses with the following specifications: 1500 4 bedrooms and 4 bathrooms. In price of the first house and the nd house. ~~ Scanned with CamScanner «7 iple Linear Regression Model for the house oy; 2: Create Multiple Linear Regression \ Prices eo ernect into train and test data while giving it to the model ans Predict em b ye houses with the following specifications: 1. Predict the house price with 780 sqft area, 3 bed rooms and 1 bath i i ith 1500 sqft, 3 bed -dict the house prices for two houses wit aft, Tooms 7 > org another one with 2000 sqft, 4 bed rooms and 4 bath room. 24 2b, # multiple linear regression - predicting house prices import pandas as pd # load the dataset into dataframe af = pd.read_excel("e:/test/call-O3hones .x1s", "Sheet") df # find out any missing values in the dataset # there are no missing values in any column df.isnul1Q.sumQ) # find out outliers by drawing box plots - there are outliers in Pria import seaborn as sns sns. Implot(x='sqFt', y="Price', data=df) sns.boxplot(x='BedRooms', Price’, dat sns.boxplot(x='Baths', y="'Price', data=df) # delete the rows with outliers using igr method # calculate q3 (third quartile). q3 = df["Price'].quantile(0.75) 3 # calculate ql (first quartile). at = dFC'Price'] .quantile(o.25) 4g. # find iqr value. this gives 80000 igr = q3-q1 , _ igr # calculate upper and lower limi i from iqr # any value above ul or below Tite Joep Bio a elow 11 will become outli nN * igqr) = Ql -(1.5 Print(ul, 11) # 304900.0 -15100.0 F # Upper bound | sapere numpy as np _Uppe MP.whereCdFL' Price'y >= ul) Scanned with CamScanner Multiple Linear Regression nd an jnonet DOU recat’ price’) <= 11) over = ne rows above upper and below lover values p delete er(0], inplace=True) only 2nd, 3rd and 4th columns and take them as x jeve i rete Toc + 2:5].values teat. x retrieve 2st column and take it as y i Pof.ilocl:, 1.values y take 80% of data for training and 20% for testing {Gtdon state indicates the random seed used in selecting test rows frm sklearn.model_selection import train_test_split Krein, xtest, y_train, ytest = train_test_splitx, y, test_size-0.2, randonstate = 10) f train the computer with Simple Linear Regression model fron sklearn.Tinear_mode1 import LinearRegression 4 create LinearRegression class object req = LinearRegression() # train the model by passing train data to fit() method req.fit@_train, y_train) # test the model by passing test data and obtain predicted data y.pred = reg.predict(x_test) # find the r squared value by comparing test data Feitinected data) and predicted data. Accuracy is 97.4% ae sklearn.metrics import r2_score ~Score(y_test, y_pred) # 0.829686049412441 { another way to find the score S-Score(x_test, y_test) # 0.829686049412441 {predict the price of a house with 780 sqft, 3 bedrooms and 1 bathroom rin gives 56120 dollars. "eg.predict({[780, 3, 1]])) # 56120.32684253 # . aBtedict the prices of houses with 1522 sqft, 3 bedrooms and 2 bathrooms 00 sqft, 4 bed: d 01 oe edrooms and 4 bathrooms. Drint gat, 128865 doTlars and 205294 dollars. : 9-predict([[1500, 3, 2], [2000, 4, 411)) i aS ~~ Scanned with CamScanner Output: ; ° the following screenshot where we executed the program in serve the program line by line and observe in Figure 20.7: Chapter 20 s the output at the bottom right sa oi De, ed data. Rccuraey 19 97.4% Bisreneaeer eee” ys5049812582 see | ew ina 6126 dollars * te Hict(({780, 3, 1]])) * 58129. 32683253 gf: erent re areca AS 7 3 Bedroom: | [56120.32604253) In [72): print(eeg predicts 1. (2000, 4, 128866.13085266 265254 1976) Figure 20.7: Executing the program in Spyder IDE Points to Remember Qa Q ; ye i There are 2 types of Linear Regression: Simple Linear Regression and Mult Regression. sind” tn Simple Linear Regression, the target value is predicted based on onl variable or feature, a ile (me in multiple Linear Regression, the target value is dependent on muiiple independent variables, 38" ’ Multiple linear regression model uses the formula: y= mixl + a mn," b. Here, x1, x2, x3, ... are called dependent variables or features. M1, 7 called quotients. b is called intercept. in : ; in In linear regression models, the score or accuracy can be known it NSiNg Scorel) method or r2_score() methods. ad The ints: teed ted value represents the square of deviations of the datape between 0 : aud 1. If itis nearer to 0, then the model is not performing ‘0 I, then the model ig doing well and accuracy level is high- d Scanned with CamScanner

You might also like