Ip+Bda - Rahim
Ip+Bda - Rahim
IMAGE PROCESSING
&
BIG DATA ANALYTICS
For
Semester II
Submitted By:
Mr. ABDUL RAHIM KARIM
KHAN
Msc.IT (Sem II)
Image
Processing
Msc.IT Part 1 IMAGE PROCESSING
IMAGE PROCESSING
For
Semester II
Submitted By:
Mr. ABDUL RAHIM KARIM
KHAN
Msc.IT (Sem II)
Msc.IT Part 1 IMAGE PROCESSING
Certificate of Approval
INDEX
Sr. Name of the Practical Page Date Teacher’s
No. No. Signature
1 Basics 13 26-03-2023
A1 Program to calculate number of samples required for 26-03-2023
an image.
B1 Program to study the effects of reducing 26-03-2023
the spatial resolution of a digital image.
C1 Program to study the effects of varying the 26-03-2023
number of intensity levels in a digital image
D1 Program to perform image averaging (image addition) 01-04-2023
for noise reduction.
E1 Program to compare images using subtraction for 01-04-2023
enhancing the difference between images.
2 Image Enhancement 18 03-03-2023
A2 A Basic Intensity Transformation functions 03-03-2023
i. Program to perform Image negation
ii. Program to perform threshold on an image.
iii. Program to perform Log transformation 04-03-2023
iv. Power-law transformations
v. Piecewise linear transformations
a. Contrast Stretching
b. Gray-level slicing with and without background.
c. Bit-plane slicing
PRACTICAL 0
Aim: Install the Scilab and Image Processing Toolbox in
Scilab.
Steps:
1: Download scilab.6.1.1
https://www.scilab.org/download/scilab-6.1.1
Click on next.
Msc.IT Part-1 Image Processing
Click on next.
Msc.IT Part-1 Image Processing
Click on Install.
Msc.IT Part-1 Image Processing
PRACTICAL 1
AIM-: Basics
1(A) Program to calculate number of samples required for an image.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
1(B) Program to study the effects of reducing the spatial resolution of a digital image.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
1(C) Program to study the effects of varying the number of intensity levels in a digital
image.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
1(E) Program to compare images using subtraction for enhancing the difference
between images .
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 2
AIM:- IMAGE ENHANCEMENT
A. Basic Intensity Transformation functions
i. Program to perform Image negation.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
c. Bit-plane slicing
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
B.
1. Program to plot the histogram of an image and categorise.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
D. Write a program to apply smoothing and sharpening filters on grayscale and color
images.
a) Low Pass
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
b) High Pass
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 3
AIM:- Filtering in Frequency Domain
A. Program to apply Discrete Fourier Transform on an image.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
B. Program to apply Low pass and High pass filters in frequency domain.
i. Ideal Low Pass Filter
CODE:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 4
AIM:- Image Denoising
i. Program to denoise using spatial mean and median filtering.
SPATIAL MEAN
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
MEDIAN(I)
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
MEDIAN(II)
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 5
AIM-: Color Image Processing
5(A)-: Program to read a color image and segment into RGB planes, histogram of color
images.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
5(B)-: Program for converting from one color model to another model.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 6
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 7
AIM:- Morphological Image Processing.
A. Program to apply erosion, dilation, opening, closing.
CODE:-
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
OUTPUT:-
Msc.IT Part-1 Image Processing
PRACTICAL 8
AIM:- Image Segmentation.
i. Program for Edge detection using.
a. Sobel, Prewitt and Canny.
CODE:-
OUTPUT:-
Msc.IT Part 1 BIG DATA ANALYTICS
Semester II
Submitted By:
Mr. ABDUL RAHIM KARIM
KHAN
Msc.IT (Sem II)
Msc.IT Part 1 BIG DATA ANALYTICS
Certificate of Approval
INDEX
Sr. Name of the Practical Page Date Teacher’s
No. No. Signature
1 Install, configure and run Hadoop and HDFS ad 01 04-02-2023
explore HDFS.
2 Implement word count / frequency programs 10 11-02-2023
using MapReduce.
3 Implement an MapReduce program that processes 12 18-02-2023
a weather dataset.
4 Implement an application that stores big data in Hbase/ 14 25-02-2023
MongoDB and manipulate it using R / Python.
5 Implement the program in practical 4 using Pig. 16 04-03-2023
6 Configure the Hive and implement the application in 22 11-03-2023
Hive.
7 Write a program to illustrate the working of JAQL. 24 18-03-2023
8 Implement the following: 29 01-04-2023
8A Implement Decision tree classification techniques 29 01-04-2023
8B Implement SVM classification techniques 31 01-04-2023
9 Solve the following: 33 08-04-2023
9A REGRESSION MODEL Import a data from web 33 08-04-2023
storage. Name the dataset and now do Logistic
Regression to find out relation between variables that
are affecting the admission of a student in an institute
based on his or her GRE score, GPA obtained and
rankof the student. Also check the model is fit or not.
Require(foreign), require (MASS).
9B MULTIPLE REGRESSION MODEL Apply 35 08-04-2023
multiple regressions, if data have a continuous
independent variable. Apply on above dataset.
Msc.IT Part
1
Practical No. 1
Aim: Install, configure and run Hadoop and HDFS ad explore HDFS.
1. Prerequisites
First, we need to make sure that the following prerequisites are installed:
The first step is to download Hadoop binaries from the official website. The binary package
size is about 342 MB.
1
Msc.IT Part
1
After finishing the file download, we should unpack the package using 7zip int two steps.
First, we should extract the hadoop-3.2.1.tar.gz library, and then, we should unpack the
extracted tar file:
The tar file extraction may take some minutes to finish. In the end, you may see some
warnings about symbolic link creation. Just ignore these warnings since they are not related to
windows.
2
Msc.IT Part
1
After unpacking the package, Since we are installing Hadoop 3.2.1, we should download the files
After installing Hadoop and its prerequisites, we should configure the environment variables
to define Hadoop and Java default paths.
To edit environment variables, go to Control Panel > System and Security > System (or
right- click > properties on My Computer icon) and click on the “Advanced system settings”
link.
3
Msc.IT Part
1
1. %HADOOP_HOME%\etc\hadoop\hdfs-site.xml
2. %HADOOP_HOME%\etc\hadoop\core-site.xml
3. %HADOOP_HOME%\etc\hadoop\mapred-site.xml
4. %HADOOP_HOME%\etc\hadoop\yarn-site.xml
4
Msc.IT Part
1
4.1. HDFS site configuration
As we know, Hadoop is built using a master-slave paradigm. Before altering the HDFS
configuration file, we should create a directory to store all master node (name node) data
and another one to store data (data node). In this example, we created the following
directories:
E:\hadoop-env\hadoop-3.2.1\data\dfs\namenode
E:\hadoop-env\hadoop-3.2.1\data\dfs\datanode
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.2.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.2.1\data\datanode</value>
</property>
5
Msc.IT Part
1
</configuration>
Note that we have set the replication factor to 1 since we are creating a single node cluster.
Now, we should configure the name node URL adding the following XML code into the
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Now, we should add the following XML code into the <configuration></configuration>
element within “mapred-site.xml”:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
6
Msc.IT Part
1
4.4. Yarn site configuration
Now, we should add the following XML code into the <configuration></configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
After finishing the configuration, let’s try to format the name node using the following command:
7
Msc.IT Part
1
Start-all.cmd
8
Msc.IT Part
1
To make sure that all services started successfully, we can run the following command:
Jps
It should display the following services:
9
Msc.IT Part
1
Practical No. 2
Aim: Implement word count / frequency programs using MapReduce.
Theory: MapReduce is a software framework for processing (large1) data sets in a distributed
fashion over a several machines. The core idea behind MapReduce is mapping your data set into
a collection of <key, value> pairs, and then reducing over all pairs with the same key. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks. The MapReduce framework consists of a single master JobTracker and
one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves
execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions
via implementations of appropriate interfaces and/or abstract-classes. These, and other job
parameters, comprise the job configuration. The Hadoop job client then submits the job
(jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
Step 1: Start
Hadoop ssh
localhost
/usr/local/hadoop/sbin/start-all.sh
Step 4: Create a text file with some words at local file system (try to include same and
repeated words)
sudo nano bda.txt
(Press Ctrl+S and then Ctrl+X)
11
Msc.IT Part
1
hdfs dfs -head /output/part-r-00000
Output:
12
Msc.IT Part
1
Practical No. 3
Aim: Implement a MapReduce program that processes a weather dataset.
Theory: MapReduce is a software framework for processing (large1) data sets in a distributed
fashion over a several machines. The core idea behind MapReduce is mapping your data set into
a collection of <key, value> pairs, and then reducing over all pairs with the same key. A
MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks. The MapReduce framework consists of a single master JobTracker and
one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves
execute the tasks as directed by the master.
Minimally, applications specify the input/output locations and supply map and reduce functions
via implementations of appropriate interfaces and/or abstract-classes. These, and other job
parameters, comprise the job configuration. The Hadoop job client then submits the job
(jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of
distributing the software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
Step 1: Start
Hadoop ssh
localhost
/usr/local/hadoop/sbin/start-all.sh
(Change your directory to the folder where you have downloaded the dataset and the jar file)
Step 5: The next time when you run this, you need to remove existing files before starting the
execution
13
Msc.IT Part
1
hadoop dfs -rm -f /user/hduser/CRND0103-2017-AK_Fairbanks_11_NE.txt
14
Msc.IT Part
1
hadoop dfs -rm -r /user/hduser/Weather-Output
Output:
In the above image, you can see the top 10 results showing the cold days. The second column is a
day in yyyy/mm/dd format. For Example, 20200101 means
year = 2020
month = 01
Date = 01
15
Msc.IT Part
1
Practical No. 4
Aim: Implement an application that stores big data in Hbase / MongoDB and
manipulate it using R / Python.
Theory: HBase is a column-oriented non-relational database management system that runs on
top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing
sparse data sets, which are common in many big data use cases. It is well suited for real-time data
processing or random read/write access to large volumes of data.
Unlike relational database systems, HBase does not support a structured query language like
SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written in Java™
much like a typical Apache MapReduce application. HBase does support writing applications in
Apache Avro, REST and Thrift.
An HBase system is designed to scale linearly. It comprises a set of standard tables with rowsand
columns, much like a traditional database. Each table must have an element defined as a primary
key, and all access attempts to HBase tables must use this primary key.
Avro, as a component, supports a rich set of primitive data types including: numeric, binary data
and strings; and a number of complex types including arrays, maps, enumerations and records. A
sort order can also be defined for the data.
HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into HBase,
but if you’re running a production cluster, it’s suggested that you have a dedicated ZooKeeper
cluster that’s integrated with your HBase cluster.
HBase works well with Hive, a query engine for batch processing of big data, to enable fault-
tolerant big data applications.
Step 1: Start
Hadoop ssh
localhost
/usr/local/hadoop/sbin/start-all.sh
su hduser
./stop-hbase.sh
Step 9: (make sure that you start the thrift server first and then the HBase, and while closing
you stop HBase first and then thrift server)
./hbase-daemon.sh start thrift
./start-hbase.sh
python3
import happybase as hb
conn=hb.Connection('127.0.0.1', 9090)
conn.table('test').row('row1')
conn.table('test').row('row2')
conn.table('test').row('row3')
exit()
./stop-hbase.sh
./hbase-daemon.sh stop thrift
17
Msc.IT Part
1
Practical No. 5
Theory: Pig is a high-level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL. Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject).
Step 1:
su hduser
cd /usr/local
sudo wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz
sudo tar -xvzf pig-0.17.0.tar.gz
sudo mv pig-0.17.0 pig
cd /home/hduser
18
Msc.IT Part
1
First we need to move customers.txt to
HDFS For that start hadoop,
Check via jps if you want
hdfs dfs -put ./customers.txt
/user/hduser/ Start pig using pig
command
customers = LOAD 'hdfs://localhost:54310/user/hduser/customers.txt' USING
PigStorage(','); dump customers;
quit;
Output:
Customers.txt file
19
Msc.IT Part
1
20
Msc.IT Part
1
21
Msc.IT Part
1
22
Msc.IT Part
1
23
Msc.IT Part
1
Practical No. 6
Aim: Configure the Hive and implement the application in Hive.
Theory: Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Steps for performing this practical:
Step 8: Hive is installed now, but you need to first create some directories in HDFS for
Hive to store its data.
hdfs dfs -mkdir /tmp
24
Msc.IT Part
1
hdfs dfs -chmod g+w /tmp
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w
/user/hive/warehouse sudo chmod 777
/usr/local/hive
Step 11:
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,salary String,
designation String)COMMENT 'Employee details' ROW FORMAT
DELIMITED FIELDS
TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS
TEXTFILE;
LOAD DATA LOCAL INPATH "/home/hduser/sample.txt" into table
employee; select * from employee;
Then stop hadoop and close.
Output:
25
Msc.IT Part
1
Practical No. 7
Theory:
DIMENSION PROPERTIES
Name Type Required Description
dim string Yes The dimension name
level string No States the level of
data in dim
filter object No Defines element’s filter
AGGREGATIONS:
26
Msc.IT Part
1
Defines the
filter object No
element's filter
REQUIREMENTS:
Software requirements: Sisense
Hardware requirements: Minimum 8GB RAM.
Dataset: Sample Healthcare
URL: http://localhost:8081/app/jaqleditor
Data: http://localhost:8081/app/data/
WORKING:
Query 1: To find the names of all Doctors
Code:
{
"datasource": "Sample
Healthcare", "metadata": [
{
"dim":"[Doctors.Name]"
}
]
}
Explanation: In Data source field the name of the dataset has to be mentioned, in our case,
the data set is “Sample Healthcare”.
Metadata is the extra information about the data. In dim section we enter the column
name followed by the property name.
The above jaql query is just like the below SQL query :
Select name from Sample
Healthcare where entity
name=”Doctors”
On Execute, following results were displayed,
The system had extracted the name of all Doctors in the Sample Healthcare dataset.
Can be cross-check from the path –C:\Program Files\Sisense\Samples\Sources\Sample
Healthcare\Doctors
Output:
27
Msc.IT Part
1
Query 2: To find the count of specialty offered by the doctors using aggregation “agg”.
Code:
{
"datasource": "Sample
Healthcare", "metadata": [
{
"dim":"[Doctors.Specialty]",
"agg":"count"
}
]
}
28
Msc.IT Part
1
Explanation: This query returns the count of the specialties present in the dataset. The output
returned-6. When cross-checked with dataset, there also 6 specialties are present, namely-
Pediatrics, Oncology, Cardiology, Surgeon, Emergency Room, Neurology.
Output:
]
}
29
Msc.IT Part
1
In output notice “Kimberley” with data above-“Oncology”. When checked in Sample Healthcare
dataset, same is the speciality.
Output:
30
Msc.IT Part BIG DATA ANALYTICS
1
Practical No. 8
PART-A
Code:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# load dataset
pima = pd.read_csv("diabetes.csv")
col_names = pima.columns
clf = DecisionTreeClassifier()
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.658008658008658
31
Msc.IT Part BIG DATA ANALYTICS
1
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes1.png')
Image(graph.create_png())
Output:
32
Msc.IT Part BIG DATA ANALYTICS
1
PART-B.
Aim: Implement SVM classification techniques.
Theory: Support Vector Machine (SVM) is a supervised machine learning algorithm which
can be used for both classification or regression challenges. However, it is mostly used in
classification problems. In the SVM algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the value of each feature
being the value of a particular coordinate. Then, we perform classification by finding the
hyper- plane that differentiates the two classes very well. Support Vectors are simply the co-
ordinates of individual observation. The SVM classifier is a frontier which best segregates the
two classes (hyper-plane/ line).
Code:
import pandas as pd
from sklearn.svm import SVC # Import SVM Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# load dataset
pima = pd.read_csv("diabetes.csv")
col_names = pima.columns
pima.head()
Output:
glucose bp skin insulin bmi pedigree age label
1 85 66 29 0 26.6 0.351 31 0
3 89 66 23 94 28.1 0.167 21 0
33
Msc.IT Part BIG DATA ANALYTICS
1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3) # 70%
training and 30% test
clf = SVC()
clf = SVC(kernel="linear")
Output:
Accuracy: 0.7316017316017316
34
Msc.IT Part BIG DATA ANALYTICS
1
Practical No. 9
PART-A.
Aim: REGRESSION MODEL Import a data from web storage. Name the dataset and now do
Logistic Regression to find out relation between variables that are affecting the admission of a student
in an institute based on his or her GRE score, GPA obtained and rank of the student. Also check the
model is fit or not. require (foreign), require (MASS).
Code:
import pandas as pd
from sklearn.linear_model import LogisticRegression # Import LogisticRegression
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
gre = pd.read_csv("binary.csv")
col_names = gre.columns
gre.head()
Output:
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
35
Msc.IT Part BIG DATA ANALYTICS
1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70%
training and 30% test
clf = LogisticRegression()
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.7583333333333333
36
Msc.IT Part BIG DATA ANALYTICS
1
PART-B.
Theory: Multiple linear regression (MLR), also known simply as multiple regression, is a
statistical technique that uses several explanatory variables to predict the outcome of a
response variable. The goal of multiple linear regression (MLR) is to model the linear
relationship between the explanatory (independent) variables and response (dependent)
variable. In essence, multiple regression is the extension of ordinary least-squares (OLS)
regression because it involves more than one explanatory variable. Simple linear regression is
a function that allows an analyst or statistician to make predictions about one variable based
on the information that is known about another variable. Linear regression can only be used
when one has two continuous variables—an independent variable and a dependent variable.
The independent variable is the parameter that is used to calculate the dependent variable or
outcome. A multiple regression model extends to several explanatory variables.
Code:
import pandas
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
#predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
Output:
[0.00755095 0.00780526]
import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
predictedCO2 = regr.predict([[3300, 1300]])
print(predictedCO2)
Output:
[114.75968007]
37