Data Analytics Lab
Data Analytics Lab
B. TECH.
(III YEAR – SIXTH SEM)
LIST OF PROGRAMS
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one dependent
variable (usually denoted by Y) and a series of other variables (known as independent
variables).
Problem Statement: Given above data build a machine learning model that can predict home prices
based on square feet area
2. Linear Regression MultiVariable:-
Now we have to predict the price of house based of many parameters like area, number
of bedroom and age.
The objective is to predict the price of a house based on all the above
mentioned parameters:-
Simple linear regression is a regression model that figures out the relationship
between one independent variable and one dependent variable using a straight line.
SOME APPLICATIONS WHERE WE CAN USE SIMPLE LINEAR
REGRESSION
Decision Tree is one of the most powerful and popular algorithm. Decision-tree
algorithm falls under the category of supervised learning algorithms. It works for both
continuous as well as categorical output variables.
In this example our objective is to predict the salary of an employee based on various
parameters.
A Decision Tree is a kind of supervised machine learning algorithm that has a root node
and leaf nodes. Every node represents a feature, and the links between the nodes show the
decision. Every leaf represents a result.
Marketing: Businesses can use decision trees to enhance the accuracy of their promotional
campaigns by observing the performance of their competitors’ products and services.
Decision trees can help in audience segmentation and support businesses in producing
better-targeted advertisements that have higher conversion rates.
Retention of Customers: Companies use decision trees for customer retention through
analyzing their behaviors and releasing new offers or products to suit those behaviors. By
using decision tree models, companies can figure out the satisfaction levels of their
customers as well.
Diagnosis of Diseases and Ailments: Decision trees can help physicians and medical
professionals in identifying patients that are at a higher risk of developing serious (or
preventable) conditions such as diabetes or dementia. The ability of decision trees to
narrow down possibilities according to specific variables is quite helpful in such cases.
Experiment No. 4
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
In this example we are classifying hand written digits using random forest algorithm.
Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final
output.
Applications of Random Forest
Random forest algorithm mostly used in following areas:
Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.
Experiment No. 5
Title: Implementation of Two sampled T-test and paired two sampled T-test in excel
T-tests are hypothesis tests that assess the means of one or two groups. Hypothesis tests
use sample data to infer properties of entire populations. To be able to use a t-test, you need
to obtain a random sample from your target populations. Depending on the t-test and how you
configure it, the test can determine whether:
STEPS
The Data Analysis Tool Pack must be installed on your copy of Excel to perform t-tests. To
determine whether you have this Tool Pack installed, click Data in Excel’s menu across the
top and look for Data Analysis in the Analyze section. If you don’t see Data Analysis, you
need to install it.
To install Excel’s Analysis Tool pack, click the File tab on the top-left and then
click Options on the bottom-left. Then, click Add-Ins. On the Manage drop-down list,
choose Excel Add-ins, and click Go. On the popup that appears, check Analysis Tool
Pcak and click OK. After you enable it, click Data Analysis in the Data menu to display the
analyses you can perform. Among other options, the popup presents three types of t-test,
which we’ll cover next.
Two-Sample t-Tests in Excel
Two-sample t-tests compare the means of precisely two groups—no more and no less!
Typically, you perform this test to determine whether two population means are different. For
example, do students who learn using Method A have a different mean score than those who
learn using Method B? This form of the test uses independent samples. In other words,
If the p-value is less than your significance level (e.g., 0.05), you can reject the null
hypothesis. The difference between the two means is statistically significant. Your sample
provides strong enough evidence to conclude that the two population means are different.
Let’s assume that the variances are equal and use the Assuming Equal Variances version. If
we had chosen the unequal variances form of the test, the steps and interpretation are the
same—only the calculations change.
The output indicates that mean for Method A is 71.50362 and for Method B it is 84.74241.
Looking in the Variances row, we can see that they are not exactly equal, but they are close
enough to assume equal variances. The p-value is the most important statistic. If you want to
learn about the other statistics, you can read my posts about the t Stat (i.e., the t-value), df
(degrees of freedom), and the t Critical values.
If the p-value is less than your significance level, the difference between means is statistically
significant. Excel provides p-values for both one-tailed and two-tailed t-tests.
For our results, we’ll use P(T<=t) two-tail, which is the p-value for the two-tailed form of the
t-test. Because our p-value (0.000336) is less than the standard significance level of 0.05, we
can reject the null hypothesis. Our sample data support the hypothesis that the population
means are different. Specifically, Method B’s mean is greater than Method A’s mean.
Paired t-Tests in Excel
Paired t-tests assess paired observations, which are often two measurements on the same
person or item
To perform a paired t-test in Excel, arrange your data into two columns so that each row
represents one person or item, as shown below. Note that the analysis does not use the
subject’s ID number.
The output indicates that mean for the Pertest is 97.06223 and for the Post-test it is 107.8346.
For our results, we’ll use P(T<=t) two-tail, which is the p-value for the two-tailed form of the
t-test. Because our p-value (0.002221) is less than the standard significance level of 0.05, we
can reject the null hypothesis. Our sample data support the hypothesis that the population
means are different. Specifically, the Post-test mean is greater than the Pre-test mean.
T-test applications
The T-test is used to compare the mean of two samples, dependent or independent.
It can also be used to determine if the sample mean is different from the
assumed mean.
T-test has an application in determining the confidence interval for a sample mean.
Experiment No. 6
Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups. A range of scenarios use it to determine if there is any
difference between the means of different groups.
Go to the Data tab, click on the Data Analysis sub-tab. If you can’t find the sub-tab, check the
subheading beneath.
Click on the input range and highlight the dataset you want to use.
You can decide if you want to view it in the same spreadsheet or another spreadsheet.
In our ANOVA table above, we analysed the sum of squares and other values of the ANOVA. With
2. From the Data Analysis popup, choose Anova: Two-Factor With Replication.
6. Click OK.
Applications of Anova
Organizations use ANOVA to make decisions about which alternative to choose among many
possible options. For example, ANOVA can help to:
Compare the yield of two different wheat varieties under three different
fertilizer brands.
Compare the effectiveness of various social media advertisements on the sales of
a particular product.
Compare the effectiveness of different lubricants in different types of vehicles.
Experiment No. 7
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.
Install Java
Download Hadoop
– https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
– extract to C:\Hadoop
1. Set the path JAVA_HOME Environment variable
2. Set the path HADOOP_HOME Environment variable
Configurations: -
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
======================================================
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
======================================================
Create folder “data” under “C:\Hadoop-3.3.0”
Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”
======================================================
Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml,
paste xml code and save this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
======================================================
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
======================================================
======================================================
Testing: -
– Hadoop Namenode
– Hadoop datanode
– YARN Resource Manager
– YARN Node Manager
Open: http://localhost:8088
======================================================
======================================================
Applications of Hadoop
Apache Hadoop is an open source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large
computer to store and process the data, Hadoop allows clustering multiple computers to
analyze massive datasets in parallel more quickly.
1. Version
Hadoop HDFS version Command Usage:
The Hadoop fs shell command version prints the Hadoop version
2. mkdir
3. ls
Hadoop HDFS ls Command Usage:
Here in the below example, we are using the ls command to enlist the files and directories
present in HDFS.
4. put
5. copyFromLocal
Hadoop HDFS copyFromLocal Command Usage:
Here in the below example, we are trying to copy the ‘test1’ file present in the local file
system to the newDataFlair directory of Hadoop.
6. get
7. copyToLocal
8. cat
Here in this example, we are using the cat command to display the content of the ‘sample’
file present in newDataFlair directory of HDFS.
Hadoop HDFS cat Command Description:
The cat command reads the file in HDFS and displays the content of the file on console or
stdout.
9. mv
The HDFS mv command moves the files or directories from the source to a destination
within HDFS.
10. cp
The cp command copies a file from one directory to another directory within the HDFS.
Experiment No. 9
In MapReduce word count example, we find out the frequency of each word. Here, the role
of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the
keys of common values. So, everything is represented in the form of Key-value pair.
package wordcount2;
import
java.io.IOException; import
java.util.*;
import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.conf.*; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Wordcount2 {
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Input File
Output File
Experiment No. 10
package tempex;
import java.io.IOException;
import java.util.*;
import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.conf.*; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Tempex {
}
}
Input
Output:-
Applications/Advantages of MapReduce:
The reduce job takes the output from a map as input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.
MapReduce programming offers several benefits to help you gain valuable insights from
your big data:
Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed
File System (HDFS).
Flexibility. Hadoop enables easier access to multiple sources of data and
multiple types of data.
Speed. With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
Simple. Developers can write code in a choice of languages, including Java, C++ and
Python.