0% found this document useful (0 votes)
94 views

Data Analytics Lab

The document describes experiments related to data analytics using Python and Excel. Experiment 1 involves implementing linear regression for single and multivariable data to predict home prices. Experiment 2 covers logistic regression to predict insurance purchases. Experiment 3 uses decision trees to predict employee salaries. Experiment 4 applies random forests to classify handwritten digits. Experiment 5 demonstrates two-sample and paired t-tests in Excel to analyze group means.

Uploaded by

Anupriya Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Data Analytics Lab

The document describes experiments related to data analytics using Python and Excel. Experiment 1 involves implementing linear regression for single and multivariable data to predict home prices. Experiment 2 covers logistic regression to predict insurance purchases. Experiment 3 uses decision trees to predict employee salaries. Experiment 4 applies random forests to classify handwritten digits. Experiment 5 demonstrates two-sample and paired t-tests in Excel to analyze group means.

Uploaded by

Anupriya Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

LABORATORY MANUAL

DATA ANALYTICS LAB

B. TECH.
(III YEAR – SIXTH SEM)
LIST OF PROGRAMS

Subject: Data Analytics Lab Code: CS-60

SNo Name of Experiment S.No.

1 Implement and analyze Linear regression in python (Single variable &


Multivariable)
2 Implement and analyze Logistic regression in python

3 Implement and analyze Decision tree algorithm in python

4 Implement and analyze Random Forest algorithm in python

5 Implementation of two samples T-test and paired two-sample T-test in


excel.
6 Implementation of one-way and two-way ANOVA in excel.

7 Write steps for installing the Hadoop in windows 10

8 Working with Hadoop commands

9 Implementation of word count example using MapReduce.

10 Implementation of MapReduce program to count unique number


Of times a song is played based on userid and trackid.
Experiment No. 1

Title: Implementation of Linear regression in python (Single variable &


Multivariable)

Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one dependent
variable (usually denoted by Y) and a series of other variables (known as independent
variables).

1. Linear Regression Single Variable:-


Below table represents current home prices in abc township based on square feet area

Problem Statement: Given above data build a machine learning model that can predict home prices
based on square feet area
2. Linear Regression MultiVariable:-
Now we have to predict the price of house based of many parameters like area, number
of bedroom and age.
The objective is to predict the price of a house based on all the above
mentioned parameters:-
Simple linear regression is a regression model that figures out the relationship
between one independent variable and one dependent variable using a straight line.
SOME APPLICATIONS WHERE WE CAN USE SIMPLE LINEAR
REGRESSION

1. Marks scored by students based on number of hours studied (ideally)- Here


marks scored in exams are independent and the number of hours studied is
independent.
2. Predicting crop yields based on the amount of rainfall- Yield is a dependent
variable while the measure of precipitation is an independent variable.
3. Predicting the Salary of a person based on years of experience- Therefore,
Experience becomes the independent while Salary turns into the dependent variable.
Experiment No. 2

Title: Implementation of logistic regression in python.


Logistic regression estimates the probability of an event occurring, such as voted or
didn't vote, based on a given dataset of independent variables. Since the outcome is a
probability, the dependent variable is bounded between 0 and 1.
In the below given example we are predicting whether a customer will buy insurance or not
based on his age.
Applications: Logistic regression is used in various fields, including machine learning,
most medical fields, and social sciences. For example, the Trauma and Injury Severity
Score (TRISS), which is widely used to predict mortality in injured patients, was originally
developed using logistic regression. Many other medical scales used to assess severity of a
patient have been developed using logistic regression. Logistic regression is mainly used
for binary classification though it can also be used for multiclass classification problems.
Experiment No. 3

Title: Implementation of decision tree algorithm in python.

Decision Tree is one of the most powerful and popular algorithm. Decision-tree
algorithm falls under the category of supervised learning algorithms. It works for both
continuous as well as categorical output variables.

In this example our objective is to predict the salary of an employee based on various
parameters.
A Decision Tree is a kind of supervised machine learning algorithm that has a root node
and leaf nodes. Every node represents a feature, and the links between the nodes show the
decision. Every leaf represents a result.

Decision Trees Applications

Marketing: Businesses can use decision trees to enhance the accuracy of their promotional
campaigns by observing the performance of their competitors’ products and services.
Decision trees can help in audience segmentation and support businesses in producing
better-targeted advertisements that have higher conversion rates.

Retention of Customers: Companies use decision trees for customer retention through
analyzing their behaviors and releasing new offers or products to suit those behaviors. By
using decision tree models, companies can figure out the satisfaction levels of their
customers as well.

Diagnosis of Diseases and Ailments: Decision trees can help physicians and medical
professionals in identifying patients that are at a higher risk of developing serious (or
preventable) conditions such as diabetes or dementia. The ability of decision trees to
narrow down possibilities according to specific variables is quite helpful in such cases.
Experiment No. 4

Title: Implementation of Random Forest algorithm in python

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

In this example we are classifying hand written digits using random forest algorithm.
Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final
output.
Applications of Random Forest
Random forest algorithm mostly used in following areas:

 Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
 Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
 Land Use: We can identify the areas of similar land use by this algorithm.
 Marketing: Marketing trends can be identified using this algorithm.
Experiment No. 5

Title: Implementation of Two sampled T-test and paired two sampled T-test in excel

T-tests are hypothesis tests that assess the means of one or two groups. Hypothesis tests
use sample data to infer properties of entire populations. To be able to use a t-test, you need
to obtain a random sample from your target populations. Depending on the t-test and how you
configure it, the test can determine whether:

o Two group means are different.


o Paired means are different.
o One mean is different from a target value.

STEPS

1. Install the Data Analysis Tool Pack in Excel

The Data Analysis Tool Pack must be installed on your copy of Excel to perform t-tests. To
determine whether you have this Tool Pack installed, click Data in Excel’s menu across the
top and look for Data Analysis in the Analyze section. If you don’t see Data Analysis, you
need to install it.

To install Excel’s Analysis Tool pack, click the File tab on the top-left and then
click Options on the bottom-left. Then, click Add-Ins. On the Manage drop-down list,
choose Excel Add-ins, and click Go. On the popup that appears, check Analysis Tool
Pcak and click OK. After you enable it, click Data Analysis in the Data menu to display the
analyses you can perform. Among other options, the popup presents three types of t-test,
which we’ll cover next.
Two-Sample t-Tests in Excel

Two-sample t-tests compare the means of precisely two groups—no more and no less!
Typically, you perform this test to determine whether two population means are different. For
example, do students who learn using Method A have a different mean score than those who
learn using Method B? This form of the test uses independent samples. In other words,

If the p-value is less than your significance level (e.g., 0.05), you can reject the null
hypothesis. The difference between the two means is statistically significant. Your sample
provides strong enough evidence to conclude that the two population means are different.

Let’s assume that the variances are equal and use the Assuming Equal Variances version. If
we had chosen the unequal variances form of the test, the steps and interpretation are the
same—only the calculations change.

1. In Excel, click Data Analysis on the Data tab.


2. From the Data Analysis popup, choose t-Test: Two-Sample Assuming Equal
Variances.
3. Under Input, select the ranges for both Variable 1 and Variable 2.
4. In Hypothesized Mean Difference, you’ll typically enter zero. This value is the
null hypothesis value, which represents no effect. In this case, a mean
difference of zero represents no difference between the two methods, which is
no effect.
5. Check the Labels checkbox if you have meaningful variable names in row 1.
This option makes the output easier to interpret. Ensure that you include the
label row in step #3.
6. Excel uses a default Alpha value of 0.05, which is usually a good value.
Alpha is the significance level. Change this value only when you have a
specific reason for doing so.
7. Click OK.
For the example data, your popup should look like the image below:

Interpreting the Two-Sample t-Test Results

The output indicates that mean for Method A is 71.50362 and for Method B it is 84.74241.
Looking in the Variances row, we can see that they are not exactly equal, but they are close
enough to assume equal variances. The p-value is the most important statistic. If you want to
learn about the other statistics, you can read my posts about the t Stat (i.e., the t-value), df
(degrees of freedom), and the t Critical values.

If the p-value is less than your significance level, the difference between means is statistically
significant. Excel provides p-values for both one-tailed and two-tailed t-tests.

For our results, we’ll use P(T<=t) two-tail, which is the p-value for the two-tailed form of the
t-test. Because our p-value (0.000336) is less than the standard significance level of 0.05, we
can reject the null hypothesis. Our sample data support the hypothesis that the population
means are different. Specifically, Method B’s mean is greater than Method A’s mean.
Paired t-Tests in Excel

Paired t-tests assess paired observations, which are often two measurements on the same
person or item

Step-by-Step Instructions for Running the Paired t-Test in Excel

To perform a paired t-test in Excel, arrange your data into two columns so that each row
represents one person or item, as shown below. Note that the analysis does not use the
subject’s ID number.

1. In Excel, click Data Analysis on the Data tab.


2. From the Data Analysis popup, choose t-Test: Paired Two Sample for Means.
3. Under Input, select the ranges for both Variable 1 and Variable 2.
4. In Hypothesized Mean Difference, you’ll typically enter zero. This value is the null
hypothesis value, which represents no effect. In this case, a mean difference of zero
represents no difference between the two methods, which is no effect.
5. Check the Labels checkbox if you have meaningful variables labels in row 1. This option
helps make the output easier to interpret. Ensure that you include the label row in step3
6. Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha is the
significance level. Change this value only when you have a specific reason for doing so.
7. Click OK.
For the example data, your popup should look like the image below:

Interpreting Excel’s Paired t-Test Results

The output indicates that mean for the Pertest is 97.06223 and for the Post-test it is 107.8346.

For our results, we’ll use P(T<=t) two-tail, which is the p-value for the two-tailed form of the
t-test. Because our p-value (0.002221) is less than the standard significance level of 0.05, we
can reject the null hypothesis. Our sample data support the hypothesis that the population
means are different. Specifically, the Post-test mean is greater than the Pre-test mean.

T-test applications

 The T-test is used to compare the mean of two samples, dependent or independent.
 It can also be used to determine if the sample mean is different from the
assumed mean.
 T-test has an application in determining the confidence interval for a sample mean.
Experiment No. 6

Title: Implementation of one-way and two-way ANOVA in excel.

Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups. A range of scenarios use it to determine if there is any
difference between the means of different groups.

Creating an ANOVA table with Excel

To perform a one way anova implement the following step.

 Import your data set in any preferred Excel format.

 Go to the Data tab, click on the Data Analysis sub-tab. If you can’t find the sub-tab, check the

subheading beneath.

 Select ANOVA: single factor and click ok.

 Click on the input range and highlight the dataset you want to use.
 You can decide if you want to view it in the same spreadsheet or another spreadsheet.

In our ANOVA table above, we analysed the sum of squares and other values of the ANOVA. With

this, we can solve a one-way ANOVA using Microsoft Excel.


Two-Way ANOVA

In Excel, do the following steps:

1. Click Data Analysis on the Data tab.

2. From the Data Analysis popup, choose Anova: Two-Factor With Replication.

3. Under Input, select the ranges for all columns of data.


4. In Rows per sample, enter 20. This represents the number of observations per group.
5. Excel uses a default Alpha value of 0.05, which is usually a good value. Alpha is the significance level.
Change this value only when you have a specific reason for doing so.

6. Click OK.
Applications of Anova
Organizations use ANOVA to make decisions about which alternative to choose among many
possible options. For example, ANOVA can help to:

 Compare the yield of two different wheat varieties under three different
fertilizer brands.
 Compare the effectiveness of various social media advertisements on the sales of
a particular product.
 Compare the effectiveness of different lubricants in different types of vehicles.
Experiment No. 7

Title: Write steps for installing the HADOOP in windows 10

The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.

Install Java

– Java JDK Link to download


https://www.oracle.com/java/technologies/javase-jdk8-downloads.html
– extract and install Java in C:\Java
– open cmd and type -> javac -version

Download Hadoop

– https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

– extract to C:\Hadoop
1. Set the path JAVA_HOME Environment variable
2. Set the path HADOOP_HOME Environment variable
Configurations: -

Edit file C:/Hadoop-3.3.0/etc/hadoop/core-site.xml,


paste the xml code in folder and save

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
======================================================

Rename “mapred-site.xml.template” to “mapred-site.xml” and edit this file C:/Hadoop-


3.3.0/etc/hadoop/mapred-site.xml, paste xml code and save this file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
======================================================
Create folder “data” under “C:\Hadoop-3.3.0”
Create folder “datanode” under “C:\Hadoop-3.3.0\data”
Create folder “namenode” under “C:\Hadoop-3.3.0\data”

======================================================
Edit file C:\Hadoop-3.3.0/etc/hadoop/hdfs-site.xml,
paste xml code and save this file.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop-3.3.0/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop-3.3.0/data/datanode</value>
</property>
</configuration>
======================================================

Edit file C:/Hadoop-3.3.0/etc/hadoop/yarn-site.xml,


paste xml code and save this file.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
======================================================

Edit file C:/Hadoop-3.3.0/etc/hadoop/hadoop-env.cmd


by closing the command line
“JAVA_HOME=%JAVA_HOME%” instead of set “JAVA_HOME=C:\Java”

======================================================
Testing: -

– Open cmd and change directory to C:\Hadoop-3.3.0\sbin


– type start-all.cmd

– Start namenode and datanode with this command


– type start-dfs.cmd
– Start yarn through this command
– type start-yarn.cmd

Make sure these apps are running

– Hadoop Namenode
– Hadoop datanode
– YARN Resource Manager
– YARN Node Manager

Open: http://localhost:8088
======================================================

Hadoop installed Successfully…………

======================================================

Applications of Hadoop

Hadoop is developed by Doug Cutting and Michale J. It is managed by apache software


foundation and licensed under the Apache license 2.0 Hadoop. It is beneficial for the big
business because it is based on cheap servers, requiring less cost to store the data and process
the data. Hadoop helps make a better business decision by providing a history of data and
various company records. So by using this technology company can improve its business.
Hadoop does lots of processing over collected data from the company to deduce the result
which can help to make a future decision.
Experiment No. 8

Title: Working with HADOOP commands

Apache Hadoop is an open source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large
computer to store and process the data, Hadoop allows clustering multiple computers to
analyze massive datasets in parallel more quickly.

1. Version
Hadoop HDFS version Command Usage:
The Hadoop fs shell command version prints the Hadoop version

Hadoop HDFS version Command Description:


The Hadoop fs shell command version prints the Hadoop version.

2. mkdir

Hadoop HDFS mkdir Command Usage:


Hadoop HDFS mkdir Command Description:
This command creates the directory in HDFS if it does not already exist.

3. ls
Hadoop HDFS ls Command Usage:
Here in the below example, we are using the ls command to enlist the files and directories
present in HDFS.

4. put

Hadoop HDFS put Command Usage:


Here in this example, we are trying to copy localfile1 of the local file system to the Hadoop
filesystem.
Hadoop HDFS put Command Description:
The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or
directory from the local filesystem to the destination in the Hadoop filesystem.

5. copyFromLocal
Hadoop HDFS copyFromLocal Command Usage:
Here in the below example, we are trying to copy the ‘test1’ file present in the local file
system to the newDataFlair directory of Hadoop.

Hadoop HDFS copyFromLocal Command Example:


Here in the below example, we are trying to copy the ‘test1’ file present in the local file
system to the newDataFlair directory of Hadoop.

6. get

Hadoop HDFS get Command Usage:


In this example, we are trying to copy the ‘file’ of the hadoop filesystem to the local file
system.
Hadoop HDFS get Command Description:
The Hadoop fs shell command get copies the file or directory from the Hadoop file system to
the local file system.

7. copyToLocal

Hadoop HDFS copyToLocal Command Usage:


Here in this example, we are trying to copy the ‘sample’ file present in the newDataFlair
directory of HDFS to the local file system.

Hadoop HDFS copyToLocal Description:


copyToLocal command copies the file from HDFS to the local file system.

8. cat

Hadoop HDFS cat Command Usage:

Here in this example, we are using the cat command to display the content of the ‘sample’
file present in newDataFlair directory of HDFS.
Hadoop HDFS cat Command Description:
The cat command reads the file in HDFS and displays the content of the file on console or
stdout.

9. mv

Hadoop HDFS mv Command Description:

The HDFS mv command moves the files or directories from the source to a destination
within HDFS.

10. cp

Hadoop HDFS cp Command Description:

The cp command copies a file from one directory to another directory within the HDFS.
Experiment No. 9

Title: Implementation of word count example using MapReduce.

In MapReduce word count example, we find out the frequency of each word. Here, the role
of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the
keys of common values. So, everything is represented in the form of Key-value pair.

package wordcount2;

import
java.io.IOException; import
java.util.*;
import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.conf.*; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Wordcount2 {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
{ sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = new Job(conf, "Wordcount2");
job.setJarByClass(Wordcount2.class);
job.setJobName("WordCounter");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}
Input File

Output File
Experiment No. 10

Title: Implementation of MapReduce program to count unique number Of times a song


is played based on userid and trackid.

package tempex;
import java.io.IOException;
import java.util.*;

import
org.apache.hadoop.fs.Path; import
org.apache.hadoop.conf.*; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class Tempex {

public static class UniqueListenersMapper


extends
Mapper<Object,Text,IntWritable,IntWritable>{
IntWritable trackId = new
IntWritable(); IntWritable userId = new
IntWritable();

public void map(Object key, Text value,Context context)


throws IOException, InterruptedException {

String[] parts = value.toString().split("[|]");


trackId.set(Integer.parseInt(parts[1]));
userId.set(Integer.parseInt(parts[0]));
context.write(trackId, userId);
}
}

public static class UniqueListenersReducer extends


Reducer< IntWritable , IntWritable, IntWritable, IntWritable> {
public void reduce(
IntWritable trackId,
Iterable< IntWritable > userIds,Context context) throws IOException,
InterruptedException
{
Set< Integer > userIdSet = new HashSet< Integer >();
for (IntWritable userId : userIds) {
userIdSet.add(userId.get());
}
IntWritable size = new IntWritable(userIdSet.size());
context.write(trackId, size);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: uniquelisteners < in > < out >");
System.exit(2);
}
Job job = new Job(conf, "Unique listeners per track");
job.setJarByClass(Tempex.class);
job.setMapperClass(UniqueListenersMapper.class);
job.setReducerClass(UniqueListenersReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Input

Output:-
Applications/Advantages of MapReduce:

MapReduce is a programming paradigm that enables massive scalability across hundreds or


thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the
heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that
Hadoop programs perform. The first is the map job, which takes a set of data and converts it
into another set of data, where individual elements are broken down into tuples (key/value
pairs).

The reduce job takes the output from a map as input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is
always performed after the map job.

MapReduce programming offers several benefits to help you gain valuable insights from
your big data:

 Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed
File System (HDFS).
 Flexibility. Hadoop enables easier access to multiple sources of data and
multiple types of data.
 Speed. With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
 Simple. Developers can write code in a choice of languages, including Java, C++ and
Python.

You might also like