0% found this document useful (0 votes)
5 views217 pages

Big Data Journal

Miss Heetika Mahesh Vaity has completed a course in Big Data Analytics and Visualization Lab at Vivekanand Education Society's Institute of Technology during the academic year 2024-2025. The document outlines the course contents, including HDFS architecture, MapReduce, MongoDB, Hive, and Power BI, along with lab assignments and commands related to Hadoop. It also details the architecture, advantages, and disadvantages of HDFS, along with various Hadoop commands for managing data.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views217 pages

Big Data Journal

Miss Heetika Mahesh Vaity has completed a course in Big Data Analytics and Visualization Lab at Vivekanand Education Society's Institute of Technology during the academic year 2024-2025. The document outlines the course contents, including HDFS architecture, MapReduce, MongoDB, Hive, and Power BI, along with lab assignments and commands related to Hadoop. It also details the architecture, advantages, and disadvantages of HDFS, along with various Hadoop commands for managing data.

Uploaded by

Riya singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 217

Roll No. 57 Exam Seat No.

VIVEKANANDEDUCATION SOCIETY’S
INSTITUTE OF TECHNOLOGY

Hashu Advani Memorial Complex, Collector’s Colony, R. C.


Marg,Chembur, Mumbai – 400074. Contact No. 02261532532

CERTIFICATE

Certified that Miss Heetika Mahesh Vaity Of SY MCA/ Division-B


has satisfactorily completed a course of the necessary experiments in Big Data
Analytics and Visualization Lab under my supervisionin the Institute of
Technology in the academic year 2024- 2025.

Principal Head of Department

Lab In-charge Subject Teacher


V.E.S. Institute of Technology, Collector Colony, Chembur,
Mumbai, Maharashtra 400047
Department of M.C.A

INDEX

Date Date Faculty


S. No. Contents Of Of Marks
Preparation Submission Sign

1. Study of HDFS architecture and basic commands. 21/08/24 21/08/24

2. To study functionality of Mapreduce and implement 22/08/24 29/08/24


following programs using Mapreduce

3. To study MongoDB architecture and implement 29/08/24 29/08/24


CRUD (Create, Read , Update, Delete ) operations
4. Implementation of Hive: Creation of Database and 29/08/24 31/08/24
Table, Hive Partition, Hive Built In Function and
Operators, Hive View and Index.
5. To create a Pig Data Model, Read and Store Data and 05/09/24 06/09/24
Perform Pig Operations
6. To run spark commands and functions: 12/09/24 13/09/24
1. Downloading Data Set and Processing it Spark
2. Word Count in Apache Spark.
7. To install and configure PowerBI for educational 30/09/24 05/10/24
usage

8. To learn various Data Preprocessing activities in 30/09/24 07/10/24


Power BI.
9. To learn handling of tables and Queries in Power BI 30/09/24 14/10/24

10. To learn Data Visualization and dashboard creation in 30/09/24 14/10/24


Power BI
11. Assignment 1 09/10/24 09/10/24

12. Assignment 2 14/09/24 19/10/24

Final Grade Instructor Signature


Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 1

Title of LAB Assignment : Study of HDFS architecture and basic


commands.

DOP : 21/08/2024 DOS: 21/08/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO1 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: STUDY OF HDFS ARCHITECTURE AND BASIC COMMANDS

THEORY:

What is Hadoop?

Hadoop is an open-source framework designed to store and process large datasets in a


distributed computing environment. It was developed by the Apache Software Foundation
and is a key technology in the field of Big Data. Hadoop allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from a single server to thousands of machines, each
offering local computation and storage.

Core Components of Hadoop

Hadoop consists of several key components, which can be grouped into two main
categories: the storage component (HDFS) and the processing component (MapReduce).
Over time, the Hadoop ecosystem has expanded to include additional components for
resource management, data querying, and more.

2
Heetika Vaity BDAV Lab SYMCA/B-57

1. HDFS (Hadoop Distributed File System):


○ Storage Layer: HDFS is the primary storage system used by Hadoop. It stores
large datasets by breaking them into smaller blocks and distributing them
across multiple nodes in the cluster.
○ Fault Tolerance: Data in HDFS is replicated across different nodes to ensure
fault tolerance. By default, each block of data is replicated three times.
2. MapReduce:
○ Processing Layer: MapReduce is a programming model and processing
engine that allows for the distributed processing of large data sets in a
Hadoop cluster.
○ Map and Reduce Phases: The Map phase processes input data and generates
key-value pairs, while the Reduce phase aggregates and processes these
key-value pairs to produce the final output.
3. YARN (Yet Another Resource Negotiator):
○ Resource Management: YARN is responsible for managing and scheduling
resources across the Hadoop cluster. It allows multiple applications to run on
Hadoop, each with its own resource requirements.
○ Application Management: YARN divides tasks into containers, which are
allocated to different nodes in the cluster, managing the execution of these
tasks.
4. Hadoop Common:
○ Libraries and Utilities: Hadoop Common provides the necessary libraries and
utilities needed by other Hadoop components to function properly. It
includes configuration files, scripts, and Java libraries.

Additional Ecosystem Components

● Hive: SQL-like interface for querying data in HDFS.


● Pig: Scripting platform for data transformations.
● HBase: NoSQL database for real-time access to large datasets.
● Sqoop: Tool for transferring data between Hadoop and relational databases.
● Flume: Collects and moves log data into HDFS.
● Zookeeper: Coordination service for managing distributed systems.
● Oozie: Workflow scheduler for managing Hadoop jobs.

3
Heetika Vaity BDAV Lab SYMCA/B-57

HDFS Architecture

HDFS is an Open source component of the Apache Software Foundation that manages data.
HDFS has scalability, availability, and replication as key features. Name nodes, secondary
name nodes, data nodes, checkpoint nodes, backup nodes, and blocks all make up the
architecture of HDFS. HDFS is fault-tolerant and is replicated. Files are distributed across
the cluster systems using the Name node and Data Nodes. The primary difference between
Hadoop and Apache HBase is that Apache HBase is a non-relational database and Apache
Hadoop is a non-relational data store.

HDFS is composed of master-slave architecture, which includes the following elements:

1. NameNode (Master Node):


○ Manages metadata, including block locations on DataNodes.
○ Controls file access and monitors all DataNodes.
○ Stores records in FsImage (a snapshot of the file system) and EditLogs (logs
of changes).
○ EditLogs are written to disk after every write operation and are replicated
across all nodes for reliability.
○ Ensures that all DataNodes are active and that data is replicated for fault
tolerance.

4
Heetika Vaity BDAV Lab SYMCA/B-57

2. DataNodes (Slave Nodes):


○ Store actual data in blocks, using ext3 or ext4 file systems.
○ Handle read and write requests from clients.
○ Operate independently; if a DataNode fails, another takes over without
impacting the cluster.
○ DataNodes communicate with each other to ensure data consistency and
block replication.

3. Secondary NameNode:
○ Activated when NameNode needs to perform a checkpoint to manage disk
space.
○ Merges FsImage and EditLogs periodically to create a new, consistent
FsImage.
○ Stores transaction logs in a single location for easier access and replication
across the cluster.
○ Helps in recovering from a failed FsImage and ensures data can be backed up
and restored.

4. Checkpoint Node:
○ Creates checkpoints by merging FsImage and EditLogs at regular intervals.
○ Provides a consistent image of the file system to the NameNode.
○ Ensures that the directory structure remains consistent with the NameNode.

5. Backup Node:
○ Ensures high availability of data by maintaining a backup of the active
NameNode’s data.
○ Can be promoted to active in case the NameNode fails.
○ Works with replica sets of data for recovery, rather than relying on individual
nodes.

6. Blocks:
○ Data is split into blocks, typically ranging from 32 to 128 MB in size.
○ Blocks are replicated across multiple DataNodes to ensure fault tolerance.
○ HDFS scales by adding more DataNodes, which automatically manage larger
blocks of data.
○ Block replication ensures that even if one DataNode fails, data can still be
recovered from other nodes.

5
Heetika Vaity BDAV Lab SYMCA/B-57

Replication Management in HDFS Architecture

HDFS is able to survive computer crashes and recover from data corruption. HDFS operates
on the principle of duplicates, so in the event of failure, it can continue operating as long as
there are replicas available. When working on the principle of replicas, data is duplicated
and stored on different machines in the DHFS cluster. A replica of every block is stored on at
least three DataNodes. HDFS uses a technique referred to as nameNode maintenance to
maintain copies on multiple DataNodes. The nameNode keeps track of how many blocks
have been under- or over-replicated, and subsequently adds or deletes copies accordingly.

Write Operation

The process continues until all DataNodes have received the data. After DataNodes receive
a copy of the file, they send back the location of the last block they received. This enables
the NameNode to reconstruct the file. After receiving the last block, the DataNodes notify
the NameNode that the job is complete. The NameNode then replies with a complete file
that can be used by the application. When a file is split into segments, it must be
reassembled to return the file data to the application. Splitting a file into segments is a
method that enables the NameNode to optimize its storage capacity. Splitting a file into
segments also improves fault tolerance and availability. When the client receives a split file,
the process is similar to that of a single file. The client divides the file into segments, which
are then sent to the DataNode. DataNode 1 receives the segment A and passes it to
DataNode 2 and so on.

Read Operation

The client then sends the file to the Replicator. The Replicator does not have a copy of the
file and must read the data from another location. In the background, data is then sent to
the DataNode. The DataNode only has metadata and must contact the other data nodes to
receive the actual data. The data is then sent to the Replicator. The Replicator again does
not have a copy of the file and must read the data from another location. Data is then sent to
the Reducer. The Reducer does have a copy of the data, but a compressed version.

6
Heetika Vaity BDAV Lab SYMCA/B-57

Advantages of HDFS Architecture

1. It is a highly scalable data storage system. This makes it ideal for data-intensive
applications like Hadoop and streaming analytics. Another major benefit of Hadoop
is that it is easy to set up. This makes it ideal for non-technical users.
2. It is very easy to implement, yet very robust. There is a lot of flexibility you get with
Hadoop. It is a fast and reliable file system.
3. This makes Hadoop a great fit for a wide range of data applications. The most
common one is analytics. You can use Hadoop to process large amounts of data
quickly, and then analyze it to find trends or make recommendations. The most
common type of application that uses Hadoop analytics is data crunching.
4. You can increase the size of the cluster by adding more nodes or increase the size of
the cluster by adding more nodes. If you have many clients that need to be stored on
HDFS you can easily scale your cluster horizontally by adding more nodes to the
cluster. To scale your cluster vertically, you can increase the size of the cluster. Once
the size of the cluster is increased, it can serve more clients.
5. This can be done by setting up a centralized database, or by distributing data across
a cluster of commodity personal computers, or a combination of both. The most
common setup for this type of virtualization is to create a virtual machine on each of
your servers.
6. Automatic data replication can be accomplished with a variety of technologies,
including RAID, Hadoop, and database replication. Logging data and monitoring it
for anomalies can also help to detect and respond to hardware and software failures.

Disadvantages of HDFS Architecture

1. It is important to have a backup strategy in place. The cost of downtime can be


extremely high, so it is important to keep things running smoothly. It is also
recommended to have a security plan in place. If your company does not have a data
backup plan, you are putting your company’s data at risk.
2. The chances are that the data in one location is vulnerable to hacking. Imagine the
fear of losing valuable data when a disaster strikes. To protect data, backup data to a
remote location. In the event of a disaster, the data can be quickly restored to its
original location.
3. This can be done manually or through a data migration process. Once the data is
copied to the local environment, it can be accessed, analyzed, and used for any
purpose.

7
Heetika Vaity BDAV Lab SYMCA/B-57

CODE:

HADOOP COMMANDS

1) Hadoop Version:

Description:

The Hadoop fs shell command version prints the Hadoop version.

Command:

hadoop version

2) Make Directory:

Description:

This command creates the directory in HDFS if it does not already exist.

Note: If the directory already exists in HDFS, then we will get an error message that file
already exists.

Command:

hadoop fs –mkdir /path/directory_name

8
Heetika Vaity BDAV Lab SYMCA/B-57

3) Listing Directories:

Description:

Using the ls command, we can check for the directories in HDFS.

Command:

hadoop fs -ls /

4) copyFromLocal or put:

Description:

The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or
directory from the local filesystem to the destination in the Hadoop file system.

Command:

hadoop fs -put ~/localfilepath /fileofdestination

9
Heetika Vaity BDAV Lab SYMCA/B-57

10
Heetika Vaity BDAV Lab SYMCA/B-57

5) count:

Description:

The Hadoop fs shell command count counts the number of files, directories, and bytes
under the paths that match the specified file pattern.

Options:

-q – shows quotas(quota is the hard limit on the number of names and amount of space
used for individual directories)
-u – it limits output to show quotas and usage only
-h – shows sizes in a human-readable format
-v – shows header line

Command:

hadoop fs -count /

6) cat

Description:

The cat command reads the file in HDFS and displays the content of the file on console or
stdout.

Command:

hadoop fs -cat /path

11
Heetika Vaity BDAV Lab SYMCA/B-57

7) touchz

Description:

touchz command creates a file in HDFS with file size equals to 0 byte. The directory is the
name of the directory where we will create the file, and filename is the name of the new file
we are going to create

Command:

hadoop fs -touchz /directory/file

8) stat:

Description:

The Hadoop fs shell command stat prints the statistics about the file or directory in the
specified format.

Formats:

%b – file size in bytes

%g – group name of owner

%n – file name

%o – block size

%r – replication

%u – user name of owner

%y – modification date

If the format is not specified then %y is used by default.

12
Heetika Vaity BDAV Lab SYMCA/B-57

Command:

hadoop fs -stat %format /path

9) checksum

Description:

The Hadoop fs shell command checksum returns the checksum information of a file.

Command:

hadoop fs -checksum /path

13
Heetika Vaity BDAV Lab SYMCA/B-57

10) usage

Description:

The Hadoop fs shell command usage returns the help for an individual command.

Command:

hadoop fs -usage [command]

11) help

Description:

The Hadoop fs shell command help shows help for all the commands or the specified
command.

Command:

hadoop fs -help [command]

14
Heetika Vaity BDAV Lab SYMCA/B-57

13) cp

Description:

The cp command copies a file from one directory to another directory within the HDFS.

Command:

hadoop fs -cp <src> <des>

14) mv

Description:

The HDFS mv command moves the files or directories from the source to a destination
within HDFS.

Command:

hadoop fs -mv <src> <des>

15
Heetika Vaity BDAV Lab SYMCA/B-57

15) copyToLocal or get

Description:

This command copies files or directories from HDFS to the local filesystem.

Command:

hadoop fs -copyToLocal /hdfs/path /local/path

16
Heetika Vaity BDAV Lab SYMCA/B-57

16) Remove Recursive (rmr):

Description:

This command recursively deletes a directory and its contents from HDFS. The -rm
command with the -r flag can also be used for this purpose.

Command:
hadoop fs -rm -r /path/directory_name

or

hadoop fs -rmr /path/directory_name

17) Disk Usage (du):

Description:

This command displays the disk usage of files and directories in HDFS.

Command:
hadoop fs -du /path

17
Heetika Vaity BDAV Lab SYMCA/B-57

18) Disk Usage Summary (dus):

Description:

This command provides a summary of disk usage for files and directories in HDFS, typically
aggregating the sizes.

Command:
hadoop fs -dus /path

19) Move From Local:

Description:

This command moves files or directories from the local filesystem to HDFS. It is similar to
copyFromLocal, but the source file is deleted after the move.

Command:
hadoop fs -moveFromLocal /local/path /hdfs/path

18
Heetika Vaity BDAV Lab SYMCA/B-57

CONCLUSION:

In conclusion, this practical exercise on HDFS architecture and basic commands has
provided a foundational understanding of how Hadoop's distributed file system manages
and interacts with large-scale data. By mastering commands such as mkdir, put, get, and du,
users can effectively create, manage, and monitor files within HDFS. This knowledge is
essential for leveraging HDFS’s capabilities in real-world big data environments, ensuring
efficient data storage and retrieval.

19
Heetika Vaity BDAV Lab SYMCA/B-57

Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 2

Title of LAB Assignment : To study functionality of Mapreduce and


implement following programs using Mapreduce -

1. Write a program in Map Reduce for WordCount operation.

2. Write a program in Map Reduce for Matrix Multiplication

DOP : 22/08/2024 DOS: 29/08/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO2 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2

1
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO STUDY FUNCTIONALITY OF MAPREDUCE AND IMPLEMENT


FOLLOWING PROGRAMS USING MAPREDUCE -

1. WRITE A PROGRAM IN MAP REDUCE FOR WORDCOUNT OPERATION.

2. WRITE A PROGRAM IN MAP REDUCE FOR MATRIX MULTIPLICATION

THEORY:

What is MapReduce?

MapReduce is a Java-based, distributed execution framework within the Apache Hadoop


Ecosystem. It takes away the complexity of distributed programming by exposing two
processing steps that developers implement: 1) Map and 2) Reduce. In the Mapping step,
data is split between parallel processing tasks. Transformation logic can be applied to each
chunk of data. Once completed, the Reduce phase takes over to handle aggregating data
from the Map set.. In general, MapReduce uses Hadoop Distributed File System (HDFS) for
both input and output. However, some technologies built on top of it, such as Sqoop, allow
access to relational systems.

How does MapReduce work?

A MapReduce system is usually composed of three steps (even though it's generalized as
the combination of Map and Reduce operations/functions). The MapReduce operations are:
● Map: The input data is first split into smaller blocks. The Hadoop framework then
decides how many mappers to use, based on the size of the data to be processed
and the memory block available on each mapper server. Each block is then assigned
to a mapper for processing. Each ‘worker’ node applies the map function to the
local data, and writes the output to temporary storage. The primary (master) node
ensures that only a single copy of the redundant input data is processed.
● Shuffle, combine and partition: worker nodes redistribute data based on the
output keys (produced by the map function), such that all data belonging to one key
is located on the same worker node. As an optional process the combiner (a
reducer) can run individually on each mapper server to reduce the data on each
mapper even further making reducing the data footprint and shuffling and sorting
easier. Partition (not optional) is the process that decides how the data has to be
presented to the reducer and also assigns it to a particular reducer.

2
Heetika Vaity BDAV Lab SYMCA/B-57

● Reduce: A reducer cannot start while a mapper is still in progress. Worker nodes
process each group of <key,value> pairs output data, in parallel to produce
<key,value> pairs as output. All the map output values that have the same key are
assigned to a single reducer, which then aggregates the values for that key. Unlike
the map function which is mandatory to filter and sort the initial data, the reduce
function is optional.

Architecture

1. Master Node (Job Tracker):

● Role: Manages and coordinates the MapReduce jobs. It assigns map and reduce
tasks to worker nodes and monitors their progress.
● Responsibilities: Includes scheduling tasks, handling failures, and managing
resource allocation.

2. Worker Nodes (Task Trackers):

● Role: Execute the map and reduce tasks as assigned by the master node.
● Responsibilities: Each worker node processes data locally to minimize data
transfer overhead, performs the map and reduce operations, and reports progress
back to the master node.

3. Input Splits:

● Definition: The input data is divided into smaller chunks called splits. Each split is
processed independently by a map task.
● Purpose: Helps in parallel processing and efficient utilization of resources.

4. Data Locality:

● Concept: To optimize performance, the MapReduce framework attempts to process


data on the node where it is stored, reducing network overhead.

MapReduce Workflow

3
Heetika Vaity BDAV Lab SYMCA/B-57

1. Data Input: Input data is read and split into smaller chunks.
2. Map Tasks: Each map task processes its chunk and emits intermediate key-value
pairs.
3. Shuffle and Sort: The intermediate data is shuffled and sorted by key.
4. Reduce Tasks: Each reduce task processes the sorted data and generates the final
output.
5. Output Storage: The final results are written to an output file or database.

Advantages

● Scalability: Easily scales to handle large datasets by distributing tasks across many
nodes.
● Fault Tolerance: Automatically recovers from node failures by reassigning tasks.
● Simplicity: Provides a simple abstraction for parallel processing without requiring
detailed knowledge of the underlying infrastructure.

Applications

● Data Analysis: Used for large-scale data analysis, such as log processing, web
indexing, and data mining.
● Search Engines: Helps in indexing large volumes of web data.
● Machine Learning: Facilitates distributed training of machine learning models.

Challenges

● Data Transfer Overhead: Shuffling and sorting can be network-intensive.


● Performance Tuning: Optimizing job execution, including task placement and
resource allocation, can be complex.
● Debugging: Debugging MapReduce jobs can be challenging due to the distributed
nature of the processing.

1. WRITE A PROGRAM IN MAP REDUCE FOR WORDCOUNT OPERATION.

4
Heetika Vaity BDAV Lab SYMCA/B-57

—>

CODE/ STEPS:

● Create a Java Project and add respective JAR files

● Add External JARs

5
Heetika Vaity BDAV Lab SYMCA/B-57

Navigate to the Libraries tab. Click on “Add External JARs..”.


Add JAR files by clicking on File System. Navigate to usr/lib/hadoop and select all JAR files.
Also, navigate to the client folder under the hadoop folder and add those JAR files.
Once all the JAR files have been added. Click on Finish

6
Heetika Vaity BDAV Lab SYMCA/B-57

7
Heetika Vaity BDAV Lab SYMCA/B-57

● Create a new package and your Class

8
Heetika Vaity BDAV Lab SYMCA/B-57

● Copy the code from


https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapred
uce-client-core/MapReduceTutorial.html#Source_Code
And paste in your class file

WordCountHeetika.java

package heetika_57;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;

9
Heetika Vaity BDAV Lab SYMCA/B-57

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountHeetika {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {

10
Heetika Vaity BDAV Lab SYMCA/B-57

Configuration conf = new Configuration();


Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountHeetika.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

● Export the JAR File

11
Heetika Vaity BDAV Lab SYMCA/B-57

12
Heetika Vaity BDAV Lab SYMCA/B-57

13
Heetika Vaity BDAV Lab SYMCA/B-57

● Open the Terminal and type following Commands

14
Heetika Vaity BDAV Lab SYMCA/B-57

● Add text to your input file

● Run the following Command

15
Heetika Vaity BDAV Lab SYMCA/B-57

16
Heetika Vaity BDAV Lab SYMCA/B-57

OUTPUT:

17
Heetika Vaity BDAV Lab SYMCA/B-57

2. WRITE A PROGRAM IN MAP REDUCE FOR MATRIX MULTIPLICATION

—>

This MapReduce program performs matrix multiplication in two stages: transforming and
multiplying matrices. It involves custom input/output formats and multiple MapReduce
jobs. This process demonstrates the use of MapReduce for complex data transformations
and aggregations.

1. Custom Writable Classes:

○ Element: Represents an individual matrix element with a tag indicating the matrix
(M or N), index, and value.

○ Pair: Represents a matrix coordinate, used as a key for the final output.

2. Job 1: Matrix Transformation

○ Mappers (MatrixMapperM & MatrixMapperN): Read matrix data, create Element


objects, and emit key-value pairs where the key is the row or column index and the
value is the matrix element.

○ Reducer (ReducerMN): Receives elements from both matrices, performs


multiplication for matching indices, and outputs intermediate results.

3. Job 2: Final Aggregation

○ Mapper (MapMN): Reads intermediate data and outputs key-value pairs for final
aggregation.

○ Reducer (ReduceMN): Aggregates values for each key, producing the final matrix
multiplication result.

Execution Flow:

1. Job 1: Transforms matrix data into a format suitable for multiplication and stores
intermediate results.

2. Job 2: Aggregates intermediate results to produce the final matrix multiplication output.

18
Heetika Vaity BDAV Lab SYMCA/B-57

The diagram below shows an example of Matrix Multiplication using MapReduce:

19
Heetika Vaity BDAV Lab SYMCA/B-57

CODE/STEPS:

● Open Eclipse and create a new Java project. Name your project and click Next
here.

● Add External JARs

Navigate to the Libraries tab. Click on “Add External JARs..”.


Add JAR files by clicking on File System. Navigate to usr/lib/hadoop and select all JAR files.
Also, navigate to the client folder under the hadoop folder and add those JAR files.
Once all the JAR files have been added. Click on Finish

20
Heetika Vaity BDAV Lab SYMCA/B-57

● In the java project, create a new java file. Copy and paste the code provided:

MatrixMultiplicationHeetika.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;

class Element implements Writable {


int tag;
int index;
double value;

Element() {
tag = 0;
index = 0;
value = 0.0;
}

Element(int tag, int index, double value) {


this.tag = tag;
this.index = index;
this.value = value;
}

@Override
public void readFields(DataInput input) throws IOException {

21
Heetika Vaity BDAV Lab SYMCA/B-57

tag = input.readInt();
index = input.readInt();
value = input.readDouble();
}

@Override
public void write(DataOutput output) throws IOException {
output.writeInt(tag);
output.writeInt(index);
output.writeDouble(value);
}
}

class Pair implements WritableComparable<Pair> {


int i;
int j;

Pair() {
i = 0;
j = 0;
}

Pair(int i, int j) {
this.i = i;
this.j = j;
}

@Override
public void readFields(DataInput input) throws IOException {
i = input.readInt();
j = input.readInt();
}

@Override
public void write(DataOutput output) throws IOException {
output.writeInt(i);
output.writeInt(j);
}

@Override
public int compareTo(Pair compare) {

22
Heetika Vaity BDAV Lab SYMCA/B-57

if (i > compare.i) {
return 1;
} else if (i < compare.i) {
return -1;
} else {
if (j > compare.j) {
return 1;
} else if (j < compare.j) {
return -1;
}
}
return 0;
}

@Override
public String toString() {
return i + " " + j + " ";
}
}

public class MatrixMultiplicationHeetika {

public static class MatrixMapperM extends Mapper<Object,


Text, IntWritable, Element> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] tokens = readLine.split(",");
int index = Integer.parseInt(tokens[0]);
double elementVal = Double.parseDouble(tokens[2]);
Element e = new Element(0, index, elementVal);
IntWritable keyval = new
IntWritable(Integer.parseInt(tokens[1]));
context.write(keyval, e);
}
}

public static class MatrixMapperN extends Mapper<Object,


Text, IntWritable, Element> {
@Override

23
Heetika Vaity BDAV Lab SYMCA/B-57

public void map(Object key, Text value, Context context)


throws IOException, InterruptedException {
String readLine = value.toString();
String[] tokens = readLine.split(",");
int index = Integer.parseInt(tokens[1]);
double elementVal = Double.parseDouble(tokens[2]);
Element e = new Element(1, index, elementVal);
IntWritable keyval = new
IntWritable(Integer.parseInt(tokens[0]));
context.write(keyval, e);
}
}

public static class ReducerMN extends Reducer<IntWritable,


Element, Pair, DoubleWritable> {
@Override
public void reduce(IntWritable key, Iterable<Element>
values, Context context) throws IOException,
InterruptedException {
ArrayList<Element> M = new ArrayList<>();
ArrayList<Element> N = new ArrayList<>();
Configuration conf = context.getConfiguration();
for (Element element : values) {
Element temp =
ReflectionUtils.newInstance(Element.class, conf);
ReflectionUtils.copy(conf, element, temp);
if (temp.tag == 0) {
M.add(temp);
} else if (temp.tag == 1) {
N.add(temp);
}
}
for (int i = 0; i < M.size(); i++) {
for (int j = 0; j < N.size(); j++) {
Pair p = new Pair(M.get(i).index,
N.get(j).index);
double mul = M.get(i).value *
N.get(j).value;
context.write(p, new DoubleWritable(mul));
}
}

24
Heetika Vaity BDAV Lab SYMCA/B-57

}
}

public static class MapMN extends Mapper<Object, Text, Pair,


DoubleWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String readLine = value.toString();
String[] pairValue = readLine.split(" ");
Pair p = new Pair(Integer.parseInt(pairValue[0]),
Integer.parseInt(pairValue[1]));
DoubleWritable val = new
DoubleWritable(Double.parseDouble(pairValue[2]));
context.write(p, val);
}
}

public static class ReduceMN extends Reducer<Pair,


DoubleWritable, Pair, DoubleWritable> {
@Override
public void reduce(Pair key, Iterable<DoubleWritable>
values, Context context) throws IOException,
InterruptedException {
double sum = 0.0;
for (DoubleWritable value : values) {
sum += value.get();
}
context.write(key, new DoubleWritable(sum));
}
}

public static void main(String[] args) throws Exception {


Path MPath = new Path("/Heetika/input/M");
Path NPath = new Path("/Heetika/input/N");
Path intermediatePath = new Path("/Heetika/interim");
Path outputPath = new Path("/Heetika/output");

Job job1 = Job.getInstance();


job1.setJobName("Map Intermediate");
job1.setJarByClass(MatrixMultiplicationHeetika.class);

25
Heetika Vaity BDAV Lab SYMCA/B-57

MultipleInputs.addInputPath(job1, MPath,
TextInputFormat.class, MatrixMapperM.class);
MultipleInputs.addInputPath(job1, NPath,
TextInputFormat.class, MatrixMapperN.class);
job1.setReducerClass(ReducerMN.class);
job1.setMapOutputKeyClass(IntWritable.class);
job1.setMapOutputValueClass(Element.class);
job1.setOutputKeyClass(Pair.class);
job1.setOutputValueClass(DoubleWritable.class);
job1.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job1, intermediatePath);
job1.waitForCompletion(true);

Job job2 = Job.getInstance();


job2.setJobName("Map Final Output");
job2.setJarByClass(MatrixMultiplicationHeetika.class);
job2.setMapperClass(MapMN.class);
job2.setReducerClass(ReduceMN.class);
job2.setOutputKeyClass(Pair.class);
job2.setOutputValueClass(DoubleWritable.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job2, intermediatePath);
FileOutputFormat.setOutputPath(job2, outputPath);
job2.waitForCompletion(true);
}
}

● Create Directories in HDFS:

26
Heetika Vaity BDAV Lab SYMCA/B-57

Matrix data used for implementation. Create following files on Desktop:

Matrix M can be interpreted as: Matrix N can be interpreted as:


0,0,1.0 0,0,5.0
0,1,2.0 0,1,7.0
1,0,3.0 1,0,5.0
1,1,4.0 1,0,7.0

● Upload Matrix Files to HDFS:

27
Heetika Vaity BDAV Lab SYMCA/B-57

● Create a Manifest File

Create a file named Manifest.txt in the src folder. The content of Manifest.txt should specify
the entry point of your application. For example:

Main-class: MatrixMultiplicationHeetika

● Navigate to the src folder

cd /home/cloudera/workspace/MatrixMultiplication
cd src

● To compile your MatrixMultiply.java file and generate the necessary classes,


use the following command:

javac MatrixMultiply.java -cp $(hadoop classpath)

➔ -cp $(hadoop classpath): Sets the classpath to include all necessary Hadoop
libraries.

➔ -d /path/to/output/directory: Specifies the directory where the compiled classes


will be stored.

● To package your compiled classes and the manifest file into a JAR file, use the
following command:

jar cfm MatrixMultiply.jar Manifest.txt *.class

28
Heetika Vaity BDAV Lab SYMCA/B-57

● Now that you have your JAR file, you can run it on Hadoop to execute your
MapReduce tasks. Execute the Hadoop Jobs (Both should run successfully):

hadoop jar MatrixMultiplicationHeetika.jar MatrixMultiplicationHeetika

29
Heetika Vaity BDAV Lab SYMCA/B-57

OUTPUT:

Output of the intermediate job:

Final Output:

Output can be interpreted as

CONCLUSION:
MapReduce simplifies large-scale data processing by breaking down complex tasks into
smaller, parallelizable units. Its design emphasizes scalability, fault tolerance, and efficiency,
making it suitable for handling vast amounts of data across distributed computing
environments. The combination of the map and reduce phases, along with a robust
architecture for resource management and fault tolerance, makes MapReduce a powerful
tool for data-intensive applications.

30
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 3

Title of LAB Assignment : To study MongoDB architecture and


implement CRUD (Create, Read , Update, Delete ) operations along with
following -
1. Installation
2. Query the Sample Database using MongoDB querying commands

● Insert Document
● Query Document
● Indexing

DOP : 29/08/2024 DOS: 29/08/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO3 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO STUDY MONGODB ARCHITECTURE AND IMPLEMENT CRUD


(CREATE, READ , UPDATE, DELETE ) OPERATIONS ALONG WITH
FOLLOWING -
1. INSTALLATION
2. QUERY THE SAMPLE DATABASE USING MONGODB QUERYING
COMMANDS

● INSERT DOCUMENT
● QUERY DOCUMENT
● INDEXING

THEORY:

MongoDB Overview

MongoDB is a NoSQL, document-oriented database that provides high performance, high


availability, and easy scalability. Unlike traditional relational databases that store data in
tables, MongoDB stores data in flexible, JSON-like documents, which allows for the storage
of complex data types and hierarchical relationships. MongoDB is widely used for
applications that require big data, real-time analytics, and fast, iterative development.

Key Features of MongoDB

1. Document-Oriented Storage: MongoDB stores data in documents, which are


analogous to records or rows in relational databases. These documents are stored in
a format called BSON (Binary JSON), which supports embedded documents and
arrays. This model aligns more closely with the objects in your application code,
making data easier to work with.
2. Flexible Schema: MongoDB has a flexible schema design, meaning documents within
the same collection do not have to have the same structure. This allows for more
flexibility in how you model data, as you can change the schema as your application
evolves without requiring costly database migrations.
3. Scalability: MongoDB supports horizontal scaling through a process called sharding.
Sharding distributes data across multiple servers (shards), allowing the database to
handle large datasets and high throughput operations by scaling out rather than
scaling up.

1
Heetika Vaity BDAV Lab SYMCA/B-57

4. High Availability: MongoDB ensures high availability through a feature called


replication. In a MongoDB replica set, data is replicated across multiple servers. If
one server goes down, another can take over with minimal downtime, providing
fault tolerance.
5. Rich Query Language: MongoDB offers a powerful query language that supports
CRUD operations (Create, Read, Update, Delete), as well as aggregation, indexing,
and geospatial queries.
6. Indexing: MongoDB supports various types of indexing, including single field,
compound, multi-key, text, and geospatial indexes, which improve query
performance.

MongoDB Architecture

MongoDB's architecture is designed to manage large-scale data while providing flexibility,


scalability, and high availability. The core components of MongoDB's architecture include:

1. Documents: The basic unit of data in MongoDB is the document, which is stored in a
binary JSON format (BSON). A document is a set of key-value pairs, where the values
can include arrays and other documents, enabling complex data structures.
2. Collections: Documents are grouped into collections, which are analogous to tables
in relational databases. Collections do not enforce a schema, so documents within a
collection can have different fields and structures.

2
Heetika Vaity BDAV Lab SYMCA/B-57

3. Replica Sets: A replica set is a group of MongoDB servers that maintain the same
data set. A replica set includes:
○ Primary: The primary server receives all write operations. It replicates the
data to secondary servers.
○ Secondaries: Secondary servers replicate data from the primary. They can
serve read operations, depending on the configuration, and can be promoted
to primary if the current primary fails.
○ Arbiter: An arbiter is a member of the replica set that does not hold data but
participates in elections to determine the primary.
4. Sharding: Sharding is the process of distributing data across multiple machines. In
MongoDB, sharding is achieved by:
○ Shard: A shard is a subset of the data, distributed across servers.
○ Config Servers: Config servers store the metadata and configuration settings
for the cluster.
○ Query Router (mongos): The query router routes client requests to the
appropriate shard based on the shard key, which determines how data is
distributed across shards.
5. MongoDB Deployment: MongoDB can be deployed in various configurations:
○ Single Server: Suitable for development and testing.
○ Replica Set: Provides high availability and redundancy.
○ Sharded Cluster: Provides scalability by distributing data across multiple
shards.

How These Components Work Together:

● Data Storage: Data is stored in databases, which contain collections. Collections


store documents, where each document is a set of key-value pairs.
● Scalability and High Availability: Sharding ensures horizontal scalability by
distributing data across multiple shards, while replica sets ensure data redundancy
and high availability.
● Query Processing: Client queries are processed by query routers, which interact
with configuration servers to determine the data location and direct the query to the
appropriate shard.
● Configuration Management: Configuration servers maintain the cluster's state,
ensuring that data is correctly mapped and available across the distributed system.

3
Heetika Vaity BDAV Lab SYMCA/B-57

CRUD operations

CRUD operations in MongoDB refer to the four basic operations you can perform on the
data stored in a MongoDB database: Create, Read, Update, and Delete. Here’s how you can
perform these operations using the MongoDB shell or a MongoDB driver.

1. Create (Insert)
● Insert a Single Document:
db.collectionName.insertOne({
name: "John Doe",
age: 30,
address: "123 Main St"
});

● Insert Multiple Documents:

db.collectionName.insertMany([

{ name: "Alice", age: 25, address: "456 Maple Ave" },


{ name: "Bob", age: 28, address: "789 Oak St" }
]);

2. Read (Query)
● Find All Documents in a Collection:
db.collectionName.find();

● Find Documents with a Query Filter:


db.collectionName.find({ age: 30 });

● Find a Single Document:


db.collectionName.findOne({ name: "John Doe" });

● Projection (Select Specific Fields):


db.collectionName.find({ age: 30 }, { name: 1, _id: 0 });
This query returns only the name field, excluding the _id field.

4
Heetika Vaity BDAV Lab SYMCA/B-57

3. Update
● Update a Single Document:
db.collectionName.updateOne(
{ name: "John Doe" }, // Filter
{ $set: { age: 31 } } // Update operation
);

● Update Multiple Documents:


db.collectionName.updateMany(
{ age: { $gt: 25 } }, // Filter
{ $set: { status: "active" } } // Update operation
);

● Replace a Document:
db.collectionName.replaceOne(
{ name: "John Doe" }, // Filter
{ name: "John Doe", age: 31, address: "123 Main St" } // New document
);

4. Delete
● Delete a Single Document:
db.collectionName.deleteOne({ name: "John Doe" });

● Delete Multiple Documents:


db.collectionName.deleteMany({ age: { $lt: 30 } });

5
Heetika Vaity BDAV Lab SYMCA/B-57

Steps to Install MongoDB on Windows using MSI

1. Visit the MongoDB Download Center and select the latest version of MongoDB
Community Server.

2. Choose the Windows operating system and the MSI package.

6
Heetika Vaity BDAV Lab SYMCA/B-57

3. Run the Installer: Double-click the downloaded .msi file.

4. Accept the license

7
Heetika Vaity BDAV Lab SYMCA/B-57

5. Follow the installation prompts. During the installation, select the option to install
MongoDB as a Service (recommended).

6. Configure the MongoDB service to start automatically and start installation.

8
Heetika Vaity BDAV Lab SYMCA/B-57

7. Installation complete

9
Heetika Vaity BDAV Lab SYMCA/B-57

8. Run MongoDB: Open a command prompt and type mongod to start the MongoDB
server.
Open another command prompt and type mongosh to connect to the server.

10
Heetika Vaity BDAV Lab SYMCA/B-57

CODE & OUTPUT:

1. Show Existing databases


> show dbs

2. Sample database creation


> use 57heetika_student_data

3. Create collection
MongoDB collections are created automatically(implicit) when you insert data.
Explicit Creation–
> db.createCollection("students");

4. Insert Document
> db.students.insertMany([ { "name" : "Heetika", "age" : 20,
"major" : "IT", "gpa" : 9.8 }, { "name" : "Tanish", "age" : 22,
"major" : "CS", "gpa" : 8.0 }, {"name" : "John", "major":
"Commerce", "gpa" : 8.5}, {"name": "Pragati", "major": "CS",
"gpa": 9.5}]);

11
Heetika Vaity BDAV Lab SYMCA/B-57

> db.students.insertOne({"name": "Aisha", "age": 27, "major":


"Mathematics", "gpa": 7.8});

5. Query Document
>db.students.find()

> db.students.findOne({name: 'Heetika'})

12
Heetika Vaity BDAV Lab SYMCA/B-57

> db.students.find({gpa: {$gt: 8.0}})

6. Delete Document
> db.students.remove({"name" : "John"})

> db.students.find().pretty()

13
Heetika Vaity BDAV Lab SYMCA/B-57

7. Update Document
> db.students.update({'name': 'Tanish'}, {$set: {'major' :
"Chemistry"}})

8. Indexing document
>db.students.createIndex({"gpa":1})

This creates an index on the gpa field. { gpa: 1 }: The 1 indicates that the index will be
created in ascending order. If you wanted a descending order, you would use -1.

> db.students.createIndex({ gpa: 1, major: 1 });

This creates a compound index on multiple fields

> db.students.getIndexes()

14
Heetika Vaity BDAV Lab SYMCA/B-57

> db.students.dropIndexes()

CONCLUSION:

In this practical, we successfully performed various CRUD operations using MongoDB. We explicitly
created a collection and inserted documents into it. We also practiced querying documents with
specific criteria and enhanced query performance by creating indexes. These operations
demonstrated the flexibility and efficiency of MongoDB as a NoSQL database, highlighting its
suitability for handling large datasets with complex, unstructured data.

15
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 4

Title of LAB Assignment : Implementation of Hive: Creation of


Database and Table, Hive Partition, Hive Built In Function and Operators,
Hive View and Index.

DOP : 29/08/2024 DOS: 31/08/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO4 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: IMPLEMENTATION OF HIVE: CREATION OF DATABASE AND TABLE,


HIVE PARTITION, HIVE BUILT IN FUNCTION AND OPERATORS, HIVE VIEW
AND INDEX.

THEORY:

Introduction to Hive

Apache Hive is a data warehousing and SQL-like query language tool that is built on top of
Apache Hadoop. It facilitates reading, writing, and managing large datasets stored in
distributed storage using SQL. Hive abstracts the complexity of Hadoop's underlying system
by providing a user-friendly interface that allows users to write queries in HiveQL, which is
similar to SQL. This makes it accessible to users who are familiar with traditional relational
databases.

Key Features of Hive

1. SQL-Like Language (HiveQL):


○ HiveQL is a query language similar to SQL, making it easy for users with SQL
knowledge to interact with Hadoop. It supports a wide range of SQL
operations, including SELECT, INSERT, JOIN, and more.
2. Schema on Read:
○ Hive applies schemas to data at the time of reading, not when the data is
stored. This means data can be ingested into Hadoop without being
pre-processed or structured, which is beneficial for handling diverse data
types.
3. Data Warehousing:
○ Hive is designed for data warehousing tasks such as data summarization,
querying, and analysis. It supports the storage of large datasets in a
distributed environment and enables fast retrieval of data through its query
optimization techniques.
4. Scalability:
○ Hive can handle and process vast amounts of data across multiple nodes in a
Hadoop cluster. It is designed to scale with the underlying Hadoop
infrastructure, making it suitable for big data analytics.
5. Extensibility:
○ Hive can be extended with user-defined functions (UDFs) to perform custom
operations on data. Users can also define their own file formats and
serialization/deserialization methods.

1
Heetika Vaity BDAV Lab SYMCA/B-57

6. Integration with Hadoop Ecosystem:


○ Hive integrates seamlessly with other components of the Hadoop ecosystem,
such as HDFS (Hadoop Distributed File System) for storage, and YARN (Yet
Another Resource Negotiator) or MapReduce for processing.

Hive Architecture

Hive's architecture consists of several key components:

● Metastore:
○ The metastore stores metadata about the databases, tables, partitions,
columns, and data types. This metadata is critical for Hive’s operation and
allows it to understand how to interpret and query the data.
● Driver:
○ The driver manages the lifecycle of a HiveQL statement, including parsing,
compiling, optimizing, and executing queries. It also interacts with the query
compiler and execution engine.
● Compiler:
○ The compiler translates HiveQL statements into a directed acyclic graph
(DAG) of MapReduce jobs, Tez tasks, or Spark jobs, depending on the
execution engine being used.
● Execution Engine:
○ The execution engine executes the tasks generated by the compiler and
interacts with Hadoop's underlying data processing engines, such as
MapReduce, Tez, or Spark.

2
Heetika Vaity BDAV Lab SYMCA/B-57

● CLI (Command Line Interface):


○ Hive provides a CLI for users to interact with the system, run HiveQL queries,
and manage Hive objects like databases and tables.

Use Cases for Hive

● Batch Processing:
○ Hive is well-suited for batch processing jobs where large datasets need to be
processed in bulk. It is not designed for low-latency querying, making it more
suitable for reporting and analysis tasks.
● Data ETL (Extract, Transform, Load):
○ Hive can be used for data transformation and loading tasks. It can process
raw data from various sources, transform it into a desired format, and store it
in tables for analysis.
● Log Analysis:
○ Hive is commonly used to analyze large volumes of log data, such as web
server logs or application logs, to extract meaningful insights and patterns.
● Data Warehousing:
○ Hive is ideal for building data warehouses in big data environments, enabling
businesses to store and analyze large amounts of structured and
semi-structured data.

Limitations of Hive

● High Latency:
○ Hive is designed for batch processing and is not optimized for real-time
queries. Queries can take a long time to execute, especially on large datasets.
● Complexity of Joins:
○ Joining large tables in Hive can be resource-intensive and time-consuming, as
it relies on MapReduce or other distributed computing frameworks.
● Limited Transactional Support:
○ While Hive does support ACID transactions, it is limited in comparison to
traditional RDBMS, making it less suitable for applications requiring complex
transactions.

3
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Databases:

Hive supports the concept of databases, which are logical containers for organizing tables,
views, and other database objects. Creating a database helps segregate data logically and
avoid name collisions between objects.
Syntax:
CREATE DATABASE database_name;

Example: CREATE DATABASE sales_data;


You can switch between databases using:
USE database_name;

Hive Tables:

Tables in Hive are similar to tables in a relational database and can be created using the
CREATE TABLE statement. Each table is associated with a directory in HDFS, and the data
is stored in files within that directory.
Syntax:
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'delimiter'
STORED AS file_format;

Example: CREATE TABLE employee (id INT, name STRING, age INT)
STORED AS TEXTFILE;

Tables in Hive can be either Managed Tables (where Hive manages the lifecycle of the table
and data) or External Tables (where the data is stored externally and Hive only manages
the metadata).

4
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Partition:

Partitioning is a technique in Hive used to divide a table into smaller parts based on the
value of one or more columns. Each partition corresponds to a unique value or a
combination of values, and Hive creates separate directories for each partition in HDFS.
Partitioning improves query performance by allowing queries to scan only the relevant
partitions instead of the entire table.
Syntax:
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (partition_column datatype)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'delimiter'
STORED AS file_format;

Example: CREATE TABLE employee_partitioned (id INT, name STRING,


age INT) PARTITIONED BY (department STRING) STORED AS TEXTFILE;

Adding Partitions:
ALTER TABLE table_name ADD PARTITION (partition_column='value')
LOCATION 'hdfs_path';

Advantages of Partitioning:

● Reduces the amount of data scanned during queries.


● Improves query performance, especially for large datasets.

5
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Built-In Functions:

Hive provides a variety of built-in functions that are essential for data processing and
transformation. These functions are categorized into:

● Aggregate Functions:
○ Examples: MAX(),COUNT(),SUM(),AVG()
○ Usage: SELECT COUNT(*) FROM employee;
● String Functions:
○ Examples: UPPER(), LOWER(), CONCAT(), SUBSTRING()
○ Usage: SELECT UPPER(name) FROM employee;
● Mathematical Functions:
○ Examples: ROUND(), CEIL(), FLOOR(), ABS()
○ Usage: SELECT ROUND(salary, 2) FROM employee;
● Date Functions:
○ Examples: CURRENT_DATE(), YEAR(), MONTH()
○ Usage: SELECT YEAR(hire_date) FROM employee;
● Conditional Functions:
○ Examples: IF(), CASE...WHEN, COALESCE()
○ Usage: SELECT IF(age > 30, 'Senior', 'Junior') FROM
employee;

Hive Operators:

● Arithmetic Operators: Used for basic arithmetic calculations.


○ Examples: +, -, *, /
○ Usage: SELECT salary * 0.1 AS bonus FROM employee;
● Comparison Operators: Used to compare values.
○ Examples: =, !=, >, <, >=, <=
○ Usage: SELECT * FROM employee WHERE age > 30;
● Logical Operators: Used to combine multiple conditions.
○ Examples: AND, OR, NOT
○ Usage: SELECT * FROM employee WHERE age > 30 AND
department = 'HR';

6
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Views:

A view in Hive is a logical, virtual table that is derived from a query. Unlike tables, views do
not store data themselves; instead, they store the SQL query used to generate the data.
Views are useful for simplifying complex queries, enforcing security (by limiting access to
specific columns or rows), and abstracting underlying table structures.
Syntax:
CREATE VIEW view_name AS SELECT columns FROM table_name WHERE
conditions;

Example: CREATE VIEW senior_employees AS SELECT id, name,


department FROM employee WHERE age > 30;

Querying a View:
SELECT * FROM view_name;

Hive Index:

Indexing in Hive is used to improve the speed of query operations on large datasets.
Indexes are created on specific columns of a table to allow faster data retrieval.
Hive supports several types of indexes, including Compact Indexes and Bitmap Indexes.
However, indexing in Hive is less commonly used compared to traditional databases due to
its performance overhead, and it's often considered based on specific use cases.
Syntax:
CREATE INDEX index_name
ON TABLE table_name (column_name)
AS 'index_type'
WITH DEFERRED REBUILD;

Example: CREATE INDEX idx_employee_department ON TABLE employee


(department) AS 'COMPACT';

Show Index:
SHOW INDEX ON table_name;

Dropping an Index:
DROP INDEX index_name ON table_name;

7
Heetika Vaity BDAV Lab SYMCA/B-57

CODE AND OUTPUT:

● Create employee.csv file

● Start zookeeper-server and hive-server2

8
Heetika Vaity BDAV Lab SYMCA/B-57

● Create database heetika57

● Show databases

● Create table employee57

● Show tables

9
Heetika Vaity BDAV Lab SYMCA/B-57

● Use the LOAD DATA command to load data from the local file system into the
Hive table.
○ LOCAL: Indicates that the file is located on the local filesystem of the machine where
Hive is running.
○ INPATH 'employee.csv': Specifies the path to the file to be loaded. Since it’s a local
file, the path should be relative to the machine where the Hive CLI or Beeline is being
executed.
○ INTO TABLE employee57: Specifies the Hive table into which the data should be
loaded.

● Retrieve all the data from the table using SELECT *

● Describe table

10
Heetika Vaity BDAV Lab SYMCA/B-57

● DESCRIBE FORMATTED: This command is used to retrieve detailed metadata


about the specified table, including information about columns, partitions, file
formats, storage details, and more.

11
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Aggregation Functions:


● Count()-

● Max() -

12
Heetika Vaity BDAV Lab SYMCA/B-57

● sum() -

● limit

● Creating new table by inserting data from another table employee57

13
Heetika Vaity BDAV Lab SYMCA/B-57

● Renaming table using ALTER command

● Create a new table by referring existing table using LIKE

● Retrieving values by id in descending order using ORDER BY

14
Heetika Vaity BDAV Lab SYMCA/B-57

● SORT by name in descending order

● Hive Built-In Functions - UPPER, LOWER

● Operators

15
Heetika Vaity BDAV Lab SYMCA/B-57

HCATALOG:

HCatalog is a table and storage management layer that sits on top of Apache Hive and
provides a unified interface for accessing data stored in various formats in Hadoop. It is
designed to facilitate interoperability between different data processing tools in the
Hadoop ecosystem, such as Pig, MapReduce, and Hive, by providing a consistent view of
data stored in HDFS (Hadoop Distributed File System).

● Create Table

● Show tables

● Describe table

● Drop table

16
Heetika Vaity BDAV Lab SYMCA/B-57

● Create external table

JOIN OPERATION:

● Create tables ‘sales’ and ‘product’

17
Heetika Vaity BDAV Lab SYMCA/B-57

● Insert values in to ‘sales’ table

18
Heetika Vaity BDAV Lab SYMCA/B-57

● Retrieve values from ‘sales’ table

● Insert values in to ‘product’ table

Retrieve values from ‘product’ table

19
Heetika Vaity BDAV Lab SYMCA/B-57

● INNER JOIN

20
Heetika Vaity BDAV Lab SYMCA/B-57

● LEFT OUTER JOIN

21
Heetika Vaity BDAV Lab SYMCA/B-57

● RIGHT OUTER JOIN

22
Heetika Vaity BDAV Lab SYMCA/B-57

Hive Partition:
● Creating a table partitioned by department values

23
Heetika Vaity BDAV Lab SYMCA/B-57

24
Heetika Vaity BDAV Lab SYMCA/B-57

Hive View:
● Creating and Querying a View

Hive Index:
● Create index

● Drop index

CONCLUSION:

Hive simplifies the process of managing and querying large datasets in a Hadoop
ecosystem. Its SQL-like language, scalability, and integration with Hadoop make it a
powerful tool for big data analytics. However, its high latency and limitations in complex
queries mean it is best suited for batch processing and data warehousing tasks rather than
real-time analytics.

25
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 5

Title of LAB Assignment : To create a Pig Data Model, Read and Store
Data and Perform following Pig Operations,
1. Pig Latin Basic
2. Pig Data Types,
3. Download the data
4. Create your Script
5. Save and Execute the Script
6. Pig Operations : Diagnostic Operators, Grouping and Joining, Combining &
Splitting, Filtering, Sorting

DOP : 05/09/2024 DOS: 06/09/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO4 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO CREATE A PIG DATA MODEL, READ AND STORE DATA AND
PERFORM FOLLOWING PIG OPERATIONS:
1. PIG LATIN BASIC
2. PIG DATA TYPES,
3. DOWNLOAD THE DATA
4. CREATE YOUR SCRIPT
5. SAVE AND EXECUTE THE SCRIPT
6. PIG OPERATIONS : DIAGNOSTIC OPERATORS, GROUPING AND JOINING,
COMBINING & SPLITTING, FILTERING, SORTING

THEORY:

Introduction to Apache Pig

Apache Pig is a high-level platform for processing and analyzing large datasets. It provides
an abstraction over the complexities of MapReduce, allowing users to write data
transformation scripts using Pig Latin, a high-level language that makes data processing
simpler and more intuitive. Pig is often used in scenarios where large-scale data analysis is
required, such as in ETL (Extract, Transform, Load) processes, data preparation for
machine learning, and log data analysis.

Pig Data Model

The Pig Data Model represents the structure of data in Apache Pig. It consists of the
following elements:

1. Atom: The simplest data type in Pig, representing a single value, such as a number or
a string. Examples include int, long, float, double, chararray, and bytearray.
2. Tuple: A record that consists of a sequence of fields. Each field can be of any data
type, including another tuple or a bag.
3. Bag: An unordered collection of tuples. Bags can contain duplicate tuples and are
used to represent relations in Pig.
4. Map: A set of key-value pairs where keys are chararray and values can be any data
type. Maps are useful for semi-structured data like JSON.

1
Heetika Vaity BDAV Lab SYMCA/B-57

Pig Latin Basics

Pig Latin is the language used to write scripts for data processing in Apache Pig. It supports
a wide range of operations, including loading data, filtering, grouping, joining, and more. Pig
Latin scripts are made up of a series of statements that describe a sequence of
transformations to be applied to the data.

Pig Data Types


Pig supports both scalar and complex data types:
● Scalar Types: Basic data types like int, long, float, double, chararray, and bytearray.
● Complex Types: Structures like Tuple, Bag, and Map, which allow for more complex
data representations.

Loading and Storing Data

Data in Pig is typically loaded from external storage, such as HDFS, using the LOAD
statement. The LOAD statement allows you to specify the file path, the storage format, and
the schema of the data. After processing, the transformed data can be stored back into
external storage using the STORE statement.

Example:
data = LOAD 'data.csv' USING PigStorage(',') AS (id:int, name:chararray, age:int);

Pig Operations

1. Diagnostic Operators:
○ DESCRIBE: Displays the schema of a relation.
○ DUMP: Outputs the content of a relation to the console.
2. Grouping and Joining:
○ Grouping: Groups records based on a common field.
○ Joining: Combines records from two or more relations based on a common
key.
3. Combining & Splitting:
○ Union: Combines the contents of two or more relations into one.
○ Split: Divides a relation into multiple parts based on specified conditions.
4. Filtering:
○ Filters data based on a condition, allowing only the records that meet the
condition to pass through.
5. Sorting:
○ Sorts the records in a relation based on one or more fields.

2
Heetika Vaity BDAV Lab SYMCA/B-57

CODE/OUTPUT:

● Create file named “data_model” with following contents:

● Go to terminal and type


> pig -x local

3
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> DataModels = LOAD 'data_model' USING PigStorage(';') AS (


name: chararray,
address: tuple(city: chararray, pincode: chararray),
result: bag{info: tuple(sub: chararray, marks: int)},
m: map[int]
);

grunt> dump DataModels;

4
Heetika Vaity BDAV Lab SYMCA/B-57

● Create file named “sample_student_data.txt” with following contents:

grunt> student_info = LOAD '/home/cloudera/Desktop/sample_student_data.txt'


USING PigStorage(',') AS (
id: chararray,
fname: chararray,
lname: chararray,
age: int,
phone: chararray,
city: chararray
);

grunt> dump student_info

5
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> Store student_info into 'HeetikaSampleOutput' using PigStorage('|');

6
Heetika Vaity BDAV Lab SYMCA/B-57

7
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> describe student_info

grunt> filterstudent= filter student_info BY age>22;


grunt> dump filterstudent;

8
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> explain filterstudent;

9
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> foreachstudent= foreach filterstudent generate id,fname,age,city


grunt> dump foreachstudent

10
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> groupstudent= GROUP student_info by city;


grunt> dump groupstudent;

grunt> Illustrate foreachstudent;

11
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> describe groupstudent

grunt> groupstudent2= group student_info by (city,age);


grunt> dump groupstudent2

grunt> groupstudent3 = group student_info all;


grunt> dump groupstudent3;

12
Heetika Vaity BDAV Lab SYMCA/B-57

● Create file named “interns_data.txt” with following contents:

grunt> InternData= LOAD '/home/cloudera/Desktop/interns_data.txt' using


PigStorage(',') AS (id:chararray, fname:chararray, age:int, city:chararray);
grunt> dump InternData;

13
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> CG= COGROUP student_info by age, InternData by age;


grunt> dump CG;

grunt> student_infoB = LOAD '/home/cloudera/Desktop/sample_student_data.txt'


USING PigStorage(',') AS (
id: chararray,
fname: chararray,
lname: chararray,
age: int,
phone: chararray,
city: chararray
);
grunt> dump student_infoB;
grunt> SelfJoin= Join student_info by age, student_infoB by age;
grunt> dump SelfJoin;

14
Heetika Vaity BDAV Lab SYMCA/B-57

15
Heetika Vaity BDAV Lab SYMCA/B-57

● INNER JOIN
grunt> InnerJoin= JOIN student_info by city, InternData by city;
grunt> dump InnerJoin;

● LEFT JOIN
grunt> LeftJoin= JOIN student_info by city LEFT, InternData by city;
grunt> dump LeftJoin;

16
Heetika Vaity BDAV Lab SYMCA/B-57

● RIGHT JOIN
grunt> RightJoin= JOIN student_info by city RIGHT OUTER, InternData by city;
grunt> dump RightJoin;

● FULL JOIN
grunt> FullJoin= JOIN student_info by city FULL OUTER, InternData by city;
grunt> dump FullJoin;

17
Heetika Vaity BDAV Lab SYMCA/B-57

● ORDER BY
grunt> B5= ORDER student_info by id desc;
grunt> dump B5;

grunt> B= order student_info by id asc;


grunt> dump B;

18
Heetika Vaity BDAV Lab SYMCA/B-57

● LIMIT
grunt> C = LIMIT B 3;
grunt> dump C;

● UNION
grunt> C = LIMIT B 3;
grunt> C1 = LIMIT B 5;
grunt> UnionData= union C, C1;
grunt> dump UnionData;

19
Heetika Vaity BDAV Lab SYMCA/B-57

grunt> split student_info into younger_students if age<23, older_students if age>=23;


grunt> dump student_info;

grunt> dump younger_students;

grunt> dump older_students;

20
Heetika Vaity BDAV Lab SYMCA/B-57

● FILTER
grunt> filter_city = FILTER student_info by city == 'Mumbai';
grunt> dump filter_city;

grunt> all_city = FOREACH student_info Generate city;


grunt> dump all_city;

21
Heetika Vaity BDAV Lab SYMCA/B-57

● DISTINCT
grunt> distinct_cities= Distinct all_city;
grunt> dump distinct_cities;

CONCLUSION:

Apache Pig provides a powerful yet simple way to process large datasets. By using Pig Latin,
users can perform complex data transformations without needing to write complex code.
The Pig Data Model allows for flexible and scalable data handling, making it suitable for a
wide range of data processing tasks. Understanding and applying the various operations
available in Pig enables efficient data analysis and processing in a distributed environment.

22
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 6

Title of LAB Assignment : To run spark commands and functions:


1. Downloading Data Set and Processing it Spark
2. Word Count in Apache Spark.
To study and implement a Page Rank algorithm using PySpark.

DOP : 12/09/2024 DOS: 13/09/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO5 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO7,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO RUN SPARK COMMANDS AND FUNCTIONS:


1. DOWNLOADING DATA SET AND PROCESSING IT SPARK
2. WORD COUNT IN APACHE SPARK.

THEORY:
Apache Spark is an open-source, distributed computing system designed for large-scale
data processing. Developed at UC Berkeley’s AMPLab, Spark extends the MapReduce model
to provide faster and more comprehensive data processing capabilities. Its in-memory
computation and high-level APIs make it ideal for big data applications, offering a unified
engine that supports a wide range of data analytics tasks.

Key Features of Apache Spark:

1. Speed:
○ Spark can process data up to 100 times faster than Hadoop MapReduce,
primarily due to its in-memory processing capability.
○ It reduces the need for repeated disk read/writes by keeping intermediate
data in memory whenever possible.
2. Ease of Use:
○ Spark provides high-level APIs in several programming languages, including
Java, Python, Scala, and R.
○ The DataFrame API offers the ease of SQL-like querying, while RDDs
(Resilient Distributed Datasets) enable functional programming constructs.
3. Unified Engine:
○ Spark offers a unified platform that supports various big data processing
tasks such as batch processing, stream processing, machine learning, and
graph processing.
○ Its core components include Spark SQL, Spark Streaming, MLlib (for machine
learning), and GraphX (for graph processing).
4. Fault Tolerance:
○ Spark achieves fault tolerance through lineage, meaning that lost data can be
recomputed based on its transformation history.
○ Spark’s RDDs maintain the transformations applied to data, so even if nodes
fail, the system can rebuild the lost partitions automatically.

Components of Apache Spark:

1. Spark Core:
○ The Spark Core is responsible for basic data processing, including memory
management, task scheduling, fault recovery, and interacting with storage
systems like HDFS, S3, and others.
○ RDDs form the core abstraction, enabling fault-tolerant, distributed data
storage.

1
Heetika Vaity BDAV Lab SYMCA/B-57

2. Spark SQL:
○ Spark SQL allows users to run SQL queries and interact with structured data
through DataFrames.
○ It also supports a variety of data sources such as Hive, Parquet, ORC, and
JSON.
3. Spark Streaming:
○ Spark Streaming enables real-time data processing and analytics by breaking
incoming data into mini-batches and processing them using Spark’s core API.
4. MLlib:
○ MLlib is Spark’s library for scalable machine learning algorithms, including
classification, regression, clustering, and collaborative filtering.
5. GraphX:
○ GraphX is the component that allows for the processing and analysis of
large-scale graph data, offering tools for graph-parallel computations.

Spark’s Programming Abstractions:

1. Resilient Distributed Datasets (RDDs):


○ RDDs are fault-tolerant collections of objects that are processed in parallel.
They are immutable and distributed across the cluster.
○ Transformations (e.g., map, filter) are lazily evaluated and create new RDDs,
while actions (e.g., collect, count) trigger the execution of the transformation
pipeline.
2. DataFrames and Datasets:
○ DataFrames are distributed collections of data organized into named
columns, offering SQL-like functionality.
○ Datasets provide both the type safety of RDDs and the optimization benefits
of DataFrames through Spark’s Catalyst optimizer.

Spark’s Execution Model:

1. Driver and Executors:


○ The Driver program runs the main function and schedules tasks on a cluster
of worker nodes where Executors are deployed to perform actual
computation.
○ The Driver communicates with the cluster manager to request resources and
launch executors across the cluster.
2. Job and Task Execution:
○ Spark breaks down jobs into smaller tasks that are executed across worker
nodes in parallel. Each stage of the job depends on specific transformations
applied to the data.
○ The DAG (Directed Acyclic Graph) Scheduler ensures that tasks are executed
in the correct order, optimizing task parallelism and data locality.

2
Heetika Vaity BDAV Lab SYMCA/B-57

Use Cases of Apache Spark:

1. Data Processing: Spark is used for ETL tasks, large-scale data processing, and
real-time data analytics in industries like finance, healthcare, and e-commerce.
2. Machine Learning: Spark’s MLlib library is used for training machine learning
models at scale.
3. Graph Processing: GraphX enables large-scale graph analytics for social networks
and other graph-based data.

CODE/OUTPUT:

Open cloudera terminal and change directory to desktop and create a text file and put it in hdfs.
● [cloudera@quickstart ~]$ gedit heetika.txt

● [cloudera@quickstart ~]$ hdfs dfs -put heetika.txt


● [cloudera@quickstart ~]$ hdfs dfs -ls

Open spark shell

3
Heetika Vaity BDAV Lab SYMCA/B-57

● [cloudera@quickstart ~]$ spark-shell

● scala> sc.appName

Read the file:


● scala> val heetikaData = sc.textFile("heetika.txt")

Count number of lines:

4
Heetika Vaity BDAV Lab SYMCA/B-57

● scala> heetikaData.count()

Print first 2 lines:


● scala> for(line<- heetikaData.take(2)) println(line)

Change to uppercase:
● scala> val newHeetikaData = heetikaData.map(line => line.toUpperCase)

Print first 2 lines:


● scala> for(line <- newHeetikaData.take(2)) println(line)

Filter file:

5
Heetika Vaity BDAV Lab SYMCA/B-57

● scala> val filt_HeetikaData = heetikaData.filter(line => line.endsWith("."))

Print first 2 lines:


● scala> for(line <- filt_HeetikaData.take(2)) println(line)

List of numbers:
● scala> val heetDataNum = sc.parallelize(List(10,20,30))

Print the list:


● scala> heetDataNum.collect()

● scala> val heetMapFunc = heetDataNum.map(x=>x+10)

6
Heetika Vaity BDAV Lab SYMCA/B-57

● scala> heetMapFunc.collect()

Sequence of names:
● scala> val name = Seq("Heetika", "Vaity")

Change to lower case:


● scala> val result = name.map(_.toLowerCase)

● scala> name.flatMap(_.toLowerCase)

7
Heetika Vaity BDAV Lab SYMCA/B-57

Create another file on the desktop:


● [cloudera@quickstart ~]$ gedit heetika1.txt

● [cloudera@quickstart ~]$ hdfs dfs -put heetika1.txt


● [cloudera@quickstart ~]$ hdfs dfs -ls

Read the file:


● scala> val heetData = sc.textFile("heetika1.txt")

Print the data:

8
Heetika Vaity BDAV Lab SYMCA/B-57

● scala> heetData.collect()

Counts the number of lines:


● scala> heetData.count()

Split where there is space:


● scala> val split_heetData = heetData.flatMap(line=>line.split(" "))

Check the splitting:


● scala> split_heetData.collect()

● scala>val map_heetData = split_heetData.map(word=>(word,1))

9
Heetika Vaity BDAV Lab SYMCA/B-57

● scala> map_heetData.collect()

● scala> val reduce_heetData = map_heetData.reduceByKey(_+_)

● scala> reduce_heetData.collect()

To study and implement a Page Rank algorithm using PySpark.

10
Heetika Vaity BDAV Lab SYMCA/B-57

Kaggle dataset link: https://www.kaggle.com/pappukrjha/google-web-graph


I have renamed the file to 57B_web-Google.txt

Move the data into hadoop file system


● [cloudera@quickstart ~]$ hdfs dfs -put 57B_web-Google.txt

Start Pyshark terminal :


● [cloudera@quickstart ~]$ pyspark

Note: Write code → press ctrl S → press enter. Also give proper indentations.

11
Heetika Vaity BDAV Lab SYMCA/B-57

Writing compute contrib function

>>> def computeContribs(neighbors,rank):


... for neighbor in neighbors: yeild(neighbor, rank/len(neighbors))

Create a RDD named links with following command

12
Heetika Vaity BDAV Lab SYMCA/B-57

>>> links =sc.textFile('57B_web-Google.txt')\


... .map(lambda line: line.split())\
... .map(lambda pages: (pages[0], pages[1]))\
... .distinct()\
... .groupByKey()\
... .persist()

Create a loop in order to calculate contribs and ranks


>> for x in xrange(10):
... contribs= links\
... .join(ranks)\
... .flatMap(lambda(page, (neighbors,rank)):
computeContribs(neighbors,rank))
... ranks=contribs\
... .reduceByKey(lambda v1,v2: v1+v2)\
... .map(lambda (page,contrib): (page, contrib*0.85+0.15))

Code to collect all ranks (note: this command may take few minutes to run completely)

13
Heetika Vaity BDAV Lab SYMCA/B-57

>>> for rank in ranks.collect(): print rank

>>> ranks.take(5)

14
Heetika Vaity BDAV Lab SYMCA/B-57

>>> ranks.saveAsTextFile(‘page_ranks_output_57B’)

15
Heetika Vaity BDAV Lab SYMCA/B-57

Note: press Ctrl Z to quit pyspark terminal

To check the saved file:

16
Heetika Vaity BDAV Lab SYMCA/B-57

● [cloudera@quickstart ~]$ hdfs dfs -ls

Content inside this saved file:


● [cloudera@quickstart ~]$ hdfs dfs -ls page_ranks_output_57B

● [cloudera@quickstart ~]$ hdfs dfs -cat page_ranks_output_57B/part-00000

17
Heetika Vaity BDAV Lab SYMCA/B-57

CONCLUSION:

Apache Spark is a versatile and powerful platform that simplifies big data processing by
providing a unified engine with in-memory computation capabilities. Its ability to handle
diverse workloads (batch, streaming, machine learning, and graph processing) on a single
platform makes it a go-to solution for large-scale data analytics.

18
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 7

Title of LAB Assignment : To install and configure PowerBI for


educational usage.

DOP : 30/09/2024 DOS: 05/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO6 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO6,
PO7, PO8,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO INSTALL AND CONFIGURE POWER BI FOR EDUCATIONAL USAGE.

THEORY:

Description:
● Power BI is a powerful business analytics tool developed by Microsoft that allows
users to visualize data, create reports, and share insights across their organizations.
● It integrates with various data sources and provides features for creating interactive
dashboards, custom reports, and analytics.
● Power BI is a business intelligence and data visualization tool developed by
Microsoft that allows users to connect to various data sources, transform and model
the data, and create interactive reports and dashboards. It helps organizations make
informed decisions by providing insights from their data in a visually appealing and
easy-to-understand format.

Components of Power BI:


1. Power BI Desktop
● A Windows application used to create reports and data models.
● Allows users to pull data from various sources, clean it, and generate interactive
reports.
2. Power BI Service
● A cloud-based service where users can publish and share Power BI reports.
● It also allows for collaboration, sharing reports, and creating dashboards for
real-time analytics.
3. Power BI Mobile
● Mobile apps for iOS, Android, and Windows that allow users to view reports and
dashboards on the go.
4. Data Sources
● Power BI connects to hundreds of data sources, including SQL Server, Azure, Excel,
Google Analytics, Salesforce, and more.
● It supports real-time data streaming from connected data sources for live reporting.
5. Power Query
● A data connection and transformation tool within Power BI.
● Allows users to clean, transform, and load data from different sources in a
user-friendly interface.

6. Power BI DAX (Data Analysis Expressions)


● A formula language used for data modeling in Power BI.

1
Heetika Vaity BDAV Lab SYMCA/B-57

● It allows users to create custom calculations and measures within their reports.
7. Power BI Dashboard
● A single-page visual interface that consolidates visualizations from different reports
into one view.
● Dashboards offer high-level insights with links to deeper reports for detailed
analysis.
8. Power BI Embedded
● An API service that allows developers to integrate Power BI reports and dashboards
into custom applications.
9. Row-Level Security (RLS)
● A feature that allows restricting data access for users based on roles, ensuring that
sensitive information is protected.
10. Sharing and Collaboration
● Power BI enables sharing reports within an organization or externally through
Power BI service and publishing to the web.
● It also integrates with Microsoft Teams for collaboration.

How Power BI is Used:


1. Data Connection: Power BI connects to multiple data sources such as Excel, SQL
Server, Google Analytics, Azure, APIs, cloud services, and more. Users can import or
connect live to data.
2. Data Preparation: Using Power Query, users can clean and transform
data—removing unnecessary columns, filtering rows, creating calculated columns,
and more.
3. Data Modeling: Users can create relationships between different data sets, build
hierarchies, and define custom measures using DAX (Data Analysis Expressions) for
deeper insights.
4. Visualization: Power BI provides various data visualization options like charts,
graphs, maps, KPIs, and tables. Users can drag and drop data fields into visual
components to generate interactive reports.
5. Reports and Dashboards: After creating visualizations, users can build reports with
multiple pages of data insights or compile the key insights into a single-page
dashboard for quick viewing.
6. Sharing and Collaboration: Once reports are created, they can be published to the
Power BI Service, where others in the organization can view, interact with, or
collaborate on the reports. Reports can also be embedded in websites or shared via
mobile apps.

2
Heetika Vaity BDAV Lab SYMCA/B-57

7. Real-Time Analytics: Power BI supports real-time data streaming, allowing


dashboards and reports to be updated live with data from sources like sensors,
social media, and APIs.

Use Cases:
● Business Analytics: Visualize sales trends, monitor key performance indicators
(KPIs), and track business growth.
● Finance: Generate reports on financial performance, budgeting, and forecasting.
● Marketing: Analyze campaign effectiveness, customer engagement, and website
traffic data.
● Supply Chain: Track inventory, optimize logistics, and monitor supplier
performance.
● Human Resources: Visualize employee data, analyze retention, and monitor
productivity metrics.

Why Power BI?


● Ease of Use: Its drag-and-drop interface makes creating complex reports and
dashboards easy, even for non-technical users.
● Integration: Seamlessly integrates with other Microsoft products like Excel, Azure,
and SharePoint.
● Cost-Effective: Power BI Desktop is free, and the cloud service offers affordable
subscription options.
● Customization: Highly customizable dashboards and reports to meet the specific
needs of organizations.

3
Heetika Vaity BDAV Lab SYMCA/B-57

Steps for Installation and Configuration:

1. Download Power BI Desktop

Power BI Desktop is a free application that allows users to create reports and dashboards.
Steps:
● Visit the official Power BI download page.

4
Heetika Vaity BDAV Lab SYMCA/B-57

● Click on the “Download free” button under Power BI Desktop.

2. Install the Power BI Desktop Application

After downloading Power BI Desktop:


● Run the installer and follow the instructions.

5
Heetika Vaity BDAV Lab SYMCA/B-57

3. Configure Power BI Desktop

After installation, you'll need to configure Power BI Desktop to connect to data sources and
set up workspaces for educational purposes.
Steps:
● Launch Power BI Desktop.

CONCLUSION:
Power BI transforms raw data into actionable insights through visually appealing reports
and dashboards, allowing organizations to make better data-driven decisions.

6
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 8

Title of LAB Assignment : To learn various Data Preprocessing


activities in Power BI.

DOP : 30/09/2024 DOS: 07/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO6 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO6,
PO7, PO8,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO LEARN VARIOUS DATA PREPROCESSING ACTIVITIES IN POWER


BI.

THEORY:

Power BI

Power BI is a business analytics tool developed by Microsoft that allows users to visualize
data, share insights, and create interactive reports and dashboards. It is widely used by
organizations for data-driven decision-making, enabling users to turn raw data into
meaningful insights. Power BI offers various services and components that cater to
different levels of users—from data analysts and business professionals to developers.

Key Features of Power BI

1. Data Connectivity:
○ Power BI can connect to a wide range of data sources, including Excel, SQL
Server, Azure, Oracle, web APIs, and many more. This allows users to
aggregate and analyze data from different systems in one place.
2. Interactive Visualizations:
○ Power BI offers a variety of built-in visualizations (such as charts, graphs,
tables, and maps) to help users create interactive and dynamic reports. Users
can also use custom visuals from Power BI’s marketplace.
○ Visuals are highly customizable, allowing users to change colors, labels, axes,
and more to match their analytical needs.
3. Power Query Editor:
○ This is where users can perform data cleaning and transformation tasks. It
allows for tasks such as removing duplicates, renaming columns, splitting
columns, and combining data from multiple sources. Power Query makes
preprocessing data user-friendly with a simple, step-by-step process.
4. Data Analysis Expressions (DAX):
○ DAX is a formula language used in Power BI for creating custom calculations,
measures, and columns. It is similar to Excel formulas but is more powerful
and optimized for relational data.
○ With DAX, users can perform complex calculations, such as calculating
growth rates, creating running totals, or filtering data for specific time
periods.
5. Power BI Desktop:

1
Heetika Vaity BDAV Lab SYMCA/B-57

○ Power BI Desktop is a free application for PCs that enables users to connect
to data, transform it, and build reports with visuals. It is the primary tool for
designing and creating reports before they are published and shared.
6. Power BI Service (Cloud):
○ Power BI Service is a cloud-based platform where users can publish, share,
and collaborate on reports and dashboards. Users can access their data and
reports from anywhere via the web or mobile devices.
○ With the Power BI Service, users can set up automatic data refreshes,
ensuring that their reports always show up-to-date information.
7. Power BI Mobile:
○ Power BI has mobile apps available for iOS, Android, and Windows devices.
This allows users to access their reports and dashboards on the go, ensuring
they can monitor key metrics from anywhere.
8. Power BI Pro:
○ Power BI Pro is a paid subscription that adds collaboration features to Power
BI. It allows users to share reports and dashboards with others, collaborate in
workspaces, and distribute content across the organization.
9. Power BI Premium:
○ This subscription provides additional capabilities, including larger storage,
higher data refresh rates, and the ability to share reports with users who
don’t have a Power BI Pro license. It also includes features like AI
capabilities and Paginated Reports for highly detailed report layouts.
10. Real-Time Analytics:
○ Power BI allows for real-time data updates, where data streams from sources
like IoT devices or live databases can be used to provide up-to-the-minute
insights on dashboards.
11. Security and Data Governance:
○ Power BI includes security features such as row-level security (RLS), which
restricts access to specific data based on user roles. This ensures that
sensitive information is only visible to authorized users.
○ Integration with Azure Active Directory (AAD) allows organizations to
control access and apply security measures.
12. Natural Language Querying (Q&A):
○ Power BI offers a Q&A feature where users can ask questions in plain
language (e.g., "What were the total sales last year?"), and Power BI will
generate a relevant visual or answer.
13. AI and Machine Learning:

2
Heetika Vaity BDAV Lab SYMCA/B-57

○ Power BI integrates AI features such as text analytics, sentiment analysis, and


Azure machine learning models to enhance data insights. It also allows users
to create custom AI models directly within Power BI.

Use Cases of Power BI

● Financial Reporting: Create detailed financial reports with automated data


updates, allowing finance teams to monitor metrics like revenue, expenses, and
profits in real time.
● Sales and Marketing Analytics: Track key performance indicators (KPIs), such as
sales growth, lead generation, and customer engagement, to optimize marketing
campaigns.
● Operations Monitoring: Power BI can track and report operational data from
manufacturing processes, supply chain activities, and logistics in real time.
● Human Resources: HR departments can analyze employee data, track recruitment
metrics, and monitor diversity and inclusion initiatives using Power BI dashboards.

Data Preprocessing Activities in Power BI

Data Preprocessing in Power BI involves a series of activities that transform raw data into a
clean, organized, and usable format, ready for analysis. It is a critical step to ensure data
quality and integrity, which improves the accuracy and reliability of insights generated by
reports or dashboards. Power BI provides several built-in tools and features, primarily in
Power Query Editor, that enable users to perform various preprocessing tasks.

Key Data Preprocessing Activities in Power BI:

1. Data Importing: Power BI supports data import from multiple sources such as Excel,
SQL databases, web services, CSV files, and more. During the import phase, it's
important to connect to the correct data source and understand the structure of the
data.
2. Data Cleaning:
○ Removing Duplicates: Identifying and eliminating duplicate rows to prevent
redundant records.
○ Handling Missing Values: Replacing missing values with a default value,
mean, or interpolating them.
○ Removing Unwanted Columns: Dropping unnecessary columns that do not
contribute to the analysis.
3. Data Transformation:

3
Heetika Vaity BDAV Lab SYMCA/B-57

○ Data Type Conversion: Ensuring that all columns have the appropriate data
types (e.g., dates, numbers, text) for correct processing.
○ Splitting and Merging Columns: Splitting columns into multiple parts (e.g.,
full name into first and last name) or merging columns into one (e.g.,
combining city and state into a single location column).
○ Filtering Data: Removing unnecessary rows based on conditions to reduce
noise and focus on relevant data.
4. Data Aggregation:
○ Summarizing data (e.g., total sales by month, average transaction value) to
create meaningful metrics.
○ Grouping data by categories (e.g., region, product category) to identify trends
and patterns.
5. Data Normalization:
○ Standardizing data values to ensure consistency, especially when dealing
with data from multiple sources (e.g., date formats, currency symbols).
6. Creating Calculated Columns and Measures:
○ Power BI allows users to create custom calculations using DAX (Data Analysis
Expressions), which can generate new columns or measures based on
existing data (e.g., profit margin, sales growth rate).
7. Data Merging and Joining:
○ Combining multiple datasets through joins (inner, outer, left, right) or merges
to create a single, comprehensive data table for analysis.
8. Data Validation:
○ Verifying the accuracy and consistency of the data by checking for outliers,
invalid data entries, or inconsistencies in formats and categories.

4
Heetika Vaity BDAV Lab SYMCA/B-57

DATASET:

Name: Supermarket Sales

Link: https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

Context: The growth of supermarkets in most populated cities are increasing and market
competitions are also high. The dataset is one of the historical sales of supermarket
company which has recorded in 3 different branches for 3 months data. Predictive data
analytics methods are easy to apply with this dataset.

Attribute information

● Invoice id: Computer generated sales slip invoice identification number


● Branch: Branch of supercenter (3 branches are available identified by A, B and C).
● City: Location of supercenters
● Customer type: Type of customers, recorded by Members for customers using
member card and Normal for without member card.
● Gender: Gender type of customer
● Product line: General item categorization groups - Electronic accessories, Fashion
accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and
travel
● Unit price: Price of each product in $
● Quantity: Number of products purchased by customer
● Tax: 5% tax fee for customer buying
● Total: Total price including tax
● Date: Date of purchase (Record available from January 2019 to March 2019)
● Time: Purchase time (10am to 9pm)
● Payment: Payment used by customer for purchase (3 methods are available – Cash,
Credit card and Ewallet)
● COGS: Cost of goods sold
● Gross margin percentage: Gross margin percentage
● Gross income: Gross income
● Rating: Customer stratification rating on their overall shopping experience (On a
scale of 1 to 10)

5
Heetika Vaity BDAV Lab SYMCA/B-57

CODE/STEPS:

1. Importing Data
a. Data Sources: Power BI allows you to import data from various sources like
Excel, SQL, Web, etc.
b. Steps:
i. Open Power BI Desktop.
ii. Go to the Home tab.

iii. Click Get Data and select your data source.

6
Heetika Vaity BDAV Lab SYMCA/B-57

iv. After connecting, load the data into Power BI.

7
Heetika Vaity BDAV Lab SYMCA/B-57

2. Data Preprocessing Activities


Data preprocessing involves cleaning, transforming, and preparing data for analysis.
Below are key steps for preprocessing data.
a. Basic Transformations
Steps:
i. In the Home tab, click Transform Data to open the Power Query Editor.

ii. Use options like:

8
Heetika Vaity BDAV Lab SYMCA/B-57

● Remove Unnecessary Rows/columns: Remove empty or duplicate rows.

Removed Invoice-ID
Before:

After:

● Change Data Types: Change the data types (e.g., Text, Number, Date).

9
Heetika Vaity BDAV Lab SYMCA/B-57

We can change the datatype of the specific columns if we want. Here we change unit price
to fixed decimal number.

b. Dealing with Text Tools

10
Heetika Vaity BDAV Lab SYMCA/B-57

Steps:
i. Select a column with text data.
ii. Use Transform Tab options such as:

● Format Column:
This option allows you to apply transformations to text data, such as converting text to
uppercase, lowercase, or capitalizing each word.
Go to Transform > Format

Before:

After:

11
Heetika Vaity BDAV Lab SYMCA/B-57

● Replace Values: Replace specific values (e.g., replacing "null" with blank).
Right-click your column. —> Choose Replace Values
Before:

After:

c. Dealing with Unwanted Columns and Null Values

12
Heetika Vaity BDAV Lab SYMCA/B-57

i. Removing Unwanted Columns:


In the Power Query Editor, select the columns you don’t need→ Right-click and choose
Remove Columns.
Before:

After:

ii. Handling Null Values:


Use the Replace Values option to replace null values with meaningful data, or filter them
out using Remove Rows > Remove Empty Rows.

d. Dealing with Numerical Tools

13
Heetika Vaity BDAV Lab SYMCA/B-57

Steps:
i. Select a numeric column in the Power Query Editor.
ii. Use Transform Tab options:
1. Statistics: Calculate statistics like mean, median, standard
deviation, etc.
Before:

After:

2. Rounding: Round numeric values to a specified number of


decimal places.

14
Heetika Vaity BDAV Lab SYMCA/B-57

● Go to Transform > Round > Round Up or Round Down.


Before:

After:

15
Heetika Vaity BDAV Lab SYMCA/B-57

e. Dealing with Date and Time


Steps:
i. Select a date/time column.
ii. Use Date and Time Tools under the Transform Tab:
1. Extract: Extract specific parts like Year, Month, Day, Hour,
Minute, etc.

Go to the Add Column tab:


● Extract Month: Use Date > Month to extract the month.
Before:

After:

16
Heetika Vaity BDAV Lab SYMCA/B-57

f. Adding Conditional Columns


Steps:
i. In the Add Column tab, select Conditional Column.
ii. Define conditions using if-else logic (e.g., if a sales figure is above a
certain amount, label it as "High", else "Low").
Go to the Add Column tab.
Select Conditional Column.

Example:
● If Total > 5000, then “High”
● If Total > 1000, then “Medium”
● Else, label it as “Low”.

Before:

17
Heetika Vaity BDAV Lab SYMCA/B-57

After:

Renamed Total to Sales..

3. Analyzing with Charts

18
Heetika Vaity BDAV Lab SYMCA/B-57

Once the data is preprocessed, you can begin creating visualizations.


Steps:
a. Go to the Report view in Power BI.
b. Select the chart type (bar, line, pie, etc.) from the Visualizations Pane.
c. Drag and drop fields to the Values, Axis, and Legend areas.
d. Customize the chart by modifying axes, adding titles, adjusting colors, and more.
Bar Chart: Show total Sales byMonth.

Line Chart: Visualize the trend of Sales over Order Date.

Pie Chart: Show the distribution of Sales by Product categories.

19
Heetika Vaity BDAV Lab SYMCA/B-57

4. Creating Dashboards
A dashboard is a collection of charts and visuals that give insights at a glance.
Steps:
a. Combine different visualizations (charts, tables, maps) into one page.
b. Arrange them in a layout that tells a clear story.
c. Use Slicers or Filters to allow dynamic changes to the dashboard.

Dashboard

5. Telling Stories with Power BI

20
Heetika Vaity BDAV Lab SYMCA/B-57

Sales Performance Overview: The sales have shown a steady trend with significant
contributions from the Fashion Accessories and Food and Beverages categories. Payment
preferences are diverse, but Cash leads slightly over E-wallet and Credit card.
Interestingly, there is a relatively balanced split between Members and Non-Members,
with Non-Members contributing marginally higher to total sales. Sales spikes are noticeable
on certain days of the month, suggesting possible promotional activities or market demand
shifts.

Chart Insights:

● Bar Chart (Sum of Sales by Month Name):


○ January leads in total sales, followed by March and February, indicating that
Q1 might be a strong period for sales.
○ The monthly distribution suggests some consistency in sales across the three
months, though January outperforms the others slightly.
● Pie Chart (Sum of Sales by Product Line):
○ The chart reveals a fairly even distribution across product lines, with Fashion
Accessories and Food and Beverages leading.
○ Other categories such as Sports and Travel and Health and Beauty also
contribute meaningfully, indicating broad-based product demand.
● Line Chart (Sum of Sales by Day):
○ This line chart reveals daily sales fluctuations, with several peaks around the
middle of the month. This might correspond to specific events or promotions
driving higher sales on certain days.
○ Noticing patterns in sales behavior by day can help optimize stock and
promotional strategies.
● Bar Chart (Sum of Quantity by Product Line):
○ Health and Beauty and Electronic Accessories show higher quantities sold,
suggesting these categories might offer competitive pricing or frequently
replenished products.
○ The comparison highlights which product lines see high volume sales,
helping with stock management and forecasting.
● Donut Chart (Sum of Sales by Customer Type):
○ A nearly equal split between Members and Non-Members, with
Non-Members contributing slightly more, indicates that the business serves
both one-time customers and loyal members well.
○ It could suggest an opportunity to introduce loyalty programs to drive more
member sales.
● Pie Chart (Sum of Sales by Payment):

21
Heetika Vaity BDAV Lab SYMCA/B-57

○ Cash is the most preferred payment method, followed by E-wallet and


Credit card.
○ Understanding customer payment preferences can aid in optimizing
checkout processes and offering relevant payment options.

Key Insights:

➢ The dominance of Fashion Accessories and Food and Beverages highlights strong
customer demand for these categories, making them focal points for future sales
strategies.
➢ The daily sales fluctuations suggest that certain days may benefit from targeted
promotions to smooth out sales dips, while peak days may reveal opportunities for
deeper engagement.
➢ Non-Members slightly outperform Members, indicating potential to boost
member-driven sales through exclusive offers or loyalty programs.

6. Final Steps: Publish and Share


Steps:
a. Click Publish from Power BI Desktop to publish the report to Power BI
Service.
b. Share the report and dashboard with others through the Power BI Service.

CONCLUSION:

22
Heetika Vaity BDAV Lab SYMCA/B-57

Data preprocessing in Power BI is a fundamental step that ensures the data is in a clean,
organized, and ready-to-use format for effective analysis. Proper preprocessing
activities—such as cleaning, transforming, aggregating, and validating data—lead to higher
data quality and more accurate insights. Power BI's Power Query Editor provides a robust
environment for these tasks, allowing users to perform complex data transformations
without writing code. This functionality ensures that even non-technical users can
preprocess data efficiently, enabling them to generate meaningful reports and dashboards
that drive data-driven decision-making.

23
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 9

Title of LAB Assignment : To learn handling of tables and Queries in


Power BI

DOP : 30/09/2024 DOS: 14/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO6 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO6,
PO7, PO8,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO LEARN HANDLING OF TABLES AND QUERIES IN POWER BI

THEORY:

Power BI is a powerful business intelligence tool that enables users to extract, transform,
and analyze data from various sources. Learning how to handle tables and queries is crucial
for making data-driven decisions, and it involves a few key concepts and operations.

1. Tables in Power BI

● Definition: A table in Power BI is a structured set of data consisting of rows and


columns, where each row represents a record (e.g., a sales transaction) and each
column represents an attribute or field (e.g., Invoice ID, Customer Type, Total).
● Sources of Tables: Tables can come from different data sources such as Excel,
databases (e.g., SQL Server), web data, and APIs. In Power BI, these tables are loaded
into the Data Model for analysis.

Operations on Tables:

● Loading Tables: Power BI can import data from different sources. You use the Get
Data feature to connect to data sources and load tables.
● Relationships between Tables: When working with multiple tables, relationships
define how data is related. This allows Power BI to perform calculations and
aggregations across multiple tables. For example, linking a Customer ID from a
Sales table to a Customers table lets you analyze sales based on customer
attributes like gender or city.
● Table Formats and Data Types: Tables in Power BI consist of columns, each having
a specific data type such as Text, Number, or Date. Correct formatting ensures
accurate calculations and visualizations.

2. Power Query Editor

● Purpose: Power Query Editor is used to clean, transform, and reshape data before
loading it into Power BI. It allows users to write queries without knowing SQL or
other programming languages.
● Key Functions:
○ Transformations: These include renaming columns, filtering data, changing
data types, and removing duplicate rows.
○ Combining Data: Power Query enables merging or appending data from
different tables or queries.

1
Heetika Vaity BDAV Lab SYMCA/B-57

○ Data Cleaning: Users can remove unwanted columns, filter unnecessary


rows, split columns, or change text cases to clean data.
● No Impact on Source Data: Power Query performs transformations on data within
Power BI without altering the original source files.

Common Query Operations:

● Renaming Columns: Helps in giving meaningful names to columns.


● Filtering Data: Allows you to remove rows that do not meet specific criteria, such as
filtering out incomplete records or unwanted categories.
● Changing Data Types: Correct data types (e.g., text, number, date) must be applied
to columns for proper analysis.
● Grouping Data: Data can be grouped to calculate summary statistics (e.g., sum of
sales by branch).

3. Merging and Appending Queries

● Merging Queries: This operation is similar to a SQL JOIN. It combines data from
two tables based on a common column (such as Invoice ID or Customer ID).
This is useful when you want to combine related information stored in different
tables. For example, you might merge a Sales table with a Products table to
analyze product-level sales.
● Appending Queries: This operation stacks tables vertically, increasing the number
of rows. It’s used when you have data split across multiple tables with the same
structure. For example, if sales data is stored in separate tables for different
branches, you can append them to create a single dataset for analysis.

4. Working with Columns

● Adding Custom Columns: Power BI allows users to add calculated columns using
expressions in DAX (Data Analysis Expressions). For instance, you can create a new
column for the total sales amount by multiplying Quantity by Unit Price.
● Removing and Reordering Columns: Unnecessary columns can be removed to
streamline the data model and improve performance. Reordering helps in
organizing columns logically.

5. Advanced Query Features

2
Heetika Vaity BDAV Lab SYMCA/B-57

● Conditional Columns: These allow users to create new columns based on


conditions. For example, you might create a High Value column that labels sales
transactions as high if the total value exceeds a certain threshold.
● Applied Steps: Power Query keeps track of each transformation step in the form of
"applied steps." Users can edit, reorder, or remove steps to change how data is
processed.

6. Performance Considerations

● Efficient Queries: Using filters and transformations smartly reduces the amount of
data being loaded, which can improve the performance of reports and dashboards.
For example, instead of loading entire datasets, apply filters to load only relevant
data (e.g., filtering for data from a specific year or branch).
● Query Folding: Power BI attempts to push as many transformations as possible
back to the data source (especially in relational databases), ensuring that operations
are performed at the source level, which enhances performance.

7. Data Loading and Refresh

● After transforming data in Power Query, it is loaded into the Power BI Data Model.
Data can be refreshed to reflect updates in the source data without redoing the
transformations.
● You can set scheduled refreshes in Power BI Service to automatically update the data
at defined intervals.

Dataset Link:

https://github.com/Ayushi0214/Datasets/blob/main/classic_models_dataset.zip

CODE/ STEPS:

3
Heetika Vaity BDAV Lab SYMCA/B-57

Load Dataset:

Browse your folder path:

4
Heetika Vaity BDAV Lab SYMCA/B-57

Now add each table as a new query by right clicking on binary

5
Heetika Vaity BDAV Lab SYMCA/B-57

Put the table name here and do the same for meaning datasets:

1. Merge Queries and Append Queries

6
Heetika Vaity BDAV Lab SYMCA/B-57

Merge Queries: Merging queries is equivalent to performing SQL joins. It combines


columns from two tables based on a shared key or column. This is useful when you need
information from both tables in a single table for analysis, like merging customer details
with transaction data based on Customer Type.

➢ Merging Products and Order Details table based on common Product


Code

➢ In the Home Tab, click on “Merge Queries”

➢ In the dialog box select Product table and matching columns. Select the type of
join(Here: Inner Join)

7
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Extracting only “Buy Price” from Products Table

➢ Calculating margin from Buy Price and quantity by creating custom column

8
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Calculating Profit

Append Queries: This operation is similar to a SQL UNION. It stacks rows from two or
more tables on top of each other. Append is used when the structure of both tables is the
same, but the data might represent different periods or entities (e.g., sales data from
different branches or time periods).

➢ Load new Dataset “World Happiness Report” to perform Append .

9
Heetika Vaity BDAV Lab SYMCA/B-57

➢ In Home Tab , click on Append Queries

➢ Select Table to append and Click OK.

10
Heetika Vaity BDAV Lab SYMCA/B-57

2. Column Formats

Column formatting refers to defining how data is displayed and processed. Different types
of data require different formats to ensure proper analysis:

● Text Format: Used for categorical data like Invoice ID, Customer Type, or Branch.
Text data is treated as labels and is not suitable for calculations.
● Numeric Format: Used for values that require mathematical operations. Examples
include Quantity, Unit Price, Total, Gross Income, etc.
● Date Format: For time-related data like Date and Time, which allows the creation of
time-based analyses like trends and comparisons.
● Proper formatting ensures correct data aggregation and calculations in reports and
visualizations.

➢ Changing CreditLimit format from whole number to fixed decimal Number in


Customers Table

11
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Changing US Date type to Indian Date Format

12
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Result:

3. Creating a Table

Creating a table in Power BI can be done by importing data or manually entering data.
Manually created tables are often used for reference purposes, like a lookup table
containing Product Line or Branch details. This helps in organizing and categorizing data,
making it easier to perform analyses like product-line-wise or branch-wise sales
performance.

➢ In the Home Tab, click on “Enter Data”

13
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Add columns and rows

4. Pivoting and Unpivoting Data

Pivoting: Pivoting transforms data from rows into columns, allowing you to summarize and
aggregate the data in a different structure. For example, pivoting sales data by Branch
might transform rows of Branch A, B, and C into separate columns, making it easier to
compare sales performance across branches.

Unpivoting: Unpivoting is the opposite of pivoting. It converts columns into rows, which
can help when analyzing multiple variables that were originally separate. For example,
unpivoting Sales by year (where each year is a separate column) would make all sales
records appear as rows, making time-based analysis easier.

➢ Suppose we have data like this and we are asked to convert this table to vertical
table:

14
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Go to Transform > Unpivot other columns

➢ Now click on column and click on “Pivot Column” in Transform Tab

15
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Select value in drop down

5. Data Model and Importance of Data Modeling

Data Model: A data model in Power BI defines how different tables relate to each other.
This is essential for combining data from multiple tables into meaningful reports. The data
model is created by establishing relationships between tables based on shared columns
(called keys).

Importance of Data Modeling: A well-structured data model ensures accurate results,


improves query performance, and simplifies data analysis. It enables slicing and dicing data
in multiple ways (e.g., filtering sales by product category or customer type).

➢ Go to Model View

16
Heetika Vaity BDAV Lab SYMCA/B-57

➢ Since the Orders table and Customers table have a common column
“CustomerNumber”. We can drag this column from Orders and put it in Customers
table and it will create a relationship

17
Heetika Vaity BDAV Lab SYMCA/B-57

6. Managing Data Relationships

Relationships: Power BI enables you to create and manage relationships between tables.
Relationships define how data in different tables connects, allowing Power BI to perform
calculations and aggregations across them. Common relationship types are:

● One-to-Many: One record in a table (e.g., a Product ID in a product table) is related


to multiple records in another table (e.g., multiple sales transactions involving that
product).
● Many-to-One: The inverse of one-to-many.
● Many-to-Many: Allows for multiple matching records on both sides of the
relationship (e.g., multiple branches and multiple cities in a geographical analysis).

➢ This is a One to Many relationship between Customers and Orders table. As one
customer can have multiple orders

18
Heetika Vaity BDAV Lab SYMCA/B-57

7. Cardinality and Cross-Filter Direction

● Cardinality: Refers to the nature of the relationship between two tables. Power BI
supports three types of cardinality:
○ One-to-Many: One value in a column corresponds to many values in another
column. This is the most common relationship type.
○ Many-to-One: Reverse of one-to-many, where multiple records in one table
are linked to a single record in another.
○ Many-to-Many: Both tables can have multiple matching records, commonly
used when there is no single unique key.
● Cross-Filter Direction: This determines how filters applied to one table affect the
related table. There are two types:
○ Single Direction: Filters flow in one direction only. This is useful when one
table is dependent on another (e.g., filtering products by product category).

○ Both Directions: Filters flow in both directions, which means a change in


one table can filter data in both related tables. This is useful for complex
models where both tables need to influence each other

19
Heetika Vaity BDAV Lab SYMCA/B-57

CONCLUSION:

Handling tables and queries in Power BI involves various tasks like loading tables from data
sources, transforming data using Power Query, combining and merging tables, and
managing relationships between them. These operations are crucial for building an
optimized and accurate data model, which serves as the foundation for generating insights
and building reports.

20
Name of Student : Heetika Mahesh Vaity

Roll No : 57 LAB Assignment Number : 10

Title of LAB Assignment : To learn Data Visualization and dashboard


creation in Power BI

DOP : 30/09/2024 DOS: 14/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO6 PO1, PO2, PO3, SIGNATURE:


PO4, PO5, PO6,
PO7, PO8,
PSO1 , PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

AIM: TO LEARN DATA VISUALIZATION AND DASHBOARD CREATION IN


POWER BI

THEORY:

1. Understanding Data Visualization

Data visualization is the graphical representation of data to make it easier for users to
understand trends, patterns, and insights. Effective visualization simplifies complex data
sets into charts, graphs, and maps, allowing for better decision-making. Power BI is a
powerful tool for creating such visualizations with a user-friendly interface and numerous
features that enable users to analyze data interactively.

2. Power BI Overview

Power BI is a business analytics tool developed by Microsoft for creating visual reports,
dashboards, and data models. It allows users to connect to various data sources, transform
raw data, and visualize it in a meaningful way.

Key components of Power BI include:

● Power BI Desktop: The primary environment for designing reports and


dashboards.
● Power BI Service: The cloud-based platform where reports and dashboards can be
shared and accessed.
● Power BI Mobile: The mobile app for viewing reports on the go.

3. Data Preparation and Transformation

Before visualizing data, it must be clean and well-organized. Power BI offers tools for:

● Data Importation: Power BI can connect to multiple data sources like Excel, SQL
databases, Azure, APIs, and more.
● Data Transformation (Power Query): Users can clean, filter, reshape, and
aggregate data using Power Query, which is built into Power BI. Operations include
removing duplicates, merging tables, changing data types, and more.

4. Basic Visualization Techniques

Power BI provides several visual elements to choose from, such as:

1
Heetika Vaity BDAV Lab SYMCA/B-57

● Bar and Column Charts: Used to compare values across categories.


● Line Charts: Display trends over time.
● Pie and Donut Charts: Show parts of a whole.
● Scatter Plots: Represent correlations between variables.
● Maps (Choropleth and Filled Maps): Geographical data representation.
● Tables and Matrices: Present data in a structured format.

Each chart has its own use cases based on the nature of the data and the insights you want
to derive.

5. Creating Dashboards

Dashboards are a collection of visualizations combined into a single view to provide


insights at a glance. In Power BI, you can pin visualizations from different reports onto a
dashboard. Key steps include:

● Report Design: Start by creating multiple visualizations in Power BI Desktop.


● Interactivity: Power BI allows users to filter and interact with reports, such as
applying slicers or drill-downs. These features make the reports dynamic and allow
for better data exploration.
● Pinning to Dashboard: Once the report is built, specific visuals can be pinned to
dashboards in the Power BI service. Dashboards give a high-level view and allow
users to monitor key metrics in real-time.

6. Key Features for Dashboard Creation

● Slicers and Filters: Allow users to filter data directly on the dashboard, improving
interactivity.
● Drill-through and Drill-down: These enable users to explore data at different
levels of granularity.
● Bookmarks and Selections: Help in navigating between different views within a
report.
● Custom Visuals: Power BI allows the import of third-party visuals to meet specific
needs.

7. DAX (Data Analysis Expressions)

DAX is the formula language in Power BI used to perform calculations and create custom
measures. Understanding DAX is crucial for advanced data modeling and creating
meaningful KPIs. It enables operations like:

2
Heetika Vaity BDAV Lab SYMCA/B-57

● Creating calculated columns and measures.


● Aggregating data (SUM, AVERAGE, COUNT).
● Time intelligence (e.g., year-over-year growth, quarter-to-date).

8. Power BI Best Practices for Visualization

To create effective dashboards and visualizations:

● Focus on Simplicity: Avoid clutter and unnecessary visuals; focus on key metrics.
● Use Color Wisely: Use consistent and meaningful colors for categories and trends.
Avoid overuse of color.
● Ensure Readability: Use appropriate font sizes, legends, and labels for readability.
● Context and Annotations: Provide context for your data with titles, annotations, or
tooltips.
● Interactivity and Exploration: Allow users to explore data interactively via slicers,
filters, and tooltips.

9. Sharing and Collaboration

Once the dashboard is built, you can share it with others via the Power BI service. Power BI
also supports real-time collaboration, where users can comment on dashboards, create
alerts, and schedule data refreshes.

10. Power BI Use Cases

Some common use cases for dashboards and visualizations in Power BI include:

● Sales and Financial Analysis: Visualizing sales trends, profit margins, and
forecasting.
● Customer Segmentation: Understanding customer behavior and segmenting
markets.
● Supply Chain Management: Tracking inventory levels, logistics, and supplier
performance.
● HR Analytics: Visualizing employee performance, attrition rates, and recruitment
data.

3
Heetika Vaity BDAV Lab SYMCA/B-57

DATASET USED:

“STUDENT PERFORMANCE FACTORS”

Link: https://www.kaggle.com/datasets/lainguyn123/student-performance-factors

Description:

This dataset provides a comprehensive overview of various factors affecting student performance in
exams. It includes information on study habits, attendance, parental involvement, and other aspects
influencing academic success.

● Hours_Studied: Number of hours spent studying per week.


● Attendance: Percentage of classes attended.
● Parental_Involvement: Level of parental involvement in the student's education
(Low, Medium, High).
● Access_to_Resources: Availability of educational resources (Low, Medium, High).
● Extracurricular_Activities:Participation in extracurricular activities (Yes, No).
● Sleep_Hours: Average number of hours of sleep per night.
● Previous_Scores: Scores from previous exams.
● Motivation_Level: Student's level of motivation (Low, Medium, High).
● Internet_Access: Availability of internet access (Yes, No).
● Tutoring_Sessions: Number of tutoring sessions attended per month.
● Family_Income: Family income level (Low, Medium, High).
● Teacher_Quality: Quality of the teachers (Low, Medium, High).
● School_Type: Type of school attended (Public, Private).
● Peer_Influence: Influence of peers on academic performance (Positive, Neutral,
Negative).
● Physical_Activity: Average number of hours of physical activity per week.
● Learning_Disabilities : Presence of learning disabilities (Yes, No).
● Parental_Education_Level: Highest education level of parents (High School,
College, Postgraduate).
● Distance_from_Home: Distance from home to school (Near, Moderate, Far).
● Gender: Gender of the student (Male, Female).
● Exam_Score: Final exam score.

4
Heetika Vaity BDAV Lab SYMCA/B-57

STEPS:

DATA PREPROCESSING:

1. Data Cleaning
● Load your Dataset

● Click on “Transform Data”

5
Heetika Vaity BDAV Lab SYMCA/B-57

● Remove Duplicates: Ensure there are no duplicate rows in your dataset. In Power
Query, go to Home → Remove Duplicates.

● Handle Missing Values: Check for missing values and decide on an approach:
○ Replace missing numeric values with a mean/median.
○ For categorical fields (e.g., Parental_Involvement), replace missing
values with the mode or a default value (e.g., "Unknown").
○ In Power Query, select Transform → Replace Values to handle missing values.

Replace blank values with “Unknown

6
Heetika Vaity BDAV Lab SYMCA/B-57

2. Data Transformation

● Convert Data Types: Ensure each field has the correct data type:
○ Numeric fields (e.g., Hours_Studied, Exam_Score) should be set to
Whole Number or Decimal Number.
○ Categorical fields (e.g., Parental_Involvement, Motivation_Level)
should be set as Text or Categorical.

7
Heetika Vaity BDAV Lab SYMCA/B-57

● Create Calculated Columns:


○ If necessary, derive new columns. For example, you might create a
Performance_Level column based on the Exam_Score (e.g., "High",
"Medium", "Low").

Click on Close and Apply

8
Heetika Vaity BDAV Lab SYMCA/B-57

1. Introduction to Visuals

● Open Power BI Desktop and load your dataset.

● From the "Visualizations" pane, select a visual (e.g., Bar Chart, Pie Chart).
● Drag dataset fields into the "Values" and "Axis" sections.

9
Heetika Vaity BDAV Lab SYMCA/B-57

2. Visualization Charts

● Select your chart type from the "Visualizations" pane (e.g., bar, line, pie).
● Drag relevant fields into the chart.

Bar Charts: Show relationships between Parental_Involvement and Exam_Score.

Line Charts: Plot Hours_Studied against Exam_Score over time (if applicable).

Pie Charts: Represent the proportion of students with High, Medium, and Low
Parental_Involvement.

10
Heetika Vaity BDAV Lab SYMCA/B-57

3. Filtering Options

● Use the "Filters" pane on the right.


● Drag fields (e.g., Gender, School_Type) into the filter and apply them.

Filtering by Gender= Female and School_Type= Private

11
Heetika Vaity BDAV Lab SYMCA/B-57

4. Exploring Matrix Visuals

● Add a "Matrix" visual from the "Visualizations" pane.


● Drag fields like Exam_Score to "Values", Gender and
Parent_Education_Level to "Rows" or "Columns."

5. Filtering Data with Slicers

● Add a "Slicer" visual.


● Drag a field like Gender, Attendance, or Parental_Involvement into the
slicer.

12
Heetika Vaity BDAV Lab SYMCA/B-57

6. Number Cards and Text Cards

● Select a "Card" visual.


● Drag the field (e.g., Exam_Score) into the card for key metrics.

7. KPI Visuals

● Add a "KPI" visual.


● Set up the target as the average of Previous Scores or another aggregated measure:

Average of Previous Scores: You can calculate the average of all the previous scores as a
reference point.

Target_Score = AVERAGE('StudentPerformanceFactors'[Previous_Scores])

Then use this measure in the Target field of your KPI.

Select Average of Exam_Score from the dropdown in the "Value" field.

Use the following DAX formula to calculate the average Exam_Score:

Avg_Exam_Score = AVERAGE('StudentPerformanceFactors'[Exam_Score])

13
Heetika Vaity BDAV Lab SYMCA/B-57

Steps to Configure the KPI:

1. Value: Select Average of Exam_Score from the dropdown in the "Value" field.
2. Trend Axis: selected Previous_Scores, which is great for showing trends based on
historical data.
3. Target: create a calculated measure for the Target and use it here.

8. Visualizing Data with Maps

● Add a "Map" visual.


● Drag a geographical field (if available) or Distance_from_Home into the
"Location" section.

14
Heetika Vaity BDAV Lab SYMCA/B-57

9. TreeMap

● Add a "TreeMap" visual.


● Drag fields like Motivation_Level into "Group" and Avg_Exam_Score into
"Values."

10. Tool Tips in Power BI


● Select a visual.
● In the "Fields" pane, drag additional data fields (e.g., Peer_Infulence) to the
"Tooltips" area.

11. Modifying Colors in Charts and Visuals

● Select a visual.
● In the "Format" pane, under "Data colors," modify the color for each value category.

15
Heetika Vaity BDAV Lab SYMCA/B-57

12. Bookmarks and Buttons

● Create a bookmark by going to "View" → "Bookmarks" → "Add."

● Add buttons via "Insert" → "Buttons" to navigate between bookmarks.

16
Heetika Vaity BDAV Lab SYMCA/B-57

13. AI Visuals

● Add the "Key Influencers" visual.


● Drag Exam_Score into the "Analyze" section and influencing factors (e.g.,
Hours_Studied, Motivation_Level) into "Explain By."

14. Designing for Phone vs Desktop Report Viewers

● In Power BI Desktop, go to "View" → "Phone Layout."


● Adjust your visuals for mobile view by dragging and resizing elements.

17
Heetika Vaity BDAV Lab SYMCA/B-57

15. Publishing Reports to Power BI Services

● In Power BI Desktop, click "Publish" in the toolbar.


● Sign into Power BI, select your workspace, and publish the report.

18
Heetika Vaity BDAV Lab SYMCA/B-57

Dashboard:

CONCLUSION:

In this practical, we focused on data visualization techniques using Power BI to analyze the
student performance dataset. Various visuals, including KPI, were used to track Exam_Score
trends and compare them against targets like the average of previous scores. The use of
charts, slicers, and filters provided insights into student performance patterns, making it
easier to interpret data and identify areas for improvement.

19
Name of Student : Heetika Mahesh Vaity

Roll No : 57 Assignment Number : 1

Title of LAB Assignment :

DOP : 09/10/2024 DOS: 09/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO2 PO1, PO2, PO3, SIGNATURE:


PO4, PO5,
PSO1, PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

ASSIGNMENT 1

Q1. Explain the cluster analysis ?



Cluster analysis refers to the procedure of statistically grouping a set of objects(number of
data points) in a kumber of clusters in accordance with their likeness. The fundamental
principle is the cell discrimination of the data in such a way that the units within each
group are comparable with each other than with those of other groups. Here are some of
the features or key points related to cluster analysis:

1. Objective

● Exploratory: It enables users to comprehend the organization and the dispersion of


information.
● Classification: It serves in recognizing information patterns and relationships in the
data which may elude any search immediately.
● Targeting: It is often employed in the field of marketing for targeting customers
with different segments depending on the behavior, preferences, or demographics.

2. Types of Cluster Analysis

● Hierarchical Clustering: Creates a hierarchy of clusters, typically represented in a


dendrogram. It can be agglomerative (bottom-up) or divisive (top-down).
● Partitioning Methods: Such as K-means, where the data is divided into a
predetermined number of clusters (K). Each cluster is defined by its centroid, and
data points are assigned to the nearest centroid.
● Density-Based Clustering: Identifies clusters based on the density of data points in
a region (e.g., DBSCAN). It can find arbitrarily shaped clusters and is robust to noise.
● Model-Based Clustering: Assumes that the data is generated by a mixture of
underlying probability distributions (e.g., Gaussian Mixture Models).

3. Key Steps in Cluster Analysis

1. Data Preparation: Cleaning and preprocessing data to handle missing values and
normalize features.
2. Feature Selection: Choosing the relevant features for clustering.
3. Choosing a Clustering Algorithm: Selecting the appropriate clustering technique
based on the data characteristics.
4. Determining the Number of Clusters: Methods like the elbow method or
silhouette analysis can help find the optimal number of clusters.
Heetika Vaity BDAV Lab SYMCA/B-57

5. Cluster Validation: Assessing the quality and validity of the clusters formed, often
using internal metrics (like cohesion and separation) or external validation against
known labels.

4. Applications

● Market Segmentation: Identifying customer segments for targeted marketing.


● Social Network Analysis: Understanding communities within networks.
● Anomaly Detection: Identifying outliers in data, which can indicate fraud or errors.
● Image Segmentation: Grouping pixels in images to identify objects.

5. Challenges

● Choosing the Right Algorithm: Different algorithms may yield different results.
● Scalability: Some methods can struggle with large datasets.
● Interpreting Results: Clusters may not always be easily interpretable or
meaningful.

Cluster analysis is a powerful tool for discovering patterns and structures in data, making it
widely used in various fields such as marketing, biology, finance, and social sciences.

Q2. Explain the different similarity measures ?


Similarity measures are used in clustering and other data analysis techniques to quantify
how similar two data points are. The choice of similarity measure can significantly affect
the clustering results. Here are some common similarity measures, along with their
characteristics and applications:

1. Euclidean Distance

● Formula:

● Description: Measures the straight-line distance between two points in Euclidean


space. It's suitable for continuous variables.
● Use Cases: Commonly used in K-means clustering and hierarchical clustering.

2. Manhattan Distance (L1 Norm)


Heetika Vaity BDAV Lab SYMCA/B-57

● Formula:

● Description: Measures the distance between two points along the axes (like a grid).
It sums the absolute differences of their coordinates.
● Use Cases: Useful in scenarios where the data is arranged in a grid-like structure or
when you want to reduce the impact of outliers.

3. Minkowski Distance

● Formula:

● Description: A generalization of both Euclidean and Manhattan distances. The


parameter mmm defines the distance type:

● Use Cases: Flexible and can be tailored to specific needs by adjusting mmm.

4. Cosine Similarity

● Formula:

● Description: Measures the cosine of the angle between two non-zero vectors in an
inner product space. It ranges from -1 to 1.
● Use Cases: Particularly useful in text mining and natural language processing, where
documents can be represented as vectors.

5. Pearson Correlation Coefficient

● Formula:

● Description: Measures the linear correlation between two variables, ranging from
-1 (perfectly negatively correlated) to 1 (perfectly positively correlated).
Heetika Vaity BDAV Lab SYMCA/B-57

● Use Cases: Often used in clustering when the goal is to identify relationships
between features.

Q3. Discuss the k means algorithm ?


The K-means algorithm is a popular clustering technique used to partition a dataset into K
distinct clusters based on feature similarity. It is widely used for its simplicity and efficiency
in handling large datasets. The objective of K-means is to minimize the variance within each
cluster while maximizing the variance between different clusters. This is typically done by
minimizing the sum of squared distances between each data point and the centroid of its
assigned cluster.

Steps of the K-means Algorithm

1. Initialization:
○ Choose the number of clusters KKK.
○ Randomly initialize KKK centroids, which can be selected from the dataset or
randomly generated within the feature space.
2. Assignment Step:
○ For each data point in the dataset, calculate the distance between the point
and each centroid (using Euclidean distance or another similarity measure).
○ Assign each data point to the cluster associated with the nearest centroid.
3. Update Step:
○ Recalculate the centroids of each cluster. The new centroid is the mean of all
data points assigned to that cluster: Ck=1Nk∑xi∈ClusterkxiC_k =
\frac{1}{N_k} \sum_{x_i \in \text{Cluster}_k} x_iCk​=Nk​1​∑xi​∈Clusterk​​xi​
where NkN_kNk​is the number of points in cluster kkk.
4. Convergence Check:
○ Repeat the assignment and update steps until the centroids no longer change
significantly, or until a maximum number of iterations is reached. This
indicates that the algorithm has converged.

Example of K-means Algorithm

1. Suppose we have the following data points:


○ (1, 2), (1, 4), (1, 0), (4, 2), (4, 4), (4, 0).
2. If we set K=2K = 2K=2 and initialize the centroids randomly (e.g., (1, 2) and (4, 4)):
○ First Assignment Step: Assign points to the nearest centroid.
Heetika Vaity BDAV Lab SYMCA/B-57

○ First Update Step: Calculate new centroids.


○ Repeat: Continue until centroids stabilize.

Advantages of K-means Algorithm

● Simplicity: Easy to implement and understand.


● Efficiency: Works well with large datasets; time complexity is approximately
O(n⋅K⋅t)O(n \cdot K \cdot t)O(n⋅K⋅t), where nnn is the number of data points, KKK
is the number of clusters, and ttt is the number of iterations.
● Scalability: Can handle large datasets efficiently compared to other clustering
algorithms.

Disadvantages of K-means Algorithm

● Choosing K: The number of clusters KKK must be specified in advance, which may
not always be known.
● Sensitivity to Initialization: The final results can depend on the initial choice of
centroids. Different initializations may lead to different clustering results.
● Assumption of Spherical Clusters: K-means assumes clusters are spherical and
equally sized, which may not always be the case in real-world data.
● Outliers: The algorithm is sensitive to outliers, as they can disproportionately
influence the position of centroids.

Applications of K-means Algorithm

● Market Segmentation: Identifying distinct customer segments based on


purchasing behavior.
● Image Compression: Reducing the number of colors in an image by clustering pixel
colors.
● Document Clustering: Grouping similar documents in natural language processing.
● Anomaly Detection: Identifying unusual data points by their distance from
centroids.
Name of Student : Heetika Mahesh Vaity

Roll No : 57 Assignment Number : 2

Title of LAB Assignment : PageRank Algorithm in Big Data

DOP : 14/10/2024 DOS: 19/10/2024

CO MAPPED : PO MAPPED: FACULTY MARKS:

CO2 PO1, PO2, PO3, SIGNATURE:


PO4, PO5,
PSO1, PSO2
Heetika Vaity BDAV Lab SYMCA/B-57

ASSIGNMENT 2

Q1. What do you mean by Big Graphs?

Big Graphs refer to graph data structures that are large in scale, often containing millions
or billions of nodes (vertices) and edges. These graphs typically represent complex
networks, such as social networks, web pages, biological networks, or transportation
systems, where the relationships between entities are vast and interconnected.

● Graph Representation: A graph consists of nodes (vertices) and edges


(connections between nodes). In big graphs, the sheer number of nodes and edges
makes the analysis and processing computationally challenging.
● Examples of Big Graphs:
○ Social Networks: Facebook's social graph, where nodes represent users and
edges represent friendships or interactions.
○ Web Graphs: The structure of the World Wide Web, where each web page is
a node and hyperlinks between them are edges.
○ Transportation Networks: Roads, train tracks, or flight routes, where nodes
represent cities or stations, and edges represent routes or paths.
● Challenges: Handling and analyzing big graphs require specialized algorithms and
infrastructure due to their size and complexity. Processing large graphs requires
efficient storage techniques, parallel algorithms, and distributed computing systems
to manage the data.

Q2. What is PageRank?


PageRank is an algorithm developed by Larry Page and Sergey Brin (founders of Google)
to rank web pages in search engine results. It evaluates the importance of each webpage
based on the number and quality of links pointing to it. Pages that are linked to by many
other important pages are considered more important themselves.

● Concept: PageRank is based on the idea that if a webpage is linked to by many other
pages, especially by important pages, it is likely to be more relevant and
authoritative. Links from higher-quality pages carry more weight.
Heetika Vaity BDAV Lab SYMCA/B-57

● Web as a Graph: The web is represented as a graph where web pages are nodes and
hyperlinks between them are edges. PageRank uses the structure of this graph to
determine the ranking of pages.
● Intuition: PageRank can be thought of as a way of measuring the likelihood that a
random web surfer, who randomly follows links, will land on a particular page. Pages
that have more incoming links are more likely to be visited.

Q3. Steps of the PageRank Algorithm


The PageRank algorithm follows a simple iterative process to assign a rank (or score) to
each page based on the link structure of the web. Here are the key steps:

Step 1: Initialize the PageRank Value

● Assign an initial PageRank value to every page in the network. Typically, every page
is given an equal value at the start. If there are N pages, each page could start with a
rank of 1/N.

Step 2: Calculate the Contribution of Each Page

● For each page, its PageRank is distributed evenly across the pages it links to.
● Formula:

○ PR(A) is the PageRank of page A.


○ M(A) is the set of pages that link to page A.
○ L(B) is the number of outbound links on page B.
○ d is a damping factor, typically set to 0.85.
○ N is the total number of pages.
● The damping factor, d, represents the probability that a user continues following
links (usually set to 0.85, meaning 85% of the time the user follows links, and 15%
of the time they randomly jump to a new page).

Step 3: Update the PageRank


Heetika Vaity BDAV Lab SYMCA/B-57

● For each iteration, update the PageRank of each page based on the contributions
from the pages that link to it (using the formula above). The process continues
iteratively until the PageRank values converge, meaning they stop changing
significantly between iterations.

Step 4: Handle Dead Ends (Dangling Nodes)

● Some pages (dead ends or dangling nodes) may have no outbound links. These are
handled by redistributing their PageRank equally among all pages in the graph, or in
some implementations, they're simply ignored.

Step 5: Handle Random Surfers

● The random surfer model is introduced through the damping factor d. In each
iteration, a portion (1 - d) of the total PageRank is evenly distributed among all
pages to simulate the chance that a random user may jump to any page at random
instead of following a link.

Step 6: Repeat the Process

● Repeat the process of distributing PageRank and updating values over multiple
iterations until the algorithm converges. Convergence means that the PageRank
values between iterations change very little, indicating that the system has reached a
steady state.

Step 7: Rank the Pages

● Once the algorithm converges, each page will have a final PageRank value, which
represents its importance relative to other pages. Pages with higher PageRank
values are considered more important and will appear higher in search engine
rankings.

You might also like