0% found this document useful (0 votes)
988 views20 pages

Big Data (Assignment)

This document discusses big data analytics and the key characteristics of big data known as the 3Vs - volume, velocity, and variety. It defines big data as data that is too large to process using traditional database systems due to its size (volume) or type (variety). The speed at which data is generated (velocity) is also a defining factor. The document provides examples of structured, unstructured, and semi-structured data and explains how missing data can impact analytics if not properly addressed. Common techniques for handling missing values like deletion, imputation, and data wrangling are summarized. Finally, it discusses exporting large datasets to cloud storage like Amazon S3 for safety, simultaneous access, and real-time analysis.

Uploaded by

chandra reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
988 views20 pages

Big Data (Assignment)

This document discusses big data analytics and the key characteristics of big data known as the 3Vs - volume, velocity, and variety. It defines big data as data that is too large to process using traditional database systems due to its size (volume) or type (variety). The speed at which data is generated (velocity) is also a defining factor. The document provides examples of structured, unstructured, and semi-structured data and explains how missing data can impact analytics if not properly addressed. Common techniques for handling missing values like deletion, imputation, and data wrangling are summarized. Finally, it discusses exporting large datasets to cloud storage like Amazon S3 for safety, simultaneous access, and real-time analysis.

Uploaded by

chandra reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Big data analytics refers to the strategy of analyzing large volumes of data, or big data.

This big
data is gathered from a wide variety of sources, including social networks, videos, digital images,
sensors, and sales transaction records.
Big data:
The data which is beyond to the storage capacity and which is beyond to the processing power is
considered as big data.
Data that is unstructured or time sensitive or simply very large cannot be processed by relational
database engines.

The V's of Big Data:


● Volume
● Velocity
● Variety
Volume: Data generated from different organizations. More people use data-collecting devices
as more devices become internet-enabled. The volume of data is increasing at a staggering rate.
In fact, 90% of the data in the world today was created in the last two years.
Velocity:
The speed, at which vast amounts of data are being generated, collected and analyzed. Every day
the number of emails, twitter messages, photos, video clips, etc. increases at lighting speeds
around the world. Every second of every day data is increasing. Not only must it be analyzed,
but the speed of transmission, and access to the data must also remain instantaneous to allow for
real-time access to website, credit card verification and instant messaging. Big data technology
allows us now to analyze the data while it is being generated, without ever putting it into
databases.
Variety:
Variety is one the most interesting developments in technology as more and more information is
digitized. Traditional data types (structured data) include things on a bank statement like date,
amount, and time. These are things that fit neatly in a relational database.
● Structured
● Unstructured
● Semi structured

Structured:

[Type text]
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

Unstructured:
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. Typical example of unstructured data are
simple text files, images, videos etc. Now a day organizations have wealth of data
available with them but unfortunately they don't know how to derive value out of it since
this data is in its raw form or unstructured format.

Semi structured:
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file-
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>

[Type text]
Fig: The increasing expansion of the 3Vs.

Missing data

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model
because we have not analysed the behavior and relationship with other variables correctly. It can lead
to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing
values. The inference from this data set is that the chances of playing cricket by males is higher than
females. On the other hand, if you look at the second table, which shows data after treatment of
missing values (based on gender), we can see that females have higher chances of playing cricket
compared to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the
reasons for occurrence of these missing values. They may occur at two stages:

[Type text]
1. Data Extraction:It is possible that there are problems with extraction process. In such cases,
we should double-check for correct data with data guardians. Some hashing procedures can
also be used to make sure data extraction is correct. Errors at data extraction stage are typically
easy to find and can be corrected easily as well.
2. Data collection: These errors occur at time of data collection and are harder to correct. They
can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an head
occurs, respondent declares his / her earnings & vice versa. Here each observation has
equal chance of missing value.
o Missing at random: This is a case when variable is missing at random and missing
ratio varies for different values / level of other input variables. For example: We are
collecting data for age and female has higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when the
missing values are not random and are related to the unobserved input variable. For
example: In a medical study, if a particular diagnostic causes discomfort, then there is
higher chance of drop out from the study. This missing value is not at random unless
we have included “discomfort” as an input variable for all patients.
o Missing that depends on the missing value itself: This is a case when the probability
of missing value is directly correlated with missing value itself. For example: People
with higher or lower income are likely to provide non-response to their earning.

Which are the methods to treat missing values?

1. Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
o In list wise deletion, we delete observations where any of the variable is missing.
Simplicity is one of the major advantage of this method, but this method reduces the
power of model because it reduces the sample size.
o In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available for
analysis. One of the disadvantage of this method, it uses different sample size for
different variables.

Deletion methods are used when the nature of missing data is “Missing completely at
o
random” else non random missing values can bias the model output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with
estimated ones. The objective is to employ known relationships that can be identified in the
valid values of the data set to assist in estimating the missing values. Mean / Mode / Median
imputation is one of the most frequently used methods. It consists of replacing the missing data
[Type text]
for a given attribute by the mean or median (quantitative attribute) or mode (qualitative
attribute) of all known values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non
missing values of that variable then replace missing value with mean or median. Like in
above table, variable “Manpower” is missing so we take average of all non missing
values of “Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75)
and “Female” (25) individually of non missing values then replace the missing value
based on gender. For “Male“, we will replace missing values of manpower with 29.75
and for “Female” with 25.

Export all the Data onto the cloud like Amazon web services S3

We usually export our data to cloud for purposes like safety, multiple access and real time
simultaneous analysis.
There are various vendors which provide cloud storage services. We are discussing Amazon S3.
An Amazon S3 export transfers individual objects from Amazon S3 buckets to your device,
creating one file for each object. You can export from more than one bucket and you can specify
which files to export using manifest file options.
Export Job Process
1 You create an export manifest file that specifies how to load data onto your device, including
an encryption PIN code or password and details such as the name of the bucket that contains the
data to export. For more information, see The Export Manifest File. If you are going to mail us
multiple storage devices, you must create a manifest file for each storage device.
2 You initiate an export job by sending a CreateJob request that includes the manifest file. You
must submit a separate job request for each device. Your job expires after 30 days. If you do not
send a device, there is no charge.
You can send a CreateJob request using the AWS Import/Export Tool, the AWS Command Line
Interface (CLI), the AWS SDK for Java, or the AWS REST API. The easiest method is the AWS
Import/Export Tool. For details, see
Sending a CreateJob Request Using the AWS Import/Export Web Service Tool
Sending a CreateJob Request Using the AWS SDK for Java
Sending a CreateJob Request Using the REST API
3 AWS Import/Export sends a response that includes a job ID, a signature value, and information
on how to print your pre-paid shipping label. The response also saves a SIGNATURE file to
your computer.
You will need this information in subsequent steps.

[Type text]
4 You copy the SIGNATURE file to the root directory of your storage device. You can use the
file AWS sent or copy the signature value from the response into a new text file named
SIGNATURE. The file name must be SIGNATURE and it must be in the device's root directory.
Each device you send must include the unique SIGNATURE file for that device and that JOBID.
AWS Import/Export validates the SIGNATURE file on your storage device before starting the
data load. If the SIGNATURE file is missing invalid (if, for instance, it is associated with a
different job request), AWS Import/Export will not perform the data load and we will return your
storage device.
5 Generate, print, and attach the pre-paid shipping label to the exterior of your package. See
Shipping Your Storage Device for information on how to get your pre-paid shipping label.
6 You ship the device and cables to AWS through UPS. Make sure to include your job ID on the
shipping label and on the device you are shipping. Otherwise, your job might be delayed. Your
job expires after 30 days. If we receive your package after your job expires, we will return your
device. You will only be charged for the shipping fees, if any.
You must submit a separate job request for each device.
7 AWS Import/Export validates the signature on the root drive of your storage device. If the
signature doesn't match the signature from the CreateJob response, AWS Import/Export can’t
load your data.
Once your storage device arrives at AWS, your data transfer typically begins by the end of the
next business day. The time line for exporting your data depends on a number of factors,
including the availability of an export station, the amount of data to export, and the data transfer
rate of your device.
8 AWS reformats your device and encrypts your data using the PIN code or password you
provided in your manifest.
9 We repack your storage device and ship it to the return shipping address listed in your manifest
file. We do not ship to post office boxes.
10 You use your PIN code or TrueCrypt password to decrypt your device. For more information,
see Encrypting Your Data.

Screen shots to upload data, create buckets and giving

permissions: Create an account and sign in to console

[Type text]
Give your credentials for sign in and don’t forget to delete the account if you are not
usingregularly.

After signing in we can see buttons on top left to create a bucket.

[Type text]
Amazon S3

Give a unique name for the bucket

[Type text]
After bucket creation we can also create a folder and UPLOAD the files from our local
machine.

[Type text]
We can upload any number of files in to S3

We can check the properties button for details of size, owner name, bucket name, etc.,

[Type text]
We can select permissions button and set permissions to any user of our choice.

PART(B)
Workplace Safety
[Type text]
Basic Workplace Safety Guidelines:

● Fire Safety: Employees should be aware of all emergency exits, including fire
escape routes, of the office building and also the locations of fire extinguishers
andalarms.

● Falls and Slips: To avoid falls and slips, all things must be arranged properly.
Any spilt liquid, food or other items such as paints must be immediately cleaned
to avoid any accidents. Make sure there is proper lighting and all damaged
equipment, stairways and light fixtures are repairedimmediately.
● First Aid: Employees should know about the location of first-aid kits in the
office. First-aid kits should be kept in places that can be reached quickly. These
kits should contain all the important items for first aid, for example, all the
things required to deal with common problems such as cuts, burns, headaches,
muscle cramps,etc.
● Security: Employees should make sure that they keep their personal things in a
safeplace.
● Electrical Safety: Employees must be provided basic knowledge of using
electrical equipment and common problems. Employees must also be provided
instructions about electrical safety such as keeping water and food items away
from electrical equipment. Electrical staff and engineers should carry out routine
inspections of all wiring to make sure there are no damaged or broken wires.

Apache Hadoop is a java based free software framework that can effectively store large amount of
data in a cluster. This framework runs in parallel on a cluster and has an ability to allow us to process
data across all nodes.

Hadoop - HDFS Overview

Hadoop Distributed file system or HDFS is a Java based distributed file system that allows you to
store large amount of data ,which splits big data and distribute across many nodes in a cluster So, if
you install Hadoop, you get HDFS as an underlying storage system for storing the data in the
distributed environment.

HDFS Architecture

[Type text]
Fig: HDFS follows the master-slave architecture and it has the following elements.

Name Node

Name Node is the master node in the Apache Hadoop.HDFS Architecture that maintains and manages
the blocks present on the Data Nodes (slave nodes).

The system having the Name Node acts as the master server and it does the following tasks:

● It records the metadata of all the files stored in the cluster, e.g. The location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated with
the metadata:

FsImage: It contains the complete state of the file system namespace since the start of the
Name Node.

EditLogs: It contains all the recent modifications made to the file system with respect to the
most recent FsImage.

● It also executes file system operations such as renaming, closing, and opening files and
directories.

Datanode:

DataNodes are the slave nodes in HDFS. Unlike Name Node, Data Node is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability. The
Data Node is a block server that stores the data in the local file.

Functions of Data Node:

[Type text]
● These are slave daemons or process which runs on each slave machine.
● The actual data is stored on Data Nodes.
● The Data Nodes perform the low-level read and write requests from the file system’s
clients.
● They send heartbeats to the Name Node periodically to report the existence of data node, by
default, this frequency is set to 3 seconds.

Secondary Name Node:

Apart from these two daemons, there is a third daemon or a process called Secondary Name
Node. The Secondary Name Node works concurrently with the primary Name Node as a helper
daemon. Secondary Name Node being a backup for Name Node.

Functions of Secondary Name Node:

The Secondary Name Node is one which constantly reads all the file systems and metadata
from the RAM of the Name Node and writes it into the hard disk or the file system.

Blocks:
Now, as we know that the data in HDFS is scattered across the Data Nodes as blocks.
Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. In
general, in any of the File System, you store the data as a collection of blocks. Similarly, HDFS stores
each file as blocks which are scattered throughout the Apache Hadoop cluster. The default size of each
block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as
per your requirement.
Replication Management:
HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The
blocks are also replicated to provide fault tolerance. The default replication factor is 3 which is again
configurable Nodes (considering the default replication factor):

[Type text]
Goals of HDFS

● Fault detection and recovery: Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should havemechanisms for
quick and automatic fault detection andrecovery.
● Huge datasets: HDFS should have hundreds of nodes per cluster to manage the
applications having hugedatasets.

● Hardware at data: A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases thethroughput.

Hadoop - Map Reduce

Map Reduce is a framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.

InputSplit – It is the logical representation of data. The data to be processed by an


individual Mapper is presented by the InputSplit.

RecordReader – It communicates with the InputSplit and it converts the Split into records
[Type text]
which are in the form of key-value pairs that are suitable for reading by the mapper. By
default, RecordReader uses TextInputFormat for converting data into a key-value pair.
RecordReader communicates with the InputSplit until the file reading is not completed.

There are two types of tasks:

1. Map tasks (Spilts& Mapping)

2. Reduce tasks (Shuffling, Reducing)

Job tracker: Acts like a master (responsible for complete execution of submitted
job).Assigns the work to the task tracker.

Multiple Task Trackers: Acts like slaves, each of them performing the job.

Shuffling in MapReduceThe process of transferring data from the mappers to reducers is


known as shuffling i.e. the process by which the system performs the sort and transfers the
map output to the reducer as input. So, MapReduce shuffle phase is necessary for the
reducers,

Sorting in MapReduceThe keys generated by the mapper are automatically sorted by


MapReduce Framework, i.e. Before starting of reducer, all intermediate key-value pairs in
MapReduce that are generated by mapper get sorted by key and not by value. Values
passed to each reducer are not sorted; they can be in any order.

Example for Map Reduce flow

[Type text]
Apache Spark – Introduction

Industries are using Hadoop extensively to analyze their data sets. The reason is that
Hadoopframework is based on a simple programming model (Map Reduce) and it enables
a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the
main concern is to maintain speed in processing large datasets in terms of waiting time
between queries and waiting time to run theprogram.

As against a common belief, Spark is not a modified version of Hadoopand is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implementSpark.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop Map Reduce and it extends the Map Reduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of anapplication.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by
MateiZaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache
software foundation in 2013, and now Apache Spark has become a top level Apache
project from Feb- 2014.

Features of Apache Spark

Apache Spark has following features.

● Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by
reducingnumber of read/write operations to disk. It stores the intermediate
processing data inmemory.
● Supports multiple languages − Spark provides built-in APIs in Java, Scala, or

[Type text]
Python. Therefore, you can write applications in different languages. Spark comes
up with 80 high-level operators for interactivequerying.

● Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graphalgorithms.

Apache Spark Core


● Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and referencing
datasets in external storage systems.
Spark SQL

Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.

MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture. It is, according to benchmarks, done
by the MLlib developers against the Alternating Least Squares (ALS)
implementations. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout (before Mahout gained a Sparkinterface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an
API for expressing graph computation that can model the user-defined graphs by
using Pregel abstraction API. It also provides an optimized runtime for this
abstraction.

[Type text]
Running Spark Application

Apache Spark Driver

The main() method of the program runs in the driver. The driver is the process that
runs the user code that creates RDDs, and performs transformation and action,
and also creates SparkContext. When the Spark Shell is launched, this signifies that
we have created a driver program. On the termination of the driver, the application
is finished.

The driver program splits the Spark application into the task and schedules them to
run on the executor. The task scheduler resides in the driver and distributes task
among workers.

The two main key roles of drivers are:


● Converting user program into the task.
● Scheduling task on the executor.

Apache SparkContext

SparkContext is the heart of Spark Application. It establishes a connection to the


Spark Execution environment. It is used to create Spark RDDs access Spark
services and run jobs. SparkContext is a client of Spark execution environment and
acts as the master of Spark application.

The main works of Spark Context are:


● Getting the current status of spark application
● Canceling the job
● Running a job
● Accessing RDD

Apache Spark Shell

Spark Shell is a Spark Application written in Scala. It offers command line


environment with auto-completion. It helps us to get familiar with the features of
Spark, which help in developing our own Standalone Spark Application.

Apache Spark Executors

The individual task in the given Spark job runs in the Spark executors. Executors
are launched once in the beginning of Spark Application and then they run for the
entire lifetime of an application. Even if the Spark executor fails, the Spark
application can continue with ease.

There are two main roles of the executors:


● Runs the task that makes up the application and returns the result to the
driver.

[Type text]
● Provide in-memory storage for RDDs that are cached by the user.

Two types of Apache Spark RDD operations are-


● Transformations
● Actions.
A Transformation is a function that produces new RDD from the existing RDDs but when we want to
work with the actual dataset, at that point Action is performed.

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as
input and produces one or more RDD as output. Each time it creates new RDD when we apply any
transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.

[Type text]

You might also like