0% found this document useful (0 votes)
10 views22 pages

Bda QB Soln

Uploaded by

bisupanda4191
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

Bda QB Soln

Uploaded by

bisupanda4191
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

lOMoARcPSD|20752295

BDA solution QB

Big Data Analytics (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Biswajit p ([email protected])
lOMoARcPSD|20752295

Big Data Analysis Question Bank Solution

Q1. Hadoop core components

As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity
hardware to maintain and store big size data. Hadoop works on MapReduce Programming
Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop
in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.
 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.


1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel
in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly 2 tasks which are divided phase-
wise:
In first phase, Map is utilized and in next phase Reduce is utilized.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Here, we can see that the Input is provided to the Map() function then it’s output is used as an
input to the Reduce function and after that, we receive our final output. Let’s understand What
this Map() and Reduce() does.

As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is
a set of Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a
key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form set
of Tuples, and perform some operation like sorting, summation type job, etc. which is then sent to
the final Output Node. Finally, the Output is Obtained.

The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.

2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed
for working on commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other
devices present in that Hadoop cluster. Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be
the transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more
number of DataNode, the Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storing capacity to store a large number of file blocks.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into
small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job timing, etc. And the
use of Resource Manager is to manage all the resources that are made available for running a
Hadoop cluster.

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can say
the java scripts that we need for all the other components present in a Hadoop cluster. these
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically
in software by Hadoop Framework.
Q2. Hadoop eco system

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

The Hadoop Ecosystem is a platform designed to handle big data problems through a suite of tools and services.
It includes Apache projects and various commercial tools that collectively provide data absorption, analysis,
storage, and maintenance. The core components of Hadoop are HDFS, YARN, MapReduce, and Hadoop
Common Utilities, which are supplemented by various tools in the ecosystem.

Core Components:

1. HDFS (Hadoop Distributed File System): The primary storage system of Hadoop, distributing and
storing large datasets across multiple nodes. It consists of Name Nodes (metadata) and Data Nodes
(actual data).
2. YARN (Yet Another Resource Negotiator): Manages and allocates resources across the Hadoop
clusters. It includes Resource Manager, Node Manager, and Application Manager, ensuring efficient
resource utilization.
3. MapReduce: A programming model for processing large datasets in parallel across a Hadoop cluster. It
involves two main functions:
o Map(): Sorts and filters data, producing key-value pairs.
o Reduce(): Aggregates the results from Map() into a summarized output.

Supplementary Tools:

1. Apache Spark: An in-memory data processing framework for real-time data processing, faster than traditional
batch processing.
2. PIG: A platform developed by Yahoo for analyzing large datasets with a SQL-like language (Pig Latin). It abstracts
the complexities of MapReduce.
3. HIVE: A data warehousing tool for querying and managing large datasets using HQL (Hive Query Language),
similar to SQL.
4. HBase: A NoSQL database designed for quick retrieval and storage of large datasets, inspired by Google’s
BigTable.
5. Mahout: Provides machine learning capabilities with libraries for clustering, classification, and collaborative
filtering.

Additional Components:

 Solr, Lucene: Tools for searching and indexing data. Lucene is Java-based and supports spell checking.
 Zookeeper: Manages coordination, synchronization, and communication between Hadoop components.
 Oozie: A job scheduler for organizing and executing Hadoop jobs either sequentially (Oozie Workflow) or in
response to external triggers (Oozie Coordinator).

In summary, the Hadoop Ecosystem revolves around the efficient management and processing of large datasets,
with each component playing a specific role in the data lifecycle.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Maix t Mo lHication Maf Redu (e.


A

4
Qr

O Naf Tak: Convt dotu ito key.alve fug.


Rumal:’A
(i,e): (A,3A
ey volue.
Vai

k
Aoo
(o,o): (A,o,) Boo 2
B,o, 2 )
o) :
(0): (A, o, )
(I, o)(o,2)
Ao : 2
(A;) (o,o) :(A,,2) (o,)
(8i)

Ao
(Ai) (1, o):(A,o, 3) Bo'. S

(l, ): (A,o ,a) ,s)

k
(AiY) (A,1,4) (o,)(B, !, )
(!, ): (u) (Bj)

Com biney dyk! Combine to eouce tae nlw


e ey-val fcu Congetiun.
( o,o): (A,o,), (ol2), bro), (Bl S)
( O,): |A),(A 2), (B o ), B,l)
(iho)
(ep ), (A,,u), (Bo3 (Blr)
(Ao,
()' (Ao3 ), (A4) ts,o,), (B,,)

Downloaded by Biswajit p ([email protected])


Downloaded by Biswajit p ([email protected])
8 2x| 3r24
m
key enit(
result
ivau)
Va mn
val) keyi malp
VI
. die ond matgnib
de Vectr
wiquonihe
th turelen ted toued synboly
alange mbels numbers/numbes of Vectu-
ColUmny and w in 2DMamx-
(Bio)
(o) (Bio2)
Ao3) fr(t)’ (A4) (Ao3)
2Rloro)
3
(B,0,)
) (A,o,) (o,i)’ Por
tut Redvte 9
lOMoARcPSD|20752295
lOMoARcPSD|20752295

Relotinal Mactoa oeaiod wying mefRalune.

Per to e is Vale!

23
emit (tvee)
Seletin Conditen: B< 2
Mwl Mw2 Seleen
value
Vale. Cheele,Conditn and teake
(h2),,2) (I, )
(31) (2, leey allnd atve
hauh nctin will se afid:

Vale key Value keylval ue


(3, )3) (21) (2) Mw in

Sw
Rul
Val \eey val keylval key al Sual
(u2) (i) (31)(st) dene
Rw
key val Val (omyirean
ti2),(Mhe) (31)
C,) (2)) redunc

Rw
A

2 |

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

AMqonthrn.
ojeet: (D) mapttey
Mil
(Mw lex's soy 5 bes sub itt (uteuniye
seleeted
2
2 4
A c c
4 3 ati butes
ts ople witn ny
emit (t,n)
educe ( vey valves):
vey
t23)
Valve
Value
eni t{kay ey)
(2i3),(23) (t2)
Seleet les
Toicehin Cundiha.
(21s)

Rnchun
mw
Valve Valve
(i3) (2,3),(113) (B.u)
leyjUalve yate
(2)(72)
(t;u)
(1,o)o)
Rmw Sw R mw

key Value
(3) vale
(+12) (t2) key yate
(n) () (,) (314)(3)4)
Rew
Rw
vale
(23) key Vat ve
t23)(2)) (314) 314)
(, 2) (2)2r)
() (io)

Rwi
2 3

2 2

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Unien oleahon. AM e vaues )


m w
n Valve

2 3
2
2
vtolue (key i val ves):

mw
Ley Vale
(2), , ) Vale
(213) (213)
(21) ()
(4i)
(2,)
neyh inh
me
ey
(i2)
Valve
vale
(2i) ky value
(s) (31)
tuir) r)
Swl
R mwl

key Vale key Value


(213) key Value
(4)
key balue
( 4r) (2i)
Rwl
Rw
(2)|(e)
Value
value
(31) (311) (2) le)
(213) (23)

Rw2

9
2

2 3
S

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

vauest ae lommun)

Muul mw
(hoie invalu):
2 2

2
2 (kayvolue):

mw
values (reyfeyl(er
mw 2
Vqle
(0.2) ey Ualue
(213) | (213)
(31) (31)

(?1)
mw mw
valne Value
leeyatue
(213) t213)
(2)

ey Value tey Ual ne eyuate


23)

RMw RMw
Value vale

(2) (1),(2i)

ol vate (tthen enit


Rw
volue

() () (:0)
Rw

2 2

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

muwZ maf(re, valve):


Aasi(e) Aab s)
2
2
3 2
2

mw (nly Te
muw
Vale
valve t dure (teyi vaws):
(31)
(4is) Ir)
(,)

Valve m w
value
(3) (21)
(2123||T

Ru
Rw
ey Value
key value
Key uatu
(2;1)

Vale Rw
(,2)
key vale
(21)

(u,r)
Hhet cli elminet
tasle Rw
Rw
Val
key
key
(3) Rw
Rw

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

attibuts
attibates.
mwl
Atqunthm
A
Aal,i|e) muf (vey,vale)
2

2
Ror (aib) in valu

\key Valwe Per lb)in valve'


2
(iy t,(T)
eelwr (keyr valvey):
(Tzi4)

(TuseTe)
Hayh
Udl
keyl Valn
2
(tu1(,3) leeylvol key va(ve fe a in
3 6 Per
Cmit (key.(Ai ey,c)
Rwl Rw
key Val Value key Vatu
2 t, c) (T4)
3 ()
(Tt)

Vey Vale
Value
2

(i2) (c)(7a,a)

q (Ts) notemm
Rwi Connut e jwined

3
3
Downloaded by Biswajit p ([email protected])
lOMoARcPSD|20752295

Gugi and ofera tiin .

uy (A, s) ond
mw atiy mar au the
oigaton nt ma (key yalu:.
mw
2

3 2 H
3
3 2
Dteore (reyivalve
4 2
2

val
vey( Value

(3@,4)
(li 3) (4]
(1)
312) (314)
wi
mw
Vale
G,) key]value ky yala
(4, 2)
(2)|E]
(2) [34]
Rw
Jalie key lvatit Vale
(Gi)
vale
(23) (22)
(3)
Rwi Rw
kay Va

[lo,4, ) Valve
(3 )
(4i2)
Is,r]

Rw
Rw
A

3
2

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Q5. difference between traditional and Big data

Traditional Data Big Data

Big data is generated outside the enterprise


Traditional data is generated in enterprise level.
level.

Its volume ranges from Petabytes to


Its volume ranges from Gigabytes to Terabytes.
Zettabytes or Exabytes.

Traditional database system deals with structured Big data system deals with structured, semi-
data. structured,database, and unstructured data.

Traditional data is generated per hour or per day or But big data is generated more frequently
more. mainly per seconds.

Traditional data source is centralized and it is Big data source is distributed and it is
managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to process High system configuration is required to


traditional data. process big data.

The size is more than the traditional data


The size of the data is very small.
size.

Special kind of data base tools are required


Traditional data base tools are required to perform any
to perform any databaseschema-based
data base operation.
operation.

Special kind of functions can manipulate


Normal functions can manipulate data.
data.

Its data model is a flat schema based and it is


Its data model is strict schema based and it is static.
dynamic.

Big data is not stable and unknown


Traditional data is stable and inter relationship.
relationship.

Big data is in huge volume which becomes


Traditional data is in manageable volume.
unmanageable.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Traditional Data Big Data

It is difficult to manage and manipulate the


It is easy to manage and manipulate the data.
data.

Its data sources includes ERP transaction data, CRM Its data sources includes social media,
transaction data, financial data, organizational data, device data, sensor data, video, images,
web transaction data etc. audio etc.

Traditional Data Big Data


Generated within the enterprise Generated outside the enterprise
Volume ranges from Gigabytes to Terabytes Volume ranges from Petabytes to Zettabytes or Exabytes
Deals with structured, semi-structured, and unstructured
Deals with structured data
data
Generated per hour or per day Generated frequently, often per second
Centralized data source and management Distributed data source and management
Easy data integration Difficult data integration
Normal system configuration is sufficient Requires high system configuration
Small data size Much larger data size
Requires traditional database tools Requires special database tools
Schema-based and static data model Flat schema-based and dynamic data model

Q6. difference between RDBMS and NoSQL

Relational Database NoSQL


It is used to handle data coming in high
It is used to handle data coming in low velocity. velocity.
It gives only read scalability. It gives both read and write scalability.
It manages structured data. It manages all type of data.
Data arrives from one or few locations. Data arrives from many locations.
It supports complex transactions. It supports simple transactions.
It has single point of failure. No single point of failure.
It handles data in less volume. It handles data in high volume.
Transactions written in one location. Transactions written in many locations.
support ACID properties compliance doesn’t support ACID properties
Its difficult to make changes in database once it is Enables easy and frequent changes to
defined database

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

schema is mandatory to store the data schema design is not required


Deployed in vertical fashion. Deployed in Horizontal fashion.

Q7. NoSQL Business Drivers

The business drivers for NoSQL databases are centered around the need to manage and process large, fast-
changing, and complex datasets that traditional RDBMSs struggle to handle. Here’s a brief explanation of each
driver with an example:

1. Volume:
o Explanation: As data volume grows exponentially, traditional RDBMSs can't scale effectively
to handle such large datasets. NoSQL databases, with their ability to scale horizontally across
multiple servers, are designed to manage big data efficiently.
o Example: A social media platform like Facebook needs to store and query petabytes of user
data, including posts, comments, and media. NoSQL solutions like Cassandra allow them to
handle this massive volume across distributed servers.
2. Velocity:
o Explanation: The speed at which data is generated and needs to be processed is critical for real-
time applications. Traditional databases may struggle with rapid data insertions and queries,
especially under heavy loads.
o Example: An online retail site like Amazon needs to process millions of transactions and
customer interactions in real-time. A NoSQL database like MongoDB can handle high-velocity
data with minimal delays, ensuring quick responses.
3. Variability:
o Explanation: Data in modern applications often comes in various forms, such as structured,
semi-structured, and unstructured. Traditional databases require rigid schemas, making it
difficult to handle diverse data types.
o Example: A news aggregator like Google News needs to ingest and organize data from different
sources, including articles, videos, and social media feeds. A NoSQL database like Couchbase
allows them to store and manage this diverse data without predefined schemas.
4. Agility:
o Explanation: The ability to quickly adapt to changes in application requirements is crucial for
modern businesses. Traditional RDBMSs require complex schema changes and object-relational
mappings, slowing down development.
o Example: A startup developing a new mobile app can use a NoSQL database like Firebase to
rapidly iterate and evolve their data models as the app's features change, enabling faster time to
market.

These drivers make NoSQL databases attractive for businesses needing flexibility, scalability, and speed in
handling large and diverse datasets.

Q8. NoSQL Data Architecture Patterns

Architecture Pattern is a logical way of categorizing data that will be stored on the
Database. NoSQL is a type of database which helps to perform operations on big data and store it
in a valid format. It is widely used because of its flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture patterns.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

1. Key-Value Store Database


2. Column Store Database
3. Document Database
4. Graph Database
These are explained as following below.
1. Key-Value Store Database:
This model is one of the most basic models of NoSQL databases. As the name suggests, the data
is stored in form of Key-Value Pairs. The key is usually a sequence of strings, integers or
characters but can also be a more advanced data type. The value is typically linked or co-related
to the key. The key-value pair storage databases generally store data as a hash table where each
key is unique. The value can be of any type (JSON, BLOB(Binary Large Object), strings, etc).
This type of pattern is usually used in shopping websites or e-commerce applications.
Advantages:
 Can handle large amounts of data and heavy load,
 Easy retrieval of data by keys.
Limitations:
 Complex queries may attempt to involve multiple key-value pairs which may delay
performance.
 Data can be involving many-to-many relationships which may collide.
Examples:
 DynamoDB
 Berkeley DB

2. Column Store Database:


Rather than storing data in relational tuples, the data is stored in individual cells which are further
grouped into columns. Column-oriented databases work only on columns. They store large
amounts of data into columns together. Format and titles of the columns can diverge from one
row to other. Every column is treated separately. But still, each individual column may contain
multiple other columns like traditional databases.
Basically, columns are mode of storage in this type.
Advantages:
 Data is readily available
 Queries like SUM, AVERAGE, COUNT can be easily performed on columns.
Examples:
 HBase

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

 Bigtable by Google
 Cassandra

3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here, the
values are called as Documents. Document can be stated as a complex data structure. Document
here can be a form of text, arrays, strings, JSON, XML or any such format. The use of nested
documents is also very common. It is very effective as most of the data created is usually in form
of JSONs and is unstructured.
Advantages:
 This type of format is very useful and apt for semi-structured data.
 Storage retrieval and managing of documents is easy.
Limitations:
 Handling multiple documents is challenging
 Aggregation operations may not work accurately.
Examples:
 MongoDB
 CouchDB

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Figure – Document Store Model in form of JSON documents


4. Graph Databases:
Clearly, this architecture pattern deals with the storage and management of data in graphs. Graphs
are basically structures that depict connections between two or more objects in some data. The
objects or entities are called as nodes and are joined together by relationships called Edges. Each
edge has a unique identifier. Each node serves as a point of contact for the graph. This pattern is
very commonly used in social networks where there are a large number of entities and each entity
has one or many characteristics which are connected by edges. The relational database pattern has
tables that are loosely connected, whereas graphs are often very strong and rigid in nature.
Advantages:
 Fastest traversal because of connections.
 Spatial data can be easily handled.
Limitations:
Wrong connections may lead to infinite loops.
Examples:
 Neo4J
 FlockDB( Used by Twitter)

Figure – Graph model format of NoSQL Databases

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Q9. Types of Big Data

Big Data refers to the vast volume of data generated every second by various digital processes, systems, and
devices. This data is so large and complex that traditional data processing tools and techniques are insufficient
to manage it. The concept of Big Data encompasses not just the amount of data but also the speed at which it is
generated (velocity), the variety of data types, and the complexities in managing and analyzing this data to
extract valuable insights.

Types of Big Data

1. Structured Data:
o Definition: Structured data is highly organized and easily searchable, typically stored in
databases in a tabular format (rows and columns). It is data that follows a predefined schema,
making it easy to enter, query, and analyze using tools like SQL.
o Examples:
 Customer information in a CRM system.
 Financial records such as transactions and invoices.
o Characteristics:
 Fixed schema.
 Easily manageable and queryable.
 Limited flexibility for changes.
2. Semi-Structured Data:
o Definition: Semi-structured data does not conform to a fixed schema like structured data but still
has some organizational properties, such as tags or markers that separate data elements. It is
often stored in formats like JSON, XML, or NoSQL databases.
o Examples:
 JSON or XML files.
 Metadata from digital media files.
 Data from social media feeds.
o Characteristics:
 More flexible than structured data.
 Easier to adapt to changing requirements.
 Often used to store data that doesn't fit neatly into tables.
3. Unstructured Data:
o Definition: Unstructured data is raw, unorganized data that doesn't fit into traditional databases
or structured formats. It lacks a predefined data model and is usually more challenging to
analyze.
o Examples:
 Text documents, emails, and PDFs.
 Images, videos, and audio files.
 Social media posts, blogs, and comments.
o Characteristics:
 No fixed schema or structure.
 Requires more advanced tools and techniques for analysis.
 Can contain valuable insights but is more challenging to process.

Downloaded by Biswajit p ([email protected])


lOMoARcPSD|20752295

Q10.Characteristics of Big data

Big Data is defined by several key characteristics that help to distinguish it from traditional data processing and
storage systems. These characteristics are often referred to as the "6 Vs" of Big Data:

1. Volume:
o Description: Volume refers to the enormous amount of data generated and collected. The size of
data is a critical factor in determining whether it qualifies as Big Data. With the exponential
growth of data, managing and analyzing such large volumes becomes challenging.
o Example: In 2016, global mobile traffic was estimated at 6.2 Exabytes per month. By 2020, this
figure had grown to nearly 40,000 Exabytes.
2. Velocity:
o Description: Velocity refers to the speed at which data is generated, collected, and processed.
Big Data often involves real-time or near-real-time data flows from various sources such as
social media, sensors, and mobile devices. The rapid influx of data requires quick processing to
derive actionable insights.
o Example: Google handles more than 3.5 billion searches per day, and Facebook's user base
grows by approximately 22% annually.
3. Variety:
o Description: Variety refers to the different types of data that are generated and processed. Big
Data includes structured, semi-structured, and unstructured data from a wide range of sources,
both internal and external to an organization. Managing this diverse data is crucial for gaining
comprehensive insights.
o Types of Data:
 Structured Data: Organized data, such as relational databases.
 Semi-Structured Data: Data that does not conform to a strict schema, such as JSON or
XML files.
 Unstructured Data: Data that lacks a defined structure, such as text, images, and videos.
o Example: Log files are an example of semi-structured data, while social media posts and
multimedia files represent unstructured data.
4. Veracity:
o Description: Veracity refers to the quality, accuracy, and trustworthiness of data. Big Data often
comes with inconsistencies, noise, and errors, making it difficult to ensure data quality. High
veracity is essential for making reliable decisions based on data.
o Example: A large volume of data may lead to confusion, while insufficient data might result in
incomplete or misleading information.
5. Value:
o Description: Value is about the usefulness of data. Data in itself has no inherent value unless it
is processed and analyzed to generate meaningful insights that can drive business decisions.
Extracting value from Big Data is one of the primary goals of Big Data initiatives.
o Example: A company may collect vast amounts of customer data, but its value lies in how
effectively the company can use this data to enhance customer experience and drive sales.
6. Variability:
o Description: Variability refers to the changing nature or meaning of data over time. The
structure and interpretation of data can vary significantly, adding complexity to Big Data
management and analysis. This variability can affect the consistency and accuracy of data-driven
insights.
o Example: Imagine eating the same brand of ice cream daily, but the taste changes each time.
This inconsistency is akin to the variability in Big Data.

These characteristics collectively define Big Data and highlight the challenges and opportunities associated with
managing and analyzing such vast and complex datasets.

Downloaded by Biswajit p ([email protected])

You might also like