0% found this document useful (0 votes)
476 views17 pages

BIG Data Analytics

1. The document discusses the classification, characteristics, and definition of big data. Digital data can be structured, semi-structured, or unstructured. 2. Characteristics of big data include the five V's: volume, variety, velocity, veracity, and value. Big data has a very large volume from many sources, exists in various structured and unstructured forms, is created and processed at high speeds, requires validation of its reliability, and is most useful when it has value. 3. Big data is defined as large and complex datasets that cannot be processed by traditional data processing applications. It has evolved from the era of mainframes and structured data to include vast amounts of data from many sources in

Uploaded by

Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
476 views17 pages

BIG Data Analytics

1. The document discusses the classification, characteristics, and definition of big data. Digital data can be structured, semi-structured, or unstructured. 2. Characteristics of big data include the five V's: volume, variety, velocity, veracity, and value. Big data has a very large volume from many sources, exists in various structured and unstructured forms, is created and processed at high speeds, requires validation of its reliability, and is most useful when it has value. 3. Big data is defined as large and complex datasets that cannot be processed by traditional data processing applications. It has evolved from the era of mainframes and structured data to include vast amounts of data from many sources in

Uploaded by

Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big data Analy cs using R B.

com (VI- Sem)

UNIT I

INTRODUCTION TO BIGDATA

Data, classification Of Digital Data--structured, unstructured, semi-structured data,


characteristics of data, evaluation of big data, definition and challenges of big data ,
what is big data and why to use big data ?, business intelligence Vs big data.

1) What is Data? and What are (explain about) the Classification of digital data?

Irrespective of the size of the enterprise whether it is big or small, data continues to
be a precious and irreplaceable asset.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Data generates information and from information we can draw valuable insight.
The digital data can be broadly classified into 3 types.
They are
I. structured,
II. semi-structured, and
III. unstructured data.

Classifica on of digital Data

Structured Data Semi- structured Unstructured


Data Data

Fig. Classification of Digital data

1. Structured data:
 When data follows a pre-defined schema/structure we say it is structured data.
 This is the data which is in an organized form (e.g., in rows and columns) and be easily
used by a computer program.
 Relationships exist between entities of data, such as classes and their objects.
 About 10% data of an organization is in this format.

1
Big data Analy cs using R B.com (VI- Sem)

 Data stored in databases is an example of structured data.

2. Semi-structured data:
 Semi-structured data is also referred to as self-describing structure.
 This is the data which does not conform to a data model but has some structure.
 However, it is not in a form which can be used easily by a computer program. About
10% data of an organization is in this format;
for example,
HTML,
XML,
JSON,
email data etc.
3. Unstructured data:
 This is the data which does not conform to a data model or is not in a form which can
be used easily by a computer program.
 About 80% data of an organization is in this format;
for example
images,
videos,
Audios,
chat rooms,
memos,
PowerPoint presentations,
body of an email, etc.

2) What are the characteristics of data ?

Data has three key characteristics:


They are
1. Composition,
2. Condition
3. Context

1. Composition: The composition of data deals with the structure of data, that is, the sources
of data, the granularity, the types, and the nature of data as to whether it is static or real-
time streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis." or "Does it require cleaning for further enhancement and
enrichment."
3. Context: The context of data deals with "Where has this data been generated." "Why was
this data generated." How sensitive is this data."

"What are the events associated with this data." and so on.

2
Big data Analy cs using R B.com (VI- Sem)

 Small data (data as it existed prior to the big data revolution) is about certainty. It is
about known data sources;
 it is about no major changes to the composition or context of data.

Composition

Condition

Context

Fig. Characteristics of data (Big Data and Analytics)

 Big data is about complexity.


 Complexity in terms of multiple and unknown datasets, in terms of exploding
volume, in terms of speed at which the data is being generated and the speed at which
it needs to be processed and in terms of the variety of data (internal or external,
behavioural or social) that is being generated.

3) What are the Big Data Characteristics?

Big Data contains a large amount of data that is not being processed by traditional data storage
or the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

1. Volume
2. Veracity
3. Variety
4. Value
5. Velocity

3
Big data Analy cs using R B.com (VI- Sem)

1.Volume

The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

4
Big data Analy cs using R B.com (VI- Sem)

2.Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.

3.Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

5
Big data Analy cs using R B.com (VI- Sem)

For example, Facebook posts with hashtags.

4.Value

Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

5.Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

4) What is the Big Data? Evolution of Big Data?

Big Data:
Big data is the large onset of structured, semi-structured, and unstructured data. It is data that
arrives at a much higher volume, at a much faster rate, in a wider variety of file formats, and
from a wider variety of sources, than that of structured data alone.

The term ‘big data’ has been around since the late 1990s, when it was officially coined by
NASA researchers, Application-Controlled Demand Paging for Out-of-Core Visualization.

They used the term to describe the challenge of processing and visualizing vast amounts of
data from supercomputers.

three primary components still in use today to describe big data:


Volume (size of data),
Velocity (speed in which data grows), and
Variety (number of data types and sources with which the data comes from).
The History and Evolution of Big Data:

6
Big data Analy cs using R B.com (VI- Sem)

 1970s and before was the era of mainframes.


 The data was essentially primitive and structured.
 Relational databases evolved in 1980s and 1990s. The era was of data intensive
applications.
 The World Wide Web (WWW) and the Internet of Things (IOT) have led to an
onslaught of structured, unstructured, and multimedia data.
Refer Table 1.1.

Table 1.1 The evolution of big data (Big Data and Analytics)

Sno Year of Duration Technology

1 1940 to 1989 Data Warehousing and Personal Desktop Computers

2 1989 to 1999 Emergence of the World Wide Web

3 2000s to 2010s Controlling Data Volume, Social Media and Cloud


Computing

4 2010s to now Optimization Techniques, Mobile Devices and IoT

1940s to 1989 – Data Warehousing and Personal Desktop Computers


 The origins of electronic storage can be traced back to the development of the world’s
first programmable computer, the Electronic Numerical Integrator and
Computer (ENIAC).
 Then, in the early 1960s, International Business Machines (IBM) released the first
transistorized computer called TRADIC,.
 The first personal desktop computer to feature a Graphical User Interface (GUI) was
Lisa, released by Apple Computers in 1983.
 Throughout the 1980s, companies like Apple, Microsoft, and IBM would release a
wide range of personal desktop computers,

7
Big data Analy cs using R B.com (VI- Sem)

1989 to 1999 – Emergence of the World Wide Web

 Between 1989 and 1993, British computer scientist Sir Tim Berners-Lee would create
the fundamental technologies required to power what we now know as the World
Wide Web.
 These web technologies were HyperText Markup Language (HTML), Uniform
Resource Identifier (URI), and Hypertext Transfer Protocol (HTTP).
 As more devices gained access to the internet, this led to a massive explosion in the
amount of information that people could access and share data at any one time.

2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing
 During the early 2000s, companies such as Amazon, eBay, and Google helped
generate large amounts of web traffic, as well as a combination of structured and
unstructured data.
 Amazon also launched a beta version of AWS (Amazon Web Services) in 2002,
which opened the Amazon.com platform to all developers. By 2004, over 100
applications were built for it.
 AWS then relaunched in 2006, offering a wide range of cloud infrastructure services,
including Simple Storage Service (S3) and Elastic Compute Cloud (EC2).
 The public launch of AWS attracted a wide range of customers, such as Dropbox,
Netflix, and Reddit, who were eager to become cloud-enabled and so they would all
partner with AWS before 2010.

2010s to now – Optimization Techniques, Mobile Devices and IoT


In the 2010s, the biggest challenges facing big data was the advent of mobile devices and the
IoT (Internet of Things).

The rise of mobile devices and IoT devices also led to new types of data being collected,
organized, and analyzed. Some examples include:

 Sensor Data (data collected by internet-enabled sensors to provide valuable, real-time


insight into the inner workings of a piece of machinery)
 Social Data (publicly available social media data from platforms like Facebook and
Twitter)
 Transactional Data (data from online web stores including receipts, storage records,
and repeat purchases)
 Health-related data (heart rate monitors, patient records, medical history)
The Future of Big Data Solutions.

big data technology is AI (Artificial Intelligence) and automation, both of which are
streamlining the process of database management and big data analysis, making it easier to
convert raw data into meaningful insights that make sense to key decision makers.

Another massive hurdle for big data is ethical concerns.

8
Big data Analy cs using R B.com (VI- Sem)

5) What is the Definition of Big Data? Explain.

• Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure
1.3

Figure 1.3 Data: Big in volume, variety, and Velocity (Big Data and Analytics)

Variety: Data can be structured data, semi-structured data and unstructured data. Data stored
in a database is an example of structured data.HTML data, XML data, email data,
CSV files are the examples of semi-structured data. Power point presentation, images,
videos, researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time.
We have moved from simple desktop applications like payroll application to real- time
processing applications.

Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big data
is high-volume, high-velocity and/or high variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight and
decision making.

9
Big data Analy cs using R B.com (VI- Sem)

For the sake of easy comprehension, we will look at the definition in three parts.

Part I of the definition: "Big data is high-volume, high-velocity, and high-variety


information assets" talks about voluminous data (humongous data) that may have great
variety (a good mix of structured, semi-structured. and unstructured data) and will require a
good speed/pace for storage, preparation, processing and analysis.

Part II of the definition: "cost effective, innovative forms of information processing" talks
about embracing new techniques and technologies to capture (ingest), store, process, persist,
integrate and visualize the high volume, high-velocity, and high-variety data.

Part III of the definition: "enhanced insight and decision making" talks about deriving
deeper, richer and meaningful insights and then using these insights to make faster and better
decisions to gain business value and thus a competitive edge.
Data —> Information —> Actionable intelligence —> Better decisions —>Enhanced
business value

Figure 1.4 Definition of big data – Gartner (Big Data and Analytics)

6) What are the Challenges of Big Data?

10
Big data Analy cs using R B.com (VI- Sem)

"Big data is high-volume, high-velocity, and high-variety information assets" talks about
voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured. and unstructured data) and will require a good speed/pace for storage,
preparation, processing and analysis.

Following are a few challenges with big data:

Figure 1.5 Challenges with big data (Big Data and Analytics)

Data volume: Data today is growing at an exponential rate. This high tide of data will
continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.

Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.

Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.

Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data
sciences.

Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.

11
Big data Analy cs using R B.com (VI- Sem)

Visualization: Big data refers to datasets whose size is typically beyond the storage capacity
of traditional database software tools. There is no explicit definition of how big the data set
should be for it to be considered bigdata. Data visualization(computer graphics) is becoming
popular as a separate discipline. There are very few data visualization experts.

7) Explain about the Business Intelligence vs Big Data Comparison Table?


Business Intelligence vs Big Data Comparison
Below is the comparison below:

Comparison of Business Intelligence Big Data


Objectives
1.Purpose The purpose of Business The main purpose of Big Data
Intelligence is to help the is to capture, process, and
business to make better decisions. analyze the data, both structured
and unstructured to improve
customer outcomes.
2.EcoSystem / Operation systems, ERP Hadoop, Spark, R Server, hive,
Components databases, Data Warehouse, HDFS etc.
Dashboard etc.
3.Tools Below is the list of tools used for Below is the list of tools used in
business intelligence.. Big Data.

 Tableau  Hadoop
 Online analytical  Spark
processing (OLAP)  Hive
 Data Warehousing  Cloudera, etc
 Microsoft Power BI
 Google Analytics etc

4.Characteristics/ Below are the six features of Big data can be described by
Properties Business Intelligence some characteristics such as
Location intelligence, Volume,
Executive Dashboards, Variety,
“what if” analysis, Velocity,
Interactive reports, Veracity and
Metadata layer, and Value.
Ranking reports
5.Benefits Below is the list of benefits of Below is the list of benefits of
Business Intelligence Big Data

 Helps in making better  Better Decision making


business decisions  Fraud detection
 Faster and more accurate  Storage, mining, and
reporting and analysis analysis of data

12
Big data Analy cs using R B.com (VI- Sem)

 Increase revenues  Cost savings

6.Applied Fields Social media, Healthcare, The banking sector,


Gaming Industry, Food Industry Entertainment, and Social
etc media, Healthcare, Retail and
wholesale etc

Both the BI and Big data helps to analyse the data to get the insights and to view the relevant
data.
Business intelligence and Big Data need to be synchronized, need to be used together. They

both are not the same thing, but they share a lot of the same common goals. A lot of the

distinctions between Business intelligence and Big Data tend to be arbitrary.

13
Big data Analy cs using R B.com (VI- Sem)

Short answered questions

1) What is Bigdata? Why to use Bigdata?

Big data is high-volume, high-velocity, and high-variety information assets" talks about
voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured. and unstructured data) and will require a good speed/pace for storage,
preparation, processing and analysis.
Some of the benefits of using big data for businesses are:

 Cost savings: Big data tools like Apache Hadoop, Spark, etc. bring cost-saving
benefits to businesses when they have to store large amounts of data.
 Time savings: Real-time in-memory analytics helps companies to collect data from
various sources and process it faster.
 Market understanding: Big data helps companies to understand the market
conditions, customer preferences, trends and opportunities.
 Customer acquisition and retention: Big data helps companies to refine their
marketing campaigns and techniques, provide targeted promotional information, and
improve customer loyalty programs.
 Innovation and product development: Big data helps companies to discover new
sources of revenue, solve problems, and create new products and services based on
customer feedback and demand.
 Competitive advantage: Big data helps companies to gain an edge over their
competitors by leveraging data-driven insights and strategies.

Big data is important for companies because it can help them achieve growth, efficiency,
profitability and customer satisfaction.

2) What is data? Explain about Structured data.


Big data:
Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Classification of Digital Data:
On the basis of the data received from the sources, big data covers
 Structured Data
 Semi-Structured Data
 Unstructured Data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data. Approximately 70% to 80% of data is in unstructured
form.

14
Big data Analy cs using R B.com (VI- Sem)

Structured Data:
 It is organized data in a predefined format.
 It is stored in tabular form.
 The Number of rows/records/tuples in a relation is called the Cardinality of a
relation.
 The number of columns in a relation is called as Degree of a relation.
 It is the data that resides in fixed fields within a record or file.
 It is formatted data that has entities and their attributes mapped.
 It is used to query and report against predetermined data types.
Sources of Structured Data:
 Relational Databases.
 RDBMS-(Oracle corporation, IBM-DB2, Microsoft-Microsoft SQL Server,
EMCGreenplum, Teradata- Teradata, MySql (Open Source), PostgreSQL (Advanced
Open Source) etc.
 Flat file in the form of records (like Comma Separated Values (.CSV) and tabseparated
files).
 Multidimensional databases (Major used in data warehouses technology.
 Legacy databases.
Sample of Structured Data:

Customer ID Name Product ID City State

12365 Smith 241 Graz Styria

23658 Jack 365 Wolfsberg Carinthia

32456 Kady 421 Enns Upper Austria.

3) What is data? Explain about Semi-Structured data.


Big data:
Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Classification of Digital Data:
On the basis of the data received from the sources, big data covers
 Structured Data
15
Big data Analy cs using R B.com (VI- Sem)

 Semi-Structured Data
 Unstructured Data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data. Approximately 70% to 80% of data is in unstructured
form.
Semi-Structured Data:
 Semi-structured data is also known a schema-less or self-describing structure.
 It does not have the data model.
 It refers to a form of structured data that contains tags or mark-up elements in order to
separate element and generate hierarchies of records and fields in the given data.
 Such type of data does not follow the proper structure of data models as in relational
databases.
Sources of Semi-Structured Data:
 File-systems such as Web Data in the form of Cookies.
 Data Exchange formats such as JavaScript Object Notation (JSON) data.
 XML Stands for eXtensible Markup Language. It is hugely popularized by web
services developed utilizing the Simple Object Notation Principles (SOAM).
 JSON Stands for Java Script Object Notation. It is used to transmit data between a
server and a web application. JSON is popularized by web services developed
utilizing the REpresentational State Transfer (REST).
 MongoDB is open source, Distributed.
 NoSQL is Document oriented database.
 Couhbase is originally known as Membase open source, distributed.
 NoSQL store data natively in JSON format.
4) What is data? Explain about Unstructured data.
Big data:
Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Classification of Digital Data:
On the basis of the data received from the sources, big data covers
 Structured Data
 Semi-Structured Data
 Unstructured Data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data. Approximately 70% to 80% of data is in unstructured
form.

16
Big data Analy cs using R B.com (VI- Sem)

Unstructured Data:
 Unstructured data does not have any logical structure or pre-defined data model.
Sources of Unstructured Data:
 It contains metadata. (Additional information related to the data).
 It contains inconsistent data, such as data obtained from files, social media websites,
satellites etc.,
 It contains data in different formats such as emails, text, audio, video, or images.
 Social Media: data obtained from social networking platforms, including youtube,
Facebook, Twitter, Linkedln, and Flickr.
 Mobile Data: data such as text message and location information.
Note: About 80 % percent of enterprise data consist of Unstructured data.

17

You might also like