BIG Data Analytics
BIG Data Analytics
UNIT I
INTRODUCTION TO BIGDATA
1) What is Data? and What are (explain about) the Classification of digital data?
Irrespective of the size of the enterprise whether it is big or small, data continues to
be a precious and irreplaceable asset.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Data generates information and from information we can draw valuable insight.
The digital data can be broadly classified into 3 types.
They are
I. structured,
II. semi-structured, and
III. unstructured data.
1. Structured data:
When data follows a pre-defined schema/structure we say it is structured data.
This is the data which is in an organized form (e.g., in rows and columns) and be easily
used by a computer program.
Relationships exist between entities of data, such as classes and their objects.
About 10% data of an organization is in this format.
1
Big data Analy cs using R B.com (VI- Sem)
2. Semi-structured data:
Semi-structured data is also referred to as self-describing structure.
This is the data which does not conform to a data model but has some structure.
However, it is not in a form which can be used easily by a computer program. About
10% data of an organization is in this format;
for example,
HTML,
XML,
JSON,
email data etc.
3. Unstructured data:
This is the data which does not conform to a data model or is not in a form which can
be used easily by a computer program.
About 80% data of an organization is in this format;
for example
images,
videos,
Audios,
chat rooms,
memos,
PowerPoint presentations,
body of an email, etc.
1. Composition: The composition of data deals with the structure of data, that is, the sources
of data, the granularity, the types, and the nature of data as to whether it is static or real-
time streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one use this
data as is for analysis." or "Does it require cleaning for further enhancement and
enrichment."
3. Context: The context of data deals with "Where has this data been generated." "Why was
this data generated." How sensitive is this data."
"What are the events associated with this data." and so on.
2
Big data Analy cs using R B.com (VI- Sem)
Small data (data as it existed prior to the big data revolution) is about certainty. It is
about known data sources;
it is about no major changes to the composition or context of data.
Composition
Condition
Context
Big Data contains a large amount of data that is not being processed by traditional data storage
or the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
There are five v's of Big Data that explains the characteristics.
1. Volume
2. Veracity
3. Variety
4. Value
5. Velocity
3
Big data Analy cs using R B.com (VI- Sem)
1.Volume
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
4
Big data Analy cs using R B.com (VI- Sem)
2.Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
3.Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
5
Big data Analy cs using R B.com (VI- Sem)
4.Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
5.Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Big Data:
Big data is the large onset of structured, semi-structured, and unstructured data. It is data that
arrives at a much higher volume, at a much faster rate, in a wider variety of file formats, and
from a wider variety of sources, than that of structured data alone.
The term ‘big data’ has been around since the late 1990s, when it was officially coined by
NASA researchers, Application-Controlled Demand Paging for Out-of-Core Visualization.
They used the term to describe the challenge of processing and visualizing vast amounts of
data from supercomputers.
6
Big data Analy cs using R B.com (VI- Sem)
Table 1.1 The evolution of big data (Big Data and Analytics)
7
Big data Analy cs using R B.com (VI- Sem)
Between 1989 and 1993, British computer scientist Sir Tim Berners-Lee would create
the fundamental technologies required to power what we now know as the World
Wide Web.
These web technologies were HyperText Markup Language (HTML), Uniform
Resource Identifier (URI), and Hypertext Transfer Protocol (HTTP).
As more devices gained access to the internet, this led to a massive explosion in the
amount of information that people could access and share data at any one time.
2000s to 2010s – Controlling Data Volume, Social Media and Cloud Computing
During the early 2000s, companies such as Amazon, eBay, and Google helped
generate large amounts of web traffic, as well as a combination of structured and
unstructured data.
Amazon also launched a beta version of AWS (Amazon Web Services) in 2002,
which opened the Amazon.com platform to all developers. By 2004, over 100
applications were built for it.
AWS then relaunched in 2006, offering a wide range of cloud infrastructure services,
including Simple Storage Service (S3) and Elastic Compute Cloud (EC2).
The public launch of AWS attracted a wide range of customers, such as Dropbox,
Netflix, and Reddit, who were eager to become cloud-enabled and so they would all
partner with AWS before 2010.
The rise of mobile devices and IoT devices also led to new types of data being collected,
organized, and analyzed. Some examples include:
big data technology is AI (Artificial Intelligence) and automation, both of which are
streamlining the process of database management and big data analysis, making it easier to
convert raw data into meaningful insights that make sense to key decision makers.
8
Big data Analy cs using R B.com (VI- Sem)
• Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity of and also
complex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to support
storage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure
1.3
Figure 1.3 Data: Big in volume, variety, and Velocity (Big Data and Analytics)
Variety: Data can be structured data, semi-structured data and unstructured data. Data stored
in a database is an example of structured data.HTML data, XML data, email data,
CSV files are the examples of semi-structured data. Power point presentation, images,
videos, researches, white papers, body of email etc are the examples of unstructured data.
Velocity: Velocity essentially refers to the speed at which data is being created in real- time.
We have moved from simple desktop applications like payroll application to real- time
processing applications.
Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary Big data
is high-volume, high-velocity and/or high variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight and
decision making.
9
Big data Analy cs using R B.com (VI- Sem)
For the sake of easy comprehension, we will look at the definition in three parts.
Part II of the definition: "cost effective, innovative forms of information processing" talks
about embracing new techniques and technologies to capture (ingest), store, process, persist,
integrate and visualize the high volume, high-velocity, and high-variety data.
Part III of the definition: "enhanced insight and decision making" talks about deriving
deeper, richer and meaningful insights and then using these insights to make faster and better
decisions to gain business value and thus a competitive edge.
Data —> Information —> Actionable intelligence —> Better decisions —>Enhanced
business value
Figure 1.4 Definition of big data – Gartner (Big Data and Analytics)
10
Big data Analy cs using R B.com (VI- Sem)
"Big data is high-volume, high-velocity, and high-variety information assets" talks about
voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured. and unstructured data) and will require a good speed/pace for storage,
preparation, processing and analysis.
Figure 1.5 Challenges with big data (Big Data and Analytics)
Data volume: Data today is growing at an exponential rate. This high tide of data will
continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computing is the answer to managing infrastructure for big data as far as
cost-efficiency, elasticity and easy upgrading / downgrading is concerned. This further
complicates the decision to host big data solutions outside the enterprise.
Data retention: How long should one retain this data? Some data may require for log-term
decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that generate
insights, organizations need professionals who possess a high-level proficiency in data
sciences.
Other challenges: Other challenges of big data are with respect to capture, storage, search,
analysis, transfer and security of big data.
11
Big data Analy cs using R B.com (VI- Sem)
Visualization: Big data refers to datasets whose size is typically beyond the storage capacity
of traditional database software tools. There is no explicit definition of how big the data set
should be for it to be considered bigdata. Data visualization(computer graphics) is becoming
popular as a separate discipline. There are very few data visualization experts.
Tableau Hadoop
Online analytical Spark
processing (OLAP) Hive
Data Warehousing Cloudera, etc
Microsoft Power BI
Google Analytics etc
4.Characteristics/ Below are the six features of Big data can be described by
Properties Business Intelligence some characteristics such as
Location intelligence, Volume,
Executive Dashboards, Variety,
“what if” analysis, Velocity,
Interactive reports, Veracity and
Metadata layer, and Value.
Ranking reports
5.Benefits Below is the list of benefits of Below is the list of benefits of
Business Intelligence Big Data
12
Big data Analy cs using R B.com (VI- Sem)
Both the BI and Big data helps to analyse the data to get the insights and to view the relevant
data.
Business intelligence and Big Data need to be synchronized, need to be used together. They
both are not the same thing, but they share a lot of the same common goals. A lot of the
13
Big data Analy cs using R B.com (VI- Sem)
Big data is high-volume, high-velocity, and high-variety information assets" talks about
voluminous data (humongous data) that may have great variety (a good mix of structured,
semi-structured. and unstructured data) and will require a good speed/pace for storage,
preparation, processing and analysis.
Some of the benefits of using big data for businesses are:
Cost savings: Big data tools like Apache Hadoop, Spark, etc. bring cost-saving
benefits to businesses when they have to store large amounts of data.
Time savings: Real-time in-memory analytics helps companies to collect data from
various sources and process it faster.
Market understanding: Big data helps companies to understand the market
conditions, customer preferences, trends and opportunities.
Customer acquisition and retention: Big data helps companies to refine their
marketing campaigns and techniques, provide targeted promotional information, and
improve customer loyalty programs.
Innovation and product development: Big data helps companies to discover new
sources of revenue, solve problems, and create new products and services based on
customer feedback and demand.
Competitive advantage: Big data helps companies to gain an edge over their
competitors by leveraging data-driven insights and strategies.
Big data is important for companies because it can help them achieve growth, efficiency,
profitability and customer satisfaction.
14
Big data Analy cs using R B.com (VI- Sem)
Structured Data:
It is organized data in a predefined format.
It is stored in tabular form.
The Number of rows/records/tuples in a relation is called the Cardinality of a
relation.
The number of columns in a relation is called as Degree of a relation.
It is the data that resides in fixed fields within a record or file.
It is formatted data that has entities and their attributes mapped.
It is used to query and report against predetermined data types.
Sources of Structured Data:
Relational Databases.
RDBMS-(Oracle corporation, IBM-DB2, Microsoft-Microsoft SQL Server,
EMCGreenplum, Teradata- Teradata, MySql (Open Source), PostgreSQL (Advanced
Open Source) etc.
Flat file in the form of records (like Comma Separated Values (.CSV) and tabseparated
files).
Multidimensional databases (Major used in data warehouses technology.
Legacy databases.
Sample of Structured Data:
Semi-Structured Data
Unstructured Data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data. Approximately 70% to 80% of data is in unstructured
form.
Semi-Structured Data:
Semi-structured data is also known a schema-less or self-describing structure.
It does not have the data model.
It refers to a form of structured data that contains tags or mark-up elements in order to
separate element and generate hierarchies of records and fields in the given data.
Such type of data does not follow the proper structure of data models as in relational
databases.
Sources of Semi-Structured Data:
File-systems such as Web Data in the form of Cookies.
Data Exchange formats such as JavaScript Object Notation (JSON) data.
XML Stands for eXtensible Markup Language. It is hugely popularized by web
services developed utilizing the Simple Object Notation Principles (SOAM).
JSON Stands for Java Script Object Notation. It is used to transmit data between a
server and a web application. JSON is popularized by web services developed
utilizing the REpresentational State Transfer (REST).
MongoDB is open source, Distributed.
NoSQL is Document oriented database.
Couhbase is originally known as Membase open source, distributed.
NoSQL store data natively in JSON format.
4) What is data? Explain about Unstructured data.
Big data:
Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and decision
making.
Data:
Data is present in homogeneous sources as well as in heterogeneous sources. The
need of the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured data.
Classification of Digital Data:
On the basis of the data received from the sources, big data covers
Structured Data
Semi-Structured Data
Unstructured Data
In a real-world scenario, typically, the unstructured data is larger in volume than the
structured and semi-structured data. Approximately 70% to 80% of data is in unstructured
form.
16
Big data Analy cs using R B.com (VI- Sem)
Unstructured Data:
Unstructured data does not have any logical structure or pre-defined data model.
Sources of Unstructured Data:
It contains metadata. (Additional information related to the data).
It contains inconsistent data, such as data obtained from files, social media websites,
satellites etc.,
It contains data in different formats such as emails, text, audio, video, or images.
Social Media: data obtained from social networking platforms, including youtube,
Facebook, Twitter, Linkedln, and Flickr.
Mobile Data: data such as text message and location information.
Note: About 80 % percent of enterprise data consist of Unstructured data.
17