Basics of Big Data Notes
Basics of Big Data Notes
UNIT-1
Types of Digital Data: Classification of Digital Data.
Introduction to Big Data: Characteristic of Data, Evolution of Big Data, Definition of Big
Data, Challenges with Big Data, What is Big Data?.
Big Data Analytics: Where do we Begin?, What is Big Data Analytics?, What Big Data
Analytics isn’t?, Classification of Analytics, Terminologies Used in Big Data Environments.
The Big Data Technology Landscape: NoSQL
The data that is stored using specific machine language systems which can
be interpreted by various technologies is called digital data.
Eg. Audio, video or text information
2. Semi-Structured Data:
This data which doesn’t conform to a data model but has some structure.
Metadata for this data is available but is not sufficient.
Sources: XML, JSON, E-mail
Characteristics:
- inconsistent structure.
- self describing (label/value pairs)
- schema information is blended with data values
- data objectives may have different attributes not known before
Challenges:
Storage cost: Storing data with their schemas increases cost
RDBMS: Semi-structured data cannot be stored in existing RDBMS as
data cannot be mapped into tables directly
Irregular and partial structure: Some data elements may have extra
information while others none at all
Implicit structure: In many cases the structure is implicit.
Interpreting relationships and correlations is very difficult
Flat files: Semi-structured is usually stored in flat files which are
difficult to index and search
Heterogeneous sources: Data comes from varied sources which is
difficult to tag and search.
3. Unstructured Data:
This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program.
About 80–90% data of an organization is in this format.
Sources: memos, chat-rooms, PowerPoint presentations, images,
videos, letters, researches, white papers, body of an email, etc.
Characteristics:
Does not confirm to any data model
Can’t be stored in the form of rows and columns
Not in any particular format or sequence
Not easily usable by the program
Doesn’t follow any rule or semantics
Challenges:
Storage space: Sheer volume of unstructured data and its
unprecedented growth makes it difficult to store. Audios, videos,
images, etc. acquire huge amount of storage space
Scalability: Scalability becomes an issue with increase in
unstructured data
Retrieve information: Retrieving and recovering unstructured data are
cumbersome
Security: Ensuring security is difficult due to varied sources of data
(e.g. e-mail, web pages)
Update/delete: Updating, deleting, etc. are not easy due to the
unstructured form
Indexing and Searching: Indexing becomes difficult with increase in
data.
Searching is difficult for non-text data
Interpretation: Unstructured data is not easily interpreted by
conventional search algorithm
Tags: As the data grows it is not possible to put tags Manually
Indexing: Designing algorithms to understand the meaning
of the document and then tag or index them accordingly is difficult.
There are different sources of data like doc, pdf, YouTube, a chat
conversation on internet messenger, a customer feedback form on an online
retail website, CCTV coverage and weather forecast.
2. Variety: Variety deals with the wide range of data types and
sources of data. Structured, semi-structured and Unstructured.
Structured data: From traditional transaction processing systems and
RDBMS, etc.
Semi-structured data: For example Hypertext Markup Language (HTML),
eXtensible Markup Language (XML).
Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.
3. Velocity: It refers to the speed of data processing. we have
moved from the days of batch processing to Real-time processing.
Q) Explain evolution of Big Data. What are the challenges of Big Data?
Evolution:
Data Generation Data Utilization Data Driven
and storage
Complex and Structured data,
unstructured Unstructured data,
Multimedia data
Complex and Relational
Relational databases : Data
intensive
applications
Primitive and Main frames: Basic
structured data storage
Q) Define Big Data Analytics. What are the various types of analytics?
Data from legacy Big Data A blend of big data and data
systems, from legacy systems, ERP,CRM
ERP,CRM and and third party applications.
third party
applications.
Small and Big data is being A blend of big data and
structured data taken up seriously. traditional analytics to yield
sources. Data Data is mainly insights and offerings with
stored in unstructured, speed and impact.
enterprise data arriving at a higher
warehouses or pace. This fast flow of
data marts. big volume data had
to be stored and
processed rapidly,
often on massively
parallel servers
running hadoop.
Data was Data was often Data is being both internally
internally externally sourced. and externally sourced.
sourced.
Relational Database In ,memory analytics, in
databases applications, Hadoopo database processing, agile
clusters, SQL to analytical methods, Machine
hadoop environments learning techniques etc ..
etc..
Big Data Analtics coexist with both RDBMS and Data Warehouse, leveraging
the power of each to yield business value.
Big Data Analtics isn’t:
Only about volume
Just about technology
Meant to replace RDBMS
Meant to replace data warehouse
Only used by huge online companies like Google or Amazon
“One-size fit all” traditionaly RDBMS built on shared disk and
memory.
Q) Explain different Big Data Analytics Approaches.
CAP Theorem: The CAP theorem is also called the Brewer’s theorem. It
states that in a distributed computing environment, it is impossible to
provide the following guarantees. At best you can have two of the following
three and one must be sacrificed.
1. Consistency
2. Availability
3. Partition tolerance
1. Consistency implies that every read fetches the last write. Consistency
means that all nodes see the same data at the same time. If there are
multiple replicas and there is an update being processed, all users see
the update go live at the same time even if they are reading from
different replicas.
2. Availability implies that reads and writes always succeed. Availability
is a guarantee that every request receives a response about whether it
was successful or failed.
3. Partition tolerance implies that the system will continue to function
when network partition occurs. It means that the system continues to
operate despite arbitrary message loss or failure of part of the system.
Q) What is BASE?
This industry also heavily relies on Big Data for risk analytics,
including; anti-money laundering, demand enterprise risk
management, "Know Your Customer," and fraud mitigation.
The Securities Exchange Commission (SEC) is using Big Data to
monitor financial market activity. They are currently using network
analytics and natural language processors to catch illegal trading
activity in the financial markets.
3. Healthcare Sector:
Some hospitals, like Beth Israel, are using data collected from a cell
phone app, from millions of patients, to allow doctors to use evidence-
based medicine as opposed to administering several medical/lab tests
to all patients who go to the hospital.
Free public health data and Google Maps have been used by the
University of Florida to create visual data that allows for faster
identification and efficient analysis of healthcare information, used in
tracking the spread of chronic disease.
4. Education:
The University of Tasmania, An Australian university with students
has deployed a Learning and Management System that tracks, among
other things, when a student logs onto the system, how much time is
spent on different pages in the system, as well as the overall progress
of a student over time.
On a governmental level, the Office of Educational Technology in the
U. S. Department of Education is using Big Data to develop analytics
to help correct course students who are going astray while
using online Big Data certification courses. Click patterns are also
being used to detect boredom.
5. Government:
In public services, Big Data has an extensive range of applications,
including energy exploration, financial market analysis, fraud
detection, health-related research, and environmental protection.
The Food and Drug Administration (FDA) is using Big Data to detect
and study patterns of food-related illnesses and diseases.
6. Insurance Industry:
Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analyzing and predicting customer
behavior through data derived from social media, GPS-enabled devices, and
CCTV footage. The Big Data also allows for better customer retention from
insurance companies.
7. Transportation Industry:
Some applications of Big Data by governments, private organizations, and
individuals include:
Governments use of Big Data: traffic control, route planning,
intelligent transport systems, congestion management (by predicting
traffic conditions)
Private-sector use of Big Data in transport: revenue management,
technological enhancements, logistics and for competitive advantage
(by consolidating shipments and optimizing freight movement)
NoSQL Stands for Not Only SQL. These are non-relational, open source,
distributed databases.
Features of NoSQL:
1. NoSQL databases are non-relational: They do not adhere to relational
data model. In fact either key-value pairs or document oriented or
column oriented or graph based databases.
2. Distributed: The data is distributed across several nodes in a cluster
constituted of low commodity hardware.
3. No Support for ACID properties: They do not offer support for ACID
properties of transactions. On the contrary, they adherence to CAP
theorem.
4. No fixed table schema: NoSQL databases are becoming increasing
popular owing to their support for flexibility to the schema. They do
not mandate for the data to strict adhere to any schema structure at
the time of storage.
Need of NoSQL:
Types of NoSQL databases: They broadly divided into Key-Value or big hash
table and Schemal-less.
3. Column: Each storage block has data from only one column. It only
fetch column families of those columns that are required by a query
(all columns in a column family are stored together on the disk, so
multiple rows can be retrieved in one read operation à data locality
Eg. Cassandra, HBase etc.
4. Graph: They are also called Network database. A graph stores data in
nodes.
Data model:
o (Property Graph) nodes and edges
Nodes may have properties (including ID)
Edges may have labels or roles
o Key-value pairs on both
Eg. Neo4j, HyperGraphDB, InfiniteGraph etc.
Sample Graph database:
Fig. Sample Graph Database
Advantages:
Big Data Capability
No Single Point of Failure
Easy Replication
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with
equal effect
NoSQL databases don't need a dedicated high-performance server
It can serve as the primary data source for online applications.
Excels at distributed database and multi-data centre operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without
downtime or service disruption
Disadvantages:
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency
when multiple transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique
values as keys become difficult
Doesn't work as well with relational data
Open source options so not so popular for enterprises.
No support for join and group-by operations.
SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or
key-value pairs databases
Vertically scalable (by Horizontally scalable (by creating a cluster
increasing system commodity machines)
resources)
Uses SQL Uses UnQL (Unstructured Query
Language)
Not preferred for large Largely preferred for large datasets
datasets
Not a best fit for hierarchical Best fit for hierarchical storage as it
data follows the key-
value pair of storing data similar to JSON
(Java Script Object Notation)
Emphasis on ACID properties Follows Brewer’s CAP theorem
Excellent support from Relies heavily on community support
vendors
Supports complex Does not have good support for complex
querying and querying
data
keeping needs
Can be configured for strong Few support strong consistency (e.g.,
consistency MongoDB), few
others can be configured for eventual
consistency (e.g.,
Cassandra)
Examples: Oracle, DB2, MongoDB, HBase, Cassandra, Redis,
MySQL, MS SQL, Neo4j, CouchDB,
PostgreSQL, etc. Couchbase, Riak, etc.
Key-Value
Shopping carts
Web user data analysis
Amazon, Linkedin
Document based
Real-time Analysis
Logging
Document archive management
Column-oriented
Analyze huge web user actions
Sensor feeds
Facebook, Twitter, eBay, Netfix
Graph-based
Network modeling
Recommendation
Walmart-upsell, cross-sell
NewSQL supports relational data model and uses SQL as their primary
interface.
NewSQL Characterisitcs:
SQL interface for application interaction
ACID support for transactions
An architecture that provides higher per node performance vis-a-vs
traditional RDBMS solution
Scale out, shared nothing architecture
Non-locking concurrency control mechanism so that real time reads
will not conflict with writes.