Big Data Unit 1 Notes
Big Data Unit 1 Notes
TECHNOLOGY
TOPICS
UNIT - I Introduction to Big Data: Big Data and its Importance – Four V’s of
Big Data – Drivers for Big Data – Introduction to Big Data Analytics – Big Data
Analytics applications.
Introduction to Big Data:
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types of data:
1.unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program.
Forexample:
Semi-structured data:
This is the data which does not conform to a data model but has some
structure.how ever it is not in the form which can be used easily by a computer
program.
Examples:E-mails,XML,markup languages,HTML
Structured data: This is the data which is in the organized form. i.e in the form of
rows and columns and can be easily used by a computer programs. Relationships
exist between the entities of data, such as classes and their objects.Data stored in
data bases is an example of structured data.
Structured data
unstructured
Structured data:
The data when it conforms to the schema/structure we say it is structured data.
Structured data is generally tabular data that is represented by columns and rows in a
database.
Databases that hold tables in this form are called relational databases.
The mathematical term “relation” specify to a formed set of data held as a table.
In structured data, all row in a table has the same set of columns.
SQL (Structured Query Language) programming language used for structured data.
Sources of structured data:
The sources from where the data is generated is RDBMS,oracle,DB2,Microsoft
Sql server,Teradata,Mysql and OLTP systems.
Semi structured data:
The semi structured data is also referred to as self describing tags.
It uses tags to segregate the semantic elements.
Sources of semi structured data:
The sources of semi structured are
XML-Extensible mark up language
JSON-Java script object Notation.
An example of HTML as follows
<html>
<head>
<title>place your title here
</head>
<body bgcolor=”FFFFFF”
</html>
Composition
Data Condition
Context
1970s and before was the era of mainframes. The data was
essentially primitive and structured. Relational databases
evolved in 1980s and 1990s. The era was of data intensive
applications. The World Wide Web (WWW) and the Internet of
Things (IOT) have led to an onslaught of structured,
unstructured, and multimedia data. Refer Table 1.1.
The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data. Every company uses data in
its own way; the more efficiently a company uses its data, the more potential it
has to grow. The company can take data from any source and analyze it to find
answers which will enable:
Big Data importance doesn’t revolve around the amount of data a company
has but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to :
Cost Savings :
o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.
o These tools help in identifying more efficient ways of doing business.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data
immediately.
o This helps us to make quick decisions based on the learnings.
Understand the market conditions :
o By analyzing big data we can get a better understanding of current market
conditions.
o For example: By analyzing customers’ purchasing behaviours, a company
can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
Control online reputation :
o Big data tools can do sentiment analysis.
o Therefore, you can get feedback about who is saying what about your
company.
o If you want to monitor and improve the online presence of your business,
then big data tools can help in all this.
Using Big Data Analytics to Boost Customer Acquisition(purchase) and
Retention :
o The customer is the most important asset any business depends on.
o No single business can claim success without first having to establish a solid
customer base.
o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.
o The use of big data allows businesses to observe various customer-related
patterns and trends.
Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights :
o Big data analytics can help change all business operations.
o Like the ability to match customer expectations, changing
company’s product line, etc.
o And ensuring that the marketing campaigns are powerful
1. Volume:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial
role. If the volume of data is very large then it is actually considered
as a ‘Big Data’. This means whether a particular data can actually
be considered as a Big Data or not, is dependent upon the volume
of data.
Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
Example: In the year 2016, the estimated global mobile traffic was
6.2 Exabytes(6.2 billion GB) per month. Also, by the year 2020 we
will have almost 40000 Exa Bytes of data.
2. Velocity:
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks.
The most popular software framework (nowadays considered the standard for Big Data) is
Apache Hadoop for distributed storage and processing.
Due to the high availability of these software frameworks in open sources, it has become
increasingly inexpensive to start Big Data projects in organizations.
This means that organizations that want to process massive quantities of data (and thus have
large storage and processing requirements) do not have to invest in large quantities of IT
infrastructure.
Instead, they can license the storage and processing capacity they need and only pay for the
amounts they actually used. As a result, most of Big Data solutions leverage the possibilities of
cloud computing to deliver their solutions to enterprises.
As a result, the knowledge and education about data science has greatly professionalized and
more information becomes available every day. While statistics and data analysis mostly
remained an academic field previously, it is quickly becoming a popular subject among
students and the working population.
It is increasingly gaining popularity as consumer goods providers start including ‘smart’ sensors
in household appliances. Whereas the average household in 2010 had around 10 devices that
connected to the internet, this number is expected to rise to 50 per household by 2020.
Examples of these devices include thermostats, smoke detectors, televisions, audio systems and
even smart refrigerators.
● Medical information, such as diagnostic imaging
● Photos and video footage uploaded to the World Wide Web
● Video surveillance, such as the thousands of video cameras
across a city
● Mobile devices, which provide geospatial location data of
the
users
● Metadata about text messages, phone calls, and application
usage on smart phones
● Smart devices, which provide sensor-based collection of
information from smart
● Non traditional IT devices, including the use of RFID
readers,
GPS navigation systems, and seismic processing.
These are the multiple sources where the data can be
generated from multiple sources.
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
1.Data is growing in an exponential rate.Most of the data have been generated in the last 2-3
years.
Organizations and data collectors are realizing that the data they can gather
from individuals contains intrinsic value and, as a result, a new economy is
emerging.
As this new digital economy continues to evolve, the market sees the
introduction of data vendors and data cleaners that use crowdsourcing to test the
outcomes of machine learning techniques.
Other vendors offer added value by repackaging open source tools in a simpler
way and bringing the tools to market. Vendors such as Cloudera, Hortonworks,
and Pivotal have provided this value-add for the open source framework
Hadoop.
Data devices and the “Sensornet” gather data from multiple locations and
continuously generate new data about this data. For each gigabyte of new data
created, an additional petabyte of data is created about that data.
For example, consider someone playing an online video game through a PC,
game console, or smartphone. In this case, the video game provider captures
data about the skill and levels attained by the player. Intelligent systems monitor
and log how and when the user plays the game. As a consequence, the game
provider can fine-tune the difficulty of the game, suggest other related games
that would most likely interest the user, and offer additional equipment and
enhancements for the character based on the user’s age, gender, and interests.
This information may get stored locally or uploaded to the game provider’s
cloud to analyze the gaming habits and opportunities for upsell and cross-sell,
and identify archetypical profiles of specific kinds of users.
Retail shopping loyalty cards record not just the amount an individual spends,
but the locations of stores that person visits, the kinds of products purchased,
the stores where goods are purchased most often, and the combinations of
products purchased together. Collecting this data provides insights into
shopping and travel habits and the likelihood of successful advertisement
targeting for certain types of retail promotions.
Data collectors include sample entities that collect data from the device and
users.
Data results from a cable TV provider tracking the shows a person watches,
which TV channels someone will and will not pay for to watch on demand, and
the prices someone is willing to pay for premium TV content
Retail stores tracking the path a customer takes through their store while
pushing a shopping cart with an RFID chip so they can gauge which products
get the most foot traffic using geospatial data collected from the RFID chips
Data aggregators make sense of the data collected from the various entities
from the “SensorNet” or the “Internet of
Things.” These organizations compile data from the devices and usage patterns
collected by government agencies, retail stores, and websites. In turn, they can
choose to transform and package the data as products to sell to list brokers, who
may want to generate marketing lists of people who may be good targets for
specific ad campaigns.
These groups directly benefit from the data collected and aggregated by others
within the data value chain.
Retail banks, acting as a data buyer, may want to know which customers have
the highest likelihood to apply for a second mortgage or a home equity line of
credit. To provide input for this analysis, retail banks may purchase data from a
data aggregator. This kind of data may include demographic information about
people living in specific locations; people who appear to have a specific level of
debt, yet still have solid credit scores (or other characteristics such as paying
bills on time and having savings accounts) that can be used to infer credit
worthiness; and those who are searching the web for information about paying
off debts or doing home remodeling projects.
Obtaining data from these various sources and aggregators will enable a more
targeted marketing campaign, which would have been more challenging before
Big Data due to the lack of information or high-performing technologies.
A field to analyze and to extract information about the big data involved in the
business or the data world so that proper conclusions can be made is called big
data Analytics.
These conclusions can be used to predict the future or to forecast the business.
This data is more complex that it cannot be dealt with with traditional methods
of analysis.
With such a massive amount of data being collected, it only makes sense for
companies to use this data to understand their customers and
their behavior better. This is the reason why the popularity of Data Science
has grown manifold over the last few years.
A big data platform is a type of IT solution that combines the features and
capabilities of several big data applications and utilities within a single
solution, this is then used further for managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive
datasets.
The users of such platforms can custom build applications according to their
use case like to calculate customer loyalty (E-Commerce user case), and so
on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability,
Availability, Performance, and Security.
Example: Some of the most commonly used Big Data Platforms are :
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to
Terabytes. Zettabytes or Exabytes.
Traditional Data Big Data
Traditional data is generated per hour But big data is generated more
or per day or more. frequently mainly per seconds.
Its data model is strict schema based Its data model is a flat schema
and it is static. based and it is dynamic.
Traditional data is stable and inter Big data is not stable and unknown
relationship. relationship.
Its data sources includes ERP Its data sources includes social
Traditional Data Big Data
Hadoop Architecture
At its core, Hadoop has two major layers namely −
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high
availability (FTHA), rather Hadoop library itself has been designed to detect
and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.