UNIT 1big Data Introduction
UNIT 1big Data Introduction
• In ancient days, people used to travel from one village to another village on a horse
driven cart, but as the time passed, villages became towns and people spread out.
The distance to travel from one town to the other town also increased. So, it
became a problem to travel between towns, along with the luggage. Out of the
blue, one smart fella suggested, we should groom and feed a horse more, to solve
this problem. When I look at this solution, it is not that bad, but do you think a
horse can become an elephant? I don’t think so. Another smart guy said, instead of
1 horse pulling the cart, let us have 4 horses to pull the same cart. What do you
guys think of this solution? I think it is a fantastic solution. Now, people can travel
large distances in less time and even carry more luggage.
• The same concept applies on Big Data. Big Data says, till today, we were okay with
storing the data into our servers because the volume of the data was pretty limited,
and the amount of time to process this data was also okay. But now in this current
technological world, the data is growing too fast and people are relying on the data
a lot of times. Also the speed at which the data is growing, it is becoming impossible
to store the data into any server.
What is Big Data?
• Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available
database management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.
Big Data Characteristics
• VOLUME
• VELOCITY
• VARIETY
• VERACITY
• VALUE
VOLUME
• Volume refers to the ‘amount of data’, which is growing day by day at
a very fast pace. The size of data generated by humans, machines and
their interactions on social media itself is massive.
VELOCITY
• Velocity is defined as the pace at which different sources generate the
data every day. This flow of data is massive and continuous. There are
1.03 billion Daily Active Users (Facebook) on Mobile as of now, which
is an increase of 22% year-over-year.
VARIETY
• As there are many sources which are contributing to Big Data, the
type of data they are generating is different. It can be structured,
semi-structured or unstructured
VERACITY
• Structured
• Semi-Structured
• Unstructured
Structured
• The data that can be stored and processed in a fixed format is called
as Structured Data. Data stored in a relational database management
system (RDBMS) is one example of ‘structured’ data. It is easy to
process structured data as it has a fixed schema. Structured Query
Language (SQL) is often used to manage such kind of Data
Semi-Structured
• The challenges in Big Data are the real implementation hurdles. These
require immediate attention and need to be handled because if not
handled then the failure of the technology may take place which can
also lead to some unpleasant result. Big data challenges include the
storing, analyzing the extremely large and fast-growing data
• Data Quality – The problem here is the 4th V i.e. Veracity. The data here is very messy,
inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the
United States.
• Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes
of data using extremely powerful algorithms to find patterns and insights are very difficult.
• Storage – The more data an organization has, the more complex the problems of managing it can
become. The question that arises here is “Where to store it?”. We need a storage system which
can easily scale up or down on-demand.
• Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are
dealing with, so analyzing that data is even more difficult.
• Security – Since the data is huge in size, keeping it secure is another challenge. It includes user
authentication, restricting access based on a user, recording data access histories, proper use of
data encryption etc.
• Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated
team of developers, data scientists and analysts who also have sufficient amount of domain
knowledge is still a challenge.
Some other Big Data challenges
are:
Sharing and Accessing Data:
•Perhaps the most frequent challenge in big data efforts is the inaccessibility
of data sets from external sources.
•Sharing data can cause substantial challenges.
•It include the need for inter and intra- institutional legal documents.
•Accessing data from public repositories leads to multiple difficulties.
•It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the companies information system is to be
used to make accurate decisions in time then it becomes necessary for data to
be available in this manner.
Privacy and Security:
•It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
•Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
Analytical Challenges:
•There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too
large?
• Fault tolerance
• Scalability
Big Data Technologies
• Big Data Technology can be defined as a Software-Utility that is
designed to Analyse, Process and Extract the information from an
extremely complex and large data sets which the Traditional Data
Processing Software could never deal with.
• Data Storage
• Data Mining
• Data Analytics
• Data Visualization
Data Storage
Hadoop Framework
• Hadoop Framework was designed to store and process data in a Distributed Data Processing
Environment with commodity hardware with a simple programming model. It can Store and
Analyse the data present in different machines with High Speeds and Low Costs.
• Developed by: Apache Software Foundation in the year 2011 10th of Dec.
• Written in: JAVA
• Current stable version: Hadoop 3.11
MongoDB
• The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema
used in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide
variety of Datatypes at large volumes and across Distributed Architectures.
• BlockChain can be used for achieving the following in a Business Network Environment:
• Shared Ledger: Here we can append the Distributed System of records across a Business
network.
• Smart Contract: Business terms are embedded in the transaction Database and Executed
with transactions.
• Privacy: Ensuring appropriate Visibility, Transactions are Secure, Authenticated and
Verifiable
• Consensus: All parties in a Business network agree to network verified transactions.
• Developed by: Bitcoin
• Written in: JavaScript, C++, Python
• Current stable version: Blockchain 4.0
Data Visualization
Tableau
• Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business
Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are in
the form of Dashboards and Worksheets.