0% found this document useful (0 votes)
25 views7 pages

Big Data Introduction

The document discusses big data, including what it is, common sources of big data like social media and IoT devices, and technologies used to analyze big data like Hadoop and MapReduce. It covers challenges of big data and compares operational and analytical systems for working with big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

Big Data Introduction

The document discusses big data, including what it is, common sources of big data like social media and IoT devices, and technologies used to analyze big data like Hadoop and MapReduce. It covers challenges of big data and compares operational and analytical systems for working with big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Hadoop - Big Data Overview

What is Big Data?


Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a complete
subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?


Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.

 Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
 Social Media Data − Social media such as Facebook and Twitter hold information
and the views posted by millions of people across the globe.
 Stock Exchange Data − The stock exchange data holds information about the ‘buy’
and ‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
 Transport Data − Transport data includes model, capacity, distance and availability
of a vehicle.
 Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data
in it will be of three types.

 Structured data − Relational data.


 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.

Benefits of Big Data


 Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
 Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
 Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Big Data Technologies
Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost
reductions, and reduced risks for the business.

To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.

There are various technologies in the market from different vendors including Amazon,
IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big
data, we examine the following two classes of technology −

Operational Big Data


This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier
to manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data


These includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that
may touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the


capabilities provided by SQL, and a system based on MapReduce that can be scaled up from
single servers to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems


Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database

Big Data Challenges


The major challenges associated with big data are as follows −

 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Top 5 sources of big data

MEDIA AS A BIG DATA SOURCE


Media is the most popular source of big data, as it provides valuable insights on consumer

preferences and changing trends. Since it is self-broadcasted and crosses all physical and

demographical barriers, it is the fastest way for businesses to get an in-depth overview of their

target audience, draw patterns and conclusions, and enhance their decision-making. Media includes

social media and interactive platforms, like Google, Facebook, Twitter, YouTube, Instagram, as well

as generic media like images, videos, audios, and podcasts that provide quantitative and qualitative

insights on every aspect of user interaction.

CLOUD AS A BIG DATA SOURCE


Today, companies have moved ahead of traditional data sources by shifting their data on the cloud.

Cloud storage accommodates structured and unstructured data and provides business with real-

time information and on-demand insights. The main attribute of cloud computing is its flexibility

and scalability. As big data can be stored and sourced on public or private clouds, via networks and

servers, cloud makes for an efficient and economical data source.


THE WEB AS A BIG DATA SOURCE
The public web constitutes big data that is widespread and easily accessible. Data on the Web or

‘Internet’ is commonly available to individuals and companies alike. Moreover, web services such as

Wikipedia provide free and quick informational insights to everyone. The enormity of the Web

ensures for its diverse usability and is especially beneficial to start-ups and SME’s, as they don’t

have to wait to develop their own big data infrastructure and repositories before they can leverage

big data.

IOT AS A BIG DATA SOURCE


Machine-generated content or data created from IoT constitute a valuable source of big data. This

data is usually generated from the sensors that are connected to electronic devices. The sourcing

capacity depends on the ability of the sensors to provide real-time accurate information. IoT is now

gaining momentum and includes big data generated, not only from computers and smartphones,

but also possibly from every device that can emit data. With IoT, data can now be sourced from

medical devices, vehicular processes, video games, meters, cameras, household appliances, and the

like.

DATABASES AS A BIG DATA SOURCE


Businesses today prefer to use an amalgamation of traditional and modern databases to acquire

relevant big data. This integration paves the way for a hybrid data model and requires low

investment and IT infrastructural costs. Furthermore, these databases are deployed for several

business intelligence purposes as well. These databases can then provide for the extraction of

insights that are used to drive business profits. Popular databases include a variety of data sources,

such as MS Access, DB2, Oracle, SQL, and Amazon Simple, among others.
The process of extracting and analyzing data amongst extensive big data sources is a complex

process and can be frustrating and time-consuming. These complications can be resolved if

organizations encompass all the necessary considerations of big data, take into account relevant

data sources, and deploy them in a manner which is well tuned to their organizational goals.

You might also like