0% found this document useful (0 votes)
35 views14 pages

Unit 5 - Principles of Big Data 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views14 pages

Unit 5 - Principles of Big Data 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SCSB1231 – DATA AND INFORMATION SCIENCE

UNIT 5 – PRINCIPLES OF BIG DATA

Introduction to Big Data - Challenges of processing Big Data (Volume, Velocity and
Variety perspective) - Use Cases.

What is Big Data?

Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it has become a complete
subject, which involves various tools, techniques and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
• Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
• Social Media Data − social media such as Facebook and Twitter hold information
and the views posted by millions of people across the globe.
• Stock Exchange Data − The stock exchange data holds information about the ‘buy’
and ‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
• Transport Data − Transport data includes model, capacity, distance and availability of
a vehicle.
• Search Engine Data − Search engines retrieve lots of data from different databases.
Thus, Big Data includes huge volume, high velocity, and extensible variety of data. The data
in it will be of three types.
• Structured data − Relational data.
• Semi Structured data − XML data.
• Unstructured data − Word, PDF, Text, Media Logs.

Benefits of Big Data

• Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
• Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
• Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in real-time and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology −

Operational Big Data

This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier to
manage, cheaper, and faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

These includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that
may touch most or all of the data.
MapReduce provides a new method of analysing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled up from
single servers to thousands of high- and low-end machines.
These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems

Terms Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database

What are the Characteristics of Big Data?


Three characteristics define Big Data: volume, variety, and velocity.

Together, these characteristics define “Big Data”. They have created the need for a new class
of capabilities to augment the way things are done today to provide a better line of sight and
control over our existing knowledge domains and the ability to act on them.

1. The Volume of Data

The sheer volume of data being stored today is exploding. In the year 2000, 800,000 petabytes
(PB) of data were stored in the world. Of course, a lot of the data that’s being created today
isn’t analyzed at all and that’s another problem that needs to be considered. This number is
expected to reach 35 zettabytes (ZB) by 2020. Twitter alone generates more than 7 terabytes
(TB) of data every day, Facebook 10 TB, and some enterprises generate terabytes of data
every hour of every day of the year. It’s no longer unheard of for individual enterprises to
have storage clusters holding petabytes of data.
Source

When you stop and think about it, it’s a little wonder we’re drowning in data. We store
everything: environmental data, financial data, medical data, surveillance data, and the list
goes on and on. For example, taking your smartphone out of your holster generates an event;
when your commuter train’s door opens for boarding, that’s an event; check-in for a plane,
badge into work, buy a song on iTunes, change the TV channel, take an electronic toll route—
every one of these actions generates data.

Okay, you get the point: There’s more data than ever before and all you have to do is look at
the terabyte penetration rate for personal home computers as the telltale sign. We used to
keep a list of all the data warehouses we knew that surpassed a terabyte almost a decade
ago—suffice to say, things have changed when it comes to volume.

As implied by the term “Big Data,” organizations are facing massive volumes of data.
Organizations that don’t know how to manage this data are overwhelmed by it. But the
opportunity exists, with the right technology platform, to analyze almost all of the data (or at
least more of it by identifying the data that’s useful to you) to gain a better understanding of
your business, your customers, and the marketplace. And this leads to the current conundrum
facing today’s businesses across all industries.

As the amount of data available to the enterprise is on the rise, the percent of data it can
process, understand, and analyze is on the decline, thereby creating the blind zone.

What’s in that blind zone?


You don’t know: it might be something great or maybe nothing at all, but the “don’t know” is
the problem (or the opportunity, depending on how you look at it). The conversation about
data volumes has changed from terabytes to petabytes with an inevitable shift to zettabytes,
and all this data can’t be stored in your traditional systems.

2. The Variety of Data

Source

The volume associated with the Big Data phenomena brings along new challenges for data
centers trying to deal with it: its variety.

With the explosion of sensors, and smart devices, as well as social collaboration technologies,
data in an enterprise has become complex, because it includes not only traditional relational
data, but also raw, semi-structured, and unstructured data from web pages, weblog files
(including click-stream data), search indexes, social media forums, e-mail, documents, sensor
data from active and passive systems, and so on.

What’s more, traditional systems can struggle to store and perform the required analytics to gain
understanding from the contents of these logs because much of the information being generated
doesn’t lend itself to traditional database technologies. In my experience, although some
companies are moving down the path, by and large, most are just beginning to understand the
opportunities of Big Data.

Quite simply, variety represents all types of data—a fundamental shift in analysis
requirements from traditional structured data to include raw, semi-structured, and
unstructured data as part of the decision-making and insight process. Traditional analytic
platforms can’t handle variety. However, an organization’s success will rely on its ability to
draw insights from the various kinds of data available to it, which includes both traditional
and non-traditional.

When we look back at our database careers, sometimes it’s humbling to see that we spent
more of our time on just 20 percent of the data: the relational kind that’s neatly formatted and
fits ever so nicely into our strict schemas. But the truth of the matter is that 80 percent of the
world’s data (and more and more of this data is responsible for setting new velocity and
volume records) is unstructured, or semi-structured at best. If you look at a Twitter feed,
you’ll see structure in its JSON format—but the actual text is not structured, and
understanding that can be rewarding.

Video and picture images aren’t easily or efficiently stored in a relational database, certain
event information can dynamically change (such as weather patterns), which isn’t well suited
for strict schemas, and more. To capitalize on the Big Data opportunity, enterprises must be
able to analyze all types of data, both relational and non-relational: text, sensor data, audio,
video, transactional, and more.

3. The Velocity of Data

Just as the sheer volume and variety of data we collect and the store has changed, so, too, has
the velocity at which it is generated and needs to be handled. A conventional understanding
of velocity typically considers how quickly the data is arriving and stored, and its associated
rates of retrieval. While managing all of that quickly is good—and the volumes of data that
we are looking at are a consequence of how quickly the data arrives. To
accommodate velocity, a new way of thinking about a problem must start at the inception
point of the data. Rather than confining the idea of velocity to the growth rates associated
with your data repositories, we suggest you apply this definition to data in motion: The speed
at which the data is flowing. After all, we’re in agreement that today’s enterprises are dealing
with petabytes of data instead of terabytes, and the increase in RFID sensors and other
information streams has led to a constant flow of data at a pace that has made it impossible
for traditional systems to handle. Sometimes, getting an edge over your competition can mean
identifying a trend, problem, or opportunity only seconds, or even microseconds, before
someone else.

In addition, more and more of the data being produced today has a very short shelf-life, so
organizations must be able to analyze this data in near real-time if they hope to find insights
in this data. In traditional processing, you can think of running queries against relatively static
data: for example, the query “Show me all people living in the ABC flood zone” would result
in a single result set to be used as a warning list of an incoming weather pattern. With streams
computing, you can execute a process similar to a continuous query that identifies people
who are currently “in the ABC flood zones,” but you get continuously updated results because
location information from GPS data is refreshed in real-time.

Dealing effectively with Big Data requires that you perform analytics against the volume and
variety of data while it is still in motion, not just after it is at rest. Consider examples from
tracking neonatal health to financial markets; in every case, they require handling the volume
and variety of data in new ways.

Applications in the real world


Big Data helps corporations in making better and faster decisions, because they have more
information available to solve problems, and have more data to test their hypothesis on.

Customer experience is a major field that has been revolutionized with the advent of Big
Data. Companies are collecting more data about their customers and their preferences than
ever. This data is being leveraged in a positive way, by giving personalized recommendations
and offers to customers, who are more than happy to allow companies to collect this data in
return for the personalized services. The recommendations you get on Netflix, or
Amazon/Flipkart are a gift of Big Data!

Machine Learning is another field that has benefited greatly from the increasing popularity
of Big Data. More data means we have larger datasets to train our ML models, and a more
trained model (generally) results in a better performance. Also, with the help of Machine
Learning, we are now able to automate tasks that were earlier being done manually, all thanks
to Big Data.

Demand forecasting has become more accurate with more and more data being collected
about customer purchases. This helps companies build forecasting models, that help them
forecast future demand, and scale production accordingly. It helps companies, especially
those in manufacturing businesses, to reduce the cost of storing unsold inventory in
warehouses.

Big data also has extensive use in applications such as product development and fraud
detection.

How to store and process Big Data?

The volume and velocity of Big Data can be huge, which makes it almost impossible to store
it in traditional data warehouses. Although some and sensitive information can be stored on
company premises, for most of the data, companies have to opt for cloud storage or Hadoop.

Cloud storage allows businesses to store their data on the internet with the help of a cloud
service provider (like Amazon Web Services, Microsoft Azure, or Google Cloud Platform)
who takes the responsibility of managing and storing the data. The data can be accessed easily
and quickly with an API.

Hadoop also does the same thing, by giving you the ability to store and process large amounts
of data at once. Hadoop is an open-source software framework and is free. It allows users to
process large datasets across clusters of computers.

Challenges

1. Data growth

Managing datasets having terabytes of information can be a big challenge for companies. As
datasets grow in size, storing them not only becomes a challenge but also becomes an
expensive affair for companies.

To overcome this, companies are now starting to pay attention to data compression and de-
duplication. Data compression reduces the number of bits that the data needs, resulting in a
reduction in space being consumed. Data de-duplication is the process of making sure
duplicate and unwanted data does not reside in our database.

2. Data security

Data security is often prioritized quite low in the Big Data workflow, which can backfire at
times. With such a large amount of data being collected, security challenges are bound to
come up sooner or later.

Mining of sensitive information, fake data generation, and lack of cryptographic protection
(encryption) are some of the challenges businesses face when trying to adopt Big Data
techniques.

Companies need to understand the importance of data security, and need to prioritize it. To
help them, there are professional Big Data consultants nowadays, that help businesses move
from traditional data storage and analysis methods to Big Data.

3. Data integration

Data is coming in from a lot of different sources (social media applications, emails, customer
verification documents, survey forms, etc.). It often becomes a very big operational challenge
for companies to combine and reconcile all of this data.

There are several Big Data solution vendors that offer ETL (Extract, Transform, Load) and
data integration solutions to companies that are trying to overcome data integration problems.
There are also several APIs that have already been built to tackle issues related to data
integration.

The future of Big Data

The volume of data being produced every day is continuously increasing, with increasing
digitization. More and more businesses are starting to shift from traditional data storage and
analysis methods to cloud solutions. Companies are starting to realize the importance of data.
All of these imply one thing, the future of Big Data looks promising! It will change the way
businesses operate, and decisions are made.

Use Cases
The six powerful big data use cases and their impacts on various industries. They showcase
how structured and unstructured content processing, NoSQL databases, predictive analytics,
machine learning, and advanced search relevance ranking techniques have made search and
big data analytics a strategic part of the modern business’ vision.

Use case #1: Log analytics

Log data is a fundamental foundation of many businesses big data applications. Log
management and analysis tools have been around long before big data. But with the
exponential growth of business activities and transactions, log data can become a huge
headache to be stored, processed, and presented in the most efficient, cost-effective manner.

Many commercial and open-source log analytics tools can provide you the ability collect,
process, and analyse massive log data without having to dump the data into relational
databases and retrieving it through SQL queries. The synergy between log search capabilities
and big data analytics has enabled organizations to discover insights for more agile
operations. Big data log analytics applications are now widely used for various business goals,
from IT system security and network performance, to market trends and e-commerce
personalization.

Use case #2: E-commerce personalization

Remember when you were leisurely (or frantically) browsing online shopping sites to find that
perfect gift for a friend or family member (or yourself)? How often do you type in the search
box, click on the navigation bar, expand product descriptions, or add a product to your cart?
If you were an e-commerce company, every one of these actions can become the key to
optimizing the entire shopping experience. And thus, the daunting tasks of collecting,
processing, and analyzing shoppers’ behaviour and transaction data open up enormous
opportunities for big data in e-commerce.
A powerful search and big data analytics platform allow e-commerce companies to (1) clean
and enrich product data for a better search experience on both desktops and mobile devices;
and (2) use predictive analytics and machine learning to predict user preferences through log
data, then personalize products in a most-likely-to-buy order that maximizes conversion.
There has also been a new movement towards real-time e-commerce personalization enabled
by big data's massive processing power.

Use case #3: Recommendation engines

If you’ve been on online media streaming platforms, you may have noticed those
“recommended for you” videos, movies, or music. Doesn’t it feel great to have a selection
personalized only for you? It’s easy. It’s time-saving. Overall, a satisfying user experience,
right? Have you also noticed that the more videos and movies you watched, the better those
recommendations became? As the media and entertainment space is filled by strong
competitors, the ability to deliver the top user experience will be the winning factor.

Big data, with its scalability and power to process massive amounts of both structured (e.g.
video titles users search for, music genre they prefer) and unstructured data (e.g. user
viewing/listening patterns), can enable companies to analyze billions of clicks and viewing
data from you and other users like you for the best recommendations. Over time, through
machine learning and predictive analytics, the recommendations become better tailored to the
user’s taste.

Use case #4: Automated candidate placement in recruiting

Recruiters often feel they don’t have the (right) tools in the race to place candidates as quickly
as possible in a competitive environment. As matching resume keywords with job descriptions
no longer provides the desired results, new approaches to using big data for recruiting have
allowed recruiters to speed up and automate the placement process like never before.

A big data recruitment platform can mine from internal databases and provide a 360-degree
view of a candidate, such as education, experience, skill sets, job titles, certifications,
geography, and anything else recruiters can think of, then compare them to the company’s
past hiring experience, salaries, previously successful candidates, etc. to identify the “best
match.” These platforms can even go beyond matching to anticipate recruiting needs and
suggest candidates before positions are posted, allowing recruiters to be more proactive – a
competitive edge against their competitors.
Use case #5: Insurance fraud detection

Organizations that handle large number of financial transactions continue searching for more
innovative, effective approaches to fight fraud. Medical insurance agencies are no exception,
as fraud can cost the industry up to $5 billion annually. In the traditional fraud detection
model, fraud investigators need to work with BI analysts to run complex SQL queries from bill
and claim data, then wait weeks or months to get the results back. This process sometimes
causes lengthy delays in legal fraud cases, thus, huge losses for business.

With big data technologies, billions of billings and claim records can be processed and pulled
into a search engine, so that investigators can analyze individual records by performing
intuitive searches on a graphical interface. Predictive analytics and machine learning
capabilities enable a big data platform for fraud detection to provide automatic red flag alerts
as soon as it recognizes a pattern that matches a previously known fraud scheme.

Use case #6: Relevancy and retention boost for online publishing

For research publishing companies, giving their online subscribers the right content, they
want is critical to building authority, expand subscriber base, and boost the bottom line. In
addition to investing in great SEO effort to make the publishing site searchable, strategizing
how well the content can be served once users are on the website is a primary factor
impacting conversion and repeat business.

With the rise of personalization, big data brings a new paradigm for processing and analyzing
both content data (authors, titles, topics) and user data (document downloads, preferences).
First, a powerful search engine helps clean and enrich research documents’ metadata to
ensure users find the most relevant content and explore related content easily. Then, through
machine learning and predictive analytics, the publisher will be able to serve content in a
particular order in which the user’s most favourite content appear in the top results. How do
they know for sure? Because they can repeatedly test and score the search engine’s
performance offline to predict search accuracy and abandonment rates before putting the
engine into production on the live website.

As social, cloud, and information have become the driving forces of the modern business, we
expect to see more and more innovative use cases that leverage search and big data analytics to
make sense and make use of the vast amount of data. Like the cloud, big data is here to stay
and continue to enrich the business technology ecosystem in the coming years.

You might also like