0% found this document useful (0 votes)
10 views30 pages

Unit1 Bda

The document provides an overview of Big Data and Distributed File Systems (DFS), explaining their significance in modern enterprises for data accessibility and collaboration. It highlights the importance of Big Data in driving business growth through cost savings, improved decision-making, and enhanced customer experiences. Additionally, it outlines the characteristics of Big Data, known as the Four Vs (Velocity, Volume, Variety, and Veracity), and discusses the drivers and processes involved in Big Data analytics.

Uploaded by

Chaudhri Upeksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views30 pages

Unit1 Bda

The document provides an overview of Big Data and Distributed File Systems (DFS), explaining their significance in modern enterprises for data accessibility and collaboration. It highlights the importance of Big Data in driving business growth through cost savings, improved decision-making, and enhanced customer experiences. Additionally, it outlines the characteristics of Big Data, known as the Four Vs (Velocity, Volume, Variety, and Veracity), and discusses the drivers and processes involved in Big Data analytics.

Uploaded by

Chaudhri Upeksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit : 1 INTRODUCTION TO BIG DATA

Introduction– distributed file system


What Is a Distributed File System?
A distributed file system (DFS) is a file system that spans across
multiple file servers or multiple locations, such as file servers that are
situated in different physical places. Files are accessible just as if they
were stored locally, from any device and from anywhere on the
network. A DFS makes it convenient to share information and files
among users on a network in a controlled and authorized way.

Why Is a Distributed File System Important?


The main reason enterprises choose a DFS is to provide access to the
same data from multiple locations. For example, you might have a
team distributed all over the world, but they have to be able to access
the same files to collaborate. Or in today’s increasingly hybrid cloud
world, whenever you need access to the same data from the data
center, to the edge, to the cloud, you would want to use a DFS.

A DFS is critical in situations where you need:

• Transparent local access — Data to be accessed as if it’s local to


the user for high performance.
• Location independence — No need for users to know where file
data physically resides.
• Scale-out capabilities — The ability to scale out massively by
adding more machines. DFS systems can scale to exceedingly
large clusters with thousands of servers.
• Fault tolerance — A need for your system to continue operating
properly even if some of its servers or disks fail. A fault-tolerant
DFS is able to handle such failures by spreading data across
multiple machines.

ASSI.PRO UPEKSHA CHAUDHRI 1


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

What Are the Benefits of a DFS?


A distributed file system (DFS) is a file system that is distributed to and
stored in multiple locations, such as file servers that are located in
different locales. Files are accessible just as if they were locally stored,
from any device at any location. A DFS makes it convenient to share
information and files among authorized users on a network in a
controlled way.

What Are the Different Types of Distributed File Systems?


These are the most common DFS implementations:

• Windows Distributed File System


• Network File System (NFS)
• Server Message Block (SMB)
• Google File System (GFS)
• Lustre
• Hadoop Distributed File System (HDFS)
• GlusterFS
• Ceph
• MapR File System

Big Data and its importance:-


What is Big Data

Data which are very large in size is called Big Data. Normally we work
on data of size MB(WordDoc ,Excel) or maximum GB(Movies, Codes)
but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated

ASSI.PRO UPEKSHA CHAUDHRI 2


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

that almost 90% of today's data has been generated in the past 3
years.

Importance of Big data:-


Big Data importance doesn’t revolve around the amount of data a
company has. Its importance lies in the fact that how the company
utilizes the gathered data.
Every company uses its collected data in its own way. More
effectively the company uses its data, more rapidly it grows.
The companies in the present market need to collect it and analyze it
because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving
benefits to businesses when they have to store large amounts of
data. These tools help organizations in identifying more effective
ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from
various sources. Tools like Hadoop help them to analyze data
immediately thus helping in making quick decisions based on the
learnings.
3. Understand the market conditions
Big Data analysis helps businesses to get a better understanding of
market situations.
For example, analysis of customer purchasing behavior helps
companies to identify the products sold most and thus produces
those products accordingly. This helps companies to get ahead of
their competitors.
ASSI.PRO UPEKSHA CHAUDHRI 3
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

4. Social Media Listening


Companies can perform sentiment analysis using Big Data tools.
These enable them to get feedback about their company, that is,
who is saying what about the company.
Companies can use Big data tools to improve their online presence.
5. Boost Customer Acquisition and Retention
Customers are a vital asset on which any business depends on. No
single business can achieve its success without building a robust
customer base. But even with a solid customer base, the companies
can’t ignore the competition in the market.
Big data analytics helps businesses to identify customer related
trends and patterns. Customer behavior analysis leads to a profitable
business.
6. Solve Advertisers Problem and Offer Marketing Insights
Big data analytics shapes all business operations. It enables
companies to fulfill customer expectations. Big data analytics helps in
changing the company’s product line. It ensures powerful marketing
campaigns.
7. The driver of Innovations and Product Development
Big data makes companies capable to innovate and redevelop their
products.
Real-Time Benefits of Big Data:-
Big Data analytics has expanded its roots in all the fields. This results
in the use of Big Data in a wide range of industries including Finance
and Banking, Healthcare, Education, Government, Retail,
Manufacturing, and many more.
There are many companies like Amazon, Netflix, Spotify, LinkedIn,
Swiggy,etc which use big data analytics. Banking sectors make the
ASSI.PRO UPEKSHA CHAUDHRI 4
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

maximum use of Big Data Analytics. Education sector is also using


data analytics to enhance students’ performance as well as making
teaching easier for instructors.
Summary
We can conclude that Big Data helps companies to make informed
decisions, understand their customer desires.
Big Data technologies help us to understand inefficiency and
opportunities in our company. It plays a major role in shaping the
organization’s growth.

Four Vs:-

Big data requires strong data handling processes in data-intensive


systems. Today, with the incredible growth of data collection into
systems of diverse kinds and sizes around the world, we need to
understand big data basics for review, audit and security purposes.
The characteristics of big data that force new architectures are as
follows:

• Velocity (i.e., rate of flow)


• Volume (i.e., the size of the dataset)
• Variety (i.e., data from multiple repositories, domains or types)
• Veracity (i.e., provenance of the data and its management)

These 4 characteristics are known colloquially as the Vs of big data.


The 4 Vs are used in the following ways:

• Velocity describes the speed at which data are processed. The


data usually arrive in batches or are streamed continuously. As
with certain other nonrelational databases, distributed
programming frameworks were not developed with security
and privacy in mind. Malfunctioning computing nodes might
leak confidential data. Partial infrastructure attacks could
ASSI.PRO UPEKSHA CHAUDHRI 5
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

compromise a significantly large fraction of the system due to


high levels of connectivity and dependency.
• Volume describes how much data are coming in. This typically
ranges from gigabytes to exabytes and beyond. As a result, the
volume of big data has necessitated storage in multitiered
storage media. The movement of data between tiers has led to
a requirement of cataloging threat models and a surveying of
novel techniques. This requirement is the threat model for
network-based, distributed, auto-tier systems. A positive of
having large volumes of data is that analytics can be performed
to help detect security breach events. This is an instance where
big data technologies can help to fortify security.
• Variety describes the organization of the data including
whether the data are structured, semi-structured or
unstructured. Retargeting traditional relational database
security to non-relational databases has been a challenge.
These systems were not designed with security and privacy in
mind, and these functions are usually relegated to middleware.
Traditional encryption technology also hinders the organization
of data based on semantics.
An emerging phenomenon introduced by big data variety is the
ability to infer identity from anonymized datasets by correlating
with apparently innocuous public databases. Sensitive data are
shared after sufficient removal of apparently unique identifiers
and indirectly identifying information by the processes of
anonymization and aggregation.
• Veracity includes provenance and curation. Provenance is
based upon the pedigree of the data, the metadata and the
context of the data when collected. This is important for both
data quality and for protecting security and maintaining privacy
policies. Big data frequently moves across individual boundaries
to groups and communities of interest and across state,
national and international boundaries. An additional area of the
pedigree is the potential chain of custody and collection
authority of the data. Curation is an integral concept that binds
ASSI.PRO UPEKSHA CHAUDHRI 6
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

veracity and provenance to principles of governance and data


quality assurance. Curation, for example, may improve raw
data by fixing errors, filling in gaps, modeling, calibrating values
and ordering data collection. Furthermore, there is a central
and broadly recognized privacy principle incorporated in many
privacy frameworks (e.g., the Organisation for Economic Co-
operation and Development [OECD] principles, the EU General
Data Protection Regulation [GDPR], Fair Trade Commission
[FTC] fair information practices) that data subjects must be able
to view and correct information collected about them in a
database.

Drivers for Big data:-


In the realm of big data, drivers can be understood as the forces or
factors that push organizations to adopt big data technologies and
practices. Here are some key drivers for big data adoption:

1. **Data Explosion**: The exponential growth of data generated


from various sources such as social media, sensors, mobile devices,
and IoT (Internet of Things) devices necessitates the use of big data
technologies to manage, process, and analyze this vast amount of
information.

2. **Competitive Advantage**: Companies are increasingly


leveraging big data analytics to gain insights into market trends,
customer behavior, and competitive intelligence. Those who can
effectively harness big data gain a significant advantage over
competitors.

ASSI.PRO UPEKSHA CHAUDHRI 7


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

3. **Cost Reduction**: Big data technologies enable more efficient


storage, processing, and analysis of data compared to traditional
methods. This can lead to cost savings in infrastructure, storage, and
operations.

4. **Improved Decision Making**: Big data analytics empowers


organizations to make data-driven decisions by providing deeper
insights and predictive analytics based on large volumes of structured
and unstructured data.

5. **Personalization and Customer Experience**: Big data enables


businesses to analyze customer data to personalize products,
services, and marketing efforts, ultimately enhancing the overall
customer experience and increasing customer satisfaction and
loyalty.

6. **Operational Efficiency**: By analyzing operational data,


organizations can identify inefficiencies, optimize processes, and
improve overall operational performance.

7. **Regulatory Compliance**: Compliance requirements such as


GDPR, HIPAA, and others necessitate the proper management and
protection of data. Big data technologies can help organizations
ensure compliance by implementing robust data governance and
security measures.

8. **Innovation**: Big data analytics fosters innovation by enabling


organizations to uncover new insights, develop new products and

ASSI.PRO UPEKSHA CHAUDHRI 8


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

services, and explore new business opportunities based on data-


driven discoveries.

9. **Real-time Insights**: With big data technologies, organizations


can analyze data in real-time or near real-time, allowing them to
respond quickly to changing market conditions, emerging trends, and
customer needs.

10. **Digital Transformation**: Big data is often a key component of


digital transformation initiatives aimed at modernizing business
processes, improving agility, and staying competitive in the digital
age.

These drivers collectively push organizations across various industries


to invest in big data technologies and capabilities to unlock the value
hidden within their data assets.
Big data analytics:-
What is big data analytics?
Big data analytics is the often complex process of examining big
data to uncover information -- such as hidden patterns, correlations,
market trends and customer preferences -- that can help
organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques give


organizations a way to analyze data sets and gather new
information. Business intelligence (BI) queries answer basic
questions about business operations and performance.

Big data analytics is a form of advanced analytics, which involve


complex applications with elements such as predictive models,
ASSI.PRO UPEKSHA CHAUDHRI 9
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

statistical algorithms and what-if analysis powered by analytics


systems.

An example of big data analytics can be found in the healthcare


industry, where millions of patient records, medical claims, clinical
results, care management records and other data must be collected,
aggregated, processed and analyzed. Big data analytics is used for
accounting, decision-making, predictive analytics and many other
purposes. This data varies greatly in type, quality and accessibility,
presenting significant challenges but also offering tremendous
benefits.

Why is big data analytics important?


Organizations can use big data analytics systems and software to
make data-driven decisions that can improve their business-related
outcomes. The benefits can include more effective marketing, new
revenue opportunities, customer personalization and improved
operational efficiency. With an effective strategy, these benefits can
provide competitive advantages over competitors.

How does big data analytics work?


Data analysts, data scientists, predictive modelers, statisticians and
other analytics professionals collect, process, clean and analyze
growing volumes of structured transaction data, as well as other
forms of data not used by conventional BI and analytics programs.

The following is an overview of the four steps of the big data


analytics process:

1. Data professionals collect data from a variety of different


sources. Often, it's a mix of semistructured and unstructured
data. While each organization uses different data streams,
some common sources include the following:

ASSI.PRO UPEKSHA CHAUDHRI 10


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

o Internet clickstream data.


o Web server logs.
o Cloud applications.
o Mobile applications.
o Social media content.
o Text from customer emails and survey responses.
o Mobile phone records.
o Machine data captured by sensors connected to
the internet of things.
2. Data is prepared and processed. After data is collected and
stored in a data warehouse or data lake, data professionals
must organize, configure and partition the data properly for
analytical queries. Thorough data preparation and
processing results in higher performance from analytical
queries. Sometimes this processing is batch processing, with
large data sets analyzed over time after being received; other
times it takes the form of stream processing, where small
data sets are analyzed in near real time, which can increase
the speed of analysis.
3. Data is cleansed to improve its quality. Data professionals
scrub the data using scripting tools or data quality software.
They look for any errors or inconsistencies, such as
duplications or formatting mistakes, and organize and tidy
the data.
4. The collected, processed and cleaned data is analyzed using
analytics software. This includes tools for the following:

o Data mining, which sifts through data sets in search


of patterns and relationships.

ASSI.PRO UPEKSHA CHAUDHRI 11


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

o Predictive analytics, which builds models to forecast


customer behavior and other future actions,
scenarios and trends.
o Machine learning, which taps various algorithms to
analyze large data sets.
o Deep learning, which is a more advanced offshoot of
machine learning.
o Text mining and statistical analysis software.
o Artificial intelligence.
o Mainstream BI software.
o Data visualization tools.
Types of big data analytics
There are several different types of big data analytics, each with their
own application within the enterprise.

• Descriptive analytics. This is the simplest form of analytics,


where data is analyzed for general assessment and
summarization. For example, in sales reporting, an
organization can analyze the efficiency of marketing from
such data.
• Diagnostic analytics. This refers to analytics that determine
why a problem occurred. For example, this could include
gathering and studying competitor pricing data to determine
when a product's sales fell off because the competitor
undercut it with a price drop.
• Predictive analytics. This refers to analysis that predicts
what comes next. For example, this could include monitoring
the performance of machines in a factory and comparing
that data to historical data to determine when a machine is
likely to break down or require maintenance or replacement.

ASSI.PRO UPEKSHA CHAUDHRI 12


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Prescriptive analytics. This form of analysis follows


diagnostics and predictions. After an issue has been
identified, it provides a recommendation of what can be
done about it. For example, this could include addressing
inconsistencies in supply chain that are causing pricing
problems by identifying suppliers whose performance is
unreliable, suggesting their replacement.
Key big data analytics technologies and tools
Many different types of tools and technologies are used to support
big data analytics processes, including the following:

• Hadoop is an open source framework for storing and


processing big data sets. Hadoop can handle large amounts
of structured and unstructured data.
• Predictive analytics hardware and software process large
amounts of complex data and use machine learning and
statistical algorithms to make predictions about future event
outcomes. Organizations use predictive analytics tools for
fraud detection, marketing, risk assessment and operations.
• Stream analytics tools are used to filter, aggregate and
analyze big data that might be stored in different formats or
platforms.
• Distributed storage data is replicated, generally on a
nonrelational database. This can be as a measure against
independent node failures, lost or corrupted big data or to
provide low-latency access.
• NoSQL databases are nonrelational data management
systems that are useful when working with large sets of
distributed data. NoSQL databases don't require a fixed
schema, which makes them ideal for raw and unstructured
data.

ASSI.PRO UPEKSHA CHAUDHRI 13


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• A data lake is a large storage repository that holds native-


format raw data until it's needed. Data lakes use a flat
architecture.
• A data warehouse is a repository that stores large amounts
of data collected by different sources. Data
warehouses typically store data using predefined schemas.
• Knowledge discovery and big data mining tools help
businesses mine large amounts of structured and
unstructured big data.
• In-memory data fabric distributes large amounts of data
across system memory resources. This helps provide low
latency for data access and processing.
• Data virtualization enables data access without technical
restrictions.
• Data integration software enables big data to be streamlined
across different platforms, including Apache, Hadoop,
MongoDB and Amazon EMR.
• Data quality software cleanses and enriches large data sets.
• Data preprocessing software prepares data for further
analysis. Data is formatted and unstructured data is
cleansed.
• Apache Spark is an open source cluster computing
framework used for batch and stream data processing.
• Microsoft Power BI and Tableau end-to-end analytics
platforms bring big data analytics to the desktop and back
out to dashboards, with full suites of tools for analysis and
reporting.

Big data analytics applications often include data from both internal
systems and external sources, such as weather data or demographic
data on consumers compiled by third-party information services
ASSI.PRO UPEKSHA CHAUDHRI 14
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

providers. In addition, streaming analytics applications are becoming


more common in big data environments as users look to
perform real-time analytics on data fed into Hadoop systems through
stream processing engines, such as Spark, Flink and Storm.

Early big data systems were mostly deployed on premises,


particularly in large organizations that collected, organized and
analyzed massive amounts of data. But cloud platform vendors, such
as Amazon Web Services (AWS), Google and Microsoft, have made it
easier to set up and manage Hadoop clusters in the cloud. The same
goes for Hadoop suppliers such as Cloudera, which support the
distribution of the big data framework on AWS, Google
and Microsoft Azure clouds. Users can spin up clusters in the cloud,
run them for as long as they need and then take them offline with
usage-based pricing that doesn't require ongoing software licenses.

Big data has become increasingly beneficial in supply chain analytics.


Big supply chain analytics uses big data and quantitative methods to
enhance decision-making processes across the supply chain.
Specifically, big supply chain analytics expands data sets for
increased analysis that goes beyond the traditional internal data
found on enterprise resource planning and supply chain
management systems. Also, big supply chain analytics implements
highly effective statistical methods on new and existing data sources.

Big data analytics uses and examples


The following are some examples of how big data analytics can be
used to help organizations:

• Customer acquisition and retention. Consumer data can


help the marketing efforts of companies, which can act on
trends to increase customer satisfaction. For
example, personalization engines for Amazon, Netflix and

ASSI.PRO UPEKSHA CHAUDHRI 15


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Spotify can provide improved customer experiences and


create customer loyalty.
• Targeted ads. Personalization data from sources such as past
purchases, interaction patterns and product page viewing
histories can help generate compelling targeted ad
campaigns for users on the individual level and on a larger
scale.
• Product development. Big data analytics can provide
insights to inform organizations about product viability,
development decisions, progress measurement and steer
improvements in the direction of what best fits customer
needs.
• Price optimization. Retailers can opt for pricing models that
use and model data from a variety of data sources to
maximize revenues.
• Supply chain and channel analytics. Predictive analytical
models can help with preemptive replenishment, business-
to-business supplier networks, inventory management, route
optimizations and the notification of potential delays to
deliveries.
• Risk management. Big data analytics can identify new risks
from data patterns for effective risk management strategies.
• Improved decision-making. Insights business users extract
from relevant data can help organizations make quicker and
better decisions.
Big data analytics benefits
The benefits of using big data analytics include the following:

• Real-time intelligence. Organizations can quickly analyze


large amounts of real-time data from different sources, in
many different formats and types.

ASSI.PRO UPEKSHA CHAUDHRI 16


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Better-informed decisions. Effective strategizing can benefit


and improve the supply chain, operations and other areas of
strategic decision-making.
• Cost savings. This can result from new business process
efficiencies and optimizations.
• Better customer engagement. A better understanding of
customer needs, behavior and sentiment can lead to better
marketing insights and provide information for product
development.
• Optimize risk management strategies. Big data analytics
improve risk management strategies by enabling
organizations to address threats in real time.
Big data analytics challenges
Despite the wide-reaching benefits that come with using big data
analytics, its use also comes with the following challenges:

• Data accessibility. With larger amounts of data, storage and


processing become more complicated. Big data should be
stored and maintained properly to ensure it can be used by
less experienced data scientists and analysts.
• Data quality maintenance. With high volumes of data
coming in from a variety of sources and in different
formats, data quality management for big data requires
significant time, effort and resources to properly maintain it.
• Data security. The complexity of big data systems presents
unique security challenges. Properly addressing security
concerns within such a complicated big data ecosystem can
be a complex undertaking.
• Choosing the right tools. Selecting from the vast array of big
data analytics tools and platforms available on the market

ASSI.PRO UPEKSHA CHAUDHRI 17


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

can be confusing, so organizations must know how to pick


the best tool that aligns with users' needs and infrastructure.
• Talent shortages. With a potential lack of internal analytics
skills and the high cost of hiring experienced data scientists
and engineers, some organizations are finding it hard to fill
the gaps.
History and growth of big data analytics
The term big data was first used to refer to increasing data volumes
in the mid-1990s. In 2001, Doug Laney, then an analyst at
consultancy Meta Group Inc., expanded the definition of big data.
This expansion described the increase of the following:

• Volume of data being stored and used by organizations.


• Variety of data being generated by organizations.
• Velocity, or speed, in which that data was being created and
updated.

Those three factors became known as the 3V's of big data. Gartner
popularized this concept in 2005 after acquiring Meta Group and
hiring Laney. Over time, the 3V's became the 5V's by
adding value and veracity and sometimes a sixth V for variability.

Another significant development in the history of big data was the


launch of the Hadoop distributed processing framework. Hadoop
was launched in 2006 as an Apache open source project. This planted
the seeds for a clustered platform built on top of commodity
hardware that could run big data applications. The Hadoop
framework of software tools is widely used for managing big data.

By 2011, big data analytics began to take a firm hold in organizations


and the public eye, along with Hadoop and various related big data
technologies.

ASSI.PRO UPEKSHA CHAUDHRI 18


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Initially, as the Hadoop ecosystem took shape and started to mature,


big data applications were primarily used by large internet and e-
commerce companies such as Yahoo, Google and Facebook, as well
as analytics and marketing services providers.

More recently, a broader variety of users have embraced big data


analytics as a key technology driving digital transformation. Users
include retailers, financial services firms, insurers, healthcare
organizations, manufacturers, energy companies and other
enterprises.

High-quality decision-making using data analysis can help contribute


to a high-performance organization. Learn which roles and
responsibilities are important to a data management team.

Big data application:-

The term Big Data is referred to as large amount of complex and


unprocessed data. Now a day's companies use Big Data to make
business more informative and allows to take business decisions by
enabling data scientists, analytical modelers and other professionals
to analyse large volume of transactional data. Big data is the valuable
and powerful fuel that drives large IT industries of the 21st century.
Big data is a spreading technology used in each business sector. In this
section, we will discuss application of Big Data.

ASSI.PRO UPEKSHA CHAUDHRI 19


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Travel and Tourism

Travel and tourism are the users of Big Data. It enables us to forecast
travel facilities requirements at multiple locations, improve business
through dynamic pricing, and many more.

ASSI.PRO UPEKSHA CHAUDHRI 20


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Financial and banking sector

The financial and banking sectors use big data technology


extensively. Big data analytics help banks and customer behaviour on
the basis of investment patterns, shopping trends, motivation to
invest, and inputs that are obtained
from personal or financial backgrounds.

ASSI.PRO UPEKSHA CHAUDHRI 21


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Healthcare

Big data has started making a massive difference in


the healthcare sector, with the help of predictive analytics, medical
professionals, and health care personnel. It can produce personalized
healthcare and solo patients also.P

Telecommunication and media

ASSI.PRO UPEKSHA CHAUDHRI 22


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Telecommunications and the multimedia sector are the main users


of Big Data. There are zettabytes to be generated every day and
handling large-scale data that require big data technologies.

Government and Military

The government and military also used technology at high rates. We


see the figures that the government makes on the record. In
the military, a fighter plane requires to process petabytes of data.

Government agencies use Big Data and run many agencies, managing
utilities, dealing with traffic jams, and the effect of crime
like hacking and online fraud.

Aadhar Card: The government has a record of 1.21 billion citizens.


This vast data is analyzed and store to find things like the number of
youth in the country. Some schemes are built to target the maximum
population. Big data cannot store in a traditional database, so it stores
and analyze data by using the Big Data Analytics tools.

ASSI.PRO UPEKSHA CHAUDHRI 23


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

E-commerce

E-commerce is also an application of Big data. It maintains


relationships with customers that is essential for the e-commerce
industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better
strategies of innovative ideas to improve businesses with Big data.

o Amazon: Amazon is a tremendous e-commerce website dealing


with lots of traffic daily. But, when there is a pre-announced sale
on Amazon, traffic increase rapidly that may crash the website.
So, to handle this type of traffic and data, it uses Big Data. Big
Data help in organizing and analyzing the data for far use.

ASSI.PRO UPEKSHA CHAUDHRI 24


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Social Media

Social Media is the largest data generator. The statistics have shown
that around 500+ terabytes of fresh data generated from social media
daily, particularly on Facebook. The data mainly contains videos,
photos, message exchanges, etc. A single activity on the social media
site generates many stored data and gets processed when required.
The data stored is in terabytes (TB); it takes a lot of time for
processing. Big Data is a solution to the problem.

Algorithms using map reduce:-

he MapReduce algorithm contains two important tasks, namely Map


and Reduce.

• The map task is done by means of Mapper Class


• The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.

ASSI.PRO UPEKSHA CHAUDHRI 25


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

MapReduce implements various mathematical algorithms to divide a


task into small parts and assign them to multiple systems. In
technical terms, MapReduce algorithm helps in sending the Map &
Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

• Sorting
• Searching
• Indexing
• TF-IDF

Sorting

Sorting is one of the basic MapReduce algorithms to process and


analyze data. MapReduce implements sorting algorithm to
automatically sort the output key-value pairs from the mapper by
their keys.

• Sorting methods are implemented in the mapper class itself.


• In the Shuffle and Sort phase, after tokenizing the values in the
mapper class, the Context class (user-defined class) collects the
matching valued keys as a collection.
• To collect similar key-value pairs (intermediate keys), the
Mapper class takes the help of RawComparator class to sort
the key-value pairs.
• The set of intermediate key-value pairs for a given Reducer is
automatically sorted by Hadoop to form key-values (K2, {V2,
V2, …}) before they are presented to the Reducer.

ASSI.PRO UPEKSHA CHAUDHRI 26


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Searching

Searching plays an important role in MapReduce algorithm. It helps


in the combiner phase (optional) and in the Reducer phase. Let us try
to understand how Searching works with the help of an example.

Example

The following example shows how MapReduce employs Searching


algorithm to find out the details of the employee who draws the
highest salary in a given employee dataset.

• Let us assume we have employee data in four different files −


A, B, C, and D. Let us also assume there are duplicate employee
records in all four files because of importing the employee data
from all database tables repeatedly. See the following
illustration.

• The Map phase processes each input file and provides the
employee data in key-value pairs (<k, v> : <emp name, salary>).
See the following illustration.

• The combiner phase (searching technique) will accept the input


from the Map phase as a key-value pair with employee name
and salary. Using searching technique, the combiner will check
ASSI.PRO UPEKSHA CHAUDHRI 27
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

all the employee salary to find the highest salaried employee in


each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){


Max = v(salary);
}

else{
Continue checking;
}

The expected result is as follows −

<satish, <gopal, <kiran, <manisha


26000> 50000> 45000> , 45000>

• Reducer phase − Form each file, you will find the highest
salaried employee. To avoid redundancy, check all the <k, v>
pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are
coming from four input files. The final output should be as
follows −
<gopal, 50000>
Indexing

Normally indexing is used to point to a particular data and its


address. It performs batch indexing on the input files for a particular
Mapper.

The indexing technique that is normally used in MapReduce is known


as inverted index. Search engines like Google and Bing use inverted
indexing technique. Let us try to understand how Indexing works
with the help of a simple example.
ASSI.PRO UPEKSHA CHAUDHRI 28
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Example

The following text is the input for inverted indexing. Here T[0], T[1],
and t[2] are the file names and their content are in double quotes.

T[0] = "it is what it is"


T[1] = "what is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly,
"is": {0, 1, 2} implies the term "is" appears in the files T[0], T[1], and
T[2].

TF-IDF

TF-IDF is a text processing algorithm which is short for Term


Frequency − Inverse Document Frequency. It is one of the common
web analysis algorithms. Here, the term 'frequency' refers to the
number of times a term appears in a document.

Term Frequency (TF)

It measures how frequently a particular term occurs in a document.


It is calculated by the number of times a word appears in a document
divided by the total number of words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) /


(Total number of terms in the document)

ASSI.PRO UPEKSHA CHAUDHRI 29


BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Inverse Document Frequency (IDF)

It measures the importance of a term. It is calculated by the number


of documents in the text database divided by the number of
documents where a specific term appears.

While computing TF, all the terms are considered equally important.
That means, TF counts the term frequency for normal words like “is”,
“a”, “what”, etc. Thus we need to know the frequent terms while
scaling up the rare ones, by computing the following −

IDF(the) = log_e(Total number of documents / Number of documents


with term ‘the’ in it).

The algorithm is explained below with the help of a small example.

Example

Consider a document containing 1000 words, wherein the


word hive appears 50 times. The TF for hive is then (50 / 1000) =
0.05.

Now, assume we have 10 million documents and the


word hive appears in 1000 of these. Then, the IDF is calculated as
log(10,000,000 / 1,000) = 4.

The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

ASSI.PRO UPEKSHA CHAUDHRI 30

You might also like