UNIT 2 BDA Notes
UNIT 2 BDA Notes
Big Data: Data Evolution - Terminologies - Definitions -Merits and Challenges - Big Data Components-
Characteristics - Big Data Processing Frameworks - Big Data Applications – Tools for Big data Analytics.
INTRODUCTION
What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations or just descriptions of
things.
For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates, or
distances.
1. Quantitative data is provided in numerical form, like the weight, volume, or cost of an item.
2. Qualitative data is descriptive, but non-numerical, like the name, sex, or eye colour of a person.
Characteristics of Data
The following are six key characteristics of data which discussed below:
Accuracy Data should be sufficiently accurate for the intended use and should
be captured only once, although it may have multiple uses.
Reliability Data should reflect stable and consistent data collection processes
across collection points and over time.
Digital data is the electronic representation of information in a format or language that machines can read and
understand. In more technical terms, Digital data is a binary format of information that's¬ converted into a
Page 1 of 23
Introduction to Advanced Computer Application UNIT II
machine-readable digital format. The power of digital data is that any analog inputs, from very simple text¬
documents to genome sequencing results, can be represented with the binary system.
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or file.
Having a particular Data Model.
Meaningful data.
Data arranged in a row and column.
Structured data has the advantage of being easily entered, stored, queried and analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Page 2 of 23
Introduction to Advanced Computer Application UNIT II
Expensive
Data quality
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf files,
PowerPoint presentations, emails, blog entries, wikis and word processing documents.
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data:
It is difficult to store and manage unstructured data due to lack of schema and structure
Indexing the data is difficult and error prone due to unclear structure and not having pre-defined
attributes. Due to which search results are not very accurate.
Ensuring security to data is difficult task.
Self-describing data.
Metadata (Data about data).
Also called quiz data: data in between structured and semi structured.
It is a type of structured data but not followed data model.
Data which does not have rigid structure.
E.g.: E-mails, word processing software.
XML and other markup language are often used to manage semi structured data.
Sources of semi-structured Data:
E-mails
XML and other markup languages
Binary executables
Page 3 of 23
Introduction to Advanced Computer Application UNIT II
TCP/IP packets
Zipped files
Integration of data from different sources
Web pages
Advantages of Semi-structured Data:
BIG DATA
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so
large size and complexity that none of traditional data management tools can store it or process it efficiently. Big
data is also a data but with huge size.
Example:
Every day, 500+ terabytes of fresh data are absorbed into the Facebook systems. This information is mostly
gathered through photo and video uploads, message exchanges, and the posting of comments, among other things.
In 30 minutes of flying time, a single Jet engine may create 10+ gigabytes of data. With thousands of flights every
day, the amount of data generated can amount to several Petabytes. Every day, the Fresh York Stock Exchange
creates around a terabyte of new trading data
Page 4 of 23
Introduction to Advanced Computer Application UNIT II
Volume: The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data generated
from many sources daily, such as business processes, machines, social media platforms, networks, human
interactions, and many more.
Variety: Big Data can be structured, unstructured, and semi-structured that are being collected from different
sources. Data will only be collected from databases and sheets in the past, but these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Veracity: Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity
is the process of being able to handle and manage data efficiently. Big Data is also essential in business
development.
Value: Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and
reliable data that we store, process, and also analyze.
Velocity: Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity bursts.
The primary aspect of Big Data is to provide demanding data rapidly. Big data velocity deals with the speed at
the data flows from sources like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.
2. Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine
tune their business strategies.
. Improved customer service (Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies.
Page 5 of 23
Introduction to Advanced Computer Application UNIT II
4. Improved customer service (In these new systems, Big Data and natural language processing technologies are
being used to read and evaluate consumer responses.
• Cost Savings
Big data helps in providing business intelligence that can reduce costs and improve the efficiency of operations.
Processes like quality assurance and testing can involve many complications particularly in industries like
biopharmaceuticals and nanotechnologies
• Time Reductions
Companies may collect data from a variety of sources using real-time in-memory analytics. Tools like Hadoop
enable businesses to evaluate data quickly, allowing them to make swift decisions based on their findings.
Businesses can benefit from big data analysis by gaining a better grasp of market conditions. Analysing client
purchase behaviour, for example, enables businesses to discover the most popular items and develop them
appropriately. This allows businesses to stay ahead of the competition.
• Social Media
Listening Companies can perform sentiment analysis using Big Data tools. These enable them to get feedback
about their company, that is, who is saying what about the company. Companies can use Big data tools to
improve their online presence
Customers are a crucial asset that each company relies on. Without a strong consumer base, no company can be
successful. However, even with a strong consumer base, businesses cannot ignore market rivalry. It will be
difficult for businesses to succeed if they do not understand what their consumer’s desire. It will be difficult for
businesses to succeed if they do not understand what their consumer’s desire. It will result in a loss of customers,
which will have a negative impact on business growth. Businesses may use big data analytics to detect customer-
related trends and patterns. Customer behavior
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing analysis
It is the key to a successful business. Insights all company activities are shaped by big data analytics. It allows
businesses to meet client expectations. Big data analytics aids in the modification of a company's product range. It
guarantees that marketing initiatives are effective.
Page 6 of 23
Introduction to Advanced Computer Application UNIT II
1. Volume: Big data involves massive amounts of data, which can be difficult to store, process, and analyze using
traditional methods.
2. Velocity: Big data is often generated at a high speed, making it difficult to keep up with the flow of data and
extract valuable insights in real time.
3. Variety: Big data comes in a wide variety of formats, including structured, unstructured, and semi-structured
data. This diversity makes it challenging to integrate and analyze data from different sources.
4. Veracity: The accuracy and quality of big data can be a challenge, as data may be incomplete, inconsistent, or
biased.
5. Complexity: Big data often involves complex relationships and patterns, making it difficult to extract
meaningful insights using traditional data analysis techniques
.6. Security: Protecting big data from unauthorized access, breaches, and other security threats is a major
concern.
7. Privacy: Handling big data while ensuring the privacy of individuals and organizations is a critical challenge.
8. Scalability: Big data systems must be able to scale to handle increasing volumes of data and growing
analytical demands.
9. Cost: Storing, processing, and analyzing big data can be expensive, requiring significant investments in
hardware, software, and personnel.
10. Talent: There is a shortage of skilled professionals with the expertise to handle big data effectively.
Most big data architectures include some or all of the following components:
Data Sources: These are the origins of the data, which can be various sources such as IoT devices, social media
platforms, databases, or real-time streams.
Page 7 of 23
Introduction to Advanced Computer Application UNIT II
Real-time message ingestion: This component is responsible for capturing and processing data in real-time, as it
is generated.
Data Storage: This component stores the collected data, often using distributed file systems or NoSQL databases
designed to handle large datasets.
Batch Processing: This component processes large batches of data in a scheduled manner, allowing for efficient
computations on historical data.
Stream Processing: This component continuously processes data as it flows in, enabling real-time analytics and
insights.
Machine Learning: This component applies machine learning algorithms to the data to discover patterns, make
predictions, and automate tasks.
Analytical Data Store: This component stores processed and analyzed data for further analysis and reporting.
Analytics and Reporting: This component generates insights, reports, and visualizations based on the analyzed
data.
Orchestration: This component coordinates the overall workflow, managing the scheduling, execution, and
monitoring of various processes.
1. Hadoop:
Hadoop is a distributed computing framework designed to process large datasets across clusters of
computers.
It consists of two main components:
Hadoop Distributed File System (HDFS): A scalable distributed file system for storing large amounts
of data.
MapReduce: A programming model for processing large datasets in parallel.
Hadoop is widely used for batch processing tasks, such as data warehousing, ETL (Extract, Transform,
Load), and data mining.
2. Apache Spark:
Apache Spark is a fast and general-purpose cluster computing framework.
It is designed to be 100 times faster than Hadoop MapReduce for certain types of workloads.
Spark supports a variety of data processing tasks, including batch processing, real-time streaming,
machine learning, and graph processing.
It offers a variety of APIs for different programming languages, such as Scala, Java, Python, and R.
3. Apache Flink:
Apache Flink is a distributed streaming engine designed for processing real-time data streams.
It is highly scalable and can handle large volumes of data with low latency.
Flink supports a variety of streaming operations, such as filtering, joining, aggregating, and windowing.
It can also be used for batch processing tasks.
4. Apache Storm:
Apache Storm is a distributed and fault-tolerant stream processing engine.
It is designed to process real-time data streams at high speeds.
Storm is highly scalable and can handle large volumes of data.
Page 8 of 23
Introduction to Advanced Computer Application UNIT II
It is often used for applications such as real-time analytics, online fraud detection, and social media
monitoring.
5. Apache Kafka:
Apache Kafka is a distributed streaming platform designed for building real-time data pipelines.
It is highly scalable and can handle large volumes of data with low latency.
Kafka is often used to collect, store, and process real-time data streams from various sources.
It can be integrated with other big data processing frameworks, such as Spark and Flink.
All of the data must be recorded and processed, which takes a lot of expertise, resources, and time. Data may be
creatively and meaningfully used to provide business benefits. There are three sorts of business applications, each
with varying degrees of revolutionary potential as shown in Figure
These are the first and most fundamental Big Data applications. In practically all industries, they aid in increasing
corporate efficiency. The following are a few examples of specialized applications:
The US government is encouraging all healthcare stakeholders to establish a national platform for interoperability
and data sharing standards. This would enable secondary use of health data, which would advance BIG DATA
Page 9 of 23
Introduction to Advanced Computer Application UNIT II
analytics and personalized holistic precision medicine. This would be a broad-based platform like Google flu
trends.
Social media has become more powerful than advertising. Many good companies have moved a bulk of their
advertising budgets from traditional media into social media.They have setup Big Data listening platforms, where
social media data streams (including tweets, and Facebook posts and blog posts) are filtered and analysed for
certain keywords or sentiments, by certain demographics and regions. Actionable information from this analysis
is delivered to marketing professionals for appropriate action, especially when the product is new to the market.
Asset Tracking
The US department of defense is encouraging the industry to devise a tiny RFID chip that could prevent the
counterfeiting of electronic parts that end up in avionics or circuit board for other devices. Airplanes are one of
the heaviest users of sensors which track every aspect of the performance of every part of the plane. The data can
be displayed on the dashboard as well as stored for later detailed analysis. Working with communicating devices,
these sensors can produce a torrent of data. Theft by shoppers and employees is a major source of loss of revenue
for retailers. All valuable items in the store can be assigned RFID tags, and the gates of the store can be equipped
with RF readers. This can help secure the products, and reduce leakage (theft) from the store.
Page 10 of 23
Introduction to Advanced Computer Application UNIT II
All containers on ships communicate their status and location using RFID tags. Thus retailers and their suppliers
can gain real-time visibility to the inventory throughout the global supply chain. Retailers can know exactly
where the items are in the warehouse, and so can bring them into the store at the right time. This is particularly
relevant for seasonal items that must be sold on time, or else they will be sold at a discount. With item-level RFID
tacks, retailers also gain full visibility of each item and can serve their customers better.
All machines, including cars and computers, do tend to fail sometimes. This is because one or more or their
components may cease to function. As a preventive measure, precious equipment could be equipped with sensors.
The continuous stream of data from the sensors could be monitored and analyzed to forecast the status of key
components, and thus, monitor the overall machine’s health. Preventive maintenance can, thus, reduce the cost of
downtime.
These are the next generation of big data apps. They have the ability to improve corporate effectiveness and have
transformational potential. Big Data may be organized and analyzed to reveal trends and insights that can be
utilized to improve business.
Predictive Policing
Winning political elections
Personal Health
Predictive Policing
The notion of predictive policing was created by the Los Angeles Police Department. The LAPD collaborated
with UC Berkeley academics to examine its massive database of 13 million crimes spanning 80 years and forecast
Page 11 of 23
Introduction to Advanced Computer Application UNIT II
the likelihood of particular sorts of crimes occurring at specific times and in specific areas. They pinpointed crime
hotspots of certain categories, at specific times, and in specific areas. They identified crime hotspots where crimes
have happened and were likely to occur in the future. After a basic insight derived from a metaphor of
earthquakes and their aftershocks, crime patterns were statistically simulated. The model said that once a crime
occurred in a location, it represented a CERTAIN disturbance in harmony, and would thus, lead to a greater
likelihood of a similar crime occurring in the local vicinity soon. The model showed for each police beat, the
specific neighborhood blocks and specific time slots, where crime was likely to occur. By aligning the police cars
patrol schedule in accordance with the models’ predictions, the LAPD could reduce crime by 12 percent to 26
percent for different categories of crime. Recently, the SAN Francisco Police department released its own crime
for over 2 years, so data analyst could model that data and prevent future crimes.
The US president, Barack Obama was the first major political candidate to use big data in a significant way, in the
2008n elections. He is the first big data president. His campaign gathered data about millions of people, including
supporters. They invented the mechanism to obtain small campaign contributions from millions of supporters.
They created personal profiles of millions of supporters and what they had done and could do for the campaign.
Data was used to determine undecided voters who could be converted to their side. They provided phone numbers
of these undecided voters to the volunteers. The results of the calls were recorded in real time using interactive
web applications. Obama himself used his twitter account to communicate his message directly with his millions
of followers. After the elections, Obama converted his list of tens of millions of supporters to an advocacy
machine that would provide the grassroots support for the president initiatives. Since then, almost all campaigns
use big data.Senator Bernie sanders used the same big data playbook to build an effective national political
machine powered entirely by small donors. Election analyst, Nate silver, created sophistical predictive models
using inputs from many political polls and surveys to win pundits to successfully predict winner of the US
elections. Nate was however, unsuccessful in predicting Donald trump’s rise and ultimate victory and that shows
the limits of big data.
Personal health
Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson system is a big data analytics
engine that ingests and digests all the medical information in the world, and then applies it intelligently to an
individual situation. Watson can provide a detailed and accurate medical diagnosis using current symptoms,
patient history, medical history and environmental trends, and other parameters. Similar products might be
offered as an APP to licensed doctors, and even individuals, to improve productivity and accuracy in health care.
These are completely new notions that did not exist previously. These applications have the ability to disrupt
whole sectors and provide organizations with new revenue streams.
Page 12 of 23
Introduction to Advanced Computer Application UNIT II
An auto insurance company can use the GPS data from cars to calculate the risk of accidents based on travel
patterns. The automobile companies can use the car sensor data to track the performance of a car. Safer drivers
can be rewarded and the errant drivers can be penalized.
A retailer or a third-party advertiser can target customers with specific promotions and coupons based on location
data obtained through Global positioning system (GPS) the time of day, the presence of stores nearby, and
mapping it to the consumer preference data available from social media databases. Advertisements and offers can
be delivered through mobile apps, SMS and email. These are examples of mobile apps.
Recommendation service
Ecommerce has been a fast-growing industry in the last couple of decades. A variety of products are sold and
shared over the internet. Web users browsing and purchase history on ecommerce sites is utilized to learn about
their preference and needs, and to advertise relevant product and pricing offers in real-time. Amazon uses a
personalized recommendation engine system to suggest new additional products to consumers based on affinities
of various products. Netflix also use a recommendation engine to suggest entertainment options to its users. Big
data is valuable across all industries.
These are three major types of data sources of big data. Example (people to people communication, people-
machine communications, Machine-machine communications.)Each type has many sources of data. There are
three types of applications. They are the monitoring type, the analysis type and new product development. They
have an impact on efficiency, effectiveness and even disruption of industries.
There are number of tools used in BIGDATA. Most popular tools are: -
Apache Hadoop
Page 13 of 23
Introduction to Advanced Computer Application UNIT II
A large data framework is the Apache Hadoop software library. It enables massive data sets to be processed
across clusters of computers in a distributed manner. It's one of the most powerful big data technologies, with the
ability to grow from a single server to thousands of computers.
Features
HPCC
HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform, a single
architecture and a single programming language for data processing.
Features
It is one of the Highly efficient big data tools that accomplish big data tasks with far less code.
It is one of the big data processing tools which offers high redundancy and availability.
It can be used both for complex data processing on a Thor cluster. Graphical IDE for simplifies
development, testing and debugging. It automatically optimizes code for parallel processing
Provide enhance scalability and performance. ECL code compiles into optimized C++, and it can also
extend using C++ libraries
Apache STORM
Storm is a free big data open source computation system. It is one of the best big data tools which offers
distributed real-time, fault-tolerant processing system. With real-time computation capabilities.
Features
It is one of the best tool from big data tools list which is benchmarked as processing one million 100 byte
messages per second per node
It has big data technologies and tools that uses parallel calculations that run across a cluster of machines.
It will automatically restart in case a node die. The worker will be restarted on another node. Storm
guarantees that each unit of data will be processed at least once or exactly once
Once deployed Storm is surely easiest tool for Bigdata analysis
Qubole
Qubole Data is Autonomous Big data management platform. It is a big data open-source tool which is self-
managed, self-optimizing and allows the data team to focus on business outcomes.
Features
Apache Cassandra
The Apache Cassandra database is widely used today to provide an effective management of large amounts of
data.
Features
Support for replicating across multiple data centers by providing lower latency for users
Data is automatically replicated to multiple nodes for fault-tolerance
It one of the best big data tools which is most suitable for applications that can't afford to lose data, even
when an entire data center is down
Cassandra offers support contracts and services are available from third parties
Statwing
Statwing is an easy-to-use statistical tool. It was built by and for big data analysts. Its modern interface chooses
statistical tests automatically.
Features
It is a big data software that can explore any data in seconds. Statwing helps to clean data, explore
relationships, and create charts in minutes
It allows creating histograms, scatterplots, heatmaps, and bar charts that export to Excel or PowerPoint. It
also translates results into plain English, so analysts unfamiliar with statistical analysis
CouchDB
CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It offers distributed
scaling with fault-tolerant storage. It allows accessing data by defining the Couch Replication Protocol.
Features
Pentaho
Pentaho provides big data tools to extract, prepare and blend data. It offers visualizations and analytics that
change the way to run any business. This Big data tool allows turning big data into big insights.
Features:
Page 15 of 23
Introduction to Advanced Computer Application UNIT II
Data access and integration for effective data visualization. It is a big data software that empowers users
to architect big data at the source and stream them for accurate analytics.
Seamlessly switch or combine data processing with in-cluster execution to get maximum processing.
Allow checking data with easy access to analytics, including charts, visualizations, and reporting
Supports wide spectrum of big data sources by offering unique capabilities
Apache Flink
Apache Flink is one of the best open source data analytics tools for stream processing big data. It is distributed,
high-performing, always-available, and accurate data streaming applications.
Features:
Provides results that are accurate, even for out-of-order or late-arriving data
It is stateful and fault-tolerant and can recover from failures.
It is a big data analytics software which can perform at a large scale, running on thousands of nodes
Has good throughput and latency characteristics
This big data tool supports stream processing and windowing with event time semantics. It supports
flexible windowing based on time, count, or sessions to data-driven windows
It supports a wide range of connectors to third-party systems for data sources and sinks
Cloudera
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to get any data
across any environment within single, scalable platform.
Features:
Open Refine
OpenRefine is a powerful big data tool. It is a big data analytics software that helps to work with messy data,
cleaning it and transforming it from one format into another. It also allows extending it with web services and
external data.
Features:
OpenRefine tool help you explore large data sets with ease. It can be used to link and extend your dataset
with various web services. Import data in various formats.
Explore datasets in a matter of seconds
Apply basic and advanced cell transformations
Allows to deal with cells that contain multiple values
Page 16 of 23
Introduction to Advanced Computer Application UNIT II
Create instantaneous links between datasets. Use named-entity extraction on text fields to automatically
identify topics. Perform advanced data operations with the help of Refine Expression Language
RapidMiner
RapidMiner is one of the best open-source data analytics tools. It is used for data prep, machine learning, and
model deployment. It offers a suite of products to build new data mining processes and setup predictive analysis.
Features
Data cleaner
Data Cleaner is a data quality analysis application and a solution platform. It has strong data profiling engine. It is
extensible and thereby adds data cleansing, transformations, matching, and merging.
Feature:
Kaggle
Kaggle is the world's largest big data community. It helps organizations and researchers to post their data &
statistics. It is the best place to analyze data seamlessly.
Features:
Apache Hive
Page 17 of 23
Introduction to Advanced Computer Application UNIT II
Hive is an open-source big data software tool. It allows programmers analyze large data sets on Hadoop. It helps
with querying and managing large datasets real fast.
Features:
It Supports SQL like query language for interaction and Data modeling
It compiles language with two main tasks map, and reducer.
It allows defining these tasks using Java or Python
Hive designed for managing and querying only structured data
Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming
It offers Java Database Connectivity (JDBC) interface
Page 18 of 23
Introduction to Advanced Computer Application UNIT II
The Hadoop core is divided into two fundamental layers:
MapReduce engine
HDFS
The MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a
distributed computing system.
HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node as the master
and a number of Data Nodes as workers (slaves).
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and
TaskTracker.
Page 19 of 23
Introduction to Advanced Computer Application UNIT II
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file system's clients.
It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
In response, NameNode provides metadata to Job Tracker.
Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or
time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.
Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is
done at the slave nodes, and the final result is sent to the master node.
A data containing code is used to process the entire data. This coded data is usually very small in
comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-
duty process on computers.
Page 20 of 23
Introduction to Advanced Computer Application UNIT II
The input dataset is first split into chunks of data. In this example, the input has three lines of text with
three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into
three chunks, based on these entities, and processed parallelly.
In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one
ship, and one train.
These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the
aggregation takes place, and the final output is obtained.
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
It is responsible for managing cluster resources to make sure you don't overload one machine.
It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes
to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.
In the node section, each of the nodes has its node managers. These node managers manage the nodes and
monitor the resource usage in the node. The containers contain a collection of physical resources, which
could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the
container from the node manager. Once the node manager gets the resource, it goes back to the Resource
Manager.
Page 21 of 23
Introduction to Advanced Computer Application UNIT II
Fill up Q & A
1. The transition from traditional data processing to Big Data is often referred to as data ________.
2. ________ refers to the rapid growth and accumulation of data generated from various sources
3. Data that is generated at high speed and in large volumes is known as ________ data.
4. The term ________ refers to the variety of data types generated, including structured,
semistructured, and unstructured data.
5. Big Data is commonly defined by the ________ , which are Volume, Velocity, and Variety.
6. The ability to analyze and visualize data in real-time is referred to as ________ analytics.
7. One major merit of Big Data is its ability to uncover ________ patterns and trends.
8. A significant challenge of Big Data is the issue of ________, which deals with the quality and accuracy
of data.
9. A key component of Big Data architecture is the ________, which is responsible for data storage.
10. _________frameworks are used for processing and analyzing large datasets.
11. Big Data is often characterized by its high ________, which refers to the sheer amount of data
generated.
12. The term ________ in Big Data refers to the speed at which data is generated and processed.
13. One of the most popular frameworks for Big Data processing is ________.
14. ________ is a programming model that allows for processing large data sets across distributed
clusters.
15. Big Data is widely used in the ________ industry for predictive analytics and customer insights.
16. In healthcare, Big Data helps in ________ by analyzing patient data for better treatment outcomes
17. A widely-used tool for Big Data storage and processing is ________.
18. ________ is a data visualization tool commonly used to present Big Data insights.
19. The term ________ refers to the ethical considerations and implications of using Big Data.
20. ________ learning techniques are often employed to extract insights from Big Data.
21. In the context of Big Data, ETL stands for Extract, Transform, and ________.
22. A database designed to handle unstructured data is known as a ________ database.
23. The ________ is a set of techniques used to analyze and interpret data for decision-making.
24. Cloud computing plays a crucial role in providing ________ for Big Data solutions.
25 The ability to process data in real-time is known as ________ processing.
26 Data ________ involves ensuring data privacy and security in Big Data environments.
27. ________ analysis is a method used to analyze complex data sets to identify patterns.
28. The ________ refers to the integration of Big Data with traditional data sources for comprehensive
analysis.
29. ________ platforms enable organizations to manage and analyze data across various environments.
30 The term ________ refers to the tools and methods used to manage and analyze Big Data.
31 A major challenge in Big Data is the need for ________ in data storage solutions.
32. ________ is a programming language frequently used in Big Data analytics for data manipulation and
analysis.
Answer
1. evolution
2. Big Data
3. streaming
Page 22 of 23
Introduction to Advanced Computer Application UNIT II
4. variety
5. three Vs
6. real-time
7. hidden
8. data quality
9. data lake
10. processing
11. volume
12. velocity
13. Apache Hadoop
14. MapReduce
15. retail
16. personalized medicine
17. Hadoop
18. Tableau
19. data ethics
20. Machine
21. Load
22. NoSQL
23. data analytics
24. scalability
25. streaming
26. governance
27. Predictive
28. integration
29. Data management
30. Big Data
31. efficiency
32. R
Page 23 of 23