0% found this document useful (0 votes)
30 views23 pages

UNIT 2 BDA Notes

The document provides an overview of Big Data, including its definitions, characteristics, types of data (structured, unstructured, semi-structured), and the challenges associated with managing it. It discusses the components of Big Data architecture, processing frameworks like Hadoop and Apache Spark, and highlights the benefits and applications of Big Data in various industries. Additionally, it addresses the importance of Big Data in decision-making, customer acquisition, and operational efficiency while outlining the challenges such as volume, velocity, variety, and security.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views23 pages

UNIT 2 BDA Notes

The document provides an overview of Big Data, including its definitions, characteristics, types of data (structured, unstructured, semi-structured), and the challenges associated with managing it. It discusses the components of Big Data architecture, processing frameworks like Hadoop and Apache Spark, and highlights the benefits and applications of Big Data in various industries. Additionally, it addresses the importance of Big Data in decision-making, customer acquisition, and operational efficiency while outlining the challenges such as volume, velocity, variety, and security.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to Advanced Computer Application UNIT II

UNIT –II BIG DATA

Big Data: Data Evolution - Terminologies - Definitions -Merits and Challenges - Big Data Components-
Characteristics - Big Data Processing Frameworks - Big Data Applications – Tools for Big data Analytics.

INTRODUCTION

What is Data?

Data is defined as individual facts, such as numbers, words, measurements, observations or just descriptions of
things.

For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates, or
distances.

There are two main types of data:

1. Quantitative data is provided in numerical form, like the weight, volume, or cost of an item.

2. Qualitative data is descriptive, but non-numerical, like the name, sex, or eye colour of a person.

Characteristics of Data

The following are six key characteristics of data which discussed below:

Accuracy Data should be sufficiently accurate for the intended use and should
be captured only once, although it may have multiple uses.

Validity Data should be recorded and used in compliance with relevant


requirements, including the correct application of any rules or
definitions

Completeness Data requirements should be clearly specified based on the


information needs of the organization and data collection processes
matched to these requirements.

Reliability Data should reflect stable and consistent data collection processes
across collection points and over time.

Relevance Data captured should be relevant to the purposes for which it is to be


used.

Timeliness Data should be captured as quickly as possible after the event or


activity and must be available for the intended use within a
reasonable time period.

Types of Digital Data

Digital data is the electronic representation of information in a format or language that machines can read and
understand. In more technical terms, Digital data is a binary format of information that's¬ converted into a
Page 1 of 23
Introduction to Advanced Computer Application UNIT II
machine-readable digital format. The power of digital data is that any analog inputs, from very simple text¬
documents to genome sequencing results, can be represented with the binary system.

Structured Data:

 Structured data refers to any data that resides in a fixed field within a record or file.
 Having a particular Data Model.
 Meaningful data.
 Data arranged in a row and column.
 Structured data has the advantage of being easily entered, stored, queried and analysed.
 E.g.: Relational Data Base, Spread sheets.
 Structured data is often managed using Structured Query Language (SQL)

Sources of Structured Data:


 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices
Advantages of Structured Data:

 Easy to understand and use


 Consistency
 Efficient storage and retrieval
 Enhanced data security
 Clear data lineage
Disadvantages of Structured Data:
 Inflexibility
 Limited complexity
 Limited context

Page 2 of 23
Introduction to Advanced Computer Application UNIT II
 Expensive
 Data quality
Unstructured Data:

 Unstructured data can not readily classify and fit into a neat box
 Also called unclassified data which does not confirm to any data model.
 Business rules are not applied.
 Indexing is not required.
 E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf files,
PowerPoint presentations, emails, blog entries, wikis and word processing documents.
Sources of Unstructured Data:

 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys
Advantages of Unstructured Data:

 Its supports the data which lacks a proper format or sequence


 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable
 It is very scalable
 It can deal easily with the heterogeneity of sources.
 These type of data have a variety of business intelligence and analytics applications.
Disadvantages of Unstructured data:

 It is difficult to store and manage unstructured data due to lack of schema and structure
 Indexing the data is difficult and error prone due to unclear structure and not having pre-defined
attributes. Due to which search results are not very accurate.
 Ensuring security to data is difficult task.

Semi structured Data:

 Self-describing data.
 Metadata (Data about data).
 Also called quiz data: data in between structured and semi structured.
 It is a type of structured data but not followed data model.
 Data which does not have rigid structure.
 E.g.: E-mails, word processing software.
 XML and other markup language are often used to manage semi structured data.
Sources of semi-structured Data:

 E-mails
 XML and other markup languages
 Binary executables
Page 3 of 23
Introduction to Advanced Computer Application UNIT II
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages
Advantages of Semi-structured Data:

 The data is not constrained by a fixed schema


 Flexible i.e Schema can be easily changed.
 Data is portable
 It is possible to view structured data as semi-structured data
 Its supports users who cannot express their need in SQL
 It can deal easily with the heterogeneity of sources.
Disadvantages of Semi-structured data

o Lack of fixed, rigid schema make it difficult in storage of the data


o Interpreting the relationship between data is difficult as there is no separation of the schema and the
data.
o Queries are less efficient as compared to structured data.
o Complexity
o Lack of standardization
o Reduced performance
o Limited tooling
o Data security

BIG DATA
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so
large size and complexity that none of traditional data management tools can store it or process it efficiently. Big
data is also a data but with huge size.

Example:

Every day, 500+ terabytes of fresh data are absorbed into the Facebook systems. This information is mostly
gathered through photo and video uploads, message exchanges, and the posting of comments, among other things.
In 30 minutes of flying time, a single Jet engine may create 10+ gigabytes of data. With thousands of flights every
day, the amount of data generated can amount to several Petabytes. Every day, the Fresh York Stock Exchange
creates around a terabyte of new trading data

BIG DATA CHARACTERISTIC

Page 4 of 23
Introduction to Advanced Computer Application UNIT II
Volume: The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data generated
from many sources daily, such as business processes, machines, social media platforms, networks, human
interactions, and many more.

Variety: Big Data can be structured, unstructured, and semi-structured that are being collected from different
sources. Data will only be collected from databases and sheets in the past, but these days the data will comes in
array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

Veracity: Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity
is the process of being able to handle and manage data efficiently. Big Data is also essential in business
development.

Value: Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and
reliable data that we store, process, and also analyze.

Velocity: Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity bursts.

The primary aspect of Big Data is to provide demanding data rapidly. Big data velocity deals with the speed at
the data flows from sources like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc.

Benefits of Big Data Processing

Ability to process Big Data brings in multiple benefits, such as-

1. Businesses can utilize outside intelligence while taking decisions.

2. Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine
tune their business strategies.

. Improved customer service (Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies.
Page 5 of 23
Introduction to Advanced Computer Application UNIT II
4. Improved customer service (In these new systems, Big Data and natural language processing technologies are
being used to read and evaluate consumer responses.

5. Early identification of risk to the product/services, if any

6. Better operational efficiency

Why is Big Data Important?

• Cost Savings

Big data helps in providing business intelligence that can reduce costs and improve the efficiency of operations.
Processes like quality assurance and testing can involve many complications particularly in industries like
biopharmaceuticals and nanotechnologies

• Time Reductions

Companies may collect data from a variety of sources using real-time in-memory analytics. Tools like Hadoop
enable businesses to evaluate data quickly, allowing them to make swift decisions based on their findings.

• Understand the market conditions

Businesses can benefit from big data analysis by gaining a better grasp of market conditions. Analysing client
purchase behaviour, for example, enables businesses to discover the most popular items and develop them
appropriately. This allows businesses to stay ahead of the competition.

• Social Media

Listening Companies can perform sentiment analysis using Big Data tools. These enable them to get feedback
about their company, that is, who is saying what about the company. Companies can use Big data tools to
improve their online presence

• Using Big Data Analytics to Boost Customer Acquisition and Retention.

Customers are a crucial asset that each company relies on. Without a strong consumer base, no company can be
successful. However, even with a strong consumer base, businesses cannot ignore market rivalry. It will be
difficult for businesses to succeed if they do not understand what their consumer’s desire. It will be difficult for
businesses to succeed if they do not understand what their consumer’s desire. It will result in a loss of customers,
which will have a negative impact on business growth. Businesses may use big data analytics to detect customer-
related trends and patterns. Customer behavior

 Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing analysis

It is the key to a successful business. Insights all company activities are shaped by big data analytics. It allows
businesses to meet client expectations. Big data analytics aids in the modification of a company's product range. It
guarantees that marketing initiatives are effective.

CHALLENGES OF BIG DATA

Page 6 of 23
Introduction to Advanced Computer Application UNIT II
1. Volume: Big data involves massive amounts of data, which can be difficult to store, process, and analyze using
traditional methods.

2. Velocity: Big data is often generated at a high speed, making it difficult to keep up with the flow of data and
extract valuable insights in real time.

3. Variety: Big data comes in a wide variety of formats, including structured, unstructured, and semi-structured
data. This diversity makes it challenging to integrate and analyze data from different sources.

4. Veracity: The accuracy and quality of big data can be a challenge, as data may be incomplete, inconsistent, or
biased.

5. Complexity: Big data often involves complex relationships and patterns, making it difficult to extract
meaningful insights using traditional data analysis techniques

.6. Security: Protecting big data from unauthorized access, breaches, and other security threats is a major
concern.

7. Privacy: Handling big data while ensuring the privacy of individuals and organizations is a critical challenge.

8. Scalability: Big data systems must be able to scale to handle increasing volumes of data and growing
analytical demands.

9. Cost: Storing, processing, and analyzing big data can be expensive, requiring significant investments in
hardware, software, and personnel.

10. Talent: There is a shortage of skilled professionals with the expertise to handle big data effectively.

COMPONENTS OF A BIG DATA ARCHITECTURE

Most big data architectures include some or all of the following components:

Data Sources: These are the origins of the data, which can be various sources such as IoT devices, social media
platforms, databases, or real-time streams.

Page 7 of 23
Introduction to Advanced Computer Application UNIT II
Real-time message ingestion: This component is responsible for capturing and processing data in real-time, as it
is generated.

Data Storage: This component stores the collected data, often using distributed file systems or NoSQL databases
designed to handle large datasets.

Batch Processing: This component processes large batches of data in a scheduled manner, allowing for efficient
computations on historical data.

Stream Processing: This component continuously processes data as it flows in, enabling real-time analytics and
insights.

Machine Learning: This component applies machine learning algorithms to the data to discover patterns, make
predictions, and automate tasks.

Analytical Data Store: This component stores processed and analyzed data for further analysis and reporting.

Analytics and Reporting: This component generates insights, reports, and visualizations based on the analyzed
data.

Orchestration: This component coordinates the overall workflow, managing the scheduling, execution, and
monitoring of various processes.

BIG DATA PROCESSING FRAMEWORK

1. Hadoop:
 Hadoop is a distributed computing framework designed to process large datasets across clusters of
computers.
 It consists of two main components:
 Hadoop Distributed File System (HDFS): A scalable distributed file system for storing large amounts
of data.
 MapReduce: A programming model for processing large datasets in parallel.
 Hadoop is widely used for batch processing tasks, such as data warehousing, ETL (Extract, Transform,
Load), and data mining.
2. Apache Spark:
 Apache Spark is a fast and general-purpose cluster computing framework.
 It is designed to be 100 times faster than Hadoop MapReduce for certain types of workloads.
 Spark supports a variety of data processing tasks, including batch processing, real-time streaming,
machine learning, and graph processing.
 It offers a variety of APIs for different programming languages, such as Scala, Java, Python, and R.
3. Apache Flink:
 Apache Flink is a distributed streaming engine designed for processing real-time data streams.
 It is highly scalable and can handle large volumes of data with low latency.
 Flink supports a variety of streaming operations, such as filtering, joining, aggregating, and windowing.
 It can also be used for batch processing tasks.
4. Apache Storm:
 Apache Storm is a distributed and fault-tolerant stream processing engine.
 It is designed to process real-time data streams at high speeds.
 Storm is highly scalable and can handle large volumes of data.

Page 8 of 23
Introduction to Advanced Computer Application UNIT II
 It is often used for applications such as real-time analytics, online fraud detection, and social media
monitoring.
5. Apache Kafka:
 Apache Kafka is a distributed streaming platform designed for building real-time data pipelines.
 It is highly scalable and can handle large volumes of data with low latency.
 Kafka is often used to collect, store, and process real-time data streams from various sources.
 It can be integrated with other big data processing frameworks, such as Spark and Flink.

APPLICATIONS OF BIG DATA

All of the data must be recorded and processed, which takes a lot of expertise, resources, and time. Data may be
creatively and meaningfully used to provide business benefits. There are three sorts of business applications, each
with varying degrees of revolutionary potential as shown in Figure

Monitoring and tracking application

These are the first and most fundamental Big Data applications. In practically all industries, they aid in increasing
corporate efficiency. The following are a few examples of specialized applications:

Public health monitoring

The US government is encouraging all healthcare stakeholders to establish a national platform for interoperability
and data sharing standards. This would enable secondary use of health data, which would advance BIG DATA

Page 9 of 23
Introduction to Advanced Computer Application UNIT II
analytics and personalized holistic precision medicine. This would be a broad-based platform like Google flu
trends.

Consumer Sentiment Monitoring

Social media has become more powerful than advertising. Many good companies have moved a bulk of their
advertising budgets from traditional media into social media.They have setup Big Data listening platforms, where
social media data streams (including tweets, and Facebook posts and blog posts) are filtered and analysed for
certain keywords or sentiments, by certain demographics and regions. Actionable information from this analysis
is delivered to marketing professionals for appropriate action, especially when the product is new to the market.

Asset Tracking

The US department of defense is encouraging the industry to devise a tiny RFID chip that could prevent the
counterfeiting of electronic parts that end up in avionics or circuit board for other devices. Airplanes are one of
the heaviest users of sensors which track every aspect of the performance of every part of the plane. The data can
be displayed on the dashboard as well as stored for later detailed analysis. Working with communicating devices,
these sensors can produce a torrent of data. Theft by shoppers and employees is a major source of loss of revenue
for retailers. All valuable items in the store can be assigned RFID tags, and the gates of the store can be equipped
with RF readers. This can help secure the products, and reduce leakage (theft) from the store.

Supply chain monitoring

Page 10 of 23
Introduction to Advanced Computer Application UNIT II
All containers on ships communicate their status and location using RFID tags. Thus retailers and their suppliers

can gain real-time visibility to the inventory throughout the global supply chain. Retailers can know exactly
where the items are in the warehouse, and so can bring them into the store at the right time. This is particularly
relevant for seasonal items that must be sold on time, or else they will be sold at a discount. With item-level RFID
tacks, retailers also gain full visibility of each item and can serve their customers better.

Preventive machine maintenance

All machines, including cars and computers, do tend to fail sometimes. This is because one or more or their
components may cease to function. As a preventive measure, precious equipment could be equipped with sensors.
The continuous stream of data from the sensors could be monitored and analyzed to forecast the status of key
components, and thus, monitor the overall machine’s health. Preventive maintenance can, thus, reduce the cost of
downtime.

Analysis and Insight Applications

These are the next generation of big data apps. They have the ability to improve corporate effectiveness and have
transformational potential. Big Data may be organized and analyzed to reveal trends and insights that can be
utilized to improve business.

 Predictive Policing
 Winning political elections
 Personal Health

Predictive Policing

The notion of predictive policing was created by the Los Angeles Police Department. The LAPD collaborated
with UC Berkeley academics to examine its massive database of 13 million crimes spanning 80 years and forecast
Page 11 of 23
Introduction to Advanced Computer Application UNIT II
the likelihood of particular sorts of crimes occurring at specific times and in specific areas. They pinpointed crime
hotspots of certain categories, at specific times, and in specific areas. They identified crime hotspots where crimes
have happened and were likely to occur in the future. After a basic insight derived from a metaphor of
earthquakes and their aftershocks, crime patterns were statistically simulated. The model said that once a crime
occurred in a location, it represented a CERTAIN disturbance in harmony, and would thus, lead to a greater
likelihood of a similar crime occurring in the local vicinity soon. The model showed for each police beat, the
specific neighborhood blocks and specific time slots, where crime was likely to occur. By aligning the police cars
patrol schedule in accordance with the models’ predictions, the LAPD could reduce crime by 12 percent to 26
percent for different categories of crime. Recently, the SAN Francisco Police department released its own crime
for over 2 years, so data analyst could model that data and prevent future crimes.

Winning political elections

The US president, Barack Obama was the first major political candidate to use big data in a significant way, in the
2008n elections. He is the first big data president. His campaign gathered data about millions of people, including
supporters. They invented the mechanism to obtain small campaign contributions from millions of supporters.
They created personal profiles of millions of supporters and what they had done and could do for the campaign.
Data was used to determine undecided voters who could be converted to their side. They provided phone numbers
of these undecided voters to the volunteers. The results of the calls were recorded in real time using interactive
web applications. Obama himself used his twitter account to communicate his message directly with his millions
of followers. After the elections, Obama converted his list of tens of millions of supporters to an advocacy
machine that would provide the grassroots support for the president initiatives. Since then, almost all campaigns
use big data.Senator Bernie sanders used the same big data playbook to build an effective national political
machine powered entirely by small donors. Election analyst, Nate silver, created sophistical predictive models
using inputs from many political polls and surveys to win pundits to successfully predict winner of the US
elections. Nate was however, unsuccessful in predicting Donald trump’s rise and ultimate victory and that shows
the limits of big data.

Personal health

Medical knowledge and technology is growing by leaps and bounds. IBM’s Watson system is a big data analytics
engine that ingests and digests all the medical information in the world, and then applies it intelligently to an
individual situation. Watson can provide a detailed and accurate medical diagnosis using current symptoms,
patient history, medical history and environmental trends, and other parameters. Similar products might be
offered as an APP to licensed doctors, and even individuals, to improve productivity and accuracy in health care.

New Product Development

These are completely new notions that did not exist previously. These applications have the ability to disrupt
whole sectors and provide organizations with new revenue streams.

 Flexible Auto Insurance


 Location based retail promotion
 Recommendation service

Flexible Auto Insurance

Page 12 of 23
Introduction to Advanced Computer Application UNIT II
An auto insurance company can use the GPS data from cars to calculate the risk of accidents based on travel
patterns. The automobile companies can use the car sensor data to track the performance of a car. Safer drivers
can be rewarded and the errant drivers can be penalized.

Location based retail promotion

A retailer or a third-party advertiser can target customers with specific promotions and coupons based on location
data obtained through Global positioning system (GPS) the time of day, the presence of stores nearby, and
mapping it to the consumer preference data available from social media databases. Advertisements and offers can
be delivered through mobile apps, SMS and email. These are examples of mobile apps.

Recommendation service

Ecommerce has been a fast-growing industry in the last couple of decades. A variety of products are sold and
shared over the internet. Web users browsing and purchase history on ecommerce sites is utilized to learn about
their preference and needs, and to advertise relevant product and pricing offers in real-time. Amazon uses a
personalized recommendation engine system to suggest new additional products to consumers based on affinities
of various products. Netflix also use a recommendation engine to suggest entertainment options to its users. Big
data is valuable across all industries.

These are three major types of data sources of big data. Example (people to people communication, people-
machine communications, Machine-machine communications.)Each type has many sources of data. There are
three types of applications. They are the monitoring type, the analysis type and new product development. They
have an impact on efficiency, effectiveness and even disruption of industries.

TOOLS USED IN BIG DATA

There are number of tools used in BIGDATA. Most popular tools are: -

Apache Hadoop

Page 13 of 23
Introduction to Advanced Computer Application UNIT II
A large data framework is the Apache Hadoop software library. It enables massive data sets to be processed
across clusters of computers in a distributed manner. It's one of the most powerful big data technologies, with the
ability to grow from a single server to thousands of computers.

Features

 When utilizing an HTTP proxy server, authentication is improved.


 Hadoop Compatible File system effort specification. Extended characteristics for POSIX- style file
systems are supported.
 It has big data technologies and tools that offers robust ecosystem that is well suited to meet the analytical
needs of developer.
 It brings Flexibility in Data Processing. It allows for faster data Processing

HPCC

HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform, a single
architecture and a single programming language for data processing.

Features

 It is one of the Highly efficient big data tools that accomplish big data tasks with far less code.
 It is one of the big data processing tools which offers high redundancy and availability.
 It can be used both for complex data processing on a Thor cluster. Graphical IDE for simplifies
development, testing and debugging. It automatically optimizes code for parallel processing
 Provide enhance scalability and performance. ECL code compiles into optimized C++, and it can also
extend using C++ libraries

Apache STORM

Storm is a free big data open source computation system. It is one of the best big data tools which offers
distributed real-time, fault-tolerant processing system. With real-time computation capabilities.

Features

 It is one of the best tool from big data tools list which is benchmarked as processing one million 100 byte
messages per second per node
 It has big data technologies and tools that uses parallel calculations that run across a cluster of machines.
 It will automatically restart in case a node die. The worker will be restarted on another node. Storm
guarantees that each unit of data will be processed at least once or exactly once
 Once deployed Storm is surely easiest tool for Bigdata analysis

Qubole

Qubole Data is Autonomous Big data management platform. It is a big data open-source tool which is self-
managed, self-optimizing and allows the data team to focus on business outcomes.

Features

 Single Platform for every use case


 It is an Open-source big data software having Engines, optimized for the Cloud.
 Comprehensive Security, Governance, and Compliance
Page 14 of 23
Introduction to Advanced Computer Application UNIT II
 Provides actionable Alerts, Insights, and Recommendations to optimize reliability, performance, and
costs.
 Automatically enacts policies to avoid performing repetitive manual actions

Apache Cassandra

The Apache Cassandra database is widely used today to provide an effective management of large amounts of
data.

Features

 Support for replicating across multiple data centers by providing lower latency for users
 Data is automatically replicated to multiple nodes for fault-tolerance
 It one of the best big data tools which is most suitable for applications that can't afford to lose data, even
when an entire data center is down
 Cassandra offers support contracts and services are available from third parties

Statwing

Statwing is an easy-to-use statistical tool. It was built by and for big data analysts. Its modern interface chooses
statistical tests automatically.

Features

 It is a big data software that can explore any data in seconds. Statwing helps to clean data, explore
relationships, and create charts in minutes
 It allows creating histograms, scatterplots, heatmaps, and bar charts that export to Excel or PowerPoint. It
also translates results into plain English, so analysts unfamiliar with statistical analysis

CouchDB

CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It offers distributed
scaling with fault-tolerant storage. It allows accessing data by defining the Couch Replication Protocol.

Features

 CouchDB is a single-node database that works like any other database


 It is one of the big data processing tools that allows running a single logical database server on any
number of servers.
 It makes use of the ubiquitous HTTP protocol and JSON data format. Easy replication of a database
across multiple server instances. Easy interface for document insertion, updates, retrieval and deletion
 JSON-based document format can be translatable across different languages

Pentaho

Pentaho provides big data tools to extract, prepare and blend data. It offers visualizations and analytics that
change the way to run any business. This Big data tool allows turning big data into big insights.

Features:

Page 15 of 23
Introduction to Advanced Computer Application UNIT II
 Data access and integration for effective data visualization. It is a big data software that empowers users
to architect big data at the source and stream them for accurate analytics.
 Seamlessly switch or combine data processing with in-cluster execution to get maximum processing.
Allow checking data with easy access to analytics, including charts, visualizations, and reporting
 Supports wide spectrum of big data sources by offering unique capabilities

Apache Flink

Apache Flink is one of the best open source data analytics tools for stream processing big data. It is distributed,
high-performing, always-available, and accurate data streaming applications.

Features:

 Provides results that are accurate, even for out-of-order or late-arriving data
 It is stateful and fault-tolerant and can recover from failures.
 It is a big data analytics software which can perform at a large scale, running on thousands of nodes
 Has good throughput and latency characteristics
 This big data tool supports stream processing and windowing with event time semantics. It supports
flexible windowing based on time, count, or sessions to data-driven windows
 It supports a wide range of connectors to third-party systems for data sources and sinks

Cloudera

Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to get any data
across any environment within single, scalable platform.

Features:

 High-performance big data analytics software


 It offers provision for multi-cloud
 Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google Cloud Platform. Spin
up and terminate clusters, and only pay for what is needed when need it
 Developing and training data models
 Reporting, exploring, and self-servicing business intelligence
 Delivering real-time insights for monitoring and detection
 Conducting accurate model scoring and serving

Open Refine

OpenRefine is a powerful big data tool. It is a big data analytics software that helps to work with messy data,
cleaning it and transforming it from one format into another. It also allows extending it with web services and
external data.

Features:

 OpenRefine tool help you explore large data sets with ease. It can be used to link and extend your dataset
with various web services. Import data in various formats.
 Explore datasets in a matter of seconds
 Apply basic and advanced cell transformations
 Allows to deal with cells that contain multiple values
Page 16 of 23
Introduction to Advanced Computer Application UNIT II
 Create instantaneous links between datasets. Use named-entity extraction on text fields to automatically
identify topics. Perform advanced data operations with the help of Refine Expression Language

RapidMiner

RapidMiner is one of the best open-source data analytics tools. It is used for data prep, machine learning, and
model deployment. It offers a suite of products to build new data mining processes and setup predictive analysis.

Features

 Allow multiple data management methods


 GUI or batch processing
 Integrates with in-house databases
 Interactive, shareable dashboards
 Big Data predictive analytics
 Remote analysis processing
 Data filtering, merging, joining and aggregating
 Build, train and validate predictive models
 Store streaming data to numerous databases
 Reports and triggered notifications

Data cleaner

Data Cleaner is a data quality analysis application and a solution platform. It has strong data profiling engine. It is
extensible and thereby adds data cleansing, transformations, matching, and merging.

Feature:

 Interactive and explorative data profiling


 Fuzzy duplicate record detection.
 Data transformation and standardization
 Data validation and reporting
 Use of reference data to cleanse data
 Master the data ingestion pipeline in Hadoop data lake. Ensure that rules about the data are correct before
user spends thier time on the processing. Find the outliers and other devilish details to either exclude or
fix the incorrect data

Kaggle

Kaggle is the world's largest big data community. It helps organizations and researchers to post their data &
statistics. It is the best place to analyze data seamlessly.

Features:

 The best place to discover and seamlessly analyze open data


 Search box to find open datasets.
 Contribute to the open data movement and connect with other data enthusiasts

Apache Hive

Page 17 of 23
Introduction to Advanced Computer Application UNIT II
Hive is an open-source big data software tool. It allows programmers analyze large data sets on Hadoop. It helps
with querying and managing large datasets real fast.

Features:

 It Supports SQL like query language for interaction and Data modeling
 It compiles language with two main tasks map, and reducer.
 It allows defining these tasks using Java or Python
 Hive designed for managing and querying only structured data
 Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming
 It offers Java Database Connectivity (JDBC) interface

INTRODUCTION TO HADOOP FRAMEWORK



Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models.
 Hadoop is designed to scale up from single server to thousands of machines, each offering local
computation and storage.
 Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on
different CPU nodes.
Hadoop = MapReduce + HDFS
Hadoop Architecture
❖ Hadoop has a Master Slave Architecture for both Storage & Processing
❖ Hadoop framework includes following four modules:
❖ Hadoop Common: These are Java libraries and provide file system and OS level abstractions and contains the
necessary Java files and scripts required to start Hadoop.
❖ Hadoop YARN: This is a framework for job scheduling and cluster resource management.
❖ Hadoop Distributed File System (HDFS): A distributed file system that provides high- throughput access to
application data.
❖ HadoopMapReduce: This is system for parallel processing of large data sets.

Page 18 of 23
Introduction to Advanced Computer Application UNIT II
 The Hadoop core is divided into two fundamental layers:
 MapReduce engine
 HDFS
 The MapReduce engine is the computation engine running on top of HDFS as its data storage manager.
 HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data on a
distributed computing system.
 HDFS Architecture: HDFS has a master/slave architecture containing a single Name Node as the master
and a number of Data Nodes as workers (slaves).
 The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop
Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
 A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job
Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and
TaskTracker.

Hadoop Distributed File System


 The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a
master/slave architecture. This architecture consist of a single NameNode performs the role of master,
and multiple DataNodes performs the role of a slave.
 Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is
used to develop HDFS. So any machine that supports Java language can easily run the NameNode and
DataNode software.
NameNode
 It is a single master server exist in the HDFS cluster.
 As it is a single node, it may become the reason of single point failure.
 It manages the file system namespace by executing an operation like the opening, renaming and closing
the files.
 It simplifies the architecture of the system.
DataNode
 The HDFS cluster contains multiple DataNodes.
 Each DataNode contains multiple data blocks.

Page 19 of 23
Introduction to Advanced Computer Application UNIT II
 These data blocks are used to store data.
 It is the responsibility of DataNode to read and write requests from the file system's clients.
 It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
 The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using
NameNode.
 In response, NameNode provides metadata to Job Tracker.
 Task Tracker
 It works as a slave node for Job Tracker.
 It receives task and code from Job Tracker and applies that code on the file. This process can also be
called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or
time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop
 Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even
the tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
 Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
 Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost
effective as compared to traditional relational database management system.
 Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.

Hadoop MapReduce
 Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is
done at the slave nodes, and the final result is sent to the master node.
 A data containing code is used to process the entire data. This coded data is usually very small in
comparison to the data itself. You only need to send a few kilobytes worth of code to perform a heavy-
duty process on computers.

Page 20 of 23
Introduction to Advanced Computer Application UNIT II

 The input dataset is first split into chunks of data. In this example, the input has three lines of text with
three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into
three chunks, based on these entities, and processed parallelly.
 In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one
ship, and one train.
 These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the
aggregation takes place, and the final output is obtained.

Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
It is responsible for managing cluster resources to make sure you don't overload one machine.
It performs job scheduling to make sure that the jobs are scheduled in the right place

Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes
to the resource manager (Hadoop Yarn), which is responsible for resource allocation and management.
In the node section, each of the nodes has its node managers. These node managers manage the nodes and
monitor the resource usage in the node. The containers contain a collection of physical resources, which
could be RAM, CPU, or hard drives. Whenever a job request comes in, the app master requests the
container from the node manager. Once the node manager gets the resource, it goes back to the Resource
Manager.

Page 21 of 23
Introduction to Advanced Computer Application UNIT II
Fill up Q & A
1. The transition from traditional data processing to Big Data is often referred to as data ________.
2. ________ refers to the rapid growth and accumulation of data generated from various sources
3. Data that is generated at high speed and in large volumes is known as ________ data.
4. The term ________ refers to the variety of data types generated, including structured,
semistructured, and unstructured data.
5. Big Data is commonly defined by the ________ , which are Volume, Velocity, and Variety.
6. The ability to analyze and visualize data in real-time is referred to as ________ analytics.
7. One major merit of Big Data is its ability to uncover ________ patterns and trends.
8. A significant challenge of Big Data is the issue of ________, which deals with the quality and accuracy
of data.
9. A key component of Big Data architecture is the ________, which is responsible for data storage.
10. _________frameworks are used for processing and analyzing large datasets.
11. Big Data is often characterized by its high ________, which refers to the sheer amount of data
generated.
12. The term ________ in Big Data refers to the speed at which data is generated and processed.
13. One of the most popular frameworks for Big Data processing is ________.
14. ________ is a programming model that allows for processing large data sets across distributed
clusters.
15. Big Data is widely used in the ________ industry for predictive analytics and customer insights.
16. In healthcare, Big Data helps in ________ by analyzing patient data for better treatment outcomes
17. A widely-used tool for Big Data storage and processing is ________.
18. ________ is a data visualization tool commonly used to present Big Data insights.
19. The term ________ refers to the ethical considerations and implications of using Big Data.
20. ________ learning techniques are often employed to extract insights from Big Data.
21. In the context of Big Data, ETL stands for Extract, Transform, and ________.
22. A database designed to handle unstructured data is known as a ________ database.
23. The ________ is a set of techniques used to analyze and interpret data for decision-making.
24. Cloud computing plays a crucial role in providing ________ for Big Data solutions.
25 The ability to process data in real-time is known as ________ processing.
26 Data ________ involves ensuring data privacy and security in Big Data environments.
27. ________ analysis is a method used to analyze complex data sets to identify patterns.
28. The ________ refers to the integration of Big Data with traditional data sources for comprehensive
analysis.
29. ________ platforms enable organizations to manage and analyze data across various environments.
30 The term ________ refers to the tools and methods used to manage and analyze Big Data.
31 A major challenge in Big Data is the need for ________ in data storage solutions.
32. ________ is a programming language frequently used in Big Data analytics for data manipulation and
analysis.

Answer
1. evolution
2. Big Data
3. streaming
Page 22 of 23
Introduction to Advanced Computer Application UNIT II
4. variety
5. three Vs
6. real-time
7. hidden
8. data quality
9. data lake
10. processing
11. volume
12. velocity
13. Apache Hadoop
14. MapReduce
15. retail
16. personalized medicine
17. Hadoop
18. Tableau
19. data ethics
20. Machine
21. Load
22. NoSQL
23. data analytics
24. scalability
25. streaming
26. governance
27. Predictive
28. integration
29. Data management
30. Big Data
31. efficiency
32. R

Page 23 of 23

You might also like