100% found this document useful (2 votes)
35 views140 pages

Introduction Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
35 views140 pages

Introduction Big Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 140

lOMoARcPSD|44504791

BIG DATA ANALYTICS

LESSON 1:

Introduction big data:

why big data

Data is defined as individual facts, such as numbers, words,


measurements, observations or just descriptions of things. For
example, data might include individual prices, weights,
addresses, ages, names, temperatures, dates, or distances. There
are two main types of data:

1. Quantitative data is provided in numerical form, like the


weight, volume, or cost of an item.

2. Qualitative data is descriptive, but non-numerical, like the


name, sex, or eye colour of a person.

Characteristics of Data

The following are six key characteristics of data which discussed


below:

1. Accuracy

2. Validity

3. Reliability

4. Timeliness

5. Relevance

6. Completeness
lOMoARcPSD|44504791

Types of Digital Data

Digital data is the electronic representation of information in a


format or language that machines can read and understand.

In more technical terms, Digital data is a binary format of


information that's converted into a machine-readable digital
format.

Types of Digital Data:

 Structured

Unstructured

 Semi Structured Data

Structured Data:

Structured data refers to any data that resides in a fixed field


==within a record or file.

Having a particular Data Model

. Meaningful data.

 Data arranged in arow and column

E.g.: Relational Data Base, Spread sheets , SQL

Unstructured Data:

Unstructured data can not readily classify and fit into a neat box

 Also called unclassified data.

 Which does not confirm to any data model.

 Business rules are not applied.

 Indexing is not required

E.g.: photos and graphic images, videos, streaming instrument


data, webpages, Pdf files, PowerPoint presentations, emails, blog
entries, wikis and word processing documents.
lOMoARcPSD|44504791

Sources of Unstructured Data: Web pages Images (JPEG, GIF,


PNG, etc.) Videos Memos Reports Word documents and
PowerPoint presentations Surveys

Sources of semi-structured Data: E-mails XML and other


markup languages Binary executables TCP/IP packets Zipped
files Integration of data from different sources Web page

Big Data

Big Data is a collection of data that is huge in volume, yet


growing exponentially with time.

1. It is a data with so large size and complexity that none of


traditional data management tools can store it or process it
efficiently.
2. Big data is also a data but with huge size

WHY BIG DATA

1. Big Data initiatives were rated as “extremely important” to


93% of companies. Leveraging a Big Data analytics solution helps
organizations to unlock the strategic values and take full
adhhvantage of their assets.

2. It helps organizations: To understand Where, When and Why


their customers buy

3. Protect the company’s client base with improved loyalty


programs

4 . Seizing cross-selling and upselling opportunities

5 . Provide targeted promotional information

6 . Optimize Workforce planning and operations

7 . Improve inefficiencies in the company’s supply chain

8 . Predict market trends , Predict future needs , Make


companies more innovative and competitive
lOMoARcPSD|44504791

9 .It helps companies to discover new sources of revenue

 Companies are using Big Data to know what their


customers want, who are their best customers, why people
choose different products. The more a company knows
about its customers, the more competitive it becomes.
 We can use it with Machine Learning for creating market
strategies based on predictions about customers.
Leveraging big data makes companies customer-centric.
 Companies can use Historical and real-time data to assess
evolving consumers’ preferences. This consequently
enables businesses to improve and update their marketing
strategies which make companies more responsive to
customer needs.

The companies in the present market need to collect it and


analyze it because

1. Cost Savings Big Data tools like Apache Hadoop, Spark, etc.
bring cost-saving benefits to businesses when they have to
store large amounts of data. These tools help organizations in
identifying more effective ways of doing business
2. Time-Saving
lOMoARcPSD|44504791

3. Understand the market condition

4. Social Media Listening

convergence of key trends


 A Convergence of Key Trends According to Steve
Lucas, the Global Executive Vice President and
General Manager, SAP Database & Technology at
SAP the difference between “Old Big Data” and “New
Big Data” is accessibility. While it’s true that the
amount of data in the world keeps growing, the real
change has been in the ways that we access that data
and use it to create value.
 Today, you have technologies like Hadoop, for
example, that make it functionally practical to access a
tremendous amount of data, and then extract value
from it.
 The availability of lower-cost hardware makes it easier
and more feasible to retrieve and process information,
quickly and at lower costs than ever before. So it’s the
convergence of several trends—more data and less
expensive, faster hardware—that’s driving this
transformation.
 Today, we’ve got raw speed at an affordable price.
Next is the ability to do that real-time analysis on very
complex sets of data and models, And finally, we now
have the ability to find solutions for very complex
problems in real time.
 The two scenarios described by Lucas aren’t fantasies.
Yesterday, the cost of real-time data analysis was
prohibitive. Today, real-time analytics have become
affordable.
 As a result, market-leading companies are already
using Big Data Analytics to improve sales revenue,
lOMoARcPSD|44504791

increase profits, and do a better job of serving


customers.
 The industry has an evolving definition around Big
Data that is currently defined by three dimensions:
1. Volume
2. Variety
3. Velocity
 Although many people define Big Data by volume,
definitions of Big Data that are based on volume can
be troublesome some people define volume by the
number of occurrences (in database terminology by
the rows in a table or in analytics terminology known
as the number of observations).
 Some people define volume based on the number of
interesting pieces of information for each occurrence
(or in database terminology, the columns in a table or
in analytics terminology the features or dimensions).
some people define volume by the combination of
depth and width.

unstructured data
lOMoARcPSD|44504791

Types of Big Data Now that we are on track with what is big data,
let’s have a look at the types of big data:

a) Structured is one of the types of big data and By structured


data, we mean data that can be processed, stored, and retrieved
in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms. For instance, the
employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be
present in an organized manner.

b)Unstructured data refers to the data that lacks any specific


form or structure whatsoever. This makes it very difficult and
time-consuming to process and analyze unstructured data. Email
is an example of unstructured data. Structured and unstructured
are two important types of big data.

Unstructured data can not readily classify and fit into a neat box

 Also called unclassified data.

 Which does not confirm to any data model.

 Business rules are not applied.

 Indexing is not required

E.g.: photos and graphic images, videos, streaming instrument


data, webpages, Pdf files, PowerPoint presentations, emails, blog
entries, wikis and word processing documents.

Sources of Unstructured Data: Web pages Images (JPEG, GIF,


PNG, etc.) Videos Memos Reports Word documents and
PowerPoint presentations Surveys

c) Semi structured is the third type of big data. Semi-structured


data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a
lOMoARcPSD|44504791

particular repository (database), yet contains vital information or


tags that segregate individual elements within the data. Thus we
come to the end of types of data.

Characteristics of Big Data


Big Data contains a large amount of data that is not being
processed by traditional data storage or the processing unit.
It is used by many multinational
companies to process the data and business of
many organizations. The data flow would exceed 150
exabytes per day before replication.

There are five v's of Big Data that explains the


characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
lOMoARcPSD|44504791

The name Big Data itself is related to an enormous size. Big


Data is a

vast 'volumes' of data generated from many sources daily,


such as business processes, machines, social media
platforms, networks, human interactions, and many
more.

Facebook can generate approximately


a billion messages, 4.5 billion times that the "Like" button
is recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle large
amounts of data.

Variety

Big Data can be structured, unstructured, and semi-


structured that are being collected from different sources.
Data will only be collected from databases and sheets in
the past, But these days the data will comes in array forms,
that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
lOMoARcPSD|44504791

The data is categorized as below:

1. Structured data: In Structured schema, along with all


the required columns. It is in a tabular form. Structured
Data is stored in the relational database management
system.
2. Semi-structured: In Semi-structured, the schema is
not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log
files, audio files, and image files are included in the
unstructured data. Some organizations have much
data available, but they did not know how to derive the
value of data since the data is raw.
4. Quasi-structured Data:The data format contains
textual data with inconsistent data formats that are
formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and
maintained by some server that contains a list of activities.

Veracity

Veracity means how much the data is reliable. It has many


ways to filter or translate the data. Veracity is the process of
being able to handle and manage data efficiently. Big Data
is also essential in business development.
lOMoARcPSD|44504791

For example, Facebook posts with hashtags.

Value

Value is an essential characteristic of big data. It is not the


data that we process or store. It
is valuable and reliable data that we store, process, and
also analyze.

Velocity

Velocity plays an important role compared to others.


Velocity creaxxxxtes the speed by which the data is created
in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary
aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows
from sources like application logs, business processes,
networks, and social media sites, sensors, mobile
devices, etc.

industry examples of big data


lOMoARcPSD|44504791

 web analytics
 big data and marketing
 fraud and big data
 risk and big data
 credit risk management
 big data and algorithmic trading
 big data and healthcare
 big data in medicine
 advertising

web analytics

The text discusses important aspects of analyzing traffic and user


behavior for businesses.

1. **Traffic and User Flow:** Understanding how users navigate


a website and achieving specific goals like conversions.

2. **Goals Acquisition:** Setting clear objectives to determine


what a business wants to achieve.
lOMoARcPSD|44504791

3. **Potential Keywords:** Identifying key search terms that can


attract more visitors to the website.

4. **Identifying Improvement Segments:** Recognizing areas


where performance can be enhanced.

5. **Key Performance Indicator (KPI):** Metrics that measure


success vary based on the type of business and its strategy.

6. **Data Insights Using Google Analytics:**

- **Micro Level Analysis:** Focuses on individual user actions,


like how many times a page is printed or how often job
applications are submitted.

- **Macro Level Analysis:** Looks at broader business goals


impacting large groups, such as total conversions among specific
demographics.

In summary, the text emphasizes the importance of both detailed


and broad data analysis to optimize business performance.
REG.

big data and marketing 3

(ii) Explain the concept of web analytics


and list its importance in detail.
(
(8)(K2)(CO1)
WEB ANALYTICS
Importance of Web Analytics
Web Analytics needed to assess the success
rate of a website and its associated
business. Using Webfor improvemen
lOMoARcPSD|44504791

 It refers to the ever-increasing volume, velocity,


variety, variability and complexity of information. For
marketing organizations, big data is the fundamental
consequence of the new marketing landscape, born
from the digital world we now live in.
 Many marketers may feel like data has always been
big – and in some ways, it has. But think about the
customer data businesses collected 20 years ago –
point of sale transaction data, responses to direct mail
campaigns, coupon redemption, etc. Then think about
the customer data collected today – online purchase
data, click-through rates, browsing behavior, social
media interactions, mobile device usage, geolocation
data, etc. Comparatively speaking, there’s no
comparison. And to borrow an old phrase, "You ain’t
seen nothin' yet.
By combining big data with an integrated marketing
management strategy, marketing organizations can make a
substantial impact in these key areas:

 Customer engagement. Big data can deliver insight


into not just who your customers are, but where they
are, what they want, how they want to be contacted
and when.
 Customer retention and loyalty. Big data can help
you discover what influences customer loyalty and
what keeps them coming back again and again.
 Marketing optimization/performance. With big data,
you can determine the optimal marketing spend
across multiple channels, as well as continuously
optimize marketing programs through testing,
measurement and analysis.

Three types of big data that are a big deal for marketing
Customer: The big data category most familiar to marketing
may include behavioral, attitudinal and transactional metrics
from such sources as marketing campaigns, points of sale,
websites, customer surveys, social media, online
communities and loyalty programs.
lOMoARcPSD|44504791

Operational: This big data category typically includes


objective metrics that measure the quality of marketing
processes relating to marketing operations, resource
allocation, asset management, budgetary controls, etc.

Financial: Typically housed in an organization’s financial


systems, this big data category may include sales, revenue,
profits and other objective data types that measure the
financial health of the organization

fraud and big data


 Fraudulent activities, including e-commerce
scams, insurance fraud, cybersecurity threats,
and financing fraud, pose significant risks to
both individuals and companies across various
industries such as retail, insurance, banking,
and healthcare.
 To combat these risks, businesses increasingly
adopt advanced fraud prevention technologies
and robust risk management strategies that
depend on Big Data. For instance, predictive
analytics models, alternative data sources, and
advanced machine learning techniques
empower decision-makers to develop innovative
lOMoARcPSD|44504791

approaches and methodologies to proactively


prevent fraud.
 These technologies analyze large volumes of
data to identify patterns and anomalies in
transactions that indicate fraudulent behavior,
allowing businesses to take proper action.
 In response to current challenges, companies
are shifting to advanced data analytics
techniques for fraud prevention technologies
and risk management strategies that use Big
Data. Techniques like predictive analytics,
alternative data, and machine learning are
helping create new ways to prevent fraud.
Applications of big data analytics in fraud
detection

Here are a few applications of big data analytics in


fraud detection:
 Real-time fraud monitoring: One of the main
benefits of using big data in fraud detection is
the ability to perform real-time analytics and
monitoring. Traditional methods of detecting
fraud often depend on past data analysis, which
may not be fast enough to stop advanced
fraudsters.
 Pattern recognition: Integrating machine
learning algorithms with big data analytics
boosts insurance fraud detection analytics and
prevention. These algorithms learn from
historical data, identifying patterns and trends
linked to fraudulent activities.
 Anomaly detection : Big data allows
advanced behavioral analytics, which involves
analyzing user behavior patterns to identify
anomalies. By specifying a baseline of normal
user behavior, organizations can quickly detect
deviations that may indicate fraud.
 Predictive modeling and risk assessment:
Predictive models can assist organizations in
lOMoARcPSD|44504791

predicting fraud scenarios and identifying


suspicious activities. These models can include
variables such as transaction volume, velocity,
or customer behavior patterns to evaluate the
likelihood of fraud.

Examples of big data analytics for fraud


detection and prevention

PayPal: Using machine learning to analyze


billions of transactions

Mastercard: Using data mining to identify


fraud patterns across millions of merchants
and cardholders

HSBC: By integrating and analyzing data


from customer profiles, data records,
transaction records, and external databases

American Express: Using natural language


generation and geospatial analysis,
American Express analyzes customers’
spending habits

risk and big data

credit risk management

Types of Big Data in Credit Risk Management


Structured Data

1. Transaction data: real-time data on the transaction


activity of customers provide an indication of their
income patterns, what their spending habits are, and
therefore, indication to when the stress might occur.
2. Credit Bureau reports: these reports capture
important indicators from customer’s transactions with
other financial institutions and can help indicate the
lOMoARcPSD|44504791

customer’s credit worthiness and historical stress


patterns
3. Financial statements: these statements can provide
an indication of their income patterns
4. Loan application data: application data can help for
the bank to assess the credit worthiness by knowing
their FICO/Vantage scores etc.

Unstructured Data

1. Social media and online behavior: social media


information, browser history, and other online activities,
such as buying behavior on the internet, can provide
knowledge on a borrower's lifestyle, behavior pattern,
and potential risk to his/her finances.
2. Call center transcripts: call center transcripts, emails
or chat logs can reveal a customer’s buying patterns,
or if he/she is under stress.

Alternate Data
Non-traditional data sources could include mobile phone
usage, utility bill payments, and geolocation data. These
data sources are mainly helpful for individuals with little or
no traditional credit history.

Big Data Tools and Techniques used in Credit Risk


Management
Data Collection and Storage

1. Data Storage: file systems like Hadoop are used for


storing huge datasets.
2. Data Lakes: Data lakes, like Amazon S3, Azure Data
Lake etc.) allow financial institutions to store both
structured and unstructured data, enabling fast, real-
time data analysis.

Data Preprocessing

1. ETL (Extract, Transform, Load): data needs to be


collected from multiple sources for ex. transaction data
from institutions’ internal data warehouse
lOMoARcPSD|44504791

Machine Learning Models

1. Classification Models: algorithms like logistic


regression, ensemble methods (random forests,
gradient boosting) are used to classify borrowers into
risk categories.
2. Clustering Techniques: unsupervised learning
techniques, such as k-means clustering, help in
grouping borrowers with similar risk profiles based on
behavioral and transactional data.
3. Deep Learning: neural networks can be useful for
more complex analyses, such as detecting hidden
patterns in borrower behavior that are indicative of
potential defaults especially on unstructured data.

Real Time Analytics

1. Stream Processing: technologies such as Apache


Kafka and Apache Flink can enable real-time
processing of data, for continuously monitoring the risk
assessments.
2. Dynamic Credit Scoring: real-time data can be used
to update borrower’s credit scores instead of periodic
updates, hence providing a more updated risk profile.

Big Data applications in Credit Risk Management

1. Improved Credit Scoring Models: institutions can


use data from multiple sources, both structured and
unstructured, resulting in a more accurate credit
scoring models. This helps an institution to better
assess a borrower's ability and willingness to pay.
2. Improved Early Warning System: financial
institutions can create early warning systems by using
big data that can send alerts when the system
identifies a change in borrower’s risk.
3. Fraud Detection: analytics, using big data, can help in
fraud detection by identifying abnormal behaviors, for
ex. unusual spending or inconsistency of customer
data. These types of outliers can be identified much
better by machine learning models than a conventional
rules-based system.
lOMoARcPSD|44504791

4. Risk-Based Pricing: with a more realistic and detailed


understanding of borrower risk profile, financial
institutions can embark on risk-based pricing policies
whereby interest rates can be offered reflecting
individual risk profiles. This would ensure maximization
of profit while allowing the mitigation of risk.

data and algorithmic trading

What is Algorithmic Trading?

Application of computer and communication


techniques has stimulated the rise of algorithm
trading. Algorithm trading is the use of computer
programs for entering trading orders, in which
computer programs decide on almost every aspect
of the order, including the timing, price, and
quantity of the order etc.

Role of Big Data in Algorithmic Trading


1. Technical Analysis : Technical Analysis is the
study of prices and price behavior, using charts as
the primary tool.

2. Real Time Analysis : The automated process


enables computer to execute financial trades at
speeds and frequencies that a human trader cannot.

3. Machine Learning : With Machine Learning,


algorithms are constantly fed data and actually get
smarter over time by learning from past mistakes,
logically deducing new conclusions based on past
results and creating new techniques that make
sense based on thousands of unique factors.

big data and healthcare


lOMoARcPSD|44504791

In health care, big data is generated by various sources and


analyzed to guide decision-making, improve patient
outcomes, and decrease health care costs, among other
things. Some of the most common sources of big data in
health care include electronic health records (EHR),
electronic medical records (EMRs), personal health records
(PHRs), and data produced by widespread digital
health tools like wearable medical devices and health apps
on mobile devices.

Big data applications in health care

Professionals in health care use big data for a wide range of


purposes – from developing insights in biomedical research
to providing patients with personalized medicine. Here are
just some of the ways that big data is used in health care
today:
 Employing predictive analytics to create machine learning
models that can predict the likelihood a patient might
develop a particular disease.
 Providing real-time alerts to medical staff by continuously
monitoring patient conditions within a facility.
 Enhancing security surrounding the processing of sensitive
medical data, such as insurance claims and medical
records.

Benefits of big data in health care

Big data has the potential to improve health care for the
better. Here are some of the most common benefits of using
big data in health care:
 Better patient care: More patient data means an
opportunity to understand the patient experience better and
improve the care they receive.
 Improved research: Big data gives medical researchers
unprecedented access to a large volume of data and
methods of collecting data. In turn, this data can drive
important medical breakthroughs that save lives.
 Smarter treatment plans: Analyzing the treatment plans
that helped patients (and those that didn’t) can help
lOMoARcPSD|44504791

researchers create even better treatment plans for future


patients.
 Reduced health care costs for patients and health
providers: Health care can cost a lot. Big data offers the
possibility of reducing the cost of obtaining and providing
health care by identifying appropriate treatment plans,
allocating resources intelligently, and identifying potential
health issues before they occur.

big data in advertising

Big data is becoming a fundamental tool in


marketing. Data constantly informs marketing
teams of customer behaviors and industry trends,
and is used to optimize future efforts, create
innovative campaigns and build lasting
relationships with customers.

Customer Engagement and Retention

Big data regarding customers provides marketers


details about user demographics, locations, and
interests, which can be used to personalize the
lOMoARcPSD|44504791

product experience and increase customer loyalty


over time.

Marketing Optimization and Performance

Big data solutions can help organize data and


pinpoint which marketing campaigns, strategies or
social channels are getting the most traction. This
lets marketers allocate marketing resources and
reduce costs for projects that aren’t yielding as
much revenue or meeting desired audience goals.

Competitor Tracking and Operation Adjustment

Big data can also compare prices and marketing


trends among competitors to see what consumers
prefer. Based on average industry standards,
marketers can then adjust product
prices, logistics and other operations to appeal to
customers and remain competitive.

big data technologies


lOMoARcPSD|44504791

introduction to Hadoop

 Hadoop is an open-source software framework that is


used for storing and processing large amounts of data
in a distributed computing environment. It is designed
to handle big data and is based on the MapReduce
programming model, which allows for the parallel
processing of large datasets.
 Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation. Its framework is based on
Java programming with some native code in C and
shell scripts.

Hadoop has two main components:

 HDFS (Hadoop Distributed File System): This is the


storage component of Hadoop, which allows for the
storage of large amounts of data across multiple
machines. It is designed to work with commodity
hardware, which makes it cost-effective.

 YARN (Yet Another Resource Negotiator): This is the


resource management component of Hadoop, which
manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.
 Hadoop also includes several additional modules that
provide additional functionality, such as Hive (a SQL-like
query language), Pig (a high-level platform for creating
MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as
data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data
analysis, and data mining. It enables the distributed
processing of large data sets across clusters of
computers using a simple programming model.
lOMoARcPSD|44504791

Features of hadoop:
 1. it is fault tolerance.
 2. it is highly available.
 3. it’s programming is easy.
 4. it have huge flexible storage.
 5. it is low cost.

Hadoop Distributed File System


It has distributed file system known as HDFS and this
HDFS splits files into blocks and sends them across
various nodes in form of large clusters. Also in case of a
node failure, the system operates and data transfer takes
place between the nodes which are facilitated by HDFS.

Some common frameworks of Hadoop


1. Hive- It uses HiveQl for data structuring and for writing
complicated MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for
data exploration.
lOMoARcPSD|44504791

3. Storm- It allows real-time processing and streaming of


data.
4. Spark- It contains a Machine Learning Library(MLlib) for
providing enhanced machine learning and is widely used
for data processing. It also supports Java, Python, and
Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs
data transformation of unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and
helps in the running of their codes faster.

open source technologies


1. Apache Cassandra: It is one of the No-SQL databases
which is highly scalable and has high availability. In this,
we can replicate data across multiple data centers.
Replication across multiple data centers is supported. In
Cassandra, fault tolerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely
used big data technology that is used to handle large-scale
data, large file systems by using Hadoop file system which
is called HDFS, and parallel processing like features using
the MapReduce framework of Hadoop. Hadoop is a
scalable system that helps to provide a scalable solution
capable of handling large capacities and capabilities. For
example: If you see real use cases like NextBio is using
Hadoop MapReduce and HBase to process multi-terabyte
data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad
hoc querying which means for querying and analyzing Big
Data easily. It is built on top of Hadoop for providing data
summarization, ad-hoc queries, and the analysis of large
datasets using SQL-like language called HiveQL. It is not a
relational database and not a language for real-time
queries. It has many features like: designed for OLAP, SQL
type language called HiveQL, fast, scalable, and
extensible.
4. Apache Flume: It is a distributed and reliable system
that is used to collect, aggregate, and move large amounts
of log data from many data sources toward a centralized
data store.
lOMoARcPSD|44504791

5. Apache Spark: The main objective of spark for speeding


up the Hadoop computational computing software process,
and It was introduced by Apache Software Foundation.
Apache Spark can work independently because it has its
own cluster management, and It is not an updated or
modified version of Hadoop and if you delve deeper then
you can say it is just one way to implement Spark with
Hadoop. The Main idea to implement Spark with Hadoop in
two ways is for storage and processing. So, in two ways
Spark uses Hadoop for storage purposes just because
Spark has its own cluster management computation. In
Spark, it includes interactive queries and stream
processing, and in-memory cluster computing is one of the
key features.
6. Apache Kafka: It is a distributed publish-subscribe
messaging system and more specifically you can say it has
a robust queue that allows you to handle a high volume of
data, and you can pass the messages from one point to
another as you can say from one sender to receiver. You
can perform message computation in both offline and
online modes, it is suitable for both. To prevent data loss
Kafka messages are replicated within the cluster. For real-
time streaming data analysis, it integrates Apache Storm
and Spark and is built on top of the ZooKeeper
synchronization service.
7. MongoDB: It is based on cross-platform and works on a
concept like collection and document. It has document-
oriented storage that means data will be stored in the form
of JSON form. It can be an index on any attribute. It has
features like high availability, replication, rich queries,
support by MongoDB, Auto-Sharding, and Fast in-place
updates.
8. ElasticSearch: It is a real-time distributed system, and
open-source full-text search and analytics engine. It has
features like scalability factor is high and scalable
structured and unstructured data up to petabytes, It can be
used as a replacement of MongoDB, RavenDB which is
based on document-based storage. To improve the search
performance, it uses denormalization. If you see the real
use case then it is an enterprise search engine and big
organizations using it, for example- Wikipedia, GitHub.
lOMoARcPSD|44504791

cloud and big data

It offers an effective solution towards dealing with big


size information sets. Organizations can store their big-
data efficiently manage them as well analyze them by
leveraging scalability provided through clouds on demand
resources such as storage capacity

mobile business intelligence


The ability to access analytics and data on mobile
devices or tablets rather than desktop computers is
referred to as mobile business intelligence. The
business metric dashboard and key performance
indicators (KPIs) are more clearly displayed.

With the rising use of mobile devices, so have the


technology that we all utilise in our daily lives to
make our lives easier, including business. Many
businesses have benefited from mobile business
intelligence. Essentially, this post is a guide for
business owners and others to educate them on the
benefits and pitfalls of Mobile BI.
Advantages of mobile BI
lOMoARcPSD|44504791

1. Simple access: Mobile BI is not restricted


to a single mobile device or a certain
place. You can view your data at any time
and from any location. Having real-time
visibility into a firm improves production
and the daily efficiency of the business.
Obtaining a company's perspective with a
single click simplifies the process.

2. Competitive advantage: Many firms are


seeking better and more responsive
methods to do business in order to stay
ahead of the competition. Easy access to
real-time data improves company
opportunities and raises sales and capital.
This also aids in making the necessary
decisions as market conditions change.

3. Simple decision-making: As previously


stated, mobile BI provides access to real-
time data at any time and from any
location. During its demand, Mobile BI
offers the information. This assists
consumers in obtaining what they require
at the time. As a result, decisions are
made quickly.

4.Increase Productivity : By extending BI to


mobile, the organization's teams can
access critical company data when they
need it. Obtaining all of the corporate data
with a single click frees up a significant
amount of time to focus on the smooth
and efficient operation of the firm.
Increased productivity results in a smooth
and quick-running firm.

Crowd sourcing analytics


lOMoARcPSD|44504791

Crowdsourcing is a sourcing model in which an


individual or an organization gets support
from a large, open-minded, and rapidly evolving
group of people in the form of ideas, micro-tasks,
finances, etc. Crowdsourcing typically involves the
use of the internet to attract a large group of
people to divide tasks or to achieve a target. The
term was coined in 2005 by Jeff Howe and Mark
Robinson. Crowdsourcing can help different types of
organizations get new ideas and solutions,
deeper consumer engagement, optimization of
tasks, and several other things
 Crowdsourcing is a method where individuals or
organizations gather support from a large group of people
for ideas, small tasks, or funding. It usually takes place
online, allowing many participants to collaborate on tasks
or reach specific goals.
 The term was first introduced in 2005 by Jeff Howe and
Mark Robinson. Crowdsourcing can benefit organizations by
providing new ideas, enhancing consumer engagement,
improving task efficiency, and more.

 1. **Enterprise**: Businesses leverage crowdsourcing for


ideas, feedback, and solutions from a diverse group of
people.
 2. **IT**: Technology development and software
improvements often benefit from the input and
contributions of a wide audience.
 3. **Marketing**: Companies use crowdsourcing to
generate creative content, gather consumer opinions, and
identify trends.
 4. **Education**: Crowdsourcing enhances learning by
involving students and educators in collaborative projects
and research.
lOMoARcPSD|44504791

 5. **Finance**: It helps in gathering data, assessing


financial trends, and creating new investment models.
 6. **Science and Health**: Researchers and healthcare
professionals use crowdsourcing for data collection, citizen
science projects, and public health initiatives.

 In summary, crowdsourcing is versatile and can be utilized
effectively across various fields to harness collective
intelligence.
 Examples: doritas , lays , starbucks , airbnb

inter and trans firewall analytics.


Inter-firewall analytics
 Focus: Analyzes traffic flows between different firewalls
within a network.
 Methodology: Utilizes data collected from multiple
firewalls to identify anomalies and potential breaches.
 Benefits: Provides a comprehensive view of network
traffic flow and helps identify lateral movement across
different security zones.
 Limitations: Requires deployment of multiple firewalls
within the network and efficient data exchange
mechanisms between them.
Trans-firewall analytics
 Focus: Analyzes encrypted traffic that traverses firewalls,
which traditional security solutions may not be able to
decrypt and inspect.
 Methodology: Uses deep packet inspection (DPI) and
other advanced techniques to analyze the content of
encrypted traffic without compromising its security.
 Benefits: Provides insight into previously hidden threats
within encrypted traffic and helps detect sophisticated
attacks.
 Limitations: Requires specialized hardware and software
solutions for DPI, and raises concerns regarding potential
data privacy violations.
lOMoARcPSD|44504791

Difference between inter and trans fire wall analytics

Feature Inter-Firewall Analytics Trans-Firewall Analytics

Network traffic flow between Content of encrypted


Focus
firewalls traffic

Uses DPI and other


Analyzes data from multiple
Methodology techniques to analyze
firewalls
encrypted traffic

Comprehensive view of
Detects threats within
Benefits network traffic, identifies
encrypted traffic
lateral movement

Requires specialized
Requires multiple firewalls
Limitations hardware and software,
and efficient data exchange
raises privacy concerns
lOMoARcPSD|44504791

LESSON 2:

Introduction to NoSQL

BDA UNIT-2
lOMoARcPSD|44504791

Big Data Analytics (Anna University)

Scan to open on Studocu


lOMoARcPSD|44504791

Studocu is not sponsored or endorsed


by any college or university
Downloaded by Ushasri Katreddi
([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791

Downloaded by Ushasri Katreddi


([email protected])

LESSON 3:

Data formats, analyzing data with Hadoop


lOMoARcPSD|44504791
lOMoARcPSD|44504791

LESSON 3
Data formats, analyzing data with Hadoop

Requirement of Hadoop Framework


 Hadoop is an Apache open source framework written in java
that allows distributed processing of large datasets across
clusters of computers using simple programming models.
 The Hadoop framework application works in an environment
that provides distributed storage and computation across
clusters of computers.
 Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and
storage.
 Design Principles (Fault Tolerant)
lOMoARcPSD|44504791

 (Scalability)
 (Data Locality)

Components of Hadoop

There are three core components of Hadoop as mentioned earlier.


They are HDFS, MapReduce, and YARN.
These together form the Hadoop framework architecture.

1.HDFS (Hadoop Distributed File System):

 It is a data storage system


Since the data sets are huge, it uses a distributed system to store
this data.
 It is stored in blocks where each block is 128 MB.
 It consists of NameNode and DataNode. There can only be
one NameNode but multiple DataNodes.
 ARCHITECTURE
lOMoARcPSD|44504791
lOMoARcPSD|44504791
lOMoARcPSD|44504791
lOMoARcPSD|44504791

3.YARN (yet another Resources Negotiator): It is the resource


management unit of the Hadoop framework. The data which is
stored can be processed with help of YARN using data processing
engines like interactive processing. It can be used to fetch any sort
of data analysis. Features: It is a filing system that acts as an
Operating System for the data stored on HDFS It helps to
schedule the tasks to avoid overloading any system

fault tolerance with data replication

Types of Fault Tolerance in Distributed Systems


1.Hardware Fault Tolerance: Hardware Fault Tolerance
involves keeping a backup plan for hardware devices such as
memory, hard disk, CPU, and other hardware peripheral
devices.
lOMoARcPSD|44504791

3. Software Fault Tolerance: Software Fault Tolerance is a type


of fault tolerance where dedicated software is used in order to
detect invalid output, runtime, and programming errors.
4. System Fault Tolerance: System Fault Tolerance is a type of
fault tolerance that consists of a whole system. It has the
advantage that it not only stores the checkpoints but also the
memory block, and program checkpoints and detects the
errors in applications automatically.
Fault Tolerance Strategies
Fault tolerance strategies are essential for ensuring that
distributed systems continue to operate smoothly even when
components fail. Here are the key strategies commonly used:
 Redundancy and Replication
o Data Replication: Data is duplicated across multiple
nodes or locations to ensure availability and durability..
o Component Redundancy: Critical system components
are duplicated so that if one component fails, others
can take over. This includes redundant servers,
network paths, or services.
 Failover Mechanisms
o Active-Passive Failover: One component (active)
handles the workload while another component
(passive) remains on standby. If the active component
fails, the passive component takes over.
o Active-Active Failover: Multiple components actively
handle workloads and share the load. If one component
fails, others continue to handle the workload.
 Error Detection Techniques
o Heartbeat Mechanisms: Regular signals (heartbeats)
are sent between components to detect failures. If a
component stops sending heartbeats, it is considered
failed.
lOMoARcPSD|44504791

o Checkpointing: Periodic saving of the system’s state


so that if a failure occurs, the system can be restored to
the last saved state.
 Error Recovery Methods
o Rollback Recovery: The system reverts to a previous
state after detecting an error, using saved checkpoints
or logs.
o Forward Recovery: The system attempts to correct or
compensate for the failure to continue operating. This
may involve reprocessing or reconstructing data.
Data Compression
Data compression, in simple terms, refers to the process of reducing the
size (number of bits) of a file or data stream. The primary goal of
compression is to store or transmit data in a more efficient manner,
ultimately saving storage space, reducing bandwidth requirements, and
improving overall system performance.

There are two types of compression techniques: lossless compression


and lossy compression.

Lossless Compression

In lossless compression, the compressed data can be fully recovered to


its original form without any loss of information.

Lossy Compression

Lossy compression, on the other hand, sacrifices some degree of data


accuracy for more significant compression ratios. It achieves higher
compression by discarding less essential or perceptually insignificant
data.

Serialization in Hadoop
Serialization is the process of converting in-memory data structures or
objects into a format suitable for storage or transmission. In Apache
lOMoARcPSD|44504791

Hadoop, serialization is essential for efficiently transferring data


between the Map and Reduce tasks and for persisting data in HDFS.

Hadoop provides various serialization frameworks like Apache


Avro, Apache Thrift, and

LESSON 4

Introduction to Hive
lOMoARcPSD|44504791

 Hive is a data warehouse infrastructure tool to process structured


data in Hadoop.

 It resides on top of Hadoop to summarize Big Data, and makes


querying and analyzing easy. Initially Hive was developed by
Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache
Hive.

 It is used by different companies. For example, Amazon uses it in


Amazon Elastic MapReduce

 Features of Hive

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.

 It provides SQL type language for querying called HiveQL or


HQL.

 It is familiar, fast, scalable, and extensible.

1 . write neat diagram explain relationship between Hive clients and


Hive Services
lOMoARcPSD|44504791

Hive Client

Hive allows writing applications in various languages, including


Java, Python, and C++. It supports different types of clients such
as:-

o Thrift Server - It is a cross-language service provider


platform that serves the request from all those programming
languages that supports Thrift.
o JDBC Driver - It is used to establish a connection between
hive and Java applications. The JDBC Driver is present in
the class org.apache.hadoop.hive.jdbc.HiveDriver.
o ODBC Driver - It allows the applications that support the
ODBC protocol to connect to Hive.

Hive Services

The following are the services provided by Hive:-


lOMoARcPSD|44504791

o Hive CLI - The Hive CLI (Command Line Interface) is a shell


where we can execute Hive queries and commands.
o Hive Web User Interface - The Hive Web UI is just an
alternative of Hive CLI. It provides a web-based GUI for
executing Hive queries and commands.

o Hive Server - It is referred to as Apache Thrift Server. It


accepts the request from different clients and provides it to
Hive Driver.
o Hive Driver - It receives queries from different sources like
web UI, CLI, Thrift, and JDBC/ODBC driver. It transfers the
queries to the compiler.
o Hive MetaStore - It is a central repository that stores all the
structure information of various tables and partitions in the
warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to
read and write data and the corresponding HDFS files where
the data is stored
o Hive Compiler - The purpose of the compiler is to parse the
query and perform semantic analysis on the different query
blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
o Hive Execution Engine - Optimizer generates the logical plan
in the form of DAG of map-reduce tasks and HDFS tasks. In
the end, the execution engine executes the incoming tasks
in the order of their dependencies.

 File Formats in Hive

 File Format specifies how records are encoded in files


lOMoARcPSD|44504791

 Record Format implies how a stream of bytes for a given record


are encoded The default file format is TEXTFILE – each record
is a line in the file

 Hive uses different control characters as delimeters in textfiles

 ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n

 The term field is used when overriding the default delimiter

 FIELDS TERMINATED BY ‘\001’

 Supports text files – csv, tsv

 TextFile can contain JSON or XML documents.

Hive Commands

Apache Hive provides various types of commands for interacting with


data stored in Hadoop Distributed File System (HDFS). These
commands are classified based on their purpose, such as data definition,
data manipulation, query execution, and administrative tasks.

1. Data Definition Language (DDL) Commands

DDL commands are used to define and manage the structure of tables
and databases in Hive.

Examples:

 Create Database:
CREATE DATABASE my_database;

 Create Table:
CREATE TABLE employees (
id INT,
lOMoARcPSD|44504791

name STRING,
salary FLOAT

 Drop Table:
DROP TABLE employees;

 Alter Table (Add column):

ALTER TABLE employees ADD COLUMNS (department


STRING);

2. Data Manipulation Language (DML) Commands

DML commands are used to query, insert, or modify data in Hive tables.

Examples:

 Insert Data:
o Insert data directly:

INSERT INTO employees VALUES (1, 'John


Doe', 50000.0);

o Insert data from another table:


INSERT INTO employees SELECT * FROM
temp_employees;

 Load Data:
LOAD DATA INPATH
'/path/to/employees_data.csv' INTO TABLE
employees;

 Export Table:
lOMoARcPSD|44504791

sql
Copy code
EXPORT TABLE employees TO
'/backup/employees/';

3. Query Language Commands

These commands allow querying and analyzing data using HiveQL.

Examples:

 Select Statement:
sql
Copy code
SELECT name, salary FROM employees WHERE
salary > 30000;

 Group By:
sql
Copy code
SELECT department, AVG(salary) FROM employees
GROUP BY department;

 Order By:
sql
Copy code
SELECT name, salary FROM employees ORDER BY
salary DESC;

 Join:
sql
Copy code
lOMoARcPSD|44504791

SELECT a.name, b.department


FROM employees a
JOIN departments b ON a.department_id = b.id;

4. Data Control Language (DCL) Commands

DCL commands manage access permissions.

Examples:

 Grant Permissions:
sql
Copy code
GRANT SELECT ON TABLE employees TO USER
'john';

 Revoke Permissions:
sql
Copy code
REVOKE SELECT ON TABLE employees FROM USER
'john';

5. Utility Commands

Utility commands manage metadata, configurations, and other


administrative tasks.

Examples:

 Show Databases:
sql
Copy code
SHOW DATABASES;
lOMoARcPSD|44504791

 Show Tables:
sql
Copy code
SHOW TABLES IN my_database;

 Describe Table:
sql
Copy code
DESCRIBE FORMATTED employees;

 Explain Query:
sql
Copy code
EXPLAIN SELECT name FROM employees WHERE
salary > 50000;

6. Transaction Commands

Hive supports ACID transactions for insert, update, and delete


operations.

Examples:

 Enable Transactions:
sql
Copy code
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode =
nonstrict;

 Insert Data:
lOMoARcPSD|44504791

sql
Copy code
INSERT INTO employees PARTITION
(department='HR') VALUES (2, 'Jane Doe',
60000.0);

 Update Data:
sql
Copy code
UPDATE employees SET salary = 70000 WHERE id
= 1;

 Delete Data:
sql
Copy code
DELETE FROM employees WHERE salary < 30000;

Summary of Command Types:

Type Purpose Example

Define and manage CREATE TABLE, DROP


DDL (Data Definition)
schema TABLE

DML (Data
Query and modify data INSERT, LOAD, EXPORT
Manipulation)

Query and analyze SELECT, JOIN, GROUP


Query Language
data BY

DCL (Data Control) Manage permissions GRANT, REVOKE

Utility Administrative tasks SHOW, DESCRIBE,


lOMoARcPSD|44504791

Type Purpose Example


EXPLAIN

Transaction Support ACID INSERT, UPDATE,


Commands operations DELETE

Hive supports several types of joins to combine data from multiple


tables. These are similar to SQL joins and are optimized for distributed
processing in a Hadoop environment.

1. Inner Join

Returns rows where there is a match in both tables.

Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
lOMoARcPSD|44504791

Copy code
SELECT e.name, d.department_name
FROM employees e
JOIN departments d
ON e.department_id = d.id;

2. Left Outer Join

Returns all rows from the left table and the matching rows from the right
table. Rows in the left table without a match will have NULL in the
columns from the right table.

Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
LEFT OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
LEFT OUTER JOIN departments d
ON e.department_id = d.id;

3. Right Outer Join

Returns all rows from the right table and the matching rows from the left
table. Rows in the right table without a match will have NULL in the
columns from the left table.
lOMoARcPSD|44504791

Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
RIGHT OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
RIGHT OUTER JOIN departments d
ON e.department_id = d.id;

4. Full Outer Join

Returns rows where there is a match in either table. Non-matching rows


in both tables will have NULL in the columns from the other table.

Syntax:
sql
Copy code
SELECT a.column1, b.column2
FROM table1 a
FULL OUTER JOIN table2 b
ON a.common_column = b.common_column;
Example:
sql
Copy code
SELECT e.name, d.department_name
FROM employees e
FULL OUTER JOIN departments d
lOMoARcPSD|44504791

ON e.department_id = d.id;

Window functions

Types of Window Functions


Hive supports several types of window functions:

 Aggregate Functions: These are used to perform


calculations like SUM, AVG, MIN, MAX, COUNT over a
range of rows.
 Ranking Functions: These functions allow you to
rank rows within a partition with options like RANK,
DENSE_RANK, ROW_NUMBER.
 Analytic Functions: They perform complex
analytics such as LEAD, LAG which access data from
a subsequent or preceding row without using a self-
join.

Window functions in Hive are used to perform operations over a group


of rows (a window or partition) and return a single value for each row.
They are helpful for tasks like ranking, cumulative sums, running totals,
and moving averages.

Example Table: sales_data

emp_id sale_amount dept_id


1 500 101
2 300 101
3 700 102
4 200 102
5 600 103
lOMoARcPSD|44504791

1. ROW_NUMBER()

Assigns a unique sequential number to each row within a window.

Syntax:
SELECT emp_id, sale_amount, dept_id,
ROW_NUMBER() OVER (PARTITION BY dept_id
ORDER BY sale_amount DESC) AS rank
FROM sales_data;

Result:

emp_id sale_amount dept_id rank


1 500 101 1
2 300 101 2
3 700 102 1
4 200 102 2
5 600 103 1

2. RANK()

Assigns a rank to each row within a window. Ties are given the same
rank, skipping subsequent ranks.

Syntax:

SELECT emp_id, sale_amount, dept_id,


RANK() OVER (PARTITION BY dept_id ORDER
BY sale_amount DESC) AS rank
FROM sales_data;

Result: (Same as ROW_NUMBER here because no ties exist)


lOMoARcPSD|44504791

emp_id sale_amount dept_id rank


1 500 101 1
2 300 101 2
3 700 102 1
4 200 102 2
5 600 103 1

3. DENSE_RANK()

Similar to RANK(), but does not skip ranks after ties.

Syntax:
SELECT emp_id, sale_amount, dept_id,
DENSE_RANK() OVER (PARTITION BY dept_id
ORDER BY sale_amount DESC) AS rank
FROM sales_data;

Result:

emp_id sale_amount dept_id rank


1 500 101 1
2 300 101 2
3 700 102 1
4 200 102 2
5 600 103 1

4. NTILE(n)

Divides rows into n buckets and assigns a bucket number to each row.

Syntax:
sql
Copy code
lOMoARcPSD|44504791

SELECT emp_id, sale_amount, dept_id,


NTILE(2) OVER (PARTITION BY dept_id ORDER
BY sale_amount DESC) AS bucket
FROM sales_data;

Result:

emp_id sale_amount dept_id bucket


1 500 101 1
2 300 101 2
3 700 102 1
4 200 102 2
5 600 103 1

5. LEAD()

Accesses data from the next row in the same result set.

Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
LEAD(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS next_sale
FROM sales_data;

Result:

emp_id sale_amount dept_id next_sale


2 300 101 500
1 500 101 NULL
4 200 102 700
3 700 102 NULL
5 600 103 NULL
lOMoARcPSD|44504791

6. LAG()

Accesses data from the previous row in the same result set.

Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
LAG(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS prev_sale
FROM sales_data;

Result:

emp_id sale_amount dept_id prev_sale


2 300 101 NULL
1 500 101 300
4 200 102 NULL
3 700 102 200
5 600 103 NULL

7. SUM() with OVER()

Calculates a cumulative sum within a window.

Syntax:
sql
Copy code
SELECT emp_id, sale_amount, dept_id,
SUM(sale_amount) OVER (PARTITION BY
dept_id ORDER BY sale_amount) AS cumulative_sum
FROM sales_data;
lOMoARcPSD|44504791

Result:

emp_id sale_amount dept_id cumulative_sum


2 300 101 300
1 500 101 800
4 200 102 200
3 700 102 900
5 600 103 600

These commands demonstrate how to use window functions for ranking,


cumulative calculations, and accessing rows within partitions in Hive.

Optimization

Optimization in Apache Hive, a data warehouse infrastructure built on


top of Hadoop, is critical for processing large datasets efficiently in Big
Data. Optimizing Hive queries ensures better performance, reduces
execution time, and minimizes resource usage. Here are key strategies
for optimization in Hive:

1. Partitioning:

Partitioning splits data into smaller, manageable chunks based on


column values. It reduces the data scanned during query execution.

 Example: Partition a sales dataset by region or date.

CREATE TABLE sales (


id INT,
amount DOUBLE,
region STRING,
date STRING
)
PARTITIONED BY (region STRING);
lOMoARcPSD|44504791

 Benefit: Queries only scan relevant partitions instead of the entire


dataset.

2. Bucketing:

Bucketing divides data into fixed-size buckets based on hash functions


of a column, improving performance in joins and sampling.

 Example: Bucket a table by user_id.

CREATE TABLE users (


id INT,
name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;

 Benefit: Efficient joins and aggregations on bucketed columns.

6. Indexing:

 Create indexes on frequently queried columns to improve


performance.
CREATE INDEX idx_column
ON TABLE my_table (column_name)
AS 'COMPACT';

10. Optimize Join Strategies:

a. Map Join:

 Convert large joins into map-side joins when one table is small.
sql
Copy code
set hive.auto.convert.join=true;
lOMoARcPSD|44504791

b. Broadcast Join:

 Broadcast smaller tables to all nodes for distributed joins.


sql
Copy code
set
hive.mapjoin.smalltable.filesize=25000000; --
set small table threshold
lOMoARcPSD|44504791

LESSON 5

APACHE SPARK

What is Spark?

 Apache Spark is an open-source cluster computing framework.

 Its primary purpose is to handle the real-time generated data.


Spark was built on the top of the Hadoop MapReduce.

 It was optimized to run in memory whereas alternative approaches


like Hadoop's MapReduce writes data to and from computer hard
drives. So, Spark process the data much quicker than other
alternatives.

 Features of Apache Spark

 Fast - It provides high performance for both batch and streaming


data, using a state-of-the-art DAG scheduler, a query optimizer,
and a physical execution engine.

 Easy to Use - It facilitates to write the application in Java, Scala,


Python, R, and SQL. It also provides more than 80 high-level
operators.

 Generality - It provides a collection of libraries including SQL and


DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming.
lOMoARcPSD|44504791

 Lightweight - It is a light unified analytics engine which is used


for large scale data processing.

 Runs Everywhere - It can easily run on Hadoop, Apache Mesos,


Kubernetes, standalone, or in the cloud.

There are five main components of Apache Spark:


 Apache Spark Core: It is responsible for functions like
scheduling, input and output operations, task dispatching, etc.
 Spark SQL: This is used to gather information about
structured data and how the data is processed.
 Spark Streaming: This component enables the processing of
live data streams.
 Machine Learning Library: The goal of this component is
scalability and to make machine learning more accessible.
 GraphX: This has a set of APIs that are used for facilitating
graph analytics tasks.

Advantages over Hadoop

Basis Hadoop Spark

Hadoop’s Spark reduces the


MapReduce model number of read/write
reads and writes cycles to disk and stores
Processing from a disk, thus intermediate data in
Speed & slowing down the memory, hence faster-
Performance processing speed. processing speed.

Usage Hadoop is designed Spark is designed to


to handle batch handle real-time data
lOMoARcPSD|44504791

Basis Hadoop Spark

processing
efficiently.
efficiently.

Hadoop is a high
Spark is a low latency
latency computing
computing and can
framework, which
process data
does not have an
interactively.
Latency interactive mode.

With Hadoop
Spark can process real-
MapReduce, a
time data, from real-time
developer can only
events like Twitter, and
process data in
Facebook.
Data batch mode only.

Hadoop is a cheaper Spark requires a lot of


option available RAM to run in-memory,
while comparing it in thus increasing the
Cost terms of cost cluster and hence cost.

The PageRank Graph computation


Algorithm algorithm is used in library called GraphX is
Used Hadoop. used by Spark.

Fault Hadoop is a highly Fault-tolerance achieved


Tolerance fault-tolerant system by storing chain of
where Fault- transformations
tolerance achieved If data is lost, the chain
by replicating blocks of transformations can
of data. be recomputed on the
If a node goes down, original data
the data can be
lOMoARcPSD|44504791

Basis Hadoop Spark

found on another
node

Spark is not secure, it


Hadoop supports
relies on the integration
LDAP, ACLs, SLAs,
with Hadoop to achieve
etc and hence it is
the necessary security
extremely secure.
Security level.

Data fragments in
Spark is much faster as
Hadoop can be too
it uses MLib for
large and can create
computations and has
Machine bottlenecks. Thus, it
in-memory processing.
Learning is slower than Spark.

Hadoop is easily
It is quite difficult to
scalable by adding
scale as it relies on RAM
nodes and disk for
for computations. It
storage. It supports
supports thousands of
tens of thousands of
nodes in a cluster.
Scalability nodes.

It uses Java or It uses Java, R, Scala,


Language Python for Python, or Spark SQL
support MapReduce apps. for the APIs.

User- It is more difficult to


It is more user-friendly.
friendliness use.

Resource YARN is the most It has built-in tools for


Management common option for resource management.
resource
lOMoARcPSD|44504791

Basis Hadoop Spark

management.

LAZY EVALUATION

 lazy evaluation in Spark means that the execution


will not start until an action is triggered. In Spark,
the picture of lazy evaluation comes when Spark
transformations occur.
 Transformations are lazy in nature meaning when
we call some operation in RDD, it does not execute
immediately.
 Spark maintains the record of which operation is
being called(Through DAG). We can think Spark
RDD as the data, that we built up through
transformation.
 Since transformations are lazy in nature, so we can
execute operation any time by calling an action on
data. Hence, in lazy evaluation data is not loaded
until it is necessary
In memory processing

Key Concepts of In-memory Processing in Apache Spark

1. Resilient Distributed Datasets (RDDs):


o RDDs are the fundamental data structures in Spark.
o They support in-memory storage for faster computation and
allow fault tolerance through lineage information.
2. Data Caching and Persistence:
o Spark provides the ability to cache datasets in memory using
the cache() or persist() methods.
lOMoARcPSD|44504791

o Cached data is reused across multiple operations, avoiding


repeated computation or I/O operations.
scala
Copy code
val rdd = sparkContext.textFile("data.txt")
val cachedRdd = rdd.cache()

3. Distributed Memory:
o Spark divides datasets into partitions and stores them in the
memory of worker nodes across the cluster.
4. Transformation and Action Model:
o Transformations (like map, filter, flatMap) are lazy;
they don't execute until an action (like collect, count,
save) is performed.
o Intermediate results from transformations can be stored in
memory for subsequent operations.

DAG

 (Directed Acyclic Graph) DAG in Apache


Spark is a set of Vertices and Edges,
where vertices represent the RDDs and
the edges represent the Operation to be applied
on RDD.
 In Spark DAG, every edge directs from earlier to
later in the sequence. On the calling of Action, the
created DAG submits to DAG Scheduler which
further splits the graph into the stages of the task.
lOMoARcPSD|44504791

 DAG a finite direct graph with no directed cycles.


 There are finitely many vertices and edges, where
each edge directed from one vertex to another. It
contains a sequence of vertices such that every edge
is directed from earlier to later in the sequence. It is
a strict generalization of MapReduce model.
 DAG operations can do better global optimization
than other systems like MapReduce. The picture of
DAG becomes clear in more complex jobs.
 Apache Spark DAG allows the user to dive into the
stage and expand on detail on any stage. In
the stage view, the details of all RDDs belonging to
that stage are expanded. The Scheduler splits the
Spark RDD into stages based on various
transformation applied. (You can refer this link
to learn RDD
 Transformations and Actions in detail) Each stage is
comprised of tasks, based on the partitions of the
RDD, which will perform same computation in
parallel. The graph here refers to navigation, and
directed and acyclic refers to how it is done.
lOMoARcPSD|44504791

SPARK CONTEXT
 SparkContext is the entry point to the Apache Spark
cluster, and it is used to configure and coordinate the
execution of Spark jobs.
 It is typically created once per Spark application and
is the starting point for interacting with Spark.

You can have only one active SparkContext instance in


a single JVM.

Attempting to create multiple SparkContext instances in


the same JVM will result in an error.

SPARK SESSION

SparkSession is introduced in Spark 2.x. In Apache


Spark, the SparkSession actually encapsulates and
manages the SparkContext internally.

RDD

 What is RDD?

 The RDD (Resilient Distributed Dataset) is the Spark's core


abstraction. It is a collection of elements, partitioned across the
nodes of the cluster so that we can execute various parallel
operations on it.
lOMoARcPSD|44504791

 There are two ways to create RDDs: o Parallelizing an existing


data in the driver program o Referencing a dataset in an external
storage system, such as a shared filesystem, HDFS, HBase, or any
data source offering a Hadoop InputFormat. operators),

 Spark creates a operator graph. When the user runs an action (like
collect), the Graph is submitted to a DAG Scheduler.

 The DAG scheduler divides operator graph into (map and reduce)
stages. A stage is comprised of tasks based on partitions of the
input data. The DAG scheduler pipelines operators together to
optimize the graph. For e.g. Many map operators can be scheduled
in a single stage.

 This optimization is key to Sparks performance. The final result of


a DAG scheduler is a set of stages. The stages are passed on to the
Task Scheduler. The task scheduler launches tasks via cluster
manager. (Spark Standalone/Yarn/Mesos). The task scheduler
doesn’t know about dependencies among stages. BIG DATA
ANAYTICS A.Y 2023-24 Dept. of CSE 76| P a g e RDD
Operations

 The RDD provides the two types of operations: o Transformation


o Action Transformation In Spark, the role of transformation is to
create a new dataset from an existing one. The transformations are
considered lazy as they only computed when an action requires a
result to be returned to the driver program

Let's see some of the frequently used RDD Transformations.

 Transformation Description
lOMoARcPSD|44504791

 map(func) It returns a new distributed dataset formed by


passing each element of the source through a function func.

 filter(func) It returns a new dataset formed by selecting those


elements of the source on which func returns true.

 flatMap(func) Here, each input item can be mapped to zero


or more output items, so func should return a sequence rather
than a single item.

Action

 In Spark, the role of action is to return a value to the


driver program after running a computation on the
dataset. Let's see some of the frequently used RDD
Actions.

 reduce(func) It aggregate the elements of the dataset


using a function func (which takes two arguments and
returns one). The function should be commutative and
associative so that it can be computed correctly in
parallel.

 collect() It returns all the elements of the dataset as an


array at the driver program. This is usually useful after
a filter or other operation that returns a sufficiently
small subset of the data. count() It returns the number
of elements in the dataset.

 first() It returns the first element of the dataset .


lOMoARcPSD|44504791

 take(n) It returns an array with the first n elements of


the dataset.

, Data frames ,RDD to Data frames,

 A DataFrame is a collection of data organized into


named columns, similar to a relational database
table.

 It provides a structured format to represent data,


facilitating easy manipulation and analysis.

Why Convert RDD to DataFrame or Dataset?


Converting RDD to DataFrame or Dataset allows for
easier manipulation and querying due to the structured
nature of data frames and the type-safety of Datasets. By
converting RDD to DataFrame, data becomes organized
into named columns, promoting optimized storage.
Similarly, converting to a Dataset allows for functional
programming with a type-safe API. Utilizing the
appropriate format, based on the use case, streamlines
data processing and analysis.

Methods to Convert RDD to DataFrame

There are two main approaches:

1,Using a Case Class (Scala-specific) or Schema Definition

A case class defines the schema of the data.


lOMoARcPSD|44504791

The RDD is converted to a DataFrame by mapping the data into


the case class and invoking the toDF() function.

2. Using Schema Definition (Cross-language):

 You explicitly define a schema using a structured schema


definition (e.g., StructType).
 This approach is flexible and can be used in both Python and
Scala.
 The schema includes field names, data types, and nullability
constraints.

3. Using Implicit Conversions (with toDF):

 This is a simpler and faster method for RDDs containing tuples or


Rows.
 You provide column names during conversion.
 This approach works well for datasets with known structure but
without requiring predefined schemas.

CATALYST OPTIMIZER

The Catalyst Optimizer is a key component of Apache Spark's SQL


engine. It is responsible for optimizing the execution of queries written
using SQL, DataFrames, or Datasets. The optimizer uses advanced
techniques to improve query performance by transforming the logical
and physical execution plans.

Key Features of Catalyst Optimizer


lOMoARcPSD|44504791

1. Rule-Based and Cost-Based Optimization:


o Rule-Based: Applies predefined transformation rules to
simplify and optimize the query.
o Cost-Based: Evaluates multiple execution strategies and
selects the most efficient one based on estimated costs.
2. Logical and Physical Plan Optimization:
o Converts a high-level logical query plan into an optimized
physical plan for execution.
3. Extensibility:
o Designed to be easily extended for custom optimization rules,
making it suitable for evolving use cases.
4. Pluggable:
o Allows developers to add custom rules and integrate with
other systems.

Working with Dates and Timestamps,

Date and Timestamp Types in Spark

1. DateType: Represents dates (e.g., YYYY-MM-DD).


2. TimestampType: Represents timestamps with time and date (e.g.,
YYYY-MM-DD HH:mm:ss).

Common Operations on Dates and Timestamps

1. Creating Date/Timestamp Columns

 From String: Use to_date or to_timestamp to convert


strings to date or timestamp types.
 From Existing Data: Apply transformations or extract date and
time components.
lOMoARcPSD|44504791

2. Extracting Components

Spark provides functions to extract specific components like year,


month, day, hour, minute, and second:

 year(), month(), dayofmonth()


 hour(), minute(), second()
 date_format() for custom formatting.

3. Date Arithmetic

 Adding/Subtracting Intervals: Use date_add, date_sub, and


add_months.
 Date Difference: Use datediff for the difference between two
dates.
 Timestamp Difference: Use unix_timestamp or
timestampdiff.

4. Formatting Dates and Timestamps

 Format or parse dates using date_format and to_date.

5. Filtering/Comparing

 Filter rows based on date or timestamp values using conditional


expressions.

Functions for Date and Timestamp in Spark SQL


lOMoARcPSD|44504791

Function Description Example

Returns the
current_date() ->
current_date() current
2024-11-17
system date.

Returns the
current current_timestamp() -
current_timestamp()
system > 2024-11-17 12:00:00
timestamp.

Converts
to_date('2024-11-17
string or
to_date() 12:00:00') -> 2024-11-
timestamp
17
to date.

Converts
to_timestamp('2024-
to_timestamp() string to
11-17 12:00:00')
timestamp.
date_format()

Working with Nulls in Data,

Working with null values is a critical part of data processing in Apache


Spark, as nulls can significantly impact the results of computations and
queries. Spark provides several mechanisms to handle nulls effectively
in both DataFrames and SQL queries.

Understanding Nulls in Spark


lOMoARcPSD|44504791

 Nulls in Spark DataFrames: Represent missing or undefined


values in columns.
 Impact on Computations: Nulls propagate in expressions,
meaning operations involving nulls typically result in null (e.g.,
NULL + 1 = NULL).

Common Operations for Null Handling

1. Identifying Nulls

 Use isNull or isNotNull to filter or identify rows with null


values.
 SQL: Use IS NULL or IS NOT NULL in conditions.

2. Replacing Nulls

 Replace nulls with default values for specific columns using


fillna.
 Replace nulls conditionally using coalesce or case expressions.

3. Removing Nulls

 Drop rows with nulls using dropna. You can specify:


o Thresholds for allowable null values in rows.
o Specific columns to evaluate for nulls.

4. Handling Nulls in Aggregations

 By default, nulls are ignored in aggregations like count, sum,


avg.
 To include nulls explicitly, use count(expr) or conditional
logic.
lOMoARcPSD|44504791

LESSON 6

Working with Complex Types

You might also like