100% found this document useful (2 votes)

35 views140 pages

Introduction Big Data

Uploaded by

usha sri katreddi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

35 views140 pages

Introduction Big Data

Uploaded by

usha sri katreddi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 140

lOMoARcPSD|44504791

BIG DATA ANALYTICS

LESSON 1:

Introduction big data:

why big data

Data is defined as individual facts, such as numbers, words,

measurements, observations or just descriptions of things. For
example, data might include individual prices, weights,
addresses, ages, names, temperatures, dates, or distances. There
are two main types of data:

1. Quantitative data is provided in numerical form, like the

weight, volume, or cost of an item.

2. Qualitative data is descriptive, but non-numerical, like the

name, sex, or eye colour of a person.

Characteristics of Data

The following are six key characteristics of data which discussed

below:

1. Accuracy

2. Validity

3. Reliability

4. Timeliness

5. Relevance

6. Completeness
lOMoARcPSD|44504791

Types of Digital Data

Digital data is the electronic representation of information in a

format or language that machines can read and understand.

In more technical terms, Digital data is a binary format of

information that's converted into a machine-readable digital
format.

Types of Digital Data:

 Structured

Unstructured

 Semi Structured Data

Structured Data:

Structured data refers to any data that resides in a fixed field

==within a record or file.

Having a particular Data Model

. Meaningful data.

 Data arranged in arow and column

E.g.: Relational Data Base, Spread sheets , SQL

Unstructured Data:

Unstructured data can not readily classify and fit into a neat box

 Also called unclassified data.

 Which does not confirm to any data model.

 Business rules are not applied.

 Indexing is not required

E.g.: photos and graphic images, videos, streaming instrument

data, webpages, Pdf files, PowerPoint presentations, emails, blog
entries, wikis and word processing documents.
lOMoARcPSD|44504791

Sources of Unstructured Data: Web pages Images (JPEG, GIF,

PNG, etc.) Videos Memos Reports Word documents and
PowerPoint presentations Surveys

Sources of semi-structured Data: E-mails XML and other

markup languages Binary executables TCP/IP packets Zipped
files Integration of data from different sources Web page

Big Data

Big Data is a collection of data that is huge in volume, yet

growing exponentially with time.

1. It is a data with so large size and complexity that none of

traditional data management tools can store it or process it
efficiently.
2. Big data is also a data but with huge size

WHY BIG DATA

1. Big Data initiatives were rated as “extremely important” to

93% of companies. Leveraging a Big Data analytics solution helps
organizations to unlock the strategic values and take full
adhhvantage of their assets.

2. It helps organizations: To understand Where, When and Why

their customers buy

3. Protect the company’s client base with improved loyalty

programs

4 . Seizing cross-selling and upselling opportunities

5 . Provide targeted promotional information

6 . Optimize Workforce planning and operations

7 . Improve inefficiencies in the company’s supply chain

8 . Predict market trends , Predict future needs , Make

companies more innovative and competitive
lOMoARcPSD|44504791

9 .It helps companies to discover new sources of revenue

 Companies are using Big Data to know what their

customers want, who are their best customers, why people
choose different products. The more a company knows
about its customers, the more competitive it becomes.
 We can use it with Machine Learning for creating market
strategies based on predictions about customers.
Leveraging big data makes companies customer-centric.
 Companies can use Historical and real-time data to assess
evolving consumers’ preferences. This consequently
enables businesses to improve and update their marketing
strategies which make companies more responsive to
customer needs.

The companies in the present market need to collect it and

analyze it because

1. Cost Savings Big Data tools like Apache Hadoop, Spark, etc.
bring cost-saving benefits to businesses when they have to
store large amounts of data. These tools help organizations in
identifying more effective ways of doing business
2. Time-Saving
lOMoARcPSD|44504791

3. Understand the market condition

4. Social Media Listening

convergence of key trends

 A Convergence of Key Trends According to Steve
Lucas, the Global Executive Vice President and
General Manager, SAP Database & Technology at
SAP the difference between “Old Big Data” and “New
Big Data” is accessibility. While it’s true that the
amount of data in the world keeps growing, the real
change has been in the ways that we access that data
and use it to create value.
 Today, you have technologies like Hadoop, for
example, that make it functionally practical to access a
tremendous amount of data, and then extract value
from it.
 The availability of lower-cost hardware makes it easier
and more feasible to retrieve and process information,
quickly and at lower costs than ever before. So it’s the
convergence of several trends—more data and less
expensive, faster hardware—that’s driving this
transformation.
 Today, we’ve got raw speed at an affordable price.
Next is the ability to do that real-time analysis on very
complex sets of data and models, And finally, we now
have the ability to find solutions for very complex
problems in real time.
 The two scenarios described by Lucas aren’t fantasies.
Yesterday, the cost of real-time data analysis was
prohibitive. Today, real-time analytics have become
affordable.
 As a result, market-leading companies are already
using Big Data Analytics to improve sales revenue,
lOMoARcPSD|44504791

increase profits, and do a better job of serving

customers.
 The industry has an evolving definition around Big
Data that is currently defined by three dimensions:
1. Volume
2. Variety
3. Velocity
 Although many people define Big Data by volume,
definitions of Big Data that are based on volume can
be troublesome some people define volume by the
number of occurrences (in database terminology by
the rows in a table or in analytics terminology known
as the number of observations).
 Some people define volume based on the number of
interesting pieces of information for each occurrence
(or in database terminology, the columns in a table or
in analytics terminology the features or dimensions).
some people define volume by the combination of
depth and width.


unstructured data
lOMoARcPSD|44504791

Types of Big Data Now that we are on track with what is big data,
let’s have a look at the types of big data:

a) Structured is one of the types of big data and By structured

data, we mean data that can be processed, stored, and retrieved
in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms. For instance, the
employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be
present in an organized manner.

b)Unstructured data refers to the data that lacks any specific

form or structure whatsoever. This makes it very difficult and
time-consuming to process and analyze unstructured data. Email
is an example of unstructured data. Structured and unstructured
are two important types of big data.

Unstructured data can not readily classify and fit into a neat box

 Also called unclassified data.

 Which does not confirm to any data model.

 Business rules are not applied.

 Indexing is not required

E.g.: photos and graphic images, videos, streaming instrument

data, webpages, Pdf files, PowerPoint presentations, emails, blog
entries, wikis and word processing documents.

Sources of Unstructured Data: Web pages Images (JPEG, GIF,

PNG, etc.) Videos Memos Reports Word documents and
PowerPoint presentations Surveys

c) Semi structured is the third type of big data. Semi-structured

data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a
lOMoARcPSD|44504791

particular repository (database), yet contains vital information or

tags that segregate individual elements within the data. Thus we
come to the end of types of data.

Characteristics of Big Data

Big Data contains a large amount of data that is not being
processed by traditional data storage or the processing unit.
It is used by many multinational
companies to process the data and business of
many organizations. The data flow would exceed 150
exabytes per day before replication.

There are five v's of Big Data that explains the

characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
lOMoARcPSD|44504791

The name Big Data itself is related to an enormous size. Big

Data is a

vast 'volumes' of data generated from many sources daily,

such as business processes, machines, social media
platforms, networks, human interactions, and many
more.

Facebook can generate approximately

a billion messages, 4.5 billion times that the "Like" button
is recorded, and more than 350 million new posts are
uploaded each day. Big data technologies can handle large
amounts of data.

Variety

Big Data can be structured, unstructured, and semi-

structured that are being collected from different sources.
Data will only be collected from databases and sheets in
the past, But these days the data will comes in array forms,
that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
lOMoARcPSD|44504791

The data is categorized as below:

1. Structured data: In Structured schema, along with all

the required columns. It is in a tabular form. Structured
Data is stored in the relational database management
system.
2. Semi-structured: In Semi-structured, the schema is
not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-
structured data. It is stored in relations, i.e., tables.
3. Unstructured Data: All the unstructured files, log
files, audio files, and image files are included in the
unstructured data. Some organizations have much
data available, but they did not know how to derive the
value of data since the data is raw.
4. Quasi-structured Data:The data format contains
textual data with inconsistent data formats that are
formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and
maintained by some server that contains a list of activities.

Veracity

Veracity means how much the data is reliable. It has many

ways to filter or translate the data. Veracity is the process of
being able to handle and manage data efficiently. Big Data
is also essential in business development.
lOMoARcPSD|44504791

For example, Facebook posts with hashtags.

Value

Value is an essential characteristic of big data. It is not the

data that we process or store. It
is valuable and reliable data that we store, process, and
also analyze.

Velocity

Velocity plays an important role compared to others.

Velocity creaxxxxtes the speed by which the data is created
in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary
aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows
from sources like application logs, business processes,
networks, and social media sites, sensors, mobile
devices, etc.

industry examples of big data

lOMoARcPSD|44504791

 web analytics
 big data and marketing
 fraud and big data
 risk and big data
 credit risk management
 big data and algorithmic trading
 big data and healthcare
 big data in medicine
 advertising

web analytics

The text discusses important aspects of analyzing traffic and user

behavior for businesses.

1. Traffic and User Flow: Understanding how users navigate

a website and achieving specific goals like conversions.

2. Goals Acquisition: Setting clear objectives to determine

what a business wants to achieve.
lOMoARcPSD|44504791

3. Potential Keywords: Identifying key search terms that can

attract more visitors to the website.

4. Identifying Improvement Segments: Recognizing areas

where performance can be enhanced.

5. Key Performance Indicator (KPI): Metrics that measure

success vary based on the type of business and its strategy.

6. Data Insights Using Google Analytics:

- Micro Level Analysis: Focuses on individual user actions,

like how many times a page is printed or how often job
applications are submitted.

- Macro Level Analysis: Looks at broader business goals

impacting large groups, such as total conversions among specific
demographics.

In summary, the text emphasizes the importance of both detailed

and broad data analysis to optimize business performance.
REG.

big data and marketing 3

(ii) Explain the concept of web analytics

and list its importance in detail.
(
(8)(K2)(CO1)
WEB ANALYTICS
Importance of Web Analytics
Web Analytics needed to assess the success
rate of a website and its associated
business. Using Webfor improvemen
lOMoARcPSD|44504791

 It refers to the ever-increasing volume, velocity,

variety, variability and complexity of information. For
marketing organizations, big data is the fundamental
consequence of the new marketing landscape, born
from the digital world we now live in.
 Many marketers may feel like data has always been
big – and in some ways, it has. But think about the
customer data businesses collected 20 years ago –
point of sale transaction data, responses to direct mail
campaigns, coupon redemption, etc. Then think about
the customer data collected today – online purchase
data, click-through rates, browsing behavior, social
media interactions, mobile device usage, geolocation
data, etc. Comparatively speaking, there’s no
comparison. And to borrow an old phrase, "You ain’t
seen nothin' yet.
By combining big data with an integrated marketing
management strategy, marketing organizations can make a
substantial impact in these key areas:

 Customer engagement. Big data can deliver insight

into not just who your customers are, but where they
are, what they want, how they want to be contacted
and when.
 Customer retention and loyalty. Big data can help
you discover what influences customer loyalty and
what keeps them coming back again and again.
 Marketing optimization/performance. With big data,
you can determine the optimal marketing spend
across multiple channels, as well as continuously
optimize marketing programs through testing,
measurement and analysis.

Three types of big data that are a big deal for marketing
Customer: The big data category most familiar to marketing
may include behavioral, attitudinal and transactional metrics
from such sources as marketing campaigns, points of sale,
websites, customer surveys, social media, online
communities and loyalty programs.
lOMoARcPSD|44504791

Operational: This big data category typically includes

objective metrics that measure the quality of marketing
processes relating to marketing operations, resource
allocation, asset management, budgetary controls, etc.

Financial: Typically housed in an organization’s financial

systems, this big data category may include sales, revenue,
profits and other objective data types that measure the
financial health of the organization

fraud and big data

 Fraudulent activities, including e-commerce
scams, insurance fraud, cybersecurity threats,
and financing fraud, pose significant risks to
both individuals and companies across various
industries such as retail, insurance, banking,
and healthcare.
 To combat these risks, businesses increasingly
adopt advanced fraud prevention technologies
and robust risk management strategies that
depend on Big Data. For instance, predictive
analytics models, alternative data sources, and
advanced machine learning techniques
empower decision-makers to develop innovative
lOMoARcPSD|44504791

approaches and methodologies to proactively

prevent fraud.
 These technologies analyze large volumes of
data to identify patterns and anomalies in
transactions that indicate fraudulent behavior,
allowing businesses to take proper action.
 In response to current challenges, companies
are shifting to advanced data analytics
techniques for fraud prevention technologies
and risk management strategies that use Big
Data. Techniques like predictive analytics,
alternative data, and machine learning are
helping create new ways to prevent fraud.
Applications of big data analytics in fraud
detection

Here are a few applications of big data analytics in

fraud detection:
 Real-time fraud monitoring: One of the main
benefits of using big data in fraud detection is
the ability to perform real-time analytics and
monitoring. Traditional methods of detecting
fraud often depend on past data analysis, which
may not be fast enough to stop advanced
fraudsters.
 Pattern recognition: Integrating machine
learning algorithms with big data analytics
boosts insurance fraud detection analytics and
prevention. These algorithms learn from
historical data, identifying patterns and trends
linked to fraudulent activities.
 Anomaly detection : Big data allows
advanced behavioral analytics, which involves
analyzing user behavior patterns to identify
anomalies. By specifying a baseline of normal
user behavior, organizations can quickly detect
deviations that may indicate fraud.
 Predictive modeling and risk assessment:
Predictive models can assist organizations in
lOMoARcPSD|44504791

predicting fraud scenarios and identifying

suspicious activities. These models can include
variables such as transaction volume, velocity,
or customer behavior patterns to evaluate the
likelihood of fraud.

Examples of big data analytics for fraud

detection and prevention

PayPal: Using machine learning to analyze

billions of transactions

Mastercard: Using data mining to identify

fraud patterns across millions of merchants
and cardholders

HSBC: By integrating and analyzing data

from customer profiles, data records,
transaction records, and external databases

American Express: Using natural language

generation and geospatial analysis,
American Express analyzes customers’
spending habits

risk and big data

credit risk management

Types of Big Data in Credit Risk Management

Structured Data

1. Transaction data: real-time data on the transaction

activity of customers provide an indication of their
income patterns, what their spending habits are, and
therefore, indication to when the stress might occur.
2. Credit Bureau reports: these reports capture
important indicators from customer’s transactions with
other financial institutions and can help indicate the
lOMoARcPSD|44504791

customer’s credit worthiness and historical stress

patterns
3. Financial statements: these statements can provide
an indication of their income patterns
4. Loan application data: application data can help for
the bank to assess the credit worthiness by knowing
their FICO/Vantage scores etc.

Unstructured Data

1. Social media and online behavior: social media

information, browser history, and other online activities,
such as buying behavior on the internet, can provide
knowledge on a borrower's lifestyle, behavior pattern,
and potential risk to his/her finances.
2. Call center transcripts: call center transcripts, emails
or chat logs can reveal a customer’s buying patterns,
or if he/she is under stress.

Alternate Data
Non-traditional data sources could include mobile phone
usage, utility bill payments, and geolocation data. These
data sources are mainly helpful for individuals with little or
no traditional credit history.

Big Data Tools and Techniques used in Credit Risk

Management
Data Collection and Storage

1. Data Storage: file systems like Hadoop are used for

storing huge datasets.
2. Data Lakes: Data lakes, like Amazon S3, Azure Data
Lake etc.) allow financial institutions to store both
structured and unstructured data, enabling fast, real-
time data analysis.

Data Preprocessing

1. ETL (Extract, Transform, Load): data needs to be

collected from multiple sources for ex. transaction data
from institutions’ internal data warehouse
lOMoARcPSD|44504791

Machine Learning Models

1. Classification Models: algorithms like logistic

regression, ensemble methods (random forests,
gradient boosting) are used to classify borrowers into
risk categories.
2. Clustering Techniques: unsupervised learning
techniques, such as k-means clustering, help in
grouping borrowers with similar risk profiles based on
behavioral and transactional data.
3. Deep Learning: neural networks can be useful for
more complex analyses, such as detecting hidden
patterns in borrower behavior that are indicative of
potential defaults especially on unstructured data.

Real Time Analytics

1. Stream Processing: technologies such as Apache

Kafka and Apache Flink can enable real-time
processing of data, for continuously monitoring the risk
assessments.
2. Dynamic Credit Scoring: real-time data can be used
to update borrower’s credit scores instead of periodic
updates, hence providing a more updated risk profile.

Big Data applications in Credit Risk Management

1. Improved Credit Scoring Models: institutions can

use data from multiple sources, both structured and
unstructured, resulting in a more accurate credit
scoring models. This helps an institution to better
assess a borrower's ability and willingness to pay.
2. Improved Early Warning System: financial
institutions can create early warning systems by using
big data that can send alerts when the system
identifies a change in borrower’s risk.
3. Fraud Detection: analytics, using big data, can help in
fraud detection by identifying abnormal behaviors, for
ex. unusual spending or inconsistency of customer
data. These types of outliers can be identified much
better by machine learning models than a conventional
rules-based system.
lOMoARcPSD|44504791

4. Risk-Based Pricing: with a more realistic and detailed

understanding of borrower risk profile, financial
institutions can embark on risk-based pricing policies
whereby interest rates can be offered reflecting
individual risk profiles. This would ensure maximization
of profit while allowing the mitigation of risk.

data and algorithmic trading

What is Algorithmic Trading?

Application of computer and communication

techniques has stimulated the rise of algorithm
trading. Algorithm trading is the use of computer
programs for entering trading orders, in which
computer programs decide on almost every aspect
of the order, including the timing, price, and
quantity of the order etc.

Role of Big Data in Algorithmic Trading

1. Technical Analysis : Technical Analysis is the
study of prices and price behavior, using charts as
the primary tool.

2. Real Time Analysis : The automated process

enables computer to execute financial trades at
speeds and frequencies that a human trader cannot.

3. Machine Learning : With Machine Learning,

algorithms are constantly fed data and actually get
smarter over time by learning from past mistakes,
logically deducing new conclusions based on past
results and creating new techniques that make
sense based on thousands of unique factors.

big data and healthcare

lOMoARcPSD|44504791

In health care, big data is generated by various sources and

analyzed to guide decision-making, improve patient
outcomes, and decrease health care costs, among other
things. Some of the most common sources of big data in
health care include electronic health records (EHR),
electronic medical records (EMRs), personal health records
(PHRs), and data produced by widespread digital
health tools like wearable medical devices and health apps
on mobile devices.

Big data applications in health care

Professionals in health care use big data for a wide range of

purposes – from developing insights in biomedical research
to providing patients with personalized medicine. Here are
just some of the ways that big data is used in health care
today:
 Employing predictive analytics to create machine learning
models that can predict the likelihood a patient might
develop a particular disease.
 Providing real-time alerts to medical staff by continuously
monitoring patient conditions within a facility.
 Enhancing security surrounding the processing of sensitive
medical data, such as insurance claims and medical
records.

Benefits of big data in health care

Big data has the potential to improve health care for the
better. Here are some of the most common benefits of using
big data in health care:
 Better patient care: More patient data means an
opportunity to understand the patient experience better and
improve the care they receive.
 Improved research: Big data gives medical researchers
unprecedented access to a large volume of data and
methods of collecting data. In turn, this data can drive
important medical breakthroughs that save lives.
 Smarter treatment plans: Analyzing the treatment plans
that helped patients (and those that didn’t) can help
lOMoARcPSD|44504791

researchers create even better treatment plans for future

patients.
 Reduced health care costs for patients and health
providers: Health care can cost a lot. Big data offers the
possibility of reducing the cost of obtaining and providing
health care by identifying appropriate treatment plans,
allocating resources intelligently, and identifying potential
health issues before they occur.

big data in advertising

Big data is becoming a fundamental tool in

marketing. Data constantly informs marketing
teams of customer behaviors and industry trends,
and is used to optimize future efforts, create
innovative campaigns and build lasting
relationships with customers.

Customer Engagement and Retention

Big data regarding customers provides marketers

details about user demographics, locations, and
interests, which can be used to personalize the
lOMoARcPSD|44504791

product experience and increase customer loyalty

over time.

Marketing Optimization and Performance

Big data solutions can help organize data and

pinpoint which marketing campaigns, strategies or
social channels are getting the most traction. This
lets marketers allocate marketing resources and
reduce costs for projects that aren’t yielding as
much revenue or meeting desired audience goals.

Competitor Tracking and Operation Adjustment

Big data can also compare prices and marketing

trends among competitors to see what consumers
prefer. Based on average industry standards,
marketers can then adjust product
prices, logistics and other operations to appeal to
customers and remain competitive.

big data technologies

lOMoARcPSD|44504791

introduction to Hadoop

 Hadoop is an open-source software framework that is

used for storing and processing large amounts of data
in a distributed computing environment. It is designed
to handle big data and is based on the MapReduce
programming model, which allows for the parallel
processing of large datasets.
 Hadoop is an open source software programming
framework for storing a large amount of data and
performing the computation. Its framework is based on
Java programming with some native code in C and
shell scripts.

Hadoop has two main components:

 HDFS (Hadoop Distributed File System): This is the

storage component of Hadoop, which allows for the
storage of large amounts of data across multiple
machines. It is designed to work with commodity
hardware, which makes it cost-effective.

 YARN (Yet Another Resource Negotiator): This is the

resource management component of Hadoop, which
manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.
 Hadoop also includes several additional modules that
provide additional functionality, such as Hive (a SQL-like
query language), Pig (a high-level platform for creating
MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as
data warehousing, business intelligence, and machine
learning. It’s also used for data processing, data
analysis, and data mining. It enables the distributed
processing of large data sets across clusters of
computers using a simple programming model.
lOMoARcPSD|44504791

Features of hadoop:
 1. it is fault tolerance.
 2. it is highly available.
 3. it’s programming is easy.
 4. it have huge flexible storage.
 5. it is low cost.

Hadoop Distributed File System

It has distributed file system known as HDFS and this
HDFS splits files into blocks and sends them across
various nodes in form of large clusters. Also in case of a
node failure, the system operates and data transfer takes
place between the nodes which are facilitated by HDFS.

Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing
complicated MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for
data exploration.
lOMoARcPSD|44504791

3. Storm- It allows real-time processing and streaming of

data.
4. Spark- It contains a Machine Learning Library(MLlib) for
providing enhanced machine learning and is widely used
for data processing. It also supports Java, Python, and
Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs
data transformation of unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and
helps in the running of their codes faster.

open source technologies

1. Apache Cassandra: It is one of the No-SQL databases
which is highly scalable and has high availability. In this,
we can replicate data across multiple data centers.
Replication across multiple data centers is supported. In
Cassandra, fault tolerance is one of the big factors in which
failed nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely
used big data technology that is used to handle large-scale
data, large file systems by using Hadoop file system which
is called HDFS, and parallel processing like features using
the MapReduce framework of Hadoop. Hadoop is a
scalable system that helps to provide a scalable solution
capable of handling large capacities and capabilities. For
example: If you see real use cases like NextBio is using
Hadoop MapReduce and HBase to process multi-terabyte
data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad
hoc querying which means for querying and analyzing Big
Data easily. It is built on top of Hadoop for providing data
summarization, ad-hoc queries, and the analysis of large
datasets using SQL-like language called HiveQL. It is not a
relational database and not a language for real-time
queries. It has many features like: designed for OLAP, SQL
type language called HiveQL, fast, scalable, and
extensible.
4. Apache Flume: It is a distributed and reliable system
that is used to collect, aggregate, and move large amounts
of log data from many data sources toward a centralized
data store.
lOMoARcPSD|44504791

5. Apache Spark: The main objective of spark for speeding

up the Hadoop computational computing software process,
and It was introduced by Apache Software Foundation.
Apache Spark can work independently because it has its
own cluster management, and It is not an updated or
modified version of Hadoop and if you delve deeper then
you can say it is just one way to implement Spark with
Hadoop. The Main idea to implement Spark with Hadoop in
two ways is for storage and processing. So, in two ways
Spark uses Hadoop for storage purposes just because
Spark has its own cluster management computation. In
Spark, it includes interactive queries and stream
processing, and in-memory cluster computing is one of the
key features.
6. Apache Kafka: It is a distributed publish-subscribe
messaging system and more specifically you can say it has
a robust queue that allows you to handle a high volume of
data, and you can pass the messages from one point to
another as you can say from one sender to receiver. You
can perform message computation in both offline and
online modes, it is suitable for both. To prevent data loss
Kafka messages are replicated within the cluster. For real-
time streaming data analysis, it integrates Apache Storm
and Spark and is built on top of the ZooKeeper
synchronization service.
7. MongoDB: It is based on cross-platform and works on a
concept like collection and document. It has document-
oriented storage that means data will be stored in the form
of JSON form. It can be an index on any attribute. It has
features like high availability, replication, rich queries,
support by MongoDB, Auto-Sharding, and Fast in-place
updates.
8. ElasticSearch: It is a real-time distributed system, and
open-source full-text search and analytics engine. It has
features like scalability factor is high and scalable
structured and unstructured data up to petabytes, It can be
used as a replacement of MongoDB, RavenDB which is
based on document-based storage. To improve the search
performance, it uses denormalization. If you see the real
use case then it is an enterprise search engine and big
organizations using it, for example- Wikipedia, GitHub.
lOMoARcPSD|44504791

cloud and big data

It offers an effective solution towards dealing with big

size information sets. Organizations can store their big-
data efficiently manage them as well analyze them by
leveraging scalability provided through clouds on demand
resources such as storage capacity

mobile business intelligence

The ability to access analytics and data on mobile
devices or tablets rather than desktop computers is
referred to as mobile business intelligence. The
business metric dashboard and key performance
indicators (KPIs) are more clearly displayed.

With the rising use of mobile devices, so have the

technology that we all utilise in our daily lives to
make our lives easier, including business. Many
businesses have benefited from mobile business
intelligence. Essentially, this post is a guide for
business owners and others to educate them on the
benefits and pitfalls of Mobile BI.
Advantages of mobile BI
lOMoARcPSD|44504791

1. Simple access: Mobile BI is not restricted

to a single mobile device or a certain
place. You can view your data at any time
and from any location. Having real-time
visibility into a firm improves production
and the daily efficiency of the business.
Obtaining a company's perspective with a
single click simplifies the process.

2. Competitive advantage: Many firms are

seeking better and more responsive
methods to do business in order to stay
ahead of the competition. Easy access to
real-time data improves company
opportunities and raises sales and capital.
This also aids in making the necessary
decisions as market conditions change.

3. Simple decision-making: As previously

stated, mobile BI provides access to real-
time data at any time and from any
location. During its demand, Mobile BI
offers the information. This assists
consumers in obtaining what they require
at the time. As a result, decisions are
made quickly.

4.Increase Productivity : By extending BI to

mobile, the organization's teams can
access critical company data when they
need it. Obtaining all of the corporate data
with a single click frees up a significant
amount of time to focus on the smooth
and efficient operation of the firm.
Increased productivity results in a smooth
and quick-running firm.

Crowd sourcing analytics

lOMoARcPSD|44504791

Crowdsourcing is a sourcing model in which an

individual or an organization gets support
from a large, open-minded, and rapidly evolving
group of people in the form of ideas, micro-tasks,
finances, etc. Crowdsourcing typically involves the
use of the internet to attract a large group of
people to divide tasks or to achieve a target. The
term was coined in 2005 by Jeff Howe and Mark
Robinson. Crowdsourcing can help different types of
organizations get new ideas and solutions,
deeper consumer engagement, optimization of
tasks, and several other things
 Crowdsourcing is a method where individuals or
organizations gather support from a large group of people
for ideas, small tasks, or funding. It usually takes place
online, allowing many participants to collaborate on tasks
or reach specific goals.
 The term was first introduced in 2005 by Jeff Howe and
Mark Robinson. Crowdsourcing can benefit organizations by
providing new ideas, enhancing consumer engagement,
improving task efficiency, and more.

 1. Enterprise: Businesses leverage crowdsourcing for

ideas, feedback, and solutions from a diverse group of
people.
 2. **IT**: Technology development and software
improvements often benefit from the input and
contributions of a wide audience.
 3. **Marketing**: Companies use crowdsourcing to
generate creative content, gather consumer opinions, and
identify trends.
 4. **Education**: Crowdsourcing enhances learning by
involving students and educators in collaborative projects
and research.
lOMoARcPSD|44504791

 5. Finance: It helps in gathering data, assessing

financial trends, and creating new investment models.
 6. **Science and Health**: Researchers and healthcare
professionals use crowdsourcing for data collection, citizen
science projects, and public health initiatives.

 In summary, crowdsourcing is versatile and can be utilized
effectively across various fields to harness collective
intelligence.
 Examples: doritas , lays , starbucks , airbnb

inter and trans firewall analytics.

Inter-firewall analytics
 Focus: Analyzes traffic flows between different firewalls
within a network.
 Methodology: Utilizes data collected from multiple
firewalls to identify anomalies and potential breaches.
 Benefits: Provides a comprehensive view of network
traffic flow and helps identify lateral movement across
different security zones.
 Limitations: Requires deployment of multiple firewalls
within the network and efficient data exchange
mechanisms between them.
Trans-firewall analytics
 Focus: Analyzes encrypted traffic that traverses firewalls,
which traditional security solutions may not be able to
decrypt and inspect.
 Methodology: Uses deep packet inspection (DPI) and
other advanced techniques to analyze the content of
encrypted traffic without compromising its security.
 Benefits: Provides insight into previously hidden threats
within encrypted traffic and helps detect sophisticated
attacks.
 Limitations: Requires specialized hardware and software
solutions for DPI, and raises concerns regarding potential
data privacy violations.
lOMoARcPSD|44504791

Difference between inter and trans fire wall analytics

Feature Inter-Firewall Analytics Trans-Firewall Analytics

Network traffic flow between Content of encrypted

Focus
firewalls traffic

Uses DPI and other

Analyzes data from multiple
Methodology techniques to analyze
firewalls
encrypted traffic

Comprehensive view of
Detects threats within
Benefits network traffic, identifies
encrypted traffic
lateral movement

Requires specialized
Requires multiple firewalls
Limitations hardware and software,
and efficient data exchange
raises privacy concerns
lOMoARcPSD|44504791

LESSON 2:

Introduction to NoSQL

BDA UNIT-2
lOMoARcPSD|44504791

Big Data Analytics (Anna University)

Scan to open on Studocu

lOMoARcPSD|44504791

Studocu is not sponsored or endorsed

by any college or university
Downloaded by Ushasri Katreddi
([email protected])
lOMoARcPSD|44504791
lOMoARcPSD|44504791