BigData QB (C.format)

Uploaded by

bhagwatshravan5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views6 pages

BigData QB (C.format)

Uploaded by

bhagwatshravan5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

a) Advantages and Disadvantages of the Apriori Algorithm

The Apriori algorithm is a foundational technique used in data mining for discovering
association rules. It is particularly effective in market basket analysis.Advantages:
1. Simplicity and Intuitiveness: The algorithm is easy to understand and implement, making it
accessible to users with varying levels of expertise in data analysis.
2. Efficient Candidate Pruning: By applying the principle that all subsets of a frequent itemset
must also be frequent, the algorithm effectively reduces the search space, leading to faster
computations.
3. Generates Comprehensive Rules: The Apriori algorithm can generate a wide range of
association rules from the identified frequent itemsets, providing valuable insights into
relationships between items.
Disadvantages:
1. High Computational Cost: The algorithm can become computationally expensive as the
dataset size increases, due to the exponential growth of candidate itemsets.
2. Sensitivity to Thresholds: The results are highly dependent on the minimum support and
confidence thresholds set by the user, which can lead to either too few or too many rules
being generated.
3. Redundancy Issues: The algorithm may produce redundant or similar rules, complicating the
process of deriving meaningful insights from the results.
b) Different Types of Data Analytics
Data analytics can be classified into four main types, each serving distinct purposes:
1. Descriptive Analytics: This type summarizes historical data to answer questions about what
has happened in the past. It often involves reporting and visualization tools that track
performance metrics over time.
2. Diagnostic Analytics: This analysis seeks to understand why certain events occurred by
exploring data relationships and patterns. Techniques such as data mining and correlation
analysis are commonly used.
3. Predictive Analytics: This type uses historical data to forecast future outcomes. By
employing statistical models and machine learning algorithms, predictive analytics can
answer questions like "What is likely to happen next?"
4. Prescriptive Analytics: The most advanced form, prescriptive analytics suggests actions
based on insights derived from previous analyses. It combines data analysis with decision-
making processes to recommend optimal strategies for achieving desired outcomes.
c) Probability in Detail
Probability is a fundamental concept in statistics that quantifies uncertainty regarding events
or outcomes. It is essential in various fields, including data science, finance, healthcare, and
machine learning.
1. Definition: Probability measures the likelihood that an event will occur, expressed as a
number between 0 (impossible event) and 1 (certain event). For example, flipping a fair coin
has a probability of 0.5 for landing heads or tails.
2. Types of Probability:
● Theoretical Probability: Based on mathematical reasoning (e.g., probability of rolling
a three on a fair die is 1661).
● Empirical Probability: Based on observed data (e.g., if 30 out of 100 students pass an
exam, the empirical probability of passing is 30100=0.310030=0.3).
● Subjective Probability: Based on personal judgment or experience rather than
objective calculation (e.g., estimating the chance of rain based on weather patterns).
3. Applications in Data Science:
● In machine learning, probability helps model uncertainties in predictions.
● In statistical inference, it aids in hypothesis testing and confidence interval estimation.
● In risk assessment, it enables organizations to evaluate potential losses or gains
associated with decisions.
4. Key Concepts:
● Random Variables: Variables whose outcomes are determined by chance.
● Probability Distributions: Functions that describe how probabilities are distributed
across different outcomes (e.g., normal distribution).
● Bayes' Theorem: A fundamental theorem used to update probabilities based on new
evidence.
d) Advantages and Disadvantages of Machine Learning
Machine learning (ML) has transformed how organizations analyze data and make decisions
through automation and pattern recognition.Advantages:
1. Automation of Complex Tasks: ML algorithms can automate repetitive tasks such as data
entry or image recognition, significantly increasing efficiency while reducing human error.
2. Ability to Learn from Data: Unlike traditional programming approaches where rules are
explicitly coded, ML models learn from historical data and improve their performance over
time as they are exposed to more information.
3. Enhanced Predictive Capabilities: ML excels at identifying patterns within large datasets that
may not be apparent through manual analysis, allowing businesses to make informed
predictions about future trends or behaviors.
Disadvantages:
1. Data Quality Dependency: The effectiveness of ML models heavily relies on the quality and
quantity of training data; poor quality data can lead to inaccurate predictions or biased
outcomes.
2. Complexity in Model Development: Designing and implementing effective ML models often
requires specialized knowledge in statistics, programming languages (like Python or R), and
domain expertise.
3. Interpretability Challenges: Many ML algorithms operate as "black boxes," meaning their
internal workings are not easily interpretable by humans, which can hinder trust in automated
decisions made by these models.
e) Process of Data Analysis
The process of data analysis involves several systematic steps that guide analysts from raw
data collection to actionable insights:
1. Define Objectives: Clearly articulate the goals and questions that the analysis aims to address
to ensure alignment with desired outcomes.
2. Data Collection: Gather relevant data from various sources such as databases, surveys, or
web scraping tools; ensuring relevance is crucial for meaningful analysis.
3. Data Cleaning: Prepare the dataset by identifying and rectifying errors such as missing values
or duplicates; this enhances the reliability of results derived from analysis.
4. Exploratory Data Analysis (EDA): Conduct initial investigations using statistical methods
and visualizations to uncover patterns, trends, or anomalies within the dataset; EDA helps
inform further analysis steps by revealing insights about data distributions.
5. Model Selection/Development: Choose appropriate analytical methods or models based on
objectives defined earlier; this may involve selecting between regression models,
classification algorithms, or clustering techniques depending on the nature of the problem
being solved.
6. Analysis Execution: Apply selected models or techniques to analyze the cleaned dataset
thoroughly; this often involves coding in programming languages like R or Python using
libraries tailored for specific analyses (e.g., scikit-learn for machine learning).
7. Interpret Results: Draw conclusions from the analysis by interpreting findings in context with
initial objectives; this may involve comparing results against benchmarks or industry
standards to assess significance.
8. Communicate Findings: Present results through reports or visualizations tailored for
stakeholders who may not have technical backgrounds; effective communication ensures that
insights lead to informed decision-making within organizations.
f) Probability Distribution Modeling
Probability distribution modeling involves defining how probabilities are allocated across
different possible outcomes for random variables within a dataset:
1. Normal Distribution: Characterized by its bell-shaped curve where most observations cluster
around the mean; it is defined by two parameters: mean (μμ) and standard deviation (σσ).
Many natural phenomena follow this distribution (e.g., heights).
2. Binomial Distribution: Represents scenarios with two possible outcomes (success/failure)
across a fixed number of trials (n). It is defined by two parameters: number of trials (n) and
probability of success (p). For example, flipping a coin multiple times falls under this
distribution framework.
3. Poisson Distribution: Models count-based events occurring within a fixed interval when
these events happen independently at a constant rate (λλ). An example includes counting the
number of emails received per hour at a help desk.
4. Exponential Distribution: Often used to model time until an event occurs (e.g., time until
failure for mechanical devices). It is characterized by its rate parameter (λλ), which defines
how quickly events occur over time.
g) Four Data Types in R
In R programming language, understanding different data types is essential for effective data
manipulation and analysis:
1. Numeric Type: Represents numbers including both integers and decimals (e.g., 5, 3.14).
Numeric types are used extensively in calculations across various statistical analyses.
2. Character Type: Consists of text strings enclosed in quotes (e.g., "Hello", "Data Science").
Character types are commonly used for categorical variables or labels within datasets.
3. Logical Type: Contains boolean values indicating TRUE/FALSE conditions
(e.g., TRUE, FALSE). Logical types are useful for filtering datasets based on specific criteria
during analyses.
4. Factor Type: Used for categorical variables with fixed unique values; factors enhance
statistical modeling by treating categorical variables appropriately during analyses (e.g.,
gender categories like "Male" or "Female").
h) Applications of Big Data
Big Data technologies have transformed various industries by enabling organizations to
harness vast amounts of information for strategic advantage:
1. Healthcare Analytics: Big Data allows healthcare providers to analyze patient records
comprehensively, improving patient care through predictive analytics that identify potential
health risks before they escalate into serious issues.
2. Retail Optimization: Retailers utilize Big Data analytics to understand customer behavior
better by analyzing purchasing patterns across different demographics; this insight helps
tailor marketing strategies effectively while optimizing inventory management processes.
3. Financial Services Risk Management: Financial institutions leverage Big Data analytics for
fraud detection through real-time transaction monitoring systems that identify suspicious
activities based on historical transaction behaviors.
4. Smart Cities Development: Urban planners use Big Data collected from sensors embedded
throughout cities to enhance resource management—optimizing traffic flow patterns while
improving public services like waste management based on real-time needs assessments.
i) Correlation and Its Types
Correlation measures the strength and directionality between two variables' relationships;
understanding correlation types aids analysts in interpreting relationships
effectively:1 .Positive Correlation: Occurs when both variables increase together; for instance,
as temperature rises during summer months, ice cream sales typically increase—indicating a
positive relationship between temperature and sales figures.2 .Negative Correlation: Exists
when one variable increases while another decreases; an example would be an increase in
unemployment rates correlating with lower consumer spending—demonstrating a negative
relationship between these economic indicators.3 .Zero Correlation: Indicates no relationship
exists between two variables; for example, there may be no correlation between shoe size and
intelligence scores—suggesting independence between these attributes without any
discernible pattern linking them together statistically.4 .Perfect Correlation
(Positive/Negative): A perfect correlation signifies an exact linear relationship where changes
in one variable perfectly predict changes in another; this scenario rarely occurs outside
controlled experimental settings but serves as an idealized reference point within statistical
analyses.
j) Statistical Inference with Suitable Diagram
Statistical inference involves drawing conclusions about populations based on sample data
collected from those populations; it encompasses techniques such as hypothesis testing and
confidence intervals:1 .Hypothesis Testing: A method used to determine whether there is
enough evidence within sample data to support a particular hypothesis about population
parameters; this process typically involves formulating null (H0H0) versus alternative
hypotheses (HaHa) followed by calculating test statistics against critical values derived from
sampling distributions.2 .Confidence Intervals (CIs): Provide ranges within which population
parameters likely fall based on sample statistics; CIs offer insight into estimation accuracy
while accounting for sampling variability—commonly expressed at levels such as 95%
confidence intervals indicating a high degree certainty surrounding parameter estimates
derived from sampled observations.3 .A common diagram used in statistical inference is the
sampling distribution curve illustrating how sample means distribute around population
means as sample size increases; larger samples yield tighter distributions around true
population parameters due primarily due reduced variability inherent smaller
samples.4 .Overall significance testing frameworks provide rigorous methodologies enabling
researchers validate claims regarding population characteristics while minimizing risks
associated with erroneous conclusions drawn solely observationally without robust inferential
backing.
k) Explanation of Machine Learning
Machine learning refers to a subset of artificial intelligence that focuses on developing
algorithms capable of learning from and making predictions based on data without explicit
programming instructions:1 .Definition & Purpose : Machine learning enables computers
systems identify patterns within datasets , allowing them adaptively improve performance
over time through experience rather than relying solely pre-defined rules .2 .Types :
● Supervised Learning: Involves training algorithms using labeled datasets where input-output
pairs exist ; common examples include regression , classification tasks .
● Unsupervised Learning: Deals with unlabeled datasets , aiming uncover hidden structures ;
clustering techniques serve well here .
● Reinforcement Learning: Focused optimizing actions taken agents interacting environment ,
receiving feedback rewards penalties guiding subsequent decisions .
3 .Applications :
● Image recognition systems utilize convolutional neural networks classify objects images .
● Natural language processing applications leverage recurrent neural networks understand
human language contextually .
4 .Overall , machine learning represents powerful toolset enabling organizations harness vast
amounts information effectively drive innovation across sectors ranging healthcare finance
entertainment among others .
l) Difference Between Structured and Unstructured Data
Data can be categorized into two primary types: structured and unstructured:1 .Structured
Data :
● Well-organized format , typically stored databases spreadsheets .
● Easily searchable via query languages like SQL .
● Examples include customer records , transaction logs , sensor readings .
2 .Unstructured Data :
● Lacks predefined structure , making it more challenging analyze .
● Comprises various formats such as text documents , images , videos social media posts .
● Examples include emails , social media content , multimedia files .
m) 5V’s of Big Data
The concept known as "5V's" describes key characteristics defining big
data :1 .Volume : Refers sheer amount information generated daily ; organizations must
manage vast quantities diverse datasets effectively .2 .Velocity : Describes speed at which
new data generated processed ; real-time analytics critical many applications today .3 .Variety
: Encompasses different formats types information collected ranging structured unstructured
sources ; requires flexible storage processing solutions accommodate
diversity .4 .Veracity : Pertains quality reliability incoming datasets ; ensuring accurate
trustworthy insights derived crucial decision-making processes .5 .Value : Represents ultimate
goal extracting meaningful insights actionable intelligence out vast pools information
available .

Unit-II (Data Analytics)
100% (1)
Unit-II (Data Analytics)
17 pages
Research Methodology Handout
No ratings yet
Research Methodology Handout
26 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
Time Series Analysis
No ratings yet
Time Series Analysis
23 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
39 pages
Business Analytics Unit I
No ratings yet
Business Analytics Unit I
45 pages
wk6 - Data Analytics
No ratings yet
wk6 - Data Analytics
25 pages
Exequiel F. Guhit English 1O 10 - St. Francis Week 2: Activity 1: What Is Research?
100% (1)
Exequiel F. Guhit English 1O 10 - St. Francis Week 2: Activity 1: What Is Research?
4 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
No ratings yet
Data Wrangling, Also Known As Data Munging, Is An Iterative Process That Involves Data
9 pages
Crash Course - Introduction To Data Science
No ratings yet
Crash Course - Introduction To Data Science
121 pages
Data Analytics Part 3
No ratings yet
Data Analytics Part 3
54 pages
Data Similarity and Dissimilarity
No ratings yet
Data Similarity and Dissimilarity
73 pages
Da Unit-Ii
No ratings yet
Da Unit-Ii
21 pages
Unit II
No ratings yet
Unit II
91 pages
Da Unit 2
No ratings yet
Da Unit 2
18 pages
Ds Sem
No ratings yet
Ds Sem
71 pages
SEASS January To April 2024 Teaching Timetable Final
No ratings yet
SEASS January To April 2024 Teaching Timetable Final
111 pages
Data Science by Internshala Trainings
No ratings yet
Data Science by Internshala Trainings
46 pages
AA THeory and Methods
No ratings yet
AA THeory and Methods
40 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
36 pages
Big Data Chapter 2
No ratings yet
Big Data Chapter 2
62 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
62 pages
CASE On A FMCG Firm-Solutions-30.10.17
No ratings yet
CASE On A FMCG Firm-Solutions-30.10.17
3 pages
ISPFL9 Module1
100% (1)
ISPFL9 Module1
22 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Predictive Modeling
No ratings yet
Predictive Modeling
27 pages
DA unit-II
No ratings yet
DA unit-II
15 pages
Udt YARINDO
No ratings yet
Udt YARINDO
88 pages
MMW For Non Math
No ratings yet
MMW For Non Math
13 pages
Bi Short Notes
No ratings yet
Bi Short Notes
15 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Practical Research 2: Week 7
No ratings yet
Practical Research 2: Week 7
28 pages
NESA Course Descriptions - Updated
No ratings yet
NESA Course Descriptions - Updated
48 pages
End To End Statistics For Data Science
No ratings yet
End To End Statistics For Data Science
28 pages
7) M A1 Hypothesis Testing Notes
No ratings yet
7) M A1 Hypothesis Testing Notes
54 pages
Unit 1
No ratings yet
Unit 1
50 pages
FDS Introduction
No ratings yet
FDS Introduction
41 pages
Data and Analysis
No ratings yet
Data and Analysis
13 pages
Module 2
No ratings yet
Module 2
18 pages
Big Data Imp Notes of Big Dats
No ratings yet
Big Data Imp Notes of Big Dats
17 pages
Mod1 DM
No ratings yet
Mod1 DM
9 pages
07 01 RA41207EN60GLA0 Cell Range
No ratings yet
07 01 RA41207EN60GLA0 Cell Range
35 pages
Business Analytics
No ratings yet
Business Analytics
13 pages
Chapter 01 2
No ratings yet
Chapter 01 2
19 pages
Module 1 - Introduction To Data Analytics
No ratings yet
Module 1 - Introduction To Data Analytics
21 pages
Almawati-2023022010 Review Artikel
No ratings yet
Almawati-2023022010 Review Artikel
55 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Unit2 DATA SCIENCE
No ratings yet
Unit2 DATA SCIENCE
8 pages
The REporting of A Disproportionality Analysis For DrUg Safety Signal Detection Using Individual Case Safety Reports in PharmacoVigilance (READUS-PV) Explanation and Elaboration
No ratings yet
The REporting of A Disproportionality Analysis For DrUg Safety Signal Detection Using Individual Case Safety Reports in PharmacoVigilance (READUS-PV) Explanation and Elaboration
15 pages
Data Analytics 1
No ratings yet
Data Analytics 1
13 pages
Mod1 Datamining&Warehousing Last
No ratings yet
Mod1 Datamining&Warehousing Last
5 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
Dsur Ea2352001010391 W1
No ratings yet
Dsur Ea2352001010391 W1
6 pages
Unit 3
No ratings yet
Unit 3
11 pages
Staiqc Paper6
No ratings yet
Staiqc Paper6
20 pages
Previous QP
No ratings yet
Previous QP
14 pages
Data Analytics Syllabus PDF
No ratings yet
Data Analytics Syllabus PDF
5 pages
The Internet's Impact On Our Thinking PDF
No ratings yet
The Internet's Impact On Our Thinking PDF
16 pages
IIT Madras Syllabus PDF
No ratings yet
IIT Madras Syllabus PDF
2 pages
Electives Notes
No ratings yet
Electives Notes
14 pages
Rma Midterm Reviewer
No ratings yet
Rma Midterm Reviewer
11 pages
Aa MDM MST
No ratings yet
Aa MDM MST
8 pages
Video Report
No ratings yet
Video Report
13 pages
Kingword
No ratings yet
Kingword
11 pages
Adobe Scan 27-Mar-2024
No ratings yet
Adobe Scan 27-Mar-2024
12 pages
30 Assignments PDF
No ratings yet
30 Assignments PDF
5 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Data Science
No ratings yet
Data Science
2 pages
Vertica Machine Learning V9.0.0 Cheat Sheet: Preprocessing The Data
No ratings yet
Vertica Machine Learning V9.0.0 Cheat Sheet: Preprocessing The Data
2 pages
Stat LAS 1
No ratings yet
Stat LAS 1
6 pages
Economics 2150 Syllabus 8a
No ratings yet
Economics 2150 Syllabus 8a
22 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Glossary of Problem & Approach
No ratings yet
Glossary of Problem & Approach
3 pages
Honours in Statistics With Specialisation in Data Science
No ratings yet
Honours in Statistics With Specialisation in Data Science
3 pages
2022-FS Unit 2 Research Methods
No ratings yet
2022-FS Unit 2 Research Methods
8 pages
IT Unit 10
No ratings yet
IT Unit 10
4 pages
CO 3-Sessionwise Problems
No ratings yet
CO 3-Sessionwise Problems
8 pages
Math Grade10 Quarter4 Week5 Module5
No ratings yet
Math Grade10 Quarter4 Week5 Module5
4 pages
Testing Computational Toxicology Models With Phytochemicals
No ratings yet
Testing Computational Toxicology Models With Phytochemicals
9 pages
Image Classification With Artificial Intelligence Cats Vs Dogs
No ratings yet
Image Classification With Artificial Intelligence Cats Vs Dogs
5 pages
Two Way Anova Template - Canvas
No ratings yet
Two Way Anova Template - Canvas
1 page
Data Mining Poster
No ratings yet
Data Mining Poster
1 page
Speaker Identification
No ratings yet
Speaker Identification
5 pages
BI - End Sem Exam Questions
No ratings yet
BI - End Sem Exam Questions
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet

BigData QB (C.format)

Uploaded by

BigData QB (C.format)

Uploaded by

a) Advantages and Disadvantages of the Apriori Algorithm

You might also like