0% found this document useful (0 votes)
8 views33 pages

Microooooooooooooo

The document outlines the Data Analytics Lifecycle, detailing six phases: Discovery, Data Preparation, Model Planning, Model Building, Communication Results, and Operationalize. It also describes the typical analytical architecture, emphasizing components like data collection, transformation, storage, and analytics. Additionally, it covers big data analysis characteristics, key roles in the big data ecosystem, and various statistical methods for data analytics, including the differences between Student's t-test and Welch's t-test.

Uploaded by

Om More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views33 pages

Microooooooooooooo

The document outlines the Data Analytics Lifecycle, detailing six phases: Discovery, Data Preparation, Model Planning, Model Building, Communication Results, and Operationalize. It also describes the typical analytical architecture, emphasizing components like data collection, transformation, storage, and analytics. Additionally, it covers big data analysis characteristics, key roles in the big data ecosystem, and various statistical methods for data analytics, including the differences between Student's t-test and Welch's t-test.

Uploaded by

Om More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

1] Explain six stages/phases of Data Analytics life cycle.

Data Analytics Lifecycle :


The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative to represent real project. To address the
distinct requirements for performing analysis on Big Data, step–by–step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.

• Phase 1: Discovery –

• The data science team learns and investigates the problem.

• Develop context and understanding.

• Come to know about data sources needed and available for the project.

• The team formulates the initial hypothesis that can be later tested with data.

• Phase 2: Data Preparation -

• Steps to explore, preprocess, and condition data before modeling and analysis.

• It requires the presence of an analytic sandbox, the team executes, loads, and transforms, to get data into the sandbox.

• Data preparation tasks are likely to be performed multiple times and not in predefined order.

• Several tools commonly used for this phase are - Hadoop, Alpine Miner, Open Refine, etc. • Phase 3: Model Planning -

• The team explores data to learn about relationships between variables and subsequently, selects key variables and the most suitable models.

• In this phase, the data science team develops data sets for training, testing, and production purposes.

• Team builds and executes models based on the work done in the model planning phase.

• Several tools commonly used for this phase are - Matlab and STASTICA.

• Phase 4: Model Building -

• Team develops datasets for testing, training, and production purposes.


• Team also considers whether its existing tools will suffice for running the models or if they need more robust environment for executing
models.

• Free or open-source tools - Rand PL/R, Octave, WEKA.

• Commercial tools - Matlab and STASTICA.

• Phase 5: Communication Results -

• After executing model team need to compare outcomes of modeling to criteria established for success and failure.

• Team considers how best to articulate findings and outcomes to various team members and stakeholders, taking into account warning,
assumptions.

• Team should identify key findings, quantify business value, and develop narrative to summarize and convey findings to stakeholders.

• Phase 6: Operationalize -

• The team communicates benefits of project more broadly and sets up pilot project to deploy work in controlled way before broadening the
work to full enterprise of users.

• This approach enables team to learn about performance and related constraints of the model in production environment on small scale which
make adjustments before full deployment.

• The team delivers final reports, briefings, codes.

• Free or open source tools - Octave, WEKA, SQL, MADlib.


2]Describe and illustrate the typical analytical architecture with a neat block diagram.

Analytics architecture refers to the overall design and structure of an analytical system or environment, which includes the hardware, software, data,
and processes used to collect, store, analyze, and visualize data. It encompasses various technologies, tools, and processes that support the end-to-
end analytics workflow.

Key components of Analytics Architecture-

Analytics architecture refers to the infrastructure and systems that are used to support the collection, storage, and analysis of data. There are several
key components that are typically included in an analytics architecture:

1. Data collection: This refers to the process of gathering data from various sources, such as sensors, devices, social media, websites, and more.

2. Transformation: When the data is already collected then it should be cleaned and transformed before storing.

3. Data storage: This refers to the systems and technologies used to store and manage data, such as databases, data lakes, and data warehouses.

4. Analytics: This refers to the tools and techniques used to analyze and interpret data, such as statistical analysis, machine learning, and
visualization.

Data Sources
All big data solutions start with one or more data sources. The Big Data Architecture accommodates various data sources and efficiently manages a wide
range of data types. Some common data sources in big data architecture include transactional databases, logs, machine-generated data, social media and
web data, streaming data, external data sources, cloud-based data, NOSQL databases, data warehouses, file systems, APIs, and web services.

These are only a few instances; in reality, the data environment is broad and constantly changing, with new sources and technologies developing over
time. The primary challenge in big data architecture is successfully integrating, processing, and analyzing data from various sources in order to gain
relevant insights and drive decision-making.

Data Storage

Data storage is the system for storing and managing large amounts of data in big data architecture. Big data includes handling large amounts of
structured, semi-structured, and unstructured data; traditional relational databases often prove inadequate due to scalability and performance
limitations.

Distributed file stores, capable of storing large volumes of files in various formats, typically store data for batch processing operations. People often refer
to this type of store as a data lake. You can use Azure Data Lake Storage or blob containers in Azure Storage for this purpose.

Big data architecture is specifically designed to manage data ingestion, data processing, and analysis of data that is too large or complex. A big size data
cannot be store, process and manage by conventional relational databases. The solution is to organize technology into a structure of big data
architecture. Big data architecture is able to manage and process data.

3] What is big data analysis .Explain characteristics of big data.

1. Volume:

• The name ‘Big Data’ itself is related to a size which is enormous.

• Volume is a huge amount of data.

• To determine the value of data, size of data plays a very crucial role. If the volume of data is very large, then it is actually considered as a ‘Big
Data’. This means whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.

• Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

• Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will have
almost 40000 Exabytes of data.
2. Velocity:

• Velocity refers to the high speed of accumulation of data.

• In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.

• There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to
meet the demands.

• Sampling data can help in dealing with the issue like ‘velocity’.

• Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users are increasing by 22%(Approx.) year by
year.

3. Variety:

• It refers to nature of data that is structured, semi-structured and unstructured data.

• It also refers to heterogeneous sources.

• Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured
and unstructured.

o Structured data: This data is basically an organized data. It generally refers to data that has defined the length and format of data.

o Semi- Structured data: This data is basically a semi-organised data. It is generally a form of data that do not conform to the formal
structure of data. Log files are the examples of this type of data.

o Unstructured data: This data basically refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional
row and column structure of the relational database. Texts, pictures, videos etc. are the examples of unstructured data which can’t be
stored in the form of rows and columns.

4. Veracity:

• It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are
difficult to control.

Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.

• Example: Data in bulk could create confusion whereas less amount of data could convey
half or Incomplete Information.

5. Value:

• After having the 4 V’s into account there comes one more V which stands for Value! The
bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.

• Data in itself is of no use or importance but it needs to be converted into something


valuable to extract Information. Hence, you can state that Value! is the most important V
of all the 6V’s.

6. Variability:

• How fast or available data that extent is the structure of your data is changing?

• How often does the meaning or shape of your data change?

• Example: if you are eating same ice-cream daily and the taste just keep changing.
4]Which are the key roles for the new big data ecosystem. Explain in brief.

The key roles in the big data ecosystem involve various professionals who contribute to managing,
analyzing, and utilizing large datasets. These roles include data scientists, data engineers, data
analysts, business analysts, and statisticians. These roles are essential for transforming raw data into
actionable insights, driving business decisions, and gaining a competitive advantage.

• Data Scientists:

Design and build machine learning models and algorithms to analyze large datasets, extract insights,
and predict future outcomes.

• Data Engineers:

Focus on building and maintaining the infrastructure for storing, processing, and delivering big data,
ensuring data quality and accessibility.

• Data Analysts:

Analyze data using various tools and techniques to identify trends, patterns, and insights that can be
used to inform business decisions.

• Business Analysts:

Translate business needs into data requirements, identify areas for improvement, and develop
datadriven solutions to address business challenges.

• Statisticians:

Play a crucial role in data ethics, validating the accuracy and reliability of data, and ensuring that
data-driven decisions are based on sound statistical principles.

• Other Key Roles:

MIS Reporting Executives, Java developers, Oracle DBAs, and Teradata Business Analysts also play
significant roles in the big data ecosystem

Unit2

1] Explain Apriori Algorithm. How rules are generated and visualized in Apriori Algorithm.

Apriori Algorithm is a foundational method in data mining used for discovering frequent itemsets
and generating association rules. Its significance lies in its ability to identify relationships
between items in large datasets which is particularly valuable in market basket analysis.

For example, if a grocery store finds that customers who buy bread often also buy butter, it can use
this information to optimise product placement or marketing strategies.

How the Apriori Algorithm Works?

The Apriori Algorithm operates through a systematic process that involves several key steps:

1. Identifying Frequent Itemsets: The algorithm begins by scanning the dataset to identify
individual items (1-item) and their frequencies. It then establishes a minimum support
threshold, which determines whether an itemset is considered frequent.

2. Creating Possible item group: Once frequent 1-itemgroup(single items) are identified,
the algorithm generates candidate 2-itemgroup by combining frequent items. This
process continues iteratively, forming larger itemsets (k-itemgroup) until no more
frequent itemgroup can be found.

3. Removing Infrequent Item groups: The algorithm employs a pruning technique based on
the Apriori Property, which states that if an itemset is infrequent, all its supersets must
also be infrequent. This significantly reduces the number of combinations that need to be
evaluated.

4. Generating Association Rules: After identifying frequent itemsets, the algorithm


generates association rules that illustrate how items relate to one another, using metrics
like support, confidence, and lift to evaluate the strength of these relationships.

2] What is clustering? How k-means algorithm is used for clustering. Give stepwise explanation.

Clustering is a process of grouping similar data points together, forming distinct clusters. K-means is a
specific algorithm used for clustering that iteratively assigns data points to clusters based on their
proximity to cluster centroids. Here's a stepwise explanation of how it works:

1. Define the number of clusters (k):

• The first step is to decide how many clusters you want to create. This is represented by
'k'. A higher 'k' means more clusters, while a lower 'k' means fewer clusters.

2. Randomly initialize cluster centroids:

Choose 'k' data points randomly from your dataset to act as initial centroids (centers of the
clusters).

• These centroids will be updated iteratively.

3. Assign data points to the nearest cluster:

• For each data point, calculate its distance to each of the 'k' centroids.
• Assign each data point to the cluster whose centroid is the closest (based on the distance
metric, usually Euclidean distance).

• This creates initial clusters.

4. Recompute cluster centroids:

• For each cluster, calculate the new centroid by finding the mean (average) of all the data
points in that cluster.

• This updates the positions of the cluster centers.

5. Repeat steps 3 and 4 iteratively:

• Reassign data points to the nearest centroids (Step 3).

• Recompute cluster centroids (Step 4).

• Continue this process until the cluster assignments or centroid positions no longer
change significantly, or until a maximum number of iterations is reached.

6. Final clustering:

• Once the algorithm converges (no more changes), you have your final clusters.

• Each data point belongs to one of the 'k' clusters based on its proximity to its respective
centroid

3] What are the different statistical methods for evolution in Data Analytics.

Statistics is a mathematical study that deals with collection and analysis. steps include data
collection, analysis of data, perception, and organization or summarization of data. Statistics is a
form of applied mathematics that produces a set of studies from the obtained data.

This mathematical analysis makes the dataset applicable for real life. Statistics has its dominance in
the field of psychology, geology, weather forecast, etc. the data is collected either in quantitative or
qualitative form.

Statistical methods play a crucial role in understanding the evolution of data in various fields. Some
of these methods include time series analysis, regression analysis, cluster analysis, and hypothesis
testing. These techniques help in identifying patterns, predicting future outcomes, and
understanding the relationships between variables over time.

Here's a more detailed look at some key statistical methods for analyzing evolving data:

Time Series Analysis:

What it is: Analyzing data points collected over time to identify trends, patterns, and cycles.

• Examples: Forecasting sales, weather patterns, stock market movements, or population


growth.

• Techniques: ARIMA models, moving averages, exponential smoothing.

2. Regression Analysis:

• What it is: Examining the relationship between a dependent variable and one or more
independent variables.

• Examples: Predicting housing prices based on location, size, and amenities; or


understanding the impact of advertising on sales.

• Techniques: Linear regression, multiple regression, logistic regression.

3. Cluster Analysis:

• What it is: Grouping data points into clusters based on their similarity or dissimilarity.

• Examples: Customer segmentation, identifying disease clusters, or grouping genes with


similar expression patterns.

• Techniques: K-means clustering, hierarchical clustering.

4. Hypothesis Testing:

• What it is: Testing a hypothesis about a population based on a sample of data.

• Examples: Determining if a new drug is effective, or if there's a significant difference


between two groups.

• Techniques: T-tests, Chi-square tests, ANOVA.

5. Advanced Techniques:

• Machine Learning: Using algorithms to learn from data and make predictions or
decisions.

• Data Mining: Discovering patterns and insights in large datasets.

• Big Data Analytics: Analyzing and processing vast amounts of data using specialized tools
and techniques

4] Differentiate between student’s t-test and Welch’s t-test.

Student's t-test assumes that the variances of the two populations being compared are equal, while
Welch's t-test relaxes this assumption and is suitable for situations where the variances are unequal.

Here's a more detailed comparison:

Student's t-test (Independent Samples t-test or Two-Sample t-test)

• Assumption: Equal population variances (homogeneity of variance).

• When to use: When you suspect the populations being compared have similar variances, or
when you're unsure and want a conservative test.

Degrees of freedom: Calculated using the pooled variance estimate, which assumes
equal variance.

• Formula: Uses a pooled standard deviation to calculate the t-statistic.

Welch's t-test

• Assumption:

Unequal population variances (heterogeneity of variance).

• When to use:

When you have reason to believe the populations have different variances, or when you can't
assume equal variances.

• Degrees of freedom:

Adjusted to account for the unequal variances, often resulting in a non-integer value.

• Formula:

Uses separate standard deviation estimates for each group to calculate the t-statistic

Unit3

1] What is Regression? *Explain Linear Regression with example. OR Explain any one type of
Regression in detail.

Regression analysis in business data analysis (BDA) is a statistical method used to model the
relationship between a dependent variable and one or more independent variables. Linear
regression, a common type of regression, specifically examines the linear relationship between these
variables, aiming to predict the dependent variable's value based on the independent variable(s).

Linear Regression Explained:

Linear regression assumes a linear relationship between the variables, meaning the dependent
variable changes at a constant rate as the independent variable changes. This relationship is
represented by a straight line. The goal is to find the "best fit" line that minimizes the difference
between the observed data points and the predicted values on the line.

Example:

Imagine you're a social researcher interested in the relationship between income and happiness. You
survey 500 people, collect their income data (ranging from $15k to $75k), and ask them to rank their
happiness on a scale from 1 to 10.

• Dependent Variable: Happiness (the variable you're trying to predict).

• Independent Variable: Income (the variable you're using to predict happiness).

You can use linear regression to see if there's a linear relationship between income and
happiness. The equation would take the form: Happiness = a + b * Income.

• a represents the y-intercept (the predicted happiness when income is zero).

b represents the slope (how much happiness changes for every $1 increase in income).

By analyzing the data and fitting the regression line, you can estimate how much happiness changes
as income changes. If you find a positive linear relationship, it would suggest that as income
increases, happiness tends to increase as well.

In BDA, this type of analysis can be used for:

• Predicting sales based on advertising spend:

Linear regression can model the relationship between advertising budget and sales revenue, helping
businesses forecast future sales.

• Predicting house prices based on size:

It can be used to determine how house size impacts price, aiding in property valuation.

• Understanding customer behavior:

Linear regression can analyze the relationship between various factors (e.g., demographics, purchase
history) and customer behavior, enabling businesses to target marketing efforts

2]What is classification? What are the two fundamental methods of classification.

Classification methods are machine learning algorithms that enable the prediction of a discrete
outcome variable based on the value of one or multiple predictor variables.

Data Mining: Data mining in general terms means mining or digging deep into data that is in different
forms to gain patterns, and to gain knowledge on that pattern. In the process of data mining, large
data sets are first sorted, then patterns are identified and relationships are established to perform
data analysis and solve problems.

Classification is a task in data mining that involves assigning a class label to each instance in a dataset
based on its features. The goal of classification is to build a model that accurately predicts the class
labels of new instances based on their features.

There are two main types of classification: binary classification and multi-class classification. Binary
classification involves classifying instances into two classes, such as "spam" or "not spam", while
multi-class classification involves classifying instances into more than two classes.

The process of building a classification model typically involves the following steps:

Data Collection:
The first step in building a classification model is data collection. In this step, the data relevant to the
problem at hand is collected. The data should be representative of the problem and should contain
all the necessary attributes and labels needed for classification. The data can be collected from
various sources, such as surveys, questionnaires, websites, and databases.

Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data needs to
be preprocessed to ensure its quality. This involves handling missing values, dealing with outliers,
and transforming the data into a format suitable for analysis. Data preprocessing also involves
converting the data into numerical form, as most classification algorithms require numerical input.

Handling Missing Values: Missing values in the dataset can be handled by replacing them with the
mean, median, or mode of the corresponding feature or by removing the entire record.

Dealing with Outliers: Outliers in the dataset can be detected using various statistical techniques
such as z-score analysis, boxplots, and scatterplots. Outliers can be removed from the dataset or
replaced with the mean, median, or mode of the corresponding feature.

Data Transformation: Data transformation involves scaling or normalizing the data to bring it into
a common scale. This is done to ensure that all features have the same level of importance in the
analysis.
Unit4

1] Why use autocorrelation instead of auto covariance when examining stationary time series.

Autocorrelation measures the degree of similarity between a given time series and the lagged
version of that time series over successive time periods. It is similar to calculating the correlation
between two different variables except in Autocorrelation we calculate the correlation between two
different versions Xt and Xt-k of the same time series.

Autocorrelation is a fundamental concept in time series analysis. Autocorrelation is a statistical


concept that assesses the degree of correlation between the values of variable at different time
points. The article aims to discuss the fundamentals and working of Autocorrelation.

When analyzing stationary time series in business data analytics (BDA), autocorrelation is preferred
over autocovariance because it's standardized, making comparisons across different time series
easier. Autocorrelation, calculated by dividing autocovariance by the variance, ranges from -1 to 1,
providing a standardized and readily interpretable measure of linear dependence. Autocovariance,
on the other hand, is not standardized and its magnitude depends on the scale of the data, making it
less suitable for comparing different time series.

Here's a more detailed explanation:

• Standardization:

Autocorrelation standardizes the measure of dependence, allowing for direct comparison of linear
relationships between time series, regardless of their individual variances.

• Interpretability:

Autocorrelation values range from -1 to 1, clearly indicating the strength and direction of the linear
relationship, making it easier to interpret than raw autocovariance values.

• Comparability:

Different time series may have different variances. Autocorrelation provides a standardized measure
that allows for easy comparisons across these series, while autocovariance does not.

In essence, autocorrelation's standardization and ease of interpretation make it the preferred tool for
examining the linear relationships within stationary time series when comparing and analyzing
different time series in BDA
2] What methods can be used for sentiment analysis.

Sentiment analysis is a popular task in natural language processing. The goal of sentiment analysis is
to classify the text based on the mood or mentality expressed in the text, which can be positive
negative, or neutral.

What is Sentiment Analysis?

Sentiment analysis is the process of classifying whether a block of text is positive, negative, or
neutral. The goal that Sentiment mining tries to gain is to be analysed people’s opinions in a way that
can help businesses expand. It focuses not only on polarity (positive, negative & neutral) but also on
emotions (happy, sad, angry, etc.). It uses various Natural Language Processing algorithms such as
Rule-based, Automatic, and Hybrid.

let's consider a scenario, if we want to analyze whether a product is satisfying customer


requirements, or is there a need for this product in the market. We can use sentiment analysis to
monitor that product’s reviews. Sentiment analysis is also efficient to use when there is a large set of
unstructured data, and we want to classify that data by automatically tagging it. Net Promoter Score
(NPS) surveys are used extensively to gain knowledge of how a customer perceives a product or
service. Sentiment analysis also gained popularity due to its feature to process large volumes of NPS
responses and obtain consistent results quickly.

• Customer Feedback Analysis: Businesses can


analyze customer reviews, comments, and feedback to understand the sentiment behind
them helping in identifying areas for improvement and addressing customer concerns,
ultimately enhancing customer satisfaction.

• Brand Reputation Management: Sentiment analysis allows businesses to monitor their


brand reputation in real-time.
By tracking mentions and sentiments on social media, review platforms, and other online
channels, companies can respond promptly to both positive and negative sentiments,
mitigating potential damage to their brand.

• Product Development and Innovation: Understanding customer sentiment helps identify


features and aspects of their products or services that are well-received or need
improvement. This information is invaluable for product development and innovation,
enabling companies to align their offerings with customer preferences.

• Competitor Analysis: Sentiment Analysis can be used to compare the sentiment around a
company's products or services with those of competitors.
Businesses identify their strengths and weaknesses relative to competitors, allowing for
strategic decision-making.

• Marketing Campaign Effectiveness


Businesses can evaluate the success of their marketing campaigns by analyzing the sentiment
of online discussions and social media mentions.
Positive sentiment indicates that the campaign is resonating with the target audience, while
negative sentiment may signal the need for adjustments.
3] Write short notes on Time series analysis.

Time series analysis and forecasting are crucial for predicting future trends, behaviors, and
behaviours based on historical data. It helps businesses make informed decisions, optimize
resources, and mitigate risks by anticipating market demand, sales fluctuations, stock prices, and
more. Additionally, it aids in planning, budgeting, and strategizing across various domains such as
finance, economics, healthcare, climate science, and resource management, driving efficiency
and competitiveness.

A time series is a sequence of data points collected, recorded, or measured at successive, evenly-
spaced time intervals.

Each data point represents observations or measurements taken over time, such as stock prices,
temperature readings, or sales figures. Time series data is commonly represented graphically
with time on the horizontal axis and the variable of interest on the vertical axis, allowing analysts
to identify trends, patterns, and changes over time.

Time series data is often represented graphically as a line plot, with time depicted on the
horizontal x-axis and the variable's values displayed on the vertical y-axis. This graphical
representation facilitates the visualization of trends, patterns, and fluctuations in the variable
over time, aiding in the analysis and interpretation of the data.

Importance of Time Series Analysis

1. Predict Future Trends: Time series analysis enables the prediction of future trends, allowing
businesses to anticipate market demand, stock prices, and other key variables, facilitating
proactive decision-making.

2. Detect Patterns and Anomalies: By examining sequential data points, time series analysis
helps detect recurring patterns and anomalies, providing insights into underlying behaviors
and potential outliers.

3. Risk Mitigation: By spotting potential risks, businesses can develop strategies to mitigate
them, enhancing overall risk management.

4. Strategic Planning: Time series insights inform long-term strategic planning, guiding decision-
making across finance, healthcare, and other sectors.

5. Competitive Edge: Time series analysis enables businesses to optimize resource allocation
effectively, whether it's inventory, workforce, or financial assets. By staying ahead of market
trends, responding to changes, and making data-driven decisions, businesses gain a
competitive edge.

4]Explain ARIMA model with auto correlation function in Time series analysis.
ARIMA models are widely used in time series analysis for forecasting, and they rely heavily on
autocorrelation functions (ACF) to identify patterns and parameters. An ARIMA model combines
autoregressive (AR) and moving average (MA) components, which are determined by the
correlation of the time series data with its lagged versions. The "integrated" (I) component
accounts for non-stationarity in the data, ensuring that the model can be applied to time series
with trends or seasonality.

Here's a more detailed explanation:

1. Autocorrelation Function (ACF):

• The ACF measures the correlation of a time series with its past values at different lags (time
differences). It's a plot that shows the correlation coefficient between the time series and its
lagged versions.

• Purpose of ACF: ACF plots help in identifying the order of the AR and MA components in an
ARIMA model. By observing the decay pattern of the ACF, we can determine how many past
values (lags) are significant in predicting future values. For example, if the ACF decays quickly,
it might indicate a low-order AR process, while a slow decay might suggest a higher-order AR
process.

• Interpreting ACF:

o A strong correlation at lag 1 indicates a strong relationship between the current


value and the previous value.

o A decaying correlation with increasing lags suggests that the correlation weakens
over time.

o Significant correlations at specific lags can indicate the order of the AR and MA terms
in the model.

2. ARIMA Model Components:

• Autoregressive (AR):

The AR component models the relationship between the current value and its past values. The
parameter "p" determines the order of the AR model (number of lagged observations). For
example, AR(1) means the current value is related to the previous value.

• Integrated (I):

The I component handles non-stationarity in the data. Stationary data has constant statistical
properties (mean, variance, autocorrelation) over time. The "d" parameter represents the order
of differencing required to make the data stationary. For example, if d=1, the data is differenced
once (taking the difference between consecutive values).

• Moving Average (MA):

The MA component models the relationship between the current value and past forecast errors
(residuals). The parameter "q" determines the order of the MA model (number of lagged errors).
For example, MA(1) means the current value is related to the previous forecast error.

5] Describe Term frequency and inverse document frequency (TFIDF).


TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used to evaluate
how important a word is to a document in a collection of documents. It's a method used in
natural language processing and information retrieval to identify the most relevant words within
a text. TF-IDF combines term frequency (TF) and inverse document frequency (IDF) to determine
a word's significance.

Term Frequency (TF): This measures how often a word appears in a document. It reflects the
word's prevalence within a specific text.

Inverse Document Frequency (IDF): This evaluates the importance of a word across a collection of
documents. It assigns higher weight to words that appear less frequently across the entire
corpus, indicating they are more specific to a particular document.

Calculating TF-IDF: The TF-IDF value is calculated by multiplying the TF and IDF scores.

Why TF-IDF?

• Relevance: TF-IDF helps determine the relevance of a word to a document within a


collection.

• Search Engine Optimization: It helps search engines identify the most relevant documents
for a given query by prioritizing words that are both frequent in a document and rare across
the entire corpus.

• Document Similarity: TF-IDF can be used to find similar documents based on their word
content.

Limitations:

• Synonyms:

TF-IDF methods struggle to identify synonyms, as they focus on the frequency of individual
words.

• Semantic Understanding:

They lack semantic understanding and can't grasp the nuanced meaning of language.

In BDA (Big Data Analytics): TF-IDF is a valuable tool in BDA for analyzing and understanding large
text datasets, such as social media posts, news articles, and customer reviews. It helps extract
meaningful insights by identifying the most important words or keywords within a document
collection

Unit5

1] Explain general overview of Big data high performance architecture along with HDFS in detail

The Hadoop Distributed File System (HDFS) is a key component of the Apache
Hadoop ecosystem, designed to store and manage large volumes of data across multiple
machines in a distributed manner. It provides high-throughput access to data, making it suitable
for applications that deal with large datasets, such as big data analytics, machine learning, and
data warehousing. This article will delve into the architecture of HDFS, explaining its key
components and mechanisms, and highlight the advantages it offers over traditional file systems.
The Hadoop Distributed File System (HDFS) is a scalable and fault-tolerant storage solution
designed for large datasets. It consists of NameNode (manages metadata), DataNodes (store
data blocks), and a client interface. Key advantages include scalability, fault tolerance, high
throughput, cost-effectiveness, and data locality, making it ideal for big data applications.

HDFS Architecture

HDFS is designed to be highly scalable, reliable, and efficient, enabling the storage and processing
of massive datasets. Its architecture consists of several key components:

1. NameNode

2. DataNode

3. Secondary NameNode

4. HDFS Client

5. Block Structure

NameNode

The NameNode is the master server that manages the filesystem namespace and controls access
to files by clients. It performs operations such as opening, closing, and renaming files and
directories. Additionally, the NameNode maps file blocks to DataNodes, maintaining the
metadata and the overall structure of the file system. This metadata is stored in memory for fast
access and persisted on disk for reliability.

Key Responsibilities:

• Maintaining the filesystem tree and metadata.

• Managing the mapping of file blocks to DataNodes.

• Ensuring data integrity and coordinating replication of data blocks.

DataNode

DataNodes are the worker nodes in HDFS, responsible for storing and retrieving actual data
blocks as instructed by the NameNode. Each DataNode manages the storage attached to it and
periodically reports the list of blocks it stores to the NameNode.

Key Responsibilities:

• Storing data blocks and serving read/write requests from clients.

• Performing block creation, deletion, and replication upon instruction from the NameNode.

• Periodically sending block reports and heartbeats to the NameNode to confirm its status.

Secondary NameNode

The Secondary NameNode acts as a helper to the primary NameNode, primarily responsible for
merging the EditLogs with the current filesystem image (FsImage) to reduce the potential load on
the NameNode. It creates checkpoints of the namespace to ensure that the filesystem metadata
is up-to-date and can be recovered in case of a NameNode failure.
Key Responsibilities:

• Merging EditLogs with FsImage to create a new checkpoint.

• Helping to manage the NameNode's namespace metadata.

HDFS Client

The HDFS client is the interface through which users and applications interact with the HDFS. It
allows for file creation, deletion, reading, and writing operations. The client communicates with
the NameNode to determine which DataNodes hold the blocks of a file and interacts directly
with the DataNodes for actual data read/write operations.

Key Responsibilities:

• Facilitating interaction between the user/application and HDFS.

• Communicating with the NameNode for metadata and with DataNodes for data access.

Block Structure

HDFS stores files by dividing them into large blocks, typically 128MB or 256MB in size. Each block
is stored independently across multiple DataNodes, allowing for parallel processing and fault
tolerance. The NameNode keeps track of the block locations and their replicas.

Key Features:

• Large block size reduces the overhead of managing a large number of blocks.

• Blocks are replicated across multiple DataNodes to ensure data availability and fault
tolerance.

HDFS Advantages

HDFS offers several advantages that make it a preferred choice for managing large datasets in
distributed computing environments:

Scalability

HDFS is highly scalable, allowing for the storage and processing of petabytes of data across
thousands of machines. It is designed to handle an increasing number of nodes and storage
without significant performance degradation.

Key Aspects:

• Linear scalability allows the addition of new nodes without reconfiguring the entire system.

• Supports horizontal scaling by adding more DataNodes.

Fault Tolerance

HDFS ensures high availability and fault tolerance through data replication. Each block of data is
replicated across multiple DataNodes, ensuring that data remains accessible even if some nodes
fail.

Key Features:

• Automatic block replication ensures data redundancy.


• Configurable replication factor allows administrators to balance storage efficiency and fault
tolerance.

Cost-Effective

HDFS is designed to run on commodity hardware, significantly reducing the cost of setting up and
maintaining a large-scale storage infrastructure. Its open-source nature further reduces the total
cost of ownership.

Key Features:

• Utilizes inexpensive hardware, reducing capital expenditure.

• Open-source software eliminates licensing costs.

Data Locality

HDFS takes advantage of data locality by moving computation closer to where the data is stored.
This minimizes data transfer over the network, reducing latency and improving overall system
performance.

Key Features:

• Data-aware scheduling ensures that tasks are assigned to nodes where the data resides.

• Reduces network congestion and improves processing speed.

HDFS Use Cases

HDFS is widely used in various industries and applications that require large-scale data processing:

• Big Data Analytics: HDFS is a core component of Hadoop-based big data platforms, enabling
the storage and analysis of massive datasets for insights and decision-making.

• Data Warehousing: Enterprises use HDFS to store and manage large volumes of historical
data for reporting and business intelligence.

• Machine Learning: HDFS provides a robust storage layer for machine learning frameworks,
facilitating the training of models on large datasets.
• Log Processing: HDFS is used to store and process log data from web servers, applications,
and devices, enabling real-time monitoring and analysis.

• Content Management: Media companies use HDFS to store and distribute large multimedia
files, ensuring high availability and efficient access.
2] Describe map reduce programming mode.

In the Hadoop framework, MapReduce is the programming model. MapReduce utilizes the map and
reduce strategy for the analysis of data. In today’s fast-paced world, there is a huge number of data
available, and processing this extensive data is one of the critical tasks to do so. However, the
MapReduce programming model can be the solution for processing extensive data while maintaining
both speed and efficiency. Understanding this programming model, its components, and execution
workflow in the Hadoop framework will be helpful to gain valuable insights.

What is MapReduce?

MapReduce is a parallel, distributed programming model in the Hadoop framework that can be used
to access the extensive data stored in the Hadoop Distributed File System (HDFS). The Hadoop is
capable of running the MapReduce program written in various languages such as Java, Ruby, and
Python. One of the beneficial factors that MapReduce aids is that MapReduce programs are
inherently parallel, making the very large scale easier for data analysis.

When the MapReduce programs run in parallel, it speeds up the process. The process of running
MapReduce programs is explained below.

• Dividing the input into fixed-size chunks: Initially, it divides the work into equal-sized pieces.
When the file size varies, dividing the work into equal-sized pieces isn’t the straightforward
method to follow, because some processes will finish much earlier than others while some
may take a very long run to complete their work. So one of the better approaches is that one
that requires more work is said to split the input into fixed-size chunks and assign each chunk
to a process.

• Combining the results: Combining results from independent processes is a crucial task in
MapReduce programming because it may often need additional processing such as
aggregating and finalizing the results.

Key components of MapReduce

There are two key components in the MapReduce. The MapReduce consists of two primary phases
such as the map phase and the reduces phase. Each phase contains the key-value pairs as its input
and output and it also has the map function and reducer function within it.

• Mapper: Mapper is the first phase of the MapReduce. The Mapper is responsible for
processing each input record and the key-value pairs are generated by the InputSplit and
RecordReader. Where these key-value pairs can be completely different from the input pair.
The MapReduce output holds the collection of all these key-value pairs.

• Reducer: The reducer phase is the second phase of the MapReduce. It is responsible for
processing the output of the mapper. Once it completes processing the output of the
mapper, the reducer now generates a new set of output that can be stored in HDFS as the
final output data.

MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and
efficient to use. MapReduce is a programming model used for efficient processing in parallel over
large data-sets in a distributed manner. The data is first split and then combined to produce the final
result. The libraries for MapReduce is written in so many programming languages with various
different-different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs
and then it will reduce it to equivalent tasks for providing less overhead over the cluster network and
to reduce the processing power. The MapReduce task is mainly divided into two phases Map Phase
and Reduce Phase.

MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for processing
to the Hadoop MapReduce Manager.

2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised
of so many smaller tasks that the client wants to process or execute.

3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.

4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all
the job-parts combined to produce the final output.

5. Input Data: The data set that is fed to the MapReduce for processing.

6. Output Data: The final result is obtained after the processing.

In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent jobparts.
These job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the particular company is solving.
The developer writes their logic to fulfill the requirement that the industry requires. The input data
which we are using is then fed to the Map Task and the Map will generate intermediate key-value
pair as its output. The output of Map i.e. these key-value pairs are then fed to the Reducer and the
final output is stored on the HDFS. There can be n number of Map and Reduce tasks made available
for processing the data as per the requirement. The algorithm for Map and Reduce is made with a
very optimized way such that the time complexity or space complexity is minimum.

Let's discuss the MapReduce phases to get a better understanding of its architecture:

The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The input
to the map may be a key-value pair where the key can be the id of some kind of address and
value is the actual value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the intermediate key-value
pair which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and
sort and send to the Reduce() function. Reducer aggregate or group the data based on its
key-value pair as per the reducer algorithm written by the developer.
Unit6

1] What is Graph analytics? What are the features of graph Analytics platform.

Graph analytics analyzes data represented as a network of interconnected nodes and edges,
revealing relationships and patterns not easily found using traditional methods. A graph analytics
platform is a software solution that facilitates this analysis, offering features for visualizing, querying,
and extracting insights from graph data.

What is Graph Analytics?

• Data Representation:

Graph analytics treats data as a graph, where nodes represent entities and edges represent
relationships between them.

• Relationship Discovery:

It helps uncover complex connections and dependencies within the data, revealing insights that
might be hidden in traditional data models.

• Analytic Tools:

Graph analytics employs various algorithms to analyze graph structures, including centrality analysis,
community detection, and link prediction.

• Use Cases:

It finds applications in areas like fraud detection, supply chain optimization, social network analysis,
and more.

Features of a Graph Analytics Platform:

• Visualization:

Graph platforms often provide tools to visualize graph structures, allowing users to explore
relationships visually.

• Querying:

They offer query languages (like Cypher) to search for specific patterns, relationships, and nodes
within the graph.

• Algorithm Integration:

Graph platforms often include a library of graph algorithms for various analytical tasks, such as
shortest path analysis, community detection, and centrality analysis.

• Data Integration:

They may provide capabilities to ingest data from various sources and transform it into graph format.

• Scalability:

Graph platforms are designed to handle large and complex graph datasets, offering scalability for
performance and storage.

• Collaboration:
Some platforms support collaborative workflows, allowing multiple users to work on the same graph
data.

• Integration with Other Tools:

Graph analytics platforms may integrate with machine learning tools, allowing for feature
engineering and model building based on graph data
2) What is NOSQL? * Explain key-value store in NOSQL.

NoSQL, or "Not Only SQL," is a database management system (DBMS) designed to handle large
volumes of unstructured and semi-structured data. Unlike traditional relational databases that use
tables and pre-defined schemas, NoSQL databases provide flexible data models and
support horizontal scalability, making them ideal for modern applications that require real-time data
processing.

Why Use NoSQL?

Unlike relational databases, which uses Structured Query Language, NoSQL databases don't have a
universal query language. Instead, each type of NoSQL database typically has its unique query
language. Traditional relational databases follow ACID (Atomicity, Consistency, Isolation,
Durability) principles, ensuring strong consistency and structured relationships between data.

However, as applications evolved to handle big data, real-time analytics, and distributed
environments, NoSQL emerged as a solution with:

• Scalability – Can scale horizontally by adding more nodes instead of upgrading a single
machine.

• Flexibility – Supports unstructured or semi-structured data without a rigid schema.

• High Performance – Optimized for fast read/write operations with large datasets.

• Distributed Architecture – Designed for high availability and partition tolerance in


distributed systems.

NoSQL refers to a non SQL or nonrelational database that main purpose of it is to provide a
mechanism for storage and retrieval of data. NoSQL database stores the information in JSON
documents instead of columns and rows. As we know the relational database use rows and columns
for storing and retrieval of data but in the case of NoSQL it uses JSON documents instead of rows and
columns and that is why it is also known as nonrelational SQL or database.

A NoSQL database includes simplicity of design, simpler horizontal scaling, and has fine control over
availability. The data structures used in the NoSQL database are different from those we used in the
relational database. the database used in NoSQL is more advanced which makes some operations
faster in NoSQL.

• Relationships present in NoSQL are less complex as compared to relational database systems.

• Actions performed in NoSQL are fast as compared to other databases.

• Implementation of it is less costly than others.

• Programming in it is easy to use and more flexible.

• A high level of scalability is provided by NoSQL.


3] Explain in brief Graph database

A graph database (GDB) is a database that uses graph structures for storing data. It uses nodes,
edges, and properties instead of tables or documents to represent and store data. The edges
represent relationships between the nodes. This helps in retrieving data more easily and, in many
cases, with one operation. Graph databases are commonly referred to as NoSQL. Ex: Neo4j, Amazon
Neptune, ArangoDB etc.

Representation:

The graph database is based on graph theory. The data is stored in the nodes of the graph and the
relationship between the data are represented by the edges between the nodes.

graph
representation of data

When do we need Graph Database?

1. It solves Many-To-Many relationship problems

If we have friends of friends and stuff like that, these are many to many relationships. Used
when the query in the relational database is very complex.

2. When relationships between data elements are more important


For example- there is a profile and the profile has some specific information in it but the major selling
point is the relationship between these different profiles that is how you get connected within a
network.
In the same way, if there is data element such as user data element inside a graph database there
could be multiple user data elements but the relationship is what is going to be the factor for all
these data elements which are stored inside the graph database.

3. Low latency with large scale data

When you add lots of relationships in the relational database, the data sets are going to be huge and
when you query it, the complexity is going to be more complex and it is going to be more than a
usual time. However, in graph database, it is specifically designed for this particular purpose and one
can query relationship with ease.

Why do Graph Databases matter? Because graphs are good at handling relationships, some
databases store data in the form of a graph.

Example We have a social network in which five friends are all connected. These friends are Anay,
Bhagya, Chaitanya, Dilip, and Erica. A graph database that will store their personal information may
look something like this:

You might also like