0% found this document useful (0 votes)
10 views11 pages

K

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

K

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1.

k-Nearest Neighbours (kNN) Algorithm: kNN is a simple, non-parametric, and lazy


learning algorithm used for classification and regression. For classification, it assigns
the class of the majority of its k-nearest neighbors. For regression, it averages the
values of its k-nearest neighbors. The key steps involve calculating the distance (e.g.,
Euclidean) between the query point and the points in the training set, selecting the k
closest points, and then performing a majority vote (for classification) or averaging
(for regression).
2. Working of the k-means Clustering Algorithm: k-means clustering is an
unsupervised learning algorithm used to partition data into k distinct clusters. It works
as follows:
o Initialize k centroids randomly.
o Assign each data point to the nearest centroid.
o Update centroids by calculating the mean of all points assigned to each
centroid.
o Repeat the assignment and update steps until the centroids no longer change or
change minimally.
3. Comparison of kNN and k-means Clustering:
o Purpose: kNN is used for supervised learning tasks (classification/regression),
while k-means is used for unsupervised clustering.
o Mechanism: kNN assigns labels based on nearest neighbors, k-means
partitions data into clusters.
o Output: kNN provides a classification or regression output; k-means provides
cluster assignments.
4. Difference between Logistic Regression and Linear Regression:
o Objective: Linear regression predicts continuous values, logistic regression
predicts probabilities for classification tasks.
o Output: Linear regression outputs a scalar value; logistic regression outputs
probabilities between 0 and 1.
o Function: Linear regression uses a linear function, logistic regression uses a
sigmoid function to model the probability.
5. Logistic Regression and its Applications: Logistic regression models the probability
of a binary outcome. It uses a logistic function (sigmoid) to map predicted values to
probabilities. Applications include binary classification tasks like spam detection,
disease diagnosis, and credit scoring.
6. Multivariate Linear Regression: It involves multiple independent variables to
predict a dependent variable. It extends simple linear regression, which uses only one
independent variable. The model form is y=β0+β1x1+β2x2+...+βnxn+ϵy = \beta_0 + \
beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilony=β0+β1x1+β2x2+...+βnxn+ϵ.
7. Stochastic Gradient Descent (SGD) and its Advantages over Batch Gradient
Descent:
o SGD: Updates model parameters using a single training example per iteration.
o Advantages: Faster convergence, less memory usage, and can escape local
minima due to its noisy updates. Batch Gradient Descent uses the entire
dataset per iteration, making it slower and memory-intensive.
8. Central Limit Theorem (CLT): CLT states that the sampling distribution of the
sample mean approaches a normal distribution as the sample size grows, regardless of
the population's distribution.
o Population vs. Sample: Population includes all members of a group, while a
sample is a subset of the population.
9. Histogram in Descriptive Statistics: A histogram is a graphical representation of
data distribution. It shows the frequency of data points within specified ranges (bins).
It helps in understanding the distribution, central tendency, and variability of data.
10. Confusion Matrix and its Use in Classification: A confusion matrix is a table used
to evaluate the performance of a classification model. It shows true positives, true
negatives, false positives, and false negatives.
o Type I Error (False Positive): Incorrectly predicting the positive class.
o Type II Error (False Negative): Incorrectly predicting the negative class.
11. Basic Operations on Vectors or Matrices:
o Addition, subtraction, scalar multiplication.
o Dot product, cross product (for vectors).
o Matrix multiplication, transpose, inverse (for matrices).
12. Descriptive Statistics and its Key Components: Descriptive statistics summarize
data characteristics. Key components:
o Measures of central tendency: Mean, median, mode.
o Measures of variability: Range, variance, standard deviation.
o Measures of shape: Skewness, kurtosis.
13. Dimensionality Reduction and Techniques: Dimensionality reduction simplifies
data by reducing the number of features. Techniques include:
o PCA (Principal Component Analysis): Projects data onto lower-dimensional
space.
o t-SNE: Non-linear technique for visualization.
o LDA (Linear Discriminant Analysis): Finds a linear combination of features
that characterizes or separates classes.
14. Difference between Population and Sample:
o Population: Entire group of interest.
o Sample: Subset of the population used for analysis.
15. Definitions:
o Mean: Average of a set of values.
o Median: Middle value in a sorted list.
o Mode: Most frequently occurring value.
o Standard Deviation: Measure of data dispersion around the mean.
o Normal Distribution: Bell-shaped distribution, symmetric around the mean.
16. Assessing the Performance of a Regression Model:
o R-squared: Proportion of variance explained by the model.
o Adjusted R-squared: Adjusted for the number of predictors.
o MSE (Mean Squared Error): Average of squared differences between actual
and predicted values.
o RMSE (Root Mean Squared Error): Square root of MSE.
17. Dimensional Reduction Techniques and their Importance:
o Techniques: PCA, LDA, t-SNE, Autoencoders.
o Importance: Reduces complexity, mitigates overfitting, and improves model
performance and interpretability.
18. Machine Learning and its Uses: Machine learning involves algorithms that learn
from data to make predictions or decisions. Uses include image recognition, natural
language processing, recommendation systems, and predictive analytics.
19. Importance of Data Visualization and Potential Problems: Visualization helps in
understanding, interpreting, and communicating data insights. Problems include
misinterpretation, misleading graphs, and oversimplification.
20. Supervised, Unsupervised, and Semi-supervised Learning:
o Supervised: Trains on labeled data.
o Unsupervised: Finds patterns in unlabeled data.
o Semi-supervised: Uses a mix of labeled and unlabeled data.
21. Data Cleaning and its Significance: Data cleaning involves removing errors and
inconsistencies from data to improve its quality. It is crucial for accurate analysis and
model performance.
22. Purpose of Storytelling in Data Science Communication: Storytelling helps in
effectively communicating data insights to non-technical stakeholders, making data-
driven decisions more accessible and understandable.
23. Importance of Data Dashboards in Decision-Making: Data dashboards provide
real-time, interactive visualizations that help stakeholders monitor key metrics,
identify trends, and make informed decisions quickly.
24. Importance of Data Ethics in Data Science Projects: Data ethics ensures
responsible use of data, protecting privacy, preventing bias, and maintaining trust.
Ethical considerations are crucial for compliance, reputation, and fairness.

4o

1. What is Data Engineering? Discuss its scope.

Data engineering involves designing, constructing, and maintaining systems and architectures
that collect, store, process, and analyze large-scale data. The scope includes:

 Data Collection: Integrating various data sources (e.g., databases, APIs).


 Data Storage: Designing data warehouses, lakes, and databases.
 Data Processing: ETL (Extract, Transform, Load) processes, data cleaning, and
preparation.
 Data Analysis Support: Providing clean and structured data for analysts and
scientists.
 Big Data Technologies: Working with Hadoop, Spark, and cloud-based data
platforms.

2. What is R and what are its main characteristics?

R is a programming language and environment used for statistical computing and graphics.
Its main characteristics include:

 Statistical Analysis: Extensive statistical tests and methods.


 Visualization: Advanced graphical capabilities for data visualization.
 Packages: Comprehensive collection of packages for various tasks.
 Data Manipulation: Powerful tools for data manipulation and cleaning.
 Open Source: Free and open-source, with a strong community support.

3. Write down different data types in R.

 Numeric: For numbers.


 Integer: For integer values.
 Character: For strings.
 Logical: For TRUE/FALSE values.
 Complex: For complex numbers.
 Factor: For categorical data.
 Date and Time: For date and time values.

4. What is the difference between the str() and summary() functions in R.

 str(): Displays the structure of an R object, showing its type, length, and a preview of
its contents.
 summary(): Provides a summary of an R object, giving statistical summaries (e.g.,
mean, median) for numerical data and frequency counts for factors.

5. Define Null hypothesis.

The null hypothesis (H0) is a statement that there is no effect or no difference, and it is the
hypothesis that researchers typically try to disprove or reject. It serves as the default or
starting assumption in hypothesis testing.

6. Write code for the following:

a. Assigning a value to a variable in R.

r
Copy code
my_variable <- 10

b. Concatenate strings in R

r
Copy code
str1 <- "Hello"
str2 <- "World"
result <- paste(str1, str2)

7. What is overfitting and underfitting, and how can they be prevented.

 Overfitting: The model learns the training data too well, capturing noise along with
the underlying pattern. It performs well on training data but poorly on new data.
o Prevention: Use cross-validation, regularization techniques (L1, L2), and
simpler models.
 Underfitting: The model is too simple to capture the underlying pattern of the data,
performing poorly on both training and new data.
o Prevention: Increase model complexity, add more features, and reduce
regularization.

8. Define Eigenvalues and Eigenvectors.

 Eigenvalues: Scalars that indicate how much the eigenvector is stretched or shrunk
during a linear transformation.
 Eigenvectors: Non-zero vectors that only change in scale (not direction) during a
linear transformation. They point in the direction of the transformation.
9. Define Central tendency.

Central tendency refers to the measure that represents the center or typical value of a dataset.
It includes mean, median, and mode.

10. Define

a. Mean

The average of a set of values.

b. Median

The middle value in a sorted list of numbers.

c. Mode

The most frequently occurring value in a dataset.

d. Standard deviation

A measure of the amount of variation or dispersion in a set of values.

e. Normal distribution

A symmetric, bell-shaped distribution where most of the data points cluster around the mean.

11. What is a histogram, and how is it used in descriptive statistics?

A histogram is a graphical representation that organizes a group of data points into specified
ranges (bins). It shows the frequency distribution of a dataset and helps in visualizing the
shape, spread, and central tendency of the data.

12. Why is it necessary to visualize data, and what problems can arise with it?

Necessity:

 Simplifies complex data


 Reveals patterns and trends
 Aids in decision-making
 Communicates insights effectively

Problems:

 Can mislead if not done correctly


 May oversimplify data
 Might introduce biases through improper scaling or visualization techniques

13. Why is data ethics important in data science projects?


Data ethics ensures responsible use of data, protecting privacy, preventing misuse, and
maintaining trust. Ethical considerations help avoid biases, discrimination, and ensure
compliance with laws and regulations.

14. Describe the stochastic gradient descent and its advantages over batch
gradient descent.

 SGD: Iteratively updates model parameters using one or a few training examples at a
time.
 Advantages: Faster convergence, lower memory usage, and can escape local minima
due to its noisy updates compared to Batch Gradient Descent which processes the
entire dataset per iteration.

15. What are the basic operations you can perform on vectors or matrices?

 Addition and Subtraction: Element-wise operations.


 Scalar Multiplication: Multiplying each element by a scalar.
 Dot Product: Sum of the products of corresponding elements.
 Cross Product (for vectors): Produces a vector perpendicular to the two input
vectors.
 Matrix Multiplication: Product of two matrices.
 Transpose: Flipping a matrix over its diagonal.
 Inverse: Matrix that, when multiplied with the original matrix, yields the identity
matrix.

16. Describe the importance of data dashboards in decision-making.

Data dashboards provide a centralized view of key metrics and KPIs, allowing real-time
monitoring and quick identification of trends and issues. They enhance data-driven decision-
making by providing accessible, visual insights to stakeholders.

1/2
4o

1. Different Types of Charts and Plots in R with Syntax

a. Histogram

r
Copy code
hist(data$variable, main="Histogram", xlab="X-axis Label", ylab="Y-axis
Label", col="color")

b. Bar Plot

r
Copy code
barplot(height, main="Bar Plot", xlab="X-axis Label", ylab="Y-axis Label",
col="color", names.arg=names)

c. Box Plot
r
Copy code
boxplot(data$variable ~ data$group, main="Box Plot", xlab="X-axis Label",
ylab="Y-axis Label", col="color")

d. Scatter Plot

r
Copy code
plot(data$x, data$y, main="Scatter Plot", xlab="X-axis Label", ylab="Y-axis
Label", col="color")

e. Line Plot

r
Copy code
plot(data$x, data$y, type="l", main="Line Plot", xlab="X-axis Label",
ylab="Y-axis Label", col="color")

f. Pie Chart

r
Copy code
pie(values, labels=labels, main="Pie Chart", col=colors)

2. Performing Linear Regression in R

Code:

r
Copy code
# Load the data
data <- read.csv("data.csv")

# Fit linear model


model <- lm(y ~ x, data=data)

# Display the summary


summary(model)

Advantages:

 Simplicity: Easy to understand and interpret.


 Speed: Computationally efficient.
 Linearity: Works well for linear relationships.

Disadvantages:

 Linearity: Assumes a linear relationship between variables.


 Outliers: Sensitive to outliers.
 Multicollinearity: Assumes no multicollinearity.

3. Explain KNN Algorithm


K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm used for classification and
regression. It works by finding the 'k' closest data points in the training set to a new point and
assigning the most common label (for classification) or averaging the labels (for regression).

Advantages:

 Simple: Easy to implement and understand.


 Non-parametric: Makes no assumptions about the data distribution.
 Flexible: Can be used for classification and regression.

Disadvantages:

 Computational Cost: High for large datasets.


 Storage Cost: Requires storing the entire training dataset.
 Sensitivity to Noise: Sensitive to irrelevant or redundant features.

4. Define Eigenvalues and Eigenvectors, Find Eigenvalues for Matrix A

Eigenvalues are scalars that measure the factor by which the corresponding eigenvector is
scaled during a linear transformation. Eigenvectors are non-zero vectors that change only in
scale during the transformation.

Matrix A: A=[5816418−4−411]A = \begin{bmatrix} 5 & 8 & 16 \\ 4 & 1 & 8 \\ -4 & -4 &


11 \end{bmatrix}A=54−481−416811

Finding Eigenvalues:

r
Copy code
A <- matrix(c(5, 4, -4, 8, 1, -4, 16, 8, 11), nrow=3, byrow=TRUE)
eigen(A)$values

5. Show that the Following Matrices are Linearly Independent

To show that the vectors [−2,4][-2, 4][−2,4], [7,−2][7, -2][7,−2], [3,−6][3, -6][3,−6] are
linearly independent, we need to set up the matrix and check if the determinant is non-zero.

Matrix B: B=[−2734−2−6]B = \begin{bmatrix} -2 & 7 & 3 \\ 4 & -2 & -6 \


end{bmatrix}B=[−247−23−6]

The vectors are linearly dependent if and only if the determinant of the matrix formed by
them is zero.

r
Copy code
B <- matrix(c(-2, 7, 3, 4, -2, -6), nrow=2, byrow=TRUE)
det(B)

6. Explain Dimensionality Reduction and Some of its Techniques


Dimensionality reduction involves reducing the number of random variables under
consideration by obtaining a set of principal variables.

Techniques:

 Principal Component Analysis (PCA): Transforms data into a set of orthogonal


components.
 Linear Discriminant Analysis (LDA): Finds the linear combination of features that
best separate classes.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): Visualizes high-
dimensional data by reducing it to 2 or 3 dimensions.

Advantages:

 Reduces Overfitting: By reducing the number of features.


 Improves Performance: Speeds up training and reduces complexity.
 Simplifies Models: Makes models easier to interpret.

Disadvantages:

 Information Loss: Some data variance might be lost.


 Complexity: Some techniques can be computationally expensive.

7. What is the Confusion Matrix, and How is it Used in Classification? Define


and Explain Type I and Type II Errors

A confusion matrix is a table used to evaluate the performance of a classification model by


comparing the predicted and actual values.

Confusion Matrix:

Predicted Positive Predicted Negative


Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

 Type I Error (False Positive): Incorrectly rejecting the null hypothesis (FP).
 Type II Error (False Negative): Failing to reject the null hypothesis when it is false
(FN).

8. What is Overfitting and Underfitting, and How Can They Be Prevented

 Overfitting: Model learns the training data too well, capturing noise along with the
underlying pattern.
o Prevention: Use cross-validation, regularization techniques (L1, L2), and
simpler models.
 Underfitting: Model is too simple to capture the underlying pattern of the data.
o Prevention: Increase model complexity, add more features, and reduce
regularization.
9. What is the Central Limit Theorem? Explain the Difference Between
Population and Sample

 Central Limit Theorem (CLT): States that the distribution of the sample mean
approximates a normal distribution as the sample size becomes large, regardless of the
population's distribution.

Difference Between Population and Sample:

 Population: The entire group being studied.


 Sample: A subset of the population used to represent the population.

10. Differences Between:

a. Supervised and Unsupervised Machine Learning

 Supervised Learning: Uses labeled data to train the model.


 Unsupervised Learning: Uses unlabeled data to find structure in the data.

b. Logistic Regression and Linear Regression

 Logistic Regression: Used for classification problems.


 Linear Regression: Used for regression problems.

c. Exploratory Factor Analysis and Confirmatory Factor Analysis

 Exploratory Factor Analysis (EFA): Used to discover the underlying structure of a


large set of variables.
 Confirmatory Factor Analysis (CFA): Used to test the hypothesis that the
relationships between observed variables and their underlying latent constructs exist.

d. Population and Sample

 Population: Entire group being studied.


 Sample: A subset of the population used to represent the population.

11. Compare and Contrast KNN and K-means Clustering

 KNN (K-Nearest Neighbors):


o Type: Supervised learning.
o Purpose: Classification and regression.
o Method: Finds the 'k' nearest data points and assigns the most common label
(classification) or averages the labels (regression).
 K-means Clustering:
o Type: Unsupervised learning.
o Purpose: Clustering data into 'k' clusters.
o Method: Assigns data points to clusters such that the sum of the squared
distances between the data points and the cluster centroid is minimized.
12. Define Descriptive Statistics and List its Key Components

Descriptive statistics summarizes or describes the main features of a dataset quantitatively.

Key Components:

 Measures of Central Tendency: Mean, median, and mode.


 Measures of Variability: Range, variance, and standard deviation.
 Measures of Shape: Skewness and kurtosis.
 Graphs and Charts: Histograms, bar charts, and box plots.

13. Explain Steps of Data Pre-processing. Explain Different Python Libraries


Used in Data Preprocessing

Steps of Data Pre-processing:

1. Data Cleaning: Handling missing values, removing duplicates.


2. Data Transformation: Scaling, normalization, and encoding categorical variables.
3. Data Reduction: Dimensionality reduction and feature selection.
4. Data Splitting: Dividing the data into training and testing sets.

Python Libraries:

 Pandas: Data manipulation and analysis.


 NumPy: Numerical computations.
 Scikit-learn: Machine learning and preprocessing.
 Matplotlib/Seaborn: Data visualization.

14. Define Machine Learning and Write Down Different Techniques of


Machine Learning

Machine learning is a subset of artificial intelligence that involves the use of algorithms and
statistical models to enable computers to learn from and make predictions based on data.

Techniques:

 Supervised Learning: Classification, regression.


 Unsupervised Learning: Clustering, association.
 Semi-Supervised Learning: Combination of labeled and unlabeled data.
 Reinforcement Learning: Learning by interacting with an environment.

15.

You might also like