K
K
4o
Data engineering involves designing, constructing, and maintaining systems and architectures
that collect, store, process, and analyze large-scale data. The scope includes:
R is a programming language and environment used for statistical computing and graphics.
Its main characteristics include:
str(): Displays the structure of an R object, showing its type, length, and a preview of
its contents.
summary(): Provides a summary of an R object, giving statistical summaries (e.g.,
mean, median) for numerical data and frequency counts for factors.
The null hypothesis (H0) is a statement that there is no effect or no difference, and it is the
hypothesis that researchers typically try to disprove or reject. It serves as the default or
starting assumption in hypothesis testing.
r
Copy code
my_variable <- 10
b. Concatenate strings in R
r
Copy code
str1 <- "Hello"
str2 <- "World"
result <- paste(str1, str2)
Overfitting: The model learns the training data too well, capturing noise along with
the underlying pattern. It performs well on training data but poorly on new data.
o Prevention: Use cross-validation, regularization techniques (L1, L2), and
simpler models.
Underfitting: The model is too simple to capture the underlying pattern of the data,
performing poorly on both training and new data.
o Prevention: Increase model complexity, add more features, and reduce
regularization.
Eigenvalues: Scalars that indicate how much the eigenvector is stretched or shrunk
during a linear transformation.
Eigenvectors: Non-zero vectors that only change in scale (not direction) during a
linear transformation. They point in the direction of the transformation.
9. Define Central tendency.
Central tendency refers to the measure that represents the center or typical value of a dataset.
It includes mean, median, and mode.
10. Define
a. Mean
b. Median
c. Mode
d. Standard deviation
e. Normal distribution
A symmetric, bell-shaped distribution where most of the data points cluster around the mean.
A histogram is a graphical representation that organizes a group of data points into specified
ranges (bins). It shows the frequency distribution of a dataset and helps in visualizing the
shape, spread, and central tendency of the data.
12. Why is it necessary to visualize data, and what problems can arise with it?
Necessity:
Problems:
14. Describe the stochastic gradient descent and its advantages over batch
gradient descent.
SGD: Iteratively updates model parameters using one or a few training examples at a
time.
Advantages: Faster convergence, lower memory usage, and can escape local minima
due to its noisy updates compared to Batch Gradient Descent which processes the
entire dataset per iteration.
15. What are the basic operations you can perform on vectors or matrices?
Data dashboards provide a centralized view of key metrics and KPIs, allowing real-time
monitoring and quick identification of trends and issues. They enhance data-driven decision-
making by providing accessible, visual insights to stakeholders.
1/2
4o
a. Histogram
r
Copy code
hist(data$variable, main="Histogram", xlab="X-axis Label", ylab="Y-axis
Label", col="color")
b. Bar Plot
r
Copy code
barplot(height, main="Bar Plot", xlab="X-axis Label", ylab="Y-axis Label",
col="color", names.arg=names)
c. Box Plot
r
Copy code
boxplot(data$variable ~ data$group, main="Box Plot", xlab="X-axis Label",
ylab="Y-axis Label", col="color")
d. Scatter Plot
r
Copy code
plot(data$x, data$y, main="Scatter Plot", xlab="X-axis Label", ylab="Y-axis
Label", col="color")
e. Line Plot
r
Copy code
plot(data$x, data$y, type="l", main="Line Plot", xlab="X-axis Label",
ylab="Y-axis Label", col="color")
f. Pie Chart
r
Copy code
pie(values, labels=labels, main="Pie Chart", col=colors)
Code:
r
Copy code
# Load the data
data <- read.csv("data.csv")
Advantages:
Disadvantages:
Advantages:
Disadvantages:
Eigenvalues are scalars that measure the factor by which the corresponding eigenvector is
scaled during a linear transformation. Eigenvectors are non-zero vectors that change only in
scale during the transformation.
Finding Eigenvalues:
r
Copy code
A <- matrix(c(5, 4, -4, 8, 1, -4, 16, 8, 11), nrow=3, byrow=TRUE)
eigen(A)$values
To show that the vectors [−2,4][-2, 4][−2,4], [7,−2][7, -2][7,−2], [3,−6][3, -6][3,−6] are
linearly independent, we need to set up the matrix and check if the determinant is non-zero.
The vectors are linearly dependent if and only if the determinant of the matrix formed by
them is zero.
r
Copy code
B <- matrix(c(-2, 7, 3, 4, -2, -6), nrow=2, byrow=TRUE)
det(B)
Techniques:
Advantages:
Disadvantages:
Confusion Matrix:
Type I Error (False Positive): Incorrectly rejecting the null hypothesis (FP).
Type II Error (False Negative): Failing to reject the null hypothesis when it is false
(FN).
Overfitting: Model learns the training data too well, capturing noise along with the
underlying pattern.
o Prevention: Use cross-validation, regularization techniques (L1, L2), and
simpler models.
Underfitting: Model is too simple to capture the underlying pattern of the data.
o Prevention: Increase model complexity, add more features, and reduce
regularization.
9. What is the Central Limit Theorem? Explain the Difference Between
Population and Sample
Central Limit Theorem (CLT): States that the distribution of the sample mean
approximates a normal distribution as the sample size becomes large, regardless of the
population's distribution.
Key Components:
Python Libraries:
Machine learning is a subset of artificial intelligence that involves the use of algorithms and
statistical models to enable computers to learn from and make predictions based on data.
Techniques:
15.