0% found this document useful (0 votes)
16 views9 pages

Machine Learning

Uploaded by

bimanmaity2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

Machine Learning

Uploaded by

bimanmaity2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MACHINE LEARNING CT 1

----------------------------------------------------------2 MARKS:- ------------------------------------------------------------


3 . State central limit theory.
--The Central Limit Theorem (CLT) states that when independent random variables are
added together, their sum tends toward a normal distribution regardless of the original
distribution of the variables themselves. In other words, as the sample size increases, the
distribution of the sample means approaches a normal distribution, regardless of the shape
of the original population distribution.
For example, consider rolling a fair six-sided die. The distribution of outcomes from rolling a
single die follows a uniform distribution. However, if you roll multiple dice and sum the
results, as the number of dice rolled increases, the distribution of the sums will tend to
resemble a normal distribution according to the Central Limit Theorem.

4. List the step to estimate the mean using CLT.


---- Sure, here are the steps to estimate the mean using the Central Limit Theorem (CLT)
briefly:
 **Random Sampling**: Collect a random sample from the population of interest.
 **Calculate Sample Mean**: Compute the mean of the sample.
 **Repeat Sampling**: Repeat steps 1 and 2 numerous times to generate a
distribution of sample means.
 **Apply CLT**: As the sample size increases, the distribution of sample means will
approach a normal distribution, regardless of the underlying distribution of the
population.
 **Estimate Population Mean**: Use the properties of the normal distribution to
estimate the population mean, typically by finding the mean of the sample means
and considering its standard error.

5. Define mean , median , mode following set .


--- Certainly! Let's define mean, median, and mode for a given set of data.

- **Mean**: The mean of a set of data is the sum of all the values in the set divided by the
total number of values. Mathematically, it is calculated as:
Mean = (Sum of all values) / (Number of values)

- **Median**: The median of a set of data is the middle value when the values are arranged
in ascending or descending order. If there is an odd number of values, the median is the
middle one. If there is an even number of values, the median is the average of the two
middle values.
- **Mode**: The mode of a set of data is the value that appears most frequently. It's
possible for a set of data to have no mode (if all values occur with the same frequency), one
mode (if one value occurs more frequently than any other), or multiple modes (if two or
more values occur with the same highest frequency).

For example, consider the set of data: {4, 6, 2, 7, 4, 3, 9, 4}

- Mean = (4 + 6 + 2 + 7 + 4 + 3 + 9 + 4) / 8 = 39 / 8 = 4.875
- To find the median, we first arrange the values in ascending order: {2, 3, 4, 4, 4, 6, 7, 9}.
The median is 4, because it's the middle value.
- The mode is 4, as it appears most frequently in the set.

6. Examine mode using Empirical relation.


When it is given that mean median are 10.5 9.6
----The empirical rule, also known as the 68-95-99.7 rule, states that for a normal
distribution:
- Approximately 68% of the data falls within one standard deviation of the mean.
- Approximately 95% of the data falls within two standard deviations of the mean.
- Approximately 99.7% of the data falls within three standard deviations of the mean.
However, the empirical rule does not provide direct insights into the mode of a distribution.
The mode represents the most frequently occurring value in a dataset, and its location is
independent of the distribution's shape or spread. Therefore, while the empirical rule is
useful for understanding the spread of data around the mean, it does not provide specific
information about the mode.

The empirical rule, also known as the 68-95-99.7 rule, primarily deals with the spread of
data around the mean in a normal distribution. It doesn't directly provide insights into the
mode of a distribution.
However, given the mean and median of a dataset, we can make some general
observations:

 If the mean and median are approximately equal, it suggests that the distribution is
symmetric. In such cases, the mode is likely to be close to the mean and median.
 If the mean is greater than the median, it indicates that the distribution is positively
skewed, meaning that there are some relatively large values pulling the mean to the
right. In such cases, the mode is likely to be less than the mean and median.
 If the mean is less than the median, it indicates that the distribution is negatively
skewed, meaning that there are some relatively small values pulling the mean to the
left. In such cases, the mode is likely to be greater than the mean and median.
 Given the mean of 10.5 and the median of 9.6, we can observe that the mean is
greater than the median, suggesting a positively skewed distribution. In such a case,
the mode is likely to be less than both the mean and the median.
 Therefore, based on the provided information and the empirical relationship, we can
infer that the mode is likely to be less than 10.5 and 9.6. However, without
additional information about the shape of the distribution or the specific dataset, we
cannot determine the exact value of the mode.
7. Observe the mean , variance , and standard deviation for the following data—
2,4,5,6,8,17
-- To observe the mean, variance, and standard deviation for the given data {-2, 4, 5, 6, 8,
17}:

1. **Mean (Average)**:
The mean is calculated by summing up all the numbers in the dataset and dividing by the
total number of values.

Mean = (-2 + 4 + 5 + 6 + 8 + 17) / 6 = 38 / 6 = 6.33 (approximately)

2. **Variance**:
Variance measures the dispersion or spread of the data points from the mean. It's
calculated by taking the average of the squared differences between each data point and
the mean.

Variance = [(−2−6.33)² + (4−6.33)² + (5−6.33)² + (6−6.33)² + (8−6.33)² + (17−6.33)²] / 6


≈ [(-8.33)² + (-2.33)² + (-1.33)² + (-0.33)² + (1.67)² + (10.67)²] / 6
≈ [69.44 + 5.44 + 1.78 + 0.11 + 2.79 + 113.56] / 6
≈ 192.12 / 6
≈ 32.02

3. **Standard Deviation**:
Standard deviation is the square root of the variance. It measures the average deviation of
data points from the mean.

Standard Deviation = √Variance ≈ √32.02 ≈ 5.66 (approximately)

So, for the given dataset {-2, 4, 5, 6, 8, 17}, the mean is approximately 6.33, the variance is
approximately 32.02, and the standard deviation is approximately 5.66.
8. The average score of boys in
9. The average score of the school of the in the examination is 71.8 . Find the ration of the
number of boys to the number of given who appeared in the examination.
10. Explain the confusion matrix with example .
11. Discuss the method of handling missing on corrupted date in data set .
-
12. Explain overfitting and write steps to avoid .
-
13. Explain Training data set over test data set.
 Training data set and test data set are both integral components of the machine
learning workflow, used for developing and evaluating models. Here's a brief
explanation of each:

 **Training Data Set**: This is the portion of the data used to train the machine
learning model. It typically consists of a set of input-output pairs where the input is
the features or attributes of the data, and the output is the target variable or label.
During the training process, the model learns from this data by adjusting its internal
parameters to minimize a predefined loss or error function.

 **Test Data Set**: Once the model is trained, it needs to be evaluated to assess its
performance on unseen data. The test data set is used for this purpose. It contains
examples that were not seen by the model during training. By applying the trained
model to the test data set, its generalization ability can be assessed, providing
insights into how well it will perform on new, unseen data.

 In summary, the training data set is used to teach the model, while the test data set
is used to assess its performance and generalization capabilities. This segregation
helps in ensuring that the model is not overfitting to the training data and can make
accurate predictions on new, unseen data.

14. Describe a 3 stage building a model in machine Learning .


 Sure, here's a brief description of the three stages in building a machine learning
model:

 **Data Preprocessing**: This stage involves cleaning and preparing the data for
training. It includes tasks such as handling missing values, scaling features, encoding
categorical variables, and splitting the data into training and test sets.

 **Model Training**: In this stage, a machine learning algorithm is selected and


trained on the preprocessed training data. The model learns patterns and
relationships within the data to make predictions or classifications. Training involves
optimizing the model's parameters to minimize a predefined loss function.

 **Model Evaluation and Tuning**: Once the model is trained, it needs to be


evaluated to assess its performance on unseen data. This stage involves using the
test data set to measure metrics such as accuracy, precision, recall, or F1-score.
Based on the evaluation results, the model may be fine-tuned by adjusting
hyperparameters or trying different algorithms to improve its performance.
15. Define Semi super vised Machine Learning.
Semi-supervised machine learning is a type of learning where a model is trained on a
combination of labeled data (data with known outcomes) and unlabeled data (data without
known outcomes).
 **Combines labeled and unlabeled data**: Utilizes both data with known outcomes
(labeled) and data without known outcomes (unlabeled).
 **Leverages unlabeled data**: Capitalizes on the abundance of unlabeled data to
improve model performance.
 **Reduces reliance on labeled data**: Allows for more efficient use of labeled data,
as labeled data can be scarce or expensive to obtain.
 **Common techniques**: Techniques like self-training, co-training, and pseudo-
labeling are commonly used in semi-supervised learning to leverage both labeled
and unlabeled data effectively.
 **Applications**: Semi-supervised learning is useful in scenarios where obtaining
labeled data is difficult or costly, such as in certain domains like healthcare, where
labeled data may be limited.
16. Critisize the statement and quote ‘accuracy’ is not a good measure for classification
process.
 The statement "accuracy is not a good measure for classification process" is valid in
certain contexts, and its criticism can be summarized as follows:
 **Class Imbalance**: Accuracy can be misleading when dealing with imbalanced
datasets, where one class is significantly more prevalent than others. In such cases, a
classifier can achieve high accuracy by simply predicting the majority class, while
performing poorly on minority classes.
 **Misclassification Costs**: Accuracy treats all misclassifications equally, which may
not be suitable for applications where different types of errors have different costs.
For instance, in medical diagnosis, a false negative (missing a disease) can be more
critical than a false positive (incorrectly diagnosing a healthy individual).
 **Incomplete Evaluation**: Accuracy provides a single, overall measure of
performance, but it doesn't provide insights into a model's behavior across different
classes or under varying conditions. It may overlook specific weaknesses or biases in
the model.
 **Inadequate for Probabilistic Models**: For probabilistic classifiers, accuracy
doesn't consider the confidence of predictions. A model might be accurate overall
but still provide uncertain or unreliable predictions for certain instances.
 **Context Dependency**: The appropriateness of accuracy as a measure depends
on the specific problem and its requirements. In some cases, other metrics like
precision, recall, F1-score, or area under the ROC curve (AUC-ROC) may provide
more informative evaluations tailored to the specific needs of the task.
Accuracy is not always the best measure for assessing classification models because it does not provide the full
picture of a model's performance. For example, in imbalanced datasets where one class is much more prevalent
than the others, a highly accurate model may simply be predicting the majority class every time, while completely
missing the minority class. In such cases, other metrics such as precision, recall, F1 score, or area under the ROC
curve (AUC-ROC) may provide a more comprehensive evaluation of the model's performance. These metrics take
into account the true positive, false positive, true negative, and false negative rates, providing a more nuanced
understanding of how well the model is performing across different classes and error types.
17. Demonstrate the use of classification over regression .
 Classification and regression are two fundamental types of supervised learning tasks,
each suited to different kinds of problems. Here's a brief comparison with an
example:
 **Classification**: Classification is used when the target variable is categorical,
meaning it belongs to a discrete set of classes or categories. The goal is to predict the
class label of new instances based on the input features. Examples include predicting
whether an email is spam or not, classifying images of animals into different species,
or identifying the sentiment of a text (positive, negative, or neutral).
 Example: Suppose you have a dataset containing information about various fruits
such as their color, size, and texture, and you want to predict whether a fruit is an
apple, orange, or banana based on these features. This is a classification problem
because the target variable (fruit type) consists of distinct categories.
 **Regression**: Regression, on the other hand, is used when the target variable is
continuous, meaning it can take any numerical value within a range. The objective is
to predict a numerical value based on input features. Examples include predicting
house prices based on features like area, number of bedrooms, and location,
forecasting stock prices, or estimating the temperature based on weather
conditions.
 Example: Consider a dataset containing information about houses, such as their size,
number of bedrooms, and distance from the city center, and you want to predict the
selling price of a house. This is a regression problem because the target variable
(house price) is a continuous variable that can take any numerical value within a
range.

You might also like