Machine Learning
Machine Learning
- **Mean**: The mean of a set of data is the sum of all the values in the set divided by the
total number of values. Mathematically, it is calculated as:
Mean = (Sum of all values) / (Number of values)
- **Median**: The median of a set of data is the middle value when the values are arranged
in ascending or descending order. If there is an odd number of values, the median is the
middle one. If there is an even number of values, the median is the average of the two
middle values.
- **Mode**: The mode of a set of data is the value that appears most frequently. It's
possible for a set of data to have no mode (if all values occur with the same frequency), one
mode (if one value occurs more frequently than any other), or multiple modes (if two or
more values occur with the same highest frequency).
- Mean = (4 + 6 + 2 + 7 + 4 + 3 + 9 + 4) / 8 = 39 / 8 = 4.875
- To find the median, we first arrange the values in ascending order: {2, 3, 4, 4, 4, 6, 7, 9}.
The median is 4, because it's the middle value.
- The mode is 4, as it appears most frequently in the set.
The empirical rule, also known as the 68-95-99.7 rule, primarily deals with the spread of
data around the mean in a normal distribution. It doesn't directly provide insights into the
mode of a distribution.
However, given the mean and median of a dataset, we can make some general
observations:
If the mean and median are approximately equal, it suggests that the distribution is
symmetric. In such cases, the mode is likely to be close to the mean and median.
If the mean is greater than the median, it indicates that the distribution is positively
skewed, meaning that there are some relatively large values pulling the mean to the
right. In such cases, the mode is likely to be less than the mean and median.
If the mean is less than the median, it indicates that the distribution is negatively
skewed, meaning that there are some relatively small values pulling the mean to the
left. In such cases, the mode is likely to be greater than the mean and median.
Given the mean of 10.5 and the median of 9.6, we can observe that the mean is
greater than the median, suggesting a positively skewed distribution. In such a case,
the mode is likely to be less than both the mean and the median.
Therefore, based on the provided information and the empirical relationship, we can
infer that the mode is likely to be less than 10.5 and 9.6. However, without
additional information about the shape of the distribution or the specific dataset, we
cannot determine the exact value of the mode.
7. Observe the mean , variance , and standard deviation for the following data—
2,4,5,6,8,17
-- To observe the mean, variance, and standard deviation for the given data {-2, 4, 5, 6, 8,
17}:
1. **Mean (Average)**:
The mean is calculated by summing up all the numbers in the dataset and dividing by the
total number of values.
2. **Variance**:
Variance measures the dispersion or spread of the data points from the mean. It's
calculated by taking the average of the squared differences between each data point and
the mean.
3. **Standard Deviation**:
Standard deviation is the square root of the variance. It measures the average deviation of
data points from the mean.
So, for the given dataset {-2, 4, 5, 6, 8, 17}, the mean is approximately 6.33, the variance is
approximately 32.02, and the standard deviation is approximately 5.66.
8. The average score of boys in
9. The average score of the school of the in the examination is 71.8 . Find the ration of the
number of boys to the number of given who appeared in the examination.
10. Explain the confusion matrix with example .
11. Discuss the method of handling missing on corrupted date in data set .
-
12. Explain overfitting and write steps to avoid .
-
13. Explain Training data set over test data set.
Training data set and test data set are both integral components of the machine
learning workflow, used for developing and evaluating models. Here's a brief
explanation of each:
**Training Data Set**: This is the portion of the data used to train the machine
learning model. It typically consists of a set of input-output pairs where the input is
the features or attributes of the data, and the output is the target variable or label.
During the training process, the model learns from this data by adjusting its internal
parameters to minimize a predefined loss or error function.
**Test Data Set**: Once the model is trained, it needs to be evaluated to assess its
performance on unseen data. The test data set is used for this purpose. It contains
examples that were not seen by the model during training. By applying the trained
model to the test data set, its generalization ability can be assessed, providing
insights into how well it will perform on new, unseen data.
In summary, the training data set is used to teach the model, while the test data set
is used to assess its performance and generalization capabilities. This segregation
helps in ensuring that the model is not overfitting to the training data and can make
accurate predictions on new, unseen data.
**Data Preprocessing**: This stage involves cleaning and preparing the data for
training. It includes tasks such as handling missing values, scaling features, encoding
categorical variables, and splitting the data into training and test sets.