ML 2022
ML 2022
subsets. This can be used to identify the features in a dataset that are most relevant to the target
variable. Information gain is calculated as follows:
Where:
Si is a subset of S
Entropy is a measure of the impurity of a set of data. A set of data is considered to be impure if it
contains a mixture of different classes. Entropy is calculated as follows:
Where:
b) Gini impurity and entropy are both measures of impurity, but they are calculated differently. Gini
impurity is calculated as follows:
Where:
Gini impurity is a measure of the probability of misclassifying a random instance from a set of data. A set
of data with a high Gini impurity is more likely to be misclassified. Entropy is a measure of the average
amount of information needed to classify a random instance from a set of data. A set of data with a high
entropy is more difficult to classify.
In general, information gain and Gini impurity are used to select the best feature to split a set of data
into smaller subsets. The feature with the highest information gain or Gini impurity is typically chosen.
However, there is no guarantee that the feature with the highest information gain or Gini impurity will
produce the best classification model. In practice, it is often necessary to try different features and
evaluate the performance of the resulting models.
1. Image Recognition
Image recognition is the ability of a computer to identify objects in an image or video. This technology is
used in a wide variety of applications, including:
Facial recognition: Facial recognition is used to identify people in photos and videos. It is used for
security purposes, such as unlocking smartphones and accessing secure buildings.
Object detection: Object detection is used to identify and locate objects in images and videos. It is used
for tasks such as self-driving cars, robotic vision, and medical imaging.
Image classification: Image classification is used to classify images into a predefined set of categories. It
is used for tasks such as product categorization, spam filtering, and medical diagnosis.
Natural language processing (NLP) is the ability of a computer to understand and process human
language. This technology is used in a wide variety of applications, including:
Machine translation: Machine translation is used to translate text from one language to another. It is
used for tasks such as translating websites, documents, and conversations.
Speech recognition: Speech recognition is used to convert spoken language into text. It is used for tasks
such as voice assistants, dictation software, and transcription services.
Natural language generation: Natural language generation is used to generate human-like text. It is used
for tasks such as writing news articles, generating marketing copy, and creating chatbots.
These are just two examples of the many applications of machine learning. Machine learning is a rapidly
growing field with a wide range of potential applications. It is likely to have a profound impact on our
lives in the years to come.
i. This task is a **classification problem**. In classification problems, the goal is to predict a discrete
class label, such as whether a student is likely to succeed in A levels or not. In this case, the school wants
to classify students into two groups: those who are likely to succeed in A levels and those who are not.
ii. The main steps involved in developing a learning algorithm for prefect selection are:
1. **Data collection and preprocessing:** The school should collect data about learners' academic
grades from Forms 1 to 4, as well as any other relevant data, such as attendance records and
extracurricular activities. The data should then be preprocessed to clean it and prepare it for analysis.
2. **Feature engineering:** The school should identify the features that are most relevant to student
success in A levels. These features may include academic grades, attendance records, standardized test
scores, and extracurricular activities.
3. **Model selection and training:** The school should select a classification algorithm that is
appropriate for the task. The algorithm should then be trained on the preprocessed data.
4. **Model evaluation:** The school should evaluate the performance of the trained model on a held-
out test set. This will help to ensure that the model generalizes well to new data.
5. **Deployment:** The school should deploy the trained model to make predictions about which
students are likely to succeed in A levels.
iii. Bias in the data can affect the output of the learning algorithm in several ways. For example, if the
data is not representative of the population of students who apply to the school, the model may make
biased predictions. Additionally, if the data is inaccurate or incomplete, the model may make inaccurate
predictions.
Here are some specific examples of how bias in the data could affect the output of the learning
algorithm in this case:
* If the data is collected from a sample of students that is not representative of the entire student
population, the model may make biased predictions. For example, if the data is collected from a sample
of students that is mostly from wealthy families, the model may predict that students from wealthy
families are more likely to succeed in A levels than students from less wealthy families.
* If the data is inaccurate or incomplete, the model may make inaccurate predictions. For example, if the
data includes incorrect grades or attendance records, the model may make inaccurate predictions about
which students are likely to succeed in A levels.
The school can take steps to mitigate bias in the data by collecting data from a representative sample of
students and by carefully checking the data for accuracy and completeness.
a) Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting or removing corrupt, incomplete, inaccurate, or irrelevant data within a dataset. It involves
detecting errors or inconsistencies in data and then changing, updating, or removing data to ensure its
accuracy, consistency, and usability. Data cleaning is an essential step in the data preparation process, as
it improves the quality of data and makes it more reliable for analysis and modeling.
Inaccurate results: If the data contains errors or inconsistencies, the model will not learn the true
patterns in the data, leading to inaccurate predictions or classifications.
Overfitting: Dirty data can cause a model to overfit, meaning it memorizes the training data too well and
fails to generalize to new data.
Biased results: If the data is biased, the model will also be biased, leading to unfair or discriminatory
predictions.
Data cleaning can help mitigate these problems by removing errors, inconsistencies, and irrelevant data
from the dataset. This can improve the accuracy and reliability of the model and make it more
generalizable to new data.
Here are some specific reasons why data cleaning is important for model training:
Improves model accuracy: By removing errors and inconsistencies from the data, data cleaning can help
ensure that the model is learning from accurate information. This can lead to more accurate predictions
and classifications.
Reduces overfitting: Overfitting occurs when a model learns the training data too well and fails to
generalize to new data. Data cleaning can help reduce overfitting by removing noise and irrelevant data
from the training set.
Mitigates bias: Bias can occur when the training data is not representative of the population that the
model will be used on. Data cleaning can help mitigate bias by identifying and removing biased data
points.
Improves model interpretability: Data cleaning can help make models more interpretable by removing
noise and irrelevant data. This can make it easier to understand how the model is making its predictions.
In conclusion, data cleaning is an essential step in the machine learning process. By ensuring that the
data is accurate, consistent, and relevant, data cleaning can improve the accuracy, reliability, and
generalizability of machine learning models.
df = pd.DataFrame(data)
print(df.to_string())
c) Skewness is a measure of the asymmetry of a probability distribution. A
distribution is said to be skewed if one side of the distribution has a
longer tail than the other side. The skewness of a normal distribution is
0. A positive skewness indicates that the distribution is skewed to the
right, while a negative skewness indicates that the distribution is skewed
to the left.
f(x)
/
/
/ / / /________________________________________________________ x
In this diagram, the red line represents the function that we are trying
to minimize. The blue dot represents the current position of the
algorithm. The green arrow represents the gradient of the function at the
current position. The algorithm will take a small step in the direction of
the green arrow, which will move it closer to the minimum of the function.
I hope this helps! Let me know if you have any other questions.
a) Applying Logistic Regression and Evaluation
Logistic regression is a statistical model that can be used to predict the probability of a
binary outcome. In this case, the binary outcome is whether or not a message is
cyberbullying. To apply logistic regression to this task, you would first need to represent
the messages as features. This could be done by using a technique called bag-of-
words, which converts each message into a vector of counts of the words in the
message.
Once you have represented the messages as features, you could train a logistic
regression model on the labeled data. The model would learn the weights of the
features, which represent the importance of each feature in predicting whether or not a
message is cyberbullying.
After the model has been trained, you could evaluate its performance on a held-out test
set. The test set should be a sample of the labeled data that the model has not seen
before. To evaluate the model, you could calculate the accuracy, precision, and recall.
Accuracy is the proportion of messages that the model correctly classifies. Precision is
the proportion of messages that the model classifies as bullying that are actually
bullying. Recall is the proportion of messages that are bullying that the model classifies
as bullying.
Precision and recall are more useful measures for tasks like cyberbullying detection,
where the cost of false positives and false negatives can be high. A false positive is a
message that the model classifies as bullying that is not actually bullying. A false
negative is a message that is bullying that the model classifies as "OK".
Precision is a measure of how many of the messages that the model classifies as
bullying are actually bullying. Recall is a measure of how many of the messages that
are bullying the model classifies as bullying.
In the case of cyberbullying detection, it is more important to have a high recall than a
high precision. This is because false positives are less costly than false negatives. A
false positive will simply result in a message being flagged for review, while a false
negative could lead to the victim of cyberbullying not receiving the help they need.
A confusion matrix is a table that summarizes the performance of a classification model.
It shows the number of correct and incorrect predictions made by the model for each
class.
Here is a diagram of a confusion matrix:
True positives (TP) are the number of cases that the model correctly classified as
positive.
False positives (FP) are the number of cases that the model incorrectly classified
as positive.
False negatives (FN) are the number of cases that the model incorrectly
classified as negative.
True negatives (TN) are the number of cases that the model correctly classified
as negative.
The accuracy of the model can be calculated using the following formula:
accuracy = (TP + TN) / (TP + TN + FP + FN)
The precision of the model can be calculated using the following formula:
precision = TP / (TP + FP)
The recall of the model can be calculated using the following formula:
recall = TP / (TP + FN)
The F1-score is a measure of the balance between precision and recall. It is calculated
using the following formula:
F1 = 2 * (precision * recall) / (precision + recall)
A confusion matrix is a useful tool for evaluating the performance of a classification
model. It can help to identify areas where the model is making mistakes and can be
used to improve the model's performance.
Here is an example of a confusion matrix for a binary classification problem:
Negative 20
In this example, the model correctly classified 120 positive cases and 50 negative
cases. It incorrectly classified 30 negative cases as positive and 20 positive cases as
negative. The accuracy of the model is 85.7%, the precision is 80%, and the recall is
85.7%. The F1-score is 82.6%.
I hope this helps!