ML

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 8

First real-world start-up data is obtained from diverse sources.

These dataset
includes features such as funding rounds, investor details, market segment, and
growth metrics.
2. Pre-processing:
First we removed columns with missing values to maintain data integrity. Then we
applied label encoding to transform categorical variables into numerical
equivalents.
3. Feature Selection:
First we utilized XGBoost for feature selection. Then we retained top features with
importance scores exceeding 0.1.
4. Hybrid Model Construction:
We constructed hybrid models using combinations of machine learning algorithms:
Logistic Regression and K-Nearest Neighbours (KNN) - Accuracy: 0.9353. Random
Forest and Naive Bayes - Accuracy: 0.9458
5. Ensemble Algorithm Development:
First we developed an ensemble algorithm incorporating Gradient Boosting, Random
Forest, and SVM. We eventually achieved accuracy of 0.9622 with the ensemble
approach.
6. Cross-Validation:
First we applied k-fold cross-validation technique. Then we partitioned dataset
into k subsets and iteratively trained and tested models. We obtained cross-
validation accuracy of 96% with the ensemble algorithm.
7. Summary:
A comprehensive experimental setup involving pre-processing, feature selection,
hybrid model construction, ensemble algorithm development, and cross-validation. We
also ensured reliability and stability of predictive models across diverse datasets
and scenarios. We leveraged insights from real-world start-up data to enhance
predictive capabilities and inform decision-making processes.
combine randomforest classifier, extratreeclassifeir and gradientboosting
classifier to do feature selection
----------------------------------------------------------------

Data Set Information: The data was received from UCI Machine Learning Repository.
The information about the dataset is below. (UCI Machine Learning Repository,
2013). The data set contains 416 liver patient records and 167 nonliver patient
records collected from North East of Andhra Pradesh, India. The "Dataset" column is
a class label used to divide groups into the liver patient (liver disease) or not
(no dis-ease). This data set contains 441 male patient records and 142 female
patient records.

Attribute Information: • Age of the patient • Gender of the patient • Total


Bilirubin • Direct Bilirubin • Alkaline Phosphatase • Alamine Aminotransferase •
Aspartate Aminotransferase • Total Proteins • Albumin • Albumin and Globulin Ratio
• Class: field used to split the data into two sets (patient with liver disease, or
no disease) Before loading the dataset, we should import all the required libraries
such as pandas, tokenizer, numpy, seaborn, and label encoder to perform operations
of implementing deep-learning models as well as to perform steps of data pre- pro-
cessing. Here, we have downloaded the dataset from the UCI repository and saved it
as indian_liver_patient.csv which is now loaded and can be read as a data frame
which is now named as data. 5.2 Data Pre-processing & Visualization While creating
our project, the dataset which we imported from the repository was not clean and
formatted, and before employing the deep learning models on the data, it is very
necessary to clean and put formatted data, hence data pre-processing is required
and is basically the process of preparing the raw data and making it ready for the
deep learning model. The following graphs show a number of liver and non-liver
diseases along with males and females in the dataset.

Observations By using the command data.describe, we can figure out some of the
observations of the dataset such as: • Gender is a non-numerical variable and other
all are numeric values. • There are 10 features and 1 output which is the dataset.
• In the Albumin and Globulin ratio we can see that there are four missing values.
• Values of Alkaline_Phosphatase, Alamine_Aminotransferase,
Aspartate_Aminotransferase which is int should be converted for float values for
better accuracy.

Filling of Missing Values It is the process of identifying the missing variables


and adding the mean values. For our dataset, the Albumin and Globulin ratio had
four missing values which are replaced by considering the mean of that column which
is 94.7. These values are filled in the second fig which shows that the column A/G
ratio has no more null values.

Identifying Duplicate Values Duplicate values were identified and by the


observations, we can see around 13 duplicate values but for a medical dataset
duplicate values can exist and thus we are not dropping any of the duplicate
values.

Resampling Because of the imbalance in the dataset where we can observe a majority
in liver disease patients and a minority in non-liver disease patients, smote is a
synthesized minority oversampling technique which generates new values for the
minority data and then synthesizes new samples for minorities. This will help in
obtaining a better accuracy for the model during the implementation of machine
learning models to the dataset in the Weka Tool. Also, we have applied PCA to
achieve better results and then lastly made combinations using smote and PCA to
compare the accuracy among various ML algorithms.

Feature Selection Feature Selection is a process of figuring out which inputs are
the best for the model and checking if there is a possibility of eliminating
certain inputs. Considering the Dataset, we can see a very high linear relationship
between Total and Direct Bilirubin and by considering this linear relationship,
Direct Bilirubin can opt to be dropped, but as per medical analysis Direct
Bilirubin constitutes almost 10% of the Total Bilirubin and this 10% may prove
crucial in obtaining higher accuracy for the model, thus none of the features are
removed.

Train-Test Split We can use the train-test split technique. It is a technique for
evaluating the performance of a deep-learning algorithm. The procedure involves
taking a dataset and dividing it into two subsets. It is a fast and easy procedure
to perform, the results of which allow us to compare the performance of deep
learning algorithms for our predictive modeling problem. For the liver disease
prediction model, we have considered 80 % of training data and 20 % of data for
testing.

Result Analysis We have used three hybrid algorithms: CNN+LSTM(99.02%),


CNN+GRU(98.38%), and CNN+RNN(99.48%), and have achieved accuracy as high as 99.48%
using filters like upscaling and PCA. We also got used various algorithms and got
the following accuracies- naïve bayes: 76%, random forest: 80.26%, logistic: 72%,
SVM: 76.93%, knn: 76.67%. We used PCA and SMOTE to increase the number of cases in
the dataset in a balanced way. This gave us better accuracies.
------------------------------------------------------------------------
Convolutional Neural Network (CNN):
CNN is a type of neural network commonly used for image recognition and
classification tasks.
It consists of multiple layers, including convolutional layers, pooling layers, and
fully connected layers.
Convolutional layers apply filters to input images to extract features such as
edges, shapes, and textures.
Pooling layers reduce the spatial dimensions of the feature maps by down-sampling,
helping to capture the most important features while reducing computational
complexity.
CNNs are highly effective for tasks involving spatial data, such as images, due to
their ability to capture spatial hierarchies of features.

Recurrent Neural Network (RNN):


RNN is a type of neural network designed to handle sequential data by maintaining
internal memory.
Unlike feedforward neural networks, which process each input independently, RNNs
process sequences of inputs by feeding the current input and the previous hidden
state into the network.
RNNs are commonly used for tasks such as natural language processing (NLP), time
series analysis, and speech recognition.
However, traditional RNNs suffer from the vanishing gradient problem, which limits
their ability to capture long-range dependencies in sequences.

Long Short-Term Memory (LSTM):


LSTM is a variant of RNN designed to address the vanishing gradient problem and
capture long-term dependencies in sequences.
It introduces specialized memory cells with three gates: input gate, forget gate,
and output gate.
The input gate controls the flow of information into the memory cell, the forget
gate controls the retention of information in the memory cell, and the output gate
controls the flow of information out of the memory cell.
By selectively updating and forgetting information over time, LSTM networks can
effectively capture long-term dependencies in sequential data.

Gated Recurrent Unit (GRU):


GRU is another variant of RNN, similar to LSTM, designed to address the vanishing
gradient problem and improve training efficiency.
It simplifies the architecture of LSTM by combining the forget and input gates into
a single update gate.
GRU has two gates: update gate and reset gate, which control the flow of
information into the memory cell and the flow of information from the previous
hidden state, respectively.
Despite its simpler architecture compared to LSTM, GRU has been shown to achieve
comparable performance in many tasks while requiring fewer parameters and
computations.

vanishing gradient problem


the vanishing gradient problem happens in neural networks when the "feedback"
(gradients) that the network receives during training becomes very small as it
progresses through many layers. This weak feedback makes it hard for the network to
learn from its mistakes and improve its performance, especially in deep networks
with many layers.
-------------------------------------------------------------------------
startup:
rows and columns: 100 columns and 500 rows
train-test split: 80-20
accuracy: Logistic Regression, K-Nearest Neighbors, Random Forest, Naive Bayes,
Gradient Boosting, and Support Vector Machine: 96%
successful start-ups comprising 64.6% and unsuccessful ones 35.4%.
xgboost 0.1 se higher selected columns

liver
rows and columns: 600 rows and 15 columns
train test split: 80-20
accuracy: CNN combined with LSTM(99.02%)
liver patients and non liver patients: 416 liver and 167 non liver
--------------------------------------------------------------------------------
LR and KNN Hybrid Model
Logistic Regression (LR): A linear model used for binary classification that
predicts the probability of an outcome.
K-Nearest Neighbors (KNN): A non-parametric algorithm that classifies data points
based on the majority class among its nearest neighbors.
Hybrid Model: Combines the strengths of both LR and KNN. For instance, LR can be
used to create a linear boundary, and KNN can refine predictions in areas where LR
is less accurate by considering the local data distribution.

2. Gradient Boosting, Random Forest, and SVM Ensemble Model


Gradient Boosting: An iterative technique that builds models sequentially, each
correcting the errors of the previous one. It's known for its high accuracy.
Random Forest (RF): An ensemble method that builds multiple decision trees and
merges their results for more robust predictions.
Support Vector Machine (SVM): A powerful classifier that finds the optimal
hyperplane separating different classes.
Ensemble Model: Combining Gradient Boosting, RF, and SVM can leverage the strengths
of each—Gradient Boosting's accuracy, RF's robustness, and SVM's ability to handle
high-dimensional data—leading to better overall performance.

3. Ensemble vs. Hybrid


Ensemble Model: Combines multiple models of the same or different types to improve
performance, usually by voting, averaging, or stacking (e.g., Random Forest,
Gradient Boosting).
Hybrid Model: Integrates different models or algorithms in a way that leverages
their complementary strengths (e.g., combining a linear model with a non-linear
model like LR and KNN).
Key Difference: Ensemble models focus on combining several models to achieve better
results, often of the same type, while hybrid models combine different types of
models or techniques to tackle specific weaknesses.

4. Cross-Validation
Purpose: A technique used to assess the generalizability of a model by dividing the
data into subsets, training the model on some subsets, and validating it on the
others.
Process: Helps in identifying how the model performs on unseen data, reducing the
risk of overfitting.

5. K-Fold Cross-Validation
Procedure: The dataset is divided into k equal-sized folds. The model is trained on
k-1 folds and tested on the remaining fold. This process is repeated k times, with
each fold used exactly once as a test set.
Benefits: Provides a more reliable estimate of model performance by averaging the
results over all folds, thus giving a better sense of the model's ability to
generalize.
-----------------------------------------------------------------------------------
--
PCA vs. SMOTE

Principal Component Analysis (PCA):


Purpose: A dimensionality reduction technique that transforms high-dimensional data
into a lower-dimensional space by finding the principal components (directions of
maximum variance).
Use Case: Used when you want to reduce the number of features in a dataset while
retaining as much variance (information) as possible, which can help in improving
model performance and reducing overfitting.

Synthetic Minority Over-sampling Technique (SMOTE):


Purpose: An oversampling technique used to balance imbalanced datasets by
generating synthetic samples for the minority class.
Use Case: Applied when dealing with imbalanced data, where one class has
significantly fewer samples than others. It helps in improving the model's ability
to correctly classify the minority class.
Key Difference: PCA is used for reducing the dimensionality of data, whereas SMOTE
is used for balancing the class distribution in a dataset.

3. CNN, LSTM, GRU, RNN: Why They Are Used for Textual Data

Recurrent Neural Networks (RNN):


Purpose: Designed to handle sequential data by maintaining a memory of previous
inputs, making them suitable for processing text, where word order matters.
Challenge: Traditional RNNs suffer from issues like vanishing gradients, making
them less effective for long sequences.

Long Short-Term Memory (LSTM):


Purpose: A type of RNN designed to overcome the limitations of traditional RNNs by
using gating mechanisms to better capture long-term dependencies in sequences.
Use Case: Frequently used in text processing tasks like language modeling,
translation, and sentiment analysis because they can remember long-term
dependencies in the data.

Gated Recurrent Unit (GRU):


Purpose: A variant of LSTM that simplifies the architecture by combining the forget
and input gates into a single update gate, making it faster to train.
Use Case: Often used when a simpler and faster model is needed, while still
capturing sequential dependencies.

Convolutional Neural Networks (CNN):


Purpose: Originally designed for image processing, CNNs can also be applied to text
by using convolutions to capture local patterns (e.g., n-grams) in the text.
Use Case: Effective for tasks like text classification where local features are
important.

4. Why Use Deep Learning Instead of Logistic Regression and Other ML Models?

Handling Complex Data:


Deep Learning: Excels at capturing complex patterns in large, high-dimensional
datasets, such as images, speech, or text, where traditional ML models like LR
might struggle.
Logistic Regression (LR) and Traditional ML Models: These are often limited to
simpler, linear relationships and may not perform well with complex data or large
feature spaces.

Feature Engineering:
Deep Learning: Automatically learns features from raw data, reducing the need for
extensive manual feature engineering.
Traditional ML Models: Often require significant feature engineering to perform
well, which can be time-consuming and requires domain expertise.

Scalability:
Deep Learning: Scales well with large datasets and can improve performance as more
data is added.
Traditional ML Models: May not scale as effectively with large datasets and might
plateau in performance as more data is added.

Application Areas:
Deep Learning: Preferred for tasks requiring high accuracy and dealing with
unstructured data (e.g., image recognition, natural language processing).
Traditional ML Models: Suitable for structured data with simpler relationships,
where interpretability and speed are priorities.
-----------------------------------------------------------------------------------
---------------------------------------------------------
label encoding over one hot encoding:
It reduces the dimensionality of the data compared to one-hot encoding, which can
be beneficial in certain cases such as when dealing with high-dimensional data or
when the number of categories is large.

smote oversamples minority class to match majority class

cross validation:
It involves partitioning the dataset into subsets, training the model on some of
these subsets, and evaluating it on the remaining subset(s). This process is
repeated multiple times, with different partitions of the data, and the performance
metrics are averaged over all the runs to provide a more reliable estimate of the
model's performance

multiple classifiers and combined together so that sensitivity to hyperparameters

The initial feature selection technique relies on feature importance scores


obtained from multiple classifiers (Random Forest, Gradient Boosting, and Extra
Trees), which consider the relevance of features for prediction across different
models.
On the other hand, RFE with logistic regression specifically evaluates the
importance of features in the context of logistic regression.
both are combined

The final_estimator parameter is used to specify the estimator that will be trained
on the predictions of the base estimators
the final estimator in a stacking ensemble is used to learn how to effectively
combine the predictions of multiple base estimators, potentially improving
predictive performance, regularization, and generalization of the stacked model. It
plays a crucial role in the stacking process by determining the final output of the
ensemble.

stratified shuffle split: The key idea is to preserve the class distribution in
both the training and testing sets, which helps the model learn and generalize
better, especially for rare classes.
cross validation technique

Hard Voting:

In hard voting, each base model (classifier) in the ensemble contributes a "vote"
for a class label, and the majority class label is selected as the final
prediction.
For example, if you have three base models and two of them predict class A while
one predicts class B, hard voting would select class A as the final prediction
since it has the majority of votes.
Hard voting is typically used for classifiers that output discrete class labels.
Soft Voting:

In soft voting, instead of counting votes, the ensemble takes into account the
confidence or probability scores assigned to each class label by each base model.
The final prediction is determined by averaging the predicted probabilities across
all base models and selecting the class with the highest average probability.
Soft voting takes into account the confidence of each model's prediction, allowing
more confident models to have a greater influence on the final decision.
It is suitable for classifiers that output probability scores for each class, such
as logistic regression or support vector machines with probability outputs.

SVM refers to the general concept and algorithm of Support Vector Machines, SVC
specifically refers to the implementation of SVM for classification tasks in
libraries like scikit-learn. SVM can be used for both classification and
regression, while SVC specifically addresses the classification aspect of SVM
-----------------------------------------------------------------------------------
--------------------------------------------------------
Here’s a simple explanation of these machine learning algorithms:

1. **Logistic Regression (LR):**


- **Purpose:** Predicts a binary outcome (yes/no, true/false) based on input
features.
- **How it works:** It calculates the probability of an event happening and
places it into one of two categories. It’s like drawing a line to separate two
groups.

2. **K-Nearest Neighbors (KNN):**


- **Purpose:** Classifies data points based on the majority class of their
nearest neighbors.
- **How it works:** Imagine you have a new point and want to classify it. KNN
looks at the ‘k’ closest points around it and assigns the most common class among
them. It’s like asking your closest friends for advice and going with the majority
opinion.

3. **Naive Bayes (NB):**


- **Purpose:** Classifies data based on applying Bayes' theorem with a "naive"
assumption that features are independent.
- **How it works:** It estimates the probability of each possible outcome and
chooses the one with the highest probability. It’s like guessing the weather by
assuming each factor (like clouds or temperature) affects it independently.

4. **Gradient Boosting (GB):**


- **Purpose:** Builds a strong model by combining many weak models, usually
decision trees.
- **How it works:** It starts with a simple model and gradually improves it by
focusing on the mistakes made by previous models. Imagine trying to learn a new
skill by focusing on what you got wrong each time and getting better with each
attempt.

5. **Random Forest (RF):**


- **Purpose:** Creates multiple decision trees and merges them to get a more
accurate and stable prediction.
- **How it works:** It builds many decision trees and combines their results.
It’s like asking multiple experts for their opinions and then choosing the majority
vote.

6. **Support Vector Machine (SVM):**


- **Purpose:** Classifies data by finding the best boundary (or margin) that
separates different classes.
- **How it works:** SVM finds the line or hyperplane that best separates
different classes while keeping the margin between them as wide as possible. It’s
like drawing a line on a map that best separates two territories.
-----------------------------------------------------------------------------------
--------------
Here’s a simple explanation of these deep learning architectures:

1. **Convolutional Neural Network (CNN):**


- **Purpose:** Primarily used for analyzing visual data, like images or videos.
- **How it works:** CNNs use special layers called convolutional layers that act
like filters to detect patterns, such as edges, textures, or shapes, in the data.
Imagine looking at a picture and noticing different features like lines, colors,
and objects, which help you recognize what's in the image.

2. **Recurrent Neural Network (RNN):**


- **Purpose:** Designed for sequential data, like time series or text, where the
order of data matters.
- **How it works:** RNNs have loops that allow information to be passed from one
step to the next, making them good at remembering previous inputs. It’s like
reading a sentence where each word helps you understand the next one, so you need
to remember the context.

3. **Long Short-Term Memory (LSTM):**


- **Purpose:** A type of RNN specifically designed to better handle long-term
dependencies in sequences.
- **How it works:** LSTMs have a special memory cell that can keep information
for long periods and gates that control the flow of information in and out of the
cell. It’s like having a notepad to jot down important things as you go along, so
you don’t forget key details later.

4. **Gated Recurrent Unit (GRU):**


- **Purpose:** Similar to LSTM but with a simpler architecture, also used for
sequential data.
- **How it works:** GRUs combine the memory cell and gates into a simpler
structure, making them faster and easier to train than LSTMs while still being
effective at remembering context over time. Think of it as a streamlined version of
LSTM that still remembers important information but with fewer steps involved.

You might also like