ML
ML
ML
These dataset
includes features such as funding rounds, investor details, market segment, and
growth metrics.
2. Pre-processing:
First we removed columns with missing values to maintain data integrity. Then we
applied label encoding to transform categorical variables into numerical
equivalents.
3. Feature Selection:
First we utilized XGBoost for feature selection. Then we retained top features with
importance scores exceeding 0.1.
4. Hybrid Model Construction:
We constructed hybrid models using combinations of machine learning algorithms:
Logistic Regression and K-Nearest Neighbours (KNN) - Accuracy: 0.9353. Random
Forest and Naive Bayes - Accuracy: 0.9458
5. Ensemble Algorithm Development:
First we developed an ensemble algorithm incorporating Gradient Boosting, Random
Forest, and SVM. We eventually achieved accuracy of 0.9622 with the ensemble
approach.
6. Cross-Validation:
First we applied k-fold cross-validation technique. Then we partitioned dataset
into k subsets and iteratively trained and tested models. We obtained cross-
validation accuracy of 96% with the ensemble algorithm.
7. Summary:
A comprehensive experimental setup involving pre-processing, feature selection,
hybrid model construction, ensemble algorithm development, and cross-validation. We
also ensured reliability and stability of predictive models across diverse datasets
and scenarios. We leveraged insights from real-world start-up data to enhance
predictive capabilities and inform decision-making processes.
combine randomforest classifier, extratreeclassifeir and gradientboosting
classifier to do feature selection
----------------------------------------------------------------
Data Set Information: The data was received from UCI Machine Learning Repository.
The information about the dataset is below. (UCI Machine Learning Repository,
2013). The data set contains 416 liver patient records and 167 nonliver patient
records collected from North East of Andhra Pradesh, India. The "Dataset" column is
a class label used to divide groups into the liver patient (liver disease) or not
(no dis-ease). This data set contains 441 male patient records and 142 female
patient records.
Observations By using the command data.describe, we can figure out some of the
observations of the dataset such as: • Gender is a non-numerical variable and other
all are numeric values. • There are 10 features and 1 output which is the dataset.
• In the Albumin and Globulin ratio we can see that there are four missing values.
• Values of Alkaline_Phosphatase, Alamine_Aminotransferase,
Aspartate_Aminotransferase which is int should be converted for float values for
better accuracy.
Resampling Because of the imbalance in the dataset where we can observe a majority
in liver disease patients and a minority in non-liver disease patients, smote is a
synthesized minority oversampling technique which generates new values for the
minority data and then synthesizes new samples for minorities. This will help in
obtaining a better accuracy for the model during the implementation of machine
learning models to the dataset in the Weka Tool. Also, we have applied PCA to
achieve better results and then lastly made combinations using smote and PCA to
compare the accuracy among various ML algorithms.
Feature Selection Feature Selection is a process of figuring out which inputs are
the best for the model and checking if there is a possibility of eliminating
certain inputs. Considering the Dataset, we can see a very high linear relationship
between Total and Direct Bilirubin and by considering this linear relationship,
Direct Bilirubin can opt to be dropped, but as per medical analysis Direct
Bilirubin constitutes almost 10% of the Total Bilirubin and this 10% may prove
crucial in obtaining higher accuracy for the model, thus none of the features are
removed.
Train-Test Split We can use the train-test split technique. It is a technique for
evaluating the performance of a deep-learning algorithm. The procedure involves
taking a dataset and dividing it into two subsets. It is a fast and easy procedure
to perform, the results of which allow us to compare the performance of deep
learning algorithms for our predictive modeling problem. For the liver disease
prediction model, we have considered 80 % of training data and 20 % of data for
testing.
liver
rows and columns: 600 rows and 15 columns
train test split: 80-20
accuracy: CNN combined with LSTM(99.02%)
liver patients and non liver patients: 416 liver and 167 non liver
--------------------------------------------------------------------------------
LR and KNN Hybrid Model
Logistic Regression (LR): A linear model used for binary classification that
predicts the probability of an outcome.
K-Nearest Neighbors (KNN): A non-parametric algorithm that classifies data points
based on the majority class among its nearest neighbors.
Hybrid Model: Combines the strengths of both LR and KNN. For instance, LR can be
used to create a linear boundary, and KNN can refine predictions in areas where LR
is less accurate by considering the local data distribution.
4. Cross-Validation
Purpose: A technique used to assess the generalizability of a model by dividing the
data into subsets, training the model on some subsets, and validating it on the
others.
Process: Helps in identifying how the model performs on unseen data, reducing the
risk of overfitting.
5. K-Fold Cross-Validation
Procedure: The dataset is divided into k equal-sized folds. The model is trained on
k-1 folds and tested on the remaining fold. This process is repeated k times, with
each fold used exactly once as a test set.
Benefits: Provides a more reliable estimate of model performance by averaging the
results over all folds, thus giving a better sense of the model's ability to
generalize.
-----------------------------------------------------------------------------------
--
PCA vs. SMOTE
3. CNN, LSTM, GRU, RNN: Why They Are Used for Textual Data
4. Why Use Deep Learning Instead of Logistic Regression and Other ML Models?
Feature Engineering:
Deep Learning: Automatically learns features from raw data, reducing the need for
extensive manual feature engineering.
Traditional ML Models: Often require significant feature engineering to perform
well, which can be time-consuming and requires domain expertise.
Scalability:
Deep Learning: Scales well with large datasets and can improve performance as more
data is added.
Traditional ML Models: May not scale as effectively with large datasets and might
plateau in performance as more data is added.
Application Areas:
Deep Learning: Preferred for tasks requiring high accuracy and dealing with
unstructured data (e.g., image recognition, natural language processing).
Traditional ML Models: Suitable for structured data with simpler relationships,
where interpretability and speed are priorities.
-----------------------------------------------------------------------------------
---------------------------------------------------------
label encoding over one hot encoding:
It reduces the dimensionality of the data compared to one-hot encoding, which can
be beneficial in certain cases such as when dealing with high-dimensional data or
when the number of categories is large.
cross validation:
It involves partitioning the dataset into subsets, training the model on some of
these subsets, and evaluating it on the remaining subset(s). This process is
repeated multiple times, with different partitions of the data, and the performance
metrics are averaged over all the runs to provide a more reliable estimate of the
model's performance
The final_estimator parameter is used to specify the estimator that will be trained
on the predictions of the base estimators
the final estimator in a stacking ensemble is used to learn how to effectively
combine the predictions of multiple base estimators, potentially improving
predictive performance, regularization, and generalization of the stacked model. It
plays a crucial role in the stacking process by determining the final output of the
ensemble.
stratified shuffle split: The key idea is to preserve the class distribution in
both the training and testing sets, which helps the model learn and generalize
better, especially for rare classes.
cross validation technique
Hard Voting:
In hard voting, each base model (classifier) in the ensemble contributes a "vote"
for a class label, and the majority class label is selected as the final
prediction.
For example, if you have three base models and two of them predict class A while
one predicts class B, hard voting would select class A as the final prediction
since it has the majority of votes.
Hard voting is typically used for classifiers that output discrete class labels.
Soft Voting:
In soft voting, instead of counting votes, the ensemble takes into account the
confidence or probability scores assigned to each class label by each base model.
The final prediction is determined by averaging the predicted probabilities across
all base models and selecting the class with the highest average probability.
Soft voting takes into account the confidence of each model's prediction, allowing
more confident models to have a greater influence on the final decision.
It is suitable for classifiers that output probability scores for each class, such
as logistic regression or support vector machines with probability outputs.
SVM refers to the general concept and algorithm of Support Vector Machines, SVC
specifically refers to the implementation of SVM for classification tasks in
libraries like scikit-learn. SVM can be used for both classification and
regression, while SVC specifically addresses the classification aspect of SVM
-----------------------------------------------------------------------------------
--------------------------------------------------------
Here’s a simple explanation of these machine learning algorithms: