Sarumathi Intern18
Sarumathi Intern18
INTERNSHIP REPORT
ON
“INNOVATION ON PYTHON, MACHINE LEARNING AND AI”
Submitted in the partial fulfilment for the award of degree (21****)
BACHELOR OF ENGINEERING
IN
Submitted by
2023-2024
CERTIFICATE
CERTIFICATION:
ABSTRACT
The problem is to predict whether a user of a free subscription plan will convert to a
paid subscriber or not using machine learning with the Random Forest algorithm. The data
available for analysis includes various user attributes such as demographics, usage patterns,
and activities on the platform. The objective is to build a model that can accurately predict the
conversion of a free user to a paid subscriber.
The prediction model will help the company in improving their conversion rate by
identifying potential users who are likely to convert to paid subscriptions. The model can also
be used to identify factors that are important in predicting conversion and can help the company
to focus on those factors to improve its user engagement and retention strategies.
ACKNOWLEDGEMENT
We express our sincere thanks to our Principal, for providing usadequate facilities to
undertake this Internship.
We would like to thank our Head of Dept – Computer science , for providing us an
opportunity to carry out Internship and for his valuable guidance and support.
We express our deep and profound gratitude to our guide, Guide name,
Assistant/Associate Prof, for her keen interest and encouragement at every step in completing
the Internship.
We would like to thank all the faculty members of our department for the support
extended during the course of Internship.
We would like to thank the non-teaching members of our dept, forhelping us during the
Internship.
Last but not the least, we would like to thank our parents and friends without whose
constant help, the completion of Internship would have not been possible.
DECLARATION
I, Sarumathi Sree.S, 3rd year student of Computer Science, SEA College Of Engineering &
Technology, declare that the Internship has been successfully completed, in
“ENTREPRENEURSHIP” under Indoskill platform conducted by Aqmenz Automation
Private Limited Technology. This report is submitted in partial fulfilment of the requirements
for award of Bachelor Degree in Computer Science Engineering , during the academic year
2023-2024.
Date: 07/11/2023
Place: Bangalore
USN: 1SP21CS093
TABLE OF CONTENT:
1. INTRODUCTION
2. COMPANY PROFILE
3. TOOLS EXPOSED
6. METHODOLOGY
7. CODING
8. TESTING
9. CONCLUSION
10. REFERENCES
1.INTRODUCTION:
The global pandemic has drastically changed the way students are learning worldwide
and thus distinctive online learning has taken place. Students from around the world have
suddenly shifted from classroom learning to online learning.
The goal of this study is to evaluate the effectiveness of the Random Forest algorithm
in predicting the conversion of free plan subscribers to paid subscribers. The results of this
study have implications for subscription-based businesses, as they can use the findings to target
their marketing efforts and improve the conversion rate of free plan subscribers to paid
subscribers. the students and employees to meet the mandatory necessities of future human
resources and skill demands.
We are in the 4th industrial revolution. The technological revolution is catastrophic like
never before, hence continues awareness for the up-gradation environment is much essential.
Aqmenz Automation Pvt. Ltd. is working to help and enhance the potential of studentsand
employees. So that future human resources will be very beneficial, purposeful and profitable
to the nation.
1.5 Objectives
• AAPL had a trust in Skill India mission & vision, hence our utmost priority is to add skill to
the young Generation and make them Profitable and productive for the nation.
• We aim in Providing Industrial Automation Training Skill module kits to Institution
University’s & Collage Lab Facilities with Lowest Possible Price for Benefits of Technical
Students.
Identifying young entrepreneurs and motivate, training them to establish Start-up to create
Employment as well as prosperity for the nation.
• Consultation, Sourcing and supplying highly skilled Manpower to Industry for better
efficiency and productivity.
• Providing low cast & precise industrial automation solutions.
• Very eager to fetch solution for most complex industrial problems in a mode
We have under gone many industrial projects. Our major clients are BIAL (Bangalore
International Airport Limited), GE (General Electric) and Amics technologies.
All type of automation projects to companies using PLC’s, SCADA embedded systems.
We provide robots and robotic solutions to small and medium scale companies
Ongoing projects
• Automation related projects
• CNC Machines
• Open-source Custom Robots
• Garment Industry slider Project
The problem is to predict whether a user of a free subscription plan will convert to a
paid subscriber or not using machine learning with the Random Forest algorithm. The data
available for analysis includes various user attributes such as demographics, usage patterns,
and activities on the platform. The objective is to build a model that can accurately predict
the conversion of a free user to a paid subscriber. The prediction model will help the company
in improving their conversion rate by identifying potential users who are likely to convert to
paid subscriptions. The model can also be used to identify factors that are important in
predicting conversion and can help the company to focus on those factors to improve its user
engagement and retention strategies
3. SYSTEM ANALYSIS
The existing system for predicting whether a Free Plan user would convert to a paid
subscriber or not using machine learning techniques involves various approaches. One of the
most common approaches is the use of logistic regression, where the data is modelled using
a logistic function to predict the probability of conversion.
Support Vector Machines (SVMs) are also commonly used in predicting customer
conversion, where the algorithm tries to find a hyperplane that separates the data into two
classes. SVMs have high accuracy and can handle high-dimensional data, but they may not
be suitable for large datasets due to their high computational complexity.
Random forest is another popular approach that overcomes the limitations of decision
trees by using an ensemble of trees, where each tree is built on a random subset of the data
and a random subset of the features. This approach reduces overfitting and provides better
predictions
Overall, the existing system for predicting customer conversion using machine learning
techniques involves a variety of approaches, and the choice of the algorithm depends on the
specific characteristics of the data and the problem at hand.
* Accuracy
* No faster mode
* Computational Complexity
For this case study, we have chosen the Real-Time data set from the 365 Data Science
learning platform. There are 11 datasets in the CSV format. The datasets are heavily
imbalanced. There are many attributes in the datasets and need to identify which dataset and
feature might contribute to getting better accuracy.
4.2 DATA PROFILE
Data Types:
• Minutes watched: Numeric (continuous).
• Number of days engaged with the platform: Numeric (continuous).
• Engaging with the quiz: Boolean (categorical).
Data Range:
• Minutes watched: 0 to infinity.
Data Distribution:
• Minutes watched: The distribution is likely to be skewed to the right, with a few users
watching a lot of minutes and most users watching less.
• Number of days engaged with the platform: The distribution is also likely to be skewed
to the right, with some users engaging with the platform for many days and most users
engaging for fewer days.
DEPT OF CSE SEACET 2023- 2024 Page 13
INNOVATION ON PYTHON, MACHINE LEARNING AND AI
• Engaging with the quiz, engaged with the exam, engaged with Q/A hub, and purchasing:
These are categorical variables, so the distribution will be in the form of frequency
counts of True (1) and False (0) values.
Data Quality:
• Missing Values: There are some null values in the dataset and filled with mean values.
Outliers: There are a few outliers in the minutes watched and the number of days
engaged with platform attributes, as some users may have much higher values than
others.
• Imbalancing: The dataset is highly imbalanced and we used imbalance learning or
resampling methods for balancing data.
Data Relationships:
• Minutes watched and the number of days engaged with the platform is likely to be
positively correlated, as users who watch more minutes are likely to engage with the
platform for more days.
• Engaging with quizzes, engaging with exams, and engaged with the Q/A hub are likely
to be correlated with each other, as users who engage with one are more likely to engage
with the others as well.
Step 1: Merging all Datasets with respect to Student ID, using Outer merge.
6. METHODOLOGY
Data collection: The first step is to collect relevant data that can be used to build the model.
The data should include information about the free plan subscribers such as their demographics,
usage patterns, and any other relevant features.
Data pre-processing: The collected data must be pre-processed to handle missing values, deal
with outliers, and convert categorical variables into numerical ones. The pre-processing step is
critical for ensuring the data quality and the model’s accuracy.
Feature selection: The next step is to select the features that will be used to build the model.
This step involves identifying the most important features that significantly impact the target
variable (i.e., whether a free plan subscriber would convert to a paid subscriber).
Splitting the data: The data must be split into two parts: training data and testing data. The
training data will be used to build the model, while the testing data will be used to evaluate its
performance.
Model training: The next step is to train the Random Forest model using the training data. The
model will use the selected features to learn the relationship between the features and the target
variable.
Model evaluation: The trained model must be evaluated using the testing data. This step will
provide insights into the model's performance and allow for any necessary adjustments. The
evaluation metrics used to assess the performance of the model may include accuracy, precision
etc.
Model deployment: If the model's performance is satisfactory, it can be deployed in a realworld
setting to make predictions on new data. The model can be used to predict whether a new free
plan subscriber would convert to a paid subscriber.
There are several data models in machine learning that can be used to predict whether
a user of a free subscription plan will convert to a paid subscriber or not.
Logistic Regression: Logistic regression is a commonly used classification algorithm that can
be used to predict the probability of a user converting to a paid subscriber. It works by
modelling the relationship between the independent variables, such as minutes watched, the
number of days engaged with the platform, engaging with the quiz, engagement with the exam,
engagement with the Q/A hub and purchased, and the dependent variable, which is the
probability of a user converting.
Decision Trees: Decision trees are another classification algorithm that can be used to predict
whether a user will convert or not. Decision trees work by recursively splitting the data based
on the most significant features until a certain threshold is reached. These splits are based on
the features such as minutes watched, number of days engaged with the platform, engagement
with the quiz, engagement with the exam, engagement with the Q/A hub and purchase.
Random Forests: Random forests are an extension of decision trees and work by aggregating
multiple decision trees to improve the accuracy of the prediction. Each decision tree in the
random forest is trained on a random subset of the data and a random subset of the features.
Support Vector Machines (SVMs): SVMs are powerful classification algorithm that works by
finding the best hyperplane that separates the data into different classes. SVMs can be used to
predict whether a user will convert or not based on their behavior on the platform.
Data preparation: Once the data has been acquired, it needs to be cleaned and preprocessed
before it can be used to train a machine learning model. This might involve removing missing
values, handling outliers, normalizing the data, and converting categorical variables into
numerical ones.
Feature extraction: After the data has been cleaned and preprocessed, relevant features need to
be extracted from the data. Feature extraction involves selecting the most relevant variables
that can help the machine learning algorithm learn patterns in the data. This could involve
techniques such as principal component analysis (PCA), feature scaling, or feature selection.
Split the dataset: Once the data has been preprocessed and features have been extracted, the
dataset needs to be split into training and testing sets. The training set is used to train the
machine learning model, while the testing set is used to evaluate the performance of the model.
Training model: With the data split into training and testing sets, the next step is to train the
machine learning model. This could involve using algorithms such as logistic regression,
decision trees, and support vector machines. Random forest is one of the ensemble methods
that can be used to improve the accuracy of the model.
Evaluation of model: After training the machine learning model, it is important to evaluate its
performance on the testing set. This could involve using metrics such as accuracy, precision,
recall and F1 score. It is important to ensure that the model is not overfitting the training data
and is able to new data.
Data visualization: Data visualization is an important step in any machine learning project. It
involves using visual tools such as scatter plots, histograms, and heatmaps to explore the data
and identify patterns or relationships. Visualization can help in feature selection and
determining the most important features.
Building front-end interface: Finally, after building the machine learning model and evaluating
its performance, it is important to build a user interface that allows end-users to interact with
the model. This could involve building a web application or a mobile app that provides real-
time predictions based on user input.
6.3 MODEL SELECTION AND EVALUATION
6.3.1 Model selection
We selected Random Forest for this project, Random Forest is a powerful machine learning
algorithm that is often used for classification, regression, and feature selectionRandom forest
is an ensemble learning method that combines multiple decision trees to improve the
accuracy and robustness of the prediction. Each decision tree is trained on a random subset of
the data and a random subset of the features, and the final prediction is based on the majority
vote of the individual trees. Random forest is a popular model selection algorithm because it is
robust to noise and overfitting, and it can handle both categorical and continuous variables.
Figure. Random Forest Architecture
However, accuracy can be misleading in situations where the classes are imbalanced,
meaning one class has significantly more examples than the other. In our case, if the majority
of users do not convert to a paid subscription, accuracy may not be the best metric to evaluate
our model.
So, we used evaluation metrics for binary classification problems like this including
precision, recall, and F1 score
Precision measures the proportion of correctly predicted positive instances (i.e., users
who converted to a paid subscription) out of all the instances predicted as positive. This metric
is useful when the cost of false positives is high, meaning we want to minimize the number of
false positives and we got 92.22 %.
Recall measures the proportion of correctly predicted positive instances out of all the
actual positive instances. This metric is useful when the cost of false negatives is high, meaning
we want to minimize the number of false negatives and we got 86.35%
F1 score is the harmonic mean of precision and recall and provides a balance between
the two metrics. It is a useful metric when we want to balance both false positives and false
negatives and we got 89%.
6.4 Result
The confusion matrix is a performance evaluation tool that helps to measure the
accuracy of a classification model. It is a table that is used to evaluate the performance of a
machine-learning algorithm. The confusion matrix is made up of four values: true positive (TP),
true negative (TN), false positive (FP), and false negative (FN). These values help us to
understand how well the model is performing and where it is making errors.
In the above confusion matrix, we have two classes - subscribed and not subscribed.
True negatives are the number of cases where the model correctly predicted that the customer
will not subscribe, and the actual value is also not subscribed. False positives are the cases
where the model predicted that the customer would subscribe, but the actual value is not
subscribed. False negatives are the cases where the model predicted that the customer will not
subscribe, but the actual value is subscribed. True positives are the cases where the model
correctly predicted that the customer would subscribe, and the actual value is also subscribed.
The confusion matrix shows that the model has 6233 true negatives, which means that
the model correctly predicted that 6233 customers will not subscribe, and the actual value is
also not subscribed. The model has 475 false positives, which means that the model predicted
that 475 customers will subscribe but the actual value is not subscribed. The model has 891
false negatives, which means that the model predicted that 891 customers will not subscribe,
but the actual value is subscribed. Finally, the model has 5638 true positives, which means that
the model correctly predicted that 5638 customers will subscribe, and the actual value is also
subscribe.
In this confusion matrix, the labels are as follows:
True Positive (TP): The model predicted that the user would subscribe to a paid plan,
and the user actually did subscribe i.e. 5638.
True Negative (TN): The model predicted that the user would not subscribe to a paid
plan, and the user actually did not subscribe i.e. 6233.
False Positive (FP): The model predicted that the user would subscribe to a paid plan,
but the user actually did not subscribe i.e. 475.
False Negative (FN): The model predicted that the user would not subscribe to a paid
plan, but the user actually did subscribe i.e. 891.
Figure. Accuracy
The accuracy obtained by each algorithm was Naive Bayes - 86.67%, SVM - 88.17%,
Logistic Regression - 89%, KNN - 93.47%, Decision Tree - 89.33%, and Random Forest -
89.63%. Among these algorithms, KNN performed the best with an accuracy of 93.47%,
closely followed by Random Forest with an accuracy of 89.63%.
These results indicate that machine learning algorithms can be used effectively to predict
whether a free plan user converts to a paid subscriber or not. It also suggests that KNN and
Random Forest are the most effective algorithms for this task.
However, we have considered Random Forest because it can handle a larger number of datasets
and is more accurate and computationally faster than other models.
7. CODING
8.Testing
Testing is an essential step in the predictive modeling process, and it plays a crucial
role in evaluating the performance of a Random Forest model when predicting whether a user
of a free subscription plan will convert to a paid subscriber or not, using the given dataset
attribute.
To perform testing, we usually split the dataset into two parts: a training set and a testing
set. The training set is used to train the Random Forest model, and the testing set is used to
evaluate the model's performance. In this case, the independent variables are minutes watched,
number of days engaged with the platform, engaging with quiz, engaged with exam, and
engaged with Q/A hub, and the dependent variable is Purchase.
After splitting the dataset, we train the Random Forest model on the training set and use
it to make predictions on the testing set. We can then evaluate the performance of the model
using various evaluation metrics. And we measured accuracy, precision, recall, and F1-score.
For instance, accuracy measures the percentage of correctly classified samples out of
all samples in the testing set and we got 89%, while precision measures the percentage of true
positive predictions out of all positive predictions got 92.22%. The recall measures the
percentage of true positive predictions out of all actual positive samples 86.35%. The F1-score
is the harmonic mean of precision and recall and we got 78%.
By analyzing the results of these evaluation metrics, we have determined the model
performed 89% in predicting, whether a user of a free subscription plan will convert to a paid
subscriber or not.
9. Conclusion
10.REFERENCES
https://userpilot.com/blog/churn-prediction/
https://neptune.ai/blog/how-to-implement-customer-churn-prediction
https://analyticsindiamag.com/customer-event-prediction-in-onlinesubscription-products/
https://www.chargebee.com/blog/subscription-revenue-forecasting/