Intern ReportFSDFSDF
Intern ReportFSDFSDF
PLACE: Namakkal
DATE:
INTRODUCTION
Major Milestones
Skills have become the global currency of the 21st century. In a world where
competition for jobs, pay increases, and academic success continues to increase,
certifications offer hope because they are a credible, third-party assessment of one’s
skill and knowledge for a given subject. Some of the key benefits achieved by the
students by certification are Validation of knowledge, Increased marketability,
Increased earning power, Enhanced academic performance, Improved reputation,
Enhanced credibility, Increased confidence, Respect from peers. By Knowledge
Solution India’s certification, students has improved academic performance having,
higher grade point average for certified college students from 6.9 to 7.8, higher
graduation rates for certified college students: 78.4% to 94.5% and the dropout rates
are reduced to 0.2% to 1.0%.
DAY 06 TO DAY 10
Data set
This section describes, in brief, the data that has been used for the research. Data
from multiple sources was used in this project, the major amount of data was
extracted from public website Yocket (Yocket.com), data regarding the rankings,
fees and enrolment in colleges was obtained from a leading educational consultancy
firm The Mentors Circle in India. Data from both the sources was integrated together
to form a staging data-set. For predicting the chance of a student getting shortlisted
in universities the final data-set was divided into multiple datasets each representing
a particular university. For predicting the list of universities suitable for students
based on their profile data of all the students the staging data-set was updated only
to have records of students who had successfully secured admission in the
universities. Below table shows the different features of the data-sets.
Algorithms
Existing System
(Bibodi et al. (n.d.)) used multiple machine learning models to create a system that
would help the students to shortlist the universities suitable for them also a second
model was created to help the colleges to decide on enrolment of the student. Nave
Bayes algorithm was used to predict the likelihood of success of an application, and
multiple classification algorithms like Decision Tree, Random Forest, Nave Bayes and
SVM were compared and evaluated based on their accuracy to select the best
candidates for the college. GRADE system was developed by (Waters and
Miikkulainen (2013)) to support the admission process for the graduate students in the
University of Texas Austin Department of Computer Science. The main objective of
the project was to develop a system that can help the admission committee of the
university to take better and faster decisions. Logistic regression and SVM were used
to create the model, both models performed equally well and the final system was
developed using Logistic regression due to its simplicity. The time required by the
admission committee to review the applications was reduced by 74% but human
intervention was required to make the final decision on status if the application.
(Nandeshwar et al. (2014)) created a similar model to predict the enrolment of the
student in the university based on the factors like SAT score, GPA score, residency
race etc. The Model was created using the Multiple Logistic regression algorithm, it
was able to achieve accuracy rate of 67% only.
DAY 21 TO DAY 25
MODULES DESCRIPTION
Exploratory Data Analysis: Performed initial investigations on data so as to discover
patterns, to spot anomalies, to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations. Data Visualization: Using data
visualization, I summarized the data with graphs, pictures and maps, so that the
human mind has an easier time processing and understanding the given data. Data
visualization plays a significant role in the representation of both small and large
data sets, but it is especially useful when we have large data sets, in which it is
impossible to see all of our data, let alone process and understand it manually.
Training and Testing: In this project, datasets are split into two subsets. The first
subset is known as the training data - it's a portion of our actual dataset that is fed
into the machine learning model to discover and learn patterns. In this way, it trains
our model. The other subset is known as the testing data. Train and Evaluate Linear
Regression: Simple linear regression is an approach for predicting a quantitative
response using a single feature (or "predictor" or "input variable"). It takes the
following form: y=β0+β1x
Limitation of this system only relied on the GRE, TOEFL and Undergraduate Score
of the student and missed on taking into consideration other important factors like
SOP and LOR. The existing system lagged the factor of the research work in the
related field. This model achieved only 67% accuracy
DAY 26 TO DAY 30
Linear Regression
Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between a dependent variable and one or more
independent features. When the number of the independent feature, is 1 then it is
known as Univariate Linear regression, and in the case of more than one feature, it
is known as multivariate linear regression.
This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple linear
regression is:
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.
In regression set of records are present with X and Y values and these values are
used to learn a function so if you want to predict Y from an unknown X this learned
function can be used. In regression we have to find the value of Y, So, a function
is required that predicts continuous Y in the case of regression given X as
independent features.
DAY 35 TO DAY 39
Independence: The observations in the dataset are independent of each other. This
means that the value of the dependent variable for one observation does not depend on
the value of the dependent variable for another observation. If the observations are not
independent, then linear regression will not be an accurate model.
Normality: The residuals should be normally distributed. This means that the residuals
should follow a bell-shaped curve. If the residuals are not normally distributed, then
linear regression will not be an accurate model.
DAY 40 TO DAY 44
For Multiple Linear Regression, all four of the assumptions from Simple Linear
Regression apply. In addition to this, below are few more:
4. Overfitting: Overfitting occurs when the model fits the training data too
closely, capturing noise or random fluctuations that do not represent the true
underlying relationship between variables. This can lead to poor generalization
performance on new, unseen data.
Acunetix is an automated web application security testing tool that audits your web
applications by checking for vulnerabilities like SQL Injection, Cross site scripting and
other exploitable vulnerabilities. In general, Acunetix scans any website or web application
that is accessible via a web browser and uses the HTTP/HTTPS protocol.
DAY 45 TO DAY 49
Multicollinearity
Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a multiple regression model are highly correlated, making it difficult to assess
the individual effects of each variable on the dependent variable.
Here,
n is the number of data points.
Vi is the actual or observed value for the ith data point.
Yi is the predicted value for the ith data point.
MSE is a way to quantify the accuracy of a model’s predictions. MSE is sensitive to
outliers as large errors contribute significantly to the overall score.
DAY 50 TO DAY 54
Product of Vector
In the case of vector multiplication, there are basically two kinds of products- scalar and
vector. The dot product is a kind of multiplication that results in a scalar quantity. Cross
Product is a kind of multiplication that results in a vector quantity. Vector products are
used to define other derived vector quantities. The equations for torque, angular velocity,
and acceleration. All of these quantities involve the operations resulting in vectors from
vectors. These operations are usually vector products.
Position Vector
The position vector is used to denote the position of the particle on the Cartesian plane
with respect to the origin as a reference.
Velocity
The average velocity is the ratio of total displacement over total time.
Gradient Descent stands as a cornerstone orchestrating the intricate dance of model
optimization. At its core, it is a numerical optimization algorithm that aims to find
the optimal parameters—weights and biases—of a neural network by minimizing
a defined cost function.
Gradient Descent (GD) is a widely used optimization algorithm in machine
learning and deep learning that minimises the cost function of a neural network
model during training. It works by iteratively adjusting the weights or parameters
of the model in the direction of the negative gradient of the cost function until the
minimum of the cost function is reached.
The learning happens during back propagation while training the neural network-
based model. There is a term known as machine, which is used to optimize the
weight and biases based on the cost function. The cost function evaluates the
difference between the actual and predicted outputs.
Gradient Descent is a fundamental optimization algorithm in machine
learning used to minimize the cost or loss function during model training.
DAY 55 TO DAY 58
Rather than dividing the entire number of data points in the model by the number of
degrees of freedom, one must divide the sum of the squared residuals to obtain an
unbiased estimate. Then, this figure is referred to as the Residual Standard Error (RSE).
RSME is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate
when the units of the variables vary since its value is dependent on the variables’ units
(it is not a normalized measure).
Residual sum of Squares (RSS): The sum of squares of the residual for each
data point in the plot or data is known as the residual sum of squares, or RSS.
It is a measurement of the difference between the output that was observed
and what was anticipated.
Total Sum of Squares (TSS): The sum of the data points’ errors from the
answer variable’s mean is known as the total sum of squares, or TSS.
CODINGS:
class LinearRegression:
def __init__(self):
self.parameters = {}