0% found this document useful (0 votes)
578 views38 pages

Final Internshala Report

The document provides details about the internship project conducted by Sarthak Bohra at Internshala on the topic of data science. It includes an introduction to the company Internshala, which is an online platform that helps students find internships and training opportunities. The report then covers key concepts in data science like predictive modeling, machine learning algorithms like linear regression, decision trees, and K-means clustering. It also discusses statistics and Python libraries relevant for data science. Finally, it mentions developing a book recommender system as the project conducted during the internship.

Uploaded by

Kartik Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
578 views38 pages

Final Internshala Report

The document provides details about the internship project conducted by Sarthak Bohra at Internshala on the topic of data science. It includes an introduction to the company Internshala, which is an online platform that helps students find internships and training opportunities. The report then covers key concepts in data science like predictive modeling, machine learning algorithms like linear regression, decision trees, and K-means clustering. It also discusses statistics and Python libraries relevant for data science. Finally, it mentions developing a book recommender system as the project conducted during the internship.

Uploaded by

Kartik Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

A Mini Project or Internship Assessment Report

on

Data Science
at
Internshala
In partial fulfillment of the requirements
for the degree of

BACHELOR OF TECHNOLOGY

in

COMPUTER SCIENCE AND ENGINEERING


Submitted by
Sarthak Bohra (1904500100053)
Submitted to
Mr. Pradeep Kumar
Assistant Professor
Ms. Neha Sharma
Assistant Professor

Department of Computer Science and Engineering


Shri Ram Murti Smarak College of Engineering Technology & Research,
Bareilly
Dr. A.P.J. Abdul Kalam Technical University, Lucknow
August, 2022
CERTIFICATE

ii
ACKNOWLEDGEMENT

The internship opportunity I had with [Internshala Trainings] was a great chance for learning
and professional development. Therefore, I consider myself as a very lucky individual as I
was provided with an opportunity to be a part of it. I am also grateful for having a chance to
meet so many wonderful people and professionals who led me though this internship period.

Bearing in mind previous I am using this opportunity to express my deepest gratitude and
special thanks to Sarvesh Agarwal, the Chief Executive officer & Founder of Internshala for
giving me the opportunity to do a training within the Organization.

I express my deepest thanks to Ms. Kopal Seth, Senior Instructional Designer, Internshala
Trainings for taking part in useful decision & giving necessary advices and guidance and
arranged all facilities to make life easier. I choose this moment to acknowledge her
contribution gratefully.

It is my radiant sentiment to place on record my best regards, deepest sense of gratitude to


Mr. Harsh Deep Singh, Product Manager-II and Senior Educator, Internshala Trainings for
his careful and precious guidance which were extremely valuable for my study both
theoretically and practically.

I perceive as this opportunity as a big milestone in my career development. I will strive to use
gained skills and knowledge in the best possible way, and I will continue to work on their
improvement, in order to attain desired career objectives. Hope to continue cooperation with
all of you in the future.

Sincerely,

Signature…………………………………

Roll No ………………………………….

Name ……………………………………

Date ……………………………………..

iii
TABLE OF CONTENTS
CERTIFICATE ii
ACKNOWLEDGEMENT iii
LIST OF TABLES iv
LIST OF FIGURES vi
CHAPTER 1 Introduction to Company 1
1.1 About the Company 1
CHAPTER 2 Data Science 2
2.1 About Data Science 2
2.1.1 What is a Data science? 2
2.1.2 Predictive Modeling 2
2.1.3 Machine Learning 2
2.1.4 Forecasting 3
2.1.5 Application of Data Science 3
2.2 Python for Data Science 5
2.2.1 Introduction to Python 5
2.2.2 Variables and Data Types 5
2.2.3 Data Types 5
2.2.4 Conditional Statements 7
2.2.5 Looping Constructs 8
2.2.6 Understanding Standard Libraries in Python 9
CHAPTER 3 Understanding the Statistics for Data Science 10
3.1 Statistics 10
3.1.1 Introduction to Statistics 10
3.1.2 Measure of central tendency 11
3.1.3 Understanding the spread of Data 11
3.1.4 Data Distribution 11
3.1.5 Introduction to Probability 14
3.1.6 Probabilities of Discreet and Continuous Variables 15
3.1.7 Central Limit Theorem and Normal Distribution 15
3.1.8 Introduction to Inferential Statics 17
3.1.9 Understanding the Confidence Interval 18
3.1.10 Hypothesis Testing 19
3.1.12 Chi-Squared Test 20

iv
CHAPTER 4 Predictive Modeling and Basics of Machine Learning21

4.1 Predictive Modeling 21


4.1.1 Introduction to Predictive Modeling 21
4.1.2 Understanding the types of Predictive Models 21
4.1.3 Stages of Predictive Models 21
4.1.4 Hypothesis Generation 21
4.1.5 Data Extraction 22
4.1.6 Data Exploration 22
4.1.7 Reading the data in Python 24
4.2 Machine Learning 21
4.2.1 Linear Regression 25
4.2.2 Logistic Regression 26
4.2.3 Decision Tree 27
4.2.4 Decision Tree Splitting 28
4.2.5 K-Means Clustering algorithm 29
CHAPTER 5 Book Recommender System 21

5.1 Project Background 29


5.2 Project Methodology 29
5.3 Result 29
REFERENCES 30

v
TABLE OF FIGURES

Figure 1.1: About Internshala 1

Figure 3.1: Types of Statistics 6

Figure 3.2: Boxplot 12

Figure 3.4: Histogram 14

Figure 3.5: Gaussian Distribution 14

Figure 3.6: Probability Density Function 15

Figure 3.7: Probability Density Graph 16

Figure 3.8: Confidence Interval 17

Figure 4.1: Linear Regression22

Figure 4.2: Logistic Regression 22

vi
CHAPTER 1

INTRODUCTION TO COMPANY

1.1 About the Company

Internshala is an internship and online training platform, based in Gurgaon, India. Founded


by Sarvesh Agrawal, an IIT Madras alumnus, in 2011, the website helps students
find internships and learn new skills through training with different organizations in India.
These skills may be app development, web development, or learning any programming
language like C, C++, Java, Python or JavaScript, etc.

This training is not only restricted to only a tech point of view but also covers content
writing, digital marketing courses, etc. Apart from it, it provides you a platform where you
can work upon your skills in real-world life like may be on projects, etc.

Internshala is on a mission to equip students with relevant skills & practice exposure to help
them get the best possible start to their careers. Imagine a world full of freedom and
possibilities. A world where you can discover your passion and turn it into your career. A
world where you graduate fully assured, confident, and prepared to stake a claim on your
place in the world.

Figure 1.1

1
Internshala -
Business Model
and How it
Works?
Internshala operates through its
website and mobile app. A stu-
dent has to register, create a
decent profile, and apply for vari-
ous internships listed on the
portal. Internshala is a free plat-
form
when it comes to applying or
searching for internships but
2
earns revenue through two medi-
ums:
Internshala charges some amount
for its online training programs.
These programs vary based on
duration, category, and finances.
Internshala also charges fees from
third parties such as advertisers
who want to post ads, posters
or email busters, etc. on its web-
site.

Internshala -
Business Model

3
and How it
Works?
Internshala operates through its
website and mobile app. A stu-
dent has to register, create a
decent profile, and apply for vari-
ous internships listed on the
portal. Internshala is a free plat-
form
when it comes to applying or
searching for internships but
earns revenue through two medi-
ums:

4
Internshala charges some amount
for its online training programs.
These programs vary based on
duration, category, and finances.
Internshala also charges fees from
third parties such as advertisers
who want to post ads, posters
or email busters, etc. on its web-
site.

Internshala -
Business Model
and How it
Works?
5
Internshala operates through its
website and mobile app. A stu-
dent has to register, create a
decent profile, and apply for vari-
ous internships listed on the
portal. Internshala is a free plat-
form
when it comes to applying or
searching for internships but
earns revenue through two medi-
ums:
Internshala charges some amount
for its online training programs.
These programs vary based on
duration, category, and finances.
Internshala also charges fees from
third parties such as advertisers
who want to post ads, posters
6
or email busters, etc. on its web-
site.
CHAPTER 2

Data Science

2.1About Data Science

2.1.1 What is a Data Science?

Data science is the study of data. Like biological sciences is a study of biology, physical sci-
ences, it’s the study of physical reactions. Data is real, data has real properties, and we need
to study them if we’re going to work on them. Data Science involves data and some signs. It
is a process, not an event. It is the process of using data to understand too many different
things, to understand the world. Let Suppose when you have a model or proposed explanation
of a problem, and you try to validate that proposed explanation or model with your data. It is
the skill of unfolding the insights and trends that are hiding (or abstract) behind data. It’s
when you translate data into a story. So, use storytelling to generate insight. And with these
insights, you can make strategic choices for a company or an institution. We can also define
data science as a field which is about processes and systems to extract data of various forms
and from various resources whether the data is unstructured or structured.

2.1.2 Predictive Modeling

Predictive modelling is a form of artificial intelligence that uses data mining and probability
to forecast or estimate more granular, specific outcomes. For example, predictive modeling
could help identify customers who are likely to purchase our new One AI software over the
next 90 days.
Internshala - Business Model and How it Works?

Internshala operates through its website and mobile app. A student has to register, create a
decent profile, and apply for various internships listed on the portal. Internshala is a free plat-
form
when it comes to applying or searching for internships but earns revenue through two medi-
ums:
Internshala charges some amount for its online training programs. These programs vary based
on

7
duration, category, and finances.
Internshala also charges fees from third parties such as advertisers who want to post ads,
posters
or email busters, etc. on its website

2.1.3 Machine Learning

Machine learning is a branch of artificial intelligence (ai) where computers learn to act and
adapt to new data without being programmed to do so. The computer is able to act independ-
ently of human interaction.

2.1.4 Forecasting

Forecasting is a process of predicting or estimating future events based on past and present
data and most commonly by analysis of trends. "Guessing" doesn't cut it. A forecast, unlike a
prediction, must have logic to it. It must be defendable. This logic is what differentiates it
from the magic 8 ball's lucky guess. After all, even a broken watch is right two times a day.

2.1.5 Application of Data Science

Data science and big data are making an undeniable impact on businesses, changing day-to-
day operations, financial analytics, and especially interactions with customers. It's clear that
businesses can gain enormous value from the insights data science can provide. But some-
times it's hard to see exactly how. So let's look at some examples. In this era of big data, al-
most everyone generates masses of data every day, often without being aware of it. This digi-
tal trace reveals the patterns of our online lives. If you have ever searched for or bought a
product on a site like Amazon, you'll notice that it starts making recommendations related to
your search. This type of system known as a recommendation engine is a common applica-
tion of data science.
Key Features are as follows:

1. In Search Engines
2. In Transport
3. In Finance
4. In E-Commerce
5. In Health Care
 Detecting tumor
 Drug discoveries

8
 Medical Image Analysis.
 Virtual Medical Chat Bot
 Genetics and Genomics
6. Image Recognition
7. Targeting Recommendation
8. Airline Routing Planning
9. Data Science in Gaming
10. Auto Complete

2.2 Python for Data Science

2.2.1 Introduction to Python

Python is a high-level, general-purpose and a very popular programming language. Python


programming language (latest Python 3) is being used in web development, Machine Learn-
ing applications, along with all cutting-edge technology in Software Industry. Python Pro-
gramming Language is very well suited for Beginners, also for experienced programmers
with other programming languages like C++ and Java.

2.2.2 Variables and Data Types

Variables:

a. Python Variables Naming Rules:

There are certain rules to what you can name a variable (called a identifier).

 Python variables can only begin with a letter (A-Z/a-z) or an underscore (_).
 Python is case-sensitive, and so are Python identifiers.
b. Assigning and Reassigning Python Variables:
 To assign a value to Python variables, you don’t need to declare its type.
 You name it according to rules and type the value after the equal sign (=).
 You can’t put the identifier on the right-hand side of the equal sign.
 Neither can you assign Python variables to a keyword.
c. Multiple Assignment:
 You can assign values to multiple Python variables in one statement.
 You can assign the same value to multiple Python variables.
9
2.2.3 Data Types:

Following are the different data types:


A. Python Numbers:
a. int
b. float
c. long

B. Strings:
a. Spanning a String Across Lines
b. Displaying Part of a String
C. Python Lists:
a. Slicing a List
b. Length of a List
D. Python Tuples:
a. Accessing and Slicing a Tuple
b. A tuple is Immutable

2.2.4 Conditional Statements

a. if statements
b. if-else statements
c. elif statements
d. Nested if-else statements

2.2.5 Looping Constructs

Loops:

a. While loop
b. for loop
c. nested loop
Functions

a. Built-in Functions
b. User-Defined Functions

10
2.2.6 Understanding Standard Libraries in Python

A. Pandas: When it comes to data manipulation and analysis, nothing beats Pandas. It is the
most popular Python library, period. Pandas is written in the Python language especially
for manipulation and analysis tasks.

Pandas provides features like:


• Dataset joining and merging
• Data Structure column dele-
tion and insertion
• Data filtration
• Reshaping datasets
• DataFrame objects to manip-
ulate data, and much more!
NumPy
• Dataset joining and merging
• Data Structure column dele-
tion and insertion
• Data filtration
• Reshaping datasets
11
• DataFrame objects to manip-
ulate data, and much more!
Pandas provides features like:
 Dataset joining and merging
 Data Structure column deletion and insertion
 Data filtration
B. NumPy: NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in
functions to support large multi-dimensional arrays and matrices. It also brings in high-
level mathematical functions to work with these arrays and matrices. NumPy is an open-
source library and has multiple contributors.
NumPy provides features like:
 Integrating
 Broadcasting
C. Matplotlib: Matplotlib is the most popular data visualization library in Python. It allows
us to generate and build plots of all kinds.
Matplotlib provides features like:
 Semantic way to generate complex, subplot grids
 Colored labels in legends
 Ticks and labels
 3D plots

12
CHAPTER 3

Understanding the Statistics for Data Science

3.1 Statistics

3.1.1 introduction to statistics

Statistics simply means numer-


ical data, and is field of math
that generally deals with
Statics simply means numerical data, and is field of math that generally deals with collection
of data, tabulation, and interpretation of numerical data. It is actually a form of mathematical
analysis that uses different quantitative models to produce a set of experimental data or stud-
ies of real life. It is an area of applied mathematics concern with data collection analysis, in-
terpretation, and presentation. Statistics deals with how data can be used to solve complex
problems. Some people consider statistics to be a distinct mathematical science rather than a

13
branch of mathematics. Statistics makes work easy and simple and provides a clear and clean
picture of work you do on a regular basis.
Basic terminology of Statistics:
 Population:
It is actually a collection of set of individuals or objects or events whose properties
are to be analyzed.
 Sample:
It is the subset of a population.

Types of statistics:

Fig.3.1 Types of Statistics

3.1.2 Measures of Central Tendency

 Mean:
It is measure of average of all value in a sample set.
 Median:
It is measure of central value of a sample set. In these, data set is ordered from lowest
to highest value and then finds exact middle
a. Mode:
It is value most frequently arrived in sample set. The value repeated most of time in
central set is actually mode.

14
3.1.3 Understanding the spread of data

Measure of Variability is also known as measure of dispersion and used to describe variabil-
ity in a sample or population. In statistics, there are three common measures of variability as
shown below
A. Range:
It is given measure of how to spread apart values in sample set or data set.
Range=Maximum value – Minimum value
B. Variance:
It simply describes how much a random variable defers from expected value and it is
also computed as square of deviation.

3.1.4 Data Distribution

Terms related to Exploration of Data Distribution:

 Boxplot
 Frequency Table
 Histogram
 Density Plot

Boxplot: It is based on the percentile of the data. The top and bottom of the boxplot are 75th
and 25th percentile of the data. The extended lines are known as whiskers that includes the
range of rest of the data

# BoxPlot Population In Millions

Fig 3.2 Boxplot

Frequency Table: It is a tool to distribute the data into equally spaced ranges, segments and
tells us how many values fall in each segment.

15
Histogram: It is a way of visualizing data distribution through frequency table with bins on
the x-axis and data count on the y-axis.

Fig 3.3 Histogram

Density plot: It is related to histogram as it shows data-values being distributed as


continuous line. It is a smoothed histogram version. The output below is the density plot
superposed over histogram.

3.1.5 Introduction to Probability

Probability: It is referring to the extent of occurrence of events. When an event occurs like
throwing a ball, picking a card from desk, etc., then the must be some probability associated
with the event.

In terms of mathematics, probability refers to the ratio of wanted outcomes to the total
number of possible outcomes. There are three approaches to the theory of probability,
namely:

1. Empirical Approach
2. Classical Approach
3. Axiomatic Approach.

Basic Terminologies:

16
Random Event: If the repetition of an experiment occurs several times under similar
conditions, if it does not produce the same outcome every time but the outcome in a trial
is one of the several possible outcomes, then such an experiment is called random event
or a probabilistic event.

Elementary Event: The elementary event refers to the outcome of each random event
performed. Whenever the random event is performed, each associated outcome is known
as elementary event.

Sample Space: Sample Space refers to the set of all possible outcomes of a random
event. Example, when a coin is tossed, the possible outcomes are head and tail.

3.1.6. Probabilities of Discreet and Continuous Variables

Random variable is basically a function which maps from the set of sample space to set of
real numbers. The purpose is to get an idea about result of a particular situation where we are
given probabilities of different outcomes.

Discrete Random Variable: a random variable X is said to be discrete if it takes on finite


number of values. The probability function associated with it is said to be PMF = Probability
mass function.
Continuous Random Variable:
A random variable X is said to be continuous if it takes on infinite number of
values. The probability function associated with it is said to be PDF = Probability
density function.

3.1.7. Central Limit Theorem and Normal Distribution


Whenever a random experiment is replicated, the Random Variable that equals the average
(or total) result over the replicates tends to have a normal distribution as the number of repli-
cates becomes large. It is one of the cornerstones of probability theory and statistics, because
of the role it plays in the Central Limit Theorem, and because many real-world phenomena
involve phenomena involve random quantities that are approximately normal.

17
Fig 3.4 Gaussian Distribution

Probability Density Function:


The probability density function of the general normal distribution is given as-
P(a<X<b)=∫ba f(x)dxIn the above formula, all the symbols have their usual meanings, is the
Standard Deviation and is the Mean. It is easy to get overwhelmed by the above formula
while trying to understand everything in one glance, but we can try to break it down into
smaller pieces so as to get an intuition as to what is going on. The z-score is a measure of
how many standard deviations away a data point is from the mean. The exponent of in the
above formula is the square of the z-score times. The figure given below shows this rule

Fig 3.5 Probability Density Function

The effects of and on the distribution are shown below. Here is used to reposition the cen-
ter of the distribution and consequently move the graph left or right, and is used to flatten or
inflate the curve

18
Fig 3.6 Probability Density Graph

3.1.8 Introduction to Inferential Statics


Inferential Statistics makes inference and prediction about population based on a sample of
data taken from population. It generalizes a large dataset and applies probabilities to draw a
conclusion. It is simply used for explaining meaning of descriptive stats. It is simply used to
analyze, interpret result, and draw conclusion. Inferential Statistics is mainly related to and
associated with hypothesis testing whose main target is to reject null hypothesis.

Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate
and assess credibility of a hypothesis about a population. Inferential statistics are generally
used to determine how strong relationship is within sample.

Types of inferential statistics:

Various types of inferential statistics are used widely nowadays and are very easy to interpret.
These are given below:

One sample test of


difference/One sample hypo-
thesis test
• Confidence Interval
• Contingency Tables and
Chi-Square Statistic
• T-test or Anova
19
 One sample test of difference/One sample hypothesis test
 Confidence Interval
 Contingency Table and Chi-Square Statistics
 T-Test or Anova

3.1.9 Understanding the Confidence Interval and margin of error

In simple terms, Confidence Interval is a range where we are certain that true value exists.
The selection of a confidence level for an interval determines the probability that the confi-
dence interval will contain the true parameter value. This range of values is generally used to
deal with population-based data, extracting specific, valuable information with a certain
amount of confidence, hence the term ‘Confidence Interval’

Fig 3.7 Confidence Interval

3.1.10 Hypothesis Testing

Hypothesis are statement about the given problem. Hypothesis testing is a statistical method
that is used in making a statistical decision using experimental data. Hypothesis testing is ba-
sically an assumption that we make about a population parameter.
Error in hypothesis Testing
 Type I error: When we reject the null hypothesis, although that hypothesis was true.
Type I error is denoted by alpha.
 Type II error: When we accept the null hypothesis but it is false. Type II error are
denoted by beta.
3.1.11 Chi – Squared Test
Chi-square test is used for categorical features in a dataset. We calculate Chi-square between
each feature and the target and select the desired number of features with best Chi-square

20
scores. It determines if the association between two categorical variables of the sample would
reflect their real association in the population. Chi- square score is given by:

CHAPTER 4

Predictive Modeling and Basics of Machine Learning

4.1 Predictive Modeling

4.1.1. Introduction to Predictive Modeling

Predictive analytics involves certain manipulations on data from existing data sets with the
goal of identifying some new trends and patterns. These trends and patterns are then used to
predict future outcomes and trends. By performing predictive analysis, we can predict future
trends and performance. It is also defined as the prognostic analysis; the word prognostic
means prediction. Predictive analytics uses the data, statistical algorithms and machine learn-
ing techniques to identify the probability of future outcomes based on historical data.

4.1.2. Understanding the types of Predictive Models

Supervised learning
Supervised learning:
Supervised Learning as the name indicates the presence of a supervisor as a teacher. Basi-
cally, supervised learning is a learning in which we teach or train the machine using data
which is well labeled that means some data is already tagged with the correct answer. After
that, the machine is provided with a new set of examples so that supervised learning algo-
rithm analyses the training data and produces a correct outcome from labeled data.
Unsupervised learning:
Unsupervised learning is the training of machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of machine is to group unsorted information according to similarities, patterns and differ-
ences without any prior training of data.

4.1.3. Stages of Predictive Models

Steps to Perform Predictive Analysis:

21
Some basic steps should be performed in order to perform predictive analysis.

 Define Problem Statement


 Data Collection
 Data Cleaning
 Data Analysis
 Build Predictive Model
 Validation
 Deployment
 Model Monitoring

4.1.4 Hypothesis Generation

A hypothesis is a function that best describes the target in supervised machine learning. The
hypothesis that an algorithm would come up depends upon the data and also depends upon
the restrictions and bias that we have imposed on the data. To better understand the Hypothe-
sis Space and Hypothesis consider the following coordinate that shows the distribution of
some data.

4.1.5. Data Extraction

2.4.5. Data Extraction


In general terms, “Mining” is the process of extraction of some valuable material from the
earth e.g., coal mining, diamond mining etc. In the context of computer science, “Data Min-
ing” refers to the extraction of useful information from a bulk of data or data warehouses.
One can see that the term itself is a little bit confusing. In case of coal or diamond mining, the
result of extraction process is coal or diamond. But in case of Data Mining, the result of ex-
traction process is not data!! Instead, the result of data mining is the patterns and knowledge
that we gain at the end of the extraction process.
Data Mining as a whole process
The whole process of Data Mining comprises of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting results

22
4.1.6. Data Exploration

Steps of Data Exploration and preparation

Steps of Data Exploration and


Preparation
Remember the quality of your inputs decide the quality of your output. So, once you have got
your business hypothesis ready, it makes sense to spend lot of time and efforts here. With my
personal estimate, data exploration, cleaning and preparation can take up to 70% of your total
project time. Below are the steps involved to understand, clean and prepare your data for
building your predictive model:
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up
with our refined model.

4.1.7. Reading the data into Python

Python provides inbuilt functions for creating, writing and reading files. There are two types
of files that can be handled in python, normal text files and binary files (written in binary lan-
guage, 0s and 1s).
• Text files: In this type of file, Each line of text is terminated with a special character called
EOL (End of Line), which is the new line character (‘\n’) in python by default.
• Binary files: In this type of file, there is no terminator for a line and the data is stored after
converting it into machine-understandable binary language.
Access modes govern the type of operations possible in the opened file. It refers to how the
file will be used once it’s opened. Different access modes for reading a file are –
 Read Only (‘r’): Open text file for reading. The handle is positioned at the

23
beginning of the file. If the file does not exist, raises I/O error. This is also the
default mode in which file is opened.
 Read and Write (‘r+’): Open the file for reading and writing. The handle is
positioned at the beginning of the file. Raises I/O error if the file does not exist.

4.2. Machine Learning

4.2.1 Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear re-
gression algorithm shows a linear relationship between a dependent (y) and one or more inde-
pendent (y) variables

Fig 4.1 Linear Regression

Types of Linear Regression

Linear regression can be further divided into two types

 Simple Linear Regression: If a single independent variable is used to predict the


value of a numerical dependent variable.

 Multiple Linear Regression: If more than one independent variable is used to predict
the value of a numerical dependent variable.

4.2.2. Logistic Regression

Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent

24
variable using a given set of independent variables. Logistic regression is used for solving the
classification problems.

4.2.3. Decision Tree

A tree has many analogies in Fig 4.2 Logistic Regression real life, and turns out
that it has influenced a wide area of machine
learning, covering both classification and regression. In decision analysis, a decision tree
can be used to visually and explicitly represent decisions and decision making. As the name
goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for
deriving a strategy to reach a particular goal, it’s also widely used in machine learning, which

will be the main focus of this article.

A decision tree is drawn upside down with its root at the top. In the image on the left, the
bold text in black represents a condition/internal node, based on which the tree splits into
branches/ edges. The end of the branch that doesn’t split anymore is the decision/leaf, in this
case, whether the passenger died or survived, represented as red and green text respectively.

4.2.4. Decision Tree Splitting

A decision tree makes decisions by splitting nodes into sub-nodes. This process is performed
multiple times during the training process until only homogenous nodes are left. And it is the
only reason why a decision tree can perform so well. Node splitting, or simply splitting, is the
process of dividing a node into multiple sub-nodes to create relatively pure nodes. There are
multiple ways of doing this, which can be broadly divided into two categories based on the
type of target variable:

1. Continuous Target Variable


2. Reduction in Variance
3. Categorical Target Variable

25
 Gini Impurity
 Information Gain
 Chi-Square

Methods of Decision Tree Splitting:

1. Reduction in Variance: Reduction in Variance is a method for splitting the node used
when the target variable is continuous, i.e., regression problems. It is so-called because it uses
variance as a measure for deciding the feature on which node is split into child nodes.

2. Information Gain: Now, what if we have a categorical target variable? Reduction in vari-
ation won’t quite cut it. Well, the answer to that is Information Gain. Information Gain is
used for splitting the nodes when the target variable is categorical. It works on the concept of
the entropy and is given by:

Information Gain = 1-Entropy

4.2.5. K-Means Clustering Algorithm

The k-means algorithm is based on the initial condition to decide the number of clusters
through the assignment of k initial centroids or means:

Then the distance between each sample and each centroid is computed and the sample is
assigned to the cluster where the distance is minimum. This approach is often called
minimizing the inertia of the clusters, which is defined as follows: 

The process is iterative, once all the samples have been processed, a new set of centroids K is
computed and all the distances are recomputed. The algorithm stops when the desired

26
tolerance is reached, or in other words, when the centroids become stable and, therefore, the
inertia is minimized. 

27
CHAPTER 5

Book Recommender System

5.1. Project Background

The booming technology of the modern world has given rise to the enormous book websites.
This makes the buyers to choose the best books to read as books play a vital role in many
people’s life. The various kinds of books come into existence on day-to-day basis. So, in
order to eliminate this critical situation, the recommendation system has been introduced in
which the suggestion on the various books can be provided based on the analysis of the
buyer’s interest. The Book Recommendation System is an intelligent algorithm which
reduces the overhead of the people. This provides benefit to both the seller and the consumer
creating the win-win situation. The E-commerce site to network security, all demands the
need for the recommended system to increase their revenue rate. The content filtering,
association rule mining and collaborative filtering are the various decision-making techniques
employed in the recommendation system as it helps buyers by the strong recommendations as
there are various books, buyers sometimes cannot find the item they search for. The Book
Recommendation System is widely implemented using search engines comprising of data
sets.

5.2. Project Methodology


The online book recommendation system involves various techniques for providing effective
suggestion for the buyers. The association mining, collaborative filtering and content filtering
are the three widely employed methods for strong impact using search engines.

28
5.3. Result

29
30
31
REFERENCES
1. https://www.ibm.com/cloud/learn/data-science-introduction

2. https://en.wikipedia.org/wiki/Data_science:~:text=Data%20science%20is%20an
%20interdisciplinary%20field%20focused%20on,in%20a%20wide%20range%20of
%20application%20domains.%20

3. https://www.geeksforgeeks.org/what-is-data-science/

4. https://data36.com/what-is-data-science/

5. https://ioe.iitm.ac.in/program/data-science/

32

You might also like