0% found this document useful (0 votes)
10 views36 pages

Internship Report

The document is an internship training report from P. A. College of Engineering and Technology, detailing a training program at Appin Technology focused on Data Science from June 7 to June 23, 2023. It outlines the institute's vision and mission, the objectives and scope of the training, and provides a weekly overview of topics covered, including Data Science, AI, ML, NumPy, and Pandas. The report emphasizes the importance of practical training in enhancing academic knowledge and skills relevant to the industry.

Uploaded by

julie M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

Internship Report

The document is an internship training report from P. A. College of Engineering and Technology, detailing a training program at Appin Technology focused on Data Science from June 7 to June 23, 2023. It outlines the institute's vision and mission, the objectives and scope of the training, and provides a weekly overview of topics covered, including Data Science, AI, ML, NumPy, and Pandas. The report emphasizes the importance of practical training in enhancing academic knowledge and skills relevant to the industry.

Uploaded by

julie M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

P. A.

COLLEGE OF ENGINEERING AND TECHNOLOGY


(An Autonomous Institution)
Accredited with ‘A’ Grade by NAAC

POLLACHI - 642 002

Department of Electronics and Communication Engineering


(Accredited by NBA)

2022 – 2023 / Even Semester

IN – PLANT / INTERNSHIP TRAINING

REPORT APPIN TECHNOLOGY,

COIMBATORE TRAINING ON DATA SCIENCE

07.06.2023 to 23.06.2023

Submitted by

HARIPRIYA.R (721721104030)
Vision of the Institute
To progress to become a center of excellence in Engineering and
Technology through creative and innovative practices in teaching-learning
and promoting research and development to produce globally competitive
and employable professionals who are psychologically strong and
emotionally balanced with social perception and professional ethics.

Mission of the Institute


To offer academic programmes in the emerging areas of Engineering and
Technology, provide training and research facilities and opportunities to
promote student and faculty research in collaboration with Industry and
Government for sustainable growth.
Department Vision
To enrich the students, technical knowledge and practical skills in the field
of Electronics and Communication Engineering and to nurture highly
emulous communication engineers with the power to facilitate the society.

Department Mission
To provide quality education and promote research in the field of Electronics
and Communication Engineering and thereby rendering continuous service
to the society by imbibing leadership skills and moral values in the students.

Program Educational Objectives (PEOs)


PEO 1: To nourish the students with fundamentals of engineering and
technology by excelling in the field of Electronics and
Communication to envisage the emerging industrial needs and
professional competence.

PEO 2: To impart skill-based training program to design, analyze and


create innovative solutions for technical challenges.

PEO 3: To instill strong zeal and elegant personality by imbibing ethical


principles and modeling the prosocial behavior to inculcate values
among future generation.
P. A. COLLEGE OF ENGINEERING AND TECHNOLOGY
(An Autonomous Institution)

INDEX

S. No. Contents Page No.

1. About the Industry / Organization

2. About the Training Program

3. Conclusion & Feedback

4. Photocopy of Certificates
INTRODUCTION:
The industrial training is very useful to acquire knowledge
from the various department of the company and also it is useful
like, management skills and organization the firm. Through the
training creates job opportunity for each of us uses to get job
preference. Therefore the training is useful for the academic and
personally helped to the various things. For many, the process of
recruiting and hiring is a drain on company resources. One
solution: Appeal to tomorrow's staff members when they're
looking for internships, and all you have to do is choose the best
of the bunch when it comes time to hire.

OBJECTIVE OF THE TRAINING


• To know about the nature of the organization
• To study the various department and the
function of the organization
• To get particular working knowledge in the organization
• To know about the official pattern and production
activities of the Organization
• Speaking of additional manpower, setting up an
internship program allows you to take advantage of
short-term support.

NEED AND SCOPE OF TRAINGING:


SCOPE OF THE TRAINING:
The industrial training creates good relationship between
common public, employees, higher officer, manager etc... this
knowledge are utilized for the development of academic
knowledge. The ongoing globalization process is replete with
threats from the completers. At the same time one should also
relies that it.
LIMITATION OF THE TRAINNING:
The training period covert only the 10 days therefore we
cannot gather entire knowledge about the organization the
training is useful for the academic and personally helped to the
various things.
HISTORY OF THE COMPANY
Appin Technology is a division of M/s Ether services. We are a company with a
hardworking team, working on existing challenges with our customers. Appin
Technology started in the year 2013 with the goal of providing Software
development Services in Web Design, Web Development, Mobile App
Development, etc.
Vision.
Ether Services played an important role in providing quality services and
early-stage implementation of software development to Tier II and Tier III
cities of Tamil Nadu.
Appin Technology provides solutions to many start-ups. Ether Services has
developed products in Food Delivery Software, Grocery Delivery Software,
Airbnb Clone Software, and Car Rental Software.

Customers across the world, who would like to start a business in these areas,
will start with the minimum viable product where we act as Technology
partners.
We provide a full range of support to our customer’s journey to their success.
Our commitment towards the work is the feeling of our team’s responsibility
which leads to success. We work on a result-driven approach to measure the
effectiveness of our projects which leads us to learn from our mistakes.

COMPANY PROFILE :

NAME OF THE COMPANY :APPIN TECHNOLOGIES


REGISTER OFFICE :Gandhipuram, coimbatore
TELEPHONE NO 9894723437
TURNOVER : 50 LAKHS
NATURE OF BUSINESS : software
development YEAR OF ESTABLISHMENT 2013
FOUNDER : Mohan
Natarajan TOTAL NO OF EMPLOYEES
;120 PEOPLE

2
About the Training Program
DATA SCIENCE
Weekly Overview of In-Plant / Internship Training Activities

DATE DAY Name of the Topic/Module Completed


Introduction about Data Science , AI , ML,Numpy
07.06.202 Wednesda Introduction , Install Numpy
3 y
NdArray , Data Types,Array
08.06.202 Thursday Attributes, Indexing and
3 Slicing
1st Week

Mathematical Functions,Arithmetic
09.06.202 Friday operations, Sort , Search ,
3 Counting Functions
10.06.202 Saturday Introduction about Pandas
3
Pandas – Series, Data Frames, Basic
12.06.202 Monday Functionality, Reindexing, Iteration,
3 Sorting,
13.06.202 Tuesday Pandas - Working with Text Data
3

DATE DAY Name of the Topic/Module Completed


14.06.202 Wednesday Aggregations,Group By , Merging and Joining
3
15.06.202 Thursday Introduction about Data Analytics,How to
3 Prepare Data
16.06.202 Friday Read Data From CSV and JSON , XLSX ,
3 Database
17.06.202 Saturday
2nd Week

Linear Regresssion , Single Linear , Multiple


3 Linear

19.06.202 Monday Polynomial Regression , Classification


3
20.06.202 Tuesday Random Forest,Support Vector ML
3
21.06.202 Wednesday K Nearest Neighbour , K - Means
3
22.06.202 Friday Decision Tree , Confusion Matrix
3
23.06.202 Saturday NLP (Natural Language Processing)
3

3
Detailed Description about Every Modules in
the Training Activity
Day 1 -Introduction about Data Science , AI , ML

Data science is the broad scientific study that focuses on making


sense of data. Think of, say, recommendation systems used to
provide personalized suggestions to customers based on their
search history. If, say, one customer searches for a rod and a lure
and the other looks for a fishing line in addition to the other
products, there’s a decent chance that the first customer will
also be interested in purchasing a fishing line. Data science is a
broad field that envelops all activities and technologies that help
build such systems, particularly those we discuss below.

Artificial intelligence is a complex topic. But for the sake of


simplicity, let’s say that any real-life data product can be called AI.
Let’s stay with our fishing-inspired example. You want to buy a
certain model fishing rod but you only have a picture of it and
don’t know the brand name. An AI system is a software product
that can examine your image and provide suggestions as to a
product name and shops where you can buy it. To build an AI
product you need to use data mining, machine learning, and
sometimes deep learning.

Machine learning aims at training machines on historical data


so that they can process new inputs based on learned patterns
without explicit programming, meaning without manually
written out instructions for a system to do an action. If it
weren’t for machine learning, the recommendation engines we
already mentioned above would be out of reach as it is difficult
for a human to process millions of search queries, likes, and
4
reviews to discover which customers commonly buy rods with
lures and which purchase fishing line on top of that

5
Day 2 – NumPy introduction, NdArray
NumPy introduction:

NumPy is a Python library used for working with arrays.It also has functions for
working in domain of linear algebra, fourier transform, and matrices.NumPy was
created in 2005 by Travis Oliphant. It is an open source project and you can use
it freely.NumPy stands for Numerical Python. In Python we have lists that serve
the purpose of arrays, but they are slow to process.NumPy aims to provide an
array object that is up to 50x faster than traditional Python lists.The array object
in NumPy is called ndarray, it provides a lot of supporting functions that make
working with ndarray very easy.Arrays are very frequently used in data science,
where speed and resources are very important.

Creating a Numpy Array


Arrays in Numpy can be created by multiple ways, with various number of Ranks, defining the
size of the Array. Arrays can also be created with the use of various data types such as lists,
tuples, etc. The type of the resultant array is deduced from the type of the elements in the
sequences.
Note: Type of array can be explicitly defined while creating the array.
# Python program
for # Creation of
Arrays import
numpy as np

# Creating a rank 1
Array arr =
np.array([1, 2, 3])
print("Array with Rank 1: \n",arr)

# Creating a rank 2 Array


arr = np.array([[1, 2, 3],[4, 5, 6]])
print("Array with Rank 2: \n", arr)

# Creating an array from


tuple arr = np.array((1, 3,
2)) print("\nArray created
using "
"passed tuple:\n", arr)
Output:

6
Array with Rank 1:
[1 2 3]
Array with Rank 2:
[[1 2 3]
[4 5 6]]
Array created using passed
tuple: [1 3 2]

7
Day 3– Mathematical functions,sort,search
Mathematical Functions:

There are some important math operations that can be performed on a


pandas series to simplify data analysis using Python and save a lot of
time.

s.sum(),s.std(),s.min() or s.max(),s.idxmin() or s.idxmax(),s.median()

Sorting: Sorting refers to arranging data in a particular format. Sorting algorithm specifies
the way to arrange data in a particular order. Most common orders are in numerical or
lexicographical order. In Numpy, we can perform various sorting operations using the various
functions that are provided in the library like sort, lexsort, argsort etc.

numpy.sort() : This function returns a sorted copy of an array.


# importing libraries
import numpy as np

# sort along the first axis


a = np.array([[12, 15], [10, 1]])
arr1 = np.sort(a, axis = 0)
print ("Along first axis : \n", arr1)

# sort along the last axis


a = np.array([[10, 15], [12, 1]])
arr2 = np.sort(a, axis = -1)
print ("\nAlong first axis : \n", arr2)

a = np.array([[12, 15], [10, 1]])


arr1 = np.sort(a, axis = None)
print ("\nAlong none axis : \n", arr1)

Output :

Along first axis


: [[10 1]
[12 15]]

Along first axis


: [[10 15]
[ 1 12]]
8
Along none axis
: [ 1 10 12
15]

Searching:
Searching is an operation or a technique that helps finds the place of a given element or value
in the list. Any search is said to be successful or unsuccessful depending upon whether the
element that is being searched is found or not. In Numpy, we can perform various searching
operations using the various functions that are provided in the library like argmax, argmin,
nanaargmax etc.
numpy.argmax() : This function returns indices of the max element of the array in a
particular axis.

# Python Program
illustrating # working of
argmax()

import numpy as

geek # Working on

2D array
array = geek.arange(12).reshape(3, 4)
print("INPUT ARRAY : \n", array)

# No axis mentioned, so works on entire


array print("\nMax element : ",
geek.argmax(array))

# returning Indices of the max


element # as per the indices
print(("\nIndices of Max element : "
, geek.argmax(array,
axis=0))) print(("\nIndices of
Max element : "
, geek.argmax(array, axis=1)))

Output :

9
INPUT ARRAY :
[[ 0 1 2 3]
[ 4
[ 8 5 6 7]
9 10 11]]

Max element : 11

Indices of Max element : [2 2 2 2]

Indices of Max element : [3 3 3]

1
Day 4 – Introduction about pandas
Pandas:
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data
Analysis" and was created by Wes McKinney in 2008.

Pandas is an open-source library that is made mainly for working


with relational or labeled data both easily and intuitively. It provides
various data structures and operations for manipulating numerical
data and time series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.

The first step of working in pandas is to ensure whether it is installed


in the Python folder or not. If not then we need to install it in our
system using pip command. Type cmd command in the search box and
locate the folder using cd command where python-pip file has been
installed. After locating it, type the command:
pip install pandas
After the pandas have been installed into the system, you need to
import the library. This module is generally imported as:
import pandas as pd
Here, pd is referred to as an alias to the Pandas. However, it is not
necessary to import the library using the alias, it just helps in writing
less amount code every time a method or property is called.
Pandas generally provide two data structures for manipulating data, They are:
 Series
 DataFrame

1
Day 5 – pandas- series,data frames,basic functionality:
Series: Pandas Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python objects, etc.).
The axis labels are collectively called indexes. Pandas Series is nothing
but a column in an excel sheet. Labels need not be unique but must
be a hashable type. The object supports both integer and label-based
indexing and provides a host of methods for performing operations
involving the index.

Creating a Series:In the real world, a Pandas Series will be created by


loading the datasets from existing storage, storage can be SQL
Database, CSV file, an Excel file. Pandas Series can be created from the
lists, dictionary, and from a scalar value etc.

Example:

import pandas as pd
import numpy as np

# Creating empty
series ser =
pd.Series()

print(ser)

# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])

ser =
pd.Series(data)
Output:
Series([], dtype: float64)
0 g
1 e
2 e
3 k
4 s
dtype: object

1
Pandas DataFrame is a two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and
columns). A Data frame is a two-dimensional data structure, i.e., data
is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns.

Creating a DataFrame:

In the real world, a Pandas DataFrame will be created by loading the


datasets from existing storage, storage can be SQL Database, CSV file,
an Excel file.
Pandas DataFrame can be created from the lists, dictionary, and from a
list of dictionaries, etc.
Example:

import pandas as pd
# Calling DataFrame
constructor df =
pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on
list df = pd.DataFrame(lst)
print(df)
Output:
Empty
DataFrame
Columns: []
Index: []
0
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
1
6 Geeks

1
Day 6– Working with text data
Text data
types
There are two ways to store text data in pandas:

1. object -dtype NumPy array.


2. StringDtype extension type.
We recommend using StringDtype to store text data.

Prior to pandas 1.0, object dtype was the only option. This was unfortunate for many reasons:

1. You can accidentally store a mixture of strings and non-strings in an object dtype
array. It’s better to have a dedicated dtype.
2. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There
isn’t a clear way to select just text while excluding non-text but still object-dtype
columns.
3. When reading code, the contents of an object dtype array is less clear than 'string'.
Currently, the performance of object dtype arrays of strings and arrays.StringArray are
about the same. We expect future enhancements to significantly increase the performance and
lower the memory overhead of StringArray.

For backwards-compatibility, object dtype remains the default type we infer a list of
strings to

>>>
In [1]: pd.Series(["a", "b", "c"])
Out[1]:
0 a
1 b
2 c
dtype: object
To explicitly request string dtype, specify the dtype

>>>
In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]:
0 a
1 b
2 c
dtype: string

In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())


Out[3]:
0 a
1 b
2 c
dtype: string

1
Day 7- Aggregations ,group by
Aggregation in data mining is the process of finding, collecting, and
presenting the data in a summarized format to perform statistical analysis of
business schemes or analysis of human patterns. When numerous data is
collected from various datasets, it’s crucial to gather accurate data to
provide significant results. Data aggregation can help in taking prudent
decisions in marketing, finance, pricing the product, etc. Aggregated data
groups are replaced using statistical summaries. Aggregated data being
present in the data warehouse can help one solve rational problems which in
turn can reduce the time strain in solving queries from data sets.
In aggregation process, interval arithmetic operations of fuzzy numbers are used.
The presented FFTA approach can be applied to evaluate the failure probabilities of
those system components, statistical data of which are either inadequate or
unavailable.

The groupby is one of the most frequently used Pandas functions in data

analysis. It is used for grouping the data points (i.e. rows) based on the distinct

values in the given column or columns. We can then calculate aggregated

values for the generated groups

The focus of the article is to have an overview of the main categories of clustering

algorithms. Most well-known algortihms can be divided into three categories.

 Partitional clustering

 Hierarchical clustering
1
 Density-based clustering

1
Day 8- introduction to Data analytics
Most companies are collecting loads of data all the time—but, in its
raw form, this data doesn’t really mean anything. This is where
data analytics comes in. Data analytics is the process of analyzing
raw data in order to draw out meaningful, actionable insights, which
are then used to inform and drive smart business decisions.

1
Day 9- Read Data from CSV and JSON, XLSX,
DATABASE

Python Data File Formats – Python CSV


Python CSV data is a basic with data science. A Comma-Separated-
Value file uses commas to separate values. You can look at it as a
delimited text file that holds tabular data as plain text. One problem
with this may arise when the data it holds contains a comma or a line
break- we can use other delimiters like a tab stop. This Python data
file format proves useful in exchanging data and in moving tabular
data between programs. The extension for a CSV file is .csv.

Here’s a Python CSV file we will use for


our demo- id,title,timing,genre,rating
1,Dog with a Blog,17:30-
18:00,Comedy,4.7 2,Liv and
Maddie,18:00-18:30,Comedy,6.3
3,Girl Meets World,18:30-
19:00,Comedy,7.2 4,KC
Undercover,19:00-19:30,Comedy,6.1
5,Austin and Ally,19:30-
20:00,Comedy,6
We saved this as schedule.csv on our Desktop. Remember to save as All
files (*.*). When we open this file, it opens in Microsoft Excel by
default on Windows-

1
Python Data File Formats – Python JSON
JSON stands for JavaScript Object Notation and is an open standard
file format. While it holds attribute-value pairs and array data types,
it uses human- readable text for this. This Python data file format is
language-independent and

2
we can use it in asynchronous browser-server communication. The
extension for a Python JSON file is .json.

Here’s the JSON file in Python we will use for the demo-

{ "ID":
["1","2","3","4","5"],
"Title":["Dog with a Blog","Liv and Maddie","Girl Meets World","KC Undercover","Austin and Ally"],
"Timing":["17:30-18:00","18:00-18:30","18:30-19:00","19:00-19:30","19:30-20:00"],
"Genre":["Comedy","Comedy","Comedy","Comedy","Comedy"],
"Rating":["4.7","6.3","7.2","6.1","6"]
}
We save this as schedule.json on the Desktop.
Python Data File Formats – Python XLS
The extension for an Excel spreadsheet is .xlsx. This proves useful
for data science; we create a workbook with two sheets in Microsoft
Excel.
Sheet 1-

2
DAY 10-Linear Regression:
Linear regression analysis is used to predict the value of a variable based
on the value of another variable. The variable you want to predict is called
the dependent variable. The variable you are using to predict the other
variable's value is called the independent variable.

Least Square Method


Linear regression uses the least square method.

The concept is to draw a line through all the plotted data points. The line is
positioned in a way that it minimizes the distance to all of the data points.

The distance is called "residuals" or "errors".

The red dashed lines represents the distance from the data points to the drawn
mathematical function.

Types of Linear Regression


o SimpleLinearRegression:
If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm
is called Simple Linear Regression.
o MultipleLinearRegression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm
is called Multiple Linear Regression.

2
DAY 11-Polynomial Regression, Claasification
Polynomial íegíession is a kind of lineaí íegíession in which the íelationship shaíed between the
dependent and independent vaíiables Y and X is modeled as the nth degíee of the polynomial.
ľhis is done to look foí the best way of díawing a line using data points. Keep íeading to know
moíe about polynomial íegíession. Equation of the Polynomial Regíession Model:
Any lineaí equation is a polynomial íegíession that has a degíee of 1. ľhe veíy common and
usual equation used to define the íegíession is;
y = mx+b
In this equation, m is the slope, and b is the y-inteícept. One can easily
wíite this as f(x) = c0 + c1 x wheíe c1 is the slope and the c0 is the y-
inteícept.
Classification:
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
1. y=f(x), where y = categorical output
Binary Classifier:
If the classification problem has only two possible outcomes, then i called as Binary
Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier:
If a classification problem has more than two outcomes, then it is called as
Multi- classClassifier.
Example: Classifications of types of crops, Classification of types of music.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

2
DAY 12- Random Forest, Support Vector
Machine Algorithm

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It is
based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification
problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or
hyperplane.

2
DAY 13- k-nearest neighbour, k-means
K-Nearest Neighbor(KNN) Algorithm for
Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity.
This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
Fitting K-NN classifier to the Training data:
Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the
class, we will create the Classifier object of the class. The Parameter of this class will
be

o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes 5.

o metric='minkowski': This is the default parameter and it decides the distance between the
points.
o p=2: It is equivalent to the standard Euclidean
metric. #Fitting K-NN classifier to the training set
from sklearn.neighbors import KNeighborsClassifier
classifier= KNeighborsClassifier(n_neighbors=5,
metric='minkowski', p=2 ) classifier.fit(x_train, y_train)
output:
Out[10]:
KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowki',met
ric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the


test set result, we will create a y_pred
vector as we did in Logistic Regression.
Below is the code for it:
#Predicting the test set
result y_pred=
classifier.predict(x_test)

2
K means algorithm
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.

Step-1: Data pre-processing Step


# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
# Importing the dataset
dataset =
pd.read_csv('Mall_Customers_data.csv') x
= dataset.iloc[:, [3, 4]].values
Step-2: Finding the optimal number of clusters using the elbow method
#finding optimal number of clusters using the elbow
method from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of
WCSS #Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++',
random_state= 42) kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)

2
mtp.title('The Elobw Method
Graph') mtp.xlabel('Number of
clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
Step- 3: Training the K-means algorithm on the training dataset
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=5, init='k-means++',
random_state= 42) y_predict= kmeans.fit_predict(x)

Step-4: Visualizing the Clusters


#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label =
'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label
= 'Cluster 2') #f or second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label =
'Cluster 3') #for t hird cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label =
'Cluster 4') #fo r fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300,
c = 'yellow', lab el = 'Centroid')
mtp.title('Clusters of
customers') mtp.xlabel('Annual
Income (k$)')
mtp.ylabel('Spending Score (1-
100)') mtp.legend()
mtp.show()

2
DAY 14- decision tree, confusion matrix

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.

1. Information Gain:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
2. Gini Index:
Gini Index= 1- ∑jPj2

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

2
Confusion Matrix:
The confusion matrix is a matrix used to determine the performance of the
classification models for a given set of test data. It can only be determined if the true
values for test data are known. The matrix itself can be easily understood, but the
related terminologies may be confusing. Since it shows the errors in the model
performance in the form of a matrix, hence also known as an error matrix
Calculations using Confusion Matrix:

Other important terms used in Confusion Matrix:


o Null Error rate: It defines how often our model would be incorrect if it
always predicted the majority class. As per the accuracy paradox, it is
said that "the best classifier has a higher error rate than the null error
rate."
o ROC Curve: The ROC is a graph displaying a classifier's performance for
all possible thresholds. The graph is plotted between the true positive
rate (on the Y-axis) and the false Positive rate (on the x-axis).

2
DAY 15- NLP(Natural Language Processing)
Natural language processing (NLP) refers to the branch of computer science—
and more specifically, the branch of artificial intelligence or AI—concerned with
giving computers the ability to understand text and spoken words in much the
same way human beings can.
Components of NLP
There are two components of NLP as

given − Natural Language

Understanding (NLU)
Understanding involves the following
tasks −

 Mapping the given input in natural language into useful representations.


 Analyzing different aspects of the language.
Natural Language Generation (NLG)
It is the process of producing meaningful phrases and sentences in the form of
natural language from some internal representation.

It involves −
 Text planning − It includes retrieving the relevant content from
knowledge
base.
 Sentence planning − It includes choosing required words,
forming meaningful phrases, setting tone of the
sentence.
 Text Realization − It is mapping sentence plan into sentence structure.
The NLU is harder than NLG.

3
3
Conclusion & Feedback
My internship in the field of data science has been an
incredible experience that has deepened my
understanding and passion for this rapidly evolving field. I
gained practical exposure and applied theoretical
knowledge to real-world projects, strengthening my skills
in statistical analysis, machine learning algorithms, and
data visualization. Working on challenging projects allowed
me to extract meaningful insights using advanced
analytics techniques. I also learned about the ethical
considerations in data science and developed strong
communication and collaboration skills. This internship has
solidified my career aspirations in data science and
provided a foundation for future endeavors. I am grateful
to all who supported and mentored me throughout this
journey.

You might also like