0% found this document useful (0 votes)
39 views29 pages

Internship Report: T.J.Instituteoftechnology

Uploaded by

M Arun Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views29 pages

Internship Report: T.J.Instituteoftechnology

Uploaded by

M Arun Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Internship Report

Submitted by

NAME:R.Jayalakshmi REGNO:312521204015

Inpartial fulfillment for the award of the degree of


BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND
ENGINEERING

T.J.INSTITUTEOFTECHNOLOGY
KARAPAKKAM
CHENNAI-600097

ANNAUNIVERSITY
CHENNAI-600025
T.J.INSTITUTEOFTECHNOLOGY
RAJIVGANDHISALAI,OMR,KARAPAKKAM,CHENNAI-600097

ANNAUNIVERSITY,CHENNAI-600025

BONAFIDECERTIFICATE
Certified that this internship Data Science by
Intern Certify is the bonafide work of
R.Jayalaksmi who carried out the internship under
my supervision.

SIGNATURE SIGNATURE

MS.D.EVANGELINE NESA PRIYA MR.MOHAMMADARSATH

HEAD OF THE DEPARTMENT CLASS INCHARGE


Acknowledgement
The success and final outcome of learning machine
learning required a lot of guidance and assistance from
many people and I am extremely privileged to have got
this all along the completion of my course and flow of the
projects.All that i have done is only due to such a
supervision and assistance and I would not forget to
thank them.

(Signature of Student)

Date:
INDEX
Sr.NO Title Page.no
1 About Training

2 Objectives

3 Data Science

4 Final Project

5 Reason for
choosing Data
Science
6 Learning
Outcome

7 Scope in Data
Science
8 Conclusion
1. ABOUT TRAINING
• NAME OF TRAINING: DATA SCIENCE
• HOSTING INSTITUTION: INTERN CERTIFY
• DATES: From 9th July 2024 to 23th August 2024

2. OBJECTIVES
To explore, sort and analyse mega data from various sources to
take advantage of them and reach conclusions to optimize
business processes and for decision support.

Examples include machine maintenance or (predictive


maintenance), in the fields of marketing and sales with sales
forecasting based on weather

3. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses
mathematics, statistics, and computer science to study and
evaluate data. The key objective of Data Science is to extract
valuable information for use in strategic decision making, product
development, trend analysis, and
forecasting.

Data Science concepts and processes are mostly derived from


data engineering, statistics, programming, social engineering,
data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data
analysis, data extraction and
data retrieval.

Data science is the field of study that combines domain expertise,


programming skills, and knowledge of mathematics and statistics
to extract meaningful insights from data. Data science
practitioners apply machine learning algorithms to numbers, text,
images, video, audio, and
more to produce artificial intelligence (AI) systems to perform
tasks that ordinarily requirel8593human intelligence. In turn,
these systems generate insights which analysts and business
users can translate into tangible business value.

DATA SCIENCE PROCESS:

1. The first step of this process is setting a research goal. The


main purpose here is making sure all the stakeholders understand
the what, how, and why of the project.

2. The second phase is data retrieval. You want to have data


available for analysis, so this step includes finding suitable data
and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and
transformation before it becomes usable.

3. Now that you have the raw data, it’s time to prepare it. This
includes transforming the data from a raw form into data that’s
directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and
modeling.

4. The fourth step is data exploration. The goal of this step is to


gain a deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you
to start modeling.

5. Finally, we get to the sexiest part: model building (often


referred to as
“data modeling” throughout this book). It is now that you attempt
to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but
remember research has taught us that often (but not always) a
combination of simple models tends to outperform one
complicated model. If you’ve done this phase right, you’re almost
done.

6. The last step of the data science model is presenting your


results and automating the analysis, if needed. One goal of a
project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will
indeed change the business process as expected. This is where
you can shine in your influencer role. The importance of this step
is more apparent in projects on a strategic and tactical level.
Certain projects require you to perform the business process over
and over again, so automating the project will save time.

4. MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE

• Overview & Terminologies in Data Science


• Applications of Data Science
➢ Unfamiliar detection (fraud, disease, etc.)
lOMoARcPSD|49385935

➢ Automation and decision-making (credit worthiness, etc.)


➢ Classifications (classifying emails as “important” or “junk”)
➢ Forecasting (sales, revenue, etc.)
➢ Pattern detection (weather patterns, financial market
patterns, etc.)
➢ Recognition (facial, voice, text, etc.)
➢ Recommendations (based on learned preferences,
recommendation engines can refer you to movies,
restaurants and books you may like)

2) PYTHON FOR DATA SCIENCE

Introduction to Python, Understanding Operators, Variables and


Data Types, Conditional Statements, Looping Constructs,
Functions, Data Structure, Lists, Dictionaries, Understanding
Standard Libraries in Python, reading a CSV File in Python, Data
Frames and basic operations with Data Frames, Indexing Data
Frame.

3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE

Introduction to Statistics, Measures of Central Tendency,


Understanding the spread of data, Data Distribution, Introduction
to Probability, Probabilities of Discrete and Continuous Variables,
Normal Distribution, Introduction to Inferential Statistics,
Understanding the Confidence Interval and margin of error,
Hypothesis Testing, Various Tests, Correlation.

4) PREDICTIVE MODELING AND BASICS OF MACHINE


LEARNING

Introduction to Predictive Modeling, Types and Stages of


Predictive Models, Hypothesis Generation, Data Extraction and
Exploration, Variable Identification, Univariate Analysis for
Continuous Variables and Categorical Variables, Bivariate
Analysis, Treating Missing Values and Outliers, Transforming the
Variables, Basics of Model Building, Linear and Logistic
Regression, Decision Trees, K-means Algorithms in Python.

Summary of Procedure of Analyzing Data:

Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data
modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression
Introduction to Data Science

Data Science

The field of bringing insights from data using scientific techniques is called
data science.

Applications

Amazon Go – No checkout lines

Computer Vision - The advancement in recognizing an image by a


computer involves processing large sets
of image data from multiple objects of same category. For example, Face
recognition.

Spectrum of Business Analysis


Reporting / Management Information System

To track what is happening in organization.

Detective Analysis

Asking questions based on data we are seeing, like. Why something


happened?

Dashboard / Business Intelligence

Utopia of reporting. Every action about business is reflected in front of


screen.

Predictive Modelling
Using past data to predict what is happening at granular level.

Big Data

Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific
tools to analyse such scale data.

Application of Data Science

• Recommendation System

Example-In Amazon recommendations are different for different users


according to their past search.

• Social Media

1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis

• Deciding the right credit limit for credit card customers.

• Suggesting right products from e-commerce companies


1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization

• How google and other search engines know what are the more relevant
results for our search query?

1. Apply ML and Data Science


2. Fraud Detection
3. AD placement
4. Personalized search results

Python Introduction

Python is an interpreted, high-level, general-purpose programming


language. It has efficient high-level data structures and a simple but
effective approach to object-oriented programming. Python’s elegant syntax
and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on
most platforms.

Python for Data science:


Why Python???

1. Python is an open source language.


2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.

• UNDERSTANDING OPERATORS:

Theory of operators: - Operators are symbolic representation of


Mathematical tasks.

• VARIABLES AND DATATYPES:

Variables are named bounded to objects. Data types in python are int
(Integer), Float, Boolean and strings.

• CONDITIONAL STATEMENTS:

If-else statements (Single condition)


If- elif- else statements (Multiple Condition)

• LOOPING CONSTRUCTS:

For loop

• FUNCTIONS:

Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.

Functions cannot be reused in python.


• DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma


and enclosed within square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements


separated by comma and stored as key: value pair, enclosed with curly
braces {}.
lOMoARcPSD|49385935

Statistics
Descriptive Statistic

Mode

It is a number which occurs most frequently in the data series.

It is robust and is not generally affected much by addition of couple of new


values.
Code

import pandas as pd

data=pd.read_csv( "Mode.csv") //reads data from csv file

data.head() //print first five lines

mode_data=data['Subject'].mode() //to take mode of subject column

print(mode_data)

Mean

import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)

Median

Absolute central value of data set.


import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
Types of variables

• Continous – Which takes continuous numeric values. Eg-marks

• Categorial-Which have discrete values. Eg- Gender

• Ordinal – Ordered categorial variables. Eg- Teacher feedback

• Nominal – Unorderd categorial variable. Eg- Gender


lOMoARcPSD|49385935
Outliers

Any value which will fall outside the range of the data is termed as a outlier.
Eg- 9700 instead of 97.
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being
faulty.
• Intentional Error-Errors which are induced intentionally. Eg-claiming smaller
amount of alcohol consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data
due to legitimate reasons.
Eg - a CEO’s salary might actually be high as compared to other employees.

Interquartile Range (IQR)

Is difference between third and first quartile from last. It is robust to outliers.

Histograms

Histograms depict the underlying frequency of a set of discrete or continuous


data that are measured on an interval scale.

import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()

Inferential Statistics

Inferential statistics allows to make inferences about the population from the
sample data.

Hypothesis Testing

Hypothesis testing is a kind of statistical inference that involves asking a


question, collecting data, and then examining what the data tells us about
how to proceed. The hypothesis to be tested is called the null hypothesis and
given the symbol Ho. We test the null hypothesis against an alternative
hypothesis, which is given the symbol Ha.

Decision made Null Hypothesis is Null Hypothesis is


True False
Reject Null Type I Error Correct Decision
Hypothesis
Don’t Reject Null Correct Decision Type I Error
Hypothesis

T Tests

When we have just a sample not population statistics.


Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
lOMoARcPSD|49385935

Z Score

The distance in terms of number of standard deviations, the observed value


is away from mean, is standard
score or z score.

+Z – value is above mean.


-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape
of original distribution.

Chi Squared Test

To test categorical variables.

Correlation

Determine the relationship between two variables.


It is denoted by r. The value ranges from -1 to +1. Hence, 0 means no
relation.
Syntax
import pandas as pd
import numpy as np
data=pd.read_csv("data.csv")
data.corr()
OMoARcPSD|49385935
L

Predictive Modelling

Making use of past data and attributes we predict future using this data.
Eg-

Past Horror Movies


Future Unwatched Horror Movies

Predicting stock price movement


1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.

Types

1. Supervised Learning

Supervised learning is a type algorithm that uses a known dataset


(called the training dataset) to make predictions. The training dataset
includes input data and response values.
• Regression-which have continuous possible values. Eg-Marks
• Classification-which have only two values. Eg-Cancer prediction is either 0
or 1.

2. Unsupervised Learning

Unsupervised learning is the training of machine using information that


is neither classified nor. Here the task of machine is to group unsorted
information according to similarities, patterns and differences without any
prior training of data.

• Clustering: A clustering problem is where you want to discover the


inherent groupings in the
data, such as grouping customers by purchasing behaviour.

• Association: An association rule learning problem is where you want to


discover rules that
describe large portions of your data, such as people that buy X also tend to
buy Y.

Stages of Predictive Modelling

1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation
Problem Definition

Identify the right problem statement, ideally formulate the problem


mathematically.
Past Horror Movies
Future Unwatched Horror Movies
lOMoARcPSD|49385935

Hypothesis Generation

List down all possible variables, which might influence problem objective.
These variables should be free
from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.

Data Extraction/Collection

Collect data from different sources and combine those for exploration and
model building.
While looking at data we might come across new hypothesis.

Data Exploration and Transformation

Data extraction is a process that involves retrieval of data from various


sources for further data processing or
data storage.
Steps of Data Extraction
• Reading the data
Eg- From csv file
• Variable identification
• Univariate Analysis
• Bivariate Analysis
• Missing value treatment
• Outlier treatment
• Variable Transformation
Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent
variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each
other.
• It helps in prediction and detecting anomalies.
lOMoARcPSD|49385935

Missing Value Treatment


Reasons of missing value
1. Non-response – Eg-when you collect data on people’s income and many
choose not to answer.
2. Error in data collection. Eg- Faculty data
3. Error in data reading.

Types

1. MCAR (Missing completely at random): Missing values have no relation to


the variable in which
missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which
missing value exist and
the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the
variable in which missing value
exists
Identifying
Syntax: -
1. describe()
2. Isnull()
Output will we in True or False

Different methods to deal with missing values

1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier

Univariate

Analysing only one variable for outlier.


Eg – In box plot of height and weight.
Weight will we analysed for outlier

Bivariate

Analysing both variables for outlier.


Eg- In scatter plot graph of height and weight. Both will we analysed.
Downloaded by M Arun Siva ([email protected])
lOMoARcPSD|49385935

Identifying Outlier

Graphical Method

• Box Plot

• Scatter Plot
Formula Method

Using Box Plot


< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile
Treating Outlier

1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate

Variable Transformation

Is the process by which-


1. We replace a variable with some function of that variable. Eg – Replacing a
variable x with its log.
2. We change the distribution or relationship of a variable with others.

Used to –

1. Change the scale of a variable


2. Transforming non linear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube
root, Binning, etc.
lOMoARcPSD|49385935

Model Building

It is a process to create a mathematical model for estimating / predicting the


future based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers.
They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of
past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of
default (closer to 0).
Steps in Model Building
1. Algorithm Selection
2. Training Model
3. Prediction / Scoring
Algorithm Selection
Example-

Eg- Predict the customer will buy product or not.

Algorithms

• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and
dependent variables.We use dependent variable of train data set to
predict/estimate.

Dataset

• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by
applying model rules.
We apply training learning to test data set for prediction/estimation.

Algorithm of Machine Learning

Linear Regression

Linear regression is a statistical approach for modelling relationship between


a dependent variable with a given set of independent variables. It is
assumed that the wo variables are linearly related. Hence, we try to find a
linear function. That predicts
the response value(y) as accurately as possible as a function of the feature
or independent variable(x).

Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable, although many more
complex extensions exist.
K-Means Clustering (Unsupervised learning)

K-means clustering is a type of unsupervised learning, which is used when


you have unlabelled data
(i.e., data without defined categories or groups). The goal of this algorithm is
to find groups in the data,
with the number of groups represented by the variable K. The algorithm
works iteratively to assign each
data point to one of K groups based on the features that are provided. Data
points are clustered based on
feature similarity.
5.Reason for choosing data
science
Data Science has become a revolutionary technology that everyone seems
to talk about. Hailed as the sexiest job of the 21st century’. Data Science is a
buzzword with very few people knowing about the technology in its true
sense. While many people wish to become Data Scientists, it is essential to
weigh the pros and cons of data science and give out a real picture. In this
article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.

Advantages: -

1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile

Disadvantages: -

1. Mastering Data Science is near to impossible


2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy
6.Learning Outcome
After completing the training, I am able to:
• Develop relevant programming abilities.
• Demonstrate proficiency with statistical analysis of data.
• Develop the skill to build and assess data-based model.
• Execute statistical analysis with professional statistical software.
• Demonstrate skill in data management.
• Apply data science concepts and methods to solve problem in real-world
contexts and will
communicate these solutions effectively.
lOMoARcPSD|49385935
7. SCOPE IN DATA SCIENCE FIELD
Few factors that point out to data science’s future, demonstrating
compelling reasons why it is crucial to today’s business needs are
listed below:

• Companies’ Inability to handle data

Data is being regularly collected by businesses and companies for


transactions and through website interactions. Many companies
face a common challenge – to analyze and categorize the data
that is collected and stored. A data scientist becomes the savior
in a situation of mayhem
like this. Companies can progress a lot with proper and efficient
handling of data, which results in productivity.

• Revised Data Privacy Regulations

Countries of the European Union witnessed the passing of the


General Data Protection Regulation (GDPR) in May 2018. A similar
regulation for data protection will be passed by California in 2020.
This will create co-dependency between companies and data
scientists for the need of storing data adequately and responsibly.
In today’s times, people are generally more cautious and alert
about sharing data to businesses and giving up a certain amount
of control to them, as there is rising awareness about data
breaches and their malefic consequences. Companies can no
longer afford to be careless and irresponsible about their data.
The GDPR will ensure some amount of data privacy in the coming
future.

• Data Science is constantly evolving


Career areas that do not carry any growth potential in them run
the risk of stagnating. This indicates that the respective fields
need to constantly evolve and undergo a change for opportunities
to arise and flourish in the industry. Data science is a broad
career path that is undergoing developments and thus promises
abundant opportunities in the future. Data science job roles are
likely to get more specific, which in turn will lead to
specializations in the field. People inclined towards this stream
can exploit their opportunities and pursue what suits them best
through these specifications and specializations.

• An astonishing incline in data growth

Data is generated by everyone on a daily basis with and without


our notice. The interaction we have with data daily will only keep
increasing as time passes. In addition, the amount of data
existing in the world will increase at lightning speed. As data
production will be on the rise, the demand for data scientists will
be crucial to help enterprises use and manage it well.

• Virtual Reality will be friendlier

In today’s world, we can witness and are in fact witnessing how


Artificial Intelligence is spreading across the globe and
companies’ reliance on it. Big data prospects with its current
innovations will flourish more with advanced concepts like Deep
Learning and neural networking. Currently, machine learning is
being introduced and implemented in almost every
lOMoARcPSD|49385935

application. Virtual Reality (VR) and Augmented Reality (AR) are


undergoing monumental modifications too. In addition, human
and machine interaction, as well as dependency, is likely to
improve and increase drastically.

• Blockchain updating with Data science

The main popular technology dealing with cryptocurrencies like


Bitcoin is referred to as Blockchain. Data security will live true to
its function in this aspect as the detailed transactions will be
secured and made note of. If big data flourishes, then Iot will
witness growth too and gain popularity. Edge computing will be
responsible for dealing with data issues and address them.

8. CONCLUSION
If the machine could successfully pretend to be human to a
knowledgeable observer then you certainly should consider it
intelligent. AI systems are now in routine use in various field such as
economics, medicine, engineering and the military, as well as being
built into many common home computer software applications,
traditional strategy games etc.

AI is an exciting and rewarding discipline. AI is branch of


computer science that is concerned with the automation of intelligent
behavior. The revised definition of AI is - AI is the study of mechanisms
underlying intelligent behavior through the construction and evaluation
of artifacts that attempt to enact those mechanisms. So it is concluded
that it works as an artificial human brain which have an unbelievable
artificial thinking power.
lOMoARcPSD|49385
CERTIFICATE

lOMoARcPSD|4938

You might also like