Internship Report: T.J.Instituteoftechnology
Internship Report: T.J.Instituteoftechnology
Submitted by
NAME:R.Jayalakshmi REGNO:312521204015
T.J.INSTITUTEOFTECHNOLOGY
KARAPAKKAM
CHENNAI-600097
ANNAUNIVERSITY
CHENNAI-600025
T.J.INSTITUTEOFTECHNOLOGY
RAJIVGANDHISALAI,OMR,KARAPAKKAM,CHENNAI-600097
ANNAUNIVERSITY,CHENNAI-600025
BONAFIDECERTIFICATE
Certified that this internship Data Science by
Intern Certify is the bonafide work of
R.Jayalaksmi who carried out the internship under
my supervision.
SIGNATURE SIGNATURE
(Signature of Student)
Date:
INDEX
Sr.NO Title Page.no
1 About Training
2 Objectives
3 Data Science
4 Final Project
5 Reason for
choosing Data
Science
6 Learning
Outcome
7 Scope in Data
Science
8 Conclusion
1. ABOUT TRAINING
• NAME OF TRAINING: DATA SCIENCE
• HOSTING INSTITUTION: INTERN CERTIFY
• DATES: From 9th July 2024 to 23th August 2024
2. OBJECTIVES
To explore, sort and analyse mega data from various sources to
take advantage of them and reach conclusions to optimize
business processes and for decision support.
3. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses
mathematics, statistics, and computer science to study and
evaluate data. The key objective of Data Science is to extract
valuable information for use in strategic decision making, product
development, trend analysis, and
forecasting.
3. Now that you have the raw data, it’s time to prepare it. This
includes transforming the data from a raw form into data that’s
directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and
modeling.
4. MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE
Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data
modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression
Introduction to Data Science
Data Science
The field of bringing insights from data using scientific techniques is called
data science.
Applications
Detective Analysis
Predictive Modelling
Using past data to predict what is happening at granular level.
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific
tools to analyse such scale data.
• Recommendation System
• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• How google and other search engines know what are the more relevant
results for our search query?
Python Introduction
• UNDERSTANDING OPERATORS:
Variables are named bounded to objects. Data types in python are int
(Integer), Float, Boolean and strings.
• CONDITIONAL STATEMENTS:
• LOOPING CONSTRUCTS:
For loop
• FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Statistics
Descriptive Statistic
Mode
import pandas as pd
print(mode_data)
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Any value which will fall outside the range of the data is termed as a outlier.
Eg- 9700 instead of 97.
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being
faulty.
• Intentional Error-Errors which are induced intentionally. Eg-claiming smaller
amount of alcohol consumed then actual.
• Legit Outlier—These are values which are not actually errors but in data
due to legitimate reasons.
Eg - a CEO’s salary might actually be high as compared to other employees.
Is difference between third and first quartile from last. It is robust to outliers.
Histograms
import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()
Inferential Statistics
Inferential statistics allows to make inferences about the population from the
sample data.
Hypothesis Testing
T Tests
Z Score
Correlation
Predictive Modelling
Making use of past data and attributes we predict future using this data.
Eg-
Types
1. Supervised Learning
2. Unsupervised Learning
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation
Problem Definition
Hypothesis Generation
List down all possible variables, which might influence problem objective.
These variables should be free
from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.
Data Extraction/Collection
Collect data from different sources and combine those for exploration and
model building.
While looking at data we might come across new hypothesis.
Types
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier
Univariate
Bivariate
Identifying Outlier
Graphical Method
• Box Plot
• Scatter Plot
Formula Method
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
Variable Transformation
Used to –
Model Building
Algorithms
• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and
dependent variables.We use dependent variable of train data set to
predict/estimate.
Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by
applying model rules.
We apply training learning to test data set for prediction/estimation.
Linear Regression
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable, although many more
complex extensions exist.
K-Means Clustering (Unsupervised learning)
Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
Disadvantages: -
8. CONCLUSION
If the machine could successfully pretend to be human to a
knowledgeable observer then you certainly should consider it
intelligent. AI systems are now in routine use in various field such as
economics, medicine, engineering and the military, as well as being
built into many common home computer software applications,
traditional strategy games etc.
lOMoARcPSD|4938