Ajay Internship
Ajay Internship
INTERNSHIP REPORT
Submitted by
Ajay Rapeti A21126551049
BONAFIDE CERTIFICATE
This is to certify that this Internship Report “Data Science” From Netzwerk Academy
Organisation is a bonafide work of R. Ajay (A21126551049) of 3/4 CSD carried out
the Internship under my Supervision.
Class Teacher
Mrs. G SURYAKALA ESWARI
Assistant Professor
Department of CSE(AI,ML&DS)
Reviewers
Mrs. G SURYAKALA ESWARI
Assistant Professor
Department of CSE(AI,ML&DS)
Dr.K.S.Deepthi
Head of The Department
Department of CSE(AI,ML&DS)
ANITS
ACKNOWLEDGEMENT
An endeavor over a long period can be successful with the advice and support of many well-
wishers. We take this opportunity to express our gratitude and appreciation to all of them.
We owe our tributes to Dr .K. S. Deepthi , Head of the Department, Computer Science &
Engineering (AI&ML,DS), ANITS, for his valuable support and guidance during the period
of the Internship.
We wish to express our sincere thanks and gratitude to our Course Curators Akash Kulkarani
, Pavithra, Netzwerk Academy who helped in stimulating discussions, and for guiding us
throughout the Course. We express our warm and sincere thanks for the encouragement,
untiring guidance, and confidence they had shown in us.
We also thank all the staff members of the Computer Science & Engineering (AI&ML,DS)
department for their valuable advices. We also thank supporting staff for providing resources
as and when required.
Table of Contents:
1. About the Data Science Course Using Python
3. Course Modules
Mathematics
Python
Machine Learning
Deep Learning
Tableau
SQL
4.Plagiarism Report
Course Description:
The Comprehensive Data Science Course is designed to equip students with the
essential knowledge and skills required to excel in the field of data science. This course
covers a wide range of topics, including mathematics, programming, machine learning, deep
learning, data visualization, SQL, and computer vision. Whether you're a beginner or an
experienced data professional looking to advance your skills, this course provides a strong
foundation for a successful career in data science.
Course Duration: 6 months (can be adjusted based on the depth of coverage and
student pace)
Course Outline:
Prerequisites:
- No prior experience is required, but a strong desire to learn and a basic
understanding of mathematics and programming concepts will be beneficial.
Target Audience:
- Students aspiring to become data scientists
- Working professionals seeking to transition into the field of data science
- Data professionals looking to enhance their skillset
In today's data-driven world, the field of Data Science has emerged as a critical
player in various sectors, from business and healthcare to government and academia.
Data Science is the interdisciplinary practice of extracting valuable insights and
knowledge from data, and its importance cannot be overstated. Here, we explore the
significance of Data Science in a one-page overview.
1. Informed Decision-Making:
Data Science enables organizations to make informed decisions based on data
analysis. Whether it's optimizing supply chains, predicting customer behavior, or
improving healthcare outcomes, data-driven decisions lead to increased efficiency,
reduced risks, and better outcomes.
3. Healthcare Advancements:
Data Science plays a pivotal role in healthcare by analyzing patient data to improve
diagnostics, treatment plans, and disease prevention. Predictive analytics can aid in
early disease detection and personalized medicine, ultimately saving lives.
4. Scientific Research:
In academia and research, Data Science empowers scientists to analyze vast
datasets, accelerating discoveries in fields such as genomics, astronomy, and climate
science. It aids in pattern recognition, simulations, and hypothesis testing.
8. Predictive Maintenance:
In manufacturing and logistics, Data Science helps predict when machines or
vehicles require maintenance. This prevents costly breakdowns, reduces downtime,
and increases operational efficiency.
9. Environmental Sustainability:
Data Science is employed to monitor and manage environmental resources. It aids in
climate modeling, natural resource conservation, and sustainable agriculture
practices.
In conclusion, Data Science has become an indispensable tool across diverse sectors,
transforming the way businesses operate, healthcare is delivered, research is
conducted, and policies are formulated. The ability to extract actionable insights
from data is a defining factor in the success and progress of modern society, making
Data Science a field of paramount importance. As technology and data continue to
evolve, the role of Data Science will only become more central to our daily lives and
the global economy.
Importance in the workplace:
Data Science plays a pivotal role in the modern workplace, serving as a catalyst for
informed decision-making and enhanced efficiency. Its significance lies in its ability to
extract actionable insights from vast amounts of data, enabling organizations to make
data-driven decisions. Businesses leverage data science to optimize operations, forecast
trends, and identify opportunities for growth. It empowers marketing teams to target the
right audience, assists healthcare professionals in diagnosing diseases, and aids financial
institutions in fraud detection. Moreover, data science drives automation, streamlining
processes, and reducing human error. In a data-rich world, its importance cannot be
overstated, as it allows companies to remain competitive, adapt to changing market
dynamics, and stay ahead of the curve. With data science, businesses can gain a deeper
understanding of their customers, enhance products and services, and ultimately achieve
sustainable success.
Course Modules
Introduction to Mathematics:
Mathematics plays a crucial role in data science, as it provides the foundation for
various data analysis, modeling, and machine learning techniques. Here are some key
mathematical topics commonly used in data science:
1. Statistics:
- Descriptive Statistics: Measures of central tendency (mean, median, mode),
measures of spread (variance, standard deviation), and percentiles.
- Inferential Statistics: Hypothesis testing, confidence intervals, and p-values.
- Probability: Probability distributions (e.g., normal, binomial, Poisson) and
probability theory.
2. Linear Algebra:
- Vectors and Matrices: Understanding and manipulation of vectors and matrices,
which are fundamental for data transformations and modeling.
- Matrix Operations: Matrix multiplication, transposition, and determinant
calculations.
3. Calculus:
- Differentiation: Used in optimization algorithms (e.g., gradient descent) and in
finding derivatives for functions.
- Integration: Used in cumulative distribution functions and probability density
functions.
4. Optimization:
- Gradient Descent: A common optimization algorithm used in training machine
learning models to find the best model parameters.
- Convex Optimization: Techniques for finding the minimum of convex functions,
often used in machine learning.
5. Linear Regression:
- Linear regression is a statistical method that involves fitting a linear equation to a set
of data points. It requires a solid understanding of regression analysis, least squares, and
residuals.
8. Information Theory:
- Concepts like entropy and mutual information are relevant for feature selection and
measuring the information gain in machine learning.
9. Graph Theory:
- Graph theory is used in network analysis and various data science applications, such
as social network analysis and recommendation systems.
Python is a versatile and widely used programming language in the field of data science
due to its rich ecosystem of libraries and tools. Here are some of the essential topics and
libraries in Python used in data science:
9. Web Scraping:
- Libraries like Beautiful Soup and Scrapy are used for web scraping, allowing data
scientists to extract data from websites for analysis.
These Python topics and libraries form the foundation for data science work,
allowing data scientists to manipulate data, build models, visualize results, and
gain insights from various data sources. Depending on the specific data science
task, other specialized libraries and tools may also be employed.
Introduction to Machine Learning:
1. Supervised Learning:
Regression: Involves predicting a continuous target variable based on input
features.
Classification: Focuses on categorizing data points into predefined classes or
categories.
2. Unsupervised Learning:
Clustering:Groups data points into clusters based on similarity or patterns, with
applications in customer segmentation, anomaly detection, and more.
Dimensionality Reduction: Techniques like Principal Component Analysis
(PCA) and t-SNE help reduce the number of features while preserving essential
information.
6. Time Series Analysis: Specifically for data that varies with time, often used in
financial forecasting, demand prediction, and climate modeling.
8. Deep Learning: Involves neural networks with multiple hidden layers, suitable
for complex tasks like image and speech recognition, natural language
understanding, and more.
9. Feature Engineering: The process of selecting, transforming, and creating new
features to improve model performance.
These machine learning topics are essential in the data science toolkit and are used
to extract insights, build predictive models, and make data-driven decisions across
various domains and industries. The choice of techniques depends on the specific
problem and dataset being analyzed.
Introduction to Deep Learning:
Deep learning is a subset of machine learning that focuses on neural networks with
many layers, also known as deep neural networks. Deep learning techniques have
gained significant prominence in the field of data science due to their ability to
automatically learn patterns and representations from large and complex datasets. Some
of the key topics in deep learning that are commonly used in data science include:
1. Neural Networks:
- Neural networks are the foundational building blocks of deep learning. They consist
of interconnected nodes, or artificial neurons, organized into layers. These networks can
be used for tasks such as classification, regression, and feature extraction.
5. Transfer Learning:
- Transfer learning involves using pre-trained deep learning models as a starting point
for new tasks. By fine-tuning existing models, data scientists can save time and
resources and achieve better results, especially when working with limited datasets.
6. Autoencoders:
- Autoencoders are neural networks used for unsupervised learning and dimensionality
reduction. They can be employed for data compression, feature learning, and anomaly
detection.
7. Generative Adversarial Networks (GANs):
- GANs consist of two neural networks, a generator and a discriminator, which work
in opposition. They are used for generating data that is similar to a given dataset,
making them useful for tasks like image generation, data augmentation, and creating
realistic synthetic data.
These topics in deep learning play a crucial role in data science, enabling data scientists
to tackle a wide range of complex data analysis, prediction, and modeling tasks across
various domains. Understanding and applying these techniques is essential for
harnessing the power of deep learning in data science applications.
Introduction to Tableau:
Tableau is a powerful data visualization and business intelligence tool used in the field
of data science to help professionals analyze and communicate data-driven insights
effectively. Some of the key topics and features in Tableau that are used in data science
include:
1. Data Connection:
- Connecting to various data sources, such as databases, spreadsheets, and cloud
services.
- Importing and blending data from multiple sources for analysis.
2. Data Preparation:
- Cleaning and transforming data to make it suitable for analysis.
- Handling missing values, data shaping, and structuring datasets.
3. Data Visualization:
- Creating interactive and visually appealing data visualizations.
- Building various types of charts and graphs, including bar charts, line charts, scatter
plots, and heatmaps.
4. Dashboard Creation:
- Designing interactive dashboards that consolidate multiple visualizations on a single
canvas.
- Adding filters, parameters, and actions to make dashboards user-friendly.
5. Calculated Fields:
- Creating calculated fields using Tableau's formula language for custom calculations
and aggregations.
- Implementing advanced calculations for complex data analysis.
7. Time-Series Analysis:
- Analyzing time-series data with features like date hierarchies, trend lines, and
forecasting.
- Creating timelines and trend visualizations.
12. Storytelling:
- Creating data stories to convey a narrative through a sequence of dashboards and
visualizations.
- Guiding the audience through insights and findings.
Tableau's versatility and user-friendly interface make it a valuable tool in the data
science workflow, enabling professionals to explore, analyze, and communicate data
insights efficiently. It is particularly useful for creating data visualizations that can help
stakeholders understand complex data and make data-driven decisions.
Introduction to SQL:
SQL (Structured Query Language) is a critical tool in data science for managing and
manipulating data stored in relational databases. Data scientists often use SQL to
extract, transform, and analyze data. Here are some key SQL topics commonly used in
data science:
1. Data Retrieval:
- SELECT Statement: The fundamental SQL command used to retrieve data from a
database table.
- FROM Clause: Specifies the table or tables from which you want to retrieve data.
- WHERE Clause: Filters rows based on specific conditions, allowing you to extract
subsets of data.
2. Data Filtering:
- WHERE Clause: Used to filter rows based on specific conditions, such as date
ranges, categories, or numerical values.
- LIKE Operator: Enables pattern matching for text data.
- IN Operator: Allows you to filter data based on a list of values.
- BETWEEN Operator: Filters data within a specific numeric or date range.
3. Data Aggregation:
- GROUP BY Clause: Groups data based on one or more columns, often used with
aggregate functions.
- Aggregate Functions (e.g., SUM, AVG, COUNT, MAX, MIN): Perform calculations
on grouped data, providing summaries or statistics.
- HAVING Clause: Filters grouped data based on aggregate function results.
4. Joining Tables:
- JOIN Clause: Combines data from multiple tables based on related columns.
- INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN: Different types of joins used
to merge data with varying inclusion rules.
5. Subqueries:
- Subquery: A query nested within another query, often used to retrieve data from one
query's result in the context of another.
6. Window Functions:
- OVER Clause: Used with aggregate functions to perform calculations within specific
windows of data, allowing for advanced data analysis.
7. Data Modification:
- INSERT, UPDATE, DELETE Statements: Used to insert new records, update
existing records, or delete records from a table.
Data scientists use these SQL topics to extract and manipulate data from databases,
create meaningful datasets, and perform various data transformations necessary for
analysis and modeling. It's an essential skill for any data scientist working with
relational databases.
PROJECT:
Title: Taxi Fare Prediction Data Science and Machine Learning Project
Project Overview:
The Taxi Fare Prediction project is a data science and machine learning endeavor aimed
at developing a predictive model for estimating taxi fares in urban areas. This project is
essential for optimizing pricing strategies and providing fare estimates to passengers,
offering transparency and convenience.
Project Steps:
1. Data Collection:
- Gather historical taxi ride data, including information such as pick-up and drop-off
locations, trip duration, distance, time of day, and, most importantly, the actual fare
charged.
2. Data Preprocessing:
- Data Cleaning: Handle missing values, outliers, and data inconsistencies.
- Feature Engineering: Create relevant features such as time of day, day of the week,
and geographical features like distance between pick-up and drop-off locations.
- Data Transformation: Convert categorical data into numerical formats and scale
numerical features as needed.
5. Model Selection:
- Choose an appropriate machine learning model for regression, as this is a prediction
task.
- Experiment with various models such as linear regression, decision trees, random
forests, gradient boosting, or neural networks.
6. Model Training:
- Train the selected machine learning model on the training dataset.
- Optimize hyperparameters to improve model performance.
7. Model Evaluation:
- Assess the model's performance on the validation dataset using metrics like Mean
Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error
(RMSE).
8. Model Testing:
- Evaluate the final model's performance on the test dataset to ensure it generalizes
well to new data.
Project Goals:
- Develop an accurate and reliable machine learning model for taxi fare prediction.
- Improve passenger experience by providing transparent fare estimates.
- Assist taxi service providers in optimizing pricing strategies and revenue management.
This Taxi Fare Prediction Data Science and Machine Learning Project has significant
implications for the transportation industry, providing a valuable tool for both taxi
service providers and passengers. It enhances pricing transparency, improves user
experience, and contributes to data-driven decision-making in the taxi business.
taxifare
November 4, 2023
[3]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 unique_id 50000 non-null object
1 amount 50000 non-null float64
2 date_time_of_pickup 50000 non-null object
3 longitude_of_pickup 50000 non-null float64
4 latitude_of_pickup 50000 non-null float64
5 longitude_of_dropoff 50000 non-null float64
6 latitude_of_dropoff 50000 non-null float64
7 no_of_passenger 50000 non-null int64
dtypes: float64(5), int64(1), object(2)
memory usage: 3.1+ MB
[4]: data.describe().T
1
latitude_of_pickup 50000.0 39.933759 6.224857 -74.006893 40.734880
longitude_of_dropoff 50000.0 -72.504616 10.407570 -84.654241 -73.991152
latitude_of_dropoff 50000.0 39.926251 6.014737 -74.006377 40.734372
no_of_passenger 50000.0 1.667840 1.289195 0.000000 1.000000
[5]: data.isna().sum()
[5]: unique_id 0
amount 0
date_time_of_pickup 0
longitude_of_pickup 0
latitude_of_pickup 0
longitude_of_dropoff 0
latitude_of_dropoff 0
no_of_passenger 0
dtype: int64
[6]: data['date_time_of_pickup']=pd.to_datetime(data['date_time_of_pickup'])
[7]: data['year']=data['date_time_of_pickup'].dt.year
data['month']=data['date_time_of_pickup'].dt.month
data['date']=data['date_time_of_pickup'].dt.day
data['hour']=data['date_time_of_pickup'].dt.hour
data['day']=data['date_time_of_pickup'].dt.dayofweek
data['min']=data['date_time_of_pickup'].dt.minute
[8]: data.drop(['unique_id','date_time_of_pickup'],axis=1,inplace=True)
[9]: data
2
49997 6.9 -74.002698 40.739428 -73.998108
49998 4.5 -73.946062 40.777567 -73.953450
49999 10.9 -73.932603 40.763805 -73.932603
[10]: # Split the data into features (X) and target variable (y)
X = data.drop('amount', axis=1)
y = data['amount']
[12]: plt.scatter(data['no_of_passenger'],data['amount'])
plt.xlabel("no_of_passenger")
plt.ylabel("amount")
plt.show()
3
[13]: plt.scatter(data['amount'],data['no_of_passenger'])
plt.ylabel("no_of_passenger")
plt.xlabel("amount")
plt.show()
4
[14]: plt.scatter(data['day'],data['no_of_passenger'])
plt.ylabel("no_of_passenger")
plt.xlabel("day")
plt.show()
5
[15]: plt.scatter(data['year'],data['amount'])
plt.ylabel("amount")
plt.xlabel("year")
plt.show()
6
[56]: # Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
[57]: LinearRegression()
7
# Train the model
model8.fit(X_train,y_train)
print(model8.score(X_test,y_test)*100)
model8.score(X_train,y_train)*100
2.06623944734462
[59]: 1.6400191665462027
75.19978360686885
[60]: 96.76705114116024
69.28470842864147
[61]: 75.5375121470838
-9.082248439246676
[62]: -9.377654882461295
[63]: model=DecisionTreeRegressor()
model.fit(X_train,y_train)
print(model.score(X_test,y_test)*100)
model.score(X_train,y_train)*100
55.07387698313491
[63]: 100.0
8
[64]: model=AdaBoostRegressor()
model.fit(X_train,y_train)
print(model.score(X_test,y_test)*100)
model.score(X_train,y_train)*100
-10.976289738995849
[64]: -7.4549865583041
0.0.1 conclusion :- The random forest regressor having the more accuracy complared
to others, so we can use model of random forest regressor for predicting the
fair price.
9
COURSE COMPLETION CERTIFICATE