100% found this document useful (1 vote)

481 views58 pages

Internship Report Data Science

This internship report discusses data science. It notes that data science is popular due to the abundance of data and opportunities for analysis. Universities are starting new multidisciplinary programs in data science, while industry is developing online courses and training. The goal of the report is to hear from professionals in the field about what they do and how to further learning. It also summarizes that there is a massive amount of data being generated each day from various sources. Data science involves analyzing large data sets to make predictions, learn patterns, and prevent issues. It is a multidisciplinary field involving statistics, computer science, and domain expertise. The data science process includes acquiring, cleaning, exploring, and modeling data before deploying results.

Uploaded by

Nexgen Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

481 views58 pages

Internship Report Data Science

Uploaded by

Nexgen Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 58

INTERNSHIP REPORT-DATA SCIENCE

1
2
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs Universities:

Starting new multidisciplinary programs

Industry: Cottage industry evolving for online and training courses

Goal of this Talk:

● Hear if from people who do it and what they do

● Use it for further learning and specialization

3
Data is: Big! Lots of Data => Lots of Analysis => Lots of Jobs

● 2.5 quintillion (1018) bytes of data are generated every day!

● Everything around you collects/generates data
● Social media sites
● Business transactions
● Location-based data
● Sensors
● Digital photos, videos
● Consumer behaviour (online and store transactions)
● More data is publicly available
● Database technology is advancing
● Cloud based & mobile applications are widespread
3
Source: IBM http://www-01.ibm.com/software/data/bigdata/
If I have data, I will know :)
Everyone wants better predictability, forecasting, customer satisfaction, market
differentiation, prevention, great user experience, ...

● How can I price a particular product?

● What can I recommend online customers to buy after buying X, Y or Z?
● How can we discover market segments? group customers into market segments?
● What customer will buy in the upcoming holiday season? (what to stock?)
● What is the price point for customer retention for subscriptions?

4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs

● Multidisciplinary study of data collections for analysis, prediction, learning and

prevention.
● Utilized in a wide variety of industries.
● Involves both structured or unstructured data sources.

5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Plan Clean Data

What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Scripts Optimization
Scripts
Data Analysis Modeling Deployment and 7
optimization
Plan Clean Data

What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Scripts Optimization
Scripts
Data Analysis Modeling Deployment and 8
optimization
Data Acquisition Stage
● As soon as the data scientist identified the problem she is trying to solve, she must
assess:
● What type of data is available
● What might be required and currently is not collected
● Is it available from other units of the company?
● Does she need to crawl/buy data from third parties?
● How much data is needed? (Data volume)
● How to access the data?
● Is the data private?
● Is it legally OK to use the data?

9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for

10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such as

● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase

11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk

12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...

13
Data Acquisition: Data Integration
Data integration involves combining data residing in
different sources and providing users with a unified view Data Source 1
of these data. (Wikipedia)

● Schema Mapping Data Source 2

● Record Matching Data Warehouse
ETL
● Data Cleaning
Data Source 3

Data Source 4

14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected for
some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization

Figure ref: https://thedailyomnivore.net/2015/12/02/

15
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deploy Models

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Analysis - Data Preparation
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features

Commonly used tools:

● SQL
● R: plyr, reshape, ggplot2, data.table,
● Python: NumPy, Pandas, SciPy, matplotlib

17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Continuous variable: explore central tendency and spread of the values

- Summary statistics
- mean, median, min, max
- IQR, standard deviation, variance, quartile
- Visualize Histograms, Boxplots

18
Analysis - Exploratory Analysis
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68

Walmart Store Sales Forecasting Data, Kaggle

19
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Categorical Variable: frequency tables

- Count and count %
- Visualize Bar charts

20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables

- Continuous to continuous variables: Correlation measures the strength and

direction of a linear relationship
- Visualize Scatterplots -> relationship may not be linear

21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)

22
Analysis - Correlation vs Causation
Correlation ⇏ causation!

23
Analysis - Correlation vs Causation
Correlation ⇏ causation!

To prove causation:

● Randomized controlled experiments

● Hypothesis testing, A/B testing

24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin Transform

Variables

Create new categorical variables: too many levels, levels that rarely occur, one level
almost always occur

Extremely skewed data - outliers

Imputation: Filling in missing data

25
Analysis - Missing Values
Missing values are unknown values of a feature.

Important as they may lead to biased models or incorrect estimations and conclusions.

Some ML algorithms accept missing values: for example some tree based models treat missing
values as a separate branch while many other algorithms require complete dataset. Therefore, we
can

● omit: remove missing values and use available data

● impute: replace missing values estimating by mean/median/mode value of the existing
data, by most similar data points (KNN) or more complex algorithms like Random
Forest

26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three standard
deviation away from the mean or values between top and bottom 5 percentiles or values
outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help

27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error variance and
bias in the estimated values.

Delete, cap, transform or impute like missing values.

28
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Predictive data modeling
Prediction, that is the end goal of many data science adventures! Data on

consumer behaviour is collected:

● to predict future consumer behaviour and to take action accordingly

Examples:

● Recommendation systems (netflix, pandora, amazon, etc.)

● Online user behaviour is used to predict best targeted ads
● Customer purchase histories are used to determine how to price,stock, market
and display future products.

30
Machine learning
● Machine Learning is the study of algorithms that improve their performance at some
task with example data or past experience
○ Foundation to many ML algorithms lie in statistics and optimization theory
○ Role of Computer science: Efficient algorithms to
■ Solve the optimization problem
■ Represent and evaluate data models for inference

● Wide variety of off-the-shelf algorithms are available today. Just pick a library and
go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.

31
Machine learning - basics
Machine learning systems are made up of 3
major parts, which are:

● Model: the system that makes

predictions.
● Parameters: the signals or factors
used by the model to form its
decisions.
● Learner: the system that adjusts the
parameters — and in turn the model
— by looking at differences in
predictions versus actual
outcome.
Ref: http://marketingland.com/how-machine-learning-works-150366 32
Machine learning application examples
● Association Analysis
○ Basket analysis: Find the probability that somebody who
buys X also buys Y
● Supervised Learning
○ Classification: Spam filter, language prediction,
customer/visit type prediction
○ Regression: Pricing
○ Recommendation
● Unsupervised Learning
○ Given a database of customer data, automatically
discover market segments and group customers into
different market segments

33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error on
new data)
● Overfitting and underfitting problems

Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine learning
● To estimate generalization error, we need data
unseen during training. We could split the data
as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation

Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

35
Deep Learning
● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with several
hidden layers are called deep neural networks (DNN).
● Different than many ML approaches, deep learning attempts to model high-level abstractions in data.
● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary input
features

36
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Deployment, maintenance and optimization
● Deployed solutions might include:
○ A trained data model (model + parameters)
○ Routines for inputting and prediction
○ (Optional) Routines for model improvement (through feedback, deployed system can improve itself)
○ (Optional) Routines for training
● Once the model has been deployed in production, it is time for regular
maintenance and operations.

● The optimization phase could be triggered by failing performance, need to add new
data sources and retraining the model, or even to deploy improved versions of the
model based on better algorithms.
Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
Recap - Software Toolbox of Data Scientists:
● Database
○ SQL
○ NoSQL languages for target databases
● Programming Languages and Libraries
○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas
○ R
○ General programming languages such as Java for gluing different systems
○ C/C++] mlpack, dlib

● Tools: Orange, Weka, Matlab

● Vendor Specific Platforms for data analytics

(such as Adobe Marketing Cloud, etc.)
● Hive
● Spark
39
Conclusion: It takes a team
Must haves:

- Programming and Scripting skills

- Statistics and data analysis skills
- Machine learning skills

Necessary but not sufficient:

- Database management skills

- Distributed computing skills

Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful

40
WHAT IS CLOUD COMPUTING?

Cloud computing refers to the use of hosted services, such as data storage, servers, databases, networking, and software over the internet.
The data is stored on physical servers, which are maintained by a cloud service provider. Computer system resources, especially data
storage and computing power, are available on-demand, without direct management by the user in cloud computing.

41
42
Instead of storing files on a storage device or hard drive, a user can save them on cloud, making it possible to access the files from anywhere, as
long as they have access to the web. The services hosted on cloud can be broadly divided into infrastructure-as-a-service (IaaS), platform-as-a-
service (PaaS), and software-as-a-service (SaaS). Based on the deployment model, cloud can also be classified as public, private, and hybrid
cloud.

Further, cloud can be divided into two different layers, namely, front-end and back-end. The layer with which users interact is called the front-end
layer. This layer enables a user to access the data that has been stored in cloud through cloud computing software.

The layer made up of software and hardware, i.e., the computers, servers, central servers, and databases, is the back-end layer. This layer is the
primary component of cloud and is entirely responsible for storing information securely. To ensure seamless connectivity between devices linked
via cloud computing, the central servers use a software called middlewareOpens a new window that acts as a bridge between the database and
applications.
43
TYPES OF CLOUD COMPUTING

Cloud computing can either be classified based on the deployment model or the type of service. Based on the specific deployment model, we can
classify cloud as public, private, and hybrid cloud. At the same time, it can be classified as infrastructure-as-a-service (IaaS), platform-as-a-service
(PaaS), and software-as-a-service (SaaS) based on the service the cloud model offers.

PRIVATE CLOUD

In a private cloud, the computing services are offered over a private IT network for the dedicated use of a single organization. Also termed
internal, enterprise, or corporate cloud, a private cloud is usually managed via internal resources and is not accessible to anyone outside the
organization. Private cloud computing provides all the benefits of a public cloud, such as self-service, scalability, and elasticity, along with
additional control, security, and customization.

44
Private clouds provide a higher level of security through company firewalls and internal hosting to ensure that an organization’s sensitive data is
not accessible to third-party providers. The drawback of private cloud, however, is that the organization becomes responsible for all the
management and maintenance of the data centers, which can prove to be quite resource-intensive.

PUBLIC CLOUD

Public cloud refers to computing services offered by third-party providers over the internet. Unlike private cloud, the services on public cloud are
available to anyone who wants to use or purchase them. These services could be free or sold on-demand, where users only have to pay per usage
for the CPU cycles, storage, or bandwidth they consume.

Public clouds can help businesses save on purchasing, managing, and maintaining on-premises infrastructure since the cloud service provider is
responsible for managing the system. They also offer scalable RAM and flexible bandwidth, making it easier for businesses to scale their storage
needs.

45
HYBRID CLOUD

Hybrid cloud uses a combination of public and private cloud features. The “best of both worlds” cloud model allows a shift of workloads between
private and public clouds as the computing and cost requirements change. When the demand for computing and processing fluctuates, hybrid
cloudOpens a new window allows businesses to scale their on-premises infrastructure up to the public cloud to handle the overflow while ensuring
that no third-party data centers have access to their data.

In a hybrid cloud model, companies only pay for the resources they use temporarily instead of purchasing and maintaining resources that may not
be used for an extended period. In short, a hybrid cloud offers the benefits of a public cloud without its security risks.

46
47
WHAT IS A DATA WAREHOUSE?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes
historical data derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for
data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
48
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's decisions."

49
50
What is Data Mining
Data Mining is the computer-assisted process of extracting knowledge from large amount of data.

51
In other words, data mining derives its name as Data + Mining the same way in which mining is done in the ground to
find a valuable ore, data mining is done to find valuable information in the dataset.

Data Mining tools predict customer habits, predict patterns and future trends, allowing business to increase company
revenues and make proactive decisions.

How Data Mining Works

Fig. 1 – Data Mining Architecture

User Interface may be any website. A product is searched in the Database, Database Warehouse, World Wide Web
and other repository (bottom Part of Figure 1). This means that the data searched will be fetched from all over net.

The data will then be cleansed to avoid noise, error in data and unwanted data with the help of parser. Then the
selective data will be integrated and all the data will be fetched by Data Ware House Server. With the help of
knowledge base and pattern evaluation, the result will be given to interface.

52
53

Data Science Handwritten Notes by Kirtika
No ratings yet
Data Science Handwritten Notes by Kirtika
59 pages
Business Analytics For Decision Making Mid Term Exam DR Mahmoud Beshr
No ratings yet
Business Analytics For Decision Making Mid Term Exam DR Mahmoud Beshr
7 pages
ActiveViam White Paper - Achieving Better Product Control
No ratings yet
ActiveViam White Paper - Achieving Better Product Control
14 pages
Web Development Using Dotnet Internship Report
100% (1)
Web Development Using Dotnet Internship Report
37 pages
Report of Industrial Training
No ratings yet
Report of Industrial Training
22 pages
Report Minor Project PDF
No ratings yet
Report Minor Project PDF
37 pages
Advanced Database Management System
75% (4)
Advanced Database Management System
2 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Anush J Internship Report
No ratings yet
Anush J Internship Report
15 pages
Final Internshala Report
No ratings yet
Final Internshala Report
38 pages
C14 - Speech Emotion Recognition Using Machine Learning
No ratings yet
C14 - Speech Emotion Recognition Using Machine Learning
118 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Kollector Document
No ratings yet
Kollector Document
107 pages
Data Science Training Report.
100% (1)
Data Science Training Report.
73 pages
Codsoft Report
No ratings yet
Codsoft Report
26 pages
PDF Sentimental Analysis Project Documentation
No ratings yet
PDF Sentimental Analysis Project Documentation
74 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
Billing System PPT Python Project
No ratings yet
Billing System PPT Python Project
14 pages
Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms
No ratings yet
Analogy of Water Quality Prediction Using SVM and Xgboost Algorithms
104 pages
Ai ML DS - Summerinternship
No ratings yet
Ai ML DS - Summerinternship
59 pages
Format - Summer Internship Report
No ratings yet
Format - Summer Internship Report
6 pages
Summer Training Report On Data Science
No ratings yet
Summer Training Report On Data Science
47 pages
Internship Report
No ratings yet
Internship Report
31 pages
Internship - Report Nithin
No ratings yet
Internship - Report Nithin
25 pages
Fake News Detection Using LSTM
No ratings yet
Fake News Detection Using LSTM
67 pages
DS&BD Lab Manul
No ratings yet
DS&BD Lab Manul
98 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
25 pages
Fake Product Review Final
No ratings yet
Fake Product Review Final
30 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
Final Stqa Miniproject 2-1
No ratings yet
Final Stqa Miniproject 2-1
14 pages
Internship Report DiabetesPrediction
No ratings yet
Internship Report DiabetesPrediction
15 pages
Final ML Report
No ratings yet
Final ML Report
34 pages
Internship Report
No ratings yet
Internship Report
23 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
Tranning Project Report
No ratings yet
Tranning Project Report
25 pages
1NH17CS407
No ratings yet
1NH17CS407
110 pages
Phishing Website Detection by Machine Learning Techniques Presentation
No ratings yet
Phishing Website Detection by Machine Learning Techniques Presentation
12 pages
Internship Report (Data Science)
No ratings yet
Internship Report (Data Science)
32 pages
AN INDUSTRY ORIENTED MINI PROJECT - Docx Edited'
No ratings yet
AN INDUSTRY ORIENTED MINI PROJECT - Docx Edited'
5 pages
Matrix-Vector Multiplication Using MapReduce in Big Data.
No ratings yet
Matrix-Vector Multiplication Using MapReduce in Big Data.
4 pages
(KAVYA R SHETTY)
No ratings yet
(KAVYA R SHETTY)
21 pages
Project Final Report
100% (1)
Project Final Report
44 pages
Django School Management Report and Documentation (1) - 1
No ratings yet
Django School Management Report and Documentation (1) - 1
53 pages
Project Report
No ratings yet
Project Report
16 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages
Java Full Stack Internship Report
No ratings yet
Java Full Stack Internship Report
2 pages
Case Study (Analysis of Algorithm
No ratings yet
Case Study (Analysis of Algorithm
14 pages
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
No ratings yet
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
5 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
Aparna INTERN REPORT 12
No ratings yet
Aparna INTERN REPORT 12
46 pages
Visvesvaraya Technological University: City Engineering College
No ratings yet
Visvesvaraya Technological University: City Engineering College
31 pages
Sentimental Analysis of Movie Review
100% (1)
Sentimental Analysis of Movie Review
58 pages
Fake News Detection Project
No ratings yet
Fake News Detection Project
7 pages
Stock-Price-Prediction-Using-Machine-Learning Final Project Indu Mam Project Final Project
No ratings yet
Stock-Price-Prediction-Using-Machine-Learning Final Project Indu Mam Project Final Project
47 pages
Summer Internship Report
No ratings yet
Summer Internship Report
35 pages
Data Science Note
No ratings yet
Data Science Note
24 pages
Python Programming (Int 213) : Report For House Price Prdiction
No ratings yet
Python Programming (Int 213) : Report For House Price Prdiction
23 pages
DSBDA - Mini Project Report
100% (1)
DSBDA - Mini Project Report
7 pages
Sales Prediction
No ratings yet
Sales Prediction
37 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
19 pages
OOMD Mini Project Harsh
100% (1)
OOMD Mini Project Harsh
12 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
C
No ratings yet
C
37 pages
Secure Reversible Image Data Hiding For Secure Data Sharing
No ratings yet
Secure Reversible Image Data Hiding For Secure Data Sharing
2 pages
A Study On Portfolio Analysis of Banking Sector1
No ratings yet
A Study On Portfolio Analysis of Banking Sector1
110 pages
A Transfer Learning Approach To Breast Cancer
No ratings yet
A Transfer Learning Approach To Breast Cancer
11 pages
Impact of Nutritional Factors in Blood Glucose
No ratings yet
Impact of Nutritional Factors in Blood Glucose
16 pages
Rice Quality Analysis Using Machine Learning
No ratings yet
Rice Quality Analysis Using Machine Learning
2 pages
A Study On Organisational Culture and Its Impact On Employees Behaviour
No ratings yet
A Study On Organisational Culture and Its Impact On Employees Behaviour
12 pages
Ieee 2022-23 Machin Learning Title
No ratings yet
Ieee 2022-23 Machin Learning Title
2 pages
Web Server Log Analysis Sysytem
No ratings yet
Web Server Log Analysis Sysytem
3 pages
Road Accident Prediction System Using Deep Learning
No ratings yet
Road Accident Prediction System Using Deep Learning
1 page
Stress Detection Using Machine Learning
100% (1)
Stress Detection Using Machine Learning
1 page
Embedded System Ieee Projects Year 2021
No ratings yet
Embedded System Ieee Projects Year 2021
4 pages
Ieee 2022-2023 Eee Projeects
No ratings yet
Ieee 2022-2023 Eee Projeects
13 pages
LSTM
No ratings yet
LSTM
2 pages
Ieee 2022-23 Cse Titles
No ratings yet
Ieee 2022-23 Cse Titles
5 pages
Pythonpython
No ratings yet
Pythonpython
6 pages
Ieee Embedded 2022-23
No ratings yet
Ieee Embedded 2022-23
7 pages
Ieee 2022-23 Deep Learning Titles
No ratings yet
Ieee 2022-23 Deep Learning Titles
3 pages
Machine Learing Algorithms
No ratings yet
Machine Learing Algorithms
13 pages
FFS: Flood Forecasting System Based On Integrated Big and Crowd Source Data by Using Deep Learning Techniques
No ratings yet
FFS: Flood Forecasting System Based On Integrated Big and Crowd Source Data by Using Deep Learning Techniques
13 pages
Ieee 2022-23 Aggriculture
No ratings yet
Ieee 2022-23 Aggriculture
3 pages
Renil Benny - SR Data Analyst - Resume
No ratings yet
Renil Benny - SR Data Analyst - Resume
5 pages
Wieder 2012 BI Tools On Performance 2012 Summary
No ratings yet
Wieder 2012 BI Tools On Performance 2012 Summary
26 pages
Unit 2 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 2 DW&DM Notes Mr. Rohit Pratap Singh
33 pages
Data Warehousing
No ratings yet
Data Warehousing
29 pages
CS614 Mcqs FinalTerm by Vu Topper RM
100% (1)
CS614 Mcqs FinalTerm by Vu Topper RM
29 pages
PUT BCS302 COA (QuestionPaper)
No ratings yet
PUT BCS302 COA (QuestionPaper)
1 page
Topic 1 ISP565
No ratings yet
Topic 1 ISP565
58 pages
Chapter 5
75% (4)
Chapter 5
30 pages
Evolution of Analytics:: Ramaraju Poosapati Business Analytics For Year 2024
No ratings yet
Evolution of Analytics:: Ramaraju Poosapati Business Analytics For Year 2024
6 pages
A Detailed View Inside Snowflake
No ratings yet
A Detailed View Inside Snowflake
14 pages
Unit 2
No ratings yet
Unit 2
25 pages
Syed Ismail Khadry: Website
No ratings yet
Syed Ismail Khadry: Website
4 pages
Chapter 2
No ratings yet
Chapter 2
33 pages
BI in SOAsoca07
No ratings yet
BI in SOAsoca07
7 pages
Plagiarism - Report
No ratings yet
Plagiarism - Report
52 pages
Business Intelligence
No ratings yet
Business Intelligence
18 pages
Ab Initio Interview Question v1.0
No ratings yet
Ab Initio Interview Question v1.0
7 pages
ETL Testing
No ratings yet
ETL Testing
6 pages
Computing and Communication Resources Multiple Choice Questions
100% (2)
Computing and Communication Resources Multiple Choice Questions
44 pages
Informatica FAQs
No ratings yet
Informatica FAQs
143 pages
DW Mod 1
No ratings yet
DW Mod 1
25 pages
Nidhi Professional Summary:: Analysis Services
No ratings yet
Nidhi Professional Summary:: Analysis Services
5 pages
Designing Databases: Kendall & Kendall Systems Analysis and Design, 9e
No ratings yet
Designing Databases: Kendall & Kendall Systems Analysis and Design, 9e
68 pages
DA&DM 3rd Unit-Notes
No ratings yet
DA&DM 3rd Unit-Notes
7 pages
BW Project Plan Methodology
100% (2)
BW Project Plan Methodology
388 pages
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
Data Warehouse: User Detailed Functional Specifications
No ratings yet
Data Warehouse: User Detailed Functional Specifications
26 pages

Internship Report Data Science

Uploaded by

Internship Report Data Science

Uploaded by

INTERNSHIP REPORT-DATA SCIENCE

Starting new multidisciplinary programs

Industry: Cottage industry evolving for online and training courses

Goal of this Talk:

● Hear if from people who do it and what they do

● 2.5 quintillion (1018) bytes of data are generated every day!

● How can I price a particular product?

● Multidisciplinary study of data collections for analysis, prediction, learning and

Explore the Deployment

Explore the Deployment

● Schema Mapping Data Source 2

Figure ref: https://thedailyomnivore.net/2015/12/02/

Explore the Deploy Models

Commonly used tools:

- Continuous variable: explore central tendency and spread of the values

Walmart Store Sales Forecasting Data, Kaggle

- Categorical Variable: frequency tables

- Continuous to continuous variables: Correlation measures the strength and

● Randomized controlled experiments

Extremely skewed data - outliers

Imputation: Filling in missing data

● omit: remove missing values and use available data

Delete, cap, transform or impute like missing values.

Explore the Deployment

consumer behaviour is collected:

● to predict future consumer behaviour and to take action accordingly

● Recommendation systems (netflix, pandora, amazon, etc.)

● Model: the system that makes

Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

Explore the Deployment

● Tools: Orange, Weka, Matlab

● Vendor Specific Platforms for data analytics

- Programming and Scripting skills

Necessary but not sufficient:

- Database management skills

How Data Mining Works

Fig. 1 – Data Mining Architecture

You might also like