100% found this document useful (1 vote)
481 views58 pages

Internship Report Data Science

This internship report discusses data science. It notes that data science is popular due to the abundance of data and opportunities for analysis. Universities are starting new multidisciplinary programs in data science, while industry is developing online courses and training. The goal of the report is to hear from professionals in the field about what they do and how to further learning. It also summarizes that there is a massive amount of data being generated each day from various sources. Data science involves analyzing large data sets to make predictions, learn patterns, and prevent issues. It is a multidisciplinary field involving statistics, computer science, and domain expertise. The data science process includes acquiring, cleaning, exploring, and modeling data before deploying results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
481 views58 pages

Internship Report Data Science

This internship report discusses data science. It notes that data science is popular due to the abundance of data and opportunities for analysis. Universities are starting new multidisciplinary programs in data science, while industry is developing online courses and training. The goal of the report is to hear from professionals in the field about what they do and how to further learning. It also summarizes that there is a massive amount of data being generated each day from various sources. Data science involves analyzing large data sets to make predictions, learn patterns, and prevent issues. It is a multidisciplinary field involving statistics, computer science, and domain expertise. The data science process includes acquiring, cleaning, exploring, and modeling data before deploying results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

INTERNSHIP REPORT-DATA SCIENCE

1
2
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs Universities:

Starting new multidisciplinary programs

Industry: Cottage industry evolving for online and training courses

Goal of this Talk:

● Hear if from people who do it and what they do


● Use it for further learning and specialization

3
Data is: Big! Lots of Data => Lots of Analysis => Lots of Jobs

● 2.5 quintillion (1018) bytes of data are generated every day!


● Everything around you collects/generates data
● Social media sites
● Business transactions
● Location-based data
● Sensors
● Digital photos, videos
● Consumer behaviour (online and store transactions)
● More data is publicly available
● Database technology is advancing
● Cloud based & mobile applications are widespread
3
Source: IBM http://www-01.ibm.com/software/data/bigdata/
If I have data, I will know :)
Everyone wants better predictability, forecasting, customer satisfaction, market
differentiation, prevention, great user experience, ...

● How can I price a particular product?


● What can I recommend online customers to buy after buying X, Y or Z?
● How can we discover market segments? group customers into market segments?
● What customer will buy in the upcoming holiday season? (what to stock?)
● What is the price point for customer retention for subscriptions?

4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs

● Multidisciplinary study of data collections for analysis, prediction, learning and


prevention.
● Utilized in a wide variety of industries.
● Involves both structured or unstructured data sources.

5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Plan Clean Data

What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts

Explore the Deployment


Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Scripts Optimization
Scripts
Data Analysis Modeling Deployment and 7
optimization
Plan Clean Data

What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts

Explore the Deployment


Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Scripts Optimization
Scripts
Data Analysis Modeling Deployment and 8
optimization
Data Acquisition Stage
● As soon as the data scientist identified the problem she is trying to solve, she must
assess:
● What type of data is available
● What might be required and currently is not collected
● Is it available from other units of the company?
● Does she need to crawl/buy data from third parties?
● How much data is needed? (Data volume)
● How to access the data?
● Is the data private?
● Is it legally OK to use the data?

9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for

10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such as

● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase

11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk

12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...

13
Data Acquisition: Data Integration
Data integration involves combining data residing in
different sources and providing users with a unified view Data Source 1
of these data. (Wikipedia)

● Schema Mapping Data Source 2


● Record Matching Data Warehouse
ETL
● Data Cleaning
Data Source 3

Data Source 4

14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected for
some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization

Figure ref: https://thedailyomnivore.net/2015/12/02/


15
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deploy Models


Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Analysis - Data Preparation
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features

Commonly used tools:


● SQL
● R: plyr, reshape, ggplot2, data.table,
● Python: NumPy, Pandas, SciPy, matplotlib

17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Continuous variable: explore central tendency and spread of the values


- Summary statistics
- mean, median, min, max
- IQR, standard deviation, variance, quartile
- Visualize Histograms, Boxplots

18
Analysis - Exploratory Analysis
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68

Walmart Store Sales Forecasting Data, Kaggle


19
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Categorical Variable: frequency tables


- Count and count %
- Visualize Bar charts

20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables

- Continuous to continuous variables: Correlation measures the strength and


direction of a linear relationship
- Visualize Scatterplots -> relationship may not be linear

21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)

22
Analysis - Correlation vs Causation
Correlation ⇏ causation!

23
Analysis - Correlation vs Causation
Correlation ⇏ causation!

To prove causation:

● Randomized controlled experiments


● Hypothesis testing, A/B testing

24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin Transform

Variables

Create new categorical variables: too many levels, levels that rarely occur, one level
almost always occur

Extremely skewed data - outliers

Imputation: Filling in missing data

25
Analysis - Missing Values
Missing values are unknown values of a feature.

Important as they may lead to biased models or incorrect estimations and conclusions.

Some ML algorithms accept missing values: for example some tree based models treat missing
values as a separate branch while many other algorithms require complete dataset. Therefore, we
can

● omit: remove missing values and use available data


● impute: replace missing values estimating by mean/median/mode value of the existing
data, by most similar data points (KNN) or more complex algorithms like Random
Forest

26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three standard
deviation away from the mean or values between top and bottom 5 percentiles or values
outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help

27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error variance and
bias in the estimated values.

Delete, cap, transform or impute like missing values.

28
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deployment


Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Predictive data modeling
Prediction, that is the end goal of many data science adventures! Data on

consumer behaviour is collected:

● to predict future consumer behaviour and to take action accordingly

Examples:

● Recommendation systems (netflix, pandora, amazon, etc.)


● Online user behaviour is used to predict best targeted ads
● Customer purchase histories are used to determine how to price,stock, market
and display future products.

30
Machine learning
● Machine Learning is the study of algorithms that improve their performance at some
task with example data or past experience
○ Foundation to many ML algorithms lie in statistics and optimization theory
○ Role of Computer science: Efficient algorithms to
■ Solve the optimization problem
■ Represent and evaluate data models for inference

● Wide variety of off-the-shelf algorithms are available today. Just pick a library and
go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.

31
Machine learning - basics
Machine learning systems are made up of 3
major parts, which are:

● Model: the system that makes


predictions.
● Parameters: the signals or factors
used by the model to form its
decisions.
● Learner: the system that adjusts the
parameters — and in turn the model
— by looking at differences in
predictions versus actual
outcome.
Ref: http://marketingland.com/how-machine-learning-works-150366 32
Machine learning application examples
● Association Analysis
○ Basket analysis: Find the probability that somebody who
buys X also buys Y
● Supervised Learning
○ Classification: Spam filter, language prediction,
customer/visit type prediction
○ Regression: Pricing
○ Recommendation
● Unsupervised Learning
○ Given a database of customer data, automatically
discover market segments and group customers into
different market segments

33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error on
new data)
● Overfitting and underfitting problems

Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine learning
● To estimate generalization error, we need data
unseen during training. We could split the data
as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation

Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

35
Deep Learning
● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with several
hidden layers are called deep neural networks (DNN).
● Different than many ML approaches, deep learning attempts to model high-level abstractions in data.
● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary input
features

36
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts

Explore the Deployment


Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature Modeling
Engineering
Scripts
Scripts
Data Analysis
Optimization

Deployment and
optimization
Deployment, maintenance and optimization
● Deployed solutions might include:
○ A trained data model (model + parameters)
○ Routines for inputting and prediction
○ (Optional) Routines for model improvement (through feedback, deployed system can improve itself)
○ (Optional) Routines for training
● Once the model has been deployed in production, it is time for regular
maintenance and operations.

● The optimization phase could be triggered by failing performance, need to add new
data sources and retraining the model, or even to deploy improved versions of the
model based on better algorithms.
Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
Recap - Software Toolbox of Data Scientists:
● Database
○ SQL
○ NoSQL languages for target databases
● Programming Languages and Libraries
○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas
○ R
○ General programming languages such as Java for gluing different systems
○ C/C++] mlpack, dlib

● Tools: Orange, Weka, Matlab

● Vendor Specific Platforms for data analytics


(such as Adobe Marketing Cloud, etc.)
● Hive
● Spark
39
Conclusion: It takes a team
Must haves:

- Programming and Scripting skills


- Statistics and data analysis skills
- Machine learning skills

Necessary but not sufficient:

- Database management skills


- Distributed computing skills

Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful

40
WHAT IS CLOUD COMPUTING?
 
Cloud computing refers to the use of hosted services, such as data storage, servers, databases, networking, and software over the internet.
The data is stored on physical servers, which are maintained by a cloud service provider. Computer system resources, especially data
storage and computing power, are available on-demand, without direct management by the user in cloud computing.
 

41
42
Instead of storing files on a storage device or hard drive, a user can save them on cloud, making it possible to access the files from anywhere, as
long as they have access to the web. The services hosted on cloud can be broadly divided into infrastructure-as-a-service (IaaS), platform-as-a-
service (PaaS), and software-as-a-service (SaaS). Based on the deployment model, cloud can also be classified as public, private, and hybrid
cloud.

Further, cloud can be divided into two different layers, namely, front-end and back-end. The layer with which users interact is called the front-end
layer. This layer enables a user to access the data that has been stored in cloud through cloud computing software.

The layer made up of software and hardware, i.e., the computers, servers, central servers, and databases, is the back-end layer. This layer is the
primary component of cloud and is entirely responsible for storing information securely. To ensure seamless connectivity between devices linked
via cloud computing, the central servers use a software called middlewareOpens a new window that acts as a bridge between the database and
applications.
43
TYPES OF CLOUD COMPUTING

 
Cloud computing can either be classified based on the deployment model or the type of service. Based on the specific deployment model, we can
classify cloud as public, private, and hybrid cloud. At the same time, it can be classified as infrastructure-as-a-service (IaaS), platform-as-a-service
(PaaS), and software-as-a-service (SaaS) based on the service the cloud model offers.

PRIVATE CLOUD

 
In a private cloud, the computing services are offered over a private IT network for the dedicated use of a single organization. Also termed
internal, enterprise, or corporate cloud, a private cloud is usually managed via internal resources and is not accessible to anyone outside the
organization. Private cloud computing provides all the benefits of a public cloud, such as self-service, scalability, and elasticity, along with
additional control, security, and customization.
 

44
Private clouds provide a higher level of security through company firewalls and internal hosting to ensure that an organization’s sensitive data is
not accessible to third-party providers. The drawback of private cloud, however, is that the organization becomes responsible for all the
management and maintenance of the data centers, which can prove to be quite resource-intensive.
 

PUBLIC CLOUD

 
Public cloud refers to computing services offered by third-party providers over the internet. Unlike private cloud, the services on public cloud are
available to anyone who wants to use or purchase them. These services could be free or sold on-demand, where users only have to pay per usage
for the CPU cycles, storage, or bandwidth they consume.
 
Public clouds can help businesses save on purchasing, managing, and maintaining on-premises infrastructure since the cloud service provider is
responsible for managing the system. They also offer scalable RAM and flexible bandwidth, making it easier for businesses to scale their storage
needs.
 

45
HYBRID CLOUD

 
Hybrid cloud uses a combination of public and private cloud features. The “best of both worlds” cloud model allows a shift of workloads between
private and public clouds as the computing and cost requirements change. When the demand for computing and processing fluctuates, hybrid
cloudOpens a new window allows businesses to scale their on-premises infrastructure up to the public cloud to handle the overflow while ensuring
that no third-party data centers have access to their data.
 
In a hybrid cloud model, companies only pay for the resources they use temporarily instead of purchasing and maintaining resources that may not
be used for an extended period. In short, a hybrid cloud offers the benefits of a public cloud without its security risks.

46
47
WHAT IS A DATA WAREHOUSE?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes
historical data derived from transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for
data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
48
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's decisions."

49
50
What is Data Mining
Data Mining is the computer-assisted process of extracting knowledge from large amount of data.

51
In other words, data mining derives its name as Data + Mining the same way in which mining is done in the ground to
find a valuable ore, data mining is done to find valuable information in the dataset.

Data Mining tools predict customer habits, predict patterns and future trends, allowing business to increase company
revenues and make proactive decisions.

How Data Mining Works

Fig. 1 – Data Mining Architecture

User Interface may be any website. A product is searched in the Database, Database Warehouse, World Wide Web
and other repository (bottom Part of Figure 1). This means that the data searched will be fetched from all over net.

The data will then be cleansed to avoid noise, error in data and unwanted data with the help of parser. Then the
selective data will be integrated and all the data will be fetched by Data Ware House Server. With the help of
knowledge base and pattern evaluation, the result will be given to interface.

52
53

You might also like