Internship Report Data Science
Internship Report Data Science
1
2
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs Universities:
3
Data is: Big! Lots of Data => Lots of Analysis => Lots of Jobs
4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs
5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Plan Clean Data
What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts
What is the
question?
Data Reformating
Data Quality & Imputing
Start
Analysis Data
What type of Acquisition
data is
needed?
Scripts
9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for
10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such as
● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase
11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk
12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...
13
Data Acquisition: Data Integration
Data integration involves combining data residing in
different sources and providing users with a unified view Data Source 1
of these data. (Wikipedia)
Data Source 4
14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected for
some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization
What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts
Deployment and
optimization
Analysis - Data Preparation
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features
17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one
18
Analysis - Exploratory Analysis
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68
20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)
22
Analysis - Correlation vs Causation
Correlation ⇏ causation!
23
Analysis - Correlation vs Causation
Correlation ⇏ causation!
To prove causation:
24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin Transform
Variables
Create new categorical variables: too many levels, levels that rarely occur, one level
almost always occur
25
Analysis - Missing Values
Missing values are unknown values of a feature.
Important as they may lead to biased models or incorrect estimations and conclusions.
Some ML algorithms accept missing values: for example some tree based models treat missing
values as a separate branch while many other algorithms require complete dataset. Therefore, we
can
26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three standard
deviation away from the mean or values between top and bottom 5 percentiles or values
outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help
27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error variance and
bias in the estimated values.
28
Plan Clean Data
What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts
Deployment and
optimization
Predictive data modeling
Prediction, that is the end goal of many data science adventures! Data on
Examples:
30
Machine learning
● Machine Learning is the study of algorithms that improve their performance at some
task with example data or past experience
○ Foundation to many ML algorithms lie in statistics and optimization theory
○ Role of Computer science: Efficient algorithms to
■ Solve the optimization problem
■ Represent and evaluate data models for inference
● Wide variety of off-the-shelf algorithms are available today. Just pick a library and
go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.
31
Machine learning - basics
Machine learning systems are made up of 3
major parts, which are:
33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error on
new data)
● Overfitting and underfitting problems
Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine learning
● To estimate generalization error, we need data
unseen during training. We could split the data
as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation
35
Deep Learning
● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with several
hidden layers are called deep neural networks (DNN).
● Different than many ML approaches, deep learning attempts to model high-level abstractions in data.
● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary input
features
36
Plan Clean Data
What is the
question?
Data Reformating
Start Data Quality & Imputing
What type of Acquisition Analysis Data
data is
needed?
Scripts
Deployment and
optimization
Deployment, maintenance and optimization
● Deployed solutions might include:
○ A trained data model (model + parameters)
○ Routines for inputting and prediction
○ (Optional) Routines for model improvement (through feedback, deployed system can improve itself)
○ (Optional) Routines for training
● Once the model has been deployed in production, it is time for regular
maintenance and operations.
● The optimization phase could be triggered by failing performance, need to add new
data sources and retraining the model, or even to deploy improved versions of the
model based on better algorithms.
Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
Recap - Software Toolbox of Data Scientists:
● Database
○ SQL
○ NoSQL languages for target databases
● Programming Languages and Libraries
○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas
○ R
○ General programming languages such as Java for gluing different systems
○ C/C++] mlpack, dlib
Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful
40
WHAT IS CLOUD COMPUTING?
Cloud computing refers to the use of hosted services, such as data storage, servers, databases, networking, and software over the internet.
The data is stored on physical servers, which are maintained by a cloud service provider. Computer system resources, especially data
storage and computing power, are available on-demand, without direct management by the user in cloud computing.
41
42
Instead of storing files on a storage device or hard drive, a user can save them on cloud, making it possible to access the files from anywhere, as
long as they have access to the web. The services hosted on cloud can be broadly divided into infrastructure-as-a-service (IaaS), platform-as-a-
service (PaaS), and software-as-a-service (SaaS). Based on the deployment model, cloud can also be classified as public, private, and hybrid
cloud.
Further, cloud can be divided into two different layers, namely, front-end and back-end. The layer with which users interact is called the front-end
layer. This layer enables a user to access the data that has been stored in cloud through cloud computing software.
The layer made up of software and hardware, i.e., the computers, servers, central servers, and databases, is the back-end layer. This layer is the
primary component of cloud and is entirely responsible for storing information securely. To ensure seamless connectivity between devices linked
via cloud computing, the central servers use a software called middlewareOpens a new window that acts as a bridge between the database and
applications.
43
TYPES OF CLOUD COMPUTING
Cloud computing can either be classified based on the deployment model or the type of service. Based on the specific deployment model, we can
classify cloud as public, private, and hybrid cloud. At the same time, it can be classified as infrastructure-as-a-service (IaaS), platform-as-a-service
(PaaS), and software-as-a-service (SaaS) based on the service the cloud model offers.
PRIVATE CLOUD
In a private cloud, the computing services are offered over a private IT network for the dedicated use of a single organization. Also termed
internal, enterprise, or corporate cloud, a private cloud is usually managed via internal resources and is not accessible to anyone outside the
organization. Private cloud computing provides all the benefits of a public cloud, such as self-service, scalability, and elasticity, along with
additional control, security, and customization.
44
Private clouds provide a higher level of security through company firewalls and internal hosting to ensure that an organization’s sensitive data is
not accessible to third-party providers. The drawback of private cloud, however, is that the organization becomes responsible for all the
management and maintenance of the data centers, which can prove to be quite resource-intensive.
PUBLIC CLOUD
Public cloud refers to computing services offered by third-party providers over the internet. Unlike private cloud, the services on public cloud are
available to anyone who wants to use or purchase them. These services could be free or sold on-demand, where users only have to pay per usage
for the CPU cycles, storage, or bandwidth they consume.
Public clouds can help businesses save on purchasing, managing, and maintaining on-premises infrastructure since the cloud service provider is
responsible for managing the system. They also offer scalable RAM and flexible bandwidth, making it easier for businesses to scale their storage
needs.
45
HYBRID CLOUD
Hybrid cloud uses a combination of public and private cloud features. The “best of both worlds” cloud model allows a shift of workloads between
private and public clouds as the computing and cost requirements change. When the demand for computing and processing fluctuates, hybrid
cloudOpens a new window allows businesses to scale their on-premises infrastructure up to the public cloud to handle the overflow while ensuring
that no third-party data centers have access to their data.
In a hybrid cloud model, companies only pay for the resources they use temporarily instead of purchasing and maintaining resources that may not
be used for an extended period. In short, a hybrid cloud offers the benefits of a public cloud without its security risks.
46
47
WHAT IS A DATA WAREHOUSE?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes
historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for
data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
48
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's decisions."
49
50
What is Data Mining
Data Mining is the computer-assisted process of extracting knowledge from large amount of data.
51
In other words, data mining derives its name as Data + Mining the same way in which mining is done in the ground to
find a valuable ore, data mining is done to find valuable information in the dataset.
Data Mining tools predict customer habits, predict patterns and future trends, allowing business to increase company
revenues and make proactive decisions.
User Interface may be any website. A product is searched in the Database, Database Warehouse, World Wide Web
and other repository (bottom Part of Figure 1). This means that the data searched will be fetched from all over net.
The data will then be cleansed to avoid noise, error in data and unwanted data with the help of parser. Then the
selective data will be integrated and all the data will be fetched by Data Ware House Server. With the help of
knowledge base and pattern evaluation, the result will be given to interface.
52
53