The document discusses data science classification and algorithms. It describes supervised and unsupervised learning models, classification and regression techniques, and clustering. Common data science algorithms are also presented for tasks like classification, regression, anomaly detection, time series forecasting, clustering, and recommendation engines. Finally, it outlines the typical data science process and some common process frameworks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
30 views19 pages
Data Science Classification Etc
The document discusses data science classification and algorithms. It describes supervised and unsupervised learning models, classification and regression techniques, and clustering. Common data science algorithms are also presented for tasks like classification, regression, anomaly detection, time series forecasting, clustering, and recommendation engines. Finally, it outlines the typical data science process and some common process frameworks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19
Data Science Classification
Data Science and Visualization
22AD2202 Data Science Classification
Data Science and Visualization
[20AD2202] Data Science Classification Supervised or unsupervised learning models. Supervised techniques predict the value of
the output variables based on a set of input
variables. To do this, a model is developed from a
training dataset where the values of input and
output are previously known. The output variable that is being predicted is
also called a class label or target variable.
Data Science and Visualization
[20AD2202] Data Science Classification Unsupervised or undirected data science uncovers hidden patterns in unlabeled data. In unsupervised data science, there are no
output variables to predict.
The objective of this class of data science
techniques, is to find patterns in data based
on the relationship between data points themselves. An application can employ both supervised
and unsupervised learners.
Data Science and Visualization
[20AD2202] Data Science Classification Classification and regression techniques predict a target variable based on input variables. The prediction is based on a generalized model built from a previously known dataset. In regression tasks, the output variable is numeric. Deep learning is a more sophisticated artificial neural network that is increasingly used for classification and regression problems. Clustering is the process of identifying the natural groupings in a dataset.
Data Science and Visualization
[20AD2202] Data Science Algorithms In data science, it is the blueprint for how a particular data problem is solved. Many of the learning algorithms are recursive, where a set of steps are repeated many times until a limiting condition is met. Some algorithms also contain a random variable as an input and are aptly called randomized algorithms. A classification task can be solved using many different learning algorithms such as decision trees, artificial neural networks, k-NN, and even some regression algorithms.
Data Science and Visualization
[20AD2202] Data Science Algorithms The choice of which algorithm to use depends on the type of dataset, objective, structure of the data, presence of outliers, available computational power, number of records, number of attributes, and so on. It is up to the data science practitioner to decide
which algorithm (s) to use by evaluating the
performance of multiple algorithms. There have been hundreds of algorithms developed in
the last few decades to solve data science problems.
Data Science and Visualization
[20AD2202] Data Science Algorithms Data science algorithms can be implemented by custom- developed computer programs in almost any computer language. This obviously is a time consuming task. In order to focus the appropriate amount of time on data and algorithms, data science tools or statistical programing tools, like R, RapidMiner, Python, SAS Enterprise Miner, etc., which can implement these algorithms with ease, can be leveraged. These data science tools offer a library of algorithms as functions, which can be interfaced through programming code or configurated through graphical user interfaces. Data Science and Visualization [20AD2202] Data Science Algorithms Tasks Description Algorithms Examples Classificati Predict if a data point belongs Decision trees, neural Assigning voters into on to one of the predefined networks, Bayesian known buckets by political classes. The prediction will be based on learning from a models, induction rules, parties, e.g., soccer moms Bucketing new customers known dataset k-nearest neighbors into one of the known customer groups Regression Predict the numeric target Linear regression, Predicting the label of a data point. The logistic regression unemployment rate for the prediction will be based on next year Estimating learning from a known dataset insurance premium
Anomaly Predict if a data point is an Distance-based, Detecting fraudulent credit
detection outlier compared to other data density-based, LOF card transactions and points in the dataset network intrusion
Data Science and Visualization
[20AD2202] Tasks Description Algorithms Examples Time series Predict the value of the target Exponential Sales forecasting, Forecasting variable for a future timeframe smoothing, production based on historical values ARIMA, forecasting, virtually any regression growth phenomenon that needs to be extrapolated Clustering Identify natural clusters within the k-Means, density- Finding customer segments dataset based on inherit properties based clustering in a company based on within the dataset (e.g., DBSCAN) transaction, web, and customer call data Association Identify relationships within an FP-growth Finding cross-selling analysis item set based on transaction data algorithm, a priori opportunities for a retailer algorithm based on transaction purchase history Recommen Predict the preference of an item Collaborative Finding the top dation for a user filtering, content- recommended engines based filtering, movies for a user Hybrid recommenders
Data Science and Visualization
[20AD2202] Data Science process The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process. The standard data science process involves (1) Understanding the problem, (2) Preparing the data samples, (3) Developing the model, (4) Applying the model on a dataset to see
how the model may work in the real world,
(5) Deploying and maintaining the models.
Data Science and Visualization
[20AD2202] Data Science Process Frameworks CRISP-DM: Cross Industry Standard Process for Data Mining SEMMA: Sample, Explore, Modify, Model, and
[20AD2202] Cross Industry Standard Process for Data Mining (CRISP-DM)
Data Science and Visualization
[20AD2202] Data Science process
Data Science and Visualization [20AD2202]
Prior knowledge Prior knowledge refers to information that is already known about a subject. The data science problem doesn’t emerge in
isolation; it always develops on top of
existing subject matter and contextual information that is already known. The prior knowledge step in the data
science process helps to define what
problem is being solved, how it fits in the business context, and what data is needed in order to solve the problem. Data Science and Visualization [20AD2202] Objective The data science process starts with a need for analysis, a question, or a business objective. This is possibly the most important step in the data science process Without a well-defined statement of the problem, it is impossible to come up with the right dataset and pick the right data science algorithm. As an iterative process, it is common to go back to previous data science process steps, revise the assumptions, approach, and tactics.
Data Science and Visualization
[20AD2202] Subject Area The process of data science uncovers hidden patterns in the dataset by exposing relationships between attributes. But the problem is that it uncovers a lot of patterns. The false or spurious signals are a major problem in
the data science process.
It is up to the practitioner to sift through the exposed
patterns and accept the ones that are valid and
relevant to the answer of the objective question. Hence, it is essential to know the subject matter, the
context, and the business process generating the data.
Data Science and Visualization [20AD2202] Data Understanding how the data is collected, stored, transformed, reported, and used is essential to the data science process. This part of the step surveys all the data available to answer the business question and narrows down the new data that need to be sourced. There are quite a range of factors to consider: quality of the data, quantity of data, availability of data, gaps in data, does lack of data compel the practitioner to change the business question, etc. The objective of this step is to come up with a dataset to answer the business question through the data science process. It is critical to recognize that an inferred model is only as good as the data used to create it. Data Science and Visualization [20AD2202] A dataset (example set) is a collection of data with a defined structure. “data frame”. A data point (record, object or example) is a single instance in the dataset. Each instance contains the same structure as the dataset. An attribute (feature, input, dimension, variable, or predictor) is a single property of the dataset. Attributes can be numeric, categorical, date-time, text, or Boolean data types. A label (class label, output, prediction, target, or response) is the special attribute to be predicted based on all the input attributes. identifiers are special attributes that are used for locating or providing context to individual records.