Unit-1 Data Science
Unit-1 Data Science
• Data wrangling is the process of converting data from its raw form to a tidy form
ready for analysis. Data wrangling is an important step in data preprocessing and
includes several processes like data importing, data cleaning, data structuring,
string processing, HTML parsing, handling dates and times, handling missing
data, and text mining.
• The process of data wrangling is a critical step for any data scientist. Very rarely is
data easily accessible in a data science project for analysis. It is more likely for the
data to be in a file, a database, or extracted from documents such as web pages,
tweets, or PDFs. Knowing how to wrangle and clean data will enable you to
derive critical insights from your data that would otherwise be hidden.
Figure 1: Data wrangling process.
3. Data Visualization
• Data Visualization is one of the most important branches of data science. It is one
of the main tools used to analyze and study relationships between different
variables. Data visualization (e.g., scatter plots, line graphs, bar plots, histograms,
qqplots, smooth densities, boxplots, pair plots, heat maps, etc.) can be used for
descriptive analytics. Data visualization is also used in machine learning for data
preprocessing and analysis, feature selection, model building, model testing, and
model evaluation. When preparing a data visualization, keep in mind that data
visualization is more of an Art than Science. To produce a good visualization, you
need to put several pieces of code together for an excellent end result.
4. Outliers
• An outlier is a data point that is very different from the rest of
the dataset. Outliers are often just bad data, e.g., due to a
malfunctioned sensor; contaminated experiments; or human
error in recording data. Sometimes, outliers could indicate
something real such as a malfunction in a system. Outliers are
very common and are expected in large datasets. One common
way to detect outliers in a dataset is by using a box plot. Figure
3 shows a simple regression model for a dataset containing lots
of outliers. Outliers can significantly degrade the predictive
power of a machine learning model. A common way to deal
with outliers is to simply omit the data points. However,
removing real data outliers can be too optimistic, leading to
non-realistic models. Advanced methods for dealing with
outliers include the RANSAC method.
5. Data Imputation
• Most datasets contain missing values. The easiest way to deal with missing data is
simply to throw away the data point. However, the removal of samples or
dropping of entire feature columns is simply not feasible because we might lose
too much valuable data. In this case, we can use different interpolation techniques
to estimate the missing values from the other training samples in our dataset. One
of the most common interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the entire feature column.
Other options for imputing missing values are median or most frequent (mode),
where the latter replaces the missing values with the most frequent values.
Whatever imputation method you employ in your model, you have to keep in
mind that imputation is only an approximation, and hence can produce an error in
the final model. If the data supplied was already preprocessed, you would have to
find out how missing values were considered
6. Data Scaling
• Scaling your features will help improve the quality and predictive power of your model. For
example, suppose you would like to build a model to predict a target
variable creditworthiness based on predictor variables such as income and credit score. Because
credit scores range from 0 to 850 while annual income could range from $25,000 to $500,000,
without scaling your features, the model will be biased towards the income feature. This means the
weight factor associated with the income parameter will be very small, which will cause the
predictive model to be predicting creditworthiness based only on the income parameter.
• In order to bring features to the same scale, we could decide to use either normalization or
standardization of features. Most often, we assume data is normally distributed and default towards
standardization, but that is not always the case. It is important that before deciding whether to use
either standardization or normalization, you first take a look at how your features are statistically
distributed. If the feature tends to be uniformly distributed, then we may use normalization
(MinMaxScaler). If the feature is approximately Gaussian, then we can use standardization
(StandardScaler). Again, note that whether you employ normalization or standardization, these are
also approximative methods and are bound to contribute to the overall error of the model.
7. Principal Component Analysis (PCA)
• Large datasets with hundreds or thousands of features often lead to redundancy especially
when features are correlated with each other. Training a model on a high-dimensional
dataset having too many features can sometimes lead to overfitting (the model captures
both real and random effects). In addition, an overly complex model having too many
features can be hard to interpret. One way to solve the problem of redundancy is via
feature selection and dimensionality reduction techniques such as PCA. Principal
Component Analysis (PCA) is a statistical method that is used for feature extraction. PCA
is used for high-dimensional and correlated data. The basic idea of PCA is to transform
the original space of features into the space of the principal component. A PCA
transformation achieves the following:
a) Reduce the number of features to be used in the final model by focusing only on the
components accounting for the majority of the variance in the dataset.
b) Removes the correlation between features.
8. Linear Discriminant Analysis (LDA)
• PCA and LDA are two data preprocessing linear transformation techniques that
are often used for dimensionality reduction to select relevant features that can be
used in the final machine learning algorithm. PCA is an unsupervised algorithm
that is used for feature extraction in high-dimensional and correlated data. PCA
achieves dimensionality reduction by transforming features into orthogonal
component axes of maximum variance in a dataset. The goal of LDA is to find the
feature subspace that optimizes class separability and reduce dimensionality (see
figure below). Hence, LDA is a supervised algorithm.
9. Data Partitioning
• In machine learning, the dataset is often partitioned into training and testing sets.
The model is trained on the training dataset and then tested on the testing dataset.
The testing dataset thus acts as the unseen dataset, which can be used to estimate a
generalization error (the error expected when the model is applied to a real-world
dataset after the model has been deployed). In scikit-learn, the train/test split
estimator can be used to split the dataset as follows:
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
• Here, X is the features matrix, and y is the target variable. In this case, the testing
dataset is set to 30%.
10. Supervised Learning
• These are machine learning algorithms that perform learning by studying the relationship between the
feature variables and the known target variable. Supervised learning has two subcategories:
a) Continuous Target Variables
• Algorithms for predicting continuous target variables include Linear Regression, KNeighbors regression
(KNR), and Support Vector Regression (SVR).
• A tutorial on Linear and KNeighbors Regression is found here: Tutorial on Linear and KNeighbors
Regression
b) Discrete Target Variables
• Algorithms for predicting discrete target variables include:
• Perceptron classifier
• Logistic Regression classifier
• Support Vector Machines (SVM)
• Decision tree classifier
• K-nearest classifier
• Naive Bayes classifier
11. Unsupervised Learning
• In unsupervised learning, we are dealing with unlabeled data or data of unknown
structure. Using unsupervised learning techniques, we are able to explore the structure
of our data to extract meaningful information without the guidance of a known outcome
variable or reward function. K-means clustering is an example of an unsupervised
learning algorithm.
12. Reinforcement Learning
• In reinforcement learning, the goal is to develop a system (agent) that improves its
performance based on interactions with the environment. Since the information about
the current state of the environment typically also includes a so-called reward signal, we
can think of reinforcement learning as a field related to supervised learning. However,
in reinforcement learning, this feedback is not the correct ground truth label or value but
a measure of how well the action was measured by a reward function. Through the
interaction with the environment, an agent can then use reinforcement learning to learn
a series of actions that maximize this reward.
13. Model Parameters and Hyperparameters
• In a machine learning model, there are two types of parameters:
a) Model Parameters: These are the parameters in the model that must be determined
using the training data set. These are the fitted parameters. For example, suppose we
have a model such as house price = a + b*(age) + c*(size), to estimate the cost of
houses based on the age of the house and its size (square foot), then a, b, and c will
be our model or fitted parameters.
b) Hyperparameters: These are adjustable parameters that must be tuned to obtain a
model with optimal performance. An example of a hyperparameter is shown here:
• KNeighborsClassifier(n_neighbors = 5, p = 2, metric = 'minkowski')
• It is important that during training, the hyperparameters be tuned to obtain the
model with the best performance (with the best-fitted parameters).
14. Cross-validation
• Cross-validation is a method of evaluating a machine learning model’s
performance across random samples of the dataset. This assures that any biases in
the dataset are captured. Cross-validation can help us to obtain reliable estimates
of the model’s generalization error, that is, how well the model performs on
unseen data.
• In k-fold cross-validation, the dataset is randomly partitioned into training and
testing sets. The model is trained on the training set and evaluated on the testing
set. The process is repeated k-times. The average training and testing scores are
then calculated by averaging over the k-folds.
Here is the k-fold cross-validation pseudocode:
15. Bias-variance Tradeoff
• In statistics and machine learning, the bias-variance tradeoff is the property of a
set of predictive models whereby models with a lower bias in parameter
estimation have a higher variance of the parameter estimates across samples and
vice versa. The bias-variance dilemma or problem is the conflict in trying to
simultaneously minimize these two sources of error that prevent supervised
learning algorithms from generalizing beyond their training set:
• The bias is an error from erroneous assumptions in the learning algorithm. High
bias (overly simple) can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).
• The variance is an error from sensitivity to small fluctuations in the training set.
High variance (overly complex) can cause an algorithm to model the random noise
in the training data rather than the intended outputs (overfitting).
• It is important to find the right balance between model simplicity and complexity.
16. Evaluation Metrics
• In machine learning (predictive analytics), there are several metrics that can be
used for model evaluation. For example, a supervised learning (continuous target)
model can be evaluated using metrics such as the R2 score, mean square error
(MSE), or mean absolute error (MAE). Furthermore, a supervised learning
(discrete target) model, also referred to as a classification model, can be evaluated
using metrics such as accuracy, precision, recall, f1 score, and the area under ROC
curve (AUC).
17. Uncertainty Quantification
• It is important to build machine learning models that will yield unbiased estimates
of uncertainties in calculated outcomes. Due to the inherent randomness in the
dataset and model, evaluation parameters such as the R2 score are random
variables, and thus it is important to estimate the degree of uncertainty in the
model.
18. Math Concepts
a) Basic Calculus: Most machine learning models are built with a dataset having several features or
predictors. Hence, familiarity with multivariable calculus is extremely important for building a machine
learning model. Here are the topics you need to be familiar with:
Functions of several variables; Derivatives and gradients; Step function, Sigmoid function, Logit function,
ReLU (Rectified Linear Unit) function; Cost function; Plotting of functions; Minimum and Maximum values
of a function
b) Basic Linear Algebra: Linear algebra is the most important math skill in machine learning. A data set is
represented as a matrix. Linear algebra is used in data preprocessing, data transformation, dimensionality
reduction, and model evaluation. Here are the topics you need to be familiar with:
Vectors; Norm of a vector; Matrices; Transpose of a matrix; The inverse of a matrix; The determinant of a
matrix; Trace of a Matrix; Dot product; Eigenvalues; Eigenvectors
c) Optimization Methods: Most machine learning algorithms perform predictive modeling by minimizing
an objective function, thereby learning the weights that must be applied to the testing data in order to obtain
the predicted labels. Here are the topics you need to be familiar with:
Cost function/Objective function; Likelihood function; Error function; Gradient Descent Algorithm and its
variants (e.g., Stochastic Gradient Descent Algorithm)
19. Statistics and Probability Concepts
• Statistics and Probability are used for visualization of features, data preprocessing,
feature transformation, data imputation, dimensionality reduction, feature engineering,
model evaluation, etc. Here are the topics you need to be familiar with:
• Mean, Median, Mode, Standard deviation/variance, Correlation coefficient and the
covariance matrix, Probability distributions (Binomial, Poisson, Normal), p-value,
Bayes Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value,
Confusion Matrix, ROC Curve), Central Limit Theorem, R_2 score, Mean Square Error
(MSE), A/B Testing, Monte Carlo Simulation
20. Productivity Tools
A typical data analysis project may involve several parts, each including several data files
and different scripts with code. Keeping all these organized can be challenging.
Productivity tools help you to keep projects organized and to maintain a record of your
completed projects. Some essential productivity tools for practicing data scientists include
tools such as Unix/Linux, git and GitHub, RStudio, and Jupyter Notebook.
Data science
tools
Top Data Science Tools
Here is the list of 14 best data science tools that most of the data scientists used.
1. SAS
It is one of those data science tools which are specifically designed for statistical operations. SAS is a closed
source proprietary software that is used by large organizations to analyze data. SAS uses base SAS
programming language which for performing statistical modeling.
• It is widely used by professionals and companies working on reliable commercial software. SAS offers
numerous statistical libraries and tools that you as a Data Scientist can use for modeling and organizing
their data.
• While SAS is highly reliable and has strong support from the company, it is highly expensive and is only
used by larger industries. Also, SAS pales in comparison with some of the more modern tools which are
open-source.
• Furthermore, there are several libraries and packages in SAS that are not available in the base pack and
can require an expensive upgradation.
Data science
tools
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data Science tool.
Spark is specifically designed to handle batch processing and Stream Processing.
• It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times faster than
MapReduce.
• Spark has many Machine Learning APIs that can help Data Scientists to make powerful predictions with
the given data.
• Spark does better than other Big Data Platforms in its ability to handle streaming data. This means that
Spark can process real-time data as compared to other analytical tools that process only historical data in
batches.
• Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful
conjunction of Spark is with Scala programming language which is based on Java Virtual Machine and is
cross-platform in nature.
• Spark is highly efficient in cluster management which makes it much better than Hadoop as the latter is
only used for storage. It is this cluster management system that allows Spark to process applications at a
high speed.
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data Science tool.
Spark is specifically designed to handle batch processing and Stream Processing.
• It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times faster than
MapReduce.
• Spark has many Machine Learning APIs that can help Data Scientists to make powerful predictions with
the given data.
• Spark does better than other Big Data Platforms in its ability to handle streaming data. This means that
Spark can process real-time data as compared to other analytical tools that process only historical data in
batches.
• Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful
conjunction of Spark is with Scala programming language which is based on Java Virtual Machine and is
cross-platform in nature.
• Spark is highly efficient in cluster management which makes it much better than Hadoop as the latter is
only used for storage. It is this cluster management system that allows Spark to process applications at a
high speed.
3. BigML
• BigML, it is another widely used Data Science Tool. It provides a fully interactable, cloud-based
GUI environment that you can use for processing Machine Learning Algorithms. BigML
provides standardized software using cloud computing for industry requirements.
• Through it, companies can use Machine Learning algorithms across various parts of their
company. For example, it can use this one software across for sales forecasting, risk analytics,
and product innovation.
• BigML specializes in predictive modeling. It uses a wide variety of Machine Learning
algorithms like clustering, classification, time-series forecasting, etc.
• BigML provides an easy to use web-interface using Rest APIs and you can create a free account
or a premium account based on your data needs. It allows interactive visualizations of data and
provides you with the ability to export visual charts on your mobile or IOT devices.
• Furthermore, BigML comes with various automation methods that can help you to automate the
tuning of hyperparameter models and even automate the workflow of reusable scripts.
4. D3.js
• Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows
you to make interactive visualizations on your web-browser. With several APIs of D3.js, you can
use several functions to create dynamic visualization and analysis of data in your browser.
• Another powerful feature of D3.js is the usage of animated transitions. D3.js makes documents
dynamic by allowing updates on the client side and actively using the change in data to reflect
visualizations on the browser.
• You can combine this with CSS to create illustrious and transitory visualizations that will help
you to implement customized graphs on web-pages.
• Overall, it can be a very useful tool for Data Scientists who are working on IOT based devices
that require client-side interaction for visualization and data processing.
4. D3.js
• Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows
you to make interactive visualizations on your web-browser. With several APIs of D3.js, you can
use several functions to create dynamic visualization and analysis of data in your browser.
• Another powerful feature of D3.js is the usage of animated transitions. D3.js makes documents
dynamic by allowing updates on the client side and actively using the change in data to reflect
visualizations on the browser.
• You can combine this with CSS to create illustrious and transitory visualizations that will help
you to implement customized graphs on web-pages.
• Overall, it can be a very useful tool for Data Scientists who are working on IOT based devices
that require client-side interaction for visualization and data processing.
5. MATLAB
• MATLAB is a multi-paradigm numerical computing environment for processing mathematical
information. It is a closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several
scientific disciplines.
• In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the
MATLAB graphics library, you can create powerful visualizations. MATLAB is also used in
image and signal processing.
• This makes it a very versatile tool for Data Scientists as they can tackle all the problems, from
data cleaning and analysis to more advanced Deep Learning algorithms.
• Furthermore, MATLAB’s easy integration for enterprise applications and embedded systems
make it an ideal Data Science tool.
• It also helps in automating various tasks ranging from the extraction of data to re-use of scripts
for decision making. However, it suffers from the limitation of being a closed-source proprietary
software.
5. MATLAB
• MATLAB is a multi-paradigm numerical computing environment for processing mathematical
information. It is a closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several
scientific disciplines.
• In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the
MATLAB graphics library, you can create powerful visualizations. MATLAB is also used in
image and signal processing.
• This makes it a very versatile tool for Data Scientists as they can tackle all the problems, from
data cleaning and analysis to more advanced Deep Learning algorithms.
• Furthermore, MATLAB’s easy integration for enterprise applications and embedded systems
make it an ideal Data Science tool.
• It also helps in automating various tasks ranging from the extraction of data to re-use of scripts
for decision making. However, it suffers from the limitation of being a closed-source proprietary
software.
6. Excel
• Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for
spreadsheet calculations and today, it is widely used for data processing, visualization, and
complex calculations.
• Excel is a powerful analytical tool for Data Science. While it has been the traditional tool for
data analysis, Excel still packs a punch.
• Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own
custom functions and formulae using Excel. While Excel is not for calculating the huge amount
of Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets.
• You can also connect SQL with Excel and can use it to manipulate and analyze data. A lot of
Data Scientists use Excel for data cleaning as it provides an interactable GUI environment to
pre-process information easily.
• With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex
analyzations. However, it still pales in comparison with much more advanced Data Science tools
like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.
6. Excel
• Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for
spreadsheet calculations and today, it is widely used for data processing, visualization, and
complex calculations.
• Excel is a powerful analytical tool for Data Science. While it has been the traditional tool for
data analysis, Excel still packs a punch.
• Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own
custom functions and formulae using Excel. While Excel is not for calculating the huge amount
of Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets.
• You can also connect SQL with Excel and can use it to manipulate and analyze data. A lot of
Data Scientists use Excel for data cleaning as it provides an interactable GUI environment to
pre-process information easily.
• With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex
analyzations. However, it still pales in comparison with much more advanced Data Science tools
like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.
7. ggplot2
• ggplot2 is an advanced data visualization package for the R programming language. The
developers created this tool to replace the native graphics package of R and it uses powerful
commands to create illustrious visualizations.
• It is the most widely used library that Data Scientists use for creating visualizations from
analyzed data.Ggplot2 is part of tidyverse, a package in R that is designed for Data Science.
• One way in which ggplot2 is much better than the rest of the data visualizations is aesthetics.
With ggplot2, Data Scientists can create customized visualizations in order to engage in
enhanced storytelling.
• Using ggplot2, you can annotate your data in visualizations, add text labels to data points and
boost intractability of your graphs. You can also create various styles of maps such as
choropleths, cartograms, hexbins, etc. It is the most used data science tool.
8. Tableau
• Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business
intelligence.
• The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.
• Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes
with an active community and you can share your findings on the online platform. While
Tableau is enterprise software, it comes with a free version called Tableau Public.
8. Tableau
• Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business
intelligence.
• The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.
• Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes
with an active community and you can share your findings on the online platform. While
Tableau is enterprise software, it comes with a free version called Tableau Public.
9. Jupyter
• Project Jupyter is an open-source tool based on IPython for helping developers in making open-
source software and experiences interactive computing. Jupyter supports multiple languages like
Julia, Python, and R.
• It is a web-application tool used for writing live code, visualizations, and presentations. Jupyter
is a widely popular tool that is designed to address the requirements of Data Science.
• It is an interactable environment through which Data Scientists can perform all of their
responsibilities. It is also a powerful tool for storytelling as various presentation features are
present in it.
• Using Jupyter Notebooks, one can perform data cleaning, statistical computation, visualization
and create predictive machine learning models. It is 100% open-source and is, therefore, free of
cost.
• There is an online Jupyter environment called Collaboratory which runs on the cloud and stores
the data in Google Drive.
10. Matplotlib
• Matplotlib is a plotting and visualization library developed for Python. It is the most popular
tool for generating graphs with the analyzed data. It is mainly used for plotting complex graphs
using simple lines of code. Using this, one can generate bar plots, histograms, scatterplots etc.
• Matplotlib has several essential modules. One of the most widely used modules is pyplot. It
offers a MATLAB like an interface. Pyplot is also an open-source alternative to MATLAB’s
graphic modules.
• Matplotlib is a preferred tool for data visualizations and is used by Data Scientists over other
contemporary tools.
• As a matter of fact, NASA used Matplotlib for illustrating data visualizations during the landing
of Phoenix Spacecraft. It is also an ideal tool for beginners in learning data visualization with
Python.
11. NLTK
• Natural Language Processing has emerged as the most popular field in Data Science. It deals
with the development of statistical models that help computers understand human language.
• These statistical models are part of Machine Learning and through several of its algorithms, are
able to assist computers in understanding natural language. Python language comes with a
collection of libraries called Natural Language Toolkit (NLTK) developed for this particular
purpose only.
• NLTK is widely used for various language processing techniques like tokenization, stemming,
tagging, parsing and machine learning. It consists of over 100 corpora which are a collection of
data for building machine learning models.
• It has a variety of applications such as Parts of Speech Tagging, Word Segmentation, Machine
Translation, Text to Speech Recognition, etc.
12. Scikit-learn
• Scikit-learn is a library-based in Python that is used for implementing Machine Learning
Algorithms. It is simple and easy to implement a tool that is widely used for analysis and data
science.
• It supports a variety of features in Machine Learning such as data preprocessing, classification,
regression, clustering, dimensionality reduction, etc
• Scikit-learn makes it easy to use complex machine learning algorithms. It is therefore in
situations that require rapid prototyping and is also an ideal platform to perform research
requiring basic Machine Learning. It makes use of several underlying libraries of Python such as
SciPy, Numpy, Matplotlib, etc.
13. TensorFlow
• TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced
machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors
which are multidimensional arrays.
• It is an open-source and ever-evolving toolkit which is known for its performance and high
computational abilities. TensorFlow can run on both CPUs and GPUs and has recently emerged
on more powerful TPU platforms.
• This gives it an unprecedented edge in terms of the processing power of advanced machine
learning algorithms.
• Due to its high processing ability, Tensorflow has a variety of applications such as speech
recognition, image classification, drug discovery, image and language generation, etc. For Data
Scientists specializing in Machine Learning, Tensorflow is a must-know tool.
14. Weka
• Weka or Waikato Environment for Knowledge Analysis is a machine learning software written
in Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists
of various machine learning tools like classification, clustering, regression, visualization and
data preparation.
• It is an open-source GUI software that allows easier implementation of machine learning
algorithms through an interactable platform.
Types of Data
Introduction – Importance of Data
“Data is the new oil.” Today data is everywhere in every field. Whether you are a data scientist,
marketer, businessman, data analyst, researcher, or you are in any other profession, you need
to play or experiment with raw or structured data. This data is so important for us that it
becomes important to handle and store it properly, without any error. While working on these
data, it is important to know the types of data to process them and get the right results. There
are two types of data: Qualitative and Quantitative data, which are further classified
into four types of data: nominal, ordinal, discrete, and Continuous.
Now business runs on data, most of the companies use data for their insights to create and
launch campaigns, design strategies, launch products, and services or try out different things.
According to a report, today, at least 2.5 quintillion bytes of data are produced per day.
1. Data Science
Data Science is a field that combines programming skills and knowledge of mathematics and
statistics to derive insights from data. In short: Data Scientists work with large amounts of data,
which are systematically analyzed to provide meaningful information that can be used for
decision making and problem solving. A Data Scientist has a high level of technical skills and
knowledge, usually with expertise in programming languages such as R and Python. They help
organizations collect, compile, interpret, format, model, predict, and manipulate all types of
data in a wide variety of ways.
2. Algorithm
Algorithms are repeatable sets, usually expressed mathematically, of instructions that humans
or machines can use to process given data. Typically, algorithms are constructed by feeding
them data and adjusting variables until the desired result is achieved. Thanks to breakthrough
developments in Artificial Intelligence, machines now typically perform this task of
combining, as they can do it much faster than a human.
3. Data Analytics
Data Analytics involves answering questions generated for better business decision-making.
Existing information is used to determine usable data. Data analysis is an ongoing process in
which data is collected and analyzed continuously. An essential component of ensuring data
integrity is the accurate evaluation of research results.
4. Data mining
Data mining is the process of sorting large data sets to identify patterns and relationships that
can help solve business problems. Data mining techniques and tools can be used to predict
future trends and make more informed business decisions. Data mining is a component of Data
Analysis and one of the core disciplines of Data Science.
The data mining process can be divided into these four main stages:
Data sources identify and assemble relevant data for an analytics application. The data may
be located in different source systems that contain a mix of structured and unstructured data.
The data exploration stage includes a set of steps to get the data ready to be mined. It
summarizes the steps of data exploration, profiling, and pre-processing, followed by data
cleansing work to fix errors and other data quality issues.
Now it is time to implement one or more algorithms to do the mining/modeling. In Machine
Learning applications, the algorithms typically must be trained on sample data sets.
On to the application of deploying the models and communicating the findings to business
executives and users, often through Data Visualization and the use of data storytelling.
5. Big Data
The term "Big Data" has emerged as an ever-increasing amount of data has become available.
Today's data differs from that of the past not only in the amount but also in the speed at which
it is available. It is data with such large size and complexity that none of the traditional data
management tools can store it or process it efficiently.
Big data benefits:
• Big Data can produce more complete answers, because you have more information
• More precisely defined answers through confirmation of multiple data sources
7. Machine Learning
Machine Learning is a technique that allows a computer to learn from data without using a
complex set of different rules. It is a subset of AI in which algorithms learn from historical data
to predict outcomes and uncover patterns. It's also the process that drives many of the services
we use today - recommendation systems like those from Netflix, YouTube, and Spotify; search
engines like Google; social media feeds like Facebook and Twitter; voice assistants like Siri
and Alexa, etc. With each click or other activity, you give a machine learning material to further
process into information, which it can use to make a highly educated decision on what to show
you next.
8. Deep learning
Deep Learning is a Machine Learning technique inspired by the neural network of our brain. It
gives machines the ability to find even the smallest patterns in a data set with many layers of
computational nodes working together to search through data and deliver a final result in the
form of a prediction.
9. NLP
Natural language processing (NLP) is an intersection between the fields of Computer Science,
linguistics, and Artificial Intelligence. It helps computers communicate with people in their
language and perform other language-related tasks. NLP enables computers to read text, listen
to the speech, interpret speech, and determine which parts are important. The goal is to create
the widest possible communication between humans and computers via speech. This should
enable both machines and applications to be controlled and operated by natural language.
10. Python
Python is one of the most popular programming languages today, however, it is best known as
a versatile language that allows it to be very useful for analyzing data. The language creators
focused on making a language that is easy to learn and user-friendly, therefore it is also a very
common first programming language to learn. Furthermore, the easily understandable syntax
of Python allows for quick, compact, and readable implementation of scripts or programs, in
comparison with other programming languages.
For many reasons, the fastest-growing programming languages globally: its ease of learning,
the recent explosion of the Data Science field, and the rise of Machine Learning. Python also
supports Object-Oriented and Functional Programming styles, which facilitate building
automated tasks and deployable systems. There are plenty of Python scientific packages for
Data Visualization, Machine Learning, Natural Language Processing, and more.
11. R
R is an open-source implementation of the statistical programming language S which was
developed at Bell Labs in the 1970s. Most of its underlying source code has been written in C
and Fortran. R allows its users to manipulate R objects from these languages as well (including
C++) for computationally intensive tasks. It is essentially a highly extensible and flexible
environment for performing statistical computations and data analysis.
R is the language of choice for statistical analysis, which is a very important feature in Data
Science. R’s popularity comes from the fact that most statistical methods developed in research
environments lead to the production of ready-to-use freely available R packages. R’s popularity
has led Microsoft to develop Microsoft R Open: The Enhanced R, Distribution, and Oracle to
develop Oracle R Enterprise. From our partner companies, we have learned that along with
Python, R remains the language of choice for Data Scientists in the insurance and
pharmaceutical sectors.
12. SQL
SQL (Structured Query Language) is the language to query and manipulate data in RDMS
(Relational Database Management Systems) and is, for this reason, very relevant in the field
of Data Science. RDMS are columns and rows to store data within a structured format and are
a potent tool to store massive amounts of information. Some common database management
systems that use SQL are: Sybase, Oracle, Microsoft SQL Server, Access, etc.
15. API
APIs provide users with a set of functions used to interact with the features of a specific service
or application. Facebook, for example, provides developers of software applications with
access to Facebook features through its API. By hooking into the Facebook API, developers
can allow users of their applications to log in using Facebook, or they can access personal
information stored in their databases.
Prerequisites for Data Science
Here are some of the technical concepts you should know about before starting to learn what
is data science.
1. Machine Learning
Machine learning is the backbone of data science. Data Scientists need to have a solid grasp of
ML in addition to basic knowledge of statistics.
2. Modeling
Mathematical models enable you to make quick calculations and predictions based on what
you already know about the data. Modeling is also a part of Machine Learning and involves
identifying which algorithm is the most suitable to solve a given problem and how to train these
models.
3. Statistics
Statistics are at the core of data science. A sturdy handle on statistics can help you extract more
intelligence and obtain more meaningful results.
4. Programming
Some level of programming is required to execute a successful data science project. The most
common programming languages are Python, and R. Python is especially popular because it’s
easy to learn, and it supports multiple libraries for data science and ML.
5. Databases
A capable data scientist needs to understand how databases work, how to manage them, and
how to extract data from them.
What Does a Data Scientist Do?
A data scientist analyzes business data to extract meaningful insights. In other words, a data
scientist solves business problems through a series of steps, including:
Before tackling the data collection and analysis, the data scientist determines the problem by
asking the right questions and gaining understanding.
The data scientist then determines the correct set of variables and data set. The data scientist
gathers structured and unstructured data from many disparate sources—enterprise data, public
data, etc.
Once the data is collected, the data scientist processes the raw data and converts it into a format
suitable for analysis. This involves cleaning and validating the data to guarantee uniformity,
completeness, and accuracy.
After the data has been rendered into a usable form, it’s fed into the analytic system—ML
algorithm or a statistical model. This is where the data scientists analyze and identify patterns
and trends.
When the data has been completely rendered, the data scientist interprets the data to find
opportunities and solutions.
The data scientists finish the task by preparing the results and insights to share with the
appropriate stakeholders and communicating the results.
Now we should be aware of some machine learning algorithms which are beneficial in
understanding data science clearly.
Why Become a Data Scientist?
According to Glassdoor and Forbes, demand for data scientists will increase by 28 percent by
2026, which speaks of the profession’s durability and longevity, so if you want a secure career,
data science offers you that chance.
Furthermore, the profession of data scientist came in second place in the Best Jobs in America
for 2021 survey, with an average base salary of USD 127,500.
So, if you’re looking for an exciting career that offers stability and generous compensation,
then look no further!
Where Do You Fit in Data Science?
Data science offers you the opportunity to focus on and specialize in one aspect of the field.
Here’s a sample of different ways you can fit into this exciting, fast-growing field.
Data Scientist
Job role: Determine what the problem is, what questions need answers, and where to find the
data. Also, they mine, clean, and present the relevant data.
Skills needed: Programming skills (SAS, R, Python), storytelling and data visualization,
statistical and mathematical skills, knowledge of Hadoop, SQL, and Machine Learning.
Data Analyst
Job role: Analysts bridge the gap between the data scientists and the business analysts,
organizing and analyzing data to answer the questions the organization poses. They take the
technical analyses and turn them into qualitative action items.
Skills needed: Statistical and mathematical skills, programming skills (SAS, R, Python), plus
experience in data wrangling and data visualization.
Data Engineer
Job role: Data engineers focus on developing, deploying, managing, and optimizing the
organization’s data infrastructure and data pipelines. Engineers support data scientists by
helping to transfer and transform data for queries.
Skills needed: NoSQL databases (e.g., MongoDB, Cassandra DB), programming languages
such as Java and Scala, and frameworks (Apache Hadoop).
Data Science Applications
Data science has found its applications in almost every industry.
Through this blog, we bring to you, 10 applications that build upon the concepts of Data
Science, exploring various domains such as the following:
1.Fraud and Risk Detection
2.Healthcare
3.Internet Search
4.Targeted Advertising
5.Website Recommendations
6.Advanced Image Recognition
7.Speech Recognition
8.Airline Route Planning
9.Gaming
10.Augmented Reality
1.Fraud and Risk Detection
The earliest applications of data science were in Finance. Companies were fed up of bad debts
and losses every year. However, they had a lot of data which use to get collected during the
initial paperwork while sanctioning loans. They decided to bring in data scientists in order to
rescue them from losses.
Over the years, banking companies learned to divide and conquer data via customer profiling,
past expenditures, and other essential variables to analyze the probabilities of risk and default.
Moreover, it also helped them to push their banking products based on customer’s purchasing
power.
2.Healthcare
The healthcare sector, especially, receives great benefits from data science applications.
i. Medical Image Analysis
Procedures such as detecting tumors, artery stenosis, organ delineation employ various
different methods and frameworks like MapReduce to find optimal parameters for tasks like
lung texture classification. It applies machine learning methods, support vector machines
(SVM), content-based medical image indexing, and wavelet analysis for solid texture
classification.healthcare 1 - Data Science Applications –
ii. Genetics & Genomics
Data Science applications also enable an advanced level of treatment personalization through
research in genetics and genomics. The goal is to understand the impact of the DNA on our
health and find individual biological connections between genetics, diseases, and drug
response. Data science techniques allow integration of different kinds of data with genomic
data in the disease research, which provides a deeper understanding of genetic issues in
reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data,
we will achieve a deeper understanding of the human DNA. The advanced genetic risk
prediction will be a major step towards more individual care.
iii. Drug Development
The drug discovery process is highly complicated and involves many disciplines. The greatest
ideas are often bounded by billions of testing, huge financial and time expenditure. On average,
it takes twelve years to make an official submission.
Data science applications and machine learning algorithms simplify and shorten this process,
adding a perspective to each step from the initial screening of drug compounds to the prediction
of the success rate based on the biological factors. Such algorithms can forecast how the
compound will act in the body using advanced mathematical modeling and simulations instead
of the “lab experiments”. The idea behind the computational drug discovery is to create
computer model simulations as a biologically relevant network simplifying the prediction of
future outcomes with high accuracy.
iv. Virtual assistance for patients and customer support
Optimization of the clinical process builds upon the concept that for many cases it is not
actually necessary for patients to visit doctors in person. A mobile application can give a more
effective solution by bringing the doctor to the patient instead.
The AI-powered mobile apps can provide basic healthcare support, usually chatbots. You
simply describe your symptoms, or ask questions, and then receive key information about your
medical condition derived from a wide network linking symptoms to causes. Apps can remind
you to take your medicine on time, and if necessary, assign an appointment with a doctor.
This approach promotes a healthy lifestyle by encouraging patients to make healthy decisions,
saves their time waiting in line for an appointment, and allows doctors to focus on more critical
cases.
3.Internet Search
Now, this is probably the first thing that strikes your mind when you think Data Science
Applications.
When we speak of search, we think ‘Google’. Right? But there are many other search engines
like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including Google) make use
of data science algorithms to deliver the best result for our searched query in a fraction of
seconds. Considering the fact that, Google processes more than 20 petabytes of data every day.
4.Targeted Advertising
If you thought Search would have been the biggest of all data science applications, here is a
challenger – the entire digital marketing spectrum. Starting from the display banners on various
websites to the digital billboards at the airports – almost all of them are decided by using data
science algorithms.
This is the reason why digital ads have been able to get a lot higher CTR (Call-Through Rate)
than traditional advertisements. They can be targeted based on a user’s past behavior.
This is the reason why you might see ads of Data Science Training Programs while I see an ad
of apparels in the same place at the same time.
5.Website Recommendations
Aren’t we all used to the suggestions about similar products on Amazon? They not only help
you find relevant products from billions of products available with them but also add a lot to
the user experience.
A lot of companies have fervidly used this engine to promote their products in accordance with
user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google Play,
Netflix, Linkedin, IMDb, and much more use this system to improve the user experience. The
recommendations are made based on previous search results for a user.
6.Advanced Image Recognition
You upload your image with friends on Facebook and you start getting suggestions to tag your
friends. This automatic tag suggestion feature uses face recognition algorithm.
In their latest update, Facebook has outlined the additional progress they’ve made in this area,
making specific note of their advances in image recognition accuracy and capacity.
“We’ve witnessed massive advances in image classification (what is in the image?) as well as
object detection (where are the objects?), but this is just the beginning of understanding the
most relevant visual content of any image or video. Recently we’ve been designing techniques
that identify and segment each and every object in an image, a key capability that will enable
entirely new applications.”
In addition, Google provides you with the option to search for images by uploading them. It
uses image recognition and provides related search results.
7.Speech Recognition
Some of the best examples of speech recognition products are Google Voice, Siri, Cortana etc.
Using the speech-recognition feature, even if you aren’t in a position to type a message, your
life wouldn’t stop. Simply speak out the message and it will be converted to text. However, at
times, you would realize, speech recognition doesn’t perform accurately.
8.Airline Route Planning
Airline Industry across the world is known to bear heavy losses. Except for a few airline service
providers, companies are struggling to maintain their occupancy ratio and operating profits.
With high rise in air-fuel prices and need to offer heavy discounts to customers has further
made the situation worse. It wasn’t for long when airlines companies started using data science
to identify the strategic areas of improvements. Now using data science, the airline companies
can:
Predict flight delay
Decide which class of airplanes to buy
Whether to directly land at the destination or take a halt in between (For example, A flight can
have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in
any country.)
Effectively drive customer loyalty programs
Southwest Airlines, Alaska Airlines are among the top companies who’ve embraced data
science to bring changes in their way of working.
You can get a better insight into it by referring to this video by our team, which vividly speaks
of all the various fields conquered by Data Science Applications.
9.Gaming
Games are now designed using machine learning algorithms that improve/upgrade themselves
as the player moves up to a higher level. In motion gaming also, your opponent (computer)
analyzes your previous moves and accordingly shapes up its game. EA Sports, Zynga, Sony,
Nintendo, Activision-Blizzard have led the gaming experience to the next level using data
science.
10.Augmented Reality
This is the final of the data science applications which seem most exciting in the future.
Augmented reality.
Data Science and Virtual Reality do have a relationship, considering a VR headset contains
computing knowledge, algorithms and data to provide you with the best viewing experience.
A very small step towards this is the high-trending game of Pokemon GO. The ability to walk
around things and look at Pokemon on walls, streets, things that aren’t really there. The creators
of this game used the data from Ingress, the last app from the same company, to choose the
locations of the Pokemon and gyms.