0% found this document useful (0 votes)
67 views74 pages

Unit-1 Data Science

The document discusses data science, including its definition, history, and process. It defines data science as the study of data to extract useful information and gain insights. It notes that data science involves both computer science and statistics. The document then outlines the typical data science process, including data discovery, preparation, modeling, evaluation, and communication of results.

Uploaded by

snikhath20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views74 pages

Unit-1 Data Science

The document discusses data science, including its definition, history, and process. It defines data science as the study of data to extract useful information and gain insights. It notes that data science involves both computer science and statistics. The document then outlines the typical data science process, including data discovery, preparation, modeling, evaluation, and communication of results.

Uploaded by

snikhath20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

DATA SCIENCE

What is data science


• Data science is the study of data. It involves developing methods of recording, storing,
and analyzing data to effectively extract useful information. The goal of data science is to
gain insights and knowledge from any type of data — both structured and unstructured.
• Data science is related to computer science, but is a separate field. Computer science
involves creating programs and algorithms to record and process data, while data science
covers any type of data analysis, which may or may not use computers. Data science is
more closely related to the mathematics field of Statistics, which includes the collection,
organization, analysis, and presentation of data.
• Because of the large amounts of data modern companies and organizations maintain, data
science has become an integral part of IT. For example, a company that has petabytes of
user data may use data science to develop effective ways to store, manage, and analyze
the data. The company may use the scientific method to run tests and extract results that
can provide meaningful insights about their users.
• Early usage
• In 1962, John Tukey described a field he called "data analysis", which resembles modern data science. In
1985, in a lecture given to the Chinese Academy of Sciences in Beijing, C.F. Jeff Wu used the term Data
Science for the first time as an alternative name for statistics. Later, attendees at a 1992 statistics symposium
at the University of Montpellier II acknowledged the emergence of a new discipline focused on data of
various origins and forms, combining established concepts and principles of statistics and data analysis with
computing.
• The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name
for computer science.In 1996, the International Federation of Classification Societies became the first
conference to specifically feature data science as a topic.However, the definition was still in flux. After the
1985 lecture in the Chinese Academy of Sciences in Beijing, in 1997 C.F. Jeff Wu again suggested that
statistics should be renamed data science. He reasoned that a new name would help statistics shed inaccurate
stereotypes, such as being synonymous with accounting, or limited to describing data.In 1998, Hayashi
Chikio argued for data science as a new, interdisciplinary concept, with three aspects: data design, collection,
and analysis.
• During the 1990s, popular terms for the process of finding patterns in datasets (which were increasingly
large) included "knowledge discovery" and "data mining".
• Modern usage
• The modern conception of data science as an independent discipline is sometimes attributed to William S.
Cleveland.In a 2001 paper, he advocated an expansion of statistics beyond theory into technical areas;
because this would significantly change the field, it warranted a new name."Data science" became more
widely used in the next few years: in 2002, the Committee on Data for Science and
Technology launched Data Science Journal. In 2003, Columbia University launched The Journal of Data
Science. In 2014, the American Statistical Association's Section on Statistical Learning and Data Mining
changed its name to the Section on Statistical Learning and Data Science, reflecting the ascendant popularity
of data science.
• The professional title of "data scientist" has been attributed to DJ Patil and Jeff Hammerbacher in
2008. Though it was used by the National Science Board in their 2005 report, "Long-Lived Digital Data
Collections: Enabling Research and Education in the 21st Century," it referred broadly to any key role in
managing a digital data collection.
• There is still no consensus on the definition of data science and it is considered by some to be a
buzzword. Big data is a related marketing term.Data scientists are responsible for breaking down big data
into usable information and creating software and algorithms that help companies and organizations
determine optimal operations.
• Why Data Science?
• Here are significant advantages of using Data Analytics Technology:
• Data is the oil for today’s world. With the right tools, technologies, algorithms, we
can use data and convert it into a distinct business advantage
• Data Science can help you to detect fraud using advanced machine learning
algorithms
• It helps you to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• You can perform sentiment analysis to gauge customer brand loyalty
• It enables you to take better and faster decisions
• It helps you to recommend the right product to the right customer to enhance your
business
• Data Science Process
1. Discovery:
• Discovery step involves acquiring data from all the identified internal & external sources, which helps you
answer the business question.
• The data can be:
• Logs from webservers
• Data gathered from social media
• Census datasets
• Data streamed from online sources using APIs
2. Preparation:
• Data can have many inconsistencies like missing values, blank columns, an incorrect data format, which
needs to be cleaned. You need to process, explore, and condition data before modelling. The cleaner your
data, the better are your predictions.
3. Model Planning:
• In this stage, you need to determine the method and technique to draw the relation between input variables.
Planning for a model is performed by using different statistical formulas and visualization tools. SQL
analysis services, R, and SAS/access are some of the tools used for this purpose.
4. Model Building:
• In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.
5. Operationalize:
• You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
• In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.
• Introduction to Data Science Lifecycle
• Data Science Lifecycle revolves around using machine learning and other analytical
methods to produce insights and predictions from data to achieve a business objective.
The entire process involves several steps like data cleaning, preparation, modelling,
model evaluation, etc. It is a long process and may take several months to complete. So,
it is essential to have a general structure to follow for every problem at hand. The
globally acknowledged structure in solving any analytical problem is the Cross-
Industry Standard Process for Data Mining or CRISP-DM framework.
• The lifecycle of Data Science
1. Business Understanding
• The entire cycle revolves around the business goal. What will you solve if you do not
have a precise problem? It is essential to understand the business objective clearly
because that will be your final goal of the analysis. Only we can set the specific goal of
analysis in sync with the business objective after proper understanding. You need to
know if the client wants to reduce credit loss, or if they want to predict the price of a
commodity, etc.
2. Data Understanding
• After business understanding, the next step is data understanding. This involves the
collection of all the available data. It would help if you worked closely with the
business team to know what data is present and what data could be used for this
business problem, and other information. This step involves describing the data, their
structure, their relevance, their data type. Explore the data using graphical plots.
Basically, extracting any information that you can get about the data by just exploring
the data.
3. Data Preparation
• Next comes the data preparation stage. This includes selecting the relevant data,
integrating the data by merging the data sets, cleaning them, treating the missing
values by either removing them or imputing them, treating erroneous data by
removing them, and checking outliers using box plots and handle them. Constructing
new data, derive new features from existing ones. Format the data into the desired
structure, remove unwanted columns and features. Data preparation is the most time
consuming yet arguably the most important step in the entire life cycle. Your model
will be as good as your data.
4. Exploratory Data Analysis
• This step involves getting some idea about the solution and factors affecting it before
building the actual model. The distribution of data within different feature variables
is explored graphically using bar-graphs; relations between different features are
captured through graphical representations like scatter plots and heat maps. Many
other data visualization techniques are extensively used to explore every feature
individually and combine them with other features.
5. Data Modeling
• Data modelling is the heart of data analysis. A model takes the prepared data as input
and provides the desired output. This step includes choosing the appropriate type of
model, whether the problem is a classification problem, or a regression problem or a
clustering problem. After choosing the model family, amongst the various algorithm
amongst that family, we need to choose the algorithms to implement and implement
them carefully. We need to tune the hyperparameters of each model to achieve the
desired performance. We also need to make sure there is a correct balance between
performance and generalizability. We do not want the model to learn the data and
perform poorly on new data.
6. Model Evaluation
• Here the model is evaluated for checking if it is ready to be deployed. The model
is tested on unseen data, evaluated on a carefully thought out set of evaluation
metrics. We also need to make sure that the model conforms to reality. If we do
not obtain a satisfactory result in the evaluation, we must re-iterate the entire
modelling process until the desired level of metrics is achieved. Any data science
solution, a machine learning model, just like a human, should evolve, should be
able to improve itself with new data, adapt to a new evaluation metric. We can
build multiple models for a certain phenomenon, but a lot of them may be
imperfect. Model evaluation helps us choose and build a perfect model.
7. Model Deployment
• The model, after a rigorous evaluation, is finally deployed in the desired format
and channel. This is the final step in the data science life cycle. Each step in the
data science life cycle explained above should be worked upon carefully. If any
step is executed improperly, it will affect the next step, and the entire effort goes to
waste. For example, if data is not collected properly, you’ll lose information, and
you will not be building a perfect model. If data is not cleaned properly, the model
will not work. If the model is not evaluated properly, it will fail in the real world.
From business understanding to model deployment, each step should be given
proper attention, time and effort.
Core Concepts in Data Science
1. Dataset
• Just as the name implies, data science is a branch of science that applies the
scientific method to data with the goal of studying the relationships between
different features and drawing out meaningful conclusions based on these
relationships. Data is, therefore, the key component in data science. A dataset is a
particular instance of data that is used for analysis or model building at any given
time. A dataset comes in different flavors such as numerical data, categorical data,
text data, image data, voice data, and video data. A dataset could be static (not
changing) or dynamic (changes with time, for example, stock prices). Moreover, a
dataset could depend on space as well. For example, temperature data in the
United States would differ significantly from temperature data in Africa. For
beginning data science projects, the most popular type of dataset is a dataset
containing numerical data that is typically stored in a comma-separated values
(CSV) file format.
2. Data Wrangling

• Data wrangling is the process of converting data from its raw form to a tidy form
ready for analysis. Data wrangling is an important step in data preprocessing and
includes several processes like data importing, data cleaning, data structuring,
string processing, HTML parsing, handling dates and times, handling missing
data, and text mining.
• The process of data wrangling is a critical step for any data scientist. Very rarely is
data easily accessible in a data science project for analysis. It is more likely for the
data to be in a file, a database, or extracted from documents such as web pages,
tweets, or PDFs. Knowing how to wrangle and clean data will enable you to
derive critical insights from your data that would otherwise be hidden.
Figure 1: Data wrangling process.
3. Data Visualization
• Data Visualization is one of the most important branches of data science. It is one
of the main tools used to analyze and study relationships between different
variables. Data visualization (e.g., scatter plots, line graphs, bar plots, histograms,
qqplots, smooth densities, boxplots, pair plots, heat maps, etc.) can be used for
descriptive analytics. Data visualization is also used in machine learning for data
preprocessing and analysis, feature selection, model building, model testing, and
model evaluation. When preparing a data visualization, keep in mind that data
visualization is more of an Art than Science. To produce a good visualization, you
need to put several pieces of code together for an excellent end result.
4. Outliers
• An outlier is a data point that is very different from the rest of
the dataset. Outliers are often just bad data, e.g., due to a
malfunctioned sensor; contaminated experiments; or human
error in recording data. Sometimes, outliers could indicate
something real such as a malfunction in a system. Outliers are
very common and are expected in large datasets. One common
way to detect outliers in a dataset is by using a box plot. Figure
3 shows a simple regression model for a dataset containing lots
of outliers. Outliers can significantly degrade the predictive
power of a machine learning model. A common way to deal
with outliers is to simply omit the data points. However,
removing real data outliers can be too optimistic, leading to
non-realistic models. Advanced methods for dealing with
outliers include the RANSAC method.
5. Data Imputation
• Most datasets contain missing values. The easiest way to deal with missing data is
simply to throw away the data point. However, the removal of samples or
dropping of entire feature columns is simply not feasible because we might lose
too much valuable data. In this case, we can use different interpolation techniques
to estimate the missing values from the other training samples in our dataset. One
of the most common interpolation techniques is mean imputation, where we
simply replace the missing value with the mean value of the entire feature column.
Other options for imputing missing values are median or most frequent (mode),
where the latter replaces the missing values with the most frequent values.
Whatever imputation method you employ in your model, you have to keep in
mind that imputation is only an approximation, and hence can produce an error in
the final model. If the data supplied was already preprocessed, you would have to
find out how missing values were considered
6. Data Scaling
• Scaling your features will help improve the quality and predictive power of your model. For
example, suppose you would like to build a model to predict a target
variable creditworthiness based on predictor variables such as income and credit score. Because
credit scores range from 0 to 850 while annual income could range from $25,000 to $500,000,
without scaling your features, the model will be biased towards the income feature. This means the
weight factor associated with the income parameter will be very small, which will cause the
predictive model to be predicting creditworthiness based only on the income parameter.
• In order to bring features to the same scale, we could decide to use either normalization or
standardization of features. Most often, we assume data is normally distributed and default towards
standardization, but that is not always the case. It is important that before deciding whether to use
either standardization or normalization, you first take a look at how your features are statistically
distributed. If the feature tends to be uniformly distributed, then we may use normalization
(MinMaxScaler). If the feature is approximately Gaussian, then we can use standardization
(StandardScaler). Again, note that whether you employ normalization or standardization, these are
also approximative methods and are bound to contribute to the overall error of the model.
7. Principal Component Analysis (PCA)
• Large datasets with hundreds or thousands of features often lead to redundancy especially
when features are correlated with each other. Training a model on a high-dimensional
dataset having too many features can sometimes lead to overfitting (the model captures
both real and random effects). In addition, an overly complex model having too many
features can be hard to interpret. One way to solve the problem of redundancy is via
feature selection and dimensionality reduction techniques such as PCA. Principal
Component Analysis (PCA) is a statistical method that is used for feature extraction. PCA
is used for high-dimensional and correlated data. The basic idea of PCA is to transform
the original space of features into the space of the principal component. A PCA
transformation achieves the following:
a) Reduce the number of features to be used in the final model by focusing only on the
components accounting for the majority of the variance in the dataset.
b) Removes the correlation between features.
8. Linear Discriminant Analysis (LDA)
• PCA and LDA are two data preprocessing linear transformation techniques that
are often used for dimensionality reduction to select relevant features that can be
used in the final machine learning algorithm. PCA is an unsupervised algorithm
that is used for feature extraction in high-dimensional and correlated data. PCA
achieves dimensionality reduction by transforming features into orthogonal
component axes of maximum variance in a dataset. The goal of LDA is to find the
feature subspace that optimizes class separability and reduce dimensionality (see
figure below). Hence, LDA is a supervised algorithm.
9. Data Partitioning
• In machine learning, the dataset is often partitioned into training and testing sets.
The model is trained on the training dataset and then tested on the testing dataset.
The testing dataset thus acts as the unseen dataset, which can be used to estimate a
generalization error (the error expected when the model is applied to a real-world
dataset after the model has been deployed). In scikit-learn, the train/test split
estimator can be used to split the dataset as follows:
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
• Here, X is the features matrix, and y is the target variable. In this case, the testing
dataset is set to 30%.
10. Supervised Learning
• These are machine learning algorithms that perform learning by studying the relationship between the
feature variables and the known target variable. Supervised learning has two subcategories:
a) Continuous Target Variables
• Algorithms for predicting continuous target variables include Linear Regression, KNeighbors regression
(KNR), and Support Vector Regression (SVR).
• A tutorial on Linear and KNeighbors Regression is found here: Tutorial on Linear and KNeighbors
Regression
b) Discrete Target Variables
• Algorithms for predicting discrete target variables include:
• Perceptron classifier
• Logistic Regression classifier
• Support Vector Machines (SVM)
• Decision tree classifier
• K-nearest classifier
• Naive Bayes classifier
11. Unsupervised Learning
• In unsupervised learning, we are dealing with unlabeled data or data of unknown
structure. Using unsupervised learning techniques, we are able to explore the structure
of our data to extract meaningful information without the guidance of a known outcome
variable or reward function. K-means clustering is an example of an unsupervised
learning algorithm.
12. Reinforcement Learning
• In reinforcement learning, the goal is to develop a system (agent) that improves its
performance based on interactions with the environment. Since the information about
the current state of the environment typically also includes a so-called reward signal, we
can think of reinforcement learning as a field related to supervised learning. However,
in reinforcement learning, this feedback is not the correct ground truth label or value but
a measure of how well the action was measured by a reward function. Through the
interaction with the environment, an agent can then use reinforcement learning to learn
a series of actions that maximize this reward.
13. Model Parameters and Hyperparameters
• In a machine learning model, there are two types of parameters:
a) Model Parameters: These are the parameters in the model that must be determined
using the training data set. These are the fitted parameters. For example, suppose we
have a model such as house price = a + b*(age) + c*(size), to estimate the cost of
houses based on the age of the house and its size (square foot), then a, b, and c will
be our model or fitted parameters.
b) Hyperparameters: These are adjustable parameters that must be tuned to obtain a
model with optimal performance. An example of a hyperparameter is shown here:
• KNeighborsClassifier(n_neighbors = 5, p = 2, metric = 'minkowski')
• It is important that during training, the hyperparameters be tuned to obtain the
model with the best performance (with the best-fitted parameters).
14. Cross-validation
• Cross-validation is a method of evaluating a machine learning model’s
performance across random samples of the dataset. This assures that any biases in
the dataset are captured. Cross-validation can help us to obtain reliable estimates
of the model’s generalization error, that is, how well the model performs on
unseen data.
• In k-fold cross-validation, the dataset is randomly partitioned into training and
testing sets. The model is trained on the training set and evaluated on the testing
set. The process is repeated k-times. The average training and testing scores are
then calculated by averaging over the k-folds.
Here is the k-fold cross-validation pseudocode:
15. Bias-variance Tradeoff
• In statistics and machine learning, the bias-variance tradeoff is the property of a
set of predictive models whereby models with a lower bias in parameter
estimation have a higher variance of the parameter estimates across samples and
vice versa. The bias-variance dilemma or problem is the conflict in trying to
simultaneously minimize these two sources of error that prevent supervised
learning algorithms from generalizing beyond their training set:
• The bias is an error from erroneous assumptions in the learning algorithm. High
bias (overly simple) can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).
• The variance is an error from sensitivity to small fluctuations in the training set.
High variance (overly complex) can cause an algorithm to model the random noise
in the training data rather than the intended outputs (overfitting).
• It is important to find the right balance between model simplicity and complexity.
16. Evaluation Metrics
• In machine learning (predictive analytics), there are several metrics that can be
used for model evaluation. For example, a supervised learning (continuous target)
model can be evaluated using metrics such as the R2 score, mean square error
(MSE), or mean absolute error (MAE). Furthermore, a supervised learning
(discrete target) model, also referred to as a classification model, can be evaluated
using metrics such as accuracy, precision, recall, f1 score, and the area under ROC
curve (AUC).
17. Uncertainty Quantification

• It is important to build machine learning models that will yield unbiased estimates
of uncertainties in calculated outcomes. Due to the inherent randomness in the
dataset and model, evaluation parameters such as the R2 score are random
variables, and thus it is important to estimate the degree of uncertainty in the
model.
18. Math Concepts
a) Basic Calculus: Most machine learning models are built with a dataset having several features or
predictors. Hence, familiarity with multivariable calculus is extremely important for building a machine
learning model. Here are the topics you need to be familiar with:
Functions of several variables; Derivatives and gradients; Step function, Sigmoid function, Logit function,
ReLU (Rectified Linear Unit) function; Cost function; Plotting of functions; Minimum and Maximum values
of a function
b) Basic Linear Algebra: Linear algebra is the most important math skill in machine learning. A data set is
represented as a matrix. Linear algebra is used in data preprocessing, data transformation, dimensionality
reduction, and model evaluation. Here are the topics you need to be familiar with:
Vectors; Norm of a vector; Matrices; Transpose of a matrix; The inverse of a matrix; The determinant of a
matrix; Trace of a Matrix; Dot product; Eigenvalues; Eigenvectors
c) Optimization Methods: Most machine learning algorithms perform predictive modeling by minimizing
an objective function, thereby learning the weights that must be applied to the testing data in order to obtain
the predicted labels. Here are the topics you need to be familiar with:
Cost function/Objective function; Likelihood function; Error function; Gradient Descent Algorithm and its
variants (e.g., Stochastic Gradient Descent Algorithm)
19. Statistics and Probability Concepts
• Statistics and Probability are used for visualization of features, data preprocessing,
feature transformation, data imputation, dimensionality reduction, feature engineering,
model evaluation, etc. Here are the topics you need to be familiar with:
• Mean, Median, Mode, Standard deviation/variance, Correlation coefficient and the
covariance matrix, Probability distributions (Binomial, Poisson, Normal), p-value,
Bayes Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value,
Confusion Matrix, ROC Curve), Central Limit Theorem, R_2 score, Mean Square Error
(MSE), A/B Testing, Monte Carlo Simulation
20. Productivity Tools
A typical data analysis project may involve several parts, each including several data files
and different scripts with code. Keeping all these organized can be challenging.
Productivity tools help you to keep projects organized and to maintain a record of your
completed projects. Some essential productivity tools for practicing data scientists include
tools such as Unix/Linux, git and GitHub, RStudio, and Jupyter Notebook.
Data science
tools
Top Data Science Tools
Here is the list of 14 best data science tools that most of the data scientists used.
1. SAS
It is one of those data science tools which are specifically designed for statistical operations. SAS is a closed
source proprietary software that is used by large organizations to analyze data. SAS uses base SAS
programming language which for performing statistical modeling.
• It is widely used by professionals and companies working on reliable commercial software. SAS offers
numerous statistical libraries and tools that you as a Data Scientist can use for modeling and organizing
their data.
• While SAS is highly reliable and has strong support from the company, it is highly expensive and is only
used by larger industries. Also, SAS pales in comparison with some of the more modern tools which are
open-source.
• Furthermore, there are several libraries and packages in SAS that are not available in the base pack and
can require an expensive upgradation.
Data science
tools
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data Science tool.
Spark is specifically designed to handle batch processing and Stream Processing.
• It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times faster than
MapReduce.
• Spark has many Machine Learning APIs that can help Data Scientists to make powerful predictions with
the given data.
• Spark does better than other Big Data Platforms in its ability to handle streaming data. This means that
Spark can process real-time data as compared to other analytical tools that process only historical data in
batches.
• Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful
conjunction of Spark is with Scala programming language which is based on Java Virtual Machine and is
cross-platform in nature.
• Spark is highly efficient in cluster management which makes it much better than Hadoop as the latter is
only used for storage. It is this cluster management system that allows Spark to process applications at a
high speed.
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data Science tool.
Spark is specifically designed to handle batch processing and Stream Processing.
• It comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times faster than
MapReduce.
• Spark has many Machine Learning APIs that can help Data Scientists to make powerful predictions with
the given data.
• Spark does better than other Big Data Platforms in its ability to handle streaming data. This means that
Spark can process real-time data as compared to other analytical tools that process only historical data in
batches.
• Spark offers various APIs that are programmable in Python, Java, and R. But the most powerful
conjunction of Spark is with Scala programming language which is based on Java Virtual Machine and is
cross-platform in nature.
• Spark is highly efficient in cluster management which makes it much better than Hadoop as the latter is
only used for storage. It is this cluster management system that allows Spark to process applications at a
high speed.
3. BigML
• BigML, it is another widely used Data Science Tool. It provides a fully interactable, cloud-based
GUI environment that you can use for processing Machine Learning Algorithms. BigML
provides standardized software using cloud computing for industry requirements.
• Through it, companies can use Machine Learning algorithms across various parts of their
company. For example, it can use this one software across for sales forecasting, risk analytics,
and product innovation.
• BigML specializes in predictive modeling. It uses a wide variety of Machine Learning
algorithms like clustering, classification, time-series forecasting, etc.
• BigML provides an easy to use web-interface using Rest APIs and you can create a free account
or a premium account based on your data needs. It allows interactive visualizations of data and
provides you with the ability to export visual charts on your mobile or IOT devices.
• Furthermore, BigML comes with various automation methods that can help you to automate the
tuning of hyperparameter models and even automate the workflow of reusable scripts.
4. D3.js
• Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows
you to make interactive visualizations on your web-browser. With several APIs of D3.js, you can
use several functions to create dynamic visualization and analysis of data in your browser.
• Another powerful feature of D3.js is the usage of animated transitions. D3.js makes documents
dynamic by allowing updates on the client side and actively using the change in data to reflect
visualizations on the browser.
• You can combine this with CSS to create illustrious and transitory visualizations that will help
you to implement customized graphs on web-pages.
• Overall, it can be a very useful tool for Data Scientists who are working on IOT based devices
that require client-side interaction for visualization and data processing.
4. D3.js
• Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows
you to make interactive visualizations on your web-browser. With several APIs of D3.js, you can
use several functions to create dynamic visualization and analysis of data in your browser.
• Another powerful feature of D3.js is the usage of animated transitions. D3.js makes documents
dynamic by allowing updates on the client side and actively using the change in data to reflect
visualizations on the browser.
• You can combine this with CSS to create illustrious and transitory visualizations that will help
you to implement customized graphs on web-pages.
• Overall, it can be a very useful tool for Data Scientists who are working on IOT based devices
that require client-side interaction for visualization and data processing.
5. MATLAB
• MATLAB is a multi-paradigm numerical computing environment for processing mathematical
information. It is a closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several
scientific disciplines.
• In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the
MATLAB graphics library, you can create powerful visualizations. MATLAB is also used in
image and signal processing.
• This makes it a very versatile tool for Data Scientists as they can tackle all the problems, from
data cleaning and analysis to more advanced Deep Learning algorithms.
• Furthermore, MATLAB’s easy integration for enterprise applications and embedded systems
make it an ideal Data Science tool.
• It also helps in automating various tasks ranging from the extraction of data to re-use of scripts
for decision making. However, it suffers from the limitation of being a closed-source proprietary
software.
5. MATLAB
• MATLAB is a multi-paradigm numerical computing environment for processing mathematical
information. It is a closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several
scientific disciplines.
• In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the
MATLAB graphics library, you can create powerful visualizations. MATLAB is also used in
image and signal processing.
• This makes it a very versatile tool for Data Scientists as they can tackle all the problems, from
data cleaning and analysis to more advanced Deep Learning algorithms.
• Furthermore, MATLAB’s easy integration for enterprise applications and embedded systems
make it an ideal Data Science tool.
• It also helps in automating various tasks ranging from the extraction of data to re-use of scripts
for decision making. However, it suffers from the limitation of being a closed-source proprietary
software.
6. Excel
• Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for
spreadsheet calculations and today, it is widely used for data processing, visualization, and
complex calculations.
• Excel is a powerful analytical tool for Data Science. While it has been the traditional tool for
data analysis, Excel still packs a punch.
• Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own
custom functions and formulae using Excel. While Excel is not for calculating the huge amount
of Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets.
• You can also connect SQL with Excel and can use it to manipulate and analyze data. A lot of
Data Scientists use Excel for data cleaning as it provides an interactable GUI environment to
pre-process information easily.
• With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex
analyzations. However, it still pales in comparison with much more advanced Data Science tools
like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.
6. Excel
• Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for
spreadsheet calculations and today, it is widely used for data processing, visualization, and
complex calculations.
• Excel is a powerful analytical tool for Data Science. While it has been the traditional tool for
data analysis, Excel still packs a punch.
• Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own
custom functions and formulae using Excel. While Excel is not for calculating the huge amount
of Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets.
• You can also connect SQL with Excel and can use it to manipulate and analyze data. A lot of
Data Scientists use Excel for data cleaning as it provides an interactable GUI environment to
pre-process information easily.
• With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex
analyzations. However, it still pales in comparison with much more advanced Data Science tools
like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.
7. ggplot2
• ggplot2 is an advanced data visualization package for the R programming language. The
developers created this tool to replace the native graphics package of R and it uses powerful
commands to create illustrious visualizations.
• It is the most widely used library that Data Scientists use for creating visualizations from
analyzed data.Ggplot2 is part of tidyverse, a package in R that is designed for Data Science.
• One way in which ggplot2 is much better than the rest of the data visualizations is aesthetics.
With ggplot2, Data Scientists can create customized visualizations in order to engage in
enhanced storytelling.
• Using ggplot2, you can annotate your data in visualizations, add text labels to data points and
boost intractability of your graphs. You can also create various styles of maps such as
choropleths, cartograms, hexbins, etc. It is the most used data science tool.
8. Tableau
• Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business
intelligence.
• The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.
• Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes
with an active community and you can share your findings on the online platform. While
Tableau is enterprise software, it comes with a free version called Tableau Public.
8. Tableau
• Tableau is a Data Visualization software that is packed with powerful graphics to make
interactive visualizations. It is focused on industries working in the field of business
intelligence.
• The most important aspect of Tableau is its ability to interface with databases, spreadsheets,
OLAP (Online Analytical Processing) cubes, etc. Along with these features, Tableau has the
ability to visualize geographical data and for plotting longitudes and latitudes in maps.
• Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes
with an active community and you can share your findings on the online platform. While
Tableau is enterprise software, it comes with a free version called Tableau Public.
9. Jupyter
• Project Jupyter is an open-source tool based on IPython for helping developers in making open-
source software and experiences interactive computing. Jupyter supports multiple languages like
Julia, Python, and R.
• It is a web-application tool used for writing live code, visualizations, and presentations. Jupyter
is a widely popular tool that is designed to address the requirements of Data Science.
• It is an interactable environment through which Data Scientists can perform all of their
responsibilities. It is also a powerful tool for storytelling as various presentation features are
present in it.
• Using Jupyter Notebooks, one can perform data cleaning, statistical computation, visualization
and create predictive machine learning models. It is 100% open-source and is, therefore, free of
cost.
• There is an online Jupyter environment called Collaboratory which runs on the cloud and stores
the data in Google Drive.
10. Matplotlib
• Matplotlib is a plotting and visualization library developed for Python. It is the most popular
tool for generating graphs with the analyzed data. It is mainly used for plotting complex graphs
using simple lines of code. Using this, one can generate bar plots, histograms, scatterplots etc.
• Matplotlib has several essential modules. One of the most widely used modules is pyplot. It
offers a MATLAB like an interface. Pyplot is also an open-source alternative to MATLAB’s
graphic modules.
• Matplotlib is a preferred tool for data visualizations and is used by Data Scientists over other
contemporary tools.
• As a matter of fact, NASA used Matplotlib for illustrating data visualizations during the landing
of Phoenix Spacecraft. It is also an ideal tool for beginners in learning data visualization with
Python.
11. NLTK
• Natural Language Processing has emerged as the most popular field in Data Science. It deals
with the development of statistical models that help computers understand human language.
• These statistical models are part of Machine Learning and through several of its algorithms, are
able to assist computers in understanding natural language. Python language comes with a
collection of libraries called Natural Language Toolkit (NLTK) developed for this particular
purpose only.
• NLTK is widely used for various language processing techniques like tokenization, stemming,
tagging, parsing and machine learning. It consists of over 100 corpora which are a collection of
data for building machine learning models.
• It has a variety of applications such as Parts of Speech Tagging, Word Segmentation, Machine
Translation, Text to Speech Recognition, etc.
12. Scikit-learn
• Scikit-learn is a library-based in Python that is used for implementing Machine Learning
Algorithms. It is simple and easy to implement a tool that is widely used for analysis and data
science.
• It supports a variety of features in Machine Learning such as data preprocessing, classification,
regression, clustering, dimensionality reduction, etc
• Scikit-learn makes it easy to use complex machine learning algorithms. It is therefore in
situations that require rapid prototyping and is also an ideal platform to perform research
requiring basic Machine Learning. It makes use of several underlying libraries of Python such as
SciPy, Numpy, Matplotlib, etc.
13. TensorFlow
• TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced
machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors
which are multidimensional arrays.
• It is an open-source and ever-evolving toolkit which is known for its performance and high
computational abilities. TensorFlow can run on both CPUs and GPUs and has recently emerged
on more powerful TPU platforms.
• This gives it an unprecedented edge in terms of the processing power of advanced machine
learning algorithms.
• Due to its high processing ability, Tensorflow has a variety of applications such as speech
recognition, image classification, drug discovery, image and language generation, etc. For Data
Scientists specializing in Machine Learning, Tensorflow is a must-know tool.
14. Weka
• Weka or Waikato Environment for Knowledge Analysis is a machine learning software written
in Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists
of various machine learning tools like classification, clustering, regression, visualization and
data preparation.
• It is an open-source GUI software that allows easier implementation of machine learning
algorithms through an interactable platform.
Types of Data
Introduction – Importance of Data
“Data is the new oil.” Today data is everywhere in every field. Whether you are a data scientist,
marketer, businessman, data analyst, researcher, or you are in any other profession, you need
to play or experiment with raw or structured data. This data is so important for us that it
becomes important to handle and store it properly, without any error. While working on these
data, it is important to know the types of data to process them and get the right results. There
are two types of data: Qualitative and Quantitative data, which are further classified
into four types of data: nominal, ordinal, discrete, and Continuous.
Now business runs on data, most of the companies use data for their insights to create and
launch campaigns, design strategies, launch products, and services or try out different things.
According to a report, today, at least 2.5 quintillion bytes of data are produced per day.

Qualitative or Categorical Data


Qualitative or Categorical Data is data that can’t be measured or counted in the form of
numbers. These types of data are sorted by category, not by number. That’s why it is also
known as Categorical Data. These data consist of audio, images, symbols, or text. The gender
of a person, i.e., male, female, or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market researchers
understand the customers’ tastes and then design their ideas and strategies accordingly.
The other examples of qualitative data are:
• What language do you speak?
• Favourite holiday destination
• Opinion on something (agree, disagree, or neutral)
• Colours
The Qualitative data are further classified into two parts :
Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The colour of
hair can be considered nominal data, as one colour can’t be compared with another colour.
The name “nominal” comes from the Latin name “nomen,” which means “name.” With the
help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data.
These data don’t have any meaningful order; their values are distributed to distinct categories.
Examples of Nominal Data :
• Colour of hair (Blonde, red, Brown, Black, etc.)
• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye Color (Black, Brown, etc.)
Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale. These data are used for observation like customer satisfaction, happiness,
etc., but we can’t do any arithmetical tasks on them.
The ordinal data is qualitative data for which their values have some kind of relative position.
These kinds of data can be considered as “in-between” the qualitative data and quantitative
data. The ordinal data only shows the sequences and cannot use for statistical analysis.
Compared to the nominal data, ordinal data have some kind of order that is not present in
nominal data.
Examples of Ordinal Data :
• When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of peoples in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)
• Difference between Nominal and Ordinal Data
Nominal Data Ordinal Data
Nominal data can’t be quantified, neither Ordinal data gives some kind of sequential
they have any intrinsic ordering order by their position on the scale
Nominal data is qualitative data or Ordinal data is said to be “in-between”
categorical data qualitative data and quantitative data
They don’t provide any quantitative value, They provide sequence and can assign
neither we can perform any arithmetical numbers to ordinal data but cannot perform
operation the arithmetical operation
Nominal data cannot be used to compare Ordinal data can help to compare one item
with one another with another by ranking or ordering
Examples: Eye colour, housing style, Examples: Economic status, customer
gender, hair colour, religion, marital status, satisfaction, education level, letter grades,
ethnicity, etc etc
Quantitative Data
Quantitative data can be expressed in numerical values, which makes it countable and includes
statistical data analysis. These kinds of data are also known as Numerical data. It answers the
questions like, “how much,” “how many,” and “how often.” For example, the price of a phone,
the computer’s ram, the height or weight of a person, etc., falls under the quantitative data.
Quantitative data can be used for statistical manipulation and these data can be represented on
a wide variety of graphs and charts such as bar graphs, histograms, scatter plots, boxplot, pie
charts, line graphs, etc.
Examples of Quantitative Data :
• Height or weight of a person or object
• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time
The Quantitative data are further classified into two parts :
Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall under
integers or whole numbers. The total number of students in a class is an example of discrete
data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible. These
data are represented mainly by a bar graph, number line, or frequency table.
Examples of Discrete Data:
• Total numbers of students present in a class
• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android
phone, the height of a person, the length of an object, etc. Continuous data represents
information that can be divided into smaller levels. The continuous variable can take any value
within a range.
The key difference between discrete and continuous data is that discrete data contains the
integer or whole number. Still, continuous data stores the fractional numbers to record different
types of data such as temperature, height, width, time, speed, etc.
Examples of Continuous Data:
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Difference between Discrete and Continuous Data
Discrete Data Continuous Data
Discrete data are countable and finite; they are
Continuous data are measurable; they are in
whole numbers or integers the form of fraction or decimal
Discrete data are represented mainly by bar Continuous data are represented in the form
graphs of a histogram
The values cannot be divided into subdivisions The values can be divided into subdivisions
into smaller pieces into smaller pieces
Discrete data have spaces between the values Continuous data are in the form of a
continuous sequence
Examples: Total students in a class, number of Example: Temperature of room, the weight
days in a week, size of a shoe, etc of a person, length of an object, etc

The 15 most popular Data Science terms (Terminology)

1. Data Science
Data Science is a field that combines programming skills and knowledge of mathematics and
statistics to derive insights from data. In short: Data Scientists work with large amounts of data,
which are systematically analyzed to provide meaningful information that can be used for
decision making and problem solving. A Data Scientist has a high level of technical skills and
knowledge, usually with expertise in programming languages such as R and Python. They help
organizations collect, compile, interpret, format, model, predict, and manipulate all types of
data in a wide variety of ways.

2. Algorithm
Algorithms are repeatable sets, usually expressed mathematically, of instructions that humans
or machines can use to process given data. Typically, algorithms are constructed by feeding
them data and adjusting variables until the desired result is achieved. Thanks to breakthrough
developments in Artificial Intelligence, machines now typically perform this task of
combining, as they can do it much faster than a human.

3. Data Analytics
Data Analytics involves answering questions generated for better business decision-making.
Existing information is used to determine usable data. Data analysis is an ongoing process in
which data is collected and analyzed continuously. An essential component of ensuring data
integrity is the accurate evaluation of research results.
4. Data mining
Data mining is the process of sorting large data sets to identify patterns and relationships that
can help solve business problems. Data mining techniques and tools can be used to predict
future trends and make more informed business decisions. Data mining is a component of Data
Analysis and one of the core disciplines of Data Science.
The data mining process can be divided into these four main stages:

Data sources identify and assemble relevant data for an analytics application. The data may
be located in different source systems that contain a mix of structured and unstructured data.

The data exploration stage includes a set of steps to get the data ready to be mined. It
summarizes the steps of data exploration, profiling, and pre-processing, followed by data
cleansing work to fix errors and other data quality issues.
Now it is time to implement one or more algorithms to do the mining/modeling. In Machine
Learning applications, the algorithms typically must be trained on sample data sets.

On to the application of deploying the models and communicating the findings to business
executives and users, often through Data Visualization and the use of data storytelling.

5. Big Data
The term "Big Data" has emerged as an ever-increasing amount of data has become available.
Today's data differs from that of the past not only in the amount but also in the speed at which
it is available. It is data with such large size and complexity that none of the traditional data
management tools can store it or process it efficiently.
Big data benefits:
• Big Data can produce more complete answers, because you have more information
• More precisely defined answers through confirmation of multiple data sources

6. Artificial intelligence (AI)


The term is frequently applied to the project of developing systems endowed with the
intellectual processes characteristic of humans which goes almost as far as an imitation. John
McCarthy also offers the following definition: "It is the science and engineering of making
intelligent machines, especially intelligent computer programs. It is related to the similar task
of using computers to understand human intelligence, but AI does not have to confine itself to
methods that are biologically observable."

7. Machine Learning
Machine Learning is a technique that allows a computer to learn from data without using a
complex set of different rules. It is a subset of AI in which algorithms learn from historical data
to predict outcomes and uncover patterns. It's also the process that drives many of the services
we use today - recommendation systems like those from Netflix, YouTube, and Spotify; search
engines like Google; social media feeds like Facebook and Twitter; voice assistants like Siri
and Alexa, etc. With each click or other activity, you give a machine learning material to further
process into information, which it can use to make a highly educated decision on what to show
you next.
8. Deep learning
Deep Learning is a Machine Learning technique inspired by the neural network of our brain. It
gives machines the ability to find even the smallest patterns in a data set with many layers of
computational nodes working together to search through data and deliver a final result in the
form of a prediction.

9. NLP
Natural language processing (NLP) is an intersection between the fields of Computer Science,
linguistics, and Artificial Intelligence. It helps computers communicate with people in their
language and perform other language-related tasks. NLP enables computers to read text, listen
to the speech, interpret speech, and determine which parts are important. The goal is to create
the widest possible communication between humans and computers via speech. This should
enable both machines and applications to be controlled and operated by natural language.

10. Python
Python is one of the most popular programming languages today, however, it is best known as
a versatile language that allows it to be very useful for analyzing data. The language creators
focused on making a language that is easy to learn and user-friendly, therefore it is also a very
common first programming language to learn. Furthermore, the easily understandable syntax
of Python allows for quick, compact, and readable implementation of scripts or programs, in
comparison with other programming languages.

For many reasons, the fastest-growing programming languages globally: its ease of learning,
the recent explosion of the Data Science field, and the rise of Machine Learning. Python also
supports Object-Oriented and Functional Programming styles, which facilitate building
automated tasks and deployable systems. There are plenty of Python scientific packages for
Data Visualization, Machine Learning, Natural Language Processing, and more.

11. R
R is an open-source implementation of the statistical programming language S which was
developed at Bell Labs in the 1970s. Most of its underlying source code has been written in C
and Fortran. R allows its users to manipulate R objects from these languages as well (including
C++) for computationally intensive tasks. It is essentially a highly extensible and flexible
environment for performing statistical computations and data analysis.
R is the language of choice for statistical analysis, which is a very important feature in Data
Science. R’s popularity comes from the fact that most statistical methods developed in research
environments lead to the production of ready-to-use freely available R packages. R’s popularity
has led Microsoft to develop Microsoft R Open: The Enhanced R, Distribution, and Oracle to
develop Oracle R Enterprise. From our partner companies, we have learned that along with
Python, R remains the language of choice for Data Scientists in the insurance and
pharmaceutical sectors.

12. SQL
SQL (Structured Query Language) is the language to query and manipulate data in RDMS
(Relational Database Management Systems) and is, for this reason, very relevant in the field
of Data Science. RDMS are columns and rows to store data within a structured format and are
a potent tool to store massive amounts of information. Some common database management
systems that use SQL are: Sybase, Oracle, Microsoft SQL Server, Access, etc.

13. NumPy & Pandas


NumPy is the fundamental package for Scientific Computing with Python, adding support for
large, multi-dimensional arrays, along with an extensive library of high-level mathematical
functions. Pandas is a library built on top of NumPy for data manipulation and analysis. The
library provides data structures and a rich set of operations for manipulating numerical tables
and time series.

14. Web Scraping


Web scraping pulls data from the source code of a website. This requires a script that identifies
the information a user wants and transfers it to a new file. Usually, software that simulates
human browsing on the Internet is used for this purpose to collect specific information from
various websites. Web scraping is also referred to as web data extraction, screen scraping or
web harvesting.

15. API
APIs provide users with a set of functions used to interact with the features of a specific service
or application. Facebook, for example, provides developers of software applications with
access to Facebook features through its API. By hooking into the Facebook API, developers
can allow users of their applications to log in using Facebook, or they can access personal
information stored in their databases.
Prerequisites for Data Science
Here are some of the technical concepts you should know about before starting to learn what
is data science.
1. Machine Learning
Machine learning is the backbone of data science. Data Scientists need to have a solid grasp of
ML in addition to basic knowledge of statistics.
2. Modeling
Mathematical models enable you to make quick calculations and predictions based on what
you already know about the data. Modeling is also a part of Machine Learning and involves
identifying which algorithm is the most suitable to solve a given problem and how to train these
models.
3. Statistics
Statistics are at the core of data science. A sturdy handle on statistics can help you extract more
intelligence and obtain more meaningful results.
4. Programming
Some level of programming is required to execute a successful data science project. The most
common programming languages are Python, and R. Python is especially popular because it’s
easy to learn, and it supports multiple libraries for data science and ML.
5. Databases
A capable data scientist needs to understand how databases work, how to manage them, and
how to extract data from them.
What Does a Data Scientist Do?
A data scientist analyzes business data to extract meaningful insights. In other words, a data
scientist solves business problems through a series of steps, including:
Before tackling the data collection and analysis, the data scientist determines the problem by
asking the right questions and gaining understanding.
The data scientist then determines the correct set of variables and data set. The data scientist
gathers structured and unstructured data from many disparate sources—enterprise data, public
data, etc.
Once the data is collected, the data scientist processes the raw data and converts it into a format
suitable for analysis. This involves cleaning and validating the data to guarantee uniformity,
completeness, and accuracy.
After the data has been rendered into a usable form, it’s fed into the analytic system—ML
algorithm or a statistical model. This is where the data scientists analyze and identify patterns
and trends.
When the data has been completely rendered, the data scientist interprets the data to find
opportunities and solutions.
The data scientists finish the task by preparing the results and insights to share with the
appropriate stakeholders and communicating the results.
Now we should be aware of some machine learning algorithms which are beneficial in
understanding data science clearly.
Why Become a Data Scientist?
According to Glassdoor and Forbes, demand for data scientists will increase by 28 percent by
2026, which speaks of the profession’s durability and longevity, so if you want a secure career,
data science offers you that chance.
Furthermore, the profession of data scientist came in second place in the Best Jobs in America
for 2021 survey, with an average base salary of USD 127,500.
So, if you’re looking for an exciting career that offers stability and generous compensation,
then look no further!
Where Do You Fit in Data Science?
Data science offers you the opportunity to focus on and specialize in one aspect of the field.
Here’s a sample of different ways you can fit into this exciting, fast-growing field.
Data Scientist
Job role: Determine what the problem is, what questions need answers, and where to find the
data. Also, they mine, clean, and present the relevant data.
Skills needed: Programming skills (SAS, R, Python), storytelling and data visualization,
statistical and mathematical skills, knowledge of Hadoop, SQL, and Machine Learning.
Data Analyst
Job role: Analysts bridge the gap between the data scientists and the business analysts,
organizing and analyzing data to answer the questions the organization poses. They take the
technical analyses and turn them into qualitative action items.
Skills needed: Statistical and mathematical skills, programming skills (SAS, R, Python), plus
experience in data wrangling and data visualization.

Data Engineer
Job role: Data engineers focus on developing, deploying, managing, and optimizing the
organization’s data infrastructure and data pipelines. Engineers support data scientists by
helping to transfer and transform data for queries.
Skills needed: NoSQL databases (e.g., MongoDB, Cassandra DB), programming languages
such as Java and Scala, and frameworks (Apache Hadoop).
Data Science Applications
Data science has found its applications in almost every industry.
Through this blog, we bring to you, 10 applications that build upon the concepts of Data
Science, exploring various domains such as the following:
1.Fraud and Risk Detection
2.Healthcare
3.Internet Search
4.Targeted Advertising
5.Website Recommendations
6.Advanced Image Recognition
7.Speech Recognition
8.Airline Route Planning
9.Gaming
10.Augmented Reality
1.Fraud and Risk Detection
The earliest applications of data science were in Finance. Companies were fed up of bad debts
and losses every year. However, they had a lot of data which use to get collected during the
initial paperwork while sanctioning loans. They decided to bring in data scientists in order to
rescue them from losses.
Over the years, banking companies learned to divide and conquer data via customer profiling,
past expenditures, and other essential variables to analyze the probabilities of risk and default.
Moreover, it also helped them to push their banking products based on customer’s purchasing
power.
2.Healthcare
The healthcare sector, especially, receives great benefits from data science applications.
i. Medical Image Analysis
Procedures such as detecting tumors, artery stenosis, organ delineation employ various
different methods and frameworks like MapReduce to find optimal parameters for tasks like
lung texture classification. It applies machine learning methods, support vector machines
(SVM), content-based medical image indexing, and wavelet analysis for solid texture
classification.healthcare 1 - Data Science Applications –
ii. Genetics & Genomics
Data Science applications also enable an advanced level of treatment personalization through
research in genetics and genomics. The goal is to understand the impact of the DNA on our
health and find individual biological connections between genetics, diseases, and drug
response. Data science techniques allow integration of different kinds of data with genomic
data in the disease research, which provides a deeper understanding of genetic issues in
reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data,
we will achieve a deeper understanding of the human DNA. The advanced genetic risk
prediction will be a major step towards more individual care.
iii. Drug Development
The drug discovery process is highly complicated and involves many disciplines. The greatest
ideas are often bounded by billions of testing, huge financial and time expenditure. On average,
it takes twelve years to make an official submission.
Data science applications and machine learning algorithms simplify and shorten this process,
adding a perspective to each step from the initial screening of drug compounds to the prediction
of the success rate based on the biological factors. Such algorithms can forecast how the
compound will act in the body using advanced mathematical modeling and simulations instead
of the “lab experiments”. The idea behind the computational drug discovery is to create
computer model simulations as a biologically relevant network simplifying the prediction of
future outcomes with high accuracy.
iv. Virtual assistance for patients and customer support
Optimization of the clinical process builds upon the concept that for many cases it is not
actually necessary for patients to visit doctors in person. A mobile application can give a more
effective solution by bringing the doctor to the patient instead.
The AI-powered mobile apps can provide basic healthcare support, usually chatbots. You
simply describe your symptoms, or ask questions, and then receive key information about your
medical condition derived from a wide network linking symptoms to causes. Apps can remind
you to take your medicine on time, and if necessary, assign an appointment with a doctor.
This approach promotes a healthy lifestyle by encouraging patients to make healthy decisions,
saves their time waiting in line for an appointment, and allows doctors to focus on more critical
cases.
3.Internet Search
Now, this is probably the first thing that strikes your mind when you think Data Science
Applications.
When we speak of search, we think ‘Google’. Right? But there are many other search engines
like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including Google) make use
of data science algorithms to deliver the best result for our searched query in a fraction of
seconds. Considering the fact that, Google processes more than 20 petabytes of data every day.
4.Targeted Advertising
If you thought Search would have been the biggest of all data science applications, here is a
challenger – the entire digital marketing spectrum. Starting from the display banners on various
websites to the digital billboards at the airports – almost all of them are decided by using data
science algorithms.
This is the reason why digital ads have been able to get a lot higher CTR (Call-Through Rate)
than traditional advertisements. They can be targeted based on a user’s past behavior.
This is the reason why you might see ads of Data Science Training Programs while I see an ad
of apparels in the same place at the same time.
5.Website Recommendations
Aren’t we all used to the suggestions about similar products on Amazon? They not only help
you find relevant products from billions of products available with them but also add a lot to
the user experience.
A lot of companies have fervidly used this engine to promote their products in accordance with
user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google Play,
Netflix, Linkedin, IMDb, and much more use this system to improve the user experience. The
recommendations are made based on previous search results for a user.
6.Advanced Image Recognition
You upload your image with friends on Facebook and you start getting suggestions to tag your
friends. This automatic tag suggestion feature uses face recognition algorithm.
In their latest update, Facebook has outlined the additional progress they’ve made in this area,
making specific note of their advances in image recognition accuracy and capacity.
“We’ve witnessed massive advances in image classification (what is in the image?) as well as
object detection (where are the objects?), but this is just the beginning of understanding the
most relevant visual content of any image or video. Recently we’ve been designing techniques
that identify and segment each and every object in an image, a key capability that will enable
entirely new applications.”
In addition, Google provides you with the option to search for images by uploading them. It
uses image recognition and provides related search results.
7.Speech Recognition
Some of the best examples of speech recognition products are Google Voice, Siri, Cortana etc.
Using the speech-recognition feature, even if you aren’t in a position to type a message, your
life wouldn’t stop. Simply speak out the message and it will be converted to text. However, at
times, you would realize, speech recognition doesn’t perform accurately.
8.Airline Route Planning
Airline Industry across the world is known to bear heavy losses. Except for a few airline service
providers, companies are struggling to maintain their occupancy ratio and operating profits.
With high rise in air-fuel prices and need to offer heavy discounts to customers has further
made the situation worse. It wasn’t for long when airlines companies started using data science
to identify the strategic areas of improvements. Now using data science, the airline companies
can:
Predict flight delay
Decide which class of airplanes to buy
Whether to directly land at the destination or take a halt in between (For example, A flight can
have a direct route from New Delhi to New York. Alternatively, it can also choose to halt in
any country.)
Effectively drive customer loyalty programs
Southwest Airlines, Alaska Airlines are among the top companies who’ve embraced data
science to bring changes in their way of working.
You can get a better insight into it by referring to this video by our team, which vividly speaks
of all the various fields conquered by Data Science Applications.
9.Gaming
Games are now designed using machine learning algorithms that improve/upgrade themselves
as the player moves up to a higher level. In motion gaming also, your opponent (computer)
analyzes your previous moves and accordingly shapes up its game. EA Sports, Zynga, Sony,
Nintendo, Activision-Blizzard have led the gaming experience to the next level using data
science.
10.Augmented Reality
This is the final of the data science applications which seem most exciting in the future.
Augmented reality.
Data Science and Virtual Reality do have a relationship, considering a VR headset contains
computing knowledge, algorithms and data to provide you with the best viewing experience.
A very small step towards this is the high-trending game of Pokemon GO. The ability to walk
around things and look at Pokemon on walls, streets, things that aren’t really there. The creators
of this game used the data from Ingress, the last app from the same company, to choose the
locations of the Pokemon and gyms.

You might also like