What Is Data Science A Beginner's Guide To Data Science
What Is Data Science A Beginner's Guide To Data Science
Data
Science? A
Beginner’s
Guide To
Data
Science
ZaranTech
What Is Data Science? A Beginner’s Guide To
Data Science
As the world entered the era of big data, the need for its storage also grew. It
was the main challenge and concern for the enterprise industries until 2010.
The main focus was on building framework and solutions to store data. Now
when Hadoop and other frameworks have successfully solved the problem of
storage, the focus has shifted to the processing of this data. Data Science is
the secret sauce here. All the ideas which you see in Hollywood sci-fi movies
can actually turn into reality by Data Science. Data Science is the future of
Artificial Intelligence. Therefore, it is very important to understand what is Data
Science and how can it add value to your business.
How about if you could understand the precise requirements of your customers
from the existing data like the customer’s past browsing history, purchase
history, age and income. No doubt you had all this data earlier too, but now
with the vast amount and variety of data, you can train models more effectively
and recommend the product to your customers with more precision. Wouldn’t it
be amazing as it will bring more business to your organization?Let’s take a
different scenario to understand the role of Data Science in decision
making. How about if your car had the intelligence to drive you home? The self-
driving cars collect live data from sensors, including radars, cameras and
lasers to create a map of its surroundings. Based on this data, it takes
decisions like when to speed up, when to speed down, when to overtake,
where to take a turn – making use of advanced machine learning
algorithms.Let’s see how Data Science can be used in predictive analytics.
Let’s take weather forecasting as an example. Data from ships, aircrafts,
radars, satellites can be collected and analyzed to build models. These models
will not only forecast the weather but also help in predicting the occurrence of
any natural calamities. It will help you to take appropriate measures
beforehand and save many precious lives.
Let’s have a look at the below infographic to see all the domains where Data
Science is creating its impression.
Now that you have understood the need of Data Science, let’s understand what
is Data Science.
First, let’s see what is Data Science. Data Science is a blend of various tools,
algorithms, and machine learning principles with the goal to discover hidden
patterns from the raw data. How is this different from what statisticians have
been doing for years?
As you can see from the above image, a Data Analyst usually explains what is
going on by processing history of the data. On the other hand, Data Scientist
not only does the exploratory analysis to discover insights from it, but also uses
various advanced machine learning algorithms to identify the occurrence of a
particular event in the future. A Data Scientist will look at the data from many
angles, sometimes angles not known earlier.
So, Data Science is primarily used to make decisions and predictions making
use of predictive causal analytics, prescriptive analytics (predictive plus
decision science) and machine learning.
I am sure you might have heard of Business Intelligence (BI) too. Often Data
Science is confused with BI. I will state some concise and clear contrasts
between the two which will help you in getting a better understanding. Let’s
have a look.
This was all about what is Data Science, now let’s understand the lifecycle of
Data Science.
You can use R for data cleaning, transformation, and visualization. This will
help you to spot the outliers and establish a relationship between the
variables. Once you have cleaned and prepared the data, it’s time to do
exploratory analytics on it. Let’s see how you can achieve that.
Phase 3—Model planning: Here, you will determine the methods and
techniques to draw the relationships between variables. These relationships will
set the base for the algorithms which you will implement in the next phase. You
will apply Exploratory Data Analytics (EDA) using various statistical formulas
and visualization tools.
Now that you have got insights into the nature of your data and have decided
the algorithms to be used. In the next stage, you will apply the algorithm and
build up a model.
Phase 4—Model building: In this phase, you will develop datasets for training
and testing purposes. You will consider whether your existing tools will suffice
for running the models or it will need a more robust environment (like fast and
parallel processing). You will analyze various learning techniques like
classification, association and clustering to build the model.
Now, I will take a case study to explain you the various phases described
above.
Step 1:
First, we will collect the data based on the medical history of the patient as
discussed in Phase 1. You can refer to the sample data below.
Now, once we have the data, we need to clean and prepare the data for
data analysis.
This data has a lot of inconsistencies like missing values, blank columns,
abrupt values and incorrect data format which need to be cleaned.
Here, we have organized the data into a single table under different
attributes – making it look more structured.
Let’s have a look at the sample data below.
Step 3:
First, we will load the data into the analytical sandbox and apply various
statistical functions on it. For example, R has functions like describe which
gives us the number of missing values and unique values. We can also use
the summary function which will give us statistical information like mean,
median, range, min and max values.
Then, we use visualization techniques like histograms, line graphs, box
plots to get a fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind
of problem is the decision tree. Let’s see how?
Here, the most important parameter is the level of glucose, so it is our root
node. Now, the current node and its value determine the next important
parameter to be taken. It goes on until we get the result in terms of pos or neg.
Pos means the tendency of having diabetes is positive and neg means the
tendency of having diabetes is negative.
Step 5:
In this phase, we will run a small pilot project to check if our results are
appropriate. We will also look for performance constraints if any. If the results
are not accurate, then we need to replan and rebuild the model.
Step 6:
Once we have executed the project successfully, we will share the output for
full deployment.
As you can see in the above image, you need to acquire various hard skills and
soft skills. You need to be good at statistics and mathematics to analyze and
visualize data. Needless to say, Machine Learning forms the heart of Data
Science and requires you to be good at it. Also, you need to have a solid
understanding of the domain you are working in to understand the business
problems clearly. Your task does not end here. You should be capable of
implementing various algorithms which require good coding skills. Finally, once
you have made certain key decisions, it is important for you to deliver them to
the stakeholders. So, good communication will definitely add brownie points to
your skills.
In the end, it won’t be wrong to say that the future belongs to the Data
Scientists. It is predicted that by the end of the year 2018, there will be a need
of around one million Data Scientists. More and more data will provide
opportunities to drive key business decisions. It is soon going to change the
way we look at the world deluged with data around us. Therefore, a Data
Scientist should be highly skilled and motivated to solve the most complex
problems.
ZaranTech
ZaranTech is a US based Global IT Training and Consulting
Company, which provides focused Individual and Corporate e-
learning programs. Our Senior trainers have more than eight years
of experience in the fast paced world of Information Technology.
LEARN MORE