0% found this document useful (0 votes)
55 views18 pages

UNIT - I Intro To DS

The document provides an introduction to data science, highlighting its significance, roles, and applications in various fields. It emphasizes the importance of Python as a primary programming language for data science due to its ease of use and extensive libraries. Additionally, it outlines the data science lifecycle, including stages such as discovery, data preparation, model planning, model building, communication of results, and operationalization.

Uploaded by

Vineela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views18 pages

UNIT - I Intro To DS

The document provides an introduction to data science, highlighting its significance, roles, and applications in various fields. It emphasizes the importance of Python as a primary programming language for data science due to its ease of use and extensive libraries. Additionally, it outlines the data science lifecycle, including stages such as discovery, data preparation, model planning, model building, communication of results, and operationalization.

Uploaded by

Vineela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT – I: INTRODUCTION TO DATA SCIENCE

Overview Data Science and Application, Common terminologies of Data Science, Various roles within
Data Science, Essential Stages of Data Science Life Cycle, Components of Data Science,
Organizational challenges while building Data Science projects.

Introduction to Data Science with Python

As per various surveys, data scientist job is becoming the most demanding Job of the 21st century due
to increasing demands for data science. Some people also called it "the hottest job title of the 21st
century". Every organization is looking for candidates with knowledge of data science. The average
salary range for data scientist will be approximately $95,000 to $ 165,000 per annum, and as per
different researches, about 11.5 millions of job will be created by the year 2026.

Data science is the field that comprises everything related to cleaning, preparing, and analyzing
unstructured, semistructured, and structured data. This field of science uses a combination of statistics,
mathematics, programming, problem-solving, and data capture to extract insights and information
from data.

Is Python necessary for Data Science?

Python is easy to learn and most worldwide used programming language. Simplicity and versatility
is the key feature of Python. There is R programming is also present for data science but due to
simplicity and versatility of python, recommended language is python for Data Science.

Benefit of using Python for Data Science

Python is a popular language for data science because it is easy to learn, has a large and active
community, offers powerful libraries for data analysis and visualization, and has excellent machine-
learning libraries. This community has created many useful libraries, including NumPy, Pandas,
matplotlib, and SciPy, which are widely used in data science. NumPy is a library for numerical
computation, Pandas is a library for data manipulation and analysis and Matplotlib is a library for
data visualization.
In terms of application areas, Data scientists prefer Python for the following modules:
1. Data Analysis
2. Data Visualizations
3. Machine Learning
4. Deep Learning
5. Image processing
6. Computer Vision
7. Natural Language Processing (NLP)

Integrated Development Environments (IDEs)


Here are some that you can explore:
• Eclipse with PyDev Plugin
• Python Tools for Visual Studio (for Windows users)
• PyCharm
• Spyder
• Jupyter
• Komodo IDE
Overview Data Science and Application

What is Data Science (DS)?


Data science is a deep study of the massive amount of data, which involves extracting meaningful
insights from raw, structured, and unstructured data that is processed using the scientific method,
different technologies, and algorithms. It is a multidisciplinary field that uses tools and techniques to
manipulate the data so that you can find something new and meaningful. Data Science is a blend of
various fields like Probability, Statistics, Programming, Analysis, Cloud Computing, etc; which are
used to extract value from the data provided.

Data science is an interconnected field that involves the use of statistical and computational methods
to extract insightful information and knowledge from data. Data Science is simply the application of
specific principles and analytic techniques to extract information from data used in planning,
strategic, decision making, etc.
In short, we can say that data science is all about:

Example:
Let suppose we want to travel from station A to station B by car. Now, we need to take some
decisions such as which route will be the best route to reach faster at the location, in which route
there will be no traffic jam, and which will be cost-effective. All these decision factors will act as
input data, and we will get an appropriate answer from these decisions, so this analysis of data is
called the data analysis, which is a part of data science.
Need for Data Science:

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion. The amount of digital data that exists is growing
at a rapid rate, doubling every two years, and changing the way we live. It is estimated as per
researches, that by 2020, 1.7 MB of data will be created at every single second, by a single person on
earth. Every Company requires data to work, grow, and improve their businesses. This means we need
to have the technical tools, algorithms, and models to clean, process, and understand the available data
in its different forms for decision-making purposes.
Now, handling of such huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required some complex, powerful, and efficient algorithms and
technology, and that technology came into existence as data Science. Following are some main reasons
for using data science technology:
o With the help of data science technology, we can convert the massive amount of raw and unstructured
data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big brand or a startup.
Google, Amazon, Netflix, etc, which handle the huge amount of data, are using data science
algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving car, which is
the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight ticket
confirmation, etc.

Applications of Data Science


Data science is used in every domain.
• Healthcare: In the healthcare sector, data science is providing lots of benefits. Data science is
being used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc.
Healthcare industries uses the data science to make instruments to detect and cure disease.
• Image Recognition and speech recognition: Data science is currently using for Image and speech
recognition. The popular application is identifying pattern in images and finds objects in image.
When you upload an image on Facebook and start getting the suggestion to tag to your friends.
This automatic tagging suggestion uses image recognition algorithm, which is part of data science.
When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per
voice control, so this is possible with speech recognition algorithm.
• Internet Search: Data science algorithms are used in products such as internet search engines to
deliver the best results for search queries in less time. Google deals with more than 20 petabytes
of data per day. The reason google is a successful engine because it uses data science. When we
want to search for something on the internet, then we use different types of search engines such as
Google, Yahoo, Bing, Ask, etc. All these search engines use the data science technology to make
the search experience better, and you can get a search result with a fraction of seconds.
• Advertising: Data science algorithms are used in digital marketing which includes banners on
various websites, billboard, posts etc. those marketing are done by data science. Data science helps
to find correct user to show a particular banner or advertisement.
• Logistics: Logistics companies ensure faster delivery of your order so, these companies use the
data science to find best route to deliver the order.
• Gaming world: In the gaming world, the use of Machine learning algorithms is increasing day by
day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user experience.
• Transport: Transport industries also using data science technology to create self-driving cars.
With self-driving cars, it will be easy to reduce the number of road accidents.
• Recommendation systems: Most of the companies, such as Amazon, Netflix, Google Play, etc.,
are using data science technology for making a better user experience to generate
recommendations. Such as, when you search for something on Amazon, and you started getting
suggestions for similar products, so this is because of data science technology.
• Risk detection: Finance industries always had an issue of fraud and risk of losses, but with the
help of data science, this can be rescued. Most of the finance companies are looking for the data
scientist to avoid risk and any type of losses with an increase in customer satisfaction.
Common terminologies of Data Science
Analytics, data science, Machine Learning, Predictive Modelling, Forecasting, Big Data,
Business Intelligence, MIS / Reporting, etc.
Data Science=Data + Science
The field of bringing out insights from data using scientific techniques is called Data Science.
Spectrum of Business Analytics

Value added to an organization


Key Terms and differences
Forecasting Vs. Predictive Modeling Vs. Machine Learning

Forecasting is a process of predicting or estimating the future based on past and present data.
Examples:
• How many passengers can we expect in a given flight?
• How many customer calls can we expect in next hour?

Predictive Modeling, used to perform prediction more granular like “Who are the customers who
are likely to buy a product in next month?” and then act accordingly.

Example: Identifying right customer and taking actions accordingly

Machine Learning: It is a method of teaching machines to learn things and improve predictions /
behavior based on data on their own.
Examples:
• Create an algorithm which can power Google Search.
• Amazon recommendation system.
Various roles within Data Science

Roles of a Data Scientist

These are the major roles in Data Science. So it’s clear from the venn that a person has an opportunity
to become either a Data Engineer, Data Scientist or Data Analyst, etc; So which means there are
enormous choices and opportunities.
Career Opportunities in Data Science
• Data Scientist: The data scientist develops model like econometric and statistical for various
problems like projection, classification, clustering, pattern analysis.
• Data Architect: The Data Scientist performs a important role in the improving of innovative
strategies to understand the business’s consumer trends and management as well as ways to
solve business problems, for instance, the optimization of product fulfilment and entire profit.
• Data Analytics: The data scientist supports the construction of the base of futuristic and
various planned and continuing data analytics projects.
• Machine Learning Engineer: They built data funnels and deliver solutions for complex
software.
• Data Engineer: Data engineers process the real-time gathered data or stored data and create
and maintain data pipelines that create interconnected ecosystem within an company.

Types of Data Science Job


If you learn data science, then you get the opportunity to find the various exciting job roles in this
domain. The main job roles are given below:

1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data, looks for
patterns, relationship, trends, and so on. At the end of the day, he comes up with visualization and
reporting for analyzing the data for decision making and problem-solving process.

Skill required: For becoming a data analyst, you must get a good background in mathematics,
business intelligence, data mining, and basic knowledge of statistics. You should also be familiar
with some computer languages and tools such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS,
R, JS, Spark, etc.

2. Machine Learning Expert:


The machine learning expert is the one who works with various machine learning algorithms used in
data science such as regression, clustering, classification, decision tree, random forest, etc.

Skill Required: Computer programming languages such as Python, C++, R, Java, and Hadoop. You
should also have an understanding of various algorithms, problem-solving analytical skill, probability,
and statistics.

3. Data Engineer:
A data engineer works with massive amount of data and responsible for building and maintaining the
data architecture of a data science project. Data engineer also works for the creation of data set
processes used in modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra, HBase,
Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++, Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such as R, SAS,
SQL, Python, Hadoop platforms, Hive, Pig, Apache spark, MATLAB. good knowledge of
semistructured formats such as JSON, XML, HTML. In addition to the knowledge of how to work
with unstructured data. Data scientists must have an understanding of Statistics, Mathematics,
visualization, and communication skills.

Who is a Data Scientist?” A Data scientist can predict the future with the necessary Data”
Data scientists use programming tools such as Python, R, SAS, Java, Perl, and C/C++ to extract
knowledge from prepared data. To extract this information, they employ various fit-to-purpose models
based on machine leaning algorithms, statistics, and mathematical methods.
Data scientists are the experts who can use various statistical tools and machine learning algorithms
to understand and analyze the data.

A data scientist is a professional responsible for collecting, analyzing, and interpreting extremely
large amounts of data. The data science role is entirely different from several traditional technical
roles, including mathematician, scientist, statistician, and computer professional. This job requires
the use of advanced analytics technologies, including machine learning and predictive modeling.
A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze
customer and market trends. Basic responsibilities include gathering and analyzing data, using
various types of analytics and reporting tools to detect patterns, trends, and relationships in data sets.
In business, data scientists typically work in teams to mine for information that can be used to predict
customer behavior and identify new revenue opportunities. In many organizations, data scientists
are also responsible for setting best practices for collecting data, using analysis tools, and interpreting
data.
The demand for data science skills has grown significantly over the years, as companies look to
glean useful information from big data, the voluminous amounts of structured, unstructured, and
semi-structured data that a large enterprise or IoT produces and collects.

To get a more-clear understanding of DS it is necessary to know the life cycle of DS.


Data Science Lifecycle

The life-cycle of data science is explained as below diagram.

The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right questions. When you start
any data science project, you need to determine what are the basic requirements, priorities, project
budget and specifications, etc; In this phase, we need to determine all the requirements of the project
such as the number of people, technology, time, data, an end goal, and then we can frame the business
problem on first hypothesis level so that we are not stuck in the middle of the project.

2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to get
the required data model and the data sets to perform Data Analytics for the whole project. We need to
perform ETL [Extract, Transform, Load] on the available data to keep the data ready for the next
stage. In this phase, we need to perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.

3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by using
various statistical formula and visualization tools to understand the relations between variable and to
see what data can inform us. There are various tools to perform Model Planning:

These are the various model planning tools in which a few are open platforms whereas others are paid
and these tools can be used to perform model planning.
4. Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification, and
clustering, to build the model in this stage. This uses a set of algorithms being applied on the previous
fetched results and tries to interpret the patterns and predict the future trend. With the help of existing
data sets, we would train the machines and predict the model. Some common Model building tools are
SAS Enterprise Miner, WEKA, SPCS Modeler, MATLAB

6. Communicate results: In this penultimate phase, we are required to interpret the results obtained
with the required results made in the initial stages. If the results match then we are on the right track
and the goal is achieved 90%. We will communicate the findings and final result with the business
team.

5. Operationalize: In this ultimate phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of complete project
performance and other components on a small scale before the full deployment. In the ultimate stage
if the goal is achieved the model is being made to operationalize in real life and tested. If the desired
output is achieved then the goal is achieved cent percent. And it is also necessary to maintain the data
model for future reference.

Essential Stages of Data Science Life Cycle

The stages in a Data Science process are :


1. Data Storage
2. Exploratory Data analysis
3. Data Modelling
4. Data Visualization
A. Data Storage
A typical successful company can generate millions if not billions of data every single day. All
this data has to be stored somewhere secure and easy to access the place where a Data Scientist can
acquire the necessary data whenever they want. This place is called a Database. A database needs to
be regularly monitored for smooth flow and storage of data.
SQL is a very handy tool to use this data. SQL stands for ‘Structured Query Language’ and is a
programming language that is mainly used to handle relational databases. It is used for database
creation, deletion, fetching rows, and modifying rows, etc.
What SQL can do? With SQL you can:
1. Execute queries against a database
2. Perform CRUD function:
CREATE→New databases
→ New tables in a database
→Records in a database
READ→ Retrieve data from a database
UPDATE→Update Records in a database
DELETE→Delete Records from a database

B. Exploratory Data Analysis


EDA is an approach to analyze datasets to summarize their main characteristics. It really helps
a data scientist, in the long run, to be familiar with the data they are working on. Gathering initial
insights will help you to understand the problem better and ask necessary questions which you can
then answer with Data Science.
C. Data Modeling
The next step is to create or make a predictive equation, or commonly referred to as a model,
which will predict what ‘might happen’ or ‘might come in the future using the previous data. This is
usually done using Machine Learning, which is one of the advanced topics of Data Science. Anyone
can do machine learning at a basic level, but if you want to become a robust and badass Data scientist
in the future, I recommend you first learn the math behind machine learning. The topics usually
required are Linear Algebra, Multivariate Calculus, and Statistics.
D. Data Visualization
Python provides really nice packages (A package is a 3rd party bundle of scripts which provide
much additional functionality. The existence of these packages is one of the reasons why python is so
widely used these days.) to make basic as well as advanced visualizations, which are really helpful in
conveying the story which the data reveals. Back to python packages for data visualization. The two
most popular packages are Matplotlib and Seaborn. Visualizations include Pie charts, Bar graphs,
histograms, etc. which provide a pictorial representation of data that helps a non-technical as well as a
technical person to understand important insights.

Overview of Lifecycle of Data Science Project


Understanding the Lifecycle of Data Science project using a Problem
The life cycle of a project begins the understanding the problem statement. The problem statement
(source) for the project is: To build a web-based application with a machine learning model at the
backend that predicts the burnout rate of company employees based on various work-life factors such
as – working hours, work-from-home availability, mental fatigue score, and the like.

Case Study: Happy and healthy employees are undoubtedly more efficient at work. They assist the
company to thrive. However, the scenario in most of the companies has changed with the pandemic.
Since work from home, over 69% of the employees have been showing burnout symptoms (survey
source: Monster). The burnout rate is indeed alarming. Many companies are taking steps to ensure
their employees stay mentally healthy. As a solution to this, we’ll build a web app that can be used by
companies to monitor the burnout of their employees. Also, the employees themselves can use it to
keep a check on their burnout rate (no time to assess the mental health in the fast work-life).

Gathering Relevant Data


There are many libraries in python – Beautiful Soap, Selenium for scraping data. Besides, there are
also web scraping APIs like ParseHub, Scrappy, Octoparse that make this less time-consuming. Web
scrapping is a crucial part of a Data Science project because the lifecycle depends on the quality and
relevance of the Data.
In this project, the dataset has been taken from Kaggle(https://www.kaggle.com/blurredmachine/are-
your-employees-burning-out). Have a look at the data before reading further.
Dataset
The following are the data attributes and their description –
• Employee ID: The unique ID allocated by the company to each employee.
• Date of Joining: The date when the employee had joined the company.
• Gender: The gender of the employee.
• Company Type: The type of company where the employee is working in (Service/Product).
• WFH Setup Available: If the work from the home facility is available for the employee (Yes/No).
• Designation: The designation of the employee in his/her organization. In range – [0.0, 5.0], 0.0 is the
lowest designation and 5.0 is the highest.
• Resource Allocation: The number of resources allocated to the employee to work, to be interpreted
as the number of working hours. In range – [1.0, 10.0] (higher means more resources).
• Mental Fatigue Score: How much mentally tired is the employee in the working hours in the range –
[0.0, 10.0] where 0.0 means no fatigue and 10.0 means completely fatigue.
• Burn Rate: The target value in the data of each employee telling the rate of burnout during working
hours in the range – [0.0, 1.0] where the higher the value is more is the burnout.
Few important notes about the data:
1. Difference between Stress and Burnout is that burnout is a different state of mind. Under stress, you
still manage to cope with pressures. But once burnout takes hold, you’re out of gas and you’ve given
up all hope of surmounting your deterrents.
2. When you’re suffering from burnout, you feel more than just being mentally fatigued.

Data Preparation and EDA


After collecting the data, data preparation comes into play. It involves cleaning and organizing the
data, which is known to take up more than 80% of data scientists’ work. The real-world data is raw
and can be full of duplicates, missing values, and wrong information. Hence the data needs to be
cleaned.
Once the data has been organized, we extract the information enfolded in the data and summarize its
main characteristics through exploratory data analysis. EDA is an important stage for a well-defined
data science project. It is performed before the statistical or machine learning modeling phase.

Feature Engineering
The process of Feature Engineering involves the extraction of features from the raw data based on its
insights. Moreover, feature engineering also involves feature selection, feature transformation, and
feature construction to prepare input data that best fits the machine learning algorithms. Feature
Engineering influences the results of the model directly and hence it’s a crucial part of data science.
Let’s use Date of Joining to create a new feature – Days Spent that would contain the information
about how many days has the employee worked in the company till date. Burn Rate for an employee
who has worked for years will be perhaps much higher than a newly joined employee.

Model Building and Evaluation


In this phase, the process of model building starts. We will create datasets for training and testing
purpose. We will apply different techniques such as association, classification, and clustering, to build
the model in this stage. This uses a set of algorithms being applied on the previous fetched results and
tries to interpret the patterns and predict the future trend. With the help of existing data sets, we would
train the machines and predict the model. Some common Model building tools are SAS Enterprise
Miner, WEKA, SPCS Modeler, MATLAB

Let’s find the performance of different ensemble techniques on the data -> 1.XGBoost 2.AdaBoost
3.RandomForest. All the 3 algorithms have given pretty good results.
Now our aim is to build a web app that takes input information from the user and gives the prediction
of Burn Rate for the user. To build it, we’ll use Flask, an API of Python that allows us to build up
applications. After building the app, we’ll deploy and push the project into your Github.
Note – Other platforms where we can deploy the ML models are – Amazon AWS, Microsoft Azure,
Google Cloud, etc.
The input information that we would collect from the user will be the features on which our Burn Rate
predictive model has been trained –
• Designation of work of the user in the company in the range – [0 – 5]: 5 is the highest designation and
0 is the lowest.
• The number of working hours
• Mental Fatigue Score in the range – [0-10]: How much fatigue/tired does the user usually feels during
working hours.
• Gender: Male/Female
• Type of the company: Service/Product
• Do you work from home: Yes/No
Burnout Rate Prediction web app
Link of the web app: https://burnout-rate-prediction-api.herokuapp.com
It can be accessed from anywhere and be used to keep a check on the mental health of the employees.

5 Basic Components of Data Science

Data science consists of many algorithms, theories, components etc. Five basic components of data
science are discussed here.
1. Data
Data is a collection of factual information based on numbers, words, observations, measurements
which can be utilized for calculation, discussion and reasoning. The crude dataset is the basic
foundation of data science and it may be of different kinds like Structured Data (Tabular structure),
Unstructured Data (pictures, recordings, messages, PDF documents and so forth) and Semi Structured.
2. Big Data
Big Data is enormously big data sets. It consists of various V’s such as, volume, variety, velocity, etc.
For instance, Facebook.

Data is contrasted and raw petroleum which is a profitable crude material, and as scientist separate the
refined oil from the unrefined petroleum comparably by applying data science, scientist can remove
various types of data from crude information.

The diverse devices utilized by information researchers to process big data are Hadoop, Spark, R,
Java, Pig, and many more.

3. Machine Learning
Machine Learning is the part of Data Science which enables the system to process datasets
autonomously without any human interference by utilizing various algorithms to work on massive
volume of data generated and extracted from numerous sources.
It makes prediction, analysis patterns and gives recommendations. Machine learning is frequently
being used in fraud detection and client retention.
A social media platform i.e. Facebook is a decent example of machine learning implementation where
fast and furious algorithms are used to gather the behavioral information of every user on social media
and recommend them appropriate articles, multimedia files and much more according to their choice.
Machine learning is also the part of Artificial Intelligence where the requisite information is
achieved after utilizing various algorithms and techniques, such as Supervised and Un-supervised
Machine Learning Algorithms. A machine learning professional must have the basic knowledge of
statistics and probability, data evaluation, and technical skills of programming languages.

Three types of Machine Learning


a) Supervised Machine Learning: Labeled dataset is used in supervised machine learning. Here,
you must input variables (X) and output variables (Y) then you apply an appropriate algorithm to
find the mapping function from input to output.
Y = f(X)
Supervised machine learning can be categorized into the following: -
Classification – where the output variable is a category like black or white, plus or minus.
Naïve Bayes, Support Vector Machine, Decision Tree are the most popular supervised machine
learning algorithms.
Regression – where the output variable is a real value like weight, dollars, etc. Linear regression
is used for regression problems.
b) Unsupervised Machine Learning: In this type of machine learning, un-labeled datasets is used.
Here, you have only input variables (X) and no output variables; therefore, algorithm can be
utilized to discover the inherent grouping from the input data.
Un-supervised machine learning can be categorized into the following: –
Clustering – where you find out the inherent groupings like grouping clients by procuring
behavior.
K-means clustering, hierarchical clustering and density based spatial clustering are more popular
clustering algorithms.
Association – where you find out rules that label large slices of your data.
Apriori algorithm is used for market basket analysis.
c) Reinforcement Learning: Reinforcement learning is different from supervised learning; it is
about to take an appropriate action in a particular situation to maximize the reward.
In supervised learning there are input as well as output variables, so, the model is trained with the
correct response but in absence of training dataset, reinforcement agent learn from its experience
and perform the given job efficiently.
In reinforcement learning, input should be an initial state and there are various output due to range
of solutions to a specific problem but optimum solution is decided which based on maximum
reward.

4. Statistics and Probability


Data is controlled to extricate data out of it. The numerical foundation of data science is insights and
likelihood as without having a reasonable learning of measurements and likelihood, there is a high
credibility of mystifying the information and achieving an off-base end. This is the reason that’s why
Statistics and Probability assume an essential job in data science.

5. Programming languages (Python, R)


Generally, data organization and investigation is finished by computer programming, therefore, in data
science, the two programming languages are most prominent i.e. Python and R.
Python
Python is a high-level programming language which provides a large standard library. It is most
popular language as most of the data scientists love this one. It is extensible and offer free data analysis
libraries. The best features of python are dynamic type, functional, object-oriented, automatic memory
management and procedural.
R
R is a most popular programming language among Data Scientists which can be used on Windows,
UNIX platform and Mac Operating System. The best feature of R language is data visualization that
would be tougher in Python but it is less beginner friendly than Python. This language is used to do
social analysis with use of post data. Twitter used this language for data visualization and semantic
clustering and Google use to evaluate advertisement efficiency and make economic predictions.
Java
Java is an object-oriented programming language which provides a large number of tools and libraries.
It is simple, portable, secure, platform independent, object oriented and multi-threaded, that’s why, it
is suitable for data science and machine learning. Java 8 with Lambdas and Scala provide better support
to data science.
NoSQL
Typically, SQL is used for handling structured data from Relational Database Management System
through programming but sometime you need to handle some unstructured data with no specific
schema, for which, you must need to use NoSQL. It ensures improved performance in storing a huge
amount of data.

Data Science Components

The main components of Data Science are given below:


1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to
collect and analyze the numerical data in a large amount and finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise
means specialized knowledge or skills of a particular area. In data science, there are various areas for
which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data. Data engineering also includes metadata (data about data) to the
data.
4. Visualization: Data visualization is meant by representing data in a visual context so that people
can easily understand the significance of data. Data visualization makes it easy to access the huge
amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing
involves designing, writing, debugging, and maintaining the source code of computer programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of
quantity, structure, space, and changes. For a data scientist, knowledge of good mathematics is
essential.
7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to
provide training to a machine so that it can act as a human brain. In data science, we use various
machine learning algorithms to solve the problems.
Tools for Data Science
Following are some tools required for data science:
o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift
o Data Visualization tools: R, Jupyter, Tableau, Cognos.
o Machine learning tools: Spark, Mahout, Azure ML studio.

Machine learning in Data Science


To become a data scientist, one should also be aware of machine learning and its algorithms, as in data
science, there are various machine learning algorithms which are broadly being used. Following are
the name of some machine learning algorithms used in data science:
o Regression
o Decision tree
o Clustering
o Principal component analysis
o Support vector machines
o Naive Bayes
o Artificial neural network
o Apriori
Organizational challenges while building Data Science projects

5 Challenges in a Data-Driven Project and How to Overcome Them

1. Data Quality
The process of discovering data is a crucial and fundamental task in a data-driven project. The
approaches for quality of data can be discovered based on certain requirements, such as user-centred
and other organizational frameworks.

How to Avoid
The methods such as data profiling and data exploration will help the analyzers to investigate the
quality of datasets as well as the implications of their use. The data quality cycle must be followed in
order to make the best practice for improving and ensuring high data quality.

2. Data Integration
In general, the method of combining data from various sources and store it together to get a unified
view is known as data integration. Inconsistent data in an organization is likely to have data integration
issues.

How To Avoid
There are several data integration platforms such as Talend, Adeptia, Actian, QlikView, etc. which
can be used to solve complex issues of data integration.
These tools provide features for data integration such as automate and orchestrate transformations,
build extensible frameworks, automate query performance optimization, etc.

3. Dirty Data
Data which contains inaccurate information can be said as dirty data. To remove the dirty data from a
dataset is virtually impossible. Depending on the severity of the errors, strategies to work with dirty
data needs to be implemented.

There are basically six types of dirty data, they are mentioned below

• Inaccurate Data: In this case, the data can be technically correct but inaccurate for the
organisation.
• Incorrect Data: Incorrect data occurs when field values are created outside of the valid range of
values.
• Duplicate Data: Duplicate data may occur due to reasons such as repeated submissions, improper
data joining, etc.
• Inconsistent Data: Data redundancy is one of the main causes of inconsistent data.
• Incomplete Data: This is due to the data with missing values.
• Business Rule Violation: This type of data violates the business rule in an organization.

How To Avoid
This challenge can be overcome when the organizations hire data management experts in order to
cleanse, validate, replace, delete the raw and unstructured data. There are also data cleansing tools or
data scrubbing tools such as TIBCO Clarity available in the market to clean the dirty data.
4. Data Uncertainty

Reasons for data uncertainties can be ranged from measurement errors, processing errors, etc. Known
and unknown errors, as well as uncertainties, should be expected when using real-world data. There
are five common types of uncertainty and they are mentioned below:
• Measurement Precision: Approximation leads to uncertainty.
• Predictions: It can be projections of future events, which may or may not happen.
• Inconsistency: Inconsistency between experts in a field or across datasets is an indication of
uncertainty.
• Incompleteness: Incompleteness in datasets including missing data or data known to be
erroneous also causes uncertainty.
• Credibility: Credibility of data or of the source of data is another type of uncertainty

How To Avoid

There are powerful uncertainty quantification and analytics software tools such as SmartUQ, UQlab,
etc. which is used to reduce the time, expense, and uncertainty associated with simulations, testing,
and analyzing complex systems.

5. Data Transformation

Raw data from various sources most often don’t work well together and thus it needs to be cleaned
and normalized. Data Transformation can be said as the method of converting the data from one format
to another in order to gain meaningful insights from the data. Data Transformation can also be known
as ETL (Extract Transform Load) which helps in converting raw data source into a validated and clean
form for gaining positive insights. Although the whole data can be transformed into a usable form, yet
there remain some issues which can go wrong with the ETL project such as an increase in data velocity,
time cost of fixing broken data connections, etc.

How To Avoid

There are various ETL tools such as Ketl, Jedox, etc. which can be used to extract data and store it in
the proper format for the purpose of analysis.

Bottom Line

With the emerging technologies, data-driven projects have become fundamental in the path of success
for an organisation. Data is a valuable asset in an organisation which comes in various sizes. The road
to get a successful data-driven project is to overcome these challenges as much as possible. There are
numerous tools available nowadays in the market to extract valuable patterns from the unstructured
data.

You might also like