UNIT - I Intro To DS
UNIT - I Intro To DS
Overview Data Science and Application, Common terminologies of Data Science, Various roles within
Data Science, Essential Stages of Data Science Life Cycle, Components of Data Science,
Organizational challenges while building Data Science projects.
As per various surveys, data scientist job is becoming the most demanding Job of the 21st century due
to increasing demands for data science. Some people also called it "the hottest job title of the 21st
century". Every organization is looking for candidates with knowledge of data science. The average
salary range for data scientist will be approximately $95,000 to $ 165,000 per annum, and as per
different researches, about 11.5 millions of job will be created by the year 2026.
Data science is the field that comprises everything related to cleaning, preparing, and analyzing
unstructured, semistructured, and structured data. This field of science uses a combination of statistics,
mathematics, programming, problem-solving, and data capture to extract insights and information
from data.
Python is easy to learn and most worldwide used programming language. Simplicity and versatility
is the key feature of Python. There is R programming is also present for data science but due to
simplicity and versatility of python, recommended language is python for Data Science.
Python is a popular language for data science because it is easy to learn, has a large and active
community, offers powerful libraries for data analysis and visualization, and has excellent machine-
learning libraries. This community has created many useful libraries, including NumPy, Pandas,
matplotlib, and SciPy, which are widely used in data science. NumPy is a library for numerical
computation, Pandas is a library for data manipulation and analysis and Matplotlib is a library for
data visualization.
In terms of application areas, Data scientists prefer Python for the following modules:
1. Data Analysis
2. Data Visualizations
3. Machine Learning
4. Deep Learning
5. Image processing
6. Computer Vision
7. Natural Language Processing (NLP)
Data science is an interconnected field that involves the use of statistical and computational methods
to extract insightful information and knowledge from data. Data Science is simply the application of
specific principles and analytic techniques to extract information from data used in planning,
strategic, decision making, etc.
In short, we can say that data science is all about:
Example:
Let suppose we want to travel from station A to station B by car. Now, we need to take some
decisions such as which route will be the best route to reach faster at the location, in which route
there will be no traffic jam, and which will be cost-effective. All these decision factors will act as
input data, and we will get an appropriate answer from these decisions, so this analysis of data is
called the data analysis, which is a part of data science.
Need for Data Science:
But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion. The amount of digital data that exists is growing
at a rapid rate, doubling every two years, and changing the way we live. It is estimated as per
researches, that by 2020, 1.7 MB of data will be created at every single second, by a single person on
earth. Every Company requires data to work, grow, and improve their businesses. This means we need
to have the technical tools, algorithms, and models to clean, process, and understand the available data
in its different forms for decision-making purposes.
Now, handling of such huge amount of data is a challenging task for every organization. So to handle,
process, and analysis of this, we required some complex, powerful, and efficient algorithms and
technology, and that technology came into existence as data Science. Following are some main reasons
for using data science technology:
o With the help of data science technology, we can convert the massive amount of raw and unstructured
data into meaningful insights.
o Data science technology is opting by various companies, whether it is a big brand or a startup.
Google, Amazon, Netflix, etc, which handle the huge amount of data, are using data science
algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving car, which is
the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight ticket
confirmation, etc.
Forecasting is a process of predicting or estimating the future based on past and present data.
Examples:
• How many passengers can we expect in a given flight?
• How many customer calls can we expect in next hour?
Predictive Modeling, used to perform prediction more granular like “Who are the customers who
are likely to buy a product in next month?” and then act accordingly.
Machine Learning: It is a method of teaching machines to learn things and improve predictions /
behavior based on data on their own.
Examples:
• Create an algorithm which can power Google Search.
• Amazon recommendation system.
Various roles within Data Science
These are the major roles in Data Science. So it’s clear from the venn that a person has an opportunity
to become either a Data Engineer, Data Scientist or Data Analyst, etc; So which means there are
enormous choices and opportunities.
Career Opportunities in Data Science
• Data Scientist: The data scientist develops model like econometric and statistical for various
problems like projection, classification, clustering, pattern analysis.
• Data Architect: The Data Scientist performs a important role in the improving of innovative
strategies to understand the business’s consumer trends and management as well as ways to
solve business problems, for instance, the optimization of product fulfilment and entire profit.
• Data Analytics: The data scientist supports the construction of the base of futuristic and
various planned and continuing data analytics projects.
• Machine Learning Engineer: They built data funnels and deliver solutions for complex
software.
• Data Engineer: Data engineers process the real-time gathered data or stored data and create
and maintain data pipelines that create interconnected ecosystem within an company.
1. Data Analyst:
Data analyst is an individual, who performs mining of huge amount of data, models the data, looks for
patterns, relationship, trends, and so on. At the end of the day, he comes up with visualization and
reporting for analyzing the data for decision making and problem-solving process.
Skill required: For becoming a data analyst, you must get a good background in mathematics,
business intelligence, data mining, and basic knowledge of statistics. You should also be familiar
with some computer languages and tools such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS,
R, JS, Spark, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java, and Hadoop. You
should also have an understanding of various algorithms, problem-solving analytical skill, probability,
and statistics.
3. Data Engineer:
A data engineer works with massive amount of data and responsible for building and maintaining the
data architecture of a data science project. Data engineer also works for the creation of data set
processes used in modeling, mining, acquisition, and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra, HBase,
Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++, Java, Perl, etc.
4. Data Scientist:
A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.
Skill required: To become a data scientist, one should have technical language skills such as R, SAS,
SQL, Python, Hadoop platforms, Hive, Pig, Apache spark, MATLAB. good knowledge of
semistructured formats such as JSON, XML, HTML. In addition to the knowledge of how to work
with unstructured data. Data scientists must have an understanding of Statistics, Mathematics,
visualization, and communication skills.
Who is a Data Scientist?” A Data scientist can predict the future with the necessary Data”
Data scientists use programming tools such as Python, R, SAS, Java, Perl, and C/C++ to extract
knowledge from prepared data. To extract this information, they employ various fit-to-purpose models
based on machine leaning algorithms, statistics, and mathematical methods.
Data scientists are the experts who can use various statistical tools and machine learning algorithms
to understand and analyze the data.
A data scientist is a professional responsible for collecting, analyzing, and interpreting extremely
large amounts of data. The data science role is entirely different from several traditional technical
roles, including mathematician, scientist, statistician, and computer professional. This job requires
the use of advanced analytics technologies, including machine learning and predictive modeling.
A data scientist requires large amounts of data to develop hypotheses, make inferences, and analyze
customer and market trends. Basic responsibilities include gathering and analyzing data, using
various types of analytics and reporting tools to detect patterns, trends, and relationships in data sets.
In business, data scientists typically work in teams to mine for information that can be used to predict
customer behavior and identify new revenue opportunities. In many organizations, data scientists
are also responsible for setting best practices for collecting data, using analysis tools, and interpreting
data.
The demand for data science skills has grown significantly over the years, as companies look to
glean useful information from big data, the voluminous amounts of structured, unstructured, and
semi-structured data that a large enterprise or IoT produces and collects.
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right questions. When you start
any data science project, you need to determine what are the basic requirements, priorities, project
budget and specifications, etc; In this phase, we need to determine all the requirements of the project
such as the number of people, technology, time, data, an end goal, and then we can frame the business
problem on first hypothesis level so that we are not stuck in the middle of the project.
2. Data preparation: Data preparation is also known as Data Munging. In this phase, we need to get
the required data model and the data sets to perform Data Analytics for the whole project. We need to
perform ETL [Extract, Transform, Load] on the available data to keep the data ready for the next
stage. In this phase, we need to perform the following tasks:
o Data cleaning
o Data Reduction
o Data integration
o Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.
3. Model Planning: In this phase, we need to determine the various methods and techniques to
establish the relation between input variables. We will apply Exploratory data analytics(EDA) by using
various statistical formula and visualization tools to understand the relations between variable and to
see what data can inform us. There are various tools to perform Model Planning:
These are the various model planning tools in which a few are open platforms whereas others are paid
and these tools can be used to perform model planning.
4. Model-building: In this phase, the process of model building starts. We will create datasets for
training and testing purpose. We will apply different techniques such as association, classification, and
clustering, to build the model in this stage. This uses a set of algorithms being applied on the previous
fetched results and tries to interpret the patterns and predict the future trend. With the help of existing
data sets, we would train the machines and predict the model. Some common Model building tools are
SAS Enterprise Miner, WEKA, SPCS Modeler, MATLAB
6. Communicate results: In this penultimate phase, we are required to interpret the results obtained
with the required results made in the initial stages. If the results match then we are on the right track
and the goal is achieved 90%. We will communicate the findings and final result with the business
team.
5. Operationalize: In this ultimate phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents. This phase provides you a clear overview of complete project
performance and other components on a small scale before the full deployment. In the ultimate stage
if the goal is achieved the model is being made to operationalize in real life and tested. If the desired
output is achieved then the goal is achieved cent percent. And it is also necessary to maintain the data
model for future reference.
Case Study: Happy and healthy employees are undoubtedly more efficient at work. They assist the
company to thrive. However, the scenario in most of the companies has changed with the pandemic.
Since work from home, over 69% of the employees have been showing burnout symptoms (survey
source: Monster). The burnout rate is indeed alarming. Many companies are taking steps to ensure
their employees stay mentally healthy. As a solution to this, we’ll build a web app that can be used by
companies to monitor the burnout of their employees. Also, the employees themselves can use it to
keep a check on their burnout rate (no time to assess the mental health in the fast work-life).
Feature Engineering
The process of Feature Engineering involves the extraction of features from the raw data based on its
insights. Moreover, feature engineering also involves feature selection, feature transformation, and
feature construction to prepare input data that best fits the machine learning algorithms. Feature
Engineering influences the results of the model directly and hence it’s a crucial part of data science.
Let’s use Date of Joining to create a new feature – Days Spent that would contain the information
about how many days has the employee worked in the company till date. Burn Rate for an employee
who has worked for years will be perhaps much higher than a newly joined employee.
Let’s find the performance of different ensemble techniques on the data -> 1.XGBoost 2.AdaBoost
3.RandomForest. All the 3 algorithms have given pretty good results.
Now our aim is to build a web app that takes input information from the user and gives the prediction
of Burn Rate for the user. To build it, we’ll use Flask, an API of Python that allows us to build up
applications. After building the app, we’ll deploy and push the project into your Github.
Note – Other platforms where we can deploy the ML models are – Amazon AWS, Microsoft Azure,
Google Cloud, etc.
The input information that we would collect from the user will be the features on which our Burn Rate
predictive model has been trained –
• Designation of work of the user in the company in the range – [0 – 5]: 5 is the highest designation and
0 is the lowest.
• The number of working hours
• Mental Fatigue Score in the range – [0-10]: How much fatigue/tired does the user usually feels during
working hours.
• Gender: Male/Female
• Type of the company: Service/Product
• Do you work from home: Yes/No
Burnout Rate Prediction web app
Link of the web app: https://burnout-rate-prediction-api.herokuapp.com
It can be accessed from anywhere and be used to keep a check on the mental health of the employees.
Data science consists of many algorithms, theories, components etc. Five basic components of data
science are discussed here.
1. Data
Data is a collection of factual information based on numbers, words, observations, measurements
which can be utilized for calculation, discussion and reasoning. The crude dataset is the basic
foundation of data science and it may be of different kinds like Structured Data (Tabular structure),
Unstructured Data (pictures, recordings, messages, PDF documents and so forth) and Semi Structured.
2. Big Data
Big Data is enormously big data sets. It consists of various V’s such as, volume, variety, velocity, etc.
For instance, Facebook.
Data is contrasted and raw petroleum which is a profitable crude material, and as scientist separate the
refined oil from the unrefined petroleum comparably by applying data science, scientist can remove
various types of data from crude information.
The diverse devices utilized by information researchers to process big data are Hadoop, Spark, R,
Java, Pig, and many more.
3. Machine Learning
Machine Learning is the part of Data Science which enables the system to process datasets
autonomously without any human interference by utilizing various algorithms to work on massive
volume of data generated and extracted from numerous sources.
It makes prediction, analysis patterns and gives recommendations. Machine learning is frequently
being used in fraud detection and client retention.
A social media platform i.e. Facebook is a decent example of machine learning implementation where
fast and furious algorithms are used to gather the behavioral information of every user on social media
and recommend them appropriate articles, multimedia files and much more according to their choice.
Machine learning is also the part of Artificial Intelligence where the requisite information is
achieved after utilizing various algorithms and techniques, such as Supervised and Un-supervised
Machine Learning Algorithms. A machine learning professional must have the basic knowledge of
statistics and probability, data evaluation, and technical skills of programming languages.
1. Data Quality
The process of discovering data is a crucial and fundamental task in a data-driven project. The
approaches for quality of data can be discovered based on certain requirements, such as user-centred
and other organizational frameworks.
How to Avoid
The methods such as data profiling and data exploration will help the analyzers to investigate the
quality of datasets as well as the implications of their use. The data quality cycle must be followed in
order to make the best practice for improving and ensuring high data quality.
2. Data Integration
In general, the method of combining data from various sources and store it together to get a unified
view is known as data integration. Inconsistent data in an organization is likely to have data integration
issues.
How To Avoid
There are several data integration platforms such as Talend, Adeptia, Actian, QlikView, etc. which
can be used to solve complex issues of data integration.
These tools provide features for data integration such as automate and orchestrate transformations,
build extensible frameworks, automate query performance optimization, etc.
3. Dirty Data
Data which contains inaccurate information can be said as dirty data. To remove the dirty data from a
dataset is virtually impossible. Depending on the severity of the errors, strategies to work with dirty
data needs to be implemented.
There are basically six types of dirty data, they are mentioned below
• Inaccurate Data: In this case, the data can be technically correct but inaccurate for the
organisation.
• Incorrect Data: Incorrect data occurs when field values are created outside of the valid range of
values.
• Duplicate Data: Duplicate data may occur due to reasons such as repeated submissions, improper
data joining, etc.
• Inconsistent Data: Data redundancy is one of the main causes of inconsistent data.
• Incomplete Data: This is due to the data with missing values.
• Business Rule Violation: This type of data violates the business rule in an organization.
How To Avoid
This challenge can be overcome when the organizations hire data management experts in order to
cleanse, validate, replace, delete the raw and unstructured data. There are also data cleansing tools or
data scrubbing tools such as TIBCO Clarity available in the market to clean the dirty data.
4. Data Uncertainty
Reasons for data uncertainties can be ranged from measurement errors, processing errors, etc. Known
and unknown errors, as well as uncertainties, should be expected when using real-world data. There
are five common types of uncertainty and they are mentioned below:
• Measurement Precision: Approximation leads to uncertainty.
• Predictions: It can be projections of future events, which may or may not happen.
• Inconsistency: Inconsistency between experts in a field or across datasets is an indication of
uncertainty.
• Incompleteness: Incompleteness in datasets including missing data or data known to be
erroneous also causes uncertainty.
• Credibility: Credibility of data or of the source of data is another type of uncertainty
How To Avoid
There are powerful uncertainty quantification and analytics software tools such as SmartUQ, UQlab,
etc. which is used to reduce the time, expense, and uncertainty associated with simulations, testing,
and analyzing complex systems.
5. Data Transformation
Raw data from various sources most often don’t work well together and thus it needs to be cleaned
and normalized. Data Transformation can be said as the method of converting the data from one format
to another in order to gain meaningful insights from the data. Data Transformation can also be known
as ETL (Extract Transform Load) which helps in converting raw data source into a validated and clean
form for gaining positive insights. Although the whole data can be transformed into a usable form, yet
there remain some issues which can go wrong with the ETL project such as an increase in data velocity,
time cost of fixing broken data connections, etc.
How To Avoid
There are various ETL tools such as Ketl, Jedox, etc. which can be used to extract data and store it in
the proper format for the purpose of analysis.
Bottom Line
With the emerging technologies, data-driven projects have become fundamental in the path of success
for an organisation. Data is a valuable asset in an organisation which comes in various sizes. The road
to get a successful data-driven project is to overcome these challenges as much as possible. There are
numerous tools available nowadays in the market to extract valuable patterns from the unstructured
data.