0% found this document useful (0 votes)
53 views9 pages

Unit 3

Uploaded by

karthika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views9 pages

Unit 3

Uploaded by

karthika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Uint-1

Introduction to data science

Data Science Introduction


Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine
learning to analyze data and to extract knowledge and insights from it.
In a world of data space where organizations deal with petabytes and exabytes of data, the era of Big
Data emerged, and the essence of its storage also grew. It was a great challenge and concern for
industries for the storage of data until 2010. Now when frameworks like Hadoop and others solved the
problem of storage, the focus shifted to the processing of data. Data Science plays a big role here. All
those fancy Sci-fi movies you love to watch around can be turned into reality by Data Science.
Nowadays its growth has been increased in multiple ways and thus one should be ready for our future
by learning what it is and how can we add value to it.
What is Data Science?
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
Better decisions (should we choose A or B)
Predictive analysis (what will happen next?)
Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.

Examples of where Data Science is needed:


For route planning: To discover the best routes to ship
To foresee delays for flight/ship/train etc. (through predictive analysis)
To create promotional offers
To find the best suited time to deliver goods
To forecast the next years revenue for a company
To analyze health benefit of training
To predict who will win elections
Data Science can be applied in nearly every part of a business where data is available. Examples are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
How Does a Data Scientist Work?
A Data Scientist requires expertise in several backgrounds:
Machine Learning
Statistics
Programming (Python or R)
Mathematics
Databases
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must
organize the data in a standard format.
Here is how a Data Scientist works:
Ask the right questions - To understand the business problem.
Explore and collect data - From database, web logs, customer feedback, etc.
Extract the data - Transform the data to a standardized format.
Clean the data - Remove erroneous values from the data.
Find and replace missing values - Check for missing values and replace them with a suitable value
(e.g. an average value).
Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However,
the number 140 is larger than 1,8. - so scaling is important).
Analyze data, find patterns and make future predictions.
Represent the result - Present the result with useful insights in a way the "company" can understand.

How Data Science Works?

How data science works


Data science is not a one-step process such that you will get to learn it in a short time and call ourselves
a Data Scientist. It’s passes from many stages and every element is important. One should always
follow the proper steps to reach the ladder. Every step has its value and it counts in your model. Buckle
up in your seats and get ready to learn about those steps.
1. Problem Statement:
No work start without motivation, Data science is no exception though. It’s really important to declare
or formulate your problem statement very clearly and precisely. Your whole model and it’s working
depend on your statement. Many scientist considers this as the main and much important step of Date
Science. So make sure what’s your problem statement and how well can it add value to business or any
other organization.
2. Data Collection:
After defining the problem statement, the next obvious step is to go in search of data that you might
require for your model. You must do good research, find all that you need. Data can be in any form i.e
unstructured or structured. It might be in various forms like videos, spreadsheets, coded forms, etc.
You must collect all these kinds of sources.
3. Data Cleaning:
As you have formulated your motive and also you did collect your data, the next step to do is cleaning.
Yes, it is! Data cleaning is the most favorite thing for data scientists to do. Data cleaning is all about
the removal of missing, redundant, unnecessary and duplicate data from your collection. There are
various tools to do so with the help of programming in either R or Python. It’s totally on you to choose
one of them. Various scientist have their opinion on which to choose. When it comes to the statistical
part, R is preferred over Python, as it has the privilege of more than 12,000 packages. While python is
used as it is fast, easily accessible and we can perform the same things as we can in R with the help of
various packages.
4. Data Analysis and Exploration:
It’s one of the prime things in data science to do and time to get inner Holmes out. It’s about analyzing
the structure of data, finding hidden patterns in them, studying behaviors, visualizing the effects of one
variable over others and then concluding. We can explore the data with the help of various graphs
formed with the help of libraries using any programming language. In R, GGplot is one of the most
famous models while Matplotlib in Python.
5. Data Modelling:
Once you are done with your study that you have formed from data visualization, you must start
building a hypothesis model such that it may yield you a good prediction in future. Here, you must
choose a good algorithm that best fit to your model. There different kinds of algorithms from
regression to classification, SVM( Support vector machines), Clustering, etc. Your model can be of
a Machine Learning algorithm. You train your model with the train data and then test it with test data.
There are various methods to do so. One of them is the K-fold method where you split your whole data
into two parts, One is Train and the other is test data. On these bases, you train your model.
6. Optimization and Deployment:
You followed each and every step and hence build a model that you feel is the best fit. But how can
you decide how well your model is performing? This where optimization comes. You test your data
and find how well it is performing by checking its accuracy. In short, you check the efficiency of the
data model and thus try to optimize it for better accurate prediction. Deployment deals with the launch
of your model and let the people outside there to benefit from that. You can also obtain feedback from
organizations and people to know their need and then to work more on your model.
Advice for new data science students
 Curiosity : If you are not curious , you would not know what to do with the data .
 Judgmental : It is because if you do not have preconceived notions about the things you wouldn’t
know where to begin with .
 Argumentative : It is because if you can argument and if you can plead a case , at least you can
start somewhere and then you can learn from data and then can modify your assumptions.
 Start by gaining a solid understanding of the basics of programming, statistics, and linear algebra.
 Learn the tools of the trade such as Python, R, and SQL. Familiarize yourself with the most popular
libraries and frameworks like numpy, pandas, and scikit-learn.
 Practice, practice, practice. Participate in online coding challenges and hackathons to improve your
skills and gain experience.
 Learn the basics of machine learning and familiarize yourself with the most popular algorithms.
 Read research papers and stay up-to-date with the latest developments in the field.
 Learn how to communicate your findings effectively. Being able to present your work in a clear
and compelling way is just as important as the technical skills you possess.
 Build a portfolio of projects that showcase your skills and experience.
 Network with other data scientists and professionals in the field. Attend meetups and conferences,
and connect with people on LinkedIn.
 Be curious, and don’t be afraid to ask questions.
 Finally, don’t be discouraged if you encounter challenges or roadblocks along the way. Learning to
become a data scientist is a journey, and it takes time, effort, and dedication to succeed.
Advantages of data science:
 Improved decision-making: Data science can help organizations make better decisions by providing
insights and predictions based on data analysis.
 Cost-effective: With the right tools and techniques, data science can help organizations reduce costs
by identifying areas of inefficiency and optimizing processes.
 Innovation: Data science can be used to identify new opportunities for innovation and to develop
new products and services.
 Competitive advantage: Organizations that use data science effectively can gain a competitive
advantage by making better decisions, improving efficiency, and identifying new opportunities.
 Personalization: Data science can help organizations personalize their products or services to better
meet the needs of individual customers.
Disadvantages of data science:
 Data quality: The accuracy and quality of the data used in data science can have a significant
impact on the results obtained.
 Privacy concerns: The collection and use of data can raise privacy concerns, particularly if the data
is personal or sensitive.
 Complexity: Data science can be a complex and technical field that requires specialized skills and
expertise.
 Bias: Data science algorithms can be biased if the data used to train them is biased, which can lead
to inaccurate results.
 Interpretation: Interpreting data science results can be challenging, particularly for non-technical
stakeholders who may not understand the underlying assumptions and methods used.
Data Science Process
Data Science Process
Last Updated : 25 Sep, 2023



If you are in a technical domain or a student with a technical background then you must have heard
about Data Science from some source certainly. This is one of the booming fields in today’s tech
market. And this will keep going on as the upcoming world is becoming more and more digital day by
day. And the data certainly hold the capacity to create a new future. In this article, we will learn about
Data Science and the process which is included in this.
What is Data Science?
Data can be proved to be very fruitful if we know how to manipulate it to get hidden patterns from
them. This logic behind the data or the process behind the manipulation is what is known as Data
Science. From formulating the problem statement and collection of data to extracting the required
results from them the Data Science process and the professional who ensures that the whole process is
going smoothly or not is known as the Data Scientist. But there are other job roles as well in this
domain as well like:
1. Data Engineers
2. Data Analysts
3. Data Architect
4. Machine Learning Engineer
5. Deep Learning Engineer
Data Science Process Life Cycle
There are some steps that are necessary for any of the tasks that are being done in the field of data
science to derive any fruitful results from the data at hand.
 Data Collection – After formulating any problem statement the main task is to calculate data that
can help us in our analysis and manipulation. Sometimes data is collected by performing some kind
of survey and there are times when it is done by performing scrapping.
 Data Cleaning – Most of the real-world data is not structured and requires cleaning and conversion
into structured data before it can be used for any analysis or modeling.
 Exploratory Data Analysis – This is the step in which we try to find the hidden patterns in the
data at hand. Also, we try to analyze different factors which affect the target variable and the extent
to which it does so. How the independent features are related to each other and what can be done to
achieve the desired results all these answers can be extracted from this process as well. This also
gives us a direction in which we should work to get started with the modeling process.
 Model Building – Different types of machine learning algorithms as well as techniques have been
developed which can easily identify complex patterns in the data which will be a very tedious task
to be done by a human.
 Model Deployment – After a model is developed and gives better results on the holdout or the
real-world dataset then we deploy it and monitor its performance. This is the main part where we
use our learning from the data to be applied in real-world applications and use cases.
Data Science Process Life Cycle

Components of Data Science Process


Data Science is a very vast field and to get the best out of the data at hand one has to apply multiple
methodologies and use different tools to make sure the integrity of the data remains intact throughout
the process keeping data privacy in mind. Machine Learning and Data analysis is the part where we
focus on the results which can be extracted from the data at hand. But Data engineering is the part in
which the main task is to ensure that the data is managed properly and proper data pipelines are created
for smooth data flow. If we try to point out the main components of Data Science then it would be:
 Data Analysis – There are times when there is no need to apply advanced deep learning and
complex methods to the data at hand to derive some patterns from it. Due to this before moving on
to the modeling part, we first perform an exploratory data analysis to get a basic idea of the data
and patterns which are available in it this gives us a direction to work on if we want to apply some
complex analysis methods on our data.
 Statistics – It is a natural phenomenon that many real-life datasets follow a normal distribution.
And when we already know that a particular dataset follows some known distribution then most of
its properties can be analyzed at once. Also, descriptive statistics and correlation and covariances
between two features of the dataset help us get a better understanding of how one factor is related
to the other in our dataset.
 Data Engineering – When we deal with a large amount of data then we have to make sure that the
data is kept safe from any online threats also it is easy to retrieve and make changes in the data as
well. To ensure that the data is used efficiently Data Engineers play a crucial role.
 Advanced Computing
o Machine Learning – Machine Learning has opened new horizons which had helped us
to build different advanced applications and methodologies so, that the machines
become more efficient and provide a personalized experience to each individual and
perform tasks in a snap of the hand earlier which requires heavy human labor and time
intense.
o Deep Learning – This is also a part of Artificial Intelligence and Machine Learning but
it is a bit more advanced than machine learning itself. High computing power and a
huge corpus of data have led to the emergence of this field in data science.
Knowledge and Skills for Data Science Professionals
As a Data Scientist, you’ll be responsible for jobs that span three domains of skills.
1. Statistical/mathematical reasoning
2. Business communication/leadership
3. Programming
1. Statistics: Wikipedia defines it as the study of the collection, analysis, interpretation, presentation,
and organization of data. Therefore, it shouldn’t be a surprise that data scientists need to know
statistics.
2. Programming Language R/ Python: Python and R are one of the most widely used languages by
Data Scientists. The primary reason is the number of packages available for Numeric and Scientific
computing.
3. Data Extraction, Transformation, and Loading: Suppose we have multiple data sources like
MySQL DB, MongoDB, Google Analytics. You have to Extract data from such sources, and then
transform it for storing in a proper format or structure for the purposes of querying and analysis.
Finally, you have to load the data in the Data Warehouse, where you will analyze the data. So, for
people from ETL (Extract Transform and Load) background Data Science can be a good career option.
Steps for Data Science Processes:
Step 1: Defining research goals and creating a project charter
 Spend time understanding the goals and context of your research.Continue asking questions and
devising examples until you grasp the exact business expectations, identify how your project fits in
the bigger picture, appreciate how your research is going to change the business, and understand
how they’ll use your results.
Create a project charter
A project charter requires teamwork, and your input covers at least the following:
1. A clear research goal
2. The project mission and context
3. How you’re going to perform your analysis
4. What resources you expect to use
5. Proof that it’s an achievable project, or proof of concepts
6. Deliverables and a measure of success
7. A timeline
Step 2: Retrieving Data
Start with data stored within the company
 Finding data even within your own company can sometimes be a challenge.
 This data can be stored in official data repositories such as databases, data marts, data warehouses,
and data lakes maintained by a team of IT professionals.
 Getting access to the data may take time and involve company policies.
Step 3: Cleansing, integrating, and transforming data-
Cleaning:
 Data cleansing is a subprocess of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes it originates from.
 The first type is the interpretation error, such as incorrect use of terminologies, like saying that a
person’s age is greater than 300 years.
 The second type of error points to inconsistencies between data sources or against your company’s
standardized values. An example of this class of errors is putting “Female” in one table and “F” in
another when they represent the same thing: that the person is female.
Integrating:
 Combining Data from different Data Sources.
 Your data comes from several different places, and in this sub step we focus on integrating these
different sources.
 You can perform two operations to combine information from different data sets. The first
operation is joining and the second operation is appending or stacking.
Joining Tables:
 Joining tables allows you to combine the information of one observation found in one table with the
information that you find in another table.
Appending Tables:
 Appending or stacking tables is effectively adding observations from one table to another table.
Transforming Data
 Certain models require their data to be in a certain shape.
Reducing the Number of Variables
 Sometimes you have too many variables and need to reduce the number because they don’t add
new information to the model.
 Having too many variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input variables.
 Dummy variables can only take two values: true(1) or false(0). They’re used to indicate the absence
of a categorical effect that may explain the observation.
Step 4: Exploratory Data Analysis
 During exploratory data analysis you take a deep dive into the data.
 Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the interactions between variables.
 Bar Plot, Line Plot, Scatter Plot ,Multiple Plots , Pareto Diagram , Link and Brush
Diagram ,Histogram , Box and Whisker Plot .
Step 5: Build the Models
 Build the models are the next step, with the goal of making better predictions, classifying objects,
or gaining an understanding of the system that are required for modeling.
Step 6: Presenting findings and building applications on top of them –
 The last stage of the data science process is where your soft skills will be most useful, and yes,
they’re extremely important.
 Presenting your results to the stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.
Benefits and uses of data science and big data
 Governmental organizations are also aware of data’s value. A data scientist in a governmental
organization gets to work on diverse projects such as detecting fraud and other criminal activity or
optimizing project funding.
 Nongovernmental organizations (NGOs) are also no strangers to using data. They use it to raise
money and defend their causes. The World Wildlife Fund (WWF), for instance, employs data
scientists to increase the effectiveness of their fundraising efforts.
 Universities use data science in their research but also to enhance the study experience of their
students. • Ex: MOOC’s- Massive open online courses.
Tools for Data Science Process
As time has passed tools to perform different tasks in Data Science have evolved to a great extent.
Different software like Matlab and Power BI, and programming Languages like Python and R
Programming Language provides many utility features which help us to complete most of the most
complex task within a very limited time and efficiently. Some of the tools which are very popular in
this domain of Data Science are shown in the below image.

Tools for Data Science Process

Usage of Data Science Process


The Data Science Process is a systematic approach to solving data-related problems and consists of the
following steps:
1. Problem Definition: Clearly defining the problem and identifying the goal of the analysis.
2. Data Collection: Gathering and acquiring data from various sources, including data cleaning and
preparation.
3. Data Exploration: Exploring the data to gain insights and identify trends, patterns, and
relationships.
4. Data Modeling: Building mathematical models and algorithms to solve problems and make
predictions.
5. Evaluation: Evaluating the model’s performance and accuracy using appropriate metrics.
6. Deployment: Deploying the model in a production environment to make predictions or automate
decision-making processes.
7. Monitoring and Maintenance: Monitoring the model’s performance over time and making
updates as needed to improve accuracy.
Issues of Data Science Process
1. Data Quality and Availability: Data quality can affect the accuracy of the models developed and
therefore, it is important to ensure that the data is accurate, complete, and consistent. Data
availability can also be an issue, as the data required for analysis may not be readily available or
accessible.
2. Bias in Data and Algorithms: Bias can exist in data due to sampling techniques, measurement
errors, or imbalanced datasets, which can affect the accuracy of models. Algorithms can also
perpetuate existing societal biases, leading to unfair or discriminatory outcomes.
3. Model Overfitting and Underfitting: Overfitting occurs when a model is too complex and fits the
training data too well, but fails to generalize to new data. On the other hand, underfitting occurs
when a model is too simple and is not able to capture the underlying relationships in the data.
4. Model Interpretability: Complex models can be difficult to interpret and understand, making it
challenging to explain the model’s decisions and decisions. This can be an issue when it comes to
making business decisions or gaining stakeholder buy-in.
5. Privacy and Ethical Considerations: Data science often involves the collection and analysis of
sensitive personal information, leading to privacy and ethical concerns. It is important to consider
privacy implications and ensure that data is used in a responsible and ethical manner.
6. Technical Challenges: Technical challenges can arise during the data science process such as data
storage and processing, algorithm selection, and computational scalability.

You might also like