0% found this document useful (0 votes)

8 views10 pages

Unit 1 Ds

DATA SCIENCE

Uploaded by

DEVI SRAVYA LADDIKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views10 pages

Unit 1 Ds

DATA SCIENCE

Uploaded by

DEVI SRAVYA LADDIKA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT 1 DATA SCIENCE

Introduction to Core Concepts & Technologies

What is Data Science

Data science is an interdisciplinary field that involves extracting insights and
knowledge from structured and unstructured data using scientific methods,
processes, algorithms, and systems. It combines elements of statistics,
mathematics, computer science, and domain knowledge to analyse and interpret
complex data sets.

The main goal of data science is to uncover patterns, extract meaningful information,
and generate actionable insights from data to aid in decision-making, solve
problems, and drive innovation. Data scientists utilize a range of tools, techniques,
and programming languages to collect, clean, analyze, and visualize data. They
apply statistical and machine learning models to make predictions, build data-driven
models, and uncover hidden patterns within the data.

Basic Terminology involved

1. Data Science: The field that combines scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured
data.

2. Machine Learning: A branch of artificial intelligence that focuses on the

development of algorithms and models that enable computers to learn from
and make predictions or decisions based on data without being explicitly
programmed.

3. Artificial Intelligence (AI): The theory and development of computer systems

capable of performing tasks that would typically require human intelligence,
such as speech recognition, decision-making, or visual perception.

4. Predictive Analytics: The practice of using historical data and statistical

techniques to make predictions or forecasts about future events or outcomes.

5. Big Data: Large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing techniques. Big data
often involves high-volume, high-velocity, and high-variety data.

6. Data Mining: The process of discovering patterns, relationships, or insights

from large datasets using various statistical and machine learning techniques.

ACET
UNIT 1 DATA SCIENCE

7. Data Visualization: The representation of data and information in graphical or

visual forms, such as charts, graphs, and maps, to facilitate understanding,
exploration, and communication of patterns and trends in the data.

8. Natural Language Processing (NLP): The branch of artificial intelligence and

linguistics that focuses on enabling computers to understand, interpret, and
generate human language, both written and spoken.

9. Feature Engineering: The process of selecting, transforming, and creating

input variables (features) from raw data to improve the performance and
accuracy of machine learning models.

10. Deep Learning: A subfield of machine learning that utilizes artificial neural
networks with multiple layers to learn and extract hierarchical representations
of data, often used for tasks such as image and speech recognition.

11. Regression Analysis: A statistical modelling technique used to investigate

the relationship between a dependent variable and one or more independent
variables, with the goal of predicting or estimating the value of the dependent
variable.

12. Classification: A machine learning task that involves assigning predefined

categories or labels to instances based on their characteristics or features.

13. Clustering: A data exploration technique that involves grouping similar data
points or objects together based on their inherent similarities or patterns.

14. Cross-Validation: A technique used to evaluate and validate the performance

of machine learning models by partitioning the available data into subsets for
training and testing.

15. Overfitting: A situation in machine learning where a model becomes too

specialized or closely fits the training data, resulting in poor performance
when applied to new, unseen data.

These are just a few examples of data science terminology. The field of data science
is vast and continuously evolving, so there are many more terms and concepts to
explore.

ACET
UNIT 1 DATA SCIENCE

Data Science Process Life Cycle

The data science process, also known as the data science lifecycle or data science
methodology, is a systematic approach to solving problems and extracting insights
from data. It typically consists of several iterative steps that data scientists follow to
address a specific business question or problem using data. While there can be
variations in the specific steps and their order, here is a general overview of the data
science process:

1. Problem Definition: The first step is to clearly define the problem or question
you want to address. This involves understanding the business context,
identifying the objectives, and formulating a well-defined problem statement.

2. Data Acquisition: In this step, you gather the relevant data necessary to
solve the problem. This can involve obtaining data from various sources such
as databases, APIs, web scraping, or other data collection methods.

3. Data Cleaning and Preprocessing: Raw data often contains errors, missing
values, outliers, or inconsistencies. In this step, you clean and preprocess the
data to ensure its quality and suitability for analysis. This may include tasks
like handling missing values, removing duplicates, standardizing formats, and
transforming variables.

4. Exploratory Data Analysis (EDA): EDA involves exploring and analyzing the
data to gain insights and understand its characteristics. This can include
summarizing the data, visualizing distributions, identifying patterns, and

ACET
UNIT 1 DATA SCIENCE

detecting relationships between variables. EDA helps to identify potential

features for modelling and understand any limitations or biases in the data.

5. Feature Engineering: Feature engineering is the process of creating new

features or transforming existing ones to improve the performance of machine
learning models. This step may involve techniques like feature selection,
dimensionality reduction, encoding categorical variables, creating interaction
terms, or generating time-based features.

6. Model Selection and Training: In this step, you select an appropriate

modeling technique based on the problem and the data at hand. This can
involve selecting from a variety of algorithms such as regression,
classification, clustering, or deep learning. You then train the selected model
on your prepared dataset.

7. Model Evaluation: Once the model is trained, you evaluate its performance
using appropriate evaluation metrics. This helps you assess how well the
model is performing and whether it meets the desired objectives. You may
need to fine-tune the model parameters or try different algorithms to improve
its performance.

8. Model Deployment: After the model has been evaluated and meets the
desired criteria, it can be deployed into a production environment. This step
involves integrating the model into an application, setting up data pipelines,
and ensuring the model can handle new data in real-time.

9. Model Monitoring and Maintenance: Once the model is deployed, it needs

to be monitored to ensure it continues to perform well over time. This involves
tracking its predictions, monitoring data quality, retraining the model
periodically with new data, and making necessary updates or improvements
as required.

It's important to note that the data science process is often iterative, with feedback
and insights gained at each step influencing decisions made in previous steps. This
iterative nature allows for continuous improvement and refinement of the analysis.

DATA SCIENCE TOOLKIT

The field of data science utilizes a wide range of tools and technologies to analyze,
manipulate, and derive insights from data. Here are some commonly used tools in
the data science toolkit:

1. Programming Languages:

 Python: Python is one of the most popular programming languages for

data science due to its extensive libraries and frameworks such as
NumPy, pandas, scikit-learn, TensorFlow, and PyTorch.

ACET
UNIT 1 DATA SCIENCE

 R: R is another widely used language for statistical computing and

graphics. It provides a rich ecosystem of packages for data
manipulation, visualization, and modelling.

2. Integrated Development Environments (IDEs):

 Jupyter Notebook: Jupyter Notebook is a web-based interactive

development environment that allows data scientists to create and
share documents containing code, visualizations, and narrative text.

 PyCharm: PyCharm is a popular Python IDE that provides advanced

code editing, debugging, and profiling capabilities for data science
projects.

 RStudio: RStudio is an integrated development environment

specifically designed for R, providing a powerful editor, debugging
tools, and package management.

3. Data Manipulation and Analysis:

 pandas: pandas is a Python library widely used for data manipulation

and analysis. It provides data structures and functions for cleaning,
transforming, and analyzing structured data.

 SQL: Structured Query Language (SQL) is essential for working with

relational databases. It allows data scientists to extract, transform, and
analyze data using SQL queries.

 dplyr: dplyr is an R package that offers a set of functions for data

manipulation and transformation, enabling easy filtering, summarizing,
and joining of data.

4. Data Visualization:

 Matplotlib: Matplotlib is a popular Python library for creating static,

animated, and interactive visualizations. It provides a wide range of
plotting functions and customization options.

 Seaborn: Seaborn is a Python library built on top of Matplotlib that

offers additional statistical visualizations and enhanced aesthetics.

 ggplot2: ggplot2 is an R package based on the grammar of graphics. It

provides a flexible system for creating visually appealing and highly
customizable plots.

5. Machine Learning and Data Modelling:

 scikit-learn: scikit-learn is a comprehensive machine learning library for

Python. It includes a wide range of algorithms for classification,

ACET
UNIT 1 DATA SCIENCE

regression, clustering, and dimensionality reduction, along with tools

for model evaluation and selection.

 caret: caret is an R package that provides a unified interface for

training and evaluating machine learning models. It offers a wide range
of algorithms, pre-processing techniques, and tools for model tuning.

 TensorFlow and Keras: TensorFlow is a popular open-source library for

machine learning and deep learning, while Keras is a high-level neural
networks API that runs on top of TensorFlow. They provide tools for
building and training neural networks.

6. Big Data Processing:

 Apache Spark: Apache Spark is a distributed computing framework

that enables processing and analysis of large-scale data. It offers high-
performance data manipulation, machine learning, and graph
processing capabilities.

 Hadoop: Hadoop is an open-source framework that allows distributed

processing of large datasets across clusters of computers. It provides a
scalable and fault-tolerant environment for big data processing.

7. Data Version Control and Collaboration:

 Git: Git is a widely used version control system that helps manage
code and track changes. It allows collaboration among team members,
facilitates code sharing, and helps maintain project integrity.

8. Cloud Platforms:

 Amazon Web Services (AWS): AWS provides a suite of cloud

services that can be leveraged for data storage, processing, and
analysis, including services like Amazon S3, Amazon Redshift, and
Amazon EMR.

 Google Cloud Platform (GCP): GCP offers a range of cloud-based

tools and services for data storage, processing, and machine learning,
such as Google Big Query, Google Cloud Storage, and Google AI
Platform.

 Microsoft Azure: Microsoft Azure provides a comprehensive set of

cloud-based tools and services for data storage, analytics, and
machine learning, including Azure Data Lake, Azure Databricks, and
Azure Machine Learning.

ACET
UNIT 1 DATA SCIENCE

Types of Data

The data is classified into majorly four categories:

Qualitative or Categorical Data

Qualitative or Categorical Data is data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it
is also known as Categorical Data. These data consist of audio, images, symbols, or
text.

 What language do you speak

 Favorite holiday destination

 Opinion on something (agree, disagree, or neutral)

 Colours

The Qualitative data are further classified into two parts:

Nominal Data

Nominal Data is used to label variables without any order or quantitative value. The
color of hair can be considered nominal data, as one color can’t be compared with
another color.

ACET
UNIT 1 DATA SCIENCE

Examples of Nominal Data:

 Colour of hair (Blonde, red, Brown, Black, etc.)

 Marital status (Single, Widowed, Married)
 Nationality (Indian, German, American)
 Gender (Male, Female, Others)
 Eye Color (Black, Brown, etc.)

Ordinal Data

Ordinal data have natural ordering where a number is present in some kind of order
by their position on the scale. These data are used for observation like customer
satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.

Examples of Ordinal Data:

 When companies ask for feedback, experience, or satisfaction on a

scale of 1 to 10
 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First, Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)

Quantitative Data

Quantitative data can be expressed in numerical values, making it countable and

including statistical data analysis. These kinds of data are also known as Numerical
data. It answers the questions like “how much,” “how many,” and “how often.”

The Quantitative data are further classified into two parts:

Discrete Data

The term discrete means distinct or separate. The discrete data contain the values
that fall under integers or whole numbers. The total number of students in a class is
an example of discrete data. These data can’t be broken into decimal or fraction
values.

The discrete data are countable and have finite values; their subdivision is not
possible. These data are represented mainly by a bar graph, number line, or
frequency table.

Examples of Discrete Data:

 Total numbers of students present in a class

 Cost of a cell phone

ACET
UNIT 1 DATA SCIENCE

 Numbers of employees in a company

 The total number of players who participated in a competition
 Days in a week

Continuous Data

Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous
variable can take any value within a range.

Examples of Continuous Data:

 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
 Market share price

Data science Applications

Data science has a wide range of applications across various industries and
domains. Here are some common and prominent applications of data science:

1. Business Intelligence: Data science is extensively used in business

intelligence to analyze past performance, identify trends, and make data-
driven decisions. It helps businesses optimize their operations, improve
efficiency, and enhance overall performance.

2. Predictive Analytics: Data science is used to build predictive models that

can forecast future outcomes based on historical data. This is applicable in
various fields like sales forecasting, demand prediction, customer behavior
analysis, and risk assessment.

3. Machine Learning: Data science forms the foundation of machine learning

algorithms, enabling systems to learn from data and make decisions or
predictions without explicit programming. Applications include
recommendation systems, image recognition, natural language processing,
and more.

4. Healthcare Analytics: Data science plays a crucial role in healthcare by

analysing patient data to improve diagnoses, optimize treatment plans, predict
disease outbreaks, and discover potential drug interactions.

ACET
UNIT 1 DATA SCIENCE

5. Financial Analysis: In finance, data science is used for fraud detection, credit
risk assessment, algorithmic trading, portfolio optimization, and customer
segmentation for personalized financial services.

6. Marketing Analytics: Data science helps marketers analyze customer

behaviour, preferences, and engagement patterns to create targeted
advertising campaigns, optimize marketing strategies, and improve customer
retention.

7. Social Media Analysis: Data science is used to analyze social media data,
understand user sentiment, track trends, and identify influencers, which helps
businesses with their marketing and reputation management.

8. Supply Chain Optimization: Data science is applied to optimize supply chain

operations, improve inventory management, and streamline logistics to
reduce costs and enhance efficiency.

9. Environmental Monitoring: Data science is used to analyze environmental

data from sensors and satellites to monitor climate change, air and water
quality, and predict natural disasters.

10. Recommendation Systems: Data science powers recommendation engines

used by e-commerce platforms, streaming services, and other applications to
suggest relevant products, movies, or content to users based on their
preferences.

11. Image and Speech Recognition: Data science techniques are employed to
develop image recognition systems used in self-driving cars, security
surveillance, and medical imaging, as well as speech recognition applications
like virtual assistants.

12. Sports Analytics: In sports, data science is used for player performance
analysis, game strategy optimization, injury prediction, and fan engagement.

ACET

IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Introduction To Data Science - 23CSH-283
100% (1)
Introduction To Data Science - 23CSH-283
48 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
Introduction To Data Science, Evolution of Data Science
No ratings yet
Introduction To Data Science, Evolution of Data Science
11 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Foundation of Data Science (BSC)
No ratings yet
Foundation of Data Science (BSC)
64 pages
Notes Data Science
100% (1)
Notes Data Science
5 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
NC JC 2022 - Brochure
No ratings yet
NC JC 2022 - Brochure
19 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Iot Basics
No ratings yet
Iot Basics
43 pages
DS QB Unit 1
No ratings yet
DS QB Unit 1
45 pages
Data Science Overview Basic To Advance Guide
No ratings yet
Data Science Overview Basic To Advance Guide
27 pages
Unseen Passage in Hindi Worksheet
No ratings yet
Unseen Passage in Hindi Worksheet
4 pages
Cat Gr11 Theory June2018
No ratings yet
Cat Gr11 Theory June2018
12 pages
All Obj Methods
100% (1)
All Obj Methods
24 pages
D3.2 Part1 Guidelines Dependability Hazard Analysis
No ratings yet
D3.2 Part1 Guidelines Dependability Hazard Analysis
340 pages
Introduction To Requirement Engineering Requirements:: 1. Milk
0% (1)
Introduction To Requirement Engineering Requirements:: 1. Milk
2 pages
Data Science in IOT
No ratings yet
Data Science in IOT
220 pages
XML and Database
No ratings yet
XML and Database
609 pages
Ip Address: 100.100.1.100 Subnet Mask:255.255.255.0 Default Gateway: 100.100.1.1
No ratings yet
Ip Address: 100.100.1.100 Subnet Mask:255.255.255.0 Default Gateway: 100.100.1.1
1 page
Terraform Associate Exam - Free Questions and Answers - ITExams - Com3
No ratings yet
Terraform Associate Exam - Free Questions and Answers - ITExams - Com3
2 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
BI Unit 2
No ratings yet
BI Unit 2
113 pages
Handbook Introduction of Data Science AY 23-24
No ratings yet
Handbook Introduction of Data Science AY 23-24
171 pages
2021 10 08 - Log
No ratings yet
2021 10 08 - Log
190 pages
Data Science Management - Vss
No ratings yet
Data Science Management - Vss
84 pages
Exporatory Data Analytics Notes ME SEM 2
No ratings yet
Exporatory Data Analytics Notes ME SEM 2
132 pages
Bd4151 Foundations of Data Science
No ratings yet
Bd4151 Foundations of Data Science
70 pages
Madan Raj C R Maven SIliconEx - Suma Maven Silicon
No ratings yet
Madan Raj C R Maven SIliconEx - Suma Maven Silicon
3 pages
Foundation of Data Science (BSC) 1
No ratings yet
Foundation of Data Science (BSC) 1
64 pages
Ol8 Relnotes8 8
No ratings yet
Ol8 Relnotes8 8
111 pages
What Is Data Science?
No ratings yet
What Is Data Science?
94 pages
Ads TopperSh
No ratings yet
Ads TopperSh
50 pages
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
No ratings yet
Fd45092a Ccad 459e Bc18 B01536fd6bac Untitled
53 pages
Data Science Unit 01
No ratings yet
Data Science Unit 01
19 pages
Data Science
No ratings yet
Data Science
17 pages
Exploratory Data Analysis With Python
No ratings yet
Exploratory Data Analysis With Python
24 pages
Fundamentals of Data Science Unit 1
No ratings yet
Fundamentals of Data Science Unit 1
33 pages
Unit I
No ratings yet
Unit I
52 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
H3C Cloud云学堂用户手册-E0303H03-5W102-整本手册
No ratings yet
H3C Cloud云学堂用户手册-E0303H03-5W102-整本手册
53 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
Javascript Practices: Complete Reference / Javascript: TCR / Powell & Schneider / 9127-9 / Chapter 24
No ratings yet
Javascript Practices: Complete Reference / Javascript: TCR / Powell & Schneider / 9127-9 / Chapter 24
40 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
6001 - Datascience With Bigdata
No ratings yet
6001 - Datascience With Bigdata
34 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Data Science
No ratings yet
Data Science
14 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
24 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Study On Decentralized Identity and Privacy Preserving Cyber Security
No ratings yet
Study On Decentralized Identity and Privacy Preserving Cyber Security
7 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
12 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Wa0001.
No ratings yet
Wa0001.
9 pages
Datascience
No ratings yet
Datascience
12 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science Course Fees Chennai
No ratings yet
Data Science Course Fees Chennai
4 pages
Data Science & Cyber Security
No ratings yet
Data Science & Cyber Security
13 pages
CND12
No ratings yet
CND12
11 pages
Transport Layer Numerical
No ratings yet
Transport Layer Numerical
29 pages
Configuration Guide V3.2 2012-4-23
No ratings yet
Configuration Guide V3.2 2012-4-23
70 pages
Mockingboard 4c+ Installation Manual
No ratings yet
Mockingboard 4c+ Installation Manual
10 pages
File
No ratings yet
File
27 pages
Res Sunithavourganti
No ratings yet
Res Sunithavourganti
6 pages
PDF Data Science
No ratings yet
PDF Data Science
7 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
DS - Unit I
No ratings yet
DS - Unit I
3 pages
Robot Arm
No ratings yet
Robot Arm
3 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Science Course in Hyderabad
No ratings yet
Data Science Course in Hyderabad
9 pages
Overview of Data Science
No ratings yet
Overview of Data Science
3 pages
OS Lab Manual
No ratings yet
OS Lab Manual
33 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
No ratings yet
Data Science Is A Multidisciplinary Field That Uses Scientific Methods
2 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
6 Months RAN Activity
No ratings yet
6 Months RAN Activity
14 pages
MB Memory Trx40 Aorus Xtreme
No ratings yet
MB Memory Trx40 Aorus Xtreme
8 pages
Half Blind Attack MSP430
No ratings yet
Half Blind Attack MSP430
6 pages
ISF - Threat Horizon 2024 - Executive Summary
No ratings yet
ISF - Threat Horizon 2024 - Executive Summary
4 pages
Jen00319 PSRPT 2023-05-01 15.21.53
No ratings yet
Jen00319 PSRPT 2023-05-01 15.21.53
5 pages
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
No ratings yet
Explain All The Evolutionary Changes in The Age of Internet Computing. The Age of Internet Computing
5 pages
Statictics Computerscience Information Science
No ratings yet
Statictics Computerscience Information Science
3 pages
Marketing Strategy of Dell: by Ravi Prakash Singh ITM College, Executive MBA, Batch 18
100% (7)
Marketing Strategy of Dell: by Ravi Prakash Singh ITM College, Executive MBA, Batch 18
24 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet

Unit 1 Ds

Uploaded by

Unit 1 Ds

Uploaded by

UNIT 1 DATA SCIENCE

Introduction to Core Concepts & Technologies

What is Data Science

Basic Terminology involved

2. Machine Learning: A branch of artificial intelligence that focuses on the

3. Artificial Intelligence (AI): The theory and development of computer systems

4. Predictive Analytics: The practice of using historical data and statistical

6. Data Mining: The process of discovering patterns, relationships, or insights

7. Data Visualization: The representation of data and information in graphical or

8. Natural Language Processing (NLP): The branch of artificial intelligence and

9. Feature Engineering: The process of selecting, transforming, and creating

11. Regression Analysis: A statistical modelling technique used to investigate

12. Classification: A machine learning task that involves assigning predefined

14. Cross-Validation: A technique used to evaluate and validate the performance

15. Overfitting: A situation in machine learning where a model becomes too

Data Science Process Life Cycle

detecting relationships between variables. EDA helps to identify potential

5. Feature Engineering: Feature engineering is the process of creating new

6. Model Selection and Training: In this step, you select an appropriate

9. Model Monitoring and Maintenance: Once the model is deployed, it needs

DATA SCIENCE TOOLKIT

 Python: Python is one of the most popular programming languages for

 R: R is another widely used language for statistical computing and

2. Integrated Development Environments (IDEs):

 Jupyter Notebook: Jupyter Notebook is a web-based interactive

 PyCharm: PyCharm is a popular Python IDE that provides advanced

 RStudio: RStudio is an integrated development environment

3. Data Manipulation and Analysis:

 pandas: pandas is a Python library widely used for data manipulation

 SQL: Structured Query Language (SQL) is essential for working with

 dplyr: dplyr is an R package that offers a set of functions for data

 Matplotlib: Matplotlib is a popular Python library for creating static,

 Seaborn: Seaborn is a Python library built on top of Matplotlib that

 ggplot2: ggplot2 is an R package based on the grammar of graphics. It

5. Machine Learning and Data Modelling:

 scikit-learn: scikit-learn is a comprehensive machine learning library for

regression, clustering, and dimensionality reduction, along with tools

 caret: caret is an R package that provides a unified interface for

 TensorFlow and Keras: TensorFlow is a popular open-source library for

6. Big Data Processing:

 Apache Spark: Apache Spark is a distributed computing framework

 Hadoop: Hadoop is an open-source framework that allows distributed

7. Data Version Control and Collaboration:

 Amazon Web Services (AWS): AWS provides a suite of cloud

 Google Cloud Platform (GCP): GCP offers a range of cloud-based

 Microsoft Azure: Microsoft Azure provides a comprehensive set of

The data is classified into majorly four categories:

Qualitative or Categorical Data

 What language do you speak

 Favorite holiday destination

 Opinion on something (agree, disagree, or neutral)

The Qualitative data are further classified into two parts:

Examples of Nominal Data:

 Colour of hair (Blonde, red, Brown, Black, etc.)

Examples of Ordinal Data:

 When companies ask for feedback, experience, or satisfaction on a

Quantitative data can be expressed in numerical values, making it countable and

The Quantitative data are further classified into two parts:

Examples of Discrete Data:

 Total numbers of students present in a class

 Numbers of employees in a company

Examples of Continuous Data:

Data science Applications

1. Business Intelligence: Data science is extensively used in business

2. Predictive Analytics: Data science is used to build predictive models that

3. Machine Learning: Data science forms the foundation of machine learning

4. Healthcare Analytics: Data science plays a crucial role in healthcare by

6. Marketing Analytics: Data science helps marketers analyze customer

8. Supply Chain Optimization: Data science is applied to optimize supply chain

9. Environmental Monitoring: Data science is used to analyze environmental

10. Recommendation Systems: Data science powers recommendation engines

You might also like