0% found this document useful (0 votes)
8 views10 pages

Unit 1 Ds

DATA SCIENCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views10 pages

Unit 1 Ds

DATA SCIENCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 1 DATA SCIENCE

Introduction to Core Concepts & Technologies

What is Data Science


Data science is an interdisciplinary field that involves extracting insights and
knowledge from structured and unstructured data using scientific methods,
processes, algorithms, and systems. It combines elements of statistics,
mathematics, computer science, and domain knowledge to analyse and interpret
complex data sets.

The main goal of data science is to uncover patterns, extract meaningful information,
and generate actionable insights from data to aid in decision-making, solve
problems, and drive innovation. Data scientists utilize a range of tools, techniques,
and programming languages to collect, clean, analyze, and visualize data. They
apply statistical and machine learning models to make predictions, build data-driven
models, and uncover hidden patterns within the data.

Basic Terminology involved


1. Data Science: The field that combines scientific methods, algorithms, and
systems to extract knowledge and insights from structured and unstructured
data.

2. Machine Learning: A branch of artificial intelligence that focuses on the


development of algorithms and models that enable computers to learn from
and make predictions or decisions based on data without being explicitly
programmed.

3. Artificial Intelligence (AI): The theory and development of computer systems


capable of performing tasks that would typically require human intelligence,
such as speech recognition, decision-making, or visual perception.

4. Predictive Analytics: The practice of using historical data and statistical


techniques to make predictions or forecasts about future events or outcomes.

5. Big Data: Large and complex datasets that cannot be easily managed,
processed, or analyzed using traditional data processing techniques. Big data
often involves high-volume, high-velocity, and high-variety data.

6. Data Mining: The process of discovering patterns, relationships, or insights


from large datasets using various statistical and machine learning techniques.

ACET
UNIT 1 DATA SCIENCE

7. Data Visualization: The representation of data and information in graphical or


visual forms, such as charts, graphs, and maps, to facilitate understanding,
exploration, and communication of patterns and trends in the data.

8. Natural Language Processing (NLP): The branch of artificial intelligence and


linguistics that focuses on enabling computers to understand, interpret, and
generate human language, both written and spoken.

9. Feature Engineering: The process of selecting, transforming, and creating


input variables (features) from raw data to improve the performance and
accuracy of machine learning models.

10. Deep Learning: A subfield of machine learning that utilizes artificial neural
networks with multiple layers to learn and extract hierarchical representations
of data, often used for tasks such as image and speech recognition.

11. Regression Analysis: A statistical modelling technique used to investigate


the relationship between a dependent variable and one or more independent
variables, with the goal of predicting or estimating the value of the dependent
variable.

12. Classification: A machine learning task that involves assigning predefined


categories or labels to instances based on their characteristics or features.

13. Clustering: A data exploration technique that involves grouping similar data
points or objects together based on their inherent similarities or patterns.

14. Cross-Validation: A technique used to evaluate and validate the performance


of machine learning models by partitioning the available data into subsets for
training and testing.

15. Overfitting: A situation in machine learning where a model becomes too


specialized or closely fits the training data, resulting in poor performance
when applied to new, unseen data.

These are just a few examples of data science terminology. The field of data science
is vast and continuously evolving, so there are many more terms and concepts to
explore.

ACET
UNIT 1 DATA SCIENCE

Data Science Process Life Cycle


The data science process, also known as the data science lifecycle or data science
methodology, is a systematic approach to solving problems and extracting insights
from data. It typically consists of several iterative steps that data scientists follow to
address a specific business question or problem using data. While there can be
variations in the specific steps and their order, here is a general overview of the data
science process:

1. Problem Definition: The first step is to clearly define the problem or question
you want to address. This involves understanding the business context,
identifying the objectives, and formulating a well-defined problem statement.

2. Data Acquisition: In this step, you gather the relevant data necessary to
solve the problem. This can involve obtaining data from various sources such
as databases, APIs, web scraping, or other data collection methods.

3. Data Cleaning and Preprocessing: Raw data often contains errors, missing
values, outliers, or inconsistencies. In this step, you clean and preprocess the
data to ensure its quality and suitability for analysis. This may include tasks
like handling missing values, removing duplicates, standardizing formats, and
transforming variables.

4. Exploratory Data Analysis (EDA): EDA involves exploring and analyzing the
data to gain insights and understand its characteristics. This can include
summarizing the data, visualizing distributions, identifying patterns, and

ACET
UNIT 1 DATA SCIENCE

detecting relationships between variables. EDA helps to identify potential


features for modelling and understand any limitations or biases in the data.

5. Feature Engineering: Feature engineering is the process of creating new


features or transforming existing ones to improve the performance of machine
learning models. This step may involve techniques like feature selection,
dimensionality reduction, encoding categorical variables, creating interaction
terms, or generating time-based features.

6. Model Selection and Training: In this step, you select an appropriate


modeling technique based on the problem and the data at hand. This can
involve selecting from a variety of algorithms such as regression,
classification, clustering, or deep learning. You then train the selected model
on your prepared dataset.

7. Model Evaluation: Once the model is trained, you evaluate its performance
using appropriate evaluation metrics. This helps you assess how well the
model is performing and whether it meets the desired objectives. You may
need to fine-tune the model parameters or try different algorithms to improve
its performance.

8. Model Deployment: After the model has been evaluated and meets the
desired criteria, it can be deployed into a production environment. This step
involves integrating the model into an application, setting up data pipelines,
and ensuring the model can handle new data in real-time.

9. Model Monitoring and Maintenance: Once the model is deployed, it needs


to be monitored to ensure it continues to perform well over time. This involves
tracking its predictions, monitoring data quality, retraining the model
periodically with new data, and making necessary updates or improvements
as required.

It's important to note that the data science process is often iterative, with feedback
and insights gained at each step influencing decisions made in previous steps. This
iterative nature allows for continuous improvement and refinement of the analysis.

DATA SCIENCE TOOLKIT


The field of data science utilizes a wide range of tools and technologies to analyze,
manipulate, and derive insights from data. Here are some commonly used tools in
the data science toolkit:

1. Programming Languages:

 Python: Python is one of the most popular programming languages for


data science due to its extensive libraries and frameworks such as
NumPy, pandas, scikit-learn, TensorFlow, and PyTorch.

ACET
UNIT 1 DATA SCIENCE

 R: R is another widely used language for statistical computing and


graphics. It provides a rich ecosystem of packages for data
manipulation, visualization, and modelling.

2. Integrated Development Environments (IDEs):

 Jupyter Notebook: Jupyter Notebook is a web-based interactive


development environment that allows data scientists to create and
share documents containing code, visualizations, and narrative text.

 PyCharm: PyCharm is a popular Python IDE that provides advanced


code editing, debugging, and profiling capabilities for data science
projects.

 RStudio: RStudio is an integrated development environment


specifically designed for R, providing a powerful editor, debugging
tools, and package management.

3. Data Manipulation and Analysis:

 pandas: pandas is a Python library widely used for data manipulation


and analysis. It provides data structures and functions for cleaning,
transforming, and analyzing structured data.

 SQL: Structured Query Language (SQL) is essential for working with


relational databases. It allows data scientists to extract, transform, and
analyze data using SQL queries.

 dplyr: dplyr is an R package that offers a set of functions for data


manipulation and transformation, enabling easy filtering, summarizing,
and joining of data.

4. Data Visualization:

 Matplotlib: Matplotlib is a popular Python library for creating static,


animated, and interactive visualizations. It provides a wide range of
plotting functions and customization options.

 Seaborn: Seaborn is a Python library built on top of Matplotlib that


offers additional statistical visualizations and enhanced aesthetics.

 ggplot2: ggplot2 is an R package based on the grammar of graphics. It


provides a flexible system for creating visually appealing and highly
customizable plots.

5. Machine Learning and Data Modelling:

 scikit-learn: scikit-learn is a comprehensive machine learning library for


Python. It includes a wide range of algorithms for classification,

ACET
UNIT 1 DATA SCIENCE

regression, clustering, and dimensionality reduction, along with tools


for model evaluation and selection.

 caret: caret is an R package that provides a unified interface for


training and evaluating machine learning models. It offers a wide range
of algorithms, pre-processing techniques, and tools for model tuning.

 TensorFlow and Keras: TensorFlow is a popular open-source library for


machine learning and deep learning, while Keras is a high-level neural
networks API that runs on top of TensorFlow. They provide tools for
building and training neural networks.

6. Big Data Processing:

 Apache Spark: Apache Spark is a distributed computing framework


that enables processing and analysis of large-scale data. It offers high-
performance data manipulation, machine learning, and graph
processing capabilities.

 Hadoop: Hadoop is an open-source framework that allows distributed


processing of large datasets across clusters of computers. It provides a
scalable and fault-tolerant environment for big data processing.

7. Data Version Control and Collaboration:

 Git: Git is a widely used version control system that helps manage
code and track changes. It allows collaboration among team members,
facilitates code sharing, and helps maintain project integrity.

8. Cloud Platforms:

 Amazon Web Services (AWS): AWS provides a suite of cloud


services that can be leveraged for data storage, processing, and
analysis, including services like Amazon S3, Amazon Redshift, and
Amazon EMR.

 Google Cloud Platform (GCP): GCP offers a range of cloud-based


tools and services for data storage, processing, and machine learning,
such as Google Big Query, Google Cloud Storage, and Google AI
Platform.

 Microsoft Azure: Microsoft Azure provides a comprehensive set of


cloud-based tools and services for data storage, analytics, and
machine learning, including Azure Data Lake, Azure Databricks, and
Azure Machine Learning.

ACET
UNIT 1 DATA SCIENCE

Types of Data

The data is classified into majorly four categories:

Qualitative or Categorical Data

Qualitative or Categorical Data is data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it
is also known as Categorical Data. These data consist of audio, images, symbols, or
text.

 What language do you speak

 Favorite holiday destination

 Opinion on something (agree, disagree, or neutral)

 Colours

The Qualitative data are further classified into two parts:

Nominal Data

Nominal Data is used to label variables without any order or quantitative value. The
color of hair can be considered nominal data, as one color can’t be compared with
another color.

ACET
UNIT 1 DATA SCIENCE

Examples of Nominal Data:

 Colour of hair (Blonde, red, Brown, Black, etc.)


 Marital status (Single, Widowed, Married)
 Nationality (Indian, German, American)
 Gender (Male, Female, Others)
 Eye Color (Black, Brown, etc.)

Ordinal Data

Ordinal data have natural ordering where a number is present in some kind of order
by their position on the scale. These data are used for observation like customer
satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.

Examples of Ordinal Data:

 When companies ask for feedback, experience, or satisfaction on a


scale of 1 to 10
 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First, Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)

Quantitative Data

Quantitative data can be expressed in numerical values, making it countable and


including statistical data analysis. These kinds of data are also known as Numerical
data. It answers the questions like “how much,” “how many,” and “how often.”

The Quantitative data are further classified into two parts:

Discrete Data

The term discrete means distinct or separate. The discrete data contain the values
that fall under integers or whole numbers. The total number of students in a class is
an example of discrete data. These data can’t be broken into decimal or fraction
values.

The discrete data are countable and have finite values; their subdivision is not
possible. These data are represented mainly by a bar graph, number line, or
frequency table.

Examples of Discrete Data:

 Total numbers of students present in a class


 Cost of a cell phone

ACET
UNIT 1 DATA SCIENCE

 Numbers of employees in a company


 The total number of players who participated in a competition
 Days in a week

Continuous Data

Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous
variable can take any value within a range.

Examples of Continuous Data:

 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
 Market share price

Data science Applications

Data science has a wide range of applications across various industries and
domains. Here are some common and prominent applications of data science:

1. Business Intelligence: Data science is extensively used in business


intelligence to analyze past performance, identify trends, and make data-
driven decisions. It helps businesses optimize their operations, improve
efficiency, and enhance overall performance.

2. Predictive Analytics: Data science is used to build predictive models that


can forecast future outcomes based on historical data. This is applicable in
various fields like sales forecasting, demand prediction, customer behavior
analysis, and risk assessment.

3. Machine Learning: Data science forms the foundation of machine learning


algorithms, enabling systems to learn from data and make decisions or
predictions without explicit programming. Applications include
recommendation systems, image recognition, natural language processing,
and more.

4. Healthcare Analytics: Data science plays a crucial role in healthcare by


analysing patient data to improve diagnoses, optimize treatment plans, predict
disease outbreaks, and discover potential drug interactions.

ACET
UNIT 1 DATA SCIENCE

5. Financial Analysis: In finance, data science is used for fraud detection, credit
risk assessment, algorithmic trading, portfolio optimization, and customer
segmentation for personalized financial services.

6. Marketing Analytics: Data science helps marketers analyze customer


behaviour, preferences, and engagement patterns to create targeted
advertising campaigns, optimize marketing strategies, and improve customer
retention.

7. Social Media Analysis: Data science is used to analyze social media data,
understand user sentiment, track trends, and identify influencers, which helps
businesses with their marketing and reputation management.

8. Supply Chain Optimization: Data science is applied to optimize supply chain


operations, improve inventory management, and streamline logistics to
reduce costs and enhance efficiency.

9. Environmental Monitoring: Data science is used to analyze environmental


data from sensors and satellites to monitor climate change, air and water
quality, and predict natural disasters.

10. Recommendation Systems: Data science powers recommendation engines


used by e-commerce platforms, streaming services, and other applications to
suggest relevant products, movies, or content to users based on their
preferences.

11. Image and Speech Recognition: Data science techniques are employed to
develop image recognition systems used in self-driving cars, security
surveillance, and medical imaging, as well as speech recognition applications
like virtual assistants.

12. Sports Analytics: In sports, data science is used for player performance
analysis, game strategy optimization, injury prediction, and fan engagement.

ACET

You might also like