0% found this document useful (0 votes)
24 views

Unit-1 - Introduction to Data Science

The document provides an introduction to data science, covering key concepts such as data types, the importance of data, and the data science lifecycle. It outlines the components of a data science course, the various analyses used in data science, and the significance of data in decision-making. Additionally, it discusses the stages involved in analyzing data, including data collection, preparation, exploration, and feature engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Unit-1 - Introduction to Data Science

The document provides an introduction to data science, covering key concepts such as data types, the importance of data, and the data science lifecycle. It outlines the components of a data science course, the various analyses used in data science, and the significance of data in decision-making. Additionally, it discusses the stages involved in analyzing data, including data collection, preparation, exploration, and feature engineering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Contents

1 Introduction to Data Science 1


1.1 What is Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Why data is important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Categories of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Structured vs UnStructured Data . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Qualitative vs Quantitative Data . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 What is Information? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 What is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Components of Data Science Course . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Why do we Analyze Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 What is Data Science used for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 How Do We Analyze Data or Data Science Lifecycle . . . . . . . . . . . . . . . . . . . 9
1.9 Data Science Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.10 Data Science Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11 Sample Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 1

Introduction to Data Science

1.1 What is Data?


• Data is a collection of facts, information, and statistics and this can be in various forms such as numbers,
text, sound, images, or any other format.
• According to the Oxford “Data is distinct pieces of information, usually formatted in a special way”.
• Data can be measured, collected, reported, and analyzed, whereupon it is often visualized using graphs,
images, or other analysis tools.
• Raw data (“unprocessed data”) may be a collection of numbers or characters before it’s been “cleaned” and
corrected by researchers. It must be corrected so that we can remove outliers, instruments, or data entry
errors.
• Data processing commonly occurs in stages, and therefore the “processed data” from one stage could also
be considered the “raw data” of subsequent stages.
• Data can be generated by:
– Humans
– Machines
– Human-Machine combines

1.1.1 Why data is important?


• Data helps in make better decisions.
• Data helps in solve problems by finding the reason for underperformance.
• Data helps one to evaluate the performance.
• Data helps one improve processes.
• Data helps one understand consumers and the market.

1
1.2 Categories of Data
1.2.1 Structured vs UnStructured Data
• Structured Data: This type of data is organized data into specific format, making it easy to search , analyze
and process. Structured data is found in a relational databases that includes information like numbers, data
and categories.
• UnStructured Data: Unstructured data does not conform to a specific structure or format. It may include
some text documents, images, videos, and other data that is not easily organized or analyzed without
additional processing.

1.2.2 Qualitative vs Quantitative Data


• Quantitative Data: This data is descriptive. For example – She is beautiful, He is tall, etc.

– Discrete Data: It has a particular fixed value and can be counted.


– Continuous Data: It is not fixed but has a range of data and can be measured

• Qualitative Data: This is numerical information. For example- A horse has four legs.

– Nominal Data – Nominal data is a basic data type that categorizes data by labeling or naming values
such as Gender, hair color, or types of animal. It does not have any hierarchy.
– Ordinal Data – Ordinal data involves classifying data based on rank, such as social status in
categories like ‘wealthy’, ‘middle income’, or ‘poor’. However, there are no set intervals between
these categories.

2
1.3 What is Information?
• Information is data that has been processed , organized, or structured in a way that makes it meaningful,
valuable and useful.
• Information is the data that has been given context, relevance and purpose.
• Information gives knowledge, understanding and insights that can be used for decision-making , problem-
solving, communication and various other purposes.

3
1.4 What is Data Science?
• The term “data science” combines two key elements: “data” and “science”.
• Data: It refers to the raw information that is collected, stored, and processed.
• In today’s digital age, enormous amounts of data are generated from various sources such as sensors, social
media, transactions, and more. This data can come in structured formats (e.g., databases) or unstructured
formats (e.g., text, images, videos).
• Science: It refers to the systematic study and investigation of phenomena using scientific methods and
principles.
• Science involves forming hypotheses, conducting experiments, analyzing data, and drawing conclusions
based on evidence.

4
1.5 Components of Data Science Course
• Data science is a multidisciplinary approach that combines math and statistics, programming, analytics,
machine learning with domain expertise to derive value and uncover actionable insights hidden in an
organization’s vast and complex datasets.

Figure 1.1: Components of Data Science

• Foundational Concepts: Introduction to basic concepts in data science, including data types, data manipu-
lation, data cleaning, and exploratory data analysis.
• Programming Languages: Programming languages commonly used in data science, such as Python or R.
Students learn how to write code to analyze and manipulate data, create visualizations, and build machine
learning models.
• Mathematics: Linear Algebra and Calculus
• Statistical Methods: Coverage of statistical techniques and methods used in data analysis, hypothesis
testing, regression analysis, and probability theory.
• Machine Learning: Introduction to machine learning algorithms, including supervised learning, unsu-
pervised learning, and deep learning. Students learn how to apply machine learning techniques to solve
real-world problems and make predictions from data.
• Data Visualization: Instruction in data visualization techniques and tools for effectively communicating
insights from data. Students learn how to create plots, charts, and interactive visualizations to explore and
present data.
• Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means
specialized knowledge or skills of a particular area. In data science, there are various areas for which we
need domain experts.

5
1.6 Why do we Analyze Data?
• When we put these two elements together, “Data + Science” refers to the scientific study of data to extract
meaningful insights.
• These insights can be used to guide decision making and strategic planning.
• Essentially, data science is about using scientific methods to unlock the potential of data, uncover patterns,
make predictions, and drive informed decision-making across various domains and industries.

1.7 What is Data Science used for?


• Descriptive Analysis or Descriptive Statistics (Hindsight):
– Descriptive analysis examines data to gain insights into what happened or what is happening in the
data environment.
– It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables, or generated
narratives.
– For example, a flight booking service may record data like the number of tickets booked each day.
Descriptive analysis will reveal booking spikes, booking slumps, and high-performing months for
this service.

• Diagnostic Analysis (Insight):


– Diagnostic analysis is a deep-dive or detailed data examination to understand why something hap-
pened.
– It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
– Multiple data operations and transformations may be performed on a given data set to discover unique
patterns in each of these techniques.
– For example, the flight service might drill down on a particularly high-performing month to better
understand the booking spike. This may lead to the discovery that many customers visit a particular
city to attend a monthly sporting event.

6
• Predictive Analysis (Foresight/Decision):
– Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur
in the future.
– It is characterized by techniques such as machine learning, forecasting, pattern matching, and
predictive modeling. In each of these techniques, computers are trained to reverse engineer causality
connections in the data.
– For example, the flight service team might use data science to predict flight booking patterns for the
coming year at the start of each year. The computer program or algorithm may look at past data and
predict booking spikes for certain destinations in May. Having anticipated their customer’s future
travel requirements, the company could start targeted advertising for those cities from February.

• Prescriptive analysis (Context/Action):


– Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to
happen but also suggests an optimum response to that outcome.
– It can analyze the potential implications of different choices and recommend the best course of action.
It uses graph analysis, simulation, complex event processing, neural networks, and recommendation
engines from machine learning.
– Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns
to maximize the advantage of the upcoming booking spike. A data scientist could project booking
outcomes for different levels of marketing spend on various marketing channels. These data forecasts
would give the flight booking company greater confidence in their marketing decisions.

7
8
1.8 How Do We Analyze Data or Data Science Lifecycle
The data science project lifecycle involves various roles, tools, and processes, which enables us to discover
knowledge and meaningful information from raw data and glean actionable insights.

Typically, a data science project undergoes the many stages. Here’s a breakdown of the key aspects involved:

1. Business Understanding or Define Goals and Questions:


• Before you can even start on a data science project, it is critical that you understand the problem you
are trying to solve.
• Are you trying to do seasonal line ups, determine customer behavior or make forecasting?
• Clearly defined goals, indeed practical analysis techniques will be the key factor to ensure alignment
to them.
• We typically use data science to answer five types of questions:
– How much or how many? (regression)
– Which category? (classification)
– Which group? (clustering)
– Is this weird? (anomaly detection)
– Which option should be taken? (recommendation)
• In this stage, you should also be identifying the central objectives of your project by identifying the
variables that need to be predicted.
– If it’s a regression, it could be something like a sales forecast.
– If it’s a clustering, it could be a customer profile.

9
2. Data Mining or Data Acquisition/Ingestion
• The lifecycle begins with the data collection—both raw structured and unstructured data from all
relevant sources using a variety of methods.
• These methods can include manual entry, web scraping, and real-time streaming data from systems
and devices.
• Data sources can include structured data, such as customer data, along with unstructured data like log
files, video, audio, pictures, the Internet of Things (IoT), social media, and more.
• At this stage, some of the questions worth considering are:
– What data do I need for my project?
– Where does it live?
– How can I obtain it?
– What is the most efficient way to store and access all of it?
3. Data Preparation or Data scrubbing or Data Cleaning
• The data cleaning process allows you to correct inconsistencies, errors and missing values which
helps to produce a clear picture based on high quality information.
• This process (also referred to as "data janitor work") can often take 50 to 80 percent of their time.
• The reason why this is such a time consuming process is simply because there are so many possible
scenarios that could necessitate cleaning. For instance,
– The data could also have inconsistencies within the same column, meaning that some rows
could be labelled 0 or 1, and others could be labelled no or yes.
– The data types could also be inconsistent - some of the 0s might integers, whereas some of them
could be strings.
– If we’re dealing with a categorical data type with multiple categories, some of the categories
could be misspelled or have different cases, such as having categories for both male and Male.
– One of the steps that is often forgotten in this stage, causing a lot of problems later on, is the
presence of missing data. Missing data can throw a lot of errors in the model creation and
training.
• This is just a subset of examples where you can see inconsistencies, and it’s important to catch and
fix them in this stage.
4. Data Exploration or Exploratory data analytics (EDA) or Statistical Analysis/Methods
• Now that you’ve got a sparkling clean set of data, you’re ready to finally get started in your analysis.
Here, data scientists conduct an exploratory data analysis to examine biases, patterns, ranges, and
distributions of values within the data.
• We will apply Exploratory data analytics (EDA) by using various statistical formula and visualization
tools to understand the relations between variable and to see what data can inform us.
• Using all of this information, you start to form hypotheses about your data and the problem you are
tackling.
– If you were predicting student scores for example, you could try visualizing the relationship
between scores and sleep.
– If you were predicting real estate prices, you could perhaps plot the prices as a heat map on a
spatial plot to see if you can catch any trends.
• This data analytics exploration drives hypothesis generation for a/b testing. It also allows analysts
to determine the data’s relevance for use within modeling efforts for predictive analytics, machine
learning, and/or deep learning. Depending on a model’s accuracy, organizations can become reliant
on these insights for business decision making, allowing them to drive more scalability.

10
5. Feature Engineering
• We typically perform two types of tasks in feature engineering - Feature Selection and Feature
Construction.
• Feature selection is the process of cutting down the features that add more noise than information.
This is typically done to avoid the curse of dimensionality, which refers to the increased complexity
that arises from high-dimensional spaces (i.e. way too many features).
• Feature construction involves creating new features from the ones that you already have. For
example, if you have a feature for age, but your model only cares about if a person is an adult or
minor, you could threshold it at 18, and assign different categories to instances above and below that
threshold.
6. Model-building
• In this phase, the process of model building starts and where the machine learning finally comes into
your data science project. We will create datasets for training and testing purpose.
• Based on the questions you asked in the business understanding stage, this is where you decide which
model to pick for your problem. This is never an easy decision, and there is no single right answer.
• The model (or models) that you end up training will be dependent on the size, type and quality of
your data, how much time and computational resources you are willing to invest, and the type of
output you intend to derive.
• There are a couple of different cheat sheets available online which have a flowchart that helps you
decide the right algorithm based on the type of classification or regression problem you are trying to
solve.
https://learn.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-s
heet?view=azureml-api-1
https://blogs.sas.com/content/subconsciousmusings/2020/12/09/machine-learn
ing-algorithm-use
• Once you’ve trained your model, it is critical that you evaluate its success. A process called k-fold
cross validation is commonly used to measure the accuracy of a model. It involves separating the
dataset into k equally sized groups of instances, training on all the groups except one, and repeating
the process with different groups left out. This allows the model to be trained on all the data instead
of using a typical train-test split.
• For classification models, we often test accuracy using PCC (percent correct classification), along
with a confusion matrix which breaks down the errors into false positives and false negatives. Plots
such as as ROC curves, which is the true positive rate plotted against the false positive rate, are also
used to benchmark the success of a model. For a regression model, the common metrics include the
coefficient of determination (which gives information about the goodness of fit of a model), mean
squared error (MSE), and average absolute error.
7. Data Visualization
• Charts, graphs, and dashboards which happen to be tools of visualization of data, make easy identify-
ing patterns, trends, and disclosures that would seem to be unclear in raw numbers.
• Once you’ve derived the intended insights from your model, you have to represent them in way that
the different key stakeholders in the project can understand.

11
1.9 Data Science Applications
Data Science has a wide array of applications across various industries, significantly impacting the way businesses
operate and how services are delivered. Here are some key applications of Data Science:

• Image recognition:
– When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per
voice control, so this is possible with speech recognition algorithm.
• Speech recognition:
– When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per
voice control, so this is possible with speech recognition algorithm.
• Recommendation systems: Most of the companies, such as Amazon, Netflix, Google Play, etc., are using
data science technology for making a better user experience with personalized recommendations. Such as,
when you search for something on Amazon, and you started getting suggestions for similar products.
• Healthcare:
– Predictive Analytics: Predicting disease outbreaks, patient readmissions, and individual health risks.
– Medical Imaging: Enhancing image recognition to diagnose conditions from X-rays, MRIs, and CT
scans.
– Personalized Medicine: Tailoring treatment plans based on genetic information and patient history.
• Finance:
– Risk Management: Identifying and mitigating financial risks through predictive modeling.
– Fraud Detection: Analyzing transactions to detect fraudulent activities.
– Algorithmic Trading: Using data-driven algorithms to execute high-frequency trading strategies.

12
• Marketing:
– Customer Segmentation: Grouping customers based on purchasing behavior and preferences for
targeted marketing.
– Sentiment Analysis: Analyzing customer feedback and social media interactions to gauge public
sentiment.
– Predictive Analytics: Forecasting sales trends and customer lifetime value.
• Retail:
– Inventory Management: Optimizing stock levels based on demand forecasting.
– Recommendation Systems: Providing personalized product recommendations to customers.
– Price Optimization: Adjusting prices dynamically based on market trends and consumer behavior.
• Transportation:
– Route Optimization: Enhancing logistics by determining the most efficient routes.
– Predictive Maintenance: Forecasting equipment failures to schedule timely maintenance.
– Autonomous Vehicles: Developing self-driving cars using machine learning algorithms.
• Education:
– Personalized Learning: Creating customized learning experiences based on student performance
and preferences.
– Academic Analytics: Analyzing data to improve student retention and graduation rates.
– Curriculum Development: Using data to develop and refine educational programs.
• Entertainment:
– Content Recommendation: Suggesting movies, shows, and music based on user preferences.
– Audience Analytics: Understanding audience behavior to improve content delivery.
– Production Analytics: Optimizing production schedules and budgets through data analysis.
• Manufacturing:
– Quality Control: Using data to monitor and improve product quality.
– Supply Chain Optimization: Streamlining supply chain processes through predictive analytics.
– Process Automation: Implementing automated systems for efficient production workflows.
• Energy:
– Smart Grids: Enhancing the efficiency and reliability of energy distribution. Predictive Main-
tenance: Forecasting and preventing equipment failures in power plants. Energy Consumption
Analytics: Analyzing patterns to optimize energy usage and reduce costs.
• Government:
– Public Safety: Analyzing crime data to improve law enforcement strategies.
– Urban Planning: Using data to plan and develop smarter cities.
– Policy Making: Leveraging data to make informed decisions and create effective policies.

13
1.10 Data Science Jobs
• Data Scientist:
– Responsibilities: Analyzing large datasets, developing machine learning models, interpreting results,
and providing insights to inform business decisions.
– Skills: Proficiency in programming languages like Python or R, expertise in statistics and machine
learning algorithms, data visualization skills, and domain knowledge in the relevant industry.
• Machine Learning Engineer:
– Responsibilities: Building and deploying machine learning models at scale, optimizing model
performance, and integrating them into production systems.
– Skills: Proficiency in programming languages like Python or Java, experience with machine learning
frameworks like TensorFlow or PyTorch, knowledge of cloud platforms like AWS or Azure, and
software engineering skills for developing scalable solutions.
• Data Analyst:
– Responsibilities: Collecting, cleaning, and analyzing data to identify trends, patterns, and insights.
Often involves creating reports and dashboards to communicate findings to stakeholders.
– Skills: Strong proficiency in SQL for data querying, experience with data visualization tools like
Tableau or Power BI, basic statistical knowledge, and familiarity with Excel or Google Sheets.
Business Intelligence (BI) Analyst:
– Responsibilities: Gathering requirements from business stakeholders, designing and developing BI
reports and dashboards, and providing data-driven insights to support strategic decision-making.
– Skills: Proficiency in BI tools like Tableau, Power BI, or Looker, strong SQL skills for data querying,
understanding of data visualization principles, and ability to translate business needs into technical
solutions.
• Data Engineer:
– Responsibilities: Designing and building data pipelines to collect, transform, and store large volumes
of data. Ensuring data quality, reliability, and scalability.
– Skills: Expertise in database systems like SQL and NoSQL, proficiency in programming languages
like Python or Java, experience with big data technologies like Hadoop or Spark, and knowledge of
data warehousing concepts.
• Data Architect:
– Responsibilities: Designing the overall structure of data systems, including databases, data lakes,
and data warehouses. Defining data models, schemas, and data governance policies. Skills: Deep
understanding of database technologies and architectures, experience with data modeling tools like
ERWin or Visio, knowledge of data integration techniques, and familiarity with data security and
compliance regulations.
• Other Data-Driven Fields:
– Marketing Analyst: The role marketing analysts play in harnessing data is like in the sense that, it
enables them to know how the customer behaves, make campaign evaluations and also to strategically
bring improvements to marketing models.
– Financial Analyst: They utilize information to measure financial risk and returns, provide advice for
investment purposes and financial decision-making.
– Quantitative Analyst: As a matter of fact, through applying complex financial math models and
analytic, they conduct qualitative and quantitative analyses of financial risks and devise trading
strategies.
– Data Security Analyst: Their job is to secure sensitive data from unauthorized access, data breach,
and more cybersecurity challenges.

14
1.11 Sample Questions
1. What is data science?
2. Is unstructured data organized in a predefined manner?
3. Which of the following is a key component of data science?
(a) Data mining
(b) Data entry
(c) Data deletion
(d) Data loss
4. What is the primary purpose of analyzing data in data science?
5. Briefly discuss three applications of data science in real-world scenarios.
6. Outline the data science lifecycle, briefly explaining each stage.
7. Compare and contrast structured and unstructured data, providing examples of each.
8. Examine the role of data science in business decision-making processes.
9. Classify different types of data science jobs and briefly describe their primary responsibilities.
10. Analyze the advantages and disadvantages of different sampling methods in data collection.
11. Discuss how domain knowledge influences different stages of a data science project, including:
(a) Problem formulation
(b) Data collection and preprocessing
(c) Feature selection and engineering
(d) Model selection and interpretation
(e) Communication of results
12. Analyze how the different components of the data science lifecycle work together to provide a comprehensive
understanding of the project.
13. Distinguish between the roles of a Data Scientist, Data Analyst, and Data Engineer in a typical data science
project.
14. Analyze the impact of data science on healthcare, providing specific examples of applications and benefits.
15. Evaluate the significance of the data science lifecycle in real-world applications. Choose two stages of the
lifecycle and explain their importance in ensuring the success of a data science project
16. Assess the potential impact of data science on job markets over the next decade. Consider both the creation
of new roles and the transformation of existing ones. How might this impact influence educational and
career choices for current students?
17. Create a proposal for a data science project that could benefit your university. Outline the problem, data
needed, and potential outcomes.

15

You might also like