data-science-unit-1
data-science-unit-1
1.1 Introduction
1.2 Terminology
1.3 Data science process
1.4 Data science toolkit
1.5 Types of Data
1.6 Example Applications
1.7 Mathematical Foundations for Data Science
INTRODUCTION TO CORE CONCEPTS AND TECHNOLOGY
What is Data Science?
● Data science is a concept to bring together ideas, data examination, Machine Learning, and their
related strategies to comprehend and dissect genuine phenomena with data.
● It is an extension of data analysis fields such as data mining, statistics, and predictive analysis.
● It is a huge field that uses a lot of methods and concepts that belong to other fields like
information science, statistics, mathematics, and computer science.
● Some of the techniques utilized in Data Science encompass machine learning, visualization,
pattern recognition, probability modeling data, data engineering, signal processing, etc.
Data science is important because it enables organizations to make informed decisions based on data and
evidence, rather than intuition and guesswork. Here are some of the key reasons why it is important:
● Improved decision-making: DS provides organizations with a systematic way to analyze data and
make informed decisions. By using data to inform decision-making, organizations can make
better, more informed decisions that drive growth and success.
● Increased efficiency: DS can be used to optimize operations, reduce waste, and improve
efficiency. By using data to identify inefficiencies and areas for improvement, organizations can
streamline their operations and increase their competitiveness.
● Better customer experiences: DS can be used to analyze customer data, such as purchase history
and behavior, to personalize marketing campaigns and improve customer experiences. This can
help organizations increase customer satisfaction and loyalty.
● Innovation: DS can be used to identify new opportunities for growth and innovation. By
analyzing data, organizations can identify new market trends, customer needs, and areas for
innovation, and develop new products and services to meet these needs.
● Improved risk management: DS can be used to analyze risk and uncertainty, such as fraud and
cyber threats, to improve risk management and minimize potential losses.
● A better understanding of complex systems: DS can be used to analyze complex systems, such as
the human body, the global economy, or the environment, to gain a deeper understanding of these
systems and identify ways to improve them.
● Improved decision-making in critical domains: In critical domains such as healthcare, criminal
justice, and finance, data science can help in decision-making by analyzing relevant data to
produce evidence-based outcomes, reducing bias, and improving fairness.
Types of Analysis
Analysis of data is a vital part of running a successful business. When data is used effectively, it leads to
better understanding of a business’s previous performance and better decision-making for its future
activities. There are many ways that data can be utilized, at all levels of a company’s operations.
There are four types of data analysis that are in use across all industries. While we separate these into
categories, they are all linked together and build upon each other. As you begin moving from the simplest
type of analytics to more complex, the degree of difficulty and resources required increases. At the same
time, the level of added insight and value also increases.
● Descriptive Analysis
● Diagnostic Analysis
● Predictive Analysis
● Prescriptive Analysis
Below, we will introduce each type and give examples of how they are utilized in business.
Descriptive Analysis
The first type of data analysis is descriptive analysis. It is at the foundation of all data insight. It is the
simplest and most common use of data in business today. Descriptive analysis answers the “what
happened” by summarizing past data, usually in the form of dashboards.
The biggest use of descriptive analysis in business is to track Key Performance Indicators (KPIs). KPIs
describe how a business is performing based on chosen benchmarks.
Business applications of descriptive analysis include:
● KPI dashboards
● Monthly revenue reports
● Sales leads overview
Diagnostic Analysis
Diagnostic analysis takes the insights found from descriptive analytics and drills down to find the causes
of those outcomes. Organizations make use of this type of analytics as it creates more connections
between data and identifies patterns of behavior.
A critical aspect of diagnostic analysis is creating detailed information. When new problems arise, it is
possible you have already collected certain data pertaining to the issue. By already having the data at your
disposal, it ends having to repeat work and makes all problems interconnected.
Business applications of diagnostic analysis include:
Predictive Analysis
This type of analysis is another step up from the descriptive and diagnostic analyses. Predictive analysis
uses the data we have summarized to make logical predictions of the outcomes of events. This analysis
relies on statistical modeling, which requires added technology and manpower to forecast. It is also
important to understand that forecasting is only an estimate; the accuracy of predictions relies on quality
and detailed data.
While descriptive and diagnostic analysis are common practices in business, predictive analysis is where
many organizations begin show signs of difficulty. Some companies do not have the manpower to
implement predictive analysis in every place they desire. Others are not yet willing to invest in analysis
teams across every department or not prepared to educate current teams.
● Risk Assessment
● Sales Forecasting
● Using customer segmentation to determine which leads have the best chance of converting
Predictive analytics in customer success teams
Prescriptive Analysis
The final type of data analysis is the most sought after, but few organizations are truly equipped to
perform it. Prescriptive analysis is the frontier of data analysis, combining the insight from all previous
analyses to determine the course of action to take in a current problem or decision.
Prescriptive analysis utilizes state of the art technology and data practices. It is a huge organizational
commitment and companies must be sure that they are ready and willing to put forth the effort and
resources.
Artificial Intelligence (AI) is a perfect example of prescriptive analytics. AI systems consume a large
amount of data to continuously learn and use this information to make informed decisions. Well-designed
AI systems are capable of communicating these decisions and even putting those decisions into action.
Business processes can be performed and optimized daily without a human doing anything with artificial
intelligence.
Currently, most of the big data-driven companies (Apple, Facebook, Netflix, etc.) are utilizing
prescriptive analytics and AI to improve decision making. For other organizations, the jump to predictive
and prescriptive analytics can be insurmountable. As technology continues to improve and more
professionals are educated in data, we will see more companies entering the data-driven realm.
Definitions
Data Lake:
A data lake is a centralized repository that allows organizations to store vast amounts of raw and
unstructured data, such as text, images, videos, and more. Unlike traditional databases, data lakes do not
enforce a specific structure on the data, making them highly flexible and suitable for big data storage and
analysis.
Data Warehouse:
A data warehouse is a structured and organized database designed for reporting and data analysis. It
consolidates data from various sources, transforms it into a consistent format, and stores it for querying
and reporting. Data warehouses are optimized for complex queries and are essential for business
intelligence and decision support.
Data Mart:
A data mart is a subset of a data warehouse. It contains a specific, often domain-focused, portion of data
that is tailored to the needs of a particular group or department within an organization. Data marts are
created to enhance data accessibility and relevance for a particular business area.
Chinese Walls:
In the context of information security and access control, Chinese walls refer to a mechanism or set of
rules that prevent the exchange of information between different departments or groups within an
organization. The purpose is to maintain confidentiality and prevent conflicts of interest, especially in
financial and legal settings.
5 Prepared by Jayashri M.
Life Cycle of Data Science
The data science life cycle is a systematic approach to solving complex problems using data. It consists of
several key stages, including data collection, data storage, data processing, data description, data
modeling, data presentation, and automation. Here's an overview of each stage in the data science life
cycle:
1. Data Collection:
- In this initial stage, data scientists gather relevant data from various sources. This can include
structured data from databases, unstructured data from text, images, and other sources, and even data from
sensors and IoT devices.
- Data collection methods may include web scraping, surveys, data imports, or connecting to APIs to
retrieve data.
2. Data Storage:
- Once the data is collected, it needs to be stored in a suitable environment. This typically involves a
data repository or database.
- Proper data storage ensures data accessibility, security, and scalability. Common database systems
include SQL databases, NoSQL databases, and data lakes.
3. Data Processing:
- Data often requires cleaning, transformation, and preprocessing to make it suitable for analysis. This
stage involves data cleaning, data normalization, handling missing values, and feature engineering.
- Data processing also includes exploratory data analysis (EDA), where data scientists gain insights and
understand the data's characteristics.
4. Data Description:
- In this stage, data scientists describe the data using statistical and visual methods. This involves
summarizing key statistics, creating visualizations, and identifying patterns or anomalies in the data.
- Descriptive statistics and data visualization techniques are commonly used to convey insights.
5. Data Modeling:
- Data modeling is the heart of the data science process. It involves building predictive and analytical
models to solve specific problems.
- Machine learning and statistical modeling techniques are applied to the prepared data to make
predictions, classifications, or to gain deeper insights.
6 Prepared by Jayashri M.
6. Data Presentation:
- The results and insights obtained from data modeling need to be communicated effectively. Data
scientists use various visualization and reporting tools to present their findings.
- Data visualization, dashboards, and reports are created to make the results accessible and
understandable to stakeholders.
7. Automation:
- In many cases, data science solutions are not one-time endeavors but ongoing processes. To make
data-driven decision-making sustainable, automation is often implemented.
- This can involve setting up automated data pipelines, model retraining, and real-time data monitoring.
The data science life cycle is often iterative, as insights gained at any stage may necessitate revisiting
previous stages to refine the process. It's a continuous cycle of collecting, storing, processing, describing,
modeling, presenting, and automating data to solve real-world problems and make data-driven decisions.
A data science toolkit typically consists of a combination of software, programming languages, libraries,
and tools that data scientists use to collect, analyze, and visualize data, build machine learning models,
and perform various data-related tasks. Here is a list of some essential components of a data science
toolkit:
1. Programming Languages:
- Python: Widely used for data analysis, machine learning, and data visualization. Libraries like NumPy,
Pandas, Matplotlib, and Scikit-Learn are popular for data science tasks.
- R: Another popular language for data analysis and statistics, known for its extensive set of statistical
packages.
7 Prepared by Jayashri M.
- Scikit-Learn: A Python library with a wide range of machine learning algorithms and tools for model
evaluation and selection.
- TensorFlow and PyTorch: Deep learning libraries for building neural networks.
7. Version Control:
- Git*: A widely used version control system for tracking changes in code and collaborating with others.
8 Prepared by Jayashri M.
- Libraries like Beautiful Soup and Scrapy for web scraping and APIs for collecting data.
The specific tools and technologies you use in your data science toolkit can vary based on your projects,
preferences, and the nature of your data analysis tasks. Data scientists often adapt and expand their
toolkits as they gain experience and work on different projects.
Types of Data
data is a collection of raw, unorganized facts that need to be processed. After the data is processed we can
conclude whether it may be used to prove or disprove a hypothesis or a data set.
You have probably noticed that the questions above have different answers. It shows us, that there are
different types of data. So, it is essential to know what type of data you are working with. Having
understood the kind of data, you will be able to effectively interpret and analyze it.
There are two main data types: numerical and categorical or, in other words, quantitative and qualitative.
a) Numerical data
Numerical, or quantitative, data is a type of data that represents numbers rather than natural language
descriptions, so it can only be collected in a numeric form.
Examples of quantitative data include arithmetic operations (addition, subtraction, division, and
multiplication), and ways to measure a person's weight and height.
It is also divided into two subsets: discrete data and continuous data:
Discrete data:
9 Prepared by Jayashri M.
The main feature of this data type is that it is countable, meaning that it can take certain values like
numbers 1,2,3 and so on, and a discrete dataset can be either finite or infinite.
Examples of these types of data are age, the number of children you want to have (the number is a
non-negative integer because you can't have 1.5 or −2 kids), and the number of sugar cubes in the jar. All
of these examples are finite. They can be counted from the beginning to the end, but if you try to count all
the sugar cubes in the world, you will notice that it is countably infinite data, so you cannot possibly
complete the counting as the number of sugar cubes tends to infinity.
Continuous data:
Continuous data is a type of data with uncountable elements. It is represented as a set of intervals on a
number line. Just like discrete data, continuous can also be either finite or infinite.
Examples of continuous data are the measure of weight, height, area, distance, time, etc. This type of data
can be further divided into interval data and ratio data.
Interval data:
Interval data is measured along a scale, in which each point is placed at an equal distance, or interval,
from one another.
Ratio data:
Ratio data is almost the same as the previous type but the main difference is that it has a zero point. For
instance, the zero point temperature can be measured in Kelvin. It is equal to −273.15
degrees Celsius, or −459.67 Fahrenheit.
b) Categorical data
Categorical, or qualitative data, is information divided into groups or categories using labels or names. In
such dataset, each item is placed in a single category depending on its qualities. All categories are
mutually exclusive.
Numbers in this type of data do not have mathematical meaning, i.e. no arithmetical operations can be
performed with numerical variables.
A good example of categorical data is when you are filling out forms for job applications. You may be
asked to specify your level of education. For instance, you are choosing MSc out of all because you fall
under this particular category.
Categorical data is further divided into nominal data and ordinal data.
Nominal data:
Nominal data, also known as naming data, is descriptive and has a function of labeling or naming
variables. Elements of this type of data do not have any order, or numerical value, and cannot be
measured. Nominal data is usually collected via questionnaires or surveys. E.g.: Person's name, eye color,
clothes brand.
10 Prepared by Jayashri M.
Ordinal data:
This type of data represents elements that are ordered, ranked, or used on a rating scale. Generally
speaking, these are categories with an implied order. Though ordinal data can be counted, it cannot be
measured as well as nominal one.
Examples of ordinal data include customer satisfaction rating, Likert scale, and income level.
Example Applications.
Data science can be applied in a wide range of industries and fields, including:
● Healthcare: DS is used to analyze patient data and improve disease diagnosis, treatment, and
patient outcomes.
● Finance: DS is used to analyze financial data, such as stock prices and market trends, to make
informed investment decisions.
● Retail: DS is used to analyze customer data, such as purchase history and behavior, to personalize
marketing campaigns and improve customer experience.
● Transportation: DS is used to optimize routes, reduce fuel consumption, and improve overall
efficiency in the transportation industry.
● Manufacturing: DS is used to optimize production processes, reduce waste, and improve product
quality in the manufacturing industry.
● Energy: DS is used to optimize energy consumption, improve energy efficiency, and develop new
renewable energy sources.
● Marketing: DS is used to analyze customer data, such as demographics and behavior, to inform
targeted marketing campaigns and improve customer acquisition and retention.
● Sports: DS is used to analyze player performance, injury rates, and team strategy to inform
coaching decisions and improve player performance.
● Education: DS is used to analyze student performance data, teacher effectiveness, and school
programs to improve educational outcomes and student achievement.
● Government: DS is used to analyze various types of data, such as crime statistics and economic
indicators, to inform public policy and improve government services
11 Prepared by Jayashri M.
Mathematical Foundations for Data Science
1. Linear Algebra
Matrices and Vectors: Fundamental building blocks for data representation.
Matrix Operations: Addition, multiplication, and inversion.
Eigenvalues and Eigenvectors: Key for understanding data transformations.
Singular Value Decomposition (SVD): Used in dimensionality reduction techniques like PCA.
2. Probability Theory
Random Variables: Discrete and continuous variables.
Probability Distributions: Normal, binomial, Poisson distributions.
Bayes’ Theorem: Foundation for Bayesian inference.
Expectation and Variance: Measures of central tendency and dispersion.
3. Statistics
Descriptive Statistics: Mean, median, mode, standard deviation.
Inferential Statistics: Hypothesis testing, confidence intervals.
Regression Analysis: Linear and logistic regression for predictive modeling.
Correlation and Causation: Understanding relationships between variables.
4. Calculus
Differentiation: Understanding rates of change, used in optimization.
Integration: Area under curves, useful in probability and statistics.
Multivariable Calculus: Partial derivatives, gradients, used in machine learning algorithms.
5. Optimization
Gradient Descent: Iterative method for finding local minima.
Convex Optimization: Ensures global minima, used in many machine learning algorithms.
Constrained Optimization: Techniques like Lagrange multipliers for problems with constraints.
6. Discrete Mathematics
Graph Theory: Nodes and edges, used in network analysis.
Combinatorics: Counting methods, useful in probability.
Boolean Algebra: Logical operations, foundational for computer science.
7. Numerical Methods
Root Finding Algorithms: Newton-Raphson method.
Numerical Integration: Trapezoidal and Simpson’s rule.
Optimization Algorithms: Simplex method, used in linear programming.
Machine Learning: Algorithms like SVM, neural networks rely heavily on these mathematical
foundations.
Data Analysis: Techniques like clustering, dimensionality reduction.
Natural Language Processing (NLP): Understanding and processing human language data.
Computer Vision: Image processing and analysis.
12 Prepared by Jayashri M.