Unit 1 Notes
Unit 1 Notes
Unit – I: Introduction
Introduction to Data Science – Evolution of Data Science – Data Science Roles – Stages
in a Data Science Project – Applications of Data Science in various fields – Data Security
Issues.
Data Science is a multidisciplinary field that combines statistics, computer science, and domain-
specific knowledge to extract meaningful insights from structured and unstructured data. It
involves several processes, including data collection, cleaning, analysis, visualization, and
interpretation. It encompasses various techniques from statistics, machine learning, data analysis,
and related fields.
Importance: Data Science is crucial in today's data-driven world, enabling organizations to make
informed decisions, optimize operations, and identify new opportunities. It is applied in numerous
domains, impacting everything from healthcare to finance.
Real-Time Example: Think of a recommendation system like the one used by Netflix. Based on
your viewing history and preferences, it suggests movies and TV shows you might like. This is a
classic example of data science at work, using past data to predict future preferences.
Data Analysis
Machine Learning
Data Visualization
Programming (Python, R, SQL)
Critical Thinking
1. Machine Learning – Machine learning is the backbone of data science. Data science
involves quite a bit of learning the basics of statistics to design a learning algorithm.
2. Modeling – Modeling is also a part of machine learning in a way, but you need to be good
at identifying what are the algorithms that are more suitable to solve the given problems,
which model can be used, and how to train this model.
3. Statistics and Probabilities – The statistics and probabilities are like the core foundation of
data science. The data scientist must be prominent in statistics and probability theories, so
that one can able to formulate the problems for processing the required results.
4. Programming skills - In the data science projects, basic programming skills and some
fundamental knowledge of databases is required. The most common programming
languages are Python, R, MATLAB, and Octave. In particular, Python is becoming a very
popular programming language in data science because of its ease of learning and the
multiple libraries that it supports. Python is the most popular language in data science.
is the modeling process, which then runs your data through the model. Then through this process,
you come out to the final
Data Science has evolved significantly over the past few decades, from simple data analysis to
complex predictive modeling and machine learning.
Data science has evolved significantly over the past few decades, transforming from basic data
analysis into a sophisticated, multidisciplinary field that encompasses statistics, machine learning,
data mining, and big data technologies. Here is a detailed overview of the evolution of data
science, illustrated with examples:
Statistics and Mathematics: The foundations of data science lie in statistics and mathematics. In
the 1960s and 1970s, data analysis primarily involved statistical methods for hypothesis testing
and data collection.
Business Intelligence (BI): Tools like SAS and IBM’s DB2 enabled businesses to perform
complex queries and generate reports, leading to the rise of BI.
Example: Walmart’s implementation of data warehousing and BI tools to analyze sales data and
optimize inventory management.
Data Warehousing: Technologies like data warehousing emerged, allowing for the storage of
large volumes of historical data for analysis. ETL (Extract, Transform, Load) processes became
standard for integrating data from various sources.
OLAP: Online Analytical Processing allowed for multidimensional analysis of data, enabling
more dynamic and flexible data exploration. (i) Relational OLAP (ROLAP) (ii)
Multidimensional OLAP (MOLAP) (iii) Hybrid OLAP (HOLAP)
Example: Amazon’s use of data warehousing and OLAP to analyze customer purchasing behavior
and recommend products.
Example: Facebook’s use of Hadoop to process petabytes of data generated by user interactions
and to deliver personalized content.
Machine Learning: Advances in machine learning algorithms and the availability of large
datasets enabled more sophisticated predictive models and AI applications. Tools like TensorFlow
and PyTorch became popular.
Example: Google’s use of machine learning for search engine algorithms, speech recognition, and
image classification.
Interdisciplinary Field: Data science emerged as a distinct field, combining aspects of statistics,
computer science, and domain-specific knowledge. Universities started offering dedicated data
science programs.
Data-Driven Decision Making: Companies began to embed data science into their decision-
making processes, leveraging insights for competitive advantage.
Example: Netflix’s use of data science for content recommendation, personalization, and
optimizing content production based on viewer data.
7. Current Trends and Future Directions
Deep Learning and AI: The focus is shifting towards deep learning and AI, with applications in
natural language processing (NLP), computer vision, and autonomous systems.
Edge Computing: With IoT devices generating vast amounts of data, edge computing is becoming
important for real-time data processing closer to the data source.
Ethics and Privacy: As data science impacts more aspects of life, issues related to data privacy,
security, and ethical AI are becoming more prominent.
Example: Autonomous vehicles like Tesla using deep learning for real-time decision making and
navigation.
The evolution of data science has been driven by technological advancements, the
increasing availability of data, and the growing importance of data-driven decision-making
across various sectors.
As the field continues to evolve, it will likely incorporate more advanced AI techniques,
address ethical concerns, and integrate seamlessly with emerging technologies like IoT and
edge computing.
Data science involves multiple roles, each contributing uniquely to the data science project. Key
roles include:
Data Scientist
Data Analyst
Data Engineer
Machine Learning Engineer
Data Architect
Business Intelligence Analyst
Data visualization specialist
Retrieves data from the data warehouse and performs exploratory data analysis (EDA) to identify
key features related to customer need. Develops a predictive model using machine learning
algorithms to identify customers at risk of churning.
Responsibilities:
Data Collection and Cleaning: Gathering data from various sources and ensuring it is
clean and ready for analysis.
Exploratory Data Analysis (EDA): Understanding data distributions and relationships
through statistical analysis and visualization.
Model Building: Developing predictive and prescriptive models using machine learning
algorithms.
Model Evaluation: Assessing model performance using metrics such as accuracy,
precision, recall, and F1-score.
Communication: Presenting insights and findings to stakeholders through reports and
visualizations.
Skills:
Example:
Credit Scoring Model: In the finance industry, predictive models are used to assess
the creditworthiness of individuals applying for loans or credit cards. The model uses
historical data such as past credit history, income, employment status, and other factors
to predict the likelihood that an individual will default on a loan.
Disease Prediction: A predictive model can be used to predict the likelihood of patients
developing certain diseases based on their medical history, lifestyle factors, and genetic
information.
Analyzes historical data to provide insights into customer behavior and trends. Works with
Business Intelligence Analysts to create dashboards that visualize churn rates and the effectiveness
of retention strategies.
Responsibilities:
Skills: Proficiency in SQL, Excel, data visualization tools (Tableau, Power BI), and basic
statistical knowledge.
Responsibilities:
Data Pipeline Development: Designing, building, and maintaining scalable data pipelines
that automate the flow of data.
Data Warehousing: Implementing and managing data storage solutions to ensure data is
accessible, secure, and reliable.
Data Integration: Combining data from different sources and ensuring consistency.
Performance Optimization: Ensuring efficient data processing by optimizing storage and
query performance.
Skills: Knowledge of ETL processes, database systems (SQL, NoSQL), big data technologies
(Hadoop, Spark), and programming (Python, Java, Scala).
Takes the model developed by the Data Scientist and optimizes it for deployment. Ensures that the
model can process new data in real-time and provide predictions to customer service
representatives.
Responsibilities:
Designs the architecture to ensure that data flows seamlessly from collection to storage, analysis,
and prediction. Implements data governance policies to ensure data security and compliance with
regulations.
Responsibilities:
Data Architecture Design: Developing the overall structure of data systems to ensure they
meet organizational needs.
Data Modeling: Creating data models that define how data is stored, accessed, and used.
Data Governance: Establishing policies and procedures to ensure data quality and
security.
Scalability and Performance: Ensuring data systems are scalable and perform efficiently.
Skills: Expertise in database design, data modeling, data warehousing, and knowledge of data
governance and security practices.
Uses the model’s predictions and historical data analysis to create reports and dashboards. BI
developers design and develop BI solutions to help organizations make data-driven decisions.
They create dashboards, reports, and data visualization tools.
Responsibilities:
Skills: Proficiency in BI tools (Tableau, Power BI), SQL, data modeling, and understanding of
business processes and requirements.
Skills: Proficiency in data visualization tools (Tableau, D3.js), graphic design principles, and an
understanding of data storytelling
In a data science project, different roles are organized hierarchically based on their responsibilities
and areas of expertise. This hierarchy helps in defining clear lines of communication,
accountability, and decision-making processes.
Hierarchy:
lua
Copy code
+-----------------------------+
| Chief Data Officer |
| / Head of Data Science |
+-----------------------------+
|
+------------------------------------+
| Data Science Manager / Lead DS |
+------------------------------------+
|
+------------------------+-------------------------+
| Data Architect | Senior Data Scientist |
+------------------------+-------------------------+
| |
+------------------+ +--------------------------------+
| Data Engineer | | Data Scientist |
+------------------+ +--------------------------------+
| |
+--------------------------------+ +---------------------+
| Machine Learning Engineer | | Data Analyst |
+--------------------------------+ +---------------------+
|
+-----------------------------+
| Business Intelligence |
| Analyst |
+-----------------------------+
|
+-----------------------------+
| Junior Data Scientist / |
| Data Analyst |
+-----------------------------+
Problem Definition: The foremost stage of the data science life cycle involves defining the
business problem and then articulate how data science can help to address them. Understanding
the problem and defining the objective. This may include predicting customer churns, estimating
product demand or optimizing marketing efforts.
Data Collection: Once problem is clearly defined, data collection becomes a critical aspect of the
data science life cycle. This stage entails gathering raw data from various sources like databases,
spreadsheets, web scraping or APIs along with other possible external influences as well, such as
seasonal trends and economic indicators.
While collecting data, it's crucial to maintain its originality for transparency and reproducibility
purposes.
Data Cleaning and Preprocessing: Handling missing values, outliers, and ensuring data quality.
The data preparation phase plays a critical role in transforming raw data into a clean and usable
format. This crucial step ensures that the data is reliable, accurate, and ready for analysis, setting
the stage for meaningful insights to be extracted.
During the data preparation phase, data scientists employ a range of techniques to address the
various challenges posed by the raw data. One common task involves handling missing values,
which are data points that are absent or incomplete. Missing values can significantly impact the
accuracy of analyses, as they introduce uncertainty and potentially bias the results.
Data scientists use strategies such as imputation, where missing values are estimated or replaced
using statistical methods, to ensure that the data remains robust and representative.
Exploratory Data Analysis (EDA): Summarizing the main characteristics of the data. One
essential task during this stage is feature selection. Data scientists carefully choose the relevant
features or variables from the dataset that are most informative and influential for the analysis. By
selecting the right set of features, they can simplify the modeling process, enhance the
interpretability of results, and reduce the risk of overfitting.
Model Building: Developing predictive models using algorithms. In this stage of the data science
lifecycle, data scientists use statistical and machine learning techniques to analyze the prepared
data. In model selection, the types of input and output variables play an important role. The team
has to decide whether they should use one single model or series of models depending on the type
of analysis they are doing. After selecting the model, a proper analytical tool is to be determined
to fit the selected model.
In the model building phase, the selected analytical technique is applied to a set of training data.
This process is known as “training the model”.
A separate set of data, known as the testing data, is then used to evaluate how well the model
performs. This is sometimes known as the pilot test. By applying these techniques, they extract
meaningful information, make accurate predictions, and gain a deeper understanding of the
underlying insights within the data.
Model Evaluation: Assessing the performance of models using metrics. Once the data models
have undergone training and predictions have been generated, the subsequent step in the data
science lifecycle is to evaluate the results. Data scientists meticulously assess the performance of
their models and validate the accuracy of the predictions against the ground truth or known
outcomes. This evaluation process plays a crucial role in determining the effectiveness of the
models and gaining valuable insights into the analyzed data.
During the evaluation stage, data scientists employ various techniques to analyze and interpret the
results. Statistical analysis is a fundamental approach used to assess the performance metrics of
the models. These metrics can include accuracy, precision, recall, F1 score, or other domain-
specific measures depending on the nature of the problem.
By integrating the models, the organization can automate decision-making processes, optimize
resource allocation, or improve operational efficiency based on the insights gained from the data
analysis.
Healthcare:
Marketing:
Targeted advertising.
Customer segmentation and profiling.
Campaign effectiveness analysis.
Sentiment analysis.
Churn prediction
E-commerce/Retail:
Policy making
Resource allocation
Route optimization.
Predictive maintenance of vehicles.
Traffic management.
Autonomous vehicles.
Supply chain logistics.
Energy:
Predictive maintenance.
Quality control.
Supply chain optimization.
Process optimization.
Demand forecasting.
6. Data Security Issues
Importance: Protecting data from unauthorized access and ensuring privacy are critical in the
digital age.
Common Challenges:
Data breaches
Insider threats
Data corruption
Data Breaches
Examples: Hackers stealing customer data, internal employees leaking sensitive information.
Causes:
Impact / Consequences:
Financial Loss: Costs associated with investigation, remediation, legal fees, and
regulatory fines.
Reputational Damage: Loss of trust from customers, partners, and stakeholders.
Legal Implications: Violations of data protection laws can result in hefty fines and legal
actions.
Prevention:
Insider Threats
Definition: Insider threats refer to risks posed by individuals within the organization, such as
employees, contractors, or business associates, who misuse their access to sensitive information.
Types:
Malicious Insider: Individuals with intent to harm the organization, often motivated by
financial gain, revenge, or corporate espionage.
Negligent Insider: Employees who accidentally cause data breaches due to lack of
awareness or carelessness.
Compromised Insider: Legitimate accounts or systems that are compromised and used
by external attackers.
Consequences:
Prevention:
Data Corruption
Definition: Data corruption refers to errors in computer data that occur during writing, reading,
storage, transmission, or processing, leading to the data being incorrect, incomplete, or unusable.
Causes:
Consequences:
Prevention:
Regular Backups: Frequent backups to ensure data can be restored in case of corruption.
Error Detection and Correction: Implementing error-checking and correction
algorithms.
Robust Hardware and Software: Using reliable hardware and up-to-date software to
minimize risks.
Security Measures: Protecting systems from malware and unauthorized access to reduce
the risk of corruption.
1. Data Encryption
o Description: Protecting data by converting it into a coded format that can only be
read by authorized individuals.
o Impact: Ensures data remains secure during transmission and storage, protecting
it from unauthorized access. Examples: Encrypting data in transit using SSL/TLS,
encrypting data at rest using AES.
2. Access control
o Description: Implementing measures to restrict access to data based on user roles
and permissions.
o Impact: Prevents unauthorized access to sensitive data, ensuring that only
authorized personnel can access specific data.
3. Data Masking
o Description: Obscuring specific data within a database to protect it from
unauthorized access.
o Impact: Allows sensitive data to be used for testing or analysis without exposing
the actual data.
o Examples: Masking credit card numbers, social security numbers.
4. Anonymization and de-identification
o Description: Removing or altering personal identifiers from data sets so that
individuals cannot be readily identified.
o Impact: Reduces the risk of re-identification of individuals in case of a data breach.
o Examples: Removing names and addresses from health records before analysis.
5. Data integrity
o Description: Ensuring the accuracy and consistency of data over its lifecycle.
o Impact: Prevents unauthorized modifications and ensures that data remains
trustworthy and reliable.
o Examples: Using checksums and hashing to verify data integrity, implementing
version control.
Data Analysis involves the meticulous process of defining, cleaning, investigating, and
transforming data into meaningful results. It's used to analyze data and extract valuable insights,
performing tasks such as predictive, descriptive, exploratory, and inferential analysis. Tools like
RapidMiner, KNIME, and Tableau Public are commonly used for this purpose.
On the other hand, Data Analytics focuses on data collection and its inspection to make data-driven
decisions. It's employed in businesses to find market trends, customer preferences, and masked
patterns using tools like Python, SAS, and Apache Spark.