DATA SCIENCE
FUNDAMENTALS
FA23-BST-
SPRING2025
Lecture 1
Dr. Asma Arshad
Associate Prof. PHED
TODAY’S AGENDA
Introduction – Basic Terminologies
Data Collection Ways
• Generated
• Collected
• Retrieved
Synthetic Data
Artificial Intelligence
• Why artificial?
DIKW
BASIC TERMINOLOGIES
DATA
TYPES
NUMERICA
HYBRID Non-Numerical
L
BASIC
TERMINOLOGIES
FOR DATA SCIENCE
NAVIGATING THE FUTURE AI & ML
DATA CAN BE
Generated Simulation
Collected Primary Secondary
Data Algorithm Similarity
Retrieved structures s Measures
1. GENERATED DATA
Generated data is artificially created rather than collected from real-world
sources. It is often used in simulations, testing, or predictive modeling.
🔹 How it's generated: Through simulations, synthetic data generation, or
computational models.
🔹 Examples:
Monte Carlo Simulation: Used in risk analysis and financial forecasting.
Synthetic Data in Machine Learning: AI-generated images (e.g., using GANs),
synthetic customer transactions for fraud detection.
Game Development: Simulated player behavior for AI testing.
Physics & Weather Forecasting: Climate models simulate future temperature
trends.
2. COLLECTED DATA
Collected data comes from real-world observations, experiments, or existing
databases. It is divided into Primary and Secondary data.
🔸 Primary Data (Collected firsthand)
Data that is collected directly for a specific research purpose.
Examples:
o Surveys: A company conducts a survey to understand customer satisfaction.
o Experiments: A scientist records lab results from a chemical reaction.
o Sensor Data: IoT devices collecting temperature data.
o Field Research: Biologists tracking animal migration patterns.
2. COLLECTED DATA
🔸 Secondary Data (Collected from existing sources)
Data that has already been collected by someone else and is reused.
Examples:
o Census Data: Governments use past census data for policy planning.
o Stock Market Data: Investors analyze past market trends from financial
reports.
o Medical Records: A researcher uses past patient records for disease
prediction.
Wikipedia & Public Datasets: AI models trained on pre-existing datasets.
o.
3. RETRIEVED DATA
Retrieved data involves extracting useful information using data structures, algorithms, and
similarity measures. This category is essential for fields like big data, information retrieval,
and AI applications.
🔸 Data Structures
How data is stored efficiently for quick retrieval.
Examples:
o Databases (SQL, NoSQL): Storing customer records in MySQL.
o Hash Tables: Fast lookup for a dictionary app.
o Graphs: Social networks (Facebook friend connections).
o Trees: Search engines use tree structures for indexing web pages.
3. RETRIEVED DATA
🔸 Algorithms
How data is retrieved, sorted, and analyzed.
Examples:
o Search Algorithms: Google uses PageRank to find relevant web pages.
o Sorting Algorithms: E-commerce sites sort products by price.
o Machine Learning Models: Netflix recommends shows based on user data.
o Pattern Recognition: AI detects spam emails based on past spam patterns.
3. RETRIEVED DATA
🔸 Similarity Measures
How we compare and retrieve similar data points.
Examples:
o Euclidean Distance: Measuring similarity in face recognition systems.
o Cosine Similarity: Finding similar documents in text analysis.
o Jaccard Similarity: Detecting plagiarism between two texts.
KNN (k-Nearest Neighbors): Classifying email as spam or not based on past
messages.
SYNTHETIC DATA
In Data Science & AI (Synthetic Data)
o Artificially generated data that mimics real-
world data but does not come from actual
observations.
o Used when real data is unavailable, expensive, or
sensitive (e.g., medical data).
o Example: AI-generated customer transactions to
train fraud detection models.
Category Definition Examples
Monte Carlo simulations, AI-
Simulated or artificially created
Generated generated images, climate
data
models
Surveys, experiments, IoT
Collected (Primary) Data collected firsthand
sensor data
Pre-existing data used for Census data, stock market
Collected (Secondary)
analysis history, public datasets
Retrieved (Data Databases, graphs, trees, hash
Organizing data efficiently
Structures) tables
Search engines, ML
Retrieved (Algorithms) Extracting and processing data
recommendations, sorting
Retrieved (Similarity Comparing and finding related Face recognition, plagiarism
Measures) data detection, KNN classification
AI
• What is AI (Artificial Intelligence)?
Artificial Intelligence (AI) refers to the simulation of human
intelligence in machines that can perform tasks typically requiring
human thinking, such as learning, reasoning, problem-solving,
perception, and decision-making.
AI enables machines to:
✅ Learn from data (Machine Learning)
✅ Recognize patterns (Face recognition, speech processing)
✅ Make decisions (Self-driving cars, chatbots)
✅ Understand language (ChatGPT, Google Assistant)
WHY IS IT CALLED "ARTIFICIAL"?
• The term "artificial" means man-made, not natural.
• We call it Artificial Intelligence because:
• 🔹 It is not real human intelligence but simulated using
algorithms and computers.
• 🔹 Machines do not think like humans but process
information based on predefined rules and learning models.
• 💡 Example:
• A human can learn from experience and make decisions naturally.
• AI can learn from data and make predictions but only within the
limits of the algorithms.
TYPES OF AI
1 Narrow AI (Weak AI) – Designed for specific tasks
1️⃣
🔸 Example: Google Search, Siri, Spam filters
2️⃣General AI (Strong AI) – Hypothetical AI that can think like humans
🔸 Example: AI that understands emotions, reasons, and makes
decisions like a person (not yet achieved).
3️⃣Super AI – AI that surpasses human intelligence (theoretical)
🔸 Example: AI that can innovate and improve itself without human
help.
BASIC TERMS FOR AI
Term Meaning
Machines that simulate human
AI
intelligence
Man-made, not naturally
Artificial
occurring
AI for specific tasks (e.g., Google
Narrow AI
Search, chatbots)
AI with human-like intelligence
General AI
(not yet achieved)
AI surpassing human intelligence
Super AI
(theoretical)
DIKW
EXPANSIO
N
UNLOCKING NEW HORIZONS
DIKW
Data Processe
d
Information Validatio
n
Knowledge Thinking
Wisdom
DIKW
INNOVATIVE
SOLUTIONS
From the MovieLens dataset documentation, this file has 5
columns:
Column
Description
Name
user_id Unique identifier for each user
age Age of the user
Gender (M = Male, F =
gender
Female)
occupation User's occupation
zip_code User's zip code
What we want?
📊 Suggested Data Analysis
Now that we understand the dataset, let's perform some
key analyses.
Basic Summary Statistics
🔹 Insights:
The average user age, min/max age, gender distribution, and
most common occupations.
What we want?
💡 Conclusion
By performing these analyses, you can get
demographic insights into the MovieLens
dataset.
Would you like more advanced analysis, such as
correlating age with occupation trends? 🚀
NEXT LECTURE INITIATIVES
1. Bring laptops… to get with the
python
2. Technology integration. To exploare
the sharing of files via whatsapp.
3. Collaborative woking. Fost learning
to proceed with real datasets..
NEXT CLASS AGENDA
Introduction – Basic Terminologies
What is Data Science?
• Big Data and Data Science Hype
• Getting Past the Hype
Why Now?
Datafication
Data Science Jobs
What is a Data scientist?
• In academia
• In industry
THANK YOU
ANY QUESTIONS?
[email protected]