Week 1 - Introduction
Week 1 - Introduction
(CIS3-535)
2022 - 2023
Introduction - 1
Goals of the Course
Understand what is a "Distributed Information System"?
– e.g. Web Search Engines, Online Social Networks, etc.
Know which are key tasks relevant for DIS?
– e.g. retrieval, mining, recommending, information extraction,
data integration etc.
Master common techniques used to solve these problems
– e.g. vector space model, graph mining, word embeddings etc.
Introduction - 2
Focus of the Course
Master important Models and Algorithms
for representing and processing information:
Data Science
Introduction - 4
References
Parts of the course are based on the following text books
– Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval (Acm
Press Series), Addison Wesley, 1999.
– Jiawei Han, Data Mining: concepts and techniques, Morgan Kaufman, 2000.
– Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
to Information Retrieval, Cambridge University Press. 2008.
– J Leskovec, A Rajaraman, JD Ullman, Mining of Massive Datasets, 2014.
Introduction - 5
Free books
mmds.org http://nlp.stanford.edu/IR-book/
Introduction - 6
Chapter 0
Introduction - 7
During the first couple of lectures, my friends and I
all thought the lectures were a bit abstract
- DIS student 2021
Introduction - 8
Overview
1. What is an Information System?
2. Data Modelling
2.1 Models
2.2 Data and Models
2.3 Representation
3. Managing data and information
3.1 Data Management
3.2 Information Management
3.3 Distributed Information Management
4. About the lecture
Introduction - 9
1. WHAT IS AN INFORMATION SYSTEM?
Introduction - 10
Q: What is an information system?
• How would you define what an information systems is?
• Think about what you believe are information systems
• What is not an information system? Are there any computer
systems that are not information systems?
• What is the difference between computer science and
information systems?
Introduction - 11
Information Systems – are everywhere
A day in a student’s life
– IS academia: course registration, grades, …
– Moodle: course information, slides, …
– Bank account: payments, savings, ….
– Library system: literature
– Search engine: where to find food, …
– Facebook, TikTok: connecting to friends, news, …
– Google maps: finding your way
– Campus map: finding lecture hall
– Email: exchanging messages
Introduction - 12
The Business Perspective
Jobs related to information Jobs related to computer
systems science
• Project Managers • Computer Programmer
• Chief Information • Java Developer
Officers (CIO) • Database Administrator
• Technical Writers • Software Engineer
• System Analysts • Network Engineer
• Requirements Analyst
Introduction - 13
The Data Perspective
Information systems: Not information systems:
Computer systems handling Computer systems
and interpreting large performing lots of
amounts of data computation
- Transaction data - Computational science
- Documents and simulation
- Maps - Computer games
- Social networks - Computer algebra
- Sensor data (Matlab etc)
Introduction - 14
Ask Wikipedia
2018 2021
Introduction - 15
A Systems Perspective
Information Processing
Performs action
(decision)
Environment
This model applies to all types of systems: organisms, species, humans, organizations, societies, etc.
Introduction - 16
Information Processing: example
•••−−−•••
coast guard
Introduction - 17
Data-Information-Knowledge
Knowledge:
If SOS then “send ship”
Information: SOS
Data: • • • − − − • • •
•••−−−•••
Knowledge:
If SOS then “send ship”
Information: SOS
Data: • • • − − − • • •
•••−−−•••
An information system automates the process of receiving, interpreting and acting upon signals
A lot of signals = Big Data
Introduction - 19
What is an Information System?
Introduction - 20
Types and Sources of Data
IS academia: course registration, grades, …
Moodle: course information, slides, …
Bank account: payments, savings, ….
Library system: literature
Search engine: where to find food, … Text
Facebook, TikTok: connecting to friends, news, …
Google maps: finding your way Structured database
Campus map: finding lecture hall
Knowledge bases
Social networks
Sensors
Images and videos
Introduction - 21
Tasks
IS academia: course registration, grades, …
Moodle: course information, slides, …
Bank account: payments, savings, ….
Library system: literature Searching information
Search engine: where to find food, …
Facebook, TikTok: connecting to friends, news, …
Event notification
Google maps: finding your way Prediction
Campus map: finding lecture hall
Classification
Recommendation
Summarization
Question answering
…
Introduction - 22
Q: What does “understanding” mean?
• What does it mean to understand/interpret data?
• What exactly is information?
• How is information different from data?
• How can we techically characterize the difference between
information and data?
Introduction - 23
2. DATA MODELING
Introduction - 24
2.1 Models
Human communication
Model M
Physical phenomena
Introduction - 26
What is an information system?
An information system is a computing system that manages
a representation of a model of its real-world environment
within a computer system for a given purpose
Model M
Introduction - 27
Constants: document identifiers,
text (sequence of characters)
Examples of Models Function:
Constants: coordinate
Axiom:
values, temperature
values
Social organization
Introduction - 28
How do we know that a model is good?
The model should represent some aspect of the
real world
Representation
Information System
implements model M
Reality R
We have to evaluate whether the model represents reality properly
Introduction - 29
Evaluation is hard!
Information
Retrieval
Compare to human evaluation
Human knowledge
Science
Model M
Fit to empirical data
Physical phenomena
Business
Analysis
Verify with users
Social organization
Introduction - 30
Q: What is data?
• What is the connection between data and models?
• What actually is data?
• How information systems deal with data to support models?
Introduction - 31
2.2 Data and Models
Example: a Mathematical Model for trajectories:
• Domains: time , space
• trajectories is a set of functions
• one trajectory is a partial function
Introduction - 32
Representation of Functions
Functions can be represented
1. by a specification or algorithm or
2. by enumerating values it can take
• The enumerated values are called data
Representing a single trajectory
• as algorithm: )
• as data, a set of samples:
Introduction - 33
Functions in Information Systems
Functions can be implemented as algorithms
– functions, queries, views etc.
Introduction - 34
Data Structures Model M
Introduction - 38
Object-Oriented Models
Allow to encapsulate ADTs and mimic the
(signature of the) mathematical structure
Introduction - 39
Data Models
A data model is a language (or framework) used to
specify data structures. It consists of
1. Data Definition Language (DDL)
2. Data Manipulation Language (DML)
Introduction - 40
Attention
The term data model is used in different ways!
• Sometimes it refers to frameworks for specifying
conceptual, logical and physical models
• Sometimes a specific schema is called data model
Introduction - 41
Physical Implementation of Data Structures
Requires a binary representation of the data structure
• Different implementations have different performance characteristics
Example: Map associative array to an array structure
k1 v1 k2 v2 ….
000 010 110 001 …
implement the functions of the ADT
Alternative implementations of associative arrays
• hash tables, binary search trees, tries, …
Introduction - 42
Relationship among Model and Data Model
The same function can be represented using
different data structures
𝑡𝑟 : 𝑇 → 𝑆 What you think
What happens
Different physical implementations of the data types
Introduction - 43
Three levels of modeling
Conceptual modelling
• mathematical model that a user thinks in
Logical modelling
• mathematical model that a computer can process
Physical modelling
• binary representation of logical model
Introduction - 44
Three Notions of Data
Conceptual level (data as in science)
• Values a variable (function) can take
Introduction - 46
Refined View of an Information System
Concepual model
Semantic Layer Information System
Application/Domain specific represented by
010000110
Introduction - 47
Q: What does “representation” mean?
• We have been using repeatedly the notion of representation, but
what (mathematical) notion does representation refer to?
• How do we know that a representation is a representation?
• Are there different kinds of representation?
Introduction - 48
2.3 Representation
A representation is a very general relationship that
expresses similarities or equivalences between
mathematical structures.
• Maps the domain of one structure into the
other
• Maps functions and relations of one structure
into the other
• Preserves some properties
Introduction - 49
Example: Homomorphism
Let and be two mathematical structures with the same
functions
Introduction - 50
Example: Representing a Discrete Model
Mathematical model:
graph
neighbor function:
Representation as abstract data type
of type list(tuple(int,int))
def neighbor(i, G):
… code to retrieve all neighbors …
R is bijective, isomorphic!
Introduction - 51
Binary Representation
Data structure: list(int)
Example: 1,8,3, …
Binary representation:
if
R is bijective, isomorphic!
Introduction - 52
Data Exchange Format
Data structure: list(int)
Example: 1,8,3, …
JSON representation:
‘[1, 8, 3,…]’
R is bijective, isomorphic!!
Introduction - 53
Example: Representing a Continuous Model
Mathematical model: domains ,
Trajectory:
Representation of time domain in Python:
Almost a homomorphism!
Introduction - 54
Representations are not always accurate
The representations need not always be
homomorphisms
- They preserve some relations approximately
- Measures for the quality of approximation are
needed
Introduction - 55
Example
Representing an empirical mathematical model, by
another mathematical model
• Assume we have only partial data for a trajectory (samples)
• We interpolate the missing values, e.g., linear interpolation
Introduction - 56
Illustration
tr(t4)
tr(t2)
trint
error
tr(t3)
New sample for tr
tr(t1)
Introduction - 57
Representing the Real World
A model represents a reality, thus there exists a
(hypothetical) representation relationship
Representation
Information System
implements model M
Introduction - 59
Q: What is needed to manage data?
• Identify problems and techniques that are used to manage
data
• What kind of systems are being used to perform those
tasks?
Introduction - 60
3.1 Data Management
The collection of data represented in a data model
D is called database DB.
Introduction - 61
Data Management
Efficient management of large amounts of data
– efficient storage and indexing
– efficient search and aggregation
Ensuring persistence and consistency of data under
updates and failures
– Persistence = data stored independent of lifetime of programs
– Consistency = data correct independent of type of failures
Introduction - 62
Examples of DBMS
Programming environment
• e.g. Python
Relational data management
• e.g. SQL, Pandas
Distributed data processing
• e.g. Map-Reduce, Spark
Text processing
• e.g. Elasticsearch, Lucene
Introduction - 63
Example: Relational Schema
Data Definition Language
Introduction - 64
Example: DataFrames
Introduction - 65
Q: What is information management?
• We have seen what are data management tasks. What are
then information management tasks?
• Which tasks an information management system has to
support?
Introduction - 66
3.2 Information Management
Depending on the “degree of abstraction” data is
frequently classified
Knowledge Graphs
Rules
Knowledge
Relational data
Structured Data XML and RDF data
Measurement data
Text, Images and
Unstructured Data
Video
Introduction - 67
Levels of Abstraction - Characteristics
Unstructured data
• Data captured from measurements and human input
• Fixed data structure (e.g., time series)
Structured data
• Data structure defined using a data model
• Captures relationships in the data
Knowledge
• Schema can evolve dynamically as knowledge expands
• Captures decision rules, objectives and intentions
The classification is not accurate!
Introduction - 68
Example
Unstructured data: a GPS trace Bob’s and Alice’s
GPS trace
Introduction - 70
Information Management
Tasks
Search
Filtering
Classification
Prediction
Recommendation
Summarization
Integration model M
Question answering
Example: Example:
Count the number of stops of (conceptual) Given GPS traces, find places and
a trip. data road segments
Introduction - 71
Example: Trajectories
Original model: trajectory is a partial function
New Models
• Interpolation: (same domains)
• Extracting trips:
Introduction - 72
New Models generate New Data
Model (Information System) Data (Database System)
x y t
Obtain data from
measurement 12 13 5:00
12 14 5:01
13 15 5:02
Derive new model from data
Introduction - 73
Interpretation of Models
Conceptual modeling: analyze the real
world and specify a model
model M
Output: data
visualization, control
Example: control an
autonomous vehicle
data
Input: users, sensors
Introduction - 75
Information Management
Conceptual modeling
Model =
Functions, function values
Information System
and constraints implements model M
Evaluation
Model Usage Model Building
Output, Control
Data =
Represented in a data
computing system Input, Monitoring
Introduction - 76
Utility
Users need information system to take decisions
Value depends on
• Importance of decision
• Quality of decision
Semantic Layer
Information Management
Application/Domain specific
Interpretation
Physical Layer
Storage, Networks, Sensors Operating System
Database
Introduction - 78
Q: What does distribution mean?
• What does the notion of distribution in information systems
designate?
• What are the implications of distribution?
• What makes distributed information sytems different from
non-distributed information systems?
Introduction - 79
3.3 Distributed Information Management
Distribution can occur at different levels
- Network
- Data
- Models
- Control
Introduction - 80
Centralized Information System
Centralized Information System on Computer
Network
Application Application
Communication Network
Application Application
Information System
Introduction - 81
Physical Distribution
Use of distributed physical resources: locality of access,
scalability, parallelism in the execution
Distributed Data Management
Application
Information System Information System
Communication Network
Information System
Introduction - 82
Logical Distribution
Use of different data models: semantic heterogeneity
– Independently developed information systems
– Different models for related concepts
Data integration
Application
Information System Information System
Communication Network
Information System
Introduction - 83
Autonomy – Distribution of Control
Independent users have to collaborate, coordinate,
negotiate, to perform information management tasks
Multi-agent systems
Application
Information System Information System
Communication Network
Information System
Introduction - 84
Key Issues in Distributed Data Management
Where to store data in the network?
– Partitioning of data
– Replication and caching
– Considering typical access patterns and data distributions
How to access data in the network?
– Push vs. pull access (query vs. filtering)
– Indexing of data in the network
– Distribution of queries and filters
– Considering the communication model
Introduction - 85
Key Issues in Data Integration
Information System M2
information
MyD My Interpretation I demand
B
Introduction - 87
Mapping: Three Approaches
2. Mapping
– Direct mapping
FIBO
3. Ontologies
– Mediated mapping
Introduction - 88
More Problems?
Syntactic heterogeneity
– The same conceptual model can be represented
using different logical data models
Information System M1 Information System M2
DB DB
1 2
Conceptual modeling
Information System
implements model M
Evaluation
Model usage Model building
Monitoring
data
Control
Syntactic interoperability
Introduction - 90
Key Issues in Autonomy
The Users Problem
Myself Others
Trust?
Privacy? Quality?
1/2 1 1
2/5
Introduction - 92
Evaluating Trust
Reputation-based trust: if users behaved honestly
in previous interactions, they will do so in the
future
Introduction - 93
Protecting Privacy
Example: location privacy – obfuscation methods
– Perturbation: (3,7)
– Adding dummy regions: (3,5), (1,4), (6,3)
– Reducing precision: (2,5), (3,4), (3,5), (3,6), (4,5)
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
Introduction - 94
Refined View of a Distributed Information
System
Autonomy
Pragmatic Layer Social Network
User/Community specific
Decision
Heterogeneity
Semantic Layer Semantic Network
Application/Domain specific
Interpretation
Distribution
Syntactic Layer Distributed
Application/ Database
Domain independent Interaction
Physical Layer
Storage, Networks, Sensors Internet
Sensors, Clouds,
Smartphones
Introduction - 95