0% found this document useful (0 votes)
2 views

Week 1 - Introduction

The course on Distributed Information Systems (CIS3-535) aims to provide an understanding of distributed information systems, key tasks relevant to them, and common techniques for solving related problems. It covers important models and algorithms for data representation and processing, alongside practical tools for data science. The course is designed to complement other related courses and does not require pre-existing knowledge, though familiarity with databases and machine learning is beneficial.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Week 1 - Introduction

The course on Distributed Information Systems (CIS3-535) aims to provide an understanding of distributed information systems, key tasks relevant to them, and common techniques for solving related problems. It covers important models and algorithms for data representation and processing, alongside practical tools for data science. The course is designed to complement other related courses and does not require pre-existing knowledge, though familiarity with databases and machine learning is beneficial.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 95

Distributed Information Systems

(CIS3-535)

Prof. Dr. Mourad Elloumi

2022 - 2023
Introduction - 1
Goals of the Course
Understand what is a "Distributed Information System"?
– e.g. Web Search Engines, Online Social Networks, etc.
Know which are key tasks relevant for DIS?
– e.g. retrieval, mining, recommending, information extraction,
data integration etc.
Master common techniques used to solve these problems
– e.g. vector space model, graph mining, word embeddings etc.

Pre-existing knowledge not required


Knowledge in databases and machine learning helpful

Introduction - 2
Focus of the Course
Master important Models and Algorithms
for representing and processing information:
Data Science

Conceptual foundations to practically use


tools and platforms for Data Science
• Complementary to Applied Data Analysis
by Bob West
Introduction - 3
Other Related Courses
In synergy with
• Applied Data Analysis
Complementary to
• Introduction to database systems
• Database systems
Some overlaps possible with
• Introduction to machine learning
• Machine learning
• Introduction to natural language processing
• Internet analytics
• …

Introduction - 4
References
Parts of the course are based on the following text books
– Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval (Acm
Press Series), Addison Wesley, 1999.
– Jiawei Han, Data Mining: concepts and techniques, Morgan Kaufman, 2000.
– Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction
to Information Retrieval, Cambridge University Press. 2008.
– J Leskovec, A Rajaraman, JD Ullman, Mining of Massive Datasets, 2014.

Further references to the literature will be given during the lecture

Introduction - 5
Free books

mmds.org http://nlp.stanford.edu/IR-book/

Introduction - 6
Chapter 0

Distributed Information Systems:


An Overview

Introduction - 7
During the first couple of lectures, my friends and I
all thought the lectures were a bit abstract
- DIS student 2021

The investigation of the meaning of words is the


beginning of education
- Antisthenes, 445 – 365 B.C.

Introduction - 8
Overview
1. What is an Information System?
2. Data Modelling
2.1 Models
2.2 Data and Models
2.3 Representation
3. Managing data and information
3.1 Data Management
3.2 Information Management
3.3 Distributed Information Management
4. About the lecture

Introduction - 9
1. WHAT IS AN INFORMATION SYSTEM?

Introduction - 10
Q: What is an information system?
• How would you define what an information systems is?
• Think about what you believe are information systems
• What is not an information system? Are there any computer
systems that are not information systems?
• What is the difference between computer science and
information systems?

• Discuss with your neighbor to try to agree on the concept

Introduction - 11
Information Systems – are everywhere
A day in a student’s life
– IS academia: course registration, grades, …
– Moodle: course information, slides, …
– Bank account: payments, savings, ….
– Library system: literature
– Search engine: where to find food, …
– Facebook, TikTok: connecting to friends, news, …
– Google maps: finding your way
– Campus map: finding lecture hall
– Email: exchanging messages
Introduction - 12
The Business Perspective
Jobs related to information Jobs related to computer
systems science
• Project Managers • Computer Programmer
• Chief Information • Java Developer
Officers (CIO) • Database Administrator
• Technical Writers • Software Engineer
• System Analysts • Network Engineer
• Requirements Analyst

Introduction - 13
The Data Perspective
Information systems: Not information systems:
Computer systems handling Computer systems
and interpreting large performing lots of
amounts of data computation
- Transaction data - Computational science
- Documents and simulation
- Maps - Computer games
- Social networks - Computer algebra
- Sensor data (Matlab etc)

Introduction - 14
Ask Wikipedia

2018 2021

Introduction - 15
A Systems Perspective
Information Processing

Performs action
(decision)

Environment

Receives signal (data)

This model applies to all types of systems: organisms, species, humans, organizations, societies, etc.

Introduction - 16
Information Processing: example

Send ship for rescue

•••−−−•••

coast guard
Introduction - 17
Data-Information-Knowledge

Send ship for rescue

Knowledge:
If SOS then “send ship”
Information: SOS

Data: • • • − − − • • •

•••−−−•••

Data = signal received


Information = understanding of signal
Knowledge = decision taking
Introduction - 18
Information Systems

Send ship for rescue

Knowledge:
If SOS then “send ship”
Information: SOS

Data: • • • − − − • • •

•••−−−•••

An information system automates the process of receiving, interpreting and acting upon signals
A lot of signals = Big Data

Introduction - 19
What is an Information System?

An information system is a computing system interacting with


the environment with the capability of
1. managing data,
2. inferring information by understanding this data
3. for performing a given task

Introduction - 20
Types and Sources of Data
IS academia: course registration, grades, …
Moodle: course information, slides, …
Bank account: payments, savings, ….
Library system: literature
Search engine: where to find food, … Text
Facebook, TikTok: connecting to friends, news, …
Google maps: finding your way Structured database
Campus map: finding lecture hall
Knowledge bases
Social networks
Sensors
Images and videos
Introduction - 21
Tasks
IS academia: course registration, grades, …
Moodle: course information, slides, …
Bank account: payments, savings, ….
Library system: literature Searching information
Search engine: where to find food, …
Facebook, TikTok: connecting to friends, news, …
Event notification
Google maps: finding your way Prediction
Campus map: finding lecture hall
Classification
Recommendation
Summarization
Question answering

Introduction - 22
Q: What does “understanding” mean?
• What does it mean to understand/interpret data?
• What exactly is information?
• How is information different from data?
• How can we techically characterize the difference between
information and data?

• Discuss with your neighbor to try to agree on some answer

Introduction - 23
2. DATA MODELING

Introduction - 24
2.1 Models

Human communication

Model M

Physical phenomena

Understanding the real world is associated


with the availability of a model of it Social organization
Introduction - 25
Models = Mathematical Models
A mathematical model is a representation of a system
using mathematical concepts and language
A mathematical model consists of a set of
– Elements (or constants/identifiers)
– Functions (or relations)
– Axioms (or constraints)
The set of elements and
functions must be consistent
with the axioms

Introduction - 26
What is an information system?
An information system is a computing system that manages
a representation of a model of its real-world environment
within a computer system for a given purpose

Model M

Introduction - 27
Constants: document identifiers,
text (sequence of characters)
Examples of Models Function:

Constants: coordinate
Axiom:
values, temperature
values

Function: Human knowledge


Model M
Axiom:
Constants: names of
people and units
Physical phenomena Function:

Axiom: each person


belongs to at least one unit

Social organization
Introduction - 28
How do we know that a model is good?
The model should represent some aspect of the
real world

Representation
Information System
implements model M

Reality R
We have to evaluate whether the model represents reality properly
Introduction - 29
Evaluation is hard!
Information
Retrieval
Compare to human evaluation

Human knowledge
Science
Model M
Fit to empirical data

Physical phenomena
Business
Analysis
Verify with users

Social organization
Introduction - 30
Q: What is data?
• What is the connection between data and models?
• What actually is data?
• How information systems deal with data to support models?

• Discuss with your neighbor to try to agree on some answer

Introduction - 31
2.2 Data and Models
Example: a Mathematical Model for trajectories:
• Domains: time , space
• trajectories is a set of functions
• one trajectory is a partial function

Introduction - 32
Representation of Functions
Functions can be represented
1. by a specification or algorithm or
2. by enumerating values it can take
• The enumerated values are called data
Representing a single trajectory
• as algorithm: )
• as data, a set of samples:

Introduction - 33
Functions in Information Systems
Functions can be implemented as algorithms
– functions, queries, views etc.

Information systems often rely on representation


of functions as data
– many aspects of the world are not algorithmically
defined, e.g., birthdate of a person or outcome of an
experiment, but are rather observations

Introduction - 34
Data Structures Model M

How is a model represented in a


computer system?

A mathematical model M is represented using a


data structure D.
A data structure D is a discrete mathematical
structure and their operations that can be
processed by a computer
Introduction - 35
Abstract Data Types (ADT)
Are mathematical models of a data structures
Example: associative array A

• are other ADTs


• Operations:
A.put(key, value), A.get(key),
A.delete(key)
• Constraint: every key occurs only once
Can represent a function
Introduction - 36
Example
Mapping the domains to the data structures
time in seconds from some reference time

longitude, latitude in degrees


Mapping the functions to the data structures

Representing the trajectory model in Python


set(dict(float, tuple(float, float)))
Introduction - 37
Is this mapping trivial?
Not exactly. If the meaning is not precisely
understood bad things may happen.

Introduction - 38
Object-Oriented Models
Allow to encapsulate ADTs and mimic the
(signature of the) mathematical structure

Introduction - 39
Data Models
A data model is a language (or framework) used to
specify data structures. It consists of
1. Data Definition Language (DDL)
2. Data Manipulation Language (DML)

The specification of a data model using a DDL is


called a (database) schema S.

Introduction - 40
Attention
The term data model is used in different ways!
• Sometimes it refers to frameworks for specifying
conceptual, logical and physical models
• Sometimes a specific schema is called data model

Introduction - 41
Physical Implementation of Data Structures
Requires a binary representation of the data structure
• Different implementations have different performance characteristics
Example: Map associative array to an array structure
k1 v1 k2 v2 ….
000 010 110 001 …
implement the functions of the ADT
Alternative implementations of associative arrays
• hash tables, binary search trees, tries, …

Introduction - 42
Relationship among Model and Data Model
The same function can be represented using
different data structures
𝑡𝑟 : 𝑇 → 𝑆 What you think

Dictionary List Relation


What you program

What happens
Different physical implementations of the data types
Introduction - 43
Three levels of modeling
Conceptual modelling
• mathematical model that a user thinks in

Logical modelling
• mathematical model that a computer can process

Physical modelling
• binary representation of logical model
Introduction - 44
Three Notions of Data
Conceptual level (data as in science)
• Values a variable (function) can take

Logical level (structured data as in programming)


• Value (instance) of an abstract data type

Physical level (data as in information theory)


• Binary representation
Introduction - 45
Representation of Conceptual Models
Conceptual models of the real world are represented in
information systems as data structures
- The elements are represented as structured data
- The functions are represented as programs operating
on this data or represented as structured data
- The structured data is represented in binary format

Introduction - 46
Refined View of an Information System

Concepual model
Semantic Layer Information System
Application/Domain specific represented by

Syntactic Layer Database System


Application/Domain independent Logical model

Physical Layer represented by


Operating System
Storage
Physical model

010000110

Introduction - 47
Q: What does “representation” mean?
• We have been using repeatedly the notion of representation, but
what (mathematical) notion does representation refer to?
• How do we know that a representation is a representation?
• Are there different kinds of representation?

• Discuss with your neighbor to try to agree on some answer

Introduction - 48
2.3 Representation
A representation is a very general relationship that
expresses similarities or equivalences between
mathematical structures.
• Maps the domain of one structure into the
other
• Maps functions and relations of one structure
into the other
• Preserves some properties
Introduction - 49
Example: Homomorphism
Let and be two mathematical structures with the same
functions

A homomorphism has to satisfy for every function :

This is one possible preserved property!

If H is in addition bijective, then H is an isomorphism

Introduction - 50
Example: Representing a Discrete Model
Mathematical model:
graph
neighbor function:
Representation as abstract data type

of type list(tuple(int,int))
def neighbor(i, G):
… code to retrieve all neighbors …

R is bijective, isomorphic!
Introduction - 51
Binary Representation
Data structure: list(int)
Example: 1,8,3, …

Binary representation:
if

Example: 001 100 010 …

R is bijective, isomorphic!
Introduction - 52
Data Exchange Format
Data structure: list(int)
Example: 1,8,3, …

JSON representation:
‘[1, 8, 3,…]’

R is bijective, isomorphic!!

Introduction - 53
Example: Representing a Continuous Model
Mathematical model: domains ,
Trajectory:
Representation of time domain in Python:

Almost a homomorphism!

Introduction - 54
Representations are not always accurate
The representations need not always be
homomorphisms
- They preserve some relations approximately
- Measures for the quality of approximation are
needed

Introduction - 55
Example
Representing an empirical mathematical model, by
another mathematical model
• Assume we have only partial data for a trajectory (samples)
• We interpolate the missing values, e.g., linear interpolation

• If new data arrives, we can estimate the error


• The error measure characterizes the quality of the new
representation

Introduction - 56
Illustration

tr(t4)

tr(t2)
trint
error
tr(t3)
New sample for tr

tr(t1)
Introduction - 57
Representing the Real World
A model represents a reality, thus there exists a
(hypothetical) representation relationship
Representation
Information System
implements model M

Evaluating the model to characterize the quality


of representation
Introduction - 58
3. MANAGING DATA AND INFORMATION

Introduction - 59
Q: What is needed to manage data?
• Identify problems and techniques that are used to manage
data
• What kind of systems are being used to perform those
tasks?

• Discuss with your neighbor to try to agree on some answer

Introduction - 60
3.1 Data Management
The collection of data represented in a data model
D is called database DB.

A computer system that is designed to (generically)


manage databases is called a database
management system DBMS.

Introduction - 61
Data Management
Efficient management of large amounts of data
– efficient storage and indexing
– efficient search and aggregation
Ensuring persistence and consistency of data under
updates and failures
– Persistence = data stored independent of lifetime of programs
– Consistency = data correct independent of type of failures

Introduction - 62
Examples of DBMS
Programming environment
• e.g. Python
Relational data management
• e.g. SQL, Pandas
Distributed data processing
• e.g. Map-Reduce, Spark
Text processing
• e.g. Elasticsearch, Lucene
Introduction - 63
Example: Relational Schema
Data Definition Language

CREATE TABLE Trajectory


(tid integer, date datetime, x float, y float,

primary key(tid, date))

Data Manipulation Language

SELECT date,x,y FROM Trajectory WHERE tid = 123

Introduction - 64
Example: DataFrames

Introduction - 65
Q: What is information management?
• We have seen what are data management tasks. What are
then information management tasks?
• Which tasks an information management system has to
support?

• Discuss with your neighbor to try to agree on some answer

Introduction - 66
3.2 Information Management
Depending on the “degree of abstraction” data is
frequently classified
Knowledge Graphs
Rules
Knowledge
Relational data
Structured Data XML and RDF data

Measurement data
Text, Images and
Unstructured Data
Video
Introduction - 67
Levels of Abstraction - Characteristics
Unstructured data
• Data captured from measurements and human input
• Fixed data structure (e.g., time series)
Structured data
• Data structure defined using a data model
• Captures relationships in the data
Knowledge
• Schema can evolve dynamically as knowledge expands
• Captures decision rules, objectives and intentions
The classification is not accurate!
Introduction - 68
Example
Unstructured data: a GPS trace Bob’s and Alice’s
GPS trace

Structured data: Road segments, Places


Places that Bob
and Alice visit

Knowledge: Concepts and Inferences Bob and Alice are


frequently together
in Ouchy, thus:
Bob loves Alice
Introduction - 69
Model Building
Creating “higher level abstractions” from
“lower level data”
• Using Statistical, Machine Learning
Segmentation
methods, rule-based approaches, … Clustering

• Typically on large datasets


• Also called: data mining,
data science, data analytics etc. Classification
Knowledge base
Create new models that represent the real
world in a different way

Introduction - 70
Information Management
Tasks
Search
Filtering
Classification
Prediction
Recommendation
Summarization
Integration model M
Question answering

Model Usage: given a model, Model Building: given data, derive a


perform the task model that supports the task

Example: Example:
Count the number of stops of (conceptual) Given GPS traces, find places and
a trip. data road segments

Introduction - 71
Example: Trajectories
Original model: trajectory is a partial function
New Models
• Interpolation: (same domains)
• Extracting trips:

• The trips can be obtained by analyzing the velocity of position data

• Task: Counting the number of stops


Methods: Data mining, Machine Learning, Rules, Statistics

Introduction - 72
New Models generate New Data
Model (Information System) Data (Database System)
x y t
Obtain data from
measurement 12 13 5:00
12 14 5:01
13 15 5:02
Derive new model from data

place x y start end


Data corresponding
to the model p111 12 13 2:00 2:10
p112 13 15 2:25 2:30

Derive new model from data


Subject Relation Object
home ISA Place
Data corresponding
bus ISA Vehicle
to the model
Bob ISAT human

Introduction - 73
Interpretation of Models
Conceptual modeling: analyze the real
world and specify a model

Example: define concepts temperature,


location, measurement

model M

Evaluation: given a model, evaluate it


against reality
data Example: compare predicted stops to
real stops Introduction - 74
Interacting with the Real World

Building systems where the data


interacts with the real world
model M

Output: data
visualization, control
Example: control an
autonomous vehicle

data
Input: users, sensors

Introduction - 75
Information Management

Conceptual modeling
Model =
Functions, function values
Information System
and constraints implements model M

Evaluation
Model Usage Model Building

Output, Control

Data =
Represented in a data
computing system Input, Monitoring

Introduction - 76
Utility
Users need information system to take decisions

Utility of information linked to the value achieved

Value depends on
• Importance of decision
• Quality of decision

model M Quality of decision depends on quality and


understandability of information!

data Using information systems for decision making is


associated with the notion of knowledge
management.
Introduction - 77
Refined View of an Information System
Pragmatic Layer
User/Community specific Knowledge Management
Decision

Semantic Layer
Information Management
Application/Domain specific
Interpretation

Syntactic Layer Database Management


Application/Domain independent
Measurement

Physical Layer
Storage, Networks, Sensors Operating System

Database

Introduction - 78
Q: What does distribution mean?
• What does the notion of distribution in information systems
designate?
• What are the implications of distribution?
• What makes distributed information sytems different from
non-distributed information systems?

• Discuss with your neighbor to try to agree on some answer

Introduction - 79
3.3 Distributed Information Management
Distribution can occur at different levels
- Network
- Data
- Models
- Control

Introduction - 80
Centralized Information System
Centralized Information System on Computer
Network
Application Application

Communication Network

Application Application

Information System

Introduction - 81
Physical Distribution
Use of distributed physical resources: locality of access,
scalability, parallelism in the execution
Distributed Data Management
Application
Information System Information System

Communication Network

Information System Information System

Information System

Introduction - 82
Logical Distribution
Use of different data models: semantic heterogeneity
– Independently developed information systems
– Different models for related concepts
Data integration
Application
Information System Information System

Communication Network

Information System Information System

Information System

Introduction - 83
Autonomy – Distribution of Control
Independent users have to collaborate, coordinate,
negotiate, to perform information management tasks
Multi-agent systems
Application
Information System Information System

Communication Network

Information System Information System

Information System

Introduction - 84
Key Issues in Distributed Data Management
Where to store data in the network?
– Partitioning of data
– Replication and caching
– Considering typical access patterns and data distributions
How to access data in the network?
– Push vs. pull access (query vs. filtering)
– Indexing of data in the network
– Distribution of queries and filters
– Considering the communication model

Introduction - 85
Key Issues in Data Integration

More data! … More models!? … More useful information?

Information System M2

Information System M1 Interpretation I2


DB information
DB
2
Interpretation I1 supply
1

corresponding R’ ? some relationship R


Information System MyM

information
MyD My Interpretation I demand
B

R' is a relationship among models!


(represent one model in terms of another)
Introduction - 86
The Problem
Semantic heterogeneity
– The same real world aspect can be represented
differently
– Requires agreement on the meaning of shared data
– Relating different models (and thus different
representations and their interpretations) requires
often human intervention
– human attention is a scarce resource !

Introduction - 87
Mapping: Three Approaches

1. Standardization EDIFACT EDIFACT

– Mapping through standards

2. Mapping
– Direct mapping

FIBO
3. Ontologies
– Mediated mapping
Introduction - 88
More Problems?
Syntactic heterogeneity
– The same conceptual model can be represented
using different logical data models
Information System M1 Information System M2

DB DB
1 2

Information System M1 Information System M2

Database System 1 Database System 2


Data Models might
be different!
DB DB
1 2
Uses relational Uses XML
Introduction - 89
Information Management Tasks
Semantic interoperability

Conceptual modeling
Information System
implements model M

Evaluation
Model usage Model building

Monitoring

data
Control

Syntactic interoperability
Introduction - 90
Key Issues in Autonomy
The Users Problem
Myself Others
Trust?

Privacy? Quality?

My Information Other Information

Revealing quality information increases trust,


but lowers privacy
Introduction - 91
Evaluating Quality of Information
Recommendations (e.g. Google PageRank)
A B
2/5 1/2 1/5

1/2 1 1

2/5

Introduction - 92
Evaluating Trust
Reputation-based trust: if users behaved honestly
in previous interactions, they will do so in the
future

Introduction - 93
Protecting Privacy
Example: location privacy – obfuscation methods
– Perturbation: (3,7)
– Adding dummy regions: (3,5), (1,4), (6,3)
– Reducing precision: (2,5), (3,4), (3,5), (3,6), (4,5)
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7

Introduction - 94
Refined View of a Distributed Information
System
Autonomy
Pragmatic Layer Social Network
User/Community specific
Decision
Heterogeneity
Semantic Layer Semantic Network
Application/Domain specific
Interpretation
Distribution
Syntactic Layer Distributed
Application/ Database
Domain independent Interaction

Physical Layer
Storage, Networks, Sensors Internet

Sensors, Clouds,
Smartphones

Introduction - 95

You might also like