Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
Summarize Principles of Distributed Database Systems Chapter 12 Web Data Management
Data Management
The indexer module constructs indexes on the downloaded pages, with two common ones being text indexes
and link indexes. Text indexes provide all URLs pointing to the pages where a given word occurs, while link
indexes describe the link structure of the web and provide information on the in-link and out-link state of
pages.
The ranking module sorts a large number of results to present the most relevant ones to the user's search.
This problem has drawn increased interest to address the special characteristics of the web, such as small
queries executed over vast amounts of data.
Recall also that this formula calculates the rank of a page based on the backlinks, but normalizes the
contribution of each backlinking page Pj using the number of forward links that Pj has. The idea here is that
it is more important to be pointed at by pages conservatively link to other pages than by those who link to
others indiscriminately, but the “contribution” of a link from such a page needs to be normalized over all the
pages that it points to.
The crawler's next page visit after crawling a page is a crucial issue. The crawler maintains a queue of
URLs, which can be ordered in the order they were discovered. Strategies include the breadth-first approach,
random ordering, or metrics like backlink counts or PageRank. A slight revision to the PageRank formula is
needed, modeling a random surfer who is likely to choose one URL on a page with equal probability d or
jump to a random page with probability 1−d. Then the above formula for PageRank is revised as follows:
The ordering of the URLs according to this formula allows the importance of a page to be incorporated into
the order in which the corresponding page is visited. In some formulations, the first term is normalized with
respect to the total number of pages in the web. Example 12.1 Consider the web graph in Fig. 12.3 where
each web page Pi is a vertex and there is a directed edge from Pi to Pj if Pi has a link to Pj. Assuming the
commonly accepted value of d = 0.85, the PageRank of P2 is P R(P2) = 0.15 + 0.85( P R(P 2 1) + P R(P 3
3)). This is a recursive formula that is evaluated by initially assigning to each page equal PageRank values
(in this case 1 6 since there are 6 pages) and iterating to compute each P R(Pi) until a fixpoint is reached
(i.e., the values no longer change).
Crawling is a continuous activity that involves revisiting web pages to update the information. Incremental
crawlers are designed to ensure fresh information by selectively revisiting pages based on their change
frequency or by sampling a few pages. Change frequency-based approaches use an estimate of a page's
change frequency to determine its revisit frequency. Sampling-based approaches focus on websites rather
than individual pages, sampling a small number of pages from a site to estimate the change in the site.
Focused crawlers are used by search engines to search pages related to a specific topic, ranking pages based
on relevance. Learning techniques, such as naïve Bayes classifier and reinforcement learning, are used to
identify the topic of a given page. Parallel crawling can be achieved by running parallel crawlers, but
coordination schemes must minimize overhead. One method is to use a central coordinator to dynamically
assign each crawler a set of pages to download, while another is to logically partition the web, with each
crawler knowing its partition without central coordination.
12.2.2 Indexing
In order to efficiently search the crawled pages and the gathered information, a number of indexes are built as shown
in Fig. 12.2. The two more important indexes are the structure (or link) index and a text (or content) index.
Web data is handled by two types of languages: first generation and second generation. First generation languages
model the web as an interconnected collection of atomic objects, allowing queries to search link structures but not
exploiting document structures. Second generation languages, such as WebSQL, W3QL, and WebLog, model the web
as a linked collection of structured objects, allowing queries to exploit document structures. WebSQL, an early query
language, combines searching and browsing by directly addressing web data captured by web documents, including
links to other pages or objects. As before, the structure can be represented as a graph, but WebSQL captures the
information about web objects in two virtual relations:
The document relation is a key-value pair that stores information about a web document, including its URL, title, text,
type, length, and modification date. It can only have one value, while the link relation stores information about links,
including their base URL, referenced URL, and label. All other attributes can be null except for the URL.
WebSQL defines a query language that consists of SQL plus path expressions.The path expressions are more powerful
than their counterparts in Lorel; in particular, they identify different types of links:
(a) interior link that exists within the same document (#>)
(b) local link that is between documents on the same server (->)
(c) global link that refers to a document on another server (=>)
(d) null path (=)
These link types form the alphabet of the path expressions. Using them, and the usual constructors of regular
expressions, different paths can be specified as in Example 12.7. Example 12.7 The following are examples of
possible path expressions that can be
specified in WebSQL.
(a) -> | =>: a path of length one, either local or global
(b) ->*: local path of any length
(c) =>->*: as above, but in other servers
(d) (-> |=>)*: the reachable portion of the web
In addition to path expressions that can appear in queries, WebSQL allows scoping within the FROM clause in the
following way:
FROM Relation SUCH THAT domain-condition
where domain-condition can be either a path expression, or can specify a text search using MENTIONS, or can
specify that an attribute (in the SELECT clause) is equal to a web object. Of course, following each relation
specification, there could be a variable ranging over the relation—this is standard SQL. The following example
queries (taken from with minor modifications) demonstrate the features of WebSQL. Example 12.8 Following are
some examples of WebSQL:
(a) The first example we consider simply searches for all documents about “hypertext” and demonstrates the use of
MENTIONS to scope the query.
(b) The second example demonstrates two scoping methods as well as a search for links. The query is to find all links
to applets from documents about “Java.”
(c) The third example demonstrates the use of different link types. It searches for documents that have the string
“database” in their title that are reachable from the ACM Digital Library home page through paths of length two or
less containing only local links.
(d) The final example demonstrates the combination of content and structure specifications in a query. It finds all
documents mentioning “Computer Science” and all documents that are linked to them through paths of length two or
less containing only local links.
WebSQL can query web data based on links and textual content, but it cannot query documents based on their
structure due to its data model that treats the web as a collection of atomic objects. Second-generation languages like
WebOQL address this limitation by modeling the web as a graph of structured objects, combining features of
semistructured data approaches with those of first-generation web query models. WebOQL's main data structure is a
hypertree, an ordered edge-labeled trie with two types of edges: internal and external. An internal edge represents the
internal structure of a web document, while an external edge represents a reference among objects. Each edge is
labeled with a record containing attributes and cannot have descendants. In WebOQL, attributes are captured in the
records associated with each edge, while internal edges represent the document structure.
The query uses a variable x to range over simple trees of dbDocuments, and for a given value, iterates over the simple
trees of a single subtree. If the author value matches "Ozsu" using the string matching operator ∼, it constructs a trie
with the title attribute of the record and the URL attribute value of the subtree. Web query languages adopt a more
powerful data model than semistructured approaches, allowing them to exploit different edge semantics and construct
new structures.
12.4 Question Answering Systems
Question answering (QA) systems are an unusual approach to accessing web data from a database perspective. These
systems accept natural language questions and analyze them to determine the specific query being posed. They are
typically used within IR systems, which aim to determine the answer to posed queries within a well-defined corpus of
documents. They allow users to specify complex queries in natural language and allow asking questions without full
knowledge of the data organization.
Sophisticated natural language processing (NLP) techniques are applied to these queries to understand the specific
query. They search the corpus of documents and return explicit answers, rather than links to relevant documents. This
does not mean they return exact answers as traditional DBMSs do, but they may return a ranked list of explicit
responses to the query.
Open domain systems, which use the web as the corpus, are also used in QA systems. Web data sources are accessed
using wrappers developed for them to obtain answers to questions. There are various systems with different objectives
and functionalities, such as Mulder, WebQA, Start, and Tritus.
The summary discusses the functionality of systems using preprocessing, an offline process to extract and enhance
rules used by systems. Preprocessing involves analyzing documents extracted from the web or returned as answers to
user questions to determine the most effective query structures. These transformation rules are stored for later use
during runtime. Tritus uses a learning-based approach, using a collection of frequently asked questions and their
correct answers as a training dataset. In a three-stage process, the system attempts to guess the answer structure by
analyzing the question and searching for the answer. The first stage extracts the question phrase, the second phase
analyzes question-answer pairs, and generates candidate transforms for each phrase.
Question analysis is a process used to understand user-posed natural language questions. It involves predicting the
type of answer and categorizing the question for translation and answer extraction. Different systems use different
approaches depending on the sophistication of Natural Language Processing (NLP) techniques employed. For
example, Mulder uses three phases: question parsing, question classification, and query generation. Mulder uses four
methods in this phase: verb conversion, query expansion, noun phrase formation, and transformation.
Once the question is analyzed and queries are generated, the next step is to generate candidate answers. The queries
generated at question analysis are used to perform keyword search for relevant documents. Some systems use general-
purpose search engines or consider additional data sources like the CIA's World Factbook or weather data sources like
the Weather Network or Weather Underground. The choice of appropriate search engine(s)/data source(s) is crucial
for better results.
Question answering systems are more flexible than other web querying approaches, as they offer users flexibility
without knowledge of web data organization. However, they are constrained by the idiosynchrocies of natural
language and the difficulties of natural language processing.
Response to queries is normalized into "records," which need to be extracted from these records. Various text
processing techniques can be used to match keywords to the returned records. These results need to be ranked using
various information retrieval techniques. Different systems employ different notions of the appropriate answer, such
as a ranked list of direct answers or a ranked order of the portion of the records that contain the keywords in the query.
In conclusion, question answering systems are a complex process that requires a combination of NLP techniques, data
analysis, and a careful selection of appropriate search engines and data sources.
12.5.2 Metasearching
Metasearching is another approach for querying the hidden web. Given a user query, a metasearcher performs the
following tasks:
1. Database selection: selecting the databases(s) that are most relevant to the user’s query. This requires collecting
some information about each database. This information is known as a content summary, which is statistical
information, usually including the document frequencies of the words that appear in the database.
2. Query translation: translating the query to a suitable form for each database (e.g., by filling certain fields in the
database’s search interface).
3. Result merging: collecting the results from the various databases, merging them (and most probably, ordering
them), and returning them to the user.
We discuss the important phases of metasearching in more detail below.
• All web resources (data) are locally identified by their URIs that serve as names;
• These names are accessible by HTTP;
• Information about web resources/entities are encoded as RDF (Resource Description Framework) triples. In other
words, RDF is the semantic web data model (and we discuss it below);
• Connections among datasets are established by data links and publishers of datasets should establish these links so
that more data is discoverable.
The Semantic!web (LOD) graph generates a graph with vertices representing web resources and edges representing
relationships. As of 2018, LOD consisted of 1,234 datasets with 16,136 links. Semantic!web consists of several
technologies, including XML for structured web documents, RDF Schema for data model establishment, ontologies
for relationships among web data, and logic-based declarative rule languages for application rule definition. The
technologies in lower layers are the minimum requirements, and a schema over the data can provide necessary
primitives.
12.6.2.1 XML
HTML is the primary encoding for web documents, consisting of HTML elements encapsulated by tags. In the
semantic web context, XML, proposed by the World Wide Web Consortium (W3C), is the preferred representation for
encoding and exchanging web documents. This encoding allows for the discovery and integration of structured data.
XML tags, or markups, divide data into elements to provide semantics. Elements can be nested but cannot overlap,
representing hierarchical relationships. An XML document can be represented as a trie, with a root element, zero or
more nested subelements, and recursively containing subelements. Each element has zero or more attributes with
atomic values assigned to them, and an optional value. The textual representation of the trie defines a total order,
called document order, on all elements corresponding to the order in which the first character of the elements occurs in
the document.
For example, the root element in Fig. 12.4 is bib, which has three child elements: two book and one article. The first
book element has an attribute year with atomic value "1999", and also contains subelements. An element can contain a
value, such as "Principles of Distributed Database Systems" for the element title.
Standard XML document definition can contain ID-IDREFs, defining references between elements in the same
document or another document. However, the simpler trie representation is commonly used, and its definition will be
more precise in the following section.
An XML document trie is defined as an ordered collection of XML document trie nodes or atomic values. A schema
can be defined for an XML document, allowing for variations in each document. XML schemas can be defined using
the Document Type Definition (DTD) or XMLSchema. A simpler schema definition exploits the graph structure of
XML documents. An XML schema graph is defined as a 5-tuple containing an alphabet of XML document node
types, a set of edges between node types, and a domain of the text content of an item of type σ.
Using the XML data model and instances, query languages can be defined. Expressions in XML query languages take
an instance of XML data as input and produce an instance of XML data as output. Two query languages proposed by
the W3C are XPath and XQuery. Path expressions are present in both query languages and are the most natural way to
query hierarchical XML data. XQuery is complicated, hard to formulate by users, and difficult to optimize by systems.
JSON has replaced both XML and XQuery for many applications, although XML representation remains important
for the semantic web.
12.6.2.2 RDF
RDF is a data model on top of XML and forms a fundamental building block of the semantic web. It was originally
proposed by W3C as a component of the semantic web but its use has expanded. Examples include Yago and
DBPedia extracting facts from Wikipedia automatically and storing them in RDF format for structural queries, and
biologists encoding their experiments and results using RDF to communicate among themselves. RDF data collections
include Bio2RDF and Uniprot RDF.
RDF models each "fact" as a set of triples (subject, property (or predicate), object) denoted as's, p, o'. Entities are
denoted by a URI (Uniform Resource Identifier) that refers to a named resource in the environment being modeled,
while blank nodes refer to anonymous resources without a name.
RDF Schema (RDFS) is the next layer in the semantic web technology stack, which allows for the annotation of RDF
data with semantic metadata. This annotation primarily enables reasoning over the RDF data (entailment) and impacts
data organization in some cases. RDFS also allows the definition of classes and class hierarchies, with built-in class
definitions like rdfs:Class and rdfs:subClassOf. A special property, rdf:type, is used to specify that an individual
resource is an element of the class.
SPARQL query types are based on the shape of the query graph, with three types: linear, star-shaped, and snowflake-
shaped. RDF data management systems can be categorized into five groups: direct relational mapping, relational
schema with extensive indexing, denormalizing triples into clustered properties, column-store organization, and
exploiting native graph pattern matching semantics. Direct relational mapping systems use RDF triples' natural tabular
structure to create a single table with three columns (Subject, Property, Object) for SPARQL queries. This approach
aims to exploit well-developed relational storage, query processing, and optimization techniques in executing
SPARQL queries. Systems like Sesame SQL92SAIL10 and Oracle follow this approach, utilizing well-developed
relational storage, query processing, and optimization techniques. The full translation of SPARQL 1.0 to SQL remains
open.
Single Table Extensive Indexing
Native storage systems like Hexastore and RDF-3X offer an alternative to direct relational mapping by allowing
extensive indexing of the triple table. These systems maintain a single table but create indexes for all possible
permutations of subject, property, and object. These indexes are sorted lexicographically by the first, second, and third
columns, and stored in the leaf pages of a clustered B+-tree. This organization allows SPARQL queries to be
efficiently processed regardless of variable location, and it eliminates some self-joins by turning them into range
queries over the particular index. Fast merge-join can be used when joins are required. However, disadvantages
include space usage and the overhead of updating multiple indexes if data is dynamic.
Property Tables
The Property Tables approach in RDF datasets uses regularity to store "related" properties in the same table. Jena and
IBM's DB2RDF follow this strategy, mapping the resulting tables to a relational system and converting queries to
SQL for execution. Jena defines two types of property tables: a clustered property table, which groups properties that
occur in the same subjects, and a property class table that clusters subjects with the same type of property into one
table. The primary key for a property is the subject, while the key for a multivalued property is the compound key
(subject, property). The mapping of the single triple table to property tables is a database design problem handled by a
database administrator.
Example 12.17 The example dataset in Example 12.14 may be organized to create one table that includes the
properties of subjects that are films, one table for properties of directors, one table for properties of actors, one table
for properties of books and so on.
IBM DB2RDF uses a dynamic table organization, called direct primary hash (DPH), organized by subject. Each
subject has k property columns, each with a different property in different rows. If a subject has more than k
properties, extra properties are spilled onto a second row. For multivalued properties, a direct secondary hash (DSH)
table is maintained. This approach simplifies joins in star queries, resulting in fewer joins. However, it can lead to a
significant number of null values and requires special care for multivalued properties. Star queries can be efficiently
handled, but it may not be suitable for other query types. Clustering "similar" properties is nontrivial, and poor design
decisions can exacerbate the null value problem.
Binary Tables
The binary tables approach is a column-oriented database schema organization that defines a two-column table for
each property containing the subject and object. This results in a set of tables ordered by the subject, reducing I/O,
tuple length, and compression. It avoids null values and clustering algorithms for similar properties and supports
multivalued properties. Subject-subject joins can be implemented using efficient merge-joins. However, queries
require more join operations, some of which may be subject-object joins. Additionally, insertions into tables have
higher overhead, and the proliferation of tables may negatively impact the scalability of the binary tables approach.
For example, the binary table representation of the dataset would create one table for each unique property, with 18
tables.
Graph-Based Processing
Graph-based RDF processing methods maintain the RDF data's graph structure using adjacency lists, convert
SPARQL queries to query graphs, and perform subgraph matching using homomorphism to evaluate queries against
the RDF graph, a technique used by systems like gStore and chameleon-db.
The approach to encoding RDF data in SPARQL maintains its original representation and enforces its intended
semantics. However, it has a disadvantage of the cost of subgraph matching, as graph homomorphism is NP-complete.
This raises issues with scalability for large RDF graphs. The gStore system uses adjacency list representation of
graphs, encoding each entity and class vertex into a fixed-length bit string. This information is exploited during graph
matching, resulting in a data signature graph G∗, where each vertex corresponds to a class or entity vertex in the RDF
graph G. An incoming SPARQL query is also represented as a query graph Q, which is encoded into a query signature
graph Q∗.
The problem of finding Q∗ over G∗ requires a filter-and-evaluate strategy to reduce the search space. The objective is
to find candidate subgraphs (CL) using a false-positive pruning strategy and validate them using the adjacency list
(RS). Two issues need to be addressed: encoding the RS to guarantee it ⊆ CL and developing an efficient subgraph
matching algorithm. gStore uses an index structure called VS∗-tree, a summary graph of G∗, to efficiently process
queries using a pruning strategy.
Consider a data item D. Let Dom(D) be the domain of D, including one true value and n false values. Let SD be the
set of sources that provide a value for D, and let SD(v) ⊆ SD be the set of sources that provide the value v for D. Let
(D) denote the observation of which value each S ∈ SD provides for D. The probability P r(v) can be computed as
follows:
The true value of a data item D is determined by the value v with the highest probability P r(v) in Dom(D). The
computation of source accuracy A(S) depends on the probability P r(v) and source accuracy A(S). An algorithm can
start with the same accuracy for every source and probability for every value, iteratively computing probabilities until
convergence.
Source Dependency
Source dependency is a crucial aspect of data analysis, as sources often copy from each other, creating dependencies.
Two intuitions for copy detection between sources include the presence of only one true value and multiple false
values. Sharing the same true value does not necessarily imply dependency, but sharing the same false value is a rare
event. A random subset of values provided by a data source typically has similar accuracies as the full set, but a copier
data source may have different accuracies. A Bayesian model can be developed to compute the probability of copying
between two sources, adjusting the computation of the vote count for a value to account for source dependencies.
Source Freshness
Data fusion is a process that aims to find all correct values and valid periods in a history when true values evolve over
time. In this dynamic setting, data errors occur due to sources providing wrong values, failing to update their data, or
some sources not updating in time. Source quality can be evaluated using three metrics: coverage of a source,
exactness, and freshness. Bayesian analysis can be used to determine the time and value of each transition for a data
item.
Machine learning and probabilistic models have been used in data fusion and modeling data source quality. SLiMFast
is a framework that expresses data fusion as a statistical learning problem over discriminative probabilistic models. It
provides quality guarantees for fused results and can incorporate available domain knowledge in the fusion process.
SLiMFast takes input from source observations, labeled ground truth, and domain knowledge about sources. It
compiles these information into a probabilistic graphical model for holistic learning and inference. Depending on the
amount of ground truth data, SLiMFast decides which algorithm to use for learning the parameters of the graphical
models. The learned model is then used for inferring both object values and source accuracies.