Semantic Web Analysis
Semantic Web Analysis
its form. A description of our musical taste is something that we might list on our
homepage but it is not something that we would like to keep typing in again for
accessingdifferent music-related services on the internet.
These questions are arbitrary in their specificity but they illustrate a general problem in
accessing the vast amounts of information on the Web. Namely, in all these cases we deal
with a knowledge gap: what the computer understands and able to work with is much more
limited than the knowledge of the user. In most cases, however, the knowledge gap is due to
the lack of some kind of background knowledge that only the human possesses. The
background knowledge is often completely missing from the context of the Web page and
thus our computers do not even stand a fair chance by working on the basis of the web page
alone.
Semantic Web has been actively promoted by the World Wide Web Consortium.
Knowledge representation and reasoning in the context of the Semantic Web involve
structuring data in a way that allows machines to understand and process information
meaningfully. The Semantic Web aims to create a universal framework that links data
across various sources using standardized formats like RDF (Resource Description
Framework) and OWL (Web Ontology Language). These standards enable the
representation of complex relationships between data, facilitating automated reasoning
and inferencing.
Natural Language Processing (NLP) plays a crucial role in the Semantic Web by
enabling the extraction, interpretation, and generation of meaningful information from
human language. NLP techniques are used to convert unstructured text into structured
data that can be understood and processed by machines. This involves tasks such as
entity recognition, relationship extraction, sentiment analysis, and automatic
summarization. By integrating NLP with the Semantic Web, it becomes possible to
enhance search engines, improve data retrieval, and enable more intuitive human-
computer interactions, ultimately making the web more accessible and useful for users.
Information retrieval in the context of the Semantic Web leverages structured data and
ontologies to improve the precision and relevance of search results. Unlike traditional
search engines that rely on keyword matching, Semantic Web technologies use
metadata and linked data principles to understand the context and relationships between
concepts. This allows for more sophisticated query processing, enabling the retrieval of
information based on the meaning and semantics of the data rather than just text
matches.
Semantic Web Layered Cake
The Semantic Web is often described as a layered cake, illustrating its architecture and the
various technologies that build upon one another to enable meaningful data interchange and
automated reasoning on the web. Here’s a brief overview of the layers in the Semantic Web
stack:
1. Unicode and URI (Uniform Resource Identifier):
o Unicode ensures that all characters are represented in a standardized form.
o URI provides a unique identifier for resources on the web.
2. XML (eXtensible Markup Language):
o A flexible text format used to encode documents in a machine-readable form.
3. Namespaces:
o Allow the use of XML elements and attributes from different vocabularies.
4. RDF (Resource Description Framework):
o A framework for representing information about resources in the web, using
triples (subject, predicate, object).
5. RDFS (RDF Schema):
o A semantic extension of RDF that provides mechanisms for describing groups
of related resources and the relationships between them.
6. OWL (Web Ontology Language):
o A language for defining and instantiating web ontologies, providing richer
integration and interoperability of data among diverse applications.
7. SPARQL (SPARQL Protocol and RDF Query Language):
o A query language and protocol used to query RDF datasets.
Against this general backdrop there was a look at the share of Semantic Web related
terms andformats, in particular the terms RDF, OWL and the number of ontologies
Blogs The first wave of socialization on the Web was due to the appearance of blogs, wikis and
other forms of web-based communication and collaboration. Blogs and wikis attracted mass
popularity from around 2003. For adding content to the Web: editing blogs and wikis did not
require any knowledge of HTML any more. Blogs and wikis allowed individuals and groups
to claim their personal space on the Web and fill it with content at relative ease. Even more
importantly, despite that weblogs have been first assessed as purely personal publishing
(similar to diaries), nowadays the blogosphereis widely recognized as a densely interconnected
social network through which news, ideas and influences travel rapidly as bloggers reference
and reflect on each other’s postings.
Social networks
The first online social networks also referred to as social networking services. It entered the
fieldat the same time as blogging and wikis started to take off. Attracted over five million
registered users followed by Google and Microsoft. These sites allow users to post a profile
with basic information, to invite others to register and to link to the profiles of their friends.
The system alsomakes it possible to visualize and browse the resulting network in order to
discover friends in common, friends thought to be lost or potential new friendships based on
shared interests. The idea of network-based exchange is based on the sociological
observation that social interaction creates similarity and vice versa, interaction creates
similarity: friends are likelyto have acquired or develop similar interests.
User profiles
Explicit user profiles make it possible for these systems to introduce rating mechanism whereby
either the users or their contributions are ranked according to usefulness or trustworthiness.
Ratingsare explicit forms of social capital that regulate exchanges in online communities such
that reputation moderates exchanges in the real world. In terms of implementation, the new
web sites are relying on new ways of applying some of the pre-existent technologies.
Asynchronous JavaScript and XML, or AJAX, which drives many of the latest websites is
merely a mix of technologies that have been supported by browsers for years.
Each graph can be associated with its characteristic matrix M: =(mi,j)n*n where n =|V|
A component is a maximal connected subgraph. Two vertices are in the same (strong)
componentif and only if there exists a (directed) path between them.
American psychologist Stanley Milgram experiment about the structure of social networks.
Milgram calculated the average of the length of the chains and concluded that the experiment
showed that on average Americans are no more than six steps apart from each other. While
this is also the source of the expression six degrees of separation the actual number is rather
dubious. Eg:
Formally, what Milgram estimated is the size of the average shortest path of the network, which
isalso called characteristic path length. The shortest path between two vertices vs and vt is a path
that begins at the vertex vs and ends in the vertex vt and contains the least possible number of
vertices. The shortest path between two vertices is also called a geodesic. The longest geodesic
in the graphis called the diameter of the graph: this is the maximum number of steps that is
required between any two nodes. The average shortest path is the average of the length of the
geodesics between allpairs of vertices in the graph.
A practical impact of Milgram’s finding structures is as that possible models for social
networks. The two-dimensional lattice model shown in following figure:
Clustering for a single vertex can be measured by the actual number of the edges
between the neighbors of a vertex divided by the possible number of edges between the
neighbors. When taken the average over all vertices we get to the measure known as clustering
coefficient. The clustering coefficient of tree is zero, which is easy to see if we consider that
there are no triangles of edges (triads) in the graph. Eg:
For a given node, clustering coefficient measures the proportion of connections between the
nodes in its neighbourhood that are actually connected. The formula for the local clustering
coefficient is:
The alpha and beta models of graphs are frameworks used to analyze and understand the
structure and dynamics of networks. The alpha model focuses on the concept of connectivity
and clustering within a graph, emphasizing how nodes are interconnected and the extent to
which they form tightly-knit communities or clusters. This model is useful for examining local
structures and the strength of connections among neighbors, often employing metrics like the
clustering coefficient to measure how nodes are grouped. In contrast, the beta model
emphasizes the overall topology of the network, considering global properties such as degree
distribution, centrality, and path lengths between nodes.
cliques has important implications in various fields, including sociology, biology, and
computer science, as it helps to identify communities, analyze network robustness, and
understand collaborative structures. Furthermore, the maximum clique problem, which
involves finding the largest clique within a graph, is a well-known computational challenge
with applications in clustering and optimization
The image that emerges is one of dense clusters or social groups sparsely connected to each
otherby a few ties. For example, this is the image that appears if we investigate the co-authorship
networks of a scientific community. Bounded by limitations of space and resources, scientists
mostly co-operate with colleagues from the same institute. Occasional exchanges and projects
with researchers from abroad, however, create the kind of shortcut ties that Watts explicitly
incorporated within his model. These shortcuts make it possible for scientists to reach each
other in a relatively short number of steps.
A k-plex is a type of graph structure used in network theory that allows for a relaxed form of
connectivity compared to a clique. In a k-plex, every node is connected to at least n−k other
nodes, where n is the total number of nodes in the k-plex and k is a given parameter that defines
the maximum number of nodes that can be excluded from the neighbourhood of any node. A
k-plex is a subset of a graph such that every vertex in the subset is connected to at least ∣S∣−k|
other vertices in the same subset, where ∣S∣| is the total number of vertices in the k-plex.
The lambda-set analysis method is a technique used in network theory and social network
analysis to identify and evaluate clusters or communities within a network based on the
relationships among nodes. A lambda set consists of a subset of nodes that share specific
characteristics, such as connectivity or similarity, determined by defined criteria like degree
thresholds. The process begins with constructing a network graph where nodes represent
entities and edges denote relationships. Researchers identify lambda sets by applying the
established criteria, followed by an analysis of the internal structure and dynamics of these sets.
Clustering a graph into subgroups allows us to visualize the connectivity at a group level. Core-
Periphery (C/P) structure is one where nodes can be divided in two distinct subgroups: nodes
in the core are densely connected with each other and the nodes on the periphery, while
peripheral nodes are not connected with each other, only nodes in the core. The result of the
optimization is a classification of the nodes as core or periphery and ameasure of the error of
the solution.
Fig.11
Affiliation networks are a type of bipartite graph that represent relationships between two
distinct sets of entities, typically individuals and their affiliations, such as organizations,
institutions, or events. In these networks, one set of nodes corresponds to individuals, while the
other set represents affiliations, with edges connecting individuals to the affiliations they
belong to or participate in. This structure allows for the analysis of collaboration patterns, social
dynamics, and the influence of affiliations on individual behavior and outcomes. For example,
in academic settings, affiliation networks can be used to study co-authorship patterns, where
researchers are connected to the institutions they work for, revealing insights into collaboration
trends, the impact of institutional networks on research productivity, and the dissemination of
knowledge. By examining these networks, researchers can identify key players, assess
community structures, and understand the flow of information within and across various
affiliations.
Eg:
Social capital refers to the networks, relationships, and norms that facilitate cooperation and
collaboration among individuals and groups within a society. It encompasses the resources
individuals can access through their social connections, such as trust, reciprocity, and shared
values, which enhance social cohesion and collective action. Social capital can be classified
into three types: bonding social capital, which occurs within close-knit groups and strengthens
ties among similar individuals; bridging social capital, which connects diverse groups and
fosters broader networks; and linking social capital, which involves connections between
individuals and institutions, enabling access to resources and opportunities
The structural dimension of social capital refers to patterns of relationships or positions that
provide benefits in terms of accessing large, important parts of the network.
Degree centrality equals the graph theoretic measure of degree, i.e. the number of (incoming,
outgoing or all) links of a node.
Closeness centrality, which is obtained by calculating the average (geodesic) distance of a
node to all other nodes in the network. In larger networks it makes sense to constrain the size
ofthe neighborhood in which to measure closeness centrality. It makes little sense, for example,
to talk about the most central node on the level of a society. The resulting measure is called
local closeness centrality.
Two other measures of power and influence through networks are broker positions and weak
ties.
Betweenness is defined as the proportion of paths — among the geodesics between all pairs of
nodes—that pass through a given actor.
A structural hole occurs in the space that exists between closely clustered communities.
Blogging
Content analysis has also been the most commonly used tool in the computer aided
analysis of blogs (web logs), primarily with the intention of trend analysis for the
purposes ofmarketing. While blogs are often considered as “person themselves know
that blogs are muchmore than that: modern blogging tools allow to easily comment and
react to the comments ofother bloggers, resulting in webs of communication among
bloggers.
These discussion networks also lead to the establishment of dynamic communities,
which often manifest themselves through syndicated blogs (aggregated blogs that
collect posts from a set of authors blogging on similar topics), blog rolls (lists of
discussion partners on a personal blog) and even result in real world meetings such as
the Blog Walk series of meetings.
The 2004 US election campaign represented a turning point in blog research as it has
been the firstmajor electoral contest where blogs have been exploited as a method of
building networks among individual activists and supporters. Blog analysis has
suddenly shed its image as relevant only to marketers interested in understanding
product choices of young demographics; following this campaign there has been
explosion in research on the capacity of web logs for creating and maintaining stable,
long distance social networks of different kinds
Online community spaces and social networking services such as MySpace, Live
Journal cater to socialization even more directly than blogs with features such as social
networking (maintaining lists of friends, joining groups), messaging and photo
sharing. As they are typically used by a much younger demographic, they offer an
excellent opportunity for studying changes in youth culture.
RSS (Really Simple Syndication) feeds and blogs are complementary technologies that
enhance content distribution and consumption on the internet. A blog serves as a
platform where individuals or organizations publish articles, updates, and multimedia
content on specific topics. RSS feeds, on the other hand, are a standardized format for
delivering regularly updated information from blogs and other websites directly to
subscribers. When a blog publishes new content, an RSS feed automatically updates,
allowing users to receive notifications about the latest posts without needing to visit the
site manually. This streamlining of information delivery enables readers to efficiently
keep up with multiple blogs and websites in one place, using RSS readers or
aggregators. By combining blogs with RSS feeds, content creators can broaden their
reach and engagement, while users benefit from a curated and convenient way to access
fresh content from their favorite sources.
individuals orinstitutions requires web mining as names are typically embedded in the
natural text of web pages. Web mining is the application of text mining to the content of
web pages. The techniques employed here are statistical methods possibly combined
with an analysis of the contents of web pages.
Using the search engine Altavista the system collected page counts for the individual
names as well as the number of pages where the names co-occurred.
Tie strength was calculated by dividing the number of co-occurrences with the number
of pages returned for the two names individually. Also known as the Jaccard-
coefficient, this is basically the ratio of the sizes of two sets: the intersection of the sets
of pages and their union.
The resulting value of tie strength is a number between zero (no co-occurrences) and
one (no separate mentioning, only co-occurrences). If this number has exceeded a
certain fixed threshold it was taken as evidence for the existence of a tie.
The number of pages that can be found for the given individuals or combination of
individuals. The reason is that the Jaccard-coefficient is a relative measure of co-
occurrence and it does nottake into account the absolute sizes of the sets. In case the
absolute sizes are very low we can easily get spurious results. A disadvantage of the
Jaccard-coefficient is that it penalizes ties between an individual whose name often
occurs on theWeb and less popular individuals.
For this reason we use an asymmetric variant of the coefficient. In particular, we divide
the number of pages for the individual with the number of pages for both names and
take it as evidence of a directed tie if this number reaches a certain threshold.
There have been several approaches to deal with name ambiguity. Instead of a single
name they assume to have a list of names related to each other. They disambiguate the
appearances by clustering the combined results returned by the search engine for the
individual names. The clustering can be based on various networks between the
returned webpages, e.g. based on hyperlinks between the pages, common links or
similarity in content.
The idea is that such key phrases can be added to the search query to reduce the set of
results to those related to the given target individual. We consider an ordered list of
pages for the first person and a set of pages for the second (therelevant set) as shown
in Figure:
We ask the search engine for the top N pages for both persons but in the case of the
secondperson the order is irrelevant() as the relevance for at the position.
The average precision method is more sophisticated in that it takes into account the order
in whichthe search engine returns document for a person: it assumes that names of other
persons that occurcloser to the top of the list represent more important contacts than
names that occur in pages at thebottom of the list.
This strength is determined by taking the number of the pages where the name of an
interest and the name of a person co-occur divided by the total number of pages about
the person.
Assign the expertise to an individual if this value is at least one standard deviation higher
than the mean of the values obtained for the same concept. The biggest technical
challenge in social network mining is the disambiguation of person names Persons
names exhibit the same problems of polysemy and synonymy that we have seen in
the general case of web search. Queries for researchers who commonly use different
variations of theirname (e.g. Jim Hendler vs. James Hendler).
Polysemy is the association of one word with two or more distinct meanings. A polyseme
is a wordor phrase with multiple meanings. In contrast, a one-to-one match between a
word and a meaningis called monosemy. According to some estimates, more than 40%
of English words have more than one meaning.The semantic qualities or sense relations
that exist between words with closely related meanings is Synonymy.