<<

. 67
( 137 .)



>>

––

Telephone calling patterns
––

State transition diagrams
––


Two types of nodes are of particular interest in directed graphs. All the
edges connected to a source node are outgoing edges. Since there are no incom­
ing edges, no path exists from any other node in the graph to any of the source
nodes. When all the edges on a node are incoming edges, the node is called a
sink node. The existence of source nodes and sink nodes is an important differ­
ence between directed graphs and their undirected cousins.
An important property of directed graphs is whether the graph contains any
paths that start and end at the same vertex. Such a path is called a cycle, imply­
ing that the path could repeat itself endlessly: ABCABCABC and so on. If a
directed graph contains at least one cycle, it is called cyclic. Cycles in a graph of
flight segments, for instance, might be the path of a single airplane. In a call
graph, members of a cycle call each other”these are good candidates for a
“friends and family“style” promotion, where the whole group gets a discount,
or for marketing conference call services.


Detecting Cycles in a Graph
There is a simple algorithm to detect whether a directed graph has any cycles.
This algorithm starts with the observation that if a directed graph has no sink
vertices, and it has at least one edge, then any path can be extended arbitrarily.
Without any sink vertices, the terminating node of a path is always connected
to another node, so the path can be extended by appending that node. Simi­
larly, if the graph has no source nodes, then we can always prepend a node to
the beginning of the path. Once the path contains more nodes than there are
nodes in the graph, we know that the path must visit at least one node twice.
Call this node X. The portion of the path between the first X and the second X
in the path is a cycle, so the graph is cyclic.
Now consider the case when a graph has one or more source nodes and one
or more sink nodes. It is pretty obvious that source nodes and sink nodes
Link Analysis 331


cannot be part of a cycle. Removing the source and sink nodes from the graph,
along with all their edges, does not affect whether the graph is cyclic. If the
resulting graph has no sink nodes or no source nodes, then it contains a cycle,
as just shown. The process of removing sink nodes, source nodes, and their
edges is repeated until one of the following occurs:
No more edges or no more nodes are left. In this case, the graph has no
––

cycles.
Some edges remain but there are no source or sink nodes. In this case,
––

the graph is cyclic.
If no cycles exist, then the graph is called an acyclic graph. These graphs are
useful for describing dependencies or one-way relationships between things.
For instance, different products often belong to nested hierarchies that can be
represented by acyclic graphs. The decision trees described in Chapter 6 are
another example.
In an acyclic graph, any two nodes have a well-defined precedence relation­
ship with each other. If node A precedes node B in some path that contains both
A and B, then A will precede B in all paths containing both A and B (otherwise
there would be a cycle). In this case, we say that A is a predecessor of B and that
B is a successor of A. If no paths contain both A and B, then A and B are disjoint.
This strict ordering can be an important property of the nodes and is sometimes
useful for data mining purposes.


A Familiar Application of Link Analysis
Most readers of this book have probably used the Google search engine. Its
phenomenal popularity stems from its ability to help people find reasonably
good material on pretty much any subject. This feat is accomplished through
link analysis.
The World Wide Web is a huge directed graph. The nodes are Web pages and
the edges are the hyperlinks between them. Special programs called spiders or
web crawlers are continually traversing these links to update maps of the huge
directed graph that is the web. Some of these spiders simply index the content
of Web pages for use by purely text-based search engines. Others record the
Web™s global structure as a directed graph that can be used for analysis.
Once upon a time, search engines analyzed only the nodes of this graph.
Text from a query was compared with text from the Web pages using tech­
niques similar to those described in Chapter 8. Google™s approach (which has
now been adopted by other search engines) is to make use of the information
encoded in the edges of the graph as well as the information found in the nodes.
332 Chapter 10


The Kleinberg Algorithm
Some Web sites or magazine articles are more interesting than others even if
they are devoted to the same topic. This simple idea is easy to grasp but hard
to explain to a computer. So when a search is performed on a topic that many
people write about, it is hard to find the most interesting or authoritative
documents in the huge collection that satisfies the search criteria.
Professor Jon Kleinberg of Cornell University came up with one widely
adopted technique for addressing this problem. His approach takes advantage
of the insight that in creating a link from one site to another, a human being is
making a judgment about the value of the site being linked to. Each link to
another site is effectively a recommendation of that site. Cumulatively, the
independent judgments of many Web site designers who all decide to provide




Y
links to the same target are conferring authority on that target. Furthermore,




FL
the reliability of the sites making the link can be judged according to the
authoritativeness of the sites they link to. The recommendations of a site with
many other good recommendations can be given more weight in determining
AM
the authority of another.
In Kleinberg™s terminology, a page that links to many authorities is a hub; a
page that is linked to by many hubs is an authority. These ideas are illustrated
TE

in Figure 10.7 The two concepts can be used together to tell the difference
between authority and mere popularity. At first glance, it might seem that a
good method for finding authoritative Web sites would be to rank them by the
number of unrelated sites linking to them. The problem with this technique is
that any time the topic is mentioned, even in passing, by a popular site (one
with many inbound links), it will be ranked higher than a site that is much
more authoritative on the particular subject though less popular in general.
The solution is to rank pages, not by the total number of links pointing
to them, but by the number of subject-related hubs that point to them.
Google.com uses a modified and enhanced version of the basic Kleinberg
algorithm described here.
A search based on link analysis begins with an ordinary text-based search.
This initial search provides a pool of pages (often a couple hundred) with
which to start the process. It is quite likely that the set of documents returned
by such a search does not include the documents that a human reader would
judge to be the most authoritative sources on the topic. That is because the
most authoritative sources on a topic are not necessarily the ones that use the
words in the search string most frequently. Kleinberg uses the example of
a search on the keyword “Harvard.” Most people would agree that www.
harvard.edu is one of the most authoritative sites on this topic, but in a purely
content-based analysis, it does not stand out among the more than a million
Web pages containing the word “Harvard” so it is quite likely that a text-based
search will not return the university™s own Web site among its top results. It is
very likely, however, that at least a few of the documents returned will contain

Team-Fly®
Link Analysis 333


a link to Harvard™s home page or, failing that, that some page that points to
one of the pages in the pool of pages will also point to www.harvard.edu.
An essential feature of Kleinberg™s algorithm is that it does not simply take
the pages returned by the initial text-based search and attempt to rank them; it
uses them to construct the much larger pool of documents that point to or are
pointed to by any of the documents in the root set. This larger pool contains
much more global structure”structure that can be mined to determine which
documents are considered to be most authoritative by the wide community of
people who created the documents in the pool.


The Details: Finding Hubs and Authorities
Kleinberg™s algorithm for identifying authoritative sources has three phases:
1. Creating the root set
2. Identifying the candidates
3. Ranking hubs and authorities
In the first phase, a root set of pages is formed using a text-based search
engine to find pages containing the search string. In the second phase, this root
set is expanded to include documents that point to or are pointed to by docu­
ments in the root set. This expanded set contains the candidates. In the third
phase, which is iterative, the candidates are ranked according to their strength
as hubs (documents that have links to many authoritative documents) and
authorities (pages that have links from many authoritative hubs).

Creating the Root Set
The root set of documents is generated using a content-based search. As a first
step, stop words (common words such as “a,” “an,” “the,” and so on) are
removed from the original search string supplied. Then, depending on the par­
ticular content-based search strategy employed, the remaining search terms
may undergo stemming. Stemming reduces words to their root form by remov­
ing plural forms and other endings due to verb conjugation, noun declension,
and so on. Then, the Web index is searched for documents containing the
terms in the search string. There are many variations on the details of how
matches are evaluated, which is one reason why performing the same search
on two text-based search engines yields different results. In any case, some
combination of the number of matching terms, the rarity of the terms matched,
and the number of times the search terms are mentioned in a document is used
to give the indexed documents a score that determines their rank in relation to
the query. The top n documents are used to establish the root set. A typical
value for n is 200.
334 Chapter 10


Identifying the Candidates
In the second phase, the root set is expanded to create the set of candidates. The
candidate set includes all pages that any page in the root set links to along with
a subset of the pages that link to any page in the root set. Locating pages that
link to a particular target page is simple if the global structure of the Web is
available as a directed graph. The same task can also be accomplished with an
index-based text search using the URL of the target page as the search string.
The reason for using only a subset of the pages that link to each page in the
root set is to guard against the possibility of an extremely popular site in the
root set bringing in an unmanageable number of pages. There is also a param­
eter d that limits the number of pages that may be brought into the candidate
set by any single member of the root set.
If more than d documents link to a particular document in the root set, then
an arbitrary subset of d documents is brought into the candidate set. A typical
value for d is 50. The candidate set typically ends up containing 1,000 to 5,000
documents.
This basic algorithm can be refined in various ways. One possible refine­
ment, for instance, is to filter out any links from within the same domain,
many of which are likely to be purely navigational. Another refinement is to
allow a document in the root set to bring in at most m pages from the same site.
This is to avoid being fooled by “collusion” between all the pages of a site to,
for example, advertise the site of the Web site designer with a “this site
designed by” link on every page.

Ranking Hubs and Authorities
The final phase is to divide the candidate pages into hubs and authorities and
rank them according to their strength in those roles. This process also has the
effect of grouping together pages that refer to the same meaning of a search
term with multiple meanings”for instance, Madonna the rock star versus the
Madonna and Child in art history or Jaguar the car versus jaguar the big cat. It
also differentiates between authorities on the topic of interest and sites that are
simply popular in general. Authoritative pages on the correct topic are not
only linked to by many pages, they tend to be linked to by the same pages. It is
these hub pages that tie together the authorities and distinguish them from
unrelated but popular pages. Figure 10.7 illustrates the difference between
hubs, authorities, and unrelated popular pages.
Hubs and authorities have a mutually reinforcing relationship. A strong hub
is one that links to many strong authorities; a strong authority is one that is
linked to by many strong hubs. The algorithm therefore proceeds iteratively,
first adjusting the strength rating of the authorities based on the strengths of
the hubs that link to them and then adjusting the strengths of the hubs based
on the strength of the authorities to which they link.
Link Analysis 335




Hubs Authorities Popular Site
Figure 10.7 Google uses link analysis to distinguish hubs, authorities, and popular pages.


For each page, there is a value A that measures its strength as an authority
and a value H that measures its strength as a hub. Both these values are ini­
tialized to 1 for all pages. Then, the A value for each page is updated by adding
up the H values of all the pages that link to them. The A values for each page
are then normalized so that the sum of their squares is equal to 1. Then the H
values are updated in a similar manner. The H value for each page is set to the
sum of the A values of the pages it links to, and the new H values are normal­
ized so that the sum of their squares is equal to 1. This process is repeated until
an equilibrium set of A and H values is reached. The pages that end up with
the highest H values are the strongest hubs; those with the strongest A values
are the strongest authorities.
The authorities returned by this application of link analysis tend to be
strong examples of one particular possible meaning of the search string. A
search on a contentious topic such as “gay marriage” or “Taiwan indepen­
dence” yields strong authorities on both sides because the global structure of
the Web includes tightly connected subgraphs representing documents main­
tained by like-minded authors.
336 Chapter 10


Hubs and Authorities in Practice
The strongest case for the advantage of adding link analysis to text-based search­
ing comes from the market place. Google, a search engine developed at Stanford
by Sergey Brin and Lawence Page using an approach very similar to Klein-
berg™s, was the first of the major search engines to make use of link analysis to
find hubs and authorities. It quickly surpassed long-entrenched search services
such as AltaVista and Yahoo! The reason was qualitatively better searches.
The authors noticed that something was special about Google back in April

<<

. 67
( 137 .)



>>