<<

. 68
( 137 .)



>>

of 2001 when we studied the web logs from our company™s site, www
.data-miners.com. At that time, industry surveys gave Google and AltaVista
approximately equal 10 percent shares of the market for web searches, and yet
Google accounted for 30 percent of the referrals to our site while AltaVista
accounted for only 3 percent. This is apparently because Google was better
able to recognize our site as an authority for data mining consulting because it
was less confused by the large number of sites that use the phrase “data min­
ing” even though they actually have little to do with the topic.


Case Study: Who Is Using Fax Machines from Home?
Graphs appear in data from other industries as well. Mobile, local, and long-
distance telephone service providers have records of every telephone call that
their customers make and receive. This data contains a wealth of information
about the behavior of their customers: when they place calls, who calls them,
whether they benefit from their calling plan, to name a few. As this case study
shows, link analysis can be used to analyze the records of local telephone calls
to identify which residential customers have a high probability of having fax
machines in their home.


Why Finding Fax Machines Is Useful
What is the use of knowing who owns a fax machine? How can a telephone
provider act on this information? In this case, the provider had developed a
package of services for residential work-at-home customers. Targeting such
customers for marketing purposes was a revolutionary concept at the com­
pany. In the tightly regulated local phone market of not so long ago, local ser­
vice providers lost revenue from work-at-home customers, because these
customers could have been paying higher business rates instead of lower resi­
dential rates. Far from targeting such customers for marketing campaigns,
the local telephone providers would deny such customers residential rates”
punishing them for behaving like a small business. For this company, develop­
ing and selling work-at-home packages represented a new foray into customer
service. One question remained. Which customers should be targeted for the
new package?
Link Analysis 337


There are many approaches to defining the target set of customers. The com­
pany could effectively use neighborhood demographics, household surveys,
estimates of computer ownership by zip code, and similar data. Although this
data improves the definition of a market segment, it is still far from identifying
individual customers with particular needs. A team, including one of the
authors, suggested that the ability to find residential fax machine usage would
improve this marketing effort, since fax machines are often (but not always)
used for business purposes. Knowing who uses a fax machine would help tar­
get the work-at-home package to a very well-defined market segment, and
this segment should have a better response rate than a segment defined by less
precise segmentation techniques based on statistical properties.
Customers with fax machines offer other opportunities as well. Customers
that are sending and receiving faxes should have at least two lines”if they
only have one, there is an opportunity to sell them a second line. To provide
better customer service, the customers who use faxes on a line with call wait­
ing should know how to turn off call waiting to avoid annoying interruptions
on fax transmissions. There are other possibilities as well: perhaps owners of
fax machines would prefer receiving their monthly bills by fax instead of by
mail, saving both postage and printing costs. In short, being able to identify
who is sending or receiving faxes from home is valuable information that pro­
vides opportunities for increasing revenues, reducing costs, and increasing
customer satisfaction.


The Data as a Graph
The raw data used for this analysis was composed of selected fields from the
call detail data fed into the billing system to generate monthly bills. Each
record contains 80 bytes of data, with information such as:
The 10-digit telephone number that originated the call, three digits for
––

the area code, three digits for the exchange, and four digits for the line
The 10-digit telephone number of the line where the call terminated
––

The 10-digit telephone number of the line being billed for the call
––

The date and time of the call
––

The duration of the call
––

The day of the week when the call was placed
––

Whether the call was placed at a pay phone
––


In the graph in Figure 10.8, the data has been narrowed to just three fields:
duration, originating number, and terminating number. The telephone numbers
are the nodes of the graph, and the calls themselves are the edges, weighted by
the duration of the calls. A sample of telephone calls is shown in Table 10.1.
338 Chapter 10



353­ 350­
00:00:41
3658 5166



3
:2
:00
00
353­
4271
00:0
0:01
353­
3068



353­ 555­
00:00:42
3108 1212
00:
01:
22
350­
6595

Figure 10.8 Five calls link together seven telephone numbers.




Table 10.1 Five Telephone Calls

ORIGINATING TERMINATING
ID NUMBER NUMBER DURATION

1 353-3658 350-5166 00:00:41

2 353-3068 350-5166 00:00:23

3 353-4271 353-3068 00:00:01

4 353-3108 555-1212 00:00:42

5 353-3108 350-6595 00:01:22




The Approach
Finding fax machines is based on a simple observation: Fax machines tend to
call other fax machines. A set of known fax numbers can be expanded based on
the calls made to or received from the known numbers. If an unclassified tele­
phone number calls known fax numbers and doesn™t hang up quickly, then there
is evidence that it can be classified as a fax number. This simple characterization
Link Analysis 339


is good for guidance, but it is an oversimplification. There are actually several
types of expected fax machine usage for residential customers:
Dedicated fax. Some fax machines are on dedicated lines, and the line is
––

used only for fax communication.
Shared. Some fax machines share their line with voice calls.
––

Data. Some fax machines are on lines dedicated to data use, either via
––

fax or via computer modem.


T I P Characterizing expected behavior is a good way to start any directed data
mining problem. The better the problem is understood, the better the results

are likely to be.


The presumption that fax machines call other fax machines is generally true
for machines on dedicated lines, although wrong numbers provide exceptions
even to this rule. To distinguish shared lines from dedicated or data lines, we
assumed that any number that calls information”411 or 555-1212 (directory
assistance services)”is used for voice communications, and is therefore a
voice line or a shared fax line. For instance, call #4 in the example data contains
a call to 555-1212, signifying that the calling number is likely to be a shared line
or just a voice line. When a shared line calls another number, there is no way
to know if the call is voice or data. We cannot identify fax machines based on
calls to and from such a node in the call graph. On the other hand, these shared
lines do represent a marketing opportunity to sell additional lines.
The process used to find fax machines consisted of the following steps:
1. Start with a set of known fax machines (gathered from the Yellow Pages).
2. Determine all the numbers that make or receive calls to or from any
number in this set where the call™s duration was longer than 10 seconds.
These numbers are candidates.
If the candidate number has called 411, 555-1212, or a number iden­
––

tified as a shared fax number, then it is included in the set of shared
voice/fax numbers.
Otherwise, it is included in the set of known fax machines.
––

3. Repeat Steps 1 and 2 until no more numbers are identified.
One of the challenges was identifying wrong numbers. In particular, incom­
ing calls to a fax machine may sometimes represent a wrong number and give
no information about the originating number (actually, if it is a wrong number
then it is probably a voice line). We made the assumption that such incoming
wrong numbers would last a very short time, as is the case with Call #3. In a
larger-scale analysis of fax machines, it would be useful to eliminate other
anomalies, such as outgoing wrong numbers and modem/fax usage.
340 Chapter 10


The process starts with an initial set of fax numbers. Since this was a demon­
stration project, several fax numbers were gathered manually from the Yellow
Pages based on the annotation “fax” by the number. For a larger-scale project,
all fax numbers could be retrieved from the database used to generate the
Yellow Pages. These numbers are only the beginning, the seeds, of the list of fax
machine telephone numbers. Although it is common for businesses to adver­
tise their fax numbers, this is not so common for fax machines at home.


Some Results
The sample of telephone records consisted of 3,011,819 telephone calls made
over one month by 19,674 households. In the world of telephony, this is a very
small sample of data, but it was sufficient to demonstrate the power of link
analysis. The analysis was performed using special-purpose C++ code that
stored the call detail and allowed us to expand a list of fax machines efficiently.
Finding the fax machines is an example of a graph-coloring algorithm. This type
of algorithm walks through the graph and label nodes with different “colors.” In
this case, the colors are “fax,” “shared,” “voice,” and “unknown” instead of red,
green, yellow, and blue. Initially, all the nodes are “unknown” except for the few
labeled “fax” from the starting set. As the algorithm proceeds, more and more
nodes with the “unknown” label are given more informative labels.
Figure 10.9 shows a call graph with 15 numbers and 19 calls. The weights on
the edges are the duration of each call in seconds. Nothing is really known
about the specific numbers.


Information
(411)

36 6
11 50
7
169




67 22
44
4
44




35




34
2 20
35




<<

. 68
( 137 .)



>>