nents involved. The rest of this paper is organized as follows:
Section 2 describes the problem setting. Section 3 presents
a brief summary of related work. Section 4 describes our al-
gorithms; namely, user clustering using Minhash and PLSI,
and item-item covisitation based recommendations. Sec-
tion 5 describes how such a system can be implemented.
Section 6 reports the results of comparative analysis with
other collaborative filtering algorithms and quality evalua-
tions on live traffic. We finish with some conclusions and
open problems in Section 7.
2. PROBLEM SETTING
Google News is a computer-generated news site that ag-
gregates news articles from more than 4,500 news sources
worldwide, groups similar stories together and displays them
according to each reader’s personalized interests. Numerous
editions by country and language are available. The home
page for Google News shows “Top stories” on the top left
hand corner, followed by category sections such as World,
U.S. , Business,, etc. Each section contains the top three
headlines from that category. To the left of the “Top Sto-
ries” is a navigation bar that links each of these categories to
a page full of stories from that category. This is the format
of the home page for non signed-in users
1
.Furthermore,
if you sign-in using your Google account and opt-in to the
“Search History” feature that is provided by various Google
product websites, you enjoy two additional features:
(a) Google will record your search queries and clicks on news
stories and make them accessible to you online. This allows
you to easily browse stories you have read in the past.
(b) Just below the “Top Stories” section you will see a
section labeled “Recommended for youremailaddress”along
with three stories that are recommended to you based on
your past click history.
The goal of our project is to present recommendations
to signed-in users based on their click history and the click
history of the community. In our setting, a user’s click on
an article is treated as a positive vote for the article. This
sets our problem further apart from settings like Netflix,
MovieLens etc., where users are asked to rate movies on a
1-5scale. Thetwodifferencesare:
1. Treating clicks as a positive vote is more noisy than ac-
cepting explicit 1-5 star ratings or treating a purchase as a
positive vote, as can be done in a setting like amazon.com.
While different mechanisms can be adopted to track the au-
thenticity of a user’s vote, given that the focus of this paper
is on collaborative filtering and not on how user votes are
collected, for the purposes of this paper we will assume that
clicks indeed represent user interest.
2
2. While clicks can be used to capture positive user interest,
they don’t say anything about a user’s negative interest.
This is in contrast to Netflix, eachmovie etc. where users
give a rating on a scale of 1-5.
1
The format for the web-site is subject to change.
2
While in general a click on a news article by a user does not
necessarily mean that she likes the article, we believe that
this is less likely in the case of Google News where there are
clean (non-spammy) snippets for each story that the user
gets to see before clicking. Infact, our concern is that often
the snippets are so good in quality that the user may not
click on the news story even if she is interested in it; she
gets to know all that she wants from the snippet
2.1 Scale of our operations
The Google News website is one of the most popular news
websites in the world receiving millions of page views and
clicks from millions of users. There is a large variance in the
click history size of the users, with numbers being anywhere
from zero to hundreds, even thousands, for certain users.
The number of news stories
3
that we observe over a period
of one month is of the order of several million. Moreover,
as mentioned earlier, the set of news stories undergoes a
constant churn with new stories added every minute and
old ones getting dropped.
2.2 The problem statement
With the preceding overview, the problem for our rec-
ommender system can be stated as follows: Presented with
the click history for N users (U = {u
1
,u
2
,...,u
N
} )overM
items (S = {s
1
,s
2
,...,s
M
}), and given a specific user u with
click history set C
u
consisting of stories {s
i
1
,...,s
i
|C
u
|
},
recommend K stories to the user that she might be inter-
ested in reading. Every time a signed-in user accesses the
home-page, we solve this problem and populate the “Recom-
mended” stories section. Similarly, when the user clicks on
the Recommended section link in the navigation bar to the
left of the “Top Stories” on home-page, we present her with
a page full of recommended stories, solving the above stated
problem for a different value of K. Additionally we require
that the system should be able to incorporate user feedback
(clicks) instantly, thus providing instant gratification.
2.3 Strict timing requirements
The Google News website strives to maintain a strict re-
sponse time requirement for any page views. In particular,
home-page view and full-page view for any of the category
sections are typically generated within a second. Taking into
account the time spent in the News Frontend webserver for
generating the news story clusters by accessing the various
indexes that store the content, time spent in generating the
HTML content that is returned in response to the HTTP
request, it leaves a few hundred milliseconds for the recom-
mendation engine to generate recommendations.
Having described the problem setting and the underlying
challenges, we will give a brief summary of related work on
recommender systems before describing our algorithms.
3. RELATED WORK
Recommender systems can be broadly categorized into
two types: Content based and Collaborative filtering.
In content based systems the similarity of items, defined
in terms of their content, to other items that have been
rated highly by the user is used to recommend new items.
However, in the case of domains such as news, a user’s in-
terest in an article cannot always be characterized by the
terms/topics present in a document. In addition, our aim
was to build a system that could be potentially applied to
other domains (e.g. images, music, videos), where it is hard
to analyse the underlying content, and hence we developed
a content-agnostic system. For the particular application of
3
As mentioned earlier, the website clusters news articles
from different news sites (e.g. BBC, CNN, ABC news etc.)
that are about the same story and presents an aggregated
view to the users. For the purpose of our discussion, when
we refer to a news story it means a cluster of news articles
about the same story as identified by Google News.
WWW 2007 / Track: Industrial Practice and Experience May 8-12, 2007. Banff, Alberta, Canada
272