Usage Meets Links Analysis: Towards Improving Intranet and Site Specific Search via Usage Statistics
Investigates ranking signals incorporating usage information with link analysis. A number of existing and newly proposed algorithms are compared.
Thesis in pdf format. A presentation is also available.
Infrastructure
Various search engine components are implemented:
- UMirror: A relatively simple but functional crawler with regular expression based matching.
- USearch: Modular, flexible, efficient intranet search engine. USearch indexed and served cs.umn.edu site searches for more than a year. In the order of 100k documents were indexed and served.
- Log processing modules to extract usage information to be used for some of the ranking algorithms.
- All ranking algorithms are implemented and incorporated to USearch.
- A special version of USearch allowing explicit selection of relevant documents is launched for evaluations.
Algorithms
- UPR (Usage Aware PageRank). A pagerank improvement that incorporates link traversal and page landing statistics into PageRank. (Oztekin et al).
- UHITS (Usage Augmented HITS). HITS algorithm augmented via usage statistics. (Oztekin).
- Plain page visitation statistics directly applied as a quality measure and a slight modification. A similar algorithm is used in early DirectHit search engine.
- PageRank. Link analysis algorithm known to be used by early Google (Brin and Page).
- HITS, Hyperlink-Induced Topic Search, (Kleinberg).
Methods are compared using a query set as well as by their global properties (e.g. stability, ability to provide a reliable global ordering).
Evaluation
Global comparisons
UPR and UHITS parameters are sampled. Algorithms (and parameters) are compared via:
- How much distinguishing power they offer (e.g. percentage of pages that do not get a score or that have the same score).
- How stable they are (do they converge/diverge after a number of iterations?).
- Correlations to every other algorithm/parameters using various methods.
Query dependent evaluations
Algorithms are compared via the following evaluation sets.
- Arbitrated set: A small set of raters rate selected queries. Each query rated thoroughly by at least 2 raters.
- Public evaluation set: Grad students and faculty are invited to participate. Larger number of queries compared to the first set, but potentially more noisy.
- Homepage search: Random set of grad and faculty names are obtained via the finger utility. Relevant document for each person is deemed to be the homepage of the user. Free of rater bias (no raters), but a very specific type of search.
For each set, a set of relevant documents are identified. Positions of relevant documents using each algorihtm are noted and compared. Algorithm that ranks the relevant documents in earlier positions is deemed superior.
Results
One of the suggested algorithms, UPR (Usage Aware PageRank), outperformed all other algorithms in practically all datasets. Results suggest that usage information may play an important role in improving ranking quality especially for site specific and intranet search domains where spam is typically not an issue.
Modifications that emphasize group behavior over individuals' behavior are proposed and compared. Those can be used to increase spam resistance and reduce other undesirable effects. Results show that such filters did not affect the scores of most popular and high quality documents, but significantly penalized perturbations resulting from a single or few sources.
|