Novel approaches to crawling important pages early

Md Hijbul Alam, JongWoo W. Ha, Sang-Geun Lee

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5% in cumulative PageRank.

Original languageEnglish
Pages (from-to)707-734
Number of pages28
JournalKnowledge and Information Systems
Volume33
Issue number3
DOIs
Publication statusPublished - 2012 Sep 10

Fingerprint

World Wide Web
Websites
Search engines
Scheduling
Experiments

Keywords

  • Crawl ordering
  • Fractional PageRank
  • PageRank
  • Web crawler

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Information Systems
  • Hardware and Architecture
  • Human-Computer Interaction

Cite this

Novel approaches to crawling important pages early. / Alam, Md Hijbul; Ha, JongWoo W.; Lee, Sang-Geun.

In: Knowledge and Information Systems, Vol. 33, No. 3, 10.09.2012, p. 707-734.

Research output: Contribution to journalArticle

Alam, Md Hijbul ; Ha, JongWoo W. ; Lee, Sang-Geun. / Novel approaches to crawling important pages early. In: Knowledge and Information Systems. 2012 ; Vol. 33, No. 3. pp. 707-734.
@article{0d15b6532a5f46d2bd70c4760e599188,
title = "Novel approaches to crawling important pages early",
abstract = "Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5{\%} in cumulative PageRank.",
keywords = "Crawl ordering, Fractional PageRank, PageRank, Web crawler",
author = "Alam, {Md Hijbul} and Ha, {JongWoo W.} and Sang-Geun Lee",
year = "2012",
month = "9",
day = "10",
doi = "10.1007/s10115-012-0535-4",
language = "English",
volume = "33",
pages = "707--734",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "3",

}

TY - JOUR

T1 - Novel approaches to crawling important pages early

AU - Alam, Md Hijbul

AU - Ha, JongWoo W.

AU - Lee, Sang-Geun

PY - 2012/9/10

Y1 - 2012/9/10

N2 - Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5% in cumulative PageRank.

AB - Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5% in cumulative PageRank.

KW - Crawl ordering

KW - Fractional PageRank

KW - PageRank

KW - Web crawler

UR - http://www.scopus.com/inward/record.url?scp=84869092092&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84869092092&partnerID=8YFLogxK

U2 - 10.1007/s10115-012-0535-4

DO - 10.1007/s10115-012-0535-4

M3 - Article

AN - SCOPUS:84869092092

VL - 33

SP - 707

EP - 734

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 3

ER -