Classification of web robots: An empirical study based on over one billion requests

Junsup Lee, Sungdeok Cha, Dongkun Lee, Hyungkyu Lee

Research output: Contribution to journalArticle

26 Citations (Scopus)

Abstract

Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown

Original languageEnglish
Pages (from-to)795-802
Number of pages8
JournalComputers and Security
Volume28
Issue number8
DOIs
Publication statusPublished - 2009 Nov 1

Fingerprint

Software agents
robot
workload
privacy
Group
confidence
university
experiment
resources
Experiments

Keywords

  • Web robot characterization
  • Web robot classification
  • Web robot detection
  • Web security
  • Web usage mining

ASJC Scopus subject areas

  • Computer Science(all)
  • Law

Cite this

Classification of web robots : An empirical study based on over one billion requests. / Lee, Junsup; Cha, Sungdeok; Lee, Dongkun; Lee, Hyungkyu.

In: Computers and Security, Vol. 28, No. 8, 01.11.2009, p. 795-802.

Research output: Contribution to journalArticle

Lee, Junsup ; Cha, Sungdeok ; Lee, Dongkun ; Lee, Hyungkyu. / Classification of web robots : An empirical study based on over one billion requests. In: Computers and Security. 2009 ; Vol. 28, No. 8. pp. 795-802.
@article{2202ec488bb14f23a26dd8dbb6d4eb1c,
title = "Classification of web robots: An empirical study based on over one billion requests",
abstract = "Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown",
keywords = "Web robot characterization, Web robot classification, Web robot detection, Web security, Web usage mining",
author = "Junsup Lee and Sungdeok Cha and Dongkun Lee and Hyungkyu Lee",
year = "2009",
month = "11",
day = "1",
doi = "10.1016/j.cose.2009.05.004",
language = "English",
volume = "28",
pages = "795--802",
journal = "Computers and Security",
issn = "0167-4048",
publisher = "Elsevier Limited",
number = "8",

}

TY - JOUR

T1 - Classification of web robots

T2 - An empirical study based on over one billion requests

AU - Lee, Junsup

AU - Cha, Sungdeok

AU - Lee, Dongkun

AU - Lee, Hyungkyu

PY - 2009/11/1

Y1 - 2009/11/1

N2 - Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown

AB - Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown

KW - Web robot characterization

KW - Web robot classification

KW - Web robot detection

KW - Web security

KW - Web usage mining

UR - http://www.scopus.com/inward/record.url?scp=71849091131&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=71849091131&partnerID=8YFLogxK

U2 - 10.1016/j.cose.2009.05.004

DO - 10.1016/j.cose.2009.05.004

M3 - Article

AN - SCOPUS:71849091131

VL - 28

SP - 795

EP - 802

JO - Computers and Security

JF - Computers and Security

SN - 0167-4048

IS - 8

ER -