A comparison of techniques for sampling web pages

Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.

Download (ext.)

Conference Paper | Published | English

Scopus indexed
Author
Baykan, Eda; Henzinger, MonikaISTA ; Keller, Stefan F.; de Castelberg, Sebastian; Kinzler, Markus
Series Title
LIPIcs
Abstract
As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results.
Publishing Year
Date Published
2009-02-01
Proceedings Title
26th International Symposium on Theoretical Aspects of Computer Science
Publisher
Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Volume
3
Page
13-30
Conference
STACS: Symposium on Theoretical Aspects of Computer Science
Conference Location
Freiburg, Germany
Conference Date
2009-02-26 – 2009-02-28
ISSN
IST-REx-ID

Cite this

Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. A comparison of techniques for sampling web pages. In: 26th International Symposium on Theoretical Aspects of Computer Science. Vol 3. Schloss Dagstuhl - Leibniz-Zentrum für Informatik; 2009:13-30. doi:10.4230/LIPICS.STACS.2009.1809
Baykan, Eda, Henzinger, M., Keller, S. F., de Castelberg, S., & Kinzler, M. (2009). A comparison of techniques for sampling web pages. In 26th International Symposium on Theoretical Aspects of Computer Science (Vol. 3, pp. 13–30). Freiburg, Germany: Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.STACS.2009.1809
Baykan, Eda, Monika Henzinger, Stefan F. Keller, Sebastian de Castelberg, and Markus Kinzler. “A Comparison of Techniques for Sampling Web Pages.” In 26th International Symposium on Theoretical Aspects of Computer Science, 3:13–30. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009. https://doi.org/10.4230/LIPICS.STACS.2009.1809.
Eda Baykan, M. Henzinger, S. F. Keller, S. de Castelberg, and M. Kinzler, “A comparison of techniques for sampling web pages,” in 26th International Symposium on Theoretical Aspects of Computer Science, Freiburg, Germany, 2009, vol. 3, pp. 13–30.
Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.
Baykan, Eda, et al. “A Comparison of Techniques for Sampling Web Pages.” 26th International Symposium on Theoretical Aspects of Computer Science, vol. 3, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009, pp. 13–30, doi:10.4230/LIPICS.STACS.2009.1809.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]

Link(s) to Main File(s)
Access Level
OA Open Access

Export

Marked Publications

Open Data ISTA Research Explorer

Sources

arXiv 0902.1604

Search this title in

Google Scholar
ISBN Search