A comparison of techniques for sampling web pages
Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.
Download (ext.)
https://doi.org/10.4230/LIPIcs.STACS.2009.1809
[Published Version]
Conference Paper
| Published
| English
Scopus indexed
Author
Baykan, Eda;
Henzinger, MonikaISTA ;
Keller, Stefan F.;
de Castelberg, Sebastian;
Kinzler, Markus
Series Title
LIPIcs
Abstract
As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results.
Publishing Year
Date Published
2009-02-01
Proceedings Title
26th International Symposium on Theoretical Aspects of Computer Science
Publisher
Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Volume
3
Page
13-30
Conference
STACS: Symposium on Theoretical Aspects of Computer Science
Conference Location
Freiburg, Germany
Conference Date
2009-02-26 – 2009-02-28
ISBN
ISSN
IST-REx-ID
Cite this
Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. A comparison of techniques for sampling web pages. In: 26th International Symposium on Theoretical Aspects of Computer Science. Vol 3. Schloss Dagstuhl - Leibniz-Zentrum für Informatik; 2009:13-30. doi:10.4230/LIPICS.STACS.2009.1809
Baykan, Eda, Henzinger, M., Keller, S. F., de Castelberg, S., & Kinzler, M. (2009). A comparison of techniques for sampling web pages. In 26th International Symposium on Theoretical Aspects of Computer Science (Vol. 3, pp. 13–30). Freiburg, Germany: Schloss Dagstuhl - Leibniz-Zentrum für Informatik. https://doi.org/10.4230/LIPICS.STACS.2009.1809
Baykan, Eda, Monika Henzinger, Stefan F. Keller, Sebastian de Castelberg, and Markus Kinzler. “A Comparison of Techniques for Sampling Web Pages.” In 26th International Symposium on Theoretical Aspects of Computer Science, 3:13–30. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009. https://doi.org/10.4230/LIPICS.STACS.2009.1809.
Eda Baykan, M. Henzinger, S. F. Keller, S. de Castelberg, and M. Kinzler, “A comparison of techniques for sampling web pages,” in 26th International Symposium on Theoretical Aspects of Computer Science, Freiburg, Germany, 2009, vol. 3, pp. 13–30.
Baykan Eda, Henzinger M, Keller SF, de Castelberg S, Kinzler M. 2009. A comparison of techniques for sampling web pages. 26th International Symposium on Theoretical Aspects of Computer Science. STACS: Symposium on Theoretical Aspects of Computer Science, LIPIcs, vol. 3, 13–30.
Baykan, Eda, et al. “A Comparison of Techniques for Sampling Web Pages.” 26th International Symposium on Theoretical Aspects of Computer Science, vol. 3, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2009, pp. 13–30, doi:10.4230/LIPICS.STACS.2009.1809.
All files available under the following license(s):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Link(s) to Main File(s)
Access Level
Open Access
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 0902.1604