|
|
Creating a virtual library with HPSearch and Mops
Gerd Hoff
and Martin Mundhenk
The fast dissemination of new research results on the world-wide web
is a new challenge for search engines. In many research areas,
scientists make their newest results electronically
available on their web site, long before the results appear in conference proceedings or in journals.
Whereas a decade ago, the state of the art in a research area could be found out
by reading conference proceedings and journals in the local library,
nowadays it is additionally necessary to find the newest related electronic publications on the web
- in other words, to maintain a virtual library of not-yet-printed literature.
Traditional search engines do not help for this task.
E.g., they do not index postscript documents, which is the electronic format of many preprints appearing on the web. The few existing searchable indices for postscript documents
either cover too large fields - all of computer science, for example -
to be really helpful, or they depend on some submission procedure which
delays the appearance of the documents on the web.
We present a new approach for constructing a virtual library of scientific papers
which is specialized in a relatively small research area
and allows to find the latest new documents.
-
In the first step, we want to find the places in the web
where we expect interesting documents to appear.
Different from other approaches, we do not search
for web pages which contain certain keywords, but
we search for web pages which are created by scientists
who are active in the research area under consideration.
For personal virtual bookshelves,
this information can e.g. be hand edited.
For a larger virtual library, we prefer an automated approach
and obtain the scientists' names from
computer science bibliographies on the web,
namely from Michael Ley's DBLP server
(http://dblp.uni-trier.de/).
This allows to find the names of scientists who published at
certain specialized conferences or in specialized journals,
and therefore the names found can be seen as ``certified.''
Using these names, our (http://pranger.uni-trier.de/hp/) HPSearch
system searches the scientists' Home Pages according to the names.
Locating these Home Pages is a difficult task,
because of the lack of any fixed page construction rules.
We determine about 500 characteristics
that control the search for the Home Pages.
Maintaining that information is a further primary task of HPSearch.
-
In the second step,
a virtual library is created from the
scientific papers found in the area
close to the scientists' Home Pages.
This is performed by our search engine
(http://mops.uni-trier.de/) Mops.
It creates an index of these papers
and makes it accessible on a web server.
Whereas the search index is administered on the Mops server,
the scientific papers from which it is extracted remain on
the servers of their owners.
In this way, a virtual and distributed library is generated.
In this project, we developed and implemented
(http://pranger.uni-trier.de/hp/) HPSearch
and (http://mops.uni-trier.de/) Mops.
We tested our approach by creating two example indices.
The research area for the one index is complexity theory, and for the other index it is
BDDs
(binary decision diagrams, a data structure for VLSI design
and verification). Both indices are well used in the respective research communities.
The whole software runs on standard PCs.
We conclude that such a focused crawling is very effective for building high-quality
virtual libraries, using ordinary desktop hardware.
A more detailed description of the system can be found at
http://www.informatik.uni-trier.de/~mundhenk/virt-lib/.
|
|