Download Personalized Search PDF

TitlePersonalized Search
File Size2.6 MB
Total Pages107
Table of Contents
List of Figures
List of Tables
List of Acronyms
	Problem Formulation
	Digital Libraries
		Invenio Digital Library
	Information Retrieval
		Inverted Index
		Retrieval and Ranking Methods
	Recommendation Systems
		Collaborative Filtering
		Content-Based Filtering
		Trust-enhanced recommendation systems
		Benefits of Recommendation Systems
		User Relevance Feedback (URF)
	Evaluation Methods
		Prediction Accuracy
		Multivariate Testing
Search on CERN Document Server
		CDS - CERN Document Server Server
	State of the Art
		CDS Search Engine
		Related Work
	Analysis of needs
		User Satisfaction Survey
		User studies
		Interviewing Librarians at CERN
		Lessons learned
	Proposed Prototype for CDS
Obelix Design
		List of requirements from the CDS team
		List of requirements to make Obelix generic
	Integrated vs Standalone
	Collecting data
	Recommendation Algorithm
	Obelix Architecture
		Building blocks
		Key/Value Store
		Obelix REST-API
		High-performance setup
		Programming language
		Packaging and hosting
	Integration with an IR system
	Offline experiments
		Available procedures
		Evaluating the current Invenio Search Engine
		Evaluting prediction accuracy of Obelix
	Online experiments
		Evaluating Click Position
		Evaluating effectiveness search time
	Future Work
Invenio modules
Document Text Contents
Page 1

Personalized Search

Fredrik Nygård Carlsen

Submission date: August 2015

Supervisor: Trond Aalberg, IDI

Norwegian University of Science and Technology
Department of Computer Science

Page 53


to solve this issue by default is to show the results grouped by collections,
which means that for every search the results are shown in groups by ten
(default). For instance if a user search for Higgs, the user may get hits from
both Articles and Thesis, which means that 20 results are shown, top ten
for each collection. It would be interesting to take into account the access
frequency of a item in the ranking.

3.4 Proposed Prototype for CDS

This prototype is based on the aforementioned lessons learned, and one obser-
vation from the d-rank project in particular. The d-rank project discovered
that the best search results produced were based on a combination of latest
first and download history (popularity) [44].

An interesting feedback from the librarians at CERN was that users
often search for items that their colleagues recommend or that are popular
in their field of study. This lead to the idea of introducing collaborative
filtering, forming communities (clusters of similar users). The idea of forming
communities is based on the assumption that similar users have a common
interest in items (e.g: users working at LHC are more interested in LHC
items than the users working in IT, whereas users working in IT are more
interested in items describing some computer science topic). This assumption
is not always true, and for that reason it is important to be able to disable
the personalized search easily when searching.

The concept of communities already exist at CERN. The users of CDS
are all a part of the CERN community and the users of INSPIRE HEP are
all a part of the High Energy Particle Physics Community. However, if the
communities are defined by group memberships, it limits the use cases for
users who are interested in items produced by another group. By using the
existing user groups, the search will not be personalized but rather groupalized.

By building communities based on user interactions and relating two users
who have used the same item, it is possible to achieve a local optimum for
popular downloads (item x is the most popular item in the community of user
y). A side effect of this is that users are not put into a community based on
where they work, but how they use CDS. Meaning that their recommendations
are not necessarily based on where you work, it is based on what you search
for on CDS, making the search experience truly personalized.

Page 54


The relationships between users and items may be illustrated as in figure
3.3. The colors indicate how far away the item is from the USER. Item 1 and
Item 2 are used by the USER himself, marked with green. While Item 3 and
Item 4 are used by users who have also used Item 1 and Item 2, indicated
by a strong blue color for Item 3 and Item 4. In comparison, Item 5 and
Item 6 are marked with a slightly less strong blue color, indicating that Item
3 and Item 4 are recommended stronger than Item 5 and Item 6. When
developing the the prototype, a temporary formula was developed to illustrate
the scoring:

score =

userd + 1

Where n is the number users who have used the item, and userd is the
sum of the distance from the USER to all the users who have used the item.
For example, user E has a distance of 2, while user F has a distance of 4. The
effect of this difference is that user E has more influence than user F. What
follows is that Item 4 will have a higher score than Item 5, because the two
users who have used Item 4 are closer to USER than the users who have used
Item 5.

As seen in the table 3.4, the score formula as seen in formula 3.1 illustrates
the weighting of items as described. The formula works for all items in the
example, except Item 9, which should receive a higher score than Item 10.
However, the final formula and parameters defining the score based on depth
and the weight of each usage is evaluated in the experiment of prediction
accuracy in section 5.2.4.

Page 106


– WebMessage permits the communication between (possibly anony-
mous) end users via web message boards, to invite readers to join the
groups, etc.

– WebSearch module handles user requests to search for a certain words
or phrases in the database. Two types of searching can be performed: a
word search or a phrase search. The system allows for complex boolean
queries, regular expression searching, or a combined metadata, references
and full text file searching in one go. Users have a possibility to browse
for present index terms. If no direct match could have been found with
the user-typed query pattern, the system proposes alternative matches
as a search guidance. The search indexes were designed to provide fast
response times for middle-sized data collections of up to 106 records.
The metadata corpus is organized into metadata collections that are
directly accessible through the browse function, similarly to the popular
concept of Web Directories. Orthogonal views on the document corpus
are enabled in the search interface via a concept of virtual collections:
for example, a document may be classified both according to its type
(e.g. preprint, book) and according to its Dewey decimal classification
number. Such a flexible organization views allows for the creation of
easy navigation schemata to the end users.

– WebSession is a session and user management module that permits to
differentiate between users. Useful for personalization of the interface
and services like personal baskets and alerts.

– WebStat is a configurable system that permits to gather statistics
about the health of the server, the usage of the system, as well as about
some particular system features.

– WebStyle is a library of design-related modules that defines look and
feel of Invenio pages.

– WebSubmit is a comprehensive submission system allowing authorized
individuals (authors, secretaries and repository maintenance staff) to
submit individual documents into the system. The submission system
disposes of a flow-control mechanism that assures the data approval
by authorized units. In total there are several different exploitable

Page 107


submission schemas at a disposal, including an automated full text
document conversion from various textual and image formats. This
module also disposes of information extraction functionality, focusing
on bibliographic entities such as references, authors, keywords or other
implicit metadata.

Similer Documents