Download Towards Scalable Personalization PDF

TitleTowards Scalable Personalization
LanguageEnglish
File Size5.7 MB
Total Pages187
Document Text Contents
Page 1

POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

acceptée sur proposition du jury:

Towards Scalable Personalization

THÈSE NO 8299 (2018)

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

PRÉSENTÉE LE 23 FÉVRIER 2018

À LA FACULTÉ INFORMATIQUE ET COMMUNICATIONS

LABORATOIRE DE PROGRAMMATION DISTRIBUÉE

PROGRAMME DOCTORAL EN INFORMATIQUE ET COMMUNICATIONS

Suisse
2018

PAR

Page 93

4.2. CIP: Consumed Item Packs

Figure 4.15 – Topology and data structures for CIP-U and CIP-I (arrows denote the RDD
dependencies).

collection of objects partitioned across a set of machines and can be rebuilt if a partition is lost.

In a Spark program, data is first read into an RDD object. This RDD object can be altered into

other RDD objects by using transformation operations like map, filter, and collect. Spark
also enables the use of shared variables, such as broadcast and accumulator, for accessing or

updating shared data across worker nodes.

B. Tailored Data Structures for CIPS

We nowmention briefly the RDDs leveraged in the memory-based approaches (CIP-U and

CIP-I) as shown in Figure 4.15 (the arrows, between RDDs, in the figure denotes the sequential

dependency between the RDDs through transformation operations) as well as those in the

model-based approach (DEEPCIP) as shown in Figure 4.16.

RDDs for CIP-U. For CIP-U, we store the collected information into three primary RDDs as

follows. USERSRDD stores the information about the user profiles. USERSIMRDD stores the

hammock pairs between all pairs of users. The pairwise user similarities are computed using a

transformation operation over this RDD. USERTOPKRDD stores the K most similar users.

During each update step in CIP-U, afterQ consumption events, the new events are stored into

a DELTAPROFILES RDD which is broadcast to all the executors using the broadcast abstraction

of Spark. Then, the hammock pairs between users are updated (in USERSIMRDD) and conse-

quently transformed to pairwise user similarities using Equation 4.13. Finally, CIP-U updates

the the top-K neighbors (USERTOPKRDD) based on the updated similarities.

RDDs for CIP-I. For CIP-I, we store the collected information into two primary RDDs as

follows. ITEMSIMRDD stores score values between items. The pairwise item similarities are

computed using a transformation operation over this RDD. ITEMTOPKRDD stores the K most

69

Page 94

Chapter 4. Incrementality

Figure 4.16 – Topology and data structures for DEEPCIP.

similar items for each item based on the updated similarities.

During each update step in CIP-I, the item scores are updated incorporating the received CIP

using Algorithm 4 in the ITEMSIMRDD, and consequently the pairwise item similarities are

also revised using Equation 4.14. CIP-I computes the top-K similar items and updates the

ITEMTOPKRDD at regular intervals.

RDDs for DEEPCIP.We implement the DEEPCIP using the DeepDist deep learning frame-

work [50] which accelerates model training by providing asynchronous stochastic gradient

descent (DOWNPOUR-SGD) for data stored on Spark.

DEEPCIP implements a standard master-workers parameter server model [48]. On the master

node, the CIPSRDD stores the recent CIPS aggregated from the user transaction logs preserv-

ing the consumption order. DEEPCIP trains on this RDD using the DOWNPOUR-SGD. The

skip-grammodel is stored on the master node and the worker nodes fetch the model before

processing each partition, and send the gradient updates to the master node. The master

node performs the stochastic gradient descent (Equation 2.8 in §2.5) asynchronously using

the updates sent by the worker nodes. Finally, DEEPCIP predicts the most similar items to a

given user, based on her most recent CIP.

4.2.5 Evaluation

In this section, we report on the evaluation of the CIP-based algorithms, using real-world

datasets.

Platform. For our experiments, we use two deployment modes of the Spark large-scale

processing framework [172].

Standalone deployment. We launch a Spark Standalone cluster on a highperf server (Dell

Poweredge R930) with 4 Processors Intel(R) Xeon(R) E7-4830 v3 (12 cores, 30MB cache, hyper-

threading enabled) and 512 GB of RAM.We use this cluster to evaluate the effect of the number

of partitions for the RDD on scalability. For the standalone deployment, we use 19 executors

70

Page 186

8th 94.891

162

Page 187

th

163

Similer Documents