Title Instructor Support Support Vector Machine Cluster Analysis Data Analysis Statistical Analysis 1.9 MB 150
Document Text Contents
Page 1

Data Mining: Concepts and Techniques, Third Edition

Instructor Support

Sample Exam and Homework Questions

Jiawei Han, Micheline Kamber, Jian Pei

The University of Illinois at Urbana-Champaign

Simon Fraser University

Version September 25, 2011

c⃝Morgan Kaufmann, 2011

For Instructors’ references only.

Do not copy! Do not distribute!

Page 2

ii

Page 75

70 CHAPTER 1. SAMPLE EXAM QUESTIONS FOR COURSE I

set to the formed clusters based on closeness. Do this m times in
cross-validate way, and return the cluster that is most tight. Repeat
this for different k values and determine the best k. �

Page 76

Chapter 2

Sample Exam Questions for
Course II

Enclosed are some sample midterm exam questions of the advanced level data
mining course offered at Computer Science, UIUC: “UIUC CS 512: Data Min-
ing: Principles and Algorithms”. Since the course is more research oriented, in
many offerings, there is only one midterm exams. But some semesters, there
are two midterm exams. Each midterm exam had 90 minutes of time, close
book, but allowing student to bring one sheet of paper (notes) worked out by
students themselves. Due to limited time, we may not provide answers to those
questions. If there were answers previously written down in those semesters, we
will provide the answers to those questions.

Please note that many of the themes discussed in CS512 are beyond the
scope covered in the 3rd edition of this textbook.

2.1 Sample Exam Question Set: 2.1

2.1.1 Midterm Exam

1. [24] Stream data mining.

(a) [7] Briefly explain why lossy counting can be used to find approximate
counts of frequent single items in data streams.

• divide the stream into buckets, with size = 1
ϵ
where ϵ is the error

threshold.

• using the synopsis structure to accumulate the item count
• At the bucket boundary, decreasing each item count by 1. If the
count ≤ 0, remove the entry from the synopsis structure

• Given support threshold σ and length N , report all the items
whose count ≥ (σ − ϵ)N

71

Page 149

144 CHAPTER 3. PH.D. QUALIFICATION EXAM QUESTIONS

(b) Can your method discover efficiently rather long sequential patterns (e.g.,
over 100 in length)? What are the performance bottlenecks? (1 point)

Part 3

Outline a method that mines long sequential patterns efficiently. (3 points)

Problem 2
Part 1

(a) What are the major differences between a homogeneous information net-
work and a heterogeneous information network. (1 point)

(b) Reasoning why a clustering algorithm that works well in homogeneous
networks may not work well in heterogeneous information networks. (1.5
points)

(c) Outline a clustering algorithm that works well in heterogeneous informa-
tion networks. (1.5 points)

Part 2

(a) Similarly, a classification algorithm that works well in homogeneous net-
works may not work well in heterogeneous information networks. Reason
on this using an example. (1.5 points)

(b) Outline a classification algorithm that works well in heterogeneous infor-
mation networks. (1.5 points)

Part 3

(a) In a heterogeneous information network, an important mining task is to
discover certain hidden relationships between nodes in the network (such
the challenges for mining such hidden relationships? (1 point)

(b) Outline a method that may discover such hidden relationships effectively.
(2 points)

Problem 3
Part 1

(a) A cyber-physical network is a network that links physical sensors and
information entities together to form an integrated network. Suppose your
cyber-physical network is used for patient care. Describe what should be
mined in such a cyber-physical network, (1 point)

Page 150

3.14. SAMPLE EXAM QUESTION SET 3.14 145

(b) A major challenge of cyber-physical network is that sensor data may not
be reliable and may contain erroneous data. Suppose sensors may not be
costly, but decisions based on obtained data could be critical. Outline one
method that may determine which pieces of data are more trustworthy
based on the noisy data obtained in such a network. (2 points)

Part 2

(a) Suppose your cyber-physical network is used for patient care. Design a
data cube that may summarize such a cyber-physical network data in a
multidimensional space. (1.5 points)

(b) If new data is streamed into the system incrementally and dynamically,
outline a stream cube design and an aggregation method that may handle
such streaming data effectively. (2 points)

Part 3

(a) Since a cyber-physical network may contain sensitive data about people
and their movements, privacy and security could become a major concern
at mining such data. Take patient care cyber-physical network as an
example. Discuss what could be the concerns on privacy at mining cyber-
physical network data. (1.5 points)

(b) Outline one method that may preserve privacy at mining cyber-physical
network data. (2 points)