Workshop: Week 3

Question 1: User model for prec@k
=================================

In the lecture, we mentioned that one way of deriving effectiveness
metrics is by modeling user behaviour and expected utility arising
from a ranking, given this behaviour.  We also provided a user model
for average precision, and an expected utility expression arising
from this model.  

Describe a user model for precision at depth $k$.  For the sake of
this exercise, the model should be one in which the user ultimately
only looks at a single document.  (There are other possible models.)
Given this model, express precision at depth $k$ using expected
utility (that is, as an expectation E[] over some random variable).

>> UPDATE March 20th

The user model should specify:

1. A set of user actions, leading up to a stopping point,
which can be expressed as a random sequence (i.e. we can place
probabilities upon individual decisions in the sequence and
their outcome).

2. The satisfaction or utility the user gains at the completion
of any given sequence of actions.  The satisfaction should be
based upon what the user sees (relevant or irrelevant documents).
Consider not just the benefit the user receives, but the effort
expended to achieve this benefit.

These should be specified so that the metric being modelled
expresses the average satisfaction were the sequence of actions
to be randomly reperformed an infinite number of times.  (This
is [an interpretation of what] and expectation is.)  [Note that,
as a student has pointed out, the expectation expression given
for MAP in the lecture is not really a valid probability 
expression.  But p@k is much easier to create a valid expression
for.]

So, based upon the hint in the instructions, we can start 
building our model for p@k by saying:

"The user picks a random document.  The document is picked
from SOMETHING SOMETHING set of documents.  The probability
of selecting a given document in this set is SOMETHING
SOMETHING.  The user views this document, and then stops.
The user receives SOMETHING satisfaction if the document
is SOMETHING, and SOMETHING satisfaction if the document
is SOMETHING."

If you can't see the E[] expression that emerges from this,
don't worry (after all, I got the one in the lecture wrong!)

>> END UPDATE


Question 2: Designing a metric on a user model
==============================================

Consider the following user model:

   The user browses down the ranking until the find the
   first relevant document.  Their satisfaction is the inverse
   of the number of documents they have to look at: if the
   first document is relevant, they are 100% satisfied; if
   the second, they are 50% satisfied; if the third, they
   are (100/3)% satisfied; and so forth.  If they reach 
   depth $k$ without seeing a relevant document, they are
   0% satisfied.

Write a mathematical function that describes this metric
(similar to equation [3] for p@k in the lecture notes).
(This metric is known as Reciprocal Rank.)


Question 3: Agglomerative clustering
====================================

Implement agglomerative clustering by the "Cluster comparison by most
similar" algorithm described in the lecture.  Do this on top of
document vectors with length-normalized TF*IDF scores (you'll probably
want to reuse code and/or indexes from either Workshop 1 or Workshop 
2 for this).

>> UPDATE March 20th

Some hints for implementation:

1. First, a correction to the algorithm in the lecture notes.
It incorrectly implies that you should sometimes change the
ordering of the nearest-neighbour pairs when you update them
in the nearest-neighbour record, P.  This is wrong, and will
break the algorithm (why should become clearer with the following
hints). Instead, the full rule should look like this:

    For < c_a, c_b > in P:
       If c_a = c_1 or c_a = c_2:
           Change c_a to c_j
       Else If c_b = c_1 or c_b = c_2:
           Change c_b to c_j

In short, change every occurence of either c_1 or c_2 in P
to c_j (the newly created cluster), but _leave them in the
same order_.

2. Now, it is crucial when you're recording the "nearest
neighbours", that you don't create any loops in your nearest
neighbour links; otherwise, you'll end up with a disconnected
graph and the agglomeration algorithm will fail to fully
agglomerate all the documents into the hierarchical cluster.
To avoid this, create a canonical ordering between your nodes,
and when creating a nearest-neigbhour link < d_1, d_2 >, enforce
that d_1 must be earlier in your canonical ordering than d_2.
If you simply number the documents "1, 2, 3, ... , N", then this 
means that for any document $i$, you should consider only
documents $j > i$ as the target for the nearest-neighbour
link.  For this reason, it would be better to speak of "nearest
subsequent neighbour", and regard the links as directed.

3. If it helps to visualize this, think of this as just
creating the upper (or right) triangle of the document--document
similarity matrix.  (We don't actually need to create this
matrix, though you can if you find that easier.)

4. And that's it!  If you do this, then the algorithm will just
work.  (I'm still disputing this point with a student, but I'm
pretty sure I'm right.)

5. There are other things you can do that make the coding easier.
For instance, you can place the nearest-neighbour pairs in a
heap structure (there's a standard Python heap implementation
called heapq), and let it take care of find the next smallest
item for you.  (I owe this suggestion to Andrew Bennett.)

>> END UPDATE

Run your cluster code on the LYRL dataset we've been using.
To begin with (and for testing your code while you're writing it), 
only take the first 100 documents from the dataset (remember,
step 1 of the algorithm alone is O(n^2)!).  Then iteratively
double the dataset (to 200, then 400, then 800, etc.).  Stop
when the total running time exceeds 5 minutes.  Record
the running time for each doubling.  Does your implementation
seem to be displaying O(n^2) behaviour?


Question 4: Tree balance
========================

The hierarchical cluster created in Question 3 is in fact
a binary tree of clusters.  Is this binary tree necessarily
balanced?  Why or why not?  Calculate the depth (that is,
the length of the longest branch) of the binary tree created 
from the largest set of documents you were able to process in 
5 minutes in question 3.


Question 5: Speeding up computation
===================================

The "Cluster comparison by most similar" algorithm implemented
in Question 3 is O(n^2) because of line one of the algorithm
(calculating pairwise distance between every pair of documents).
In fact, we don't need all the pairwise distances; we just need
to know the nearest neighbour for each document (to create
${\cal P}$ in line 2 of the algorithm).  How might we speed this
up if we already had an inverted index built?  (Don't consider
the cost of building the inverted index!)  What would be
the complexity of Steps 1 and 2 of the algorithm be with an
inverted index built, _if_ we assume that each document has $t$
terms, and each term occurs in a fraction $d$ of documents in the
collection?  Critique these two last assumptions.  Under what
assumptions about the values of $t$ and $d$ is using the inverted
index going to give lower complexity than simply comparing each
document pairwise?  If we were willing to allow approximations to
nearest neighbours, how might we speed up the computation with
the inverted index further?  (NOTE: you don't need to actually
implement this, though feel free to if you're keen.)


Question 6: Representing clusters
=================================

You've implemented a clustering algorithm in Question 3; but how
to represent the results?  We said in the lecture that a common
representation is by the most frequent (or most heavily-weighted)
terms.  This can be taken from the mean pseudo-document in the
cluster.  A simple way to define the mean pseudo-document is
simply by averaging the document vectors of all the documents
in the cluster.

Implement this document-vector averaging.  The input should be
a list of document vectors; the output, the average document
vector (the "mean document").   Then, represent a cluster by its
highest-weighted 10 terms (by averaged TF*IDF).

We said in Question 4 that the hierarchical cluster forms a binary
tree.  All the nodes at a certain level form a complete partitioning
of the tree.  (So do all the nodes at different levels, provided no
node is a child of another node, and we don't leave a tree path that
doesn't end in some selected node.)  Print out the cluster
representations for all the nodes at the zero'th level (this is just a
single cluster holding the entire corpus); then at the first, second,
third, fourth, and fifth levels (for partitionings into 2, 4, 8, 16,
and 32 clusters).  Do the cluster representations make sense to you?
Does the interpretability of the clusters increase or decrease with
depth?