Workshop: Week 3 Question 1: User model for prec@k ================================= In the lecture, we mentioned that one way of deriving effectiveness metrics is by modeling user behaviour and expected utility arising from a ranking, given this behaviour. We also provided a user model for average precision, and an expected utility expression arising from this model. Describe a user model for precision at depth $k$. For the sake of this exercise, the model should be one in which the user ultimately only looks at a single document. (There are other possible models.) Given this model, express precision at depth $k$ using expected utility (that is, as an expectation E[] over some random variable). >> UPDATE March 20th The user model should specify: 1. A set of user actions, leading up to a stopping point, which can be expressed as a random sequence (i.e. we can place probabilities upon individual decisions in the sequence and their outcome). 2. The satisfaction or utility the user gains at the completion of any given sequence of actions. The satisfaction should be based upon what the user sees (relevant or irrelevant documents). Consider not just the benefit the user receives, but the effort expended to achieve this benefit. These should be specified so that the metric being modelled expresses the average satisfaction were the sequence of actions to be randomly reperformed an infinite number of times. (This is [an interpretation of what] and expectation is.) [Note that, as a student has pointed out, the expectation expression given for MAP in the lecture is not really a valid probability expression. But p@k is much easier to create a valid expression for.] So, based upon the hint in the instructions, we can start building our model for p@k by saying: "The user picks a random document. The document is picked from SOMETHING SOMETHING set of documents. The probability of selecting a given document in this set is SOMETHING SOMETHING. The user views this document, and then stops. The user receives SOMETHING satisfaction if the document is SOMETHING, and SOMETHING satisfaction if the document is SOMETHING." If you can't see the E[] expression that emerges from this, don't worry (after all, I got the one in the lecture wrong!) >> END UPDATE Question 2: Designing a metric on a user model ============================================== Consider the following user model: The user browses down the ranking until the find the first relevant document. Their satisfaction is the inverse of the number of documents they have to look at: if the first document is relevant, they are 100% satisfied; if the second, they are 50% satisfied; if the third, they are (100/3)% satisfied; and so forth. If they reach depth $k$ without seeing a relevant document, they are 0% satisfied. Write a mathematical function that describes this metric (similar to equation [3] for p@k in the lecture notes). (This metric is known as Reciprocal Rank.) Question 3: Agglomerative clustering ==================================== Implement agglomerative clustering by the "Cluster comparison by most similar" algorithm described in the lecture. Do this on top of document vectors with length-normalized TF*IDF scores (you'll probably want to reuse code and/or indexes from either Workshop 1 or Workshop 2 for this). >> UPDATE March 20th Some hints for implementation: 1. First, a correction to the algorithm in the lecture notes. It incorrectly implies that you should sometimes change the ordering of the nearest-neighbour pairs when you update them in the nearest-neighbour record, P. This is wrong, and will break the algorithm (why should become clearer with the following hints). Instead, the full rule should look like this: For < c_a, c_b > in P: If c_a = c_1 or c_a = c_2: Change c_a to c_j Else If c_b = c_1 or c_b = c_2: Change c_b to c_j In short, change every occurence of either c_1 or c_2 in P to c_j (the newly created cluster), but _leave them in the same order_. 2. Now, it is crucial when you're recording the "nearest neighbours", that you don't create any loops in your nearest neighbour links; otherwise, you'll end up with a disconnected graph and the agglomeration algorithm will fail to fully agglomerate all the documents into the hierarchical cluster. To avoid this, create a canonical ordering between your nodes, and when creating a nearest-neigbhour link < d_1, d_2 >, enforce that d_1 must be earlier in your canonical ordering than d_2. If you simply number the documents "1, 2, 3, ... , N", then this means that for any document $i$, you should consider only documents $j > i$ as the target for the nearest-neighbour link. For this reason, it would be better to speak of "nearest subsequent neighbour", and regard the links as directed. 3. If it helps to visualize this, think of this as just creating the upper (or right) triangle of the document--document similarity matrix. (We don't actually need to create this matrix, though you can if you find that easier.) 4. And that's it! If you do this, then the algorithm will just work. (I'm still disputing this point with a student, but I'm pretty sure I'm right.) 5. There are other things you can do that make the coding easier. For instance, you can place the nearest-neighbour pairs in a heap structure (there's a standard Python heap implementation called heapq), and let it take care of find the next smallest item for you. (I owe this suggestion to Andrew Bennett.) >> END UPDATE Run your cluster code on the LYRL dataset we've been using. To begin with (and for testing your code while you're writing it), only take the first 100 documents from the dataset (remember, step 1 of the algorithm alone is O(n^2)!). Then iteratively double the dataset (to 200, then 400, then 800, etc.). Stop when the total running time exceeds 5 minutes. Record the running time for each doubling. Does your implementation seem to be displaying O(n^2) behaviour? Question 4: Tree balance ======================== The hierarchical cluster created in Question 3 is in fact a binary tree of clusters. Is this binary tree necessarily balanced? Why or why not? Calculate the depth (that is, the length of the longest branch) of the binary tree created from the largest set of documents you were able to process in 5 minutes in question 3. Question 5: Speeding up computation =================================== The "Cluster comparison by most similar" algorithm implemented in Question 3 is O(n^2) because of line one of the algorithm (calculating pairwise distance between every pair of documents). In fact, we don't need all the pairwise distances; we just need to know the nearest neighbour for each document (to create ${\cal P}$ in line 2 of the algorithm). How might we speed this up if we already had an inverted index built? (Don't consider the cost of building the inverted index!) What would be the complexity of Steps 1 and 2 of the algorithm be with an inverted index built, _if_ we assume that each document has $t$ terms, and each term occurs in a fraction $d$ of documents in the collection? Critique these two last assumptions. Under what assumptions about the values of $t$ and $d$ is using the inverted index going to give lower complexity than simply comparing each document pairwise? If we were willing to allow approximations to nearest neighbours, how might we speed up the computation with the inverted index further? (NOTE: you don't need to actually implement this, though feel free to if you're keen.) Question 6: Representing clusters ================================= You've implemented a clustering algorithm in Question 3; but how to represent the results? We said in the lecture that a common representation is by the most frequent (or most heavily-weighted) terms. This can be taken from the mean pseudo-document in the cluster. A simple way to define the mean pseudo-document is simply by averaging the document vectors of all the documents in the cluster. Implement this document-vector averaging. The input should be a list of document vectors; the output, the average document vector (the "mean document"). Then, represent a cluster by its highest-weighted 10 terms (by averaged TF*IDF). We said in Question 4 that the hierarchical cluster forms a binary tree. All the nodes at a certain level form a complete partitioning of the tree. (So do all the nodes at different levels, provided no node is a child of another node, and we don't leave a tree path that doesn't end in some selected node.) Print out the cluster representations for all the nodes at the zero'th level (this is just a single cluster holding the entire corpus); then at the first, second, third, fourth, and fifth levels (for partitionings into 2, 4, 8, 16, and 32 clusters). Do the cluster representations make sense to you? Does the interpretability of the clusters increase or decrease with depth?