Also look at last week's lecture notes.
  1. What is the difference between a "bag of words" and a "set of words"? Why is this difference significant for IR?
  2. Does the vector space model differentiate between the query "waste" and the query "waste waste"> Do the major search engines (google, yahoo, msn, ask)?
  3. The time complexity of the Vector space model appears to be awful. Why is it not that bad?
  4. Give examples of situations in which the term independence assumption of probabilistic models is certainly violated
  5. Describe at least two assumptions underlying probabilistic models. What are the problems with these assumptions?