CS380 - Information Retrieval and Search Engines

Room 232 Park Science

Overview

Lectures

September 5
Lecture Notes
September 12
Lecture Notes
September 19
Lecture Notes
September 26
Lecture Notes
October 3
No lecture, student presentations. But do see the webalyzer analysis of the cs.brynmawr.edu website.
October 10
Remaining student presentation and Lecture Notes
October 17
No Class, fall break
October 24
Lecture Notes on the Google way. Large portion of this lecture were abstracted from a paper about Google by Page and Brin. Some ofthe specifics about Pagerank were taken from another paper by Page and Brin et al.
October 30
Note that we are meeting from 5pm to 7pm on Monday. This immediately follows the CS department pumpkin carving event. So: come early, carve a pumpkin, drink some cider, eat a cookie, then go to class.
Other Link analysis systems. Lecture notes
Nov 7
Lecture notes on clustering
Nov 14
Continuing with lecture notes from class 9 on clustering
Notes on Zamir and Etzioni's paper
Nov 21
Second round of student presentations (see list below)
Nov 28
Remainder of student presentations and collaborative filtering
Dec 5
Relevance Feedback Lecture Notes
Dec 12
Summary and Conclusions Lecture Notes
Other exams 1 , 2 , 3 , 4

Assignments

Due September 12
Assignment 1
Linux shell script to make a jar file executable here (SATURDAY SEPT 9, there was a bug in this script, it is fixed now. sorry)
Due September 19
Assignment 2 How changable are google's rankings? Web log analysis
General comments about the web log analysis.
Webalizer analysis of the access_log file. There are pages for months prior to October 2005. To get to them follow the naming pattern of the months for with there are links from the main page.
Due October 13 Revised Now due October 24
Assignment 3 Finding registrar-controlled domains.
Due November 7
Assignment 4 Being Evil.
Due December 12
Assignment 5 Forward and Inverted Indexing
At end of finals period
Assignment 6 In Lieu of Final Exam

Talking Points & Readings

For September 12
Read Scientific American , Feb 2005, Seeking better web searches
Read Chapter 7 of C. J. van RIJSBERGEN's book. This will be very slow going. Read only through the "Swets model". Of the rest, read only about the SMART measure.
Talking points for September 12
For September 19
Read Chapter 2 of C. J. van RIJSBERGEN's book. Read up to, but not including or after "Probabilistic Indexing".
Read about TF-IDF weighting in the wikipedia.
Talking points for September 19.
For September 26
Read Chapter 6 of C. J. van RIJSBERGEN's book. Read up to, but not including, "Selecting the best dependence trees"
updated 9/21/06 Read Self-organization of the web and identification of communities, GW Flake, S Lawrence, CL Giles, F Coetzee
updated 9/21/06 Talking points for September 26.
For October 3
extended Spet 25Student presentations. For each of the papers listed below, one person will be the presenter. This presentation will be part of the assignments grades. A second person will be the "quibbler". The quality of the quibbling will go to the classroom participation grade. Paper selection will occur on class on Septermber 26. The presenter should try to describe the algorithm (if it exists) and critique the paper for what it does, or does not do. For instance, would you use this algorithm? Why? Also, presenters can ask me questions, I will to my best to give timely answers. Presenters might also find that quibblers are at least a good sounding board, since they will have read and understood the paper.. (The table below was missing for IE users between Sep 25 and Oct 1. I had a mistake in the html that IE barfed on.)
title/linkPresenterQuibbler
link fixed Spet 25 When Will Information Retrieval Be “Good Enough”? Leslie Lindsay
link fixed Spet 25 Using Bigrams in Text Categorization Lindsay Lauren
An Evaluation of Naive Bayesian Anti-Spam Filtering Julia Anne
Web Spam, Propaganda and Trust Jacob Rachel
Detecting Spam Web Pages through Content Analysis Rachel Jacob
On Knowing a Word Lauren Julia
Combining Multiple Evidence from Different Properties of Weighting Schemes no oneno one
Corpus-Based Statistical Sense Resolution Anne Leslie
For October 30
Read Kleinberg's seminal paper "Authoratative Sources in a Hyperlinked Environment. (Warning, this paper is long and I do not expect total understanding.) In the second half of class we will discuss Latent Semantic Indexing This paper is not as good as I had hoped. Here is the authoratative paper of LSI, but it is much harder to read than the one on the earlier link.
For October 30
Read Kleinberg's seminal paper "Authoratative Sources in a Hyperlinked Environment. (Warning, this paper is long and I do not expect total understanding.) In the second half of class we will discuss Latent Semantic Indexing This paper is not as good as I had hoped. Here is the authoratative paper of LSI, but it is much harder to read than the one on the earlier link.
For October 30
Read Kleinberg's seminal paper "Authoratative Sources in a Hyperlinked Environment. (Warning, this paper is long and I do not expect total understanding.) In the second half of class we will discuss Latent Semantic Indexing This paper is not as good as I had hoped. Here is the authoratative paper of LSI, but it is much harder to read than the one on the earlier link.
Talking points for class
For November 7
Read sections 1-5.2 of Data Clustering: A Review You can read more and we will discuss other parts of this paper in class. You should be sure to understand all of sections 1-4.
Talking points for class. You may need to go beyond the paper linked above for some of these points.
For November 14
Read Zamir and Etzioni's paper on web document Clustering
Talking points for class.
For November 28
Read Herlocker et al's review paper on collaborative filtering. This paper is rather long. I am most interested in the first 10 pages or so where the discussion is of the general topic area of collaborative filtering, although the whole paper is good. I will also talk some about recommendations in: TiVo, eachmovie, and libra
For Dec 5
Read Shapire et al's paper on relevance feedback. Pay particular attention to the section on Rocchio's algorithm although we will discuss almost everything. If you get confused Allan's paper might help.
Talking points for class.

Papers for November 21

Possible Books for final book report

Geoffrey Towell

Student Sites