Assignment 2
This assignment has been changed as of Friday Sept 15 at 6:30 AM. See bold below for changes
Rate of change of the web
This assignment will build on the first assignment to create a program that will monitor Google for changes. In this way we will, as a class, get some estimate of the rate of change of the web.
- Do
-
- SKIP THIS Revise your code access code from week 1 to be able to retrive the top 200 URLs for each of 30 queries. The queries should be chosen so that you think there will be changes in the results. (This will use 600 of the 1000 queries per day allowed by Google. You will use more in coming weeks.)
- SKIP THIS Set up your program so that it runs every 24 hours for at least a month.
- SKIP THIS Compare the results of the run on day N to the run on day N-1. In every place record the number of differences. (For example, if a URL was in position 50 on one day and in 51 the next, that is a difference.)
- SKIP THIS You program must be able to be stopped and restarted.
- SKIP THIS If you work in a group, each person must create their own set of queries.
- Consider the file bubo:/home/gtowell/access_log (do not copy it, it is 800M). This file is the log of all accesses to the cs.brynmawr.edu server. Look at this file as evidence of search engines at work. Analyse it. Some of the questions you might look at are:
- How long has it been collecting info?
- How many different search engines can you find evidence of?
- How ofen would you estimate that eash search engine crawls the web?
You will probably find unix utilities like grep and wc very helpful. Also, you might look into what the file "robots.txt" does.
- Submit electronically by start of class Sept 19:
-
- SKIP THIS (attachment) a jar file containing your code (it need not be executable) (60 points)
- SKIP THIS (attachment) a list of the 30 queries you are using. (5 points)
- (attachment) Write up a report detailing (in 5 pages or less) your investigation of the access_log. Be sure to make up some of your own questions. Graphs might be helpful. Do you kill yourself analysing this file. Included in this file should be a short statements about what the robots.txt file does. (70 points)
- SKIP THIS (body) A statement that your program is running. (A later assignment will ask you for your results, so you need to have your program running) (5 points)
- (body) A list of the people you worked with.
- An analysis of why I suddenly changed this assignment. To do this analysis, all you need is last week's program. Write up you analysis in 1 page or less (10 points)