Forward and Inverse Indexing
In this assignment you will build forward and inverse indexes for a set of documents. Realistically, you should run this program off of your crawler from the last assignment, However, that would make it extermely difficult for me to evaluate. Hence, I have collected a set of about 1000 text documents in bubo:/home/gtowell/a5docs. You will find two sets of documents in this file: z* and tz*. z* is the original html, tx is the text of these documents after running them through a html 2 text program. Unless you are working in a group you will only need the tz* files. Your program should read and index these documents.
Your program, when started should read all the files in my directory then present users with two questions: first "direction" and then "terms/docids". The only possible responses to the first question are "forward" and "inverse". For the second question, the responses will be either a set of docids (in the case of forward) or a set of tokens (in the case of inverse). On the second question, if more than one term is supplied, then the terms should be treated as conjunctions. So, in the forward case if two document ids are given, then you should return a list of all tokens that appear in both documents. Similarly, in the inverse case, if 3 tokens are supplied, you should list the ids of all documents that countain all 3 tokens.
Some additional constraints:
- On startup, you submitted program should read all of the files it needs from bubo:/home/gtowell/a5docs
- Your indices should be case insensitive.
- The document ids are the names of the files in my directory.
- Watch out for punctuation. The sentence "I do." should be indexed using the tokens "i" and "do".
- No stemming
- No stop word removal
- index all tokens, That is, you shoud index numbers as well as words. Watch out for punctuation tricks -- "3.1415" should be indexed as "3.1415" not "3" and "1415".
- A command line interface is fine, but so is a GUI.
- You do not have to slavishly follow my sample below, but you should be similar.
- Please do not use any web resources. You may use only the standard packages in the programming language of your choice.
- I will test on a bubo-machine, so you may / should hard code the directory location into your program.
- If you work in group, you must:
- have a GUI interface
- be able to handle case sensitive and case insensitive queries. (If the query is all lower case, then case insensitive, otherwise cases sensitive)
- Order results to give preference to text in bold or h1 html tags
- I encourage you to use a subset of my documents for debugging and development.
Example interaction (with simulated answers):
unix> startProgram
(F)orward or (I)nverse: F
Docs Ids: tz1 tz2 tz3 tz4
Tokens in common: the and of geoff towell is great ir rocks
(F)orward or (I)nverse: I
Tokens: geoff towell is the best
Doc Ids: NO DOCUMENTS CONTAINED ALL OF THE TOKENS
- Submit electronically by December 12:
-
- (200 points) An executable version of your program for me to test
- (20 points) A trace showing an example forward and inverse query
- (100 points) A written report discussing at least the following topics:
- what you learned about indexing by doing this assignment
- What problems you would expect to encounter as you scale this program up to 24 billion documents and how you might solve those problems
Note the relative vulate of the program and the report. The easiest place to loose credit on this assignment will be in the report since the program is essentially self checking.