Finding "Registrar-Controlled" domains

last modified: Sept 21 10:55pm

Registrar Controlled Domains

A "top-level domain" on the internet is defined as the part of a domain name that is to the right of the last ".". For almost all top-level domains an entity with the appropriate standing may request of the registrar of the top-level domain that the registrar assign that string to be controlled by an individual. That individual can then associate the name in that domain with an IP address.

For example, several years ago I asked the registrar of the ".us" domain to assign the name "towell4" to me and I told the registrar to associate the name "towell4.us" with the IP address 66.199.227.58. (There was a financial consideration involved in this request.) At the time I wanted to get towell.us, but someone already owned that. I also would have prefered towell.nj.us since I was living in NJ, but I could not find the controlling registrar.

The reason I used the word "entity" rather than person is that some domains are not supposed to be associated people. For instance, the "com" domain is supposed to be for commercial enterprises, "gov" is for governmets, etc.

So, all top-level domains are registrar controlled. There are additional domains that are registrar controlled. For instance,

The are no registrar-controlled domains that are more than 2 tokens long.

I am hopeful that this definition is sufficient. Feel free to ask if you need more information. I have made up the term "registrar-controlled domains" so do not try to look it up elsewhere.

Task

The task for this assignment is to write a web crawler that tries to find as many registrar controlled domains as it can.

The crawler must obey the "robot exclusion protocol". You need not fully implement the protocol, but you must implement enough to be sure you do not crawl where you are not wanted.

When your program finds a registrar-controlled domain it should immediately print a statement that it has found one, along with the date time and number of the registrar-controlled domain. Hence, the output of you program should look like:

1. sept 20 19:53 com
2. sept 20 19:54 net
3. sept 20 19:54 jp
4. sept 20 19:54 co.uk

Suggested approach

Start by writing a crawler. For sites to begin your crawl, consider using the results of one or more searches by your google soap program from the first assignment. You might find that a multi-threaded crawler is so much faster that it is worth the effort to write, but that is not required.

Once your crawler is working, consider the proble of recognizing "registrar-controlled" domains. One way that will work is to look for the conserved part of domain names. For instance, once you have seen: "www.ford.com", and "www.coke.com" you can conclude that "com" must be a registrar-controlled domain. The problem with this approach is that it requires you to have seen at least two registered domains to identify a registrar controlled domain. There may be more efficient approaches.

Grading

This assignment is worth triple the first assignment (or 330 points). Here is how the points will be assigned:
250 points
The program works when I run it. If you use java you must give me an executable jar file. If you use another language, consider wrapping your programin a shell script for I get something like an executable jar. You can separately give me instructions for what arguements to use. By works, I mean that within 10 minutes of my starting the program it has identified at least 10 registrar controlled domains.
20 points
Competition 1. The person/team identifying the most registrar-controlled somains will recieve 20 points. Other will receive fewer on according to percentage difference from the best submission. (Mistakenly identified items will be penalized be reducing the count of correct items).
10 points
Competition 2. I will run each of the programs for 10 minutes and count the number of correctly identified domains. Points and scoring will be as in competion 1.
50 points
Your crawler will be visiting lots of web pages to collect this data. Write a brief report summarizing the voyages of your crawler (e.g., interesting places it went, boring sites it got stuck in, problems your crawler encountered -- for example tarpits --) This report need not be long, but humor is appreciated.

Extra Credit

If you did something extremely clever, tell me about it.

Hand in

  1. The program along with instructions on how I can run it.
  2. A list -- in a text document -- of the registrar controlled domains you program found.
  3. The wtiteup of your experience
  4. A list of the people with whom you worked

Due Date

October 13, before midnight.

Final comment, I amy make minor modifications to this text. The modifications will not change the core of the assignment. However, they may try to clairify points on which I recieve questions (or simply fix typos). Look last modified date at the top of this page to see if I have changed anything.