CS 330: Algorithms: Design & Practice

Homework1: Sorting large data files
Due: In class on Monday, January 30, 2012

Presentation By:

For each of the programs provided in this exercise (bitsortgen.c, qsortints.c, sortints.c, and the system sort), carefully study them first and make sure you understand them well. This will also help you learn (or re-learn) the languages and programming techniques used. You can download/copy the programs from ~dblank/cs330/Homework1/...

1. Use the bitsortgen.c program (See files in ~dblank/cs330/Homework1) to generate datasets of sizes 100000, 300000, 500000, 1000000, 2000000, 3000000, 4000000, 5000000 non repeating positive integers in the range [0, 5000000). Store these numbers in individual data files: say data100k, data300k, ..., data5M.

2. Sort the data generated above using:

Sort the data sets generated in (1) using the four schemes (A), (B), (C), (D). Time the runs, record the times in a table, and plot the times on a graph.

 

Dataset Size
A - system
B - qsort
C - sortints
D - bitsort
100K
       
300k
       
500k
       
1M
       
2M
       
3M
       
4M
       
5M
       

You may need to run the program several times to get an average time for each size. Record the average times only for each size and then plot them.

Extra Credit

Rewrite bitsortgen and sortints in Python and provide timing data as above. How do these compare to the results of the C programs?

What to hand in

1. The times recorded in 1. and a plot of the times. What can you say about the time complexity of the number generating algorithm?

2. Plot on a separate graph a comparison of all the methods: (a), (b), and (c). What can you say about the time complexity of each of these methods?

Write a short report discussing the above two sets of results. Your report should be formatted on standard paper with 1-inch margins all around and 12pt font size. Unless you modified any of the given programs, or wrote new ones you do not need to hand in any program code. Make sure that your name appears on the front page of the report. Also, ensure that everything you hand in is stapled together (no lose sheets will be accepted).

Notes:

To compile C programs do:

> gcc file.c -o file

To run the resulting executable:

> ./file

To compile a C++ program do:

> g++ file.cpp -o file

You should only record the user cpu times. To get the CPU time used by a program you can use the GNU time command as follows:

/usr/bin/time <some unix command>

By default this prints out a number of statistics (see the manual pages for more information). In order to just get the CPU time you should first set the value of an environment variable TIME, as follows:

export TIME="CPU %U"

Then use the time command as described above. The time reported is in seconds.

Notice that the files will get rather large. So you may not want to store all the files all at the same time. Work with each data set in batches, removing older data sets before moving on to the next one.

Presentation:

The designated person(s) will give a presentation on this exercise in class on Monday, January 30. Check the presentation schedule to ensure that you know when you will be presenting. Details of the presentation will be provided in class. Your instructor can provide further guidance in preparing your presentation.

Your presentation should include a description of the three sorting techniques used. Also, it should include a description of the algorithm for generating unique random numbers. Your presentation should discuss the performance of all the methods.


Back to CS330 Materials