CS 330: Algorithms: Design & Practice

Homework 2: Sorting large data files
Due: In class on Thursday, February 11, 2016

Presentation By: Clare & Tu

For each of the programs provided in this exercise (bitsortgen.c, qsortints.c, sortints.cpp, bitsort.c, and the system sort), carefully study them first and make sure you understand them well. This will also help you learn (or re-learn) the languages and programming techniques used.

1. Use the bitsortgen.c program to generate datasets of sizes 100k, 300k, 500k, 1 million, 2million, 3, 4, 5, and 10 million non repeating positive integers in the range [0, 100,000,000). Store these numbers in individual data files: say data100k, data300k, ..., data5M.

Time each run, record it in the table below and plot the times on a graph.

2. Sort the data generated above using (a) The system provided sort program (sort -n), (b) A program that uses the quicksort algorithm (see file qsortints.c), (c) A program that uses the C++ <set> library module (see file sortints.cpp). For all these runs, make sure the data is output to a file.

Sort the data sets generated in (1) using the three schemes (a), (b), and (c). Time the runs and record the times in the table below.

 

Dataset Size
bitsortgen
sort(a)
sort(b)
sort(c)
100K
       
300k
       
500k
       
1M
       
2M
       
3M
       
4M
       
5M
       
10M
       

You may need to run the program several times to get an average time for each size. Record the average times only for each size and then plot them.

Extra Credit

Rewrite bitsortgen and sortints in Python and/or Java and provide timing data as above. How do these compare to the results of the C programs?

What to hand in

1. The times recorded in 1. and a plot of the times. What can you say about the time complexity of the number generating algorithm?

2. Plot on a separate graph a comparison of all the methods: (a), (b), and (c). What can you say about the time complexity of each of these methods?

Write a short report discussing the above two sets of results. Your report should be formatted on standard paper with 1-inch margins all around and 12pt font size. Unless you modified any of the given programs, or wrote new ones you do not need to hand in any program code. Make sure that your name appears on the front page of the report. Also, ensure that everything you hand in is stapled together (no lose sheets will be accepted).

Notes:

To compile C programs do:

> gcc file.c -o file

To run the resulting executable:

> ./file

To compile a C++ program do:

> g++ file.cpp -o file

You should only record the user cpu times. To get the CPU time used by a program you can use the GNU time command as follows:

/usr/bin/time <some unix command>

By default this prints out a number of statistics (see the manual pages for more information). In order to just get the CPU time you should first set the value of an environment variable TIME, as follows:

export TIME="CPU %U"

Then use the time command as described above. The time reported is in seconds.

Notice that the files will get rather large. So you may not want to store all the files all at the same time. Work with each data set in batches, removing older data sets before moving on to the next one.

Presentation:

The designated person(s) will give a presentation on this exercise in class on Thursday, February 11. Check the presentation schedule to ensure that you know when you will be presenting. Details of the presentation will be provided in class. Your instructor can provide further guidance in preparing your presentation.

Your presentation should include a description of the three sorting techniques used. Also, it should include a description of the algorithm for generating unique random numbers. Your presentation should discuss the performance of all the methods.


Back to CS330 Materials