CS520
Spring 2012
Programming Assignment 4
Due Tuesday April 3


The goal of this assignment is to use Posix threads to efficiently process a set of English text files.

The program will accept a sequence of names of text files as its only command-line arguments. The goal of the program is to find the twenty most frequently used words across all files, where each word must appear at least once in each file. If there are ties for position twenty on the list of most frequently used words, then report all the words that tie. That is, you may report more than twenty words for your result, because of ties for the last position on the list. Note that you might report fewer than twenty words if there are fewer than twenty words that are common to all files. Also note that words less than six characters in length should be ignored. Also, since the input files are in English, you may assume that no word will be longer that fifty characters, since I believe the longest word in the English language is only 45 letters long. Words longer than fifty characters can be ignored, since they are probably not real words. Therefore, the analysis should actually report the twenty most frequently used words, where each word must appear at least once in each file, and each word must be at least six characters long and no more than fifty characters long.

If the user does not specify at least one file to be processed, then terminate the program with an appropriate message.

A word starts with a letter (either uppercase or lowercase) and continues until a non-letter (or EOF) is encountered. Non-words in the file should simply be ignored. Once a word is identified, convert all uppercase letters to lowercase before you process it.

Therefore, "elephant's" will be two words, "elephant" and "s", and since "s" is less than six characters long, it will be ignored. Likewise, "double-precision" will be two words, "double" and "precision".

The output can be unsorted, but it should be given to stdout and should consist of one word per line. You should not print anything else to stdout. Please be sure to turnoff debugging output before submitting your program for grading.

Step one is to implement a data structure that can efficiently store a set of words, with each word being associated with a count of how many times it has been seen in a block of text. A hash table is recommended.

Step two is to write a function that, given a block of text (defined by a starting address and a length), will use the data structure from step one to track how many times each word occurs in the block.

Step three is to write a function that, given a set of instances of the data structure, will compute the "intersection" of all the instances. That is, produce an instance of the data structure that will contain only the words that occur at least once in each file. For the words in the intersection, sum each word's count from each instance into a global count.

Step four is to write a serial program that will process the text files in sequence using the functions written in steps two and three to find the twenty most frequently used words, where each word occurs at least once in each file.

This program should read the files using the following standard functions:

Use of these functions is illustrated by a simple example program, count.c, available in ~cs520/public/prog4. For more details about these functions, consult their man pages (man open, man 2 read, man close).

You may also want to consider using this function:

lseek allows you, for instance, to "backup" the next location to be read from in an open file, in order to re-read a section of the file. See man lseek for more information.

Step five is to write a multi-threaded program to solve the problem in each thread is assigned the task of reading and processing one particular file. And one thread uses the function of step three to analyze the data structures produced by the threads and to report the list of longest words appearing in all files.

The solution produced by step five is likely to suffer from load imbalance. Step six is to alter the approach of step five to have multiple threads working on the same file. This requires that the threads coordinate to share the processing of a single file.

After finishing the six steps then implement the fastest version of the program that you can determine. The program must use only open, read, lseek and close to do I/O. Of course, the program must produce the correct results.

We will have a contest to determine the fastest student submission. (I reserve the right to enter the contest too, but I may not.) I will buy the two students with the fastest solutions lunch or dinner at Libby's (maximum value $20). In addition, all other submissions that are within 10% of the running time of the fastest submission will earn ten bonus points on the assignment.

I will determine the fastest submission by averaging the wall clock times for a series of benchmark runs. What I will use for the benchmark runs will not be released in advance. But you can expect runs using a varying number of files (including only one file) and files of varying lengths.

The benchmark runs will be performed on agate.cs.unh.edu. I will try to do all the runs when the machine is lightly loaded. I will compile the programs using gcc with -O (i.e. modest optimization).

Sample files that might be used to benchmark programs are available in ~cs520/public/prog4/files. You can assume that all input files will be less than 2GB in length. Also, you can assume at most 100 files will be given to one particular run of your program. If an invalid filename is given as an argument, terminate the program with an appropriate error message.

If other run-time errors occur, you may also terminate your program with an error. But, note, if malloc fails, this will be considered to be your problem. You need to be able to process the files given the system's limits on memory usage.

Your code for this assignment should be placed in two files: prog4.c and fast4.c. prog4.c should contain your code for the highest step you completed in steps one to six given above. fast4.c should contain your code for your fastest solution to the program. Note that these two programs might be the same, or they might be different because, for instance, you find that a solution using the approach of a higher step is not as fast as some other approach.

You should attack this assignment in the order of the steps given above. Points will be awarded for this assignment in the following manner:

In addition, as discussed above, 10 bonus points will be awarded to all students whose fast4.c is within 10% of the fastest submission.

Your program will be graded primarily by testing it for correct functionality. In addition, however, you may lose points if your program is not properly structured and documented. Decompose sub-problems appropriately into functions and do incremental testing. Leave your debugging output in your code, but disabled, when you do your final assignment submission.

There are two labs associated with this program: Lab 7 on March 23 and Lab 8 on March 30.

Your programs will be graded using agate.cs.unh.edu so be sure to test in that environment.

You should only submit prog4.c and fast4.c. Be sure you disable any debugging output before you submit your assignment.

Please turn-off any debugging code before you submit your program.

Your programs should be submitted for grading from agate.cs.unh.edu. To turn in this assignment, type:
~cs520/bin/submit prog4 prog4.c fast4.c

Please submit only these two files. Do not turn in any other files!

Submissions can be checked by typing:
~cs520/bin/scheck prog4

This assignment is due Tuesday April 3. The standard late policy concerning late submissions will be in effect. See the course overview webpage.

Remember: as always you are expected to do your own work on this assignment.


Last modified on January 21, 2012.

Comments and questions should be directed to hatcher@unh.edu