Homework 5: Finding Similar Sounding Names
Due: In class on Tuesday, April 05, 2016
Presentation By: Clare & Sarah
The idea of forming signatures to process records/lines of data is quite common in many computing applications. You saw how Bentley used this idea to discover anagram classes in a dictionary of words. In this exercise we will use this idea to explore another application: finding similar sounding names and words. The Soundex algorithm has been put to use to identify similar sounding, yet differently spelled names or words. The Soundex algorithm has several flaws. In 1990 Lawrence Phillips designed a better phonetic encoding scheme, called Metaphone, that seems to be much better at encoding not just names but any english words based on phonetic encoding rules.
Both, Soundex and Metaphone, algorithms are described using several cases or rules that are used in the encoding. Similar to the problems discussed in Bentley's Chapter on Data Structures Algorithms, the programs to implement these algorithms can be written much better by using encoding schemes for the rules or cases of the algorithm. Thus these two algorithms embody two important programming patterns: using some kind of signature encoding, and the use of properties of the algorithm itself to encode its rules.
1) Study the Python implementations provided for Soundex and Metaphone algorithms carefully and make sure you understand them. Here is the implementation: soundex.py
Next, we will use these algorithms to find all similar sounding names in a database of names (see file FemaleNames.txt for a set of 10000 names). Here is what you have to do:
2) Use the Soundex algorithm (Python function: soundex) to create (signature, name) pairs of every name in the given data file. Pipe the outout of this to Bentley's anagram finding program to find all name phonetic anagram classes, retaining only those classes that have two or more names in them).
3) Repeat (2) above but this time use the Metaphone algorithm (Python function: metaphone).
Answer the following questions for each algorithm:
What to hand in
1. A short formal report describing your findings. Use the outputs of your programs on the questions above to frame your narrative and conclusions.
3. In the Appendix of the report include: A printout of any programs that you wrote that were not already provided, if any. Explain how the programs were used in the exercise.
Extra Credit:
Would it be possible to use the programs above to process a dictionary of English words to find all homonyms (or homophones)? Give a printout of 25 most interesting homonyms found (you decide what is interesting).
Presentation:
Mary & Hanna will give presentations on this exercise in class on Tuesday, March 01. The presentation should include a description of the problem and the overall solution. It should also include all the other things you had to do in the implementation process.