Introduction

In choosing a dataset and language to implement this examination of collision resolution tradeoffs, I chose things which I thought would be unusual and fun.

For a dataset, I chose CDs from my collection. Fifty CDs were chosen at random from my collection and entered the title, artist, year it was published, genre, recording label and the number of songs on the album.

The alpha-numeric key is the first and last 2 characters of the title with the first and last 2 characters of the artist concatenated to the end. The alpha-numeric key is converted from character values to integer values, and summed. The length of the title and artist fields is then added to the result to give the numeric key for the album. While it's possible for this to return a non-unique key, for the given set of data, it is unique, and so will serve our purpose. This algorithm changed several times in the process of doing this project, see the summary for more information on this.

In examining the language I wanted to choose something that would provide a challenge, yet still have the functionality to handle the work. I chose Perl due to my desire to deliver the presentation of this project via the web. This document is generated by my program. If you happen to be reading this as a hard-copy, please at least glance at the URL that generates it: http://www.slac.com/morgenes/class/csc541/prog1/prog1.cgi

For convenience, I implemented my secondary storage in memory. That is, all my probes are done in memory, not to a logical disk. Unfortunately, Perl's random file access is less than perfect and would require alot of copying the whole file to implement it that way.

The following is the initial contents of the sequential data file (supplemented with the # in the file and computed key):