Honours theses — time to start on the data

IBM key punch and verifier, model 029. Photo: en.wikipedia.org

IBM key punch and verifier, model 029. Photo: en.wikipedia.org

If you are an honours research student in Australia then you are probably nearing the end of your data collection. What comes next?

Although starting on the analyses is going to be high on your agenda, I would encourage you not to rush into it. First, ensure that the records of your data are accurate. If you have been copying data from paper records into a database then it would be surprising if you have made no errors in the data entry.

One well-tried method of checking for errors in data entry is to enter the data twice and then to compare the two sets. No one seems to do this anymore, at least not in universities, but in the days when punched Hollerith cards were used for data storage, instead of the now ubiquitous magnetic discs and flash memory sticks, double entry was considered almost mandatory. The process was referred to as “key punch” and “verification” in keeping with the terminology used by IBM. Originally, one machine, the IBM Model 026, was used for punching the holes in the Hollerith cards, and another machine, the Model 056, was used for verification. Although I have occasionally used both models, I used the IBM Model 029 much more. It could be used both for punching and for verification and was even able to print directly onto the cards, obviating the need to read one’s data by interpreting the holes!

But back to the matter of how a modern honours student should verify their data. If a single mistake in your data entry could lead to catastrophe, then I would suggest that double entry still has its place. You could enter your data into two spreadsheets, or even into two plain text files, and then use some simple software (such as the Unix diff command) to check for differences between the two entered sets.

The next step is to ensure that there are no impossible values in your data set by which I mean checking for values that cannot represent real measurements. For example, if you have been measuring the temperature of liquid water in drains, then values below zero or over 100 °C indicate that you have made an error. Similarly, if you are a psychologist measuring marks on a visual analogue scale that is 12 centimetres long, then negative values or values exceeding 12 will be erroneous. One of the easiest ways of finding these sorts of errors is to have your data analysis package print the minimum and maximum or each of the variables in your data set.

In your hunt for impossible values, look next for numbers that indicate spurious accuracy. For example, if you are measuring water temperature with a standard alcohol thermometer, then it is unlikely that you should have more than one decimal place in your measurements. Numbers like 22.048 will indicate that something has gone wrong. Sometimes you will discover that you have just hit an extra key by mistake; more often, you will find that the digits 4 and 8 actually belonged to the next column of data and that you forgot to type a delimiter between the first valid datum (22.0) and the next datum (possibly 48).

It is impossible to be exhaustive in describing what sorts of things to check in your data because so much depends on the specific context of the data collection. What I can say is that you should think carefully about the kinds of numbers that are impossible, and ensure that you do not have any.

Related to impossible data is the problem of illogical data. Illogical data are those where two items are jointly nonsensical even if they would make sense on their own. For example, a person can have a 1989 birthday, and a person can be 50 years old, but if the year is 2009, then it is not possible to be 50 and to have a 1989 birthday. Similarly, being born in 1989 and being enrolled at primary school in 2009 are unlikely co-occurrences. Again, it is impossible to be exhaustive but if you can discover mutual dependencies in your data set (such as year of birth and age), then you can cross-tabulate the two variables to discover whether you have illogical data.

Data checking is vital. Do it carefully and thoroughly and you will be more confident of being able to rely on the results of your later analyses. Do it badly and you might embarrass yourself by claiming to have discovered the cause of global climate change when what you actually had was a fly-speck on your thermometer scale — and possibly a mote in your own eye.

Contributors: Mark R. Diamond