You are what you measure: Enhancing compiler error messages effectively

semicolonCompiler Error Messages (CEMs) play a particularly essential role for programming students as they often have little experience to draw upon, leaving CEMs as their primary guidance on error correction. Further, they provide immediate feedback, with implications discussed in this post. In the absence of an instructor, the compiler and its messages are the only source of feedback on what the student is doing correctly, and incorrectly. There is another issue at hand however – CEMs are frequently inadequate, present a barrier to progress, and are often a source of discouragement.

At SIGCSE 2016 I presented a paper which showed that enhancing compiler error messages can be effective, referred to here as Becker (2016). I also led a more in-depth study with a more focused comparison approach that was recently published in Computer Science Education (see my publications page for details on both). In 2014 Denny, Luxton-Reilly and Carpenter published a study providing evidence that enhancing CEMs was not effective, generating a bit of discussion on Mark Guzdial’s Blog. Although these papers came up with opposing conclusions, there are a ton of variables involved in studies like this, and two things in particular are really important. These might sound really obvious, but bear with me. These two things are:

  1. What is measured
  2. How these things are measured

Another important factor is the language used – as in the English terminology – not programming language. That will come up here soon enough.

In Becker (2016) I measured four things:

  1. number of errors per compiler error message
  2. number of errors per student
  3. number of errors per student per compiler error message
  4. number of repeated errors per compiler error message

Denny et al. measured three things:

  1. number of consecutive non-compiling submissions
  2. total number of non-compiling submissions
  3. number of attempts needed to resolve three errors: Cannot resolve identifier, type mismatch, missing semicolon

Getting back to my fairly obvious point that what is measured (and how) is of critical importance, let me dig into my four metrics for some of the not so obvious stuff. For starters, all four of my metrics involve student errors. Additionally, although I was measuring errors, for three of my metrics I was measuring some flavor of errors per CEM. This is important, and the wording is intentional. As I was investigating the effect of enhancing CEMs, the ‘per CEM’ part is by design. However it is also required for another reason – there is often not a one-to-one mapping of student committed errors to CEMs in Java – so I don’t know (from looking at the CEM) exactly what error caused that CEM. I could look at the source code to see, but the point is that from a CEM point of view, all I can know is how many times that CEM occurred – in other words, how many (student-committed) errors (of any type/kind/etc.) generated that CEM. See work by Altadmri & Brown (2015) and my MA thesis for more on this lack of a one-to-one mapping of errors to CEMs in Java. This makes things tricky. Finally, each metric warrants some discussion on its own:

  1. The number of errors per CEM was measured for all errors encountered during the study (generating 74 CEMs in total) and for errors generating the top 15 CEMs, representing 86.3% of all errors. Results indicated that enhancing CEMs reduced both.
  2. The number of errors per student was not significantly reduced when taking all 74 CEMs, but it was for errors generating the top 15 CEMs.
  3. The number of errors per student per CEM was significantly reduced for 9 of the top 15 CEMs (of which only 8 had enhanced CEMs). The odd-one-out was .class expected. Sometime I’ll write more on this – it’s a really interesting case.
  4. The number of repeated errors per CEM is dependent on the definition of a repeated error. I defined a repeated error similarly to Matt Jadud – two successive compilations that generate the same CEM on the same line of code. Also, this was for the top 15 CEMs.

If we now look at the metrics of Denny et al., the first two involve student submissions, which may have contained errors, but errors are not being measured directly (well, we know that the compiling submissions don’t have any errors, and that the non-compiling submissions do, but that’s about it). Only the third involves errors directly, and at that, only three particular types. What was really measured here was the average number of compiles that it takes a student to resolve each type of error, where a submission is said to have a syntax error of a particular type when the error is first reported in response to compilation, and the error is said to have been resolved when the syntax error is no longer reported to students in the feedback for that submission.

So, comparing the results of these two studies, if this post were trying to reach a conclusion of its own, the best we can do is to compare the following result from Denny et al.:

  • D1. Enhancing compiler error messages does not reduce the number of attempts needed to resolve three errors (really, CEMs): Cannot resolve identifier, type mismatch, missing semicolon.

and the following from Becker (2016):

  • B1. Enhancing compiler error messages does reduce the number of errors that generate the CEMs: expected, incompatible types, ; expected, and many other CEMs.
  • B2. Enhancing compiler error messages does reduce the number of errors per student that generate the CEMs: expected, incompatible types, and many other CEMs*
  • B3. Enhancing compiler error messages does reduce the number of repeated errors generating the CEMs: expected, incompatible types, and many other CEMs.*

These are the only four results (across both papers) that measure the same thing – student errors. Further, we can only specifically compare the results involving the three CEMs that Denny et al. investigated. Becker (2016) investigated 74, including these three.

* The number of errors (per student, and repeated) generating the CEM ; expected was not reduced in these cases.

So, despite the differing general conclusions (Denny et al. indicate that enhanced CEMs are not effective, while Becker (2016) indicates that enhanced CEMs can be effective) if we synthesize the most common results from each paper, we end up with what the two studies agree on (sometimes), which is ; expected:

  • D1. Enhancing compiler error messages does not reduce the number of attempts needed to resolve missing semicolon (Denny et al.).
  • B2. Enhancing compiler error messages does not reduce the number of errors per student that generate the CEM ; expected (Becker 2016).
  • B3. Enhancing compiler error messages does not reduce the number of repeated errors per student that generate the CEM ; expected (Becker 2016).

I find this to be particularly unsurprising as ; expected is one of the most common CEMs (in my study the third most common, representing ~10% of all errors) and the actual CEM itself is one of the most straightforward of all Java CEMs. However, Becker (2016) had one result (B1) which showed that the number of errors generating ; expected CEMs was reduced. So for this CEM, maybe the jury is still out.

It may seem that the two studies didn’t agree on much, which technically is true. However I hope that any readers that have persevered this long can appreciate the nuances of what is measured (and how) in these types of study, particularly when comparing studies. It is very challenging because the nuances really matter. Further, they can really complicate the language used. If you try and make the language easy, you miss important details, and get ambiguous. Incorporating those details into the language affects readability.

Finally, I think that this post demonstrates the important need for studies that attempt to repeat the results of others, particularly in an area where results are contested. Comparing two different studies poses several other problems (apart from what is measured and how), and I won’t go into them here as most are well known and well discussed, but I do think that the difficulties that come about to the use of different language is an often overlooked one.

Either way, I believe that the results in Becker (2016), and the recent Computer Science Education article are robust. These studies provide many results do indicate that enhanced CEMs can be effective.