The enhancing compiler error messages saga: the saga continues

I was sorry to have missed Raymond Pettit, John Homer and Roger Gee presenting the latest installment of what is becoming the enhanced compiler error messages saga at SIGCSE earlier this month. Their paper “Do Enhanced Compiler Error Messages Help Students?: Results Inconclusive” [1] was two-pronged. It contrasted the work of Denny et al. [2] (which provided evidence that compiler error enhancement does not make a difference to students) and my SIGCSE 2016 paper [3] (which provided evidence that it does). It also provided fresh evidence that supports the work of Denny et al. I must say that I like Pettit et al.’s subtitle: “Results Inconclusive”. I don’t think that this is the final chapter in the saga. We need to do a lot more work on this.

Early studies on this often didn’t include much quantifiable data. More recent studies haven’t really been measuring the same things – and they have been measuring these things in different ways. In other words, the metrics, and the methodologies differ. It’s great to see work like that of Pettit et al. that is more comparable to previous work like that of Denny et al.

One of the biggest differences between my editor, Decaf, and Pettit et al.’s tool, Athene, is that Decaf was used by students for all of their programming – practicing, working on assignments, programming for fun, even programming in despair. For most of my students it was the only compiler they used – so they made a lot of errors, and they all were logged. Unlike Denny et al., my students did not receive skeleton code – they were writing programs, often from scratch. On the other hand, Athene was often utilized by students after developing their code on their own local (un-monitored) compilers. Thus, many errors generated by the students in the Pettit et al. study were not captured. Often, the code submitted to Athene was already fairly refined. Pettit et al. even have evidence from some of their students that at times the code submitted to Athene only contained those errors that the students absolutely could not rectify without help.

As outlined in this post, Denny et al. and I were working towards the same goal but measuring different things. This may not be super apparent at first read, but under the hood comparing studies like these is often a little more complicated than it first looks. Of course these differences have big implications when trying to compare results. I’m afraid that the same is true comparing my work with Pettit et al. – we are trying to answer the same question, but measuring different things (in different ways) in order to do so.

Specifically, Pettit el al. measured:

  1. the number of non-compiling submissions; similar to did Denny et al., but unlike me
  2. the number of successive non-compiling submissions that produced the same error message; Denny et al. measured the number of consecutive non-compiling submissions regardless of why the submission didn’t compile, and I measured the number of consecutive errors generating the same error message, on the same line of the same file
  3. the number of submission attempts (in an effort to measure student progress)
  4. time between submissions; neither Denny et al. nor I measured time-based metrics

I also did a fairly detailed comparison between my work and Denny et al. in [4] (page 10). In that study we directly compared some effects of enhanced and non-enhanced error messages:

In this study we directly distinguish between two sets of compiler error messages (CEMs), the 30 that are enhanced by Decaf and those that are not. We then explore if the control and intervention groups respond differently when they are presented with these. For CEMs enhanced by Decaf the control and intervention groups experience different output. The intervention group, using Decaf in enhanced mode, see the enhanced and raw javac CEMs. The control group, using Decaf in pass-through mode, only see the raw javac CEMs. Thus for CEMs not enhanced by Decaf, both groups see the same raw CEMs. This provides us with an important subgroup within the intervention group, namely when the intervention group experiences errors generating CEMs not enhanced by Decaf. We hypothesized that there would be no significant difference between the control and intervention groups when looking at these cases for which both groups receive the same raw CEMs. On the other hand, if enhancing CEMs has an effect on student behavior, we would see a significant difference between the two groups when looking at errors generating the 30 enhanced CEMs (due to the intervention group receiving enhanced CEMs and the control group receiving raw CEMs).

As mentioned, the metrics used by Pettit et al. and Denny et al. are more common to each other than to mine. Pettit et al. and Denny et al. both used metrics based on submissions (that is, programs) submitted by students, or the number of submission attempts. This certainly makes comparing their studies more straight-forward. However it is possible that these metrics are too ‘far from the data’ to be significantly influenced by enhanced error messages. It is possible that metrics simply based on the programming errors committed by students, and the error messages generated by these errors are more ‘basic’ and more sensitive.

Another consideration when measuring submissions is that just because a submission compiles, does not mean that it is correct or does what was intended. It is possible that some students continue to edit (and possibly generate errors) after their first compiling version, or after they submit an assignment. These errors should also be analyzed. I think that in order to measure if enhancing error messages makes a difference to students we should focus on all programming activity. I’m afraid that otherwise, the results may say more about the tool (that enhances error messages) and the way that tool was used by students, than about the effects of enhanced error messages themselves. I am sure that in some of my research this is also true – after all my students were using a tool also, and this tool has its own workings which must generate effects. Isolating the effects of the tool from the effects of the messages is challenging.

I am very glad to see more work in this area. I think it is important, and I don’t think it is even close to being settled. I have to say I really feel that the community is working together to do this. It’s great! In addition there may be more to do than determine if enhanced compiler errors make a difference to students. We have overwhelming evidence that syntax poses barriers to students. We have a good amount of evidence that students think that enhancing compiler error messages makes a positive difference. Some researchers think it should too. If enhancing compiler error messages doesn’t make a difference, we need to find out why, and we need to explain the contradiction this would pose. On the other hand, if enhancing compiler error messages does make a difference we need to figure out how to do it best, which would also be a significant challenge.

I hope to present some new evidence on this soon. I haven’t analyzed the data yet, and I don’t know which way this study is going to go. The idea for this study came from holding my previous results up to the light and looking at them from quite a different angle. I feel that one of the biggest weaknesses in my previous work was that the control and treatment groups were separated by a year – so that is what I eliminated. The new control and treatment groups were taking the same class, on the same day – separated only by lunch break. Fortuitously, due to a large intake CP1 was split into two groups for the study semester, but was taught by the same lecturer in the exact same way – sometimes things just work out!

I will be at ITiCSE 2017 and SIGCSE 2018 (and 2019 for that matter – I am happy to be serving a two year term as workshop co-chair). I hope to attend some other conferences also but haven’t committed yet. I look forward to continuing the discussion on the saga of enhancing compiler error messages with anyone who cares to listen! In the meantime here are a few more posts where I discuss enhancing compiler error messages – comments are welcome…

[1] Raymond S. Pettit, John Homer, and Roger Gee. 2017. Do Enhanced Compiler Error Messages Help Students?: Results Inconclusive.. In Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education (SIGCSE ’17). ACM, New York, NY, USA, 465-470. DOI: https://doi.org/10.1145/3017680.3017768

[2] Paul Denny, Andrew Luxton-Reilly, and Dave Carpenter. 2014. Enhancing syntax error messages appears ineffectual. In Proceedings of the 2014 conference on Innovation & technology in computer science education (ITiCSE ’14). ACM, New York, NY, USA, 273-278. DOI: http://dx.doi.org/10.1145/2591708.2591748

[3] Brett A. Becker. 2016. An Effective Approach to Enhancing Compiler Error Messages. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (SIGCSE ’16). ACM, New York, NY, USA, 126-131. DOI: https://doi.org/10.1145/2839509.2844584

full-text available to all with link available at www.brettbecker.com/publications

[4] Brett A. Becker, Graham Glanville, Ricardo Iwashima, Claire McDonnell, Kyle Goslin, Catherine Mooney. 2106. Effective Compiler Error Message Enhancement for Novice Programming Students, Computer Science Education 26(2-3), pp. 148-175; http://dx.doi.org/10.1080/08993408.2016.1225464

full-text available to all at www.brettbecker.com/publications

You are what you measure: Enhancing compiler error messages effectively

semicolonCompiler Error Messages (CEMs) play a particularly essential role for programming students as they often have little experience to draw upon, leaving CEMs as their primary guidance on error correction. Further, they provide immediate feedback, with implications discussed in this post. In the absence of an instructor, the compiler and its messages are the only source of feedback on what the student is doing correctly, and incorrectly. There is another issue at hand however – CEMs are frequently inadequate, present a barrier to progress, and are often a source of discouragement.

At SIGCSE 2016 I presented a paper which showed that enhancing compiler error messages can be effective, referred to here as Becker (2016). I also led a more in-depth study with a more focused comparison approach that was recently published in Computer Science Education (see my publications page for details on both). In 2014 Denny, Luxton-Reilly and Carpenter published a study providing evidence that enhancing CEMs was not effective, generating a bit of discussion on Mark Guzdial’s Blog. Although these papers came up with opposing conclusions, there are a ton of variables involved in studies like this, and two things in particular are really important. These might sound really obvious, but bear with me. These two things are:

  1. What is measured
  2. How these things are measured

Another important factor is the language used – as in the English terminology – not programming language. That will come up here soon enough.

In Becker (2016) I measured four things:

  1. number of errors per compiler error message
  2. number of errors per student
  3. number of errors per student per compiler error message
  4. number of repeated errors per compiler error message

Denny et al. measured three things:

  1. number of consecutive non-compiling submissions
  2. total number of non-compiling submissions
  3. number of attempts needed to resolve three errors: Cannot resolve identifier, type mismatch, missing semicolon

Getting back to my fairly obvious point that what is measured (and how) is of critical importance, let me dig into my four metrics for some of the not so obvious stuff. For starters, all four of my metrics involve student errors. Additionally, although I was measuring errors, for three of my metrics I was measuring some flavor of errors per CEM. This is important, and the wording is intentional. As I was investigating the effect of enhancing CEMs, the ‘per CEM’ part is by design. However it is also required for another reason – there is often not a one-to-one mapping of student committed errors to CEMs in Java – so I don’t know (from looking at the CEM) exactly what error caused that CEM. I could look at the source code to see, but the point is that from a CEM point of view, all I can know is how many times that CEM occurred – in other words, how many (student-committed) errors (of any type/kind/etc.) generated that CEM. See work by Altadmri & Brown (2015) and my MA thesis for more on this lack of a one-to-one mapping of errors to CEMs in Java. This makes things tricky. Finally, each metric warrants some discussion on its own:

  1. The number of errors per CEM was measured for all errors encountered during the study (generating 74 CEMs in total) and for errors generating the top 15 CEMs, representing 86.3% of all errors. Results indicated that enhancing CEMs reduced both.
  2. The number of errors per student was not significantly reduced when taking all 74 CEMs, but it was for errors generating the top 15 CEMs.
  3. The number of errors per student per CEM was significantly reduced for 9 of the top 15 CEMs (of which only 8 had enhanced CEMs). The odd-one-out was .class expected. Sometime I’ll write more on this – it’s a really interesting case.
  4. The number of repeated errors per CEM is dependent on the definition of a repeated error. I defined a repeated error similarly to Matt Jadud – two successive compilations that generate the same CEM on the same line of code. Also, this was for the top 15 CEMs.

If we now look at the metrics of Denny et al., the first two involve student submissions, which may have contained errors, but errors are not being measured directly (well, we know that the compiling submissions don’t have any errors, and that the non-compiling submissions do, but that’s about it). Only the third involves errors directly, and at that, only three particular types. What was really measured here was the average number of compiles that it takes a student to resolve each type of error, where a submission is said to have a syntax error of a particular type when the error is first reported in response to compilation, and the error is said to have been resolved when the syntax error is no longer reported to students in the feedback for that submission.

So, comparing the results of these two studies, if this post were trying to reach a conclusion of its own, the best we can do is to compare the following result from Denny et al.:

  • D1. Enhancing compiler error messages does not reduce the number of attempts needed to resolve three errors (really, CEMs): Cannot resolve identifier, type mismatch, missing semicolon.

and the following from Becker (2016):

  • B1. Enhancing compiler error messages does reduce the number of errors that generate the CEMs: expected, incompatible types, ; expected, and many other CEMs.
  • B2. Enhancing compiler error messages does reduce the number of errors per student that generate the CEMs: expected, incompatible types, and many other CEMs*
  • B3. Enhancing compiler error messages does reduce the number of repeated errors generating the CEMs: expected, incompatible types, and many other CEMs.*

These are the only four results (across both papers) that measure the same thing – student errors. Further, we can only specifically compare the results involving the three CEMs that Denny et al. investigated. Becker (2016) investigated 74, including these three.

* The number of errors (per student, and repeated) generating the CEM ; expected was not reduced in these cases.

So, despite the differing general conclusions (Denny et al. indicate that enhanced CEMs are not effective, while Becker (2016) indicates that enhanced CEMs can be effective) if we synthesize the most common results from each paper, we end up with what the two studies agree on (sometimes), which is ; expected:

  • D1. Enhancing compiler error messages does not reduce the number of attempts needed to resolve missing semicolon (Denny et al.).
  • B2. Enhancing compiler error messages does not reduce the number of errors per student that generate the CEM ; expected (Becker 2016).
  • B3. Enhancing compiler error messages does not reduce the number of repeated errors per student that generate the CEM ; expected (Becker 2016).

I find this to be particularly unsurprising as ; expected is one of the most common CEMs (in my study the third most common, representing ~10% of all errors) and the actual CEM itself is one of the most straightforward of all Java CEMs. However, Becker (2016) had one result (B1) which showed that the number of errors generating ; expected CEMs was reduced. So for this CEM, maybe the jury is still out.

It may seem that the two studies didn’t agree on much, which technically is true. However I hope that any readers that have persevered this long can appreciate the nuances of what is measured (and how) in these types of study, particularly when comparing studies. It is very challenging because the nuances really matter. Further, they can really complicate the language used. If you try and make the language easy, you miss important details, and get ambiguous. Incorporating those details into the language affects readability.

Finally, I think that this post demonstrates the important need for studies that attempt to repeat the results of others, particularly in an area where results are contested. Comparing two different studies poses several other problems (apart from what is measured and how), and I won’t go into them here as most are well known and well discussed, but I do think that the difficulties that come about to the use of different language is an often overlooked one.

Either way, I believe that the results in Becker (2016), and the recent Computer Science Education article are robust. These studies provide many results do indicate that enhanced CEMs can be effective.

Misleading, cascading Java error messages

I have been working with enhancing Java error messages for a while now, and I have stared at a lot of them. Today I came across one that I don’t think I’ve consciously seen before, and it’s quite a doozy if you are a novice programmer. Below is the code, with a missing bracket on line 2:

public class Hello {
       public static void main(String[] args)  //missing {
              double i;
              i = 1.0;
              System.out.println(i);
       }
}

The standard Java output in this case is:

C:\Users\bbecker\Desktop\Junk\Hello.java:2: error: ';' expected
       public static void main(String[] args)
                                             ^

C:\Users\bbecker\Desktop\Junk\Hello.java:4: error: <identifier> expected
              i = 1.0;
               ^

C:\Users\bbecker\Desktop\Junk\Hello.java:5: error: <identifier> expected
              System.out.println(i);
                                ^

C:\Users\bbecker\Desktop\Junk\Hello.java:5: error: <identifier> expected
              System.out.println(i);
                                  ^

C:\Users\bbecker\Desktop\Junk\Hello.java:7: error: class, interface, or enum expected
}
^

5 errors

Process Terminated ... there were problems.

Amazing. This is telling the student that there were 5 errors (not one), and none of the five reported errors are even close to telling the student that there is a missing bracket on line 2. If the missing bracket is supplied, all five “errors” are resolved.

During my MA in Higher Education I developed an editor that enhances some Java error messages, and I have recently published some of this work at SIGCSE (see brettbecker.com/publications). I hope to get some more work on this front  soon, and in addition I would like to look more deeply at what effects cascading error messages have on novices. I can imagine that if I had no programming experience, was learning Java, and came across the above I would probably be pretty discouraged.

The enhanced error that my editor would provide for the above code, which would be reported side-by-side with the above Java error output is:

Looks like a problem on line number 2.

Class Hello has 1 fewer opening brackets '{' than closing brackets '}'.