A spelling corrector in Scala

Inigo Surguy


Scala is a modern object-functional language that runs on the Java Virtual Machine. Unlike similar JVM languages like Jython, JRuby and JavaScript, it's statically typed, but uses type inference to reduce the amount of type declaration boilerplate required. I've been interested in it for some time (and Nice, a similar language, before that) but haven't actually got round to doing anything in it. Two things inspired me to do so. First, the release of the Scala plugin for IntelliJ IDEA, which has the potential for making the Scala environment as powerful as the Java environment. Second, I saw a reference to Peter Norvig's code for spelling correction in Python and various Haskell conversions. This is a similar conversion of Norvig's code into Scala.

The code

import Console._
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.commons.io.FileUtils
package net.surguy.scala {
object Main {
    class DefaultDict[K, V](defaultFn: (K)=>V) extends HashMap[K, V] {
        override def default(key: K): V = return defaultFn(key)
    def words(text: String): Array[String] = return text.toLowerCase().split("[^a-z]");
    def train(features: Array[String]) : Map[String, int] = {
        var model = new DefaultDict[String, int](K => 1)
        for (val f <- features; f!="")
            model(f) = model(f) + 1;
        return model
    var NWORDS = train(words(FileUtils.readFileToString(new File("big.txt"))))
    var alphabet = "abcdefghijklmnopqrstuvwxyz"
    def edits1(word: String):Set[String] = {
        var n = word.length() - 1
        var deletions = for (val i <- (0 to n).toList ) 
		        yield word.substring(0, i) + word.substring(i + 1)
        var transpositions = for (val i <- (0 to n - 1).toList ) 
		        yield word.substring(0, i) + word(i + 1) + word(i) + word.substring(i + 2)
        var alterations = for (val i <- (0 to n - 1).toList; val c <- alphabet) 
		        yield word.substring(0, i) + c + word.substring(i + 1)
        var insertions = for (val i <- (0 to n + 1).toList; val c <- alphabet) 
		        yield word.substring(0, i) + c + word.substring(i)
        return Set() ++ deletions ++ transpositions ++ alterations ++ insertions
    def known_edits2(word: String):Set[String] = {
        var distance1 = for (val e1 <- edits1(word)) yield e1
        return (for (val d <- distance1; val e2 <- edits1(d); NWORDS.contains(e2) ) yield e2)
    def known(words: Set[String] ):Set[String] = words.filter( s => NWORDS.contains(s) )
    def correct(word: String):String = {
        var w = known(Set(word))
        var w1 = known( edits1(word) )
        var w2 = known_edits2(word)
        var candidates = if (w isEmpty) if (w1 isEmpty) w2 else w1 else w
        return candidates.foldLeft("")((a,b) => if (NWORDS(a) > NWORDS(b)) a else b)



I've written very little Scala, and this is pretty much a straight translation of the Python code. I suspect it's not too far from an idiomatic Scala implementation, though - list comprehensions are as standard in Scala as they are in Python. Unlike Python, Scala doesn't come with a DefaultDict class, but it's trivial to write one. The code is almost as readable as the Python, although it suffers a little from the lack of Python string slicing, and the list comprehension syntax seems slightly clunkier. The type declarations imposed no pain at all. Interoperability with the Java libraries was also very useful. Overall, I'm very happy with the Scala language - but a good programming environment is about more than just the language.

The IDEA Scala plugin is good, but not yet good enough. It was very useful to have immediate feedback of when I was using incorrect syntax without having to compile, and it was very easy to browse the library source-code, but it doesn't yet have anything like the convenience of using IDEA for Java. My hope is that Dave Griffith, author of some fantastic IDEA plugins has time to improve it still further. My productivity at the moment in Scala is vastly less than in Java; and although this is partly due to unfamiliarity, it's also partly due to IDE support (typing in method names rather than Ctrl-Enter? reading Javadoc web pages rather than Ctrl-Q? no refactoring? there are a hundred small things that my IDE does that add up to a huge difference).

My main problem in development turned out to be a Scala bug - filtering a HashSet was broken. I spent some time trying to work out what I'd got wrong with the list comprehension syntax, before realizing that for once it actually wasn't my mistake. This shook my faith in the quality of the Scala test process. It's fixed in the latest Scala release.

The performance is far worse than the Python implementation, which surprises me. Norvig reports 15 words/second, whereas I was getting slightly less than 2 words/second. From the discussion on the Haskell implementation, I suspect I might be able to gain some time by replacing the Sets mostly with Lists, but I had expected that the naive Scala implementation would still be faster than the Python implementation without any additional work. I haven't profiled (nor am I sure whether a Java profiler will give meaningful reports for Scala). I'd be grateful for any suggestions. Update: see below.

So, I'll probably keep using Scala on and off, and will develop more in it, but I'm unlikely to attempt any large projects in it until there's a more powerful IDE. Hopefully that will be very soon, because I do really like the language.

Update: David Pollak has written a much faster and neater Scala implementation. He's also provided this patch to my code:

        def known(words: List[String] ):List[String] = words.filter( s => NWORDS.contains(s) )
        def correct(word: String): String = {
            (known(List(word)) match {
              case List(word) => List(word)
              case Nil =>
                known(edits1(word).toList) match {
                  case Nil => known_edits2(word)
                  case s => s
            }).foldLeft("")((a,b) => if (NWORDS(a) > NWORDS(b)) a else b)

When substituted into the code above, this removes the biggest inefficiency of my version - the unnecessary call to known_edits2 when an answer has already been found. This speeds it up by about a factor of 10. In Norvig's Python version, this is done very neatly with "or" - see Pollak's version for an equally elegant Scala implementation of "or".

Return to index

Return to the index page.