25 October 2011

Reducing boilerplate in Scala

I've often heard claims from Scala advocates that coding in Scala rather than Java saves time, reduces the number of lines of code and clarifies the code, removes the boilerplate.
They say it makes the code easier to read.


So, as a long term project, I'm going to be translating one of our reasonably sized (~50k lines) projects from Java to Scala. I'll post something when I find something interesting.

I'm doing this to answer the following questions:
  • Can I use mix and match Scala and Java? This is one of the selling points of Scala. You can use Java technology and libraries easily from Scala. My project uses (old versions of) hibernate, Spring, Spring MVC. We'll see.

  • When I've finished, will I have fewer lines of code? Again, one of the selling points of Scala.

  • When I've finished, will my code be understandable? One of the points of contention of Scala is its perceived complexity. The old adage says: 'You can write Fortran in any language', but will we end up with a codebase which is unavoidably incomprehensible? Inherent complexity is one thing, accidental complexity is another, but will we have designed-in complexity?

  • Is the tooling up to the job? I mean in particular eclipse, maven, some of the other eclipse plugins. The Scala-IDE has improved a lot recently, but is it robust enough?

One thing I want to avoid is refactoring that could be done in Java. I want to compare well written Java code with well written Scala.
I'm going to translate class by class where possible, and then improve the code, make it more idiomatic.

I am reliably informed that the best place to start is my testing code, my junit tests. Testing code isn't delivered to the customer, you can try something and not have it affect production code.

First things first. The old project was developed using Eclipse Galileo. This is no longer an option if I'm going to use Scala, the plugin doesn't work with it.
I'll need to upgrade to Helios.
This is essentially pain free (except for some maven issues, which I'll deal with later).

The project contains some soap services (developed using Apache Axis2). We test using a stub class and we have one test per service.
When we translate the tests directly from Java to Scala, we don't usually gain very much. For example, we have a java method such as:
private Calendar getDate(String dateString) throws Exception {
Calendar calendar = Calendar.getInstance();
calendar.setTime(new
SimpleDateFormat("dd.MM.yyyy").parse(dateString));
return calendar;
}
we end up with the following Scala method:
private def getDate(dateString: String) = {
val calendar = Calendar.getInstance();
calendar.setTime(new
SimpleDateFormat("dd.MM.yyyy").parse(dateString));
calendar
}
So the only thing we've gained is the lack of a return type (which is inferred to be Calendar) and the lack of throws Exception. Scala does not have checked exceptions, we don't need it.
Some methods, however, condense down a lot.
private Set getErrorCode(ErrorTo[] errors) {
Set set = new TreeSet();

for (ErrorTo error: errors) {
set.add(error.getCode());
}

return set;
}
In Scala, this becomes:
private def getErrorCode(errors: Array[ErrorTo]) =
new TreeSet(errors.map(_.getCode).toSet)
There is actually quite a lot to see here. Scala is much more expressive when dealing with collections. The map() method applies a
function to every entry in a collection, in this case an Array, and returns another collection (a Seq). We're applying getCode to
each entry in the array and returning a new collection (of String). _ refers the 'current instance'. So we're converting from an Array[ErrorTo] to a Seq[String].
Seq is another Scala collection type. We convert this to a Set (a Scala Set) and populate a java.util.TreeSet, because we wish to maintain interoperability with Java. For the minute.

We're using implicit conversions to convert between Scala & Java collections. In Scala, we can define an implicit conversion between two classes
so that if we want one of them but have the other, the classes get converted magically. So the toSet function returns a scala Set.
But java.util.TreeSet doesn't have a constructor which accepts a Scala Set, so we have to convert it. We have to import scala.collection.JavaConversions._
import scala.collection.JavaConversions._
These implicit conversions can be a performance problem sometimes, because you're potentially converting between objects multiple times, but we don't care about them here,
because this is testing code :-).

Why is Scala so much more concise than Java here? One reason is the type inference. In the java method, we mention Set three times, in Scala only once.
That, the map() function and the lack of a return statement in Scala reduces a 7 line java method down to a single line. It can be on a single line, so it goes on a single line. Because we can.

Next, we'll look at how we can use static methods and how to inherit them.

01 October 2011

Using git svn with a large repository

I've started using the git svn bridge for one of our projects, but I had a couple of problems with the initial clone of the repository, due to the file size
(some > 100Mb), and to the subversion server dropping the connection.
So, I started using the standard git svn clone:
$ git svn clone https://svn.farwell.co.uk/svn/project --stdlayout
Initialized empty Git repository in c:/code/project/.git/
r1 = 339bd134b2d482cf9038c16fa75f93255ebfbc1a (refs/remotes/trunk)
W: +empty_dir: trunk/blah1
W: +empty_dir: trunk/blah2
W: +empty_dir: trunk/blah3
W: +empty_dir: trunk/blah4
....
The --stdlayout means that git expects the trunk to be called trunk, tags be called tags and branches to be called branches.
Note also that you need to specify the url without the trunk at the end. This ran for a while, and then fell over, because svn dropped the connection on me. There is a timeout on the server.
RA layer request failed: REPORT request failed on '/svn/project
/!svn/vcc/default': REPORT of '/svn/project/!svn/vcc/default':
Could not read chunk delimiter: Secure connection truncated (ht
tps://svn.farwell.co.uk) at C:\Program Files (x86)\Git/libexec/
git-core/git-svn line 5114
We need to load in batches. git fetch has a -r option to allow you to specify the range of revisions to fetch. We've got some large files, so we'll do 10 at a time.
I started again:
$ git svn clone https://svn.farwell.co.uk/svn/project \
--stdlayout -r1:2
which fetched the first two revisions, but we have to fetch the rest, about 1000 revisions. I used a quick perl script.
my $count = 1;

while ($count <= 1000) {
# executes git svn fetch -r1:11 etc.
my $cmd="git svn fetch -r$count:" . ($count + 10);
print "$cmd\n";
system($cmd);
$count += 10;
}
But then we get another problem: git is running out of memory; it crashed and this time it's more serious. Another problem with our big files. This is the error message:
Out of memory during "large" request for 268439552 bytes, total sbrk() is 140652544 bytes at /usr/lib/perl5/site_perl/Git.pm line 898,  line 3.
Git svn uses perl to download and process the files, but it slurps the entire file in one go. So for our large files, it runs out of memory.

After a bit of searching on the internet, I found a solution on github for our problem: Git.pm: Use stream-like writing in cat_blob().
This is a fairly simple patch, which doesn't seem to have made it into a release yet, so I applied it manually to C:\Program Files (x86)\Git\lib\perl5\site_perl\Git.pm.
@@ -896,22 +896,26 @@ sub cat_blob {
}
my $size = $1;
-
- my $blob;
my $bytesRead = 0;

while (1) {
+ my $blob;
my $bytesLeft = $size - $bytesRead;
last unless $bytesLeft;

my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
- my $read = read($in, $blob, $bytesToRead, $bytesRead);
+ my $read = read($in, $blob, $bytesToRead);
unless (defined($read)) {
$self->_close_cat_blob();
throw Error::Simple("in pipe went bad");
}

$bytesRead += $read;
+
+ unless (print $fh $blob) {
+ $self->_close_cat_blob();
+ throw Error::Simple("couldn't write to passed in filehandle");
+ }
}

# Skip past the trailing newline.

@@ -926,11 +930,6 @@ sub cat_blob {
throw Error::Simple("didn't find newline after blob");
}

- unless (print $fh $blob) {
- $self->_close_cat_blob();
- throw Error::Simple("couldn't write to passed in filehandle");
- }
-
return $size;
}
I restarted the process from the beginning and voilà, it got to the end. All of the revisions had been fetched, all that was left to do was a
$ git svn rebase
to merge the changes into the tree and have a working git repo.

If had wanted to migrate from svn to github, rather than continue to use git svn, I'd have done exactly the same thing, but add a --no-metadata to the clone command.
And obviously you don't need to to an svn rebase, just a rebase.