Berlin Buzzwords 2015 – Day 2



It’s All Fun And Games Until…: A Tale of Repetitive Stress Injury (Eric Evans)

Basically, watch out for yourself. If it hurts, you are doing something wrong. It’s good to be reminded about that from time to time. So, watch out for yourself:



A complete Tweet index on Apache Lucene (Michael Busch)

Michael Busch has given this talk in one version or the other for a couple of years now. Unfortunately it got more shallow now, not so many technical details about how they optimized Lucene for Twitter. Numbers are great, they have two billion queries per day and about 500 million tweets per day. One thing that he didn’t mention in his earlier talks is that they actually figured that Earlybird does scale to Twitter level requests due to the Earthquake in Japan when they had to emergency shutdown the caching layer in front (which was Ruby on Rails and did not scale that well). Nowadays they have all tweets in a pretty vanilla Lucene with some additions (going to be open sourced soon) and use a Mesos cluster in case they have to reindex all the data.

And BTW: Tweet IDs encode the timestamp the tweet was send at.


Automating Cassandra Repairs (Radovan Zvoncek)

First time speaker Radovan did a pretty good job. Apparently there are several ways to get to the “consistency” of the “eventual consistency” in Cassandra, which are Read Repair, Hinted Handoff and full blown anit-entropy repairs. The latter ones apparently can lead to a lot of problems if done improperly, so Spotify build something to manage that: Reaper.  The problems are usually due to disk IO limits, network saturation or just plain full disks. Spotify Reaper orchestrates anti-entropy repairs to make them reliable.

I’m still somewhat confused that one aparently has to spend a lot of time repairing Casandra clusters. I always thought that was what Cassandra was doing.


Diving into Elasticsearch Discovery (Shikhar Bhushan)

For all the people who forgot, like me, how ES clustering works this was a good reminder. Plus I learned that discovery is pluggable, so you can write your own plugin to provide the clustering part for ES. He apparently did and wrote Eskka, an Akka based clustering approach. Writing your own apparently isn’t that much fun because APIs change all the time. Just in case you forgot, Zen is the default way ES clusters.


Change Data Capture: The Magic Wand We Forgot (Martin Kleppmann)

We all know the problem Martin was describing: same data in different form, like in your database, in your cache, in your search engine. He went back to the “Change Data Capture” principle, which basically says “save once, distribute everywhere”. So in order to realize that he wrote a PostgreSQL plugin “Bottled Water” which gets the changes from Postgres and posts them to a Kafka topic. Yay for the best project name this year in the category: will never find that on google.

His implementation and idea is solid, the problem is that it is a Kafka topic per table, so you actually loose the transaction when reading from Kafka. Otherwise it is transaction save, messages are only sent when the transaction in Postgres commits. He uses Avro on the wire and transforms the Postgres DDL Schema to an Avro schema.

If you want to get your transaction back you would need a stream processor (Storm/Spark) downstream to reassemble your transactions. Might be a good idea if you already have a Postgres DB or rely on some special properties of a centralized Datastore, otherwise it is OK if your microservices write directly to Kafka.

Has someone actually coined the word “nanoservices” yet for designs that basically do just one thing? Like take the request, write it to a queue (Kafka) and all other processing taking place by consumers down the queue that do just one thing as well.


Designing Concurrent Distributed Sequence Numbers for Elasticsearch (Boaz Leskes)

Elasticsearch is rewriting the way they do distributed indexing based on the Raft Consensus Algorithm. Sounds great, they are mitigating a lot of problems they do have right now.


Apache Lucene 5 – New Features and Improvements for Apache Solr and Elasticsearch (Uwe Schindler)

Apparently, Lucene 4 broke a lot of indexes due to it’s build in backward compatibility to Lucene <=3. With two big companies actually relying on Lucene, that kind of amazes me.

Lucene 5 gets rid of all this legacy stuff and drops support for older indexes. Plus it adds a lot of data safety features when it comes to on-disk indices like checksums and sequence numbers. So, Solr and Elasticsearch should finally be production ready …. ;).

JDK seems to keep breaking Lucene (remember that the initial JDK 7 release broke Lucene?), apparently one should not use G1 GC with Lucene (es? Solr?).

And Lucene 5 uses a lot of the “new” JDK 7 APIs for IO to finally get the index safely to disk.



Real-Time Monitoring of Distributed Systems (Tobias Kuhn)

Less distributed, more of Real-Time monitoring. Apparently they build their own system for analyzing their loggs for anomaly detection, punnily named Anna Molly, which was open sourced now.

They made pretty clear that thresholds are not enough if you have a highly dynamic system that can change on multiple dimensions any time. Seasonality of your date makes it even harder to define useful thresholds. There are a couple of algorithms which can be used for anomaly detection, namely Tukey’s outlier detection and seasonal trend decomposition. And T-digest comes to the rescue of course.

For monitoring they actually use a cascade of statsd and carbon.


To sum up bbuzz 2015:





Berlin Buzzwords 2015 – Day 1


Analytics in the age of the Internet of Things (Ludwine Probst)

Basically a talk about analyzing a demo dataset from sports activity via Spark. Not so much new stuff in there, but beautifully manually illustrated slides.

Real time analytics with Apache Cassandra and Apache Spark (Christopher Batey)


Good speaker, awesome talk. Some takeaways:

  • One should read the dynamo paper
  • You can (mis)use the datacenter awareness of Cassandra for isolating workloads if you run spark on top of it.
  • 500ms is the lowest usefull microbatch length

Application performance management with open source tools (Tudor Golubenco, Monica Sarbu)


Packetbeat, which apparently just joined Elastic, is a TCP Layer application monitoring solution. So what they basically do is understand your protocol (HTTP, Redis, Postgres) and give you metrics directly from the traffic (how long did my HTTP request take?). Sound really interesting, as it doesn’t need any integration. Will be integrated into the ELK-Stack and get some more data providers. I really like the idea.

Practical t-digest Applications (Ted Dunning)


t-digest is an algorithm to get realtime quantiles/percentiles out of your data. That comes in handy if you want to have the data always at your fingertips and/or want to identify outliers. It is blazingly fast and needs constant memory, so you actually want to have it wherever you have numbers. Of course there is a Java-Library and an integration into Elasticsearch. Awesome speaker as well.

The Do’s and Don’ts of Elasticsearch Scalability and Performance (Patrick Peschlow)


Basically a long reminder to RTFM. Know what data you need, now the pitfalls, disable features you don’t need and make sure that your cluster setup fits your requirements.

Detecting Events on the Web with Java, Kafka and ZooKeeper (James Stanier)


Good speaker, I think they build quite interesting stuff at Brandwatch. It was not clear to me till the end why the built all that stuff themselves, but I think they co-evolved with Storm/Spark and just made their existing software cluster aware rather than rewriting the stuff.

Reminded me that there is Apache Curator, a set of high level abstractions for Zookeeper services (

Analyzing and Searching Streams of Social Media at Scale using Spark, Kafka and Elasticsearch (Markus Lorch)

IBM is using Spark. Basically they got a pretty standard setup to get a lot of data from Twitter and enrich/augment that with some of their proprietary tech (mood detection etc.). Nothing special here.

Predictive Insights for IT Operations (Omer Trajman)

Actually a pretty good speaker, but I didn’t really get the whole point. He basically explained that you should use the same big data techniques for analyzing the data that comes out of your operations measurements (and btw, he has a company specializing on that). But it wasn’t a sales talk. So basically, yes, analyze all the data.


Programming Concurrency on the JVM

The fun thing about Venkat Subramaniam’s latest book is the way he jumps to and fro between a couple of JVM based languages: Java, Scala and Clojure. He shows interesting ways to do given tasks in different languages and introduces a couple of interesting frameworks I at least have not heard of before. The most interesting frameworks he shows are Akka, which is an Actor Based Concurrency Framework, written in Scala but which comes with an API that is just as well usabale from Java. Furthermore, I knew the concepts of Software Transactional Memory but did not know that there are working implementations for the JVM. One of them beeing Multiverse.


I really liked the book, because I really like the idea of using a framwork implemented in Scala through its Java API in Groovy. Sometimes it gets quite tiresome to have many of the same examples just shown in different languages, but it is always good to practice Scala or Clojure reading skills. If you feel safe in Java concurrency, i.e. you now Concurrency in Practice by heart, I warmly recommend this book.

Secrets of the JS Console

For some reason, up to until three weeks ago, I didn’t know that there is more to an API in the JS console than “console.log”. They API has been defined by Firebug first and is now more or less considered standard in all major browsers. Here is a short list of usefull commands:

$$('.foo') // gives you a shortcut to querySelectorAll

$0 // gives you the result of the last executed call

$1 // gives you access to the currently selected DOM Element in the Inspect view

clear() // clears the console

time('foo') // start Timing for foo

timeEnd('foo') // end Timing for foo

inspect( domElement  ) // shows the given dom element in Inspect view

The documentation can be found either in the Firebug Wiki or at Chromes Devtools Docs.

Google Merchant Module for ROME

Google Merchant is the way a company can feed their products into Google Shopping and ROME is a Java Library to create and parse RSS and ATOM feeds. Google Merchant allows RSS as one of the feed formats you can use to feed them their data. In order to get additional information into ROME created feeds you have to implement 4 classes, so it is not really easy to get your data there.

I created a small Library to make RSS feeds that conform to Google Merchants requirements:

        final SyndFeed feed = new SyndFeedImpl();
        feed.getModules().add( new GoogleMerchantModuleImpl() );

        final SyndEntry entry = new SyndEntryImpl();

        entry.setTitle( "Title" );

        final GoogleMerchantModule merchantData = new GoogleMerchantModuleImpl();
        merchantData.setImageLink( "SOME IMAGE URL" );

        entry.getModules().add( merchantData );

It’s under MIT License up in Github, you can direct download the Version 0.1 here.

Code with style: Readability is everything

So, by now everybody got it: Code is there to be read by people, to be analyzed by people and to be understood by people. The fact that you can put it through a compiler and run it is a nice sideffect, but nothing to focus on. Besides writing readable tests of course.

But when software is growing and many different hands touch the same spots, it somehow gets dirty. So even when you have usually quite high coding standards, it still can happen that I stumble upon something like this:

User user = userService.createNewUser( email, password,
                                       false, true, null, 5 );

I like to be able to read a line of code like a line of text. Which means that I at least want to be able to get the “words” without the surrounding context.

So what would you be able to tell about the above snipplet? I would say: it apparently creates a new user, using some service and it requires an email and a password, which might be stored somewhere. And the point that makes me cry: apparently some magic flags, a nullable parameter and a magic number.

So lets see how we can clean this up with some new patterns. I usually make up my own names, if someone has a rather more common name for them feel free to comment, I will  be happy to replace them.

Enum as Flag Pattern (aka do not use boolean literals) 

enum FailIfExists { YES, NO };
enum NewPasswordChecks { YES, NO }

User user = userService.createNewUser( email, password,
                                       null, 5 );

So, what can you tell about the method call now. The problem of methods that react to flags aside, the readability is better. You at least can deduct now that there might be no error if the user already exists and that it uses the “new” password checks. It is even easier to refactor to use the third password validation algorithm should it ever be changed again.

When setting up a new project you might try to disallow all boolean literals in certain high level classes, I don’t know if it works that well with third party libraries. This might be a cool Checkstyle rule to try.

No nullable Parameters Pattern (aka keep your implementation details internal)

So what is my next problem? Well, the given null parameter value. I believe that the ability to cope with null values is an implementation detail that belongs in the class that defines a method. So the UserService interface should provide us with an overridden version createNewUser that does not have the fifth parameter. Then the implementation could hide the fact that this parameter is really nullable. And it avoids the clutter of methods that have n optional object parameters.

If you use Findbugs in combination with the great JSR 305 annotations, which by the way can be enforced using a Checkstyle plugin, you might try to disallow using the Nullable Annotation in public methods. Maybe even for protected methods. In any case, you should never have to use a null parameter while calling a visible method of another class.

No literals outside constant definitions (aka give names to your magic numbers)

The last thing is a classic, but there are still a couple of people who do not use constants instead of literals. I think the general rule is that you should never use a numeric literal inside a method, but always a class declared constant. Furthermore this might even be extended to be valid for String literals and as told in the first point to boolean literals.

So let’s revisit my short (and bad) example taking the above mentioned points into consideration:

enum FailIfExists { YES, NO };
enum NewPasswordChecks { YES, NO }

private static final int INITIAL_NUMBER_OF_INVITES = 5;

User user = userService.createNewUser( email, password,
                                       INITIAL_NUMBER_OF_INVITES );

Given that you might see this lines of code during a review, where you can’t browse the complete source of your project, one might now be able to better understand what this method does: it creates a new user with the given email and passwords, succeeds even when the user already exists, uses some mythical new password check and gives him an initial number invites of five. This can be guessed by just reading code, without a single line of JavaDoc and even without once visiting the declaring class or interface.

So in future, when writing code, try to consider if you would understand your method calls without ever having seen the implementation or even the declaration of the called method.

Quickhacks: Visualizing contributions with git

Contributions for Git

So I had a little fun with git history visualization. Gource is there to show the progress of a repo over time, but I always wanted to get an actual snapshot of the repo at a given point in time. I wrote a little program, that basically does a “git ls-files” and then a “git blame -e” for all files that seem to  be source files. Then it resolves the Email-Addresses using gravatar and scales the image relative to the amount of lines the contributor is currently blamed for. The above image shows the visualisation for Git itself. It still needs some polishing, the algorithms for packing and sizing are all a bit off and not really correct, but mostly it creates nice imags. It is up on github if you want to play with it.

Contributions by Author for Prototype (the js library)
Contributions for pdf.js