Berlin Buzzwords 2015 – Day 1


Analytics in the age of the Internet of Things (Ludwine Probst)

Basically a talk about analyzing a demo dataset from sports activity via Spark. Not so much new stuff in there, but beautifully manually illustrated slides.

Real time analytics with Apache Cassandra and Apache Spark (Christopher Batey)


Good speaker, awesome talk. Some takeaways:

  • One should read the dynamo paper
  • You can (mis)use the datacenter awareness of Cassandra for isolating workloads if you run spark on top of it.
  • 500ms is the lowest usefull microbatch length

Application performance management with open source tools (Tudor Golubenco, Monica Sarbu)


Packetbeat, which apparently just joined Elastic, is a TCP Layer application monitoring solution. So what they basically do is understand your protocol (HTTP, Redis, Postgres) and give you metrics directly from the traffic (how long did my HTTP request take?). Sound really interesting, as it doesn’t need any integration. Will be integrated into the ELK-Stack and get some more data providers. I really like the idea.

Practical t-digest Applications (Ted Dunning)


t-digest is an algorithm to get realtime quantiles/percentiles out of your data. That comes in handy if you want to have the data always at your fingertips and/or want to identify outliers. It is blazingly fast and needs constant memory, so you actually want to have it wherever you have numbers. Of course there is a Java-Library and an integration into Elasticsearch. Awesome speaker as well.

The Do’s and Don’ts of Elasticsearch Scalability and Performance (Patrick Peschlow)


Basically a long reminder to RTFM. Know what data you need, now the pitfalls, disable features you don’t need and make sure that your cluster setup fits your requirements.

Detecting Events on the Web with Java, Kafka and ZooKeeper (James Stanier)


Good speaker, I think they build quite interesting stuff at Brandwatch. It was not clear to me till the end why the built all that stuff themselves, but I think they co-evolved with Storm/Spark and just made their existing software cluster aware rather than rewriting the stuff.

Reminded me that there is Apache Curator, a set of high level abstractions for Zookeeper services (

Analyzing and Searching Streams of Social Media at Scale using Spark, Kafka and Elasticsearch (Markus Lorch)

IBM is using Spark. Basically they got a pretty standard setup to get a lot of data from Twitter and enrich/augment that with some of their proprietary tech (mood detection etc.). Nothing special here.

Predictive Insights for IT Operations (Omer Trajman)

Actually a pretty good speaker, but I didn’t really get the whole point. He basically explained that you should use the same big data techniques for analyzing the data that comes out of your operations measurements (and btw, he has a company specializing on that). But it wasn’t a sales talk. So basically, yes, analyze all the data.