Serverless Big Data

It’s been a long time since I’ve blogged here, but figured it’s overdue. I’ve been busy in the Azure world and am really excited to see how messaging has really started to shape the modern cloud landscape. Creating and shipping Azure Event Grid was perhaps the highlight of my career thus far, certainly at Microsoft. I’ll write a long over due blog about that shortly.

Seeing Azure Event Hubs grow to 2 trillion requests per day in the time I’ve been with it has also been a great experience. In this time not only has messaging become core to the cloud in general and Serverless in particular, but new patterns are starting to emerge that are really exciting. One of the most exciting this month has been seeing the traction our quietly released new Apache Kafka endpoint for Event Hubs is gaining and the new directions it is driving. The Kafka endpoint feature is available in some regions of Azure today and you can give it a try following this quick start. This allows you to use Kafka producers and consumers to read and write from an Event Hub.

By combining this with Azure Functions Event Hubs Binding you can create a Serverless Kafka processor in just a few minutes. The blog Processing 100,000 Events Per Second on Azure Functions shows you how easily you can scale this Serverless processing to pretty high scale very easily. This means you can really start to create truly Serverless Big Data solutions on Azure today. This is an exciting time and we will see more development in this space as users of these platforms drive innovation on the platforms which we and other cloud providers are starting to provide.

Give this stuff a try and tell me what you think!


4 Reasons the “Smart” Grid is Dumb

Disclaimer – These are my opinions and mine alone. They do not represent the views of my employer or any organization I am a part of.

I work heavily with “Smart” technologies: in the energy and utilities sector, in the manufacturing sector, and in telemetry covering retail and a few other areas. Over the last few years my work in Smart Grid has been fairly extensive. If you don’t know already “Smart Grid” basically means advanced telemetry built into every segment of the energy grid (though often smart meter and smart grid are actually different things; to most people smart meter and smart grid are the same). In my time implementing and consulting in this area I’ve come to see that there are a few really dumb things about the smart grid.

1) Lack of true standards at almost every level – Technology standards are what make the world interoperable. Ever send a text message or use the Internet or make a call from your mobile? That’s because standards allow devices and equipment from many vendors to work together – in two of those examples those standards are from the GSM Association. It is this interoperability that provides long term viability for the overall market: for the vendors, for the providers, and users. There are very few standards in the Smart Grid arena and there is almost no equipment interoperability. The really bad part here is that Smart Meters aren’t that different from the rest of the Internet of Things (IoT) and should be sharing standards with other parts of the IoT ecosystem.

2) No cloud first implementation strategy – None of the major vendors in the area are pursuing a cloud first strategy. From a technology standpoint most of this twenty first century infrastructure is being solved with late twentieth century architecture. There is a lot of expensive on premise technology that would feel right at home in the late 1990s. Cloud is important for valid reasons on both ends of the utility spectrum: small and large. Small utilities require a cost effective solution to implement this technology and realize the benefits. They cannot afford expensive highly available platforms and their small load factors don’t require it, yet the industry at large only offers them expensive on premise solutions that are overkill for most. Large utilities face another problem that a cloud first strategy would solve: scale. A large utility is going to have millions of meters and they will be providing telemetry at timeframes as short as 15 minutes. This is going to create a lot of data. Let’s look at an example:
5 million meters x 96 readings per day (i.e. 4×24) = 480,000,000 readings

This is just meters! Telemetry on the distribution side could actually be even larger as the readings are likely to be more frequent. The result: Some seriously Big Data (another blog on that shortly). This load from the meters alone would break down to 5555 readings per second on average 24 hours a day, 7 days a week. Although that number is not that big, these events are likely to come in huge bursts. The software and platforms being selected to handle this load are not up to the task on either the messaging (delivery) or data (processing / storage) sides of this challenge. Many vendors and their relational / legacy data platforms think this will scale just fine – throw more hardware at it. It also allows them to sell more licenses and hardware. Unfortunately it just won’t work.

3) Lack of publish subscribe architectures – Building on issue 2 there is the very serious and technical aspect of architecture to be addressed. To be sure we’re early in this Smart Grid game, but most of the solutions so far are trying to use web services at best and sometimes just batch processing to handle this data. This is a true travesty that I think may be the result of some insular group think. Even when web services are used they often don’t incorporate WS-* standards and almost always rely on polling, which also doesn’t scale. The environment that is ultimately developed ends up being an archipelago of services and data that do not build broad scale extensibility into their design. Most of these architectures end up causing load and scale problems so the vendors and users end up falling back on batch processing. This greatly diminishes the value of Smart data as it arrives with a great delay that stops it from being used for real time processing scenarios – which promise to provide the greatest innovation in the arena. Ultimately Smart systems need true publish subscribe capabilities built into their core to provide scale and extensibility. This is the only way to facilitate the development and addition of new components and capabilities without reengineering an expensive and possibly brittle implementation. But what sort of features and capabilities would require this architecture? Glad you asked! Perhaps things like real time analytics to provide predictive failure, demand shifts, weather patterns. Like the Internet, it is not so much what we have thought of that will make Smart systems so successful, but what we will think of once a solid platform is in place. Publish subscribe is the key to extending these platforms to unlock their true value in the future – ideally with standardized protocols that create an open ecosystem.

4) Heavy vendor lock-in – This last point is really a culmination of all the others. Vendors produce their own parts of this Smart ecosystem with little thought of the larger environment and with a desire to protect revenue with a relatively short focus. This manifests itself in single vendor meter networks, closed platforms, and limited extensibility. I know we’re all in business to make money, but if the ecosystem isn’t healthy and providing choice and competition then this money will be short lived for “Smart” as much of the value will be difficult to realize and innovation will be slowed. This is still early in the technology, so I think this will change as the industry matures and vendors realize that they can all have slices of a bigger pie if they embrace interoperability.

The good news is that there is hope. We are very early in the creation of the Smart ecosystem and some participants are starting to take notice, much like how mobile operators did in the past. Standards like AMQP are providing wire level interoperability for a publish subscribe architecture that is vendor agnostic and free to use. Some utilities are starting to demand support for robust open protocols. I have particularly seen this in European utilities where I believe there may be more historic precedent for interoperability. Some members of this community are starting to look beyond the Utilities sector for inspiration and advice from other industries that have faced these exact challenges in the past like telecommunications, financial services, and banking. All of these thing bode well and if embraced will stop making the Smart Grid so dumb. It will be interesting to see.

Apache Storm on Windows

In a release in February Apache Storm community added Windows platform support for Storm 0.9.1

I for one have been very excited to see this.  The Hortonworks distribution of Hadoop (HDP) is the only one that runs on both Windows and Linux and this gives a lot more choice to traditional enterprise clients.  I’ve been working with HDP for about a year and a half now and really like the experience – both on Linux and Windows. 

Storm is a very exciting development in real time data processing using a Hadoop cluster.  This is useful for running models that you’ve created by more traditional batch processing and map reduce within Hadoop.  Storm uses a simple spout and bolt topology for processing tuples of information at scale and in real time.  More information can be found at the storm site:

I am now wondering if this technology, now running on Windows, will make it into the Windows Azure HDInsight service.  I certainly don’t have any inside information on this, but I’d be interested to see it. 

Upcoming Speaking Engagements

This is a busy month for me.  I will be at both the GPU Technology Conference and Hadoop Summit Europe.  Both events are in the same week with my dates on March 19th and 21st respectively, which will make for fun travel  Both promise to be amazing conferences with a lot of knowledge share and I am honored to be a part of each. 

Being from the Microsoft camp as I am both my sessions will involve these technologies from a Microsoft context or standpoint.  In the case of GTC this will be using .NET to write CUDA (GPU) applications.  For Hadoop this will be using Hadoop within the Microsoft ecosystem (which if you have not noticed is a very large ecosystem). 

I’m very excited for both of these events and eagerly looking forward to them and the discussions and learning that accompany both.