Hadoop on Windows: HDInsight – Getting Started

Hadoop has been all the rage the last year or so and anyone who does not know that Microsoft is very serious about Hadoop has clearly not been paying attention.  HDInsight is what Microsoft is calling their suite of 100% Apache Hadoop compatible software.  They refer to it as part of their “end-to-end roadmap for Big Data” and they’re not kidding, it’s integral. 

A few things may jump out from this as odd or funny.  One would be ‘what is Microsoft doing in the open source world?’.  If this is a surprise to you then you really have been living under a rock.  Microsoft is working very closely with Hortonworks and contributing heavily on Hadoop.  They are also contributing heavily to the Linux kernel since 2009. 

Like them or not you have to give Microsoft credit for making working with technology easier.  Their work with Hadoop has been much the same.  Impressively, and to their credit, they have chosen to stay 100% Apache Hadoop compatible which will help the burgeoning ecosystem.  HDInsight brings the power of Hadoop to the Windows user base and beyond. 

There are two ways to get started with HDInsight: the first is to download the preview which is a simple web installer for HDInsight (which even works on Windows 7).  This is meant for single node installations at this time, but will soon be expanded for multi-node.  The multi-node functionality is already there, but Microsoft is working on the security parts.  This is pretty easy and it’s a point and click install (can’t say I’ve had that experience with most of the software in the ecosystem yet). 

Once up and running there are even sample applications, including a sample data generator, already in place (in C:\Hadoop\GettingStarted).

Playing with Hadoop has never been easier.  In two command line commands (yes, this is used extensively, though there is a nice UI as well) you can run Hadoop jobs.  Check out http://gettingstarted.hadooponazure.com/gettingStarted.html for more details. 

You can also use Hadoop on Azure which is a beta you need to sign up for at this time.  This is really cool because you can provision a Hadoop cluster for use and then release it when you’re done.  This makes it perfect for playing around with (although the on premise install is so easy that any workstation, or better yet Azure VM) can be used for it. 

Stay tuned for more posts on getting started with HDInsight and Hadoop on Windows.