Big data processing with Hadoop

how-to
Feb 21, 20112 mins

Data storage has become cheap. Consequently, we’re storing tons of it:

  • in less than 10 years since launching its image search feature, Google has indexed over 10 billion images
  • thirty-five hours of content are uploaded to YouTube every minute
  • Twitter is said to handle, on average, 55 million tweets per day
  • in early 2010, Twitter’s search feature was logging 600 million queries daily

In lockstep with the explosive growth of data are tools designed to facilitate data processing — one such tool is Apache’s Hadoop. Hadoop is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce’s massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting. Yahoo! and countless other organizations have found it an efficient mechanism for analyzing mountains of bits and bytes. Hadoop is also fairly easy to get working on a single node; all you need is some data to analyze and familiarity with Java code, including generics.

In IBM developerWorks‘ article “Big data analysis with Hadoop MapReduce” you’ll get started with Hadoop’s MapReduce programming model and learn how to use it to analyze data for both big and small business information needs. You’ll find that analyzing data with Hadoop is easy and efficient!

Looking to spin up Continuous Integration quickly? Check out www.ciinabox.com.
andrew_glover

When Andrew Glover isn't listening to “Funkytown” or “Le Freak” he enjoys speaking on the No Fluff Just Stuff Tour. He also writes articles for multiple online publications including IBM's developerWorks and O'Reilly’s ONJava and ONLamp portals. Andrew is also the co-author of Java Testing Patterns, which was published by Wiley in September 2004; Addison-Wesley’s Continuous Integration; and Manning’s Groovy in Action.

More from this author