http://developer.yahoo.com/blogs/hadoop/

Hadoop and Distributed Computing at Yahoo!

http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html

Open Source Distributed Computing: Yahoo's Hadoop Support

July 25, 2007

For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy.

yahoo cluster The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy.

It's hard work. And it needs to be commoditized, just like the hardware has been...

We too have been dealing with this at Yahoo. Analyzing petabytes of data takes a lot of CPU power and storage. And given the way our needs (and the web as a whole) have been growing, there will likely be dozens of similarly demanding applications before long.

To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project.

What started here as a 20 node cluster in March of 2006 was up to nearly 200 a month later and has continued to grow as it eats terabytes and terabytes of data. It wasn't long after that our code contributions back to Hadoop really started to ramp up as well.

http://lucene.apache.org/hadoop/

Hadoop is a software platform that lets one easily write and run applications that process vast amounts of data.

Here's what makes Hadoop especially useful:

Scalable: Hadoop can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.

Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch.

For more information about Hadoop, please see the Hadoop wiki.