The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google

Map-Reduce is on its way out. But we shouldn’t measure its importance in the number of bytes it crunches, but the fundamental shift in data processing architectures it helped popularise.

Reference: The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google


MADlib: Big Data Machine Learning in SQL for Data Scientists

MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.

Key MADlib architecture principles are:

  • Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
  • Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
  • Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
  • Open implementation maintaining active ties into ongoing academic research.

MADlib functionality includes:

  • Classification
  • Regression
  • Clustering
  • Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
  • Association Rule Mining, also called market basket analysis or frequent itemset mining
  • Descriptive statistics
  • Validation

MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.

Reference: MADlib