MADlib: Big Data Machine Learning in SQL for Data Scientists

MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.

Key MADlib architecture principles are:

  • Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
  • Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
  • Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
  • Open implementation maintaining active ties into ongoing academic research.

MADlib functionality includes:

  • Classification
  • Regression
  • Clustering
  • Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
  • Association Rule Mining, also called market basket analysis or frequent itemset mining
  • Descriptive statistics
  • Validation

MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.

Reference: MADlib

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s