MADlib: Big Data Machine Learning in SQL for Data Scientists

MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.

Key MADlib architecture principles are:

Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
Open implementation maintaining active ties into ongoing academic research.

MADlib functionality includes:

Classification
Regression
Clustering
Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
Association Rule Mining, also called market basket analysis or frequent itemset mining
Descriptive statistics
Validation

MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.

Reference: MADlib

Business Intelligence, Data Analytics, Infographics, and Life

MADlib: Big Data Machine Learning in SQL for Data Scientists

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply