MADLib approach is to leverage the efforts of commercial practice, academic research, and open-source development to build a product that addresses the needs of the analytic challenges within modern business.
Key MADlib architecture principles are:
- Operate on the data locally-in database. Do not move it between multiple runtime environments unnecessarily.
- Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details.
- Leverage MPP Share nothing technology, such as the Pivotal Greenplum Database, to provide parallelism and scalability.
- Open implementation maintaining active ties into ongoing academic research.
MADlib functionality includes:
- Topic Modeling: attempts to identify clusters of documents that are similar to each other, but it is more specialized in a text domain where it is also trying to identify the main themes of those documents.
- Association Rule Mining, also called market basket analysis or frequent itemset mining
- Descriptive statistics
MADlib software project began in 2010 as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (later Pivotal), and today it also includes researchers from Stanford and University of Florida.