I co-authored two blog posts on the Databricks blog on large-scale machine learning with Apache Spark:
- The first post discusses a distributed decision tree construction in Spark and profiles the scaling performance for various cluster sizes and datasets.
- The second post introduces tree-based ensembles (Random Forests and Boosting) that are top performers for both classification and regression tasks. It highlights the scaling performance for various cluster sizes, training datasets sizes, model sizes (#trees in the ensemble) and tree depths.