Curiosity

Musings on observations.

Decision Trees, Random Forests and Boosting in Spark

I co-authored two blog posts on the Databricks blog on large-scale machine learning with Apache Spark:

  • The first post discusses a distributed decision tree construction in Spark and profiles the scaling performance for various cluster sizes and datasets.
  • The second post introduces tree-based ensembles (Random Forests and Boosting) that are top performers for both classification and regression tasks. It highlights the scaling performance for various cluster sizes, training datasets sizes, model sizes (#trees in the ensemble) and tree depths.

Comments