Musings on observations.

Decision Trees, Random Forests and Boosting in Spark

I co-authored two blog posts on the Databricks blog on large-scale machine learning with Apache Spark:

  • The first post discusses a distributed decision tree construction in Spark and profiles the scaling performance for various cluster sizes and datasets.
  • The second post introduces tree-based ensembles (Random Forests and Boosting) that are top performers for both classification and regression tasks. It highlights the scaling performance for various cluster sizes, training datasets sizes, model sizes (#trees in the ensemble) and tree depths.

Mocking Python With Kung Fu Panda

I frequently use the Mock library for unit testing in python. I needed a quick reference for my favorite functionality and couldn’t find one. I decided to make a lighthearted attempt at writing one while watching Kung Fu Panda on the telly. I hope others find it useful.

Before I start, here is the list of my favorites:

  1. Mock classes
  2. Mock class methods
  3. Mock instances
  4. Mock instance methods
  5. Configurable return values
  6. Restricted API for mock objects
  7. MagicMock
  8. Mock multiple classes
  9. Verifying calls
  10. Sentinels

Pandas and Python: Top 10

I recently discovered the high-performance Pandas library written in Python while performing data munging in a machine learning project. Using simple examples, I want to highlight my favorite (and sometimes hard to find) features.

Apart from serving as a quick reference, I hope this post will help new users to quickly start extracting value from Pandas. For a good overview of Pandas and its advanced features, I highly recommended Wes McKinney’s Python for Data Analysis book and the documentation on the website.

Here is my top 10 list:

  1. Indexing
  2. Renaming
  3. Handling missing values
  4. map(), apply(), applymap()
  5. groupby()
  6. New Columns = f(Existing Columns)
  7. Basic stats
  8. Merge, join
  9. Plots
  10. Scikit-learn conversion


I was looking for a new blogging platform designed for engineers. Think Latex for blogging – simple to use, beautiful output, version control. Enter Octopress (a blogging framework for hackers), which allows me to use the combination of Markdown and Github Pages to produce not-too-shabby webpages with an easy-to-use workflow.