I co-authored two blog posts on the Databricks blog on large-scale machine learning with Apache Spark:
- The first post discusses a distributed decision tree construction in Spark and profiles the scaling performance for various cluster sizes and datasets.
- The second post introduces tree-based ensembles (Random Forests and Boosting) that are top performers for both classification and regression tasks. It highlights the scaling performance for various cluster sizes, training datasets sizes, model sizes (#trees in the ensemble) and tree depths.
I am speaking at Spark Summit 2014 on Scalable Distributed Decision Trees in Spark MLlib. You can grab a friend-of-the-speaker discount here. Also, feel free to leave feedback in the comments.
I frequently use the Mock library for unit testing in python. I needed a quick reference for my favorite functionality and couldn’t find one. I decided to make a lighthearted attempt at writing one while watching Kung Fu Panda on the telly. I hope others find it useful.
Before I start, here is the list of my favorites:
I recently discovered the high-performance Pandas library written in Python while performing data munging in a machine learning project. Using simple examples, I want to highlight my favorite (and sometimes hard to find) features.
Apart from serving as a quick reference, I hope this post will help new users to quickly start extracting value from Pandas. For a good overview of Pandas and its advanced features, I highly recommended Wes McKinney’s Python for Data Analysis book and the documentation on the website.
Here is my top 10 list:
I was looking for a new blogging platform designed for engineers. Think Latex for blogging – simple to use, beautiful output, version control. Enter Octopress (a blogging framework for hackers), which allows me to use the combination of Markdown and Github Pages to produce not-too-shabby webpages with an easy-to-use workflow.