Max Kuhn

Max Kuhn, Director - Pfizer

Director, Pfizer

Bio: I am a Ph.D. statistician with experience in a few different domains: pharmaceutical non-clinical statistics, molecular diagnostic R&D, assay development, manufacturing support/Six Sigma and clinical statistics (in order of my interests). I have worked both as an individual contributor as well as directing moderate sized groups (currently 2 people, 12 previously). I prefer problems where creativity is a key to problem solving. For example, predictive modeling (i.e. machine learning) is an area where traditional statistics may be limiting; as long as you can prove that a solution performs well, any idea is on the table. Complex problems interest me. A friend once remarked that I thrive on making order out of chaos and, to some extent, this is true. One of the most overlooked skills in these situations is the ability to remained focused on the core objective. This is especially true when we have as access to large amounts of data. Specialties: Predictive modeling/machine learning/pattern recognition | Computational biology and chemistry | High dimensional biology | The design and analysis of experiments

Tidy Resampling Redux with Agricultural Economics Data

Tidy Resampling Redux with Agricultural Economics Data

(No statistical graphs in this one. This is what my dog Artemis looks like when she wants my attention during work hours.) Mindy L. Mallory (@ace_prof) wrote a blog post on Machine Learning and Econometrics: Model Selection and Assessment Statistical Learning Style where she has a great description of the variance-bias tradeoff, resampling, and model complexity using […]

While you wait for that to finish, can I interest you in parallel processing?

While you wait for that to finish, can I interest you in parallel...

caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September of 2011, caret started using the foreach package was used to “harmonize” the parallel processing technologies thanks to a super smart guy named Steve Weston. I’ve done a few benchmarks to quantify the benefits […]

Intro to Caret, Model Training and Tuning

Intro to Caret, Model Training and Tuning

Contents Model Training and Parameter Tuning An Example Basic Parameter Tuning Notes on Reproducibility Customizing the Tuning Process Pre-Processing Options Alternate Tuning Grids Plotting the Resampling Profile The trainControl Function Alternate Performance Metrics Choosing the Final Model Extracting Predictions and Class Probabilities Exploring and Comparing Resampling Distributions Within-Model Between-Models Fitting Models Without Parameter Tuning 5.1 Model Training and […]

Intro to Caret: Data Splitting

Intro to Caret: Data Splitting

Contents Simple Splitting Based on the Outcome Splitting Based on the Predictors Data Splitting for Time Series Data Splitting with Important Groups 4.1 Simple Splitting Based on the Outcome The function createDataPartition can be used to create balanced splits of the data. If the yargument to this function is a factor, the random sampling occurs within each class and […]

Do Resampling Estimates Have Low Correlation to the Truth?

Do Resampling Estimates Have Low Correlation to the Truth?

The Answer May Shock You. One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate. Let’s look at this with some simulated data. While this assertion is often correct, there are a few […]

Intro to Caret: Pre-Processing

Intro to Caret: Pre-Processing

Editor’s note: This is the third of a series of posts on the caret package. Creating Dummy Variables Zero- and Near Zero-Variance Predictors Identifying Correlated Predictors Linear Dependencies The preProcess Function Centering and Scaling Imputation Transforming Predictors Putting It All Together Class Distance Calculations caret includes several functions to pre-process the predictor data. It assumes that […]

Intro to caret: Visualizations

Intro to caret: Visualizations

Editor’s note: This is the second of a series of posts on the caret package. The featurePlot function is a wrapper for different lattice plots to visualize the data. For example, the following figures show the default plot for continuous outcomes generated using the featurePlotfunction. For classification data sets, the iris data are used for illustration. […]

The caret Package

The caret Package

Editor’s note: This is the first of a long series of posts on the caret package. Introduction The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting pre-processing feature selection model tuning using resampling […]

Optimizing with Nonlinear Programming

Optimizing with Nonlinear Programming

Rafael Ladeira asked on github: I was wondering why it doesn’t implement some others algorithms for search for optimal tuning parameters. What would be the caveats of using a genetic algorithm , for instance, instead of grid or random search? Do you think using some of those powerful optimization algorithms for tuning parameters is a […]