« playlist [21 March 2006] | Main | preliminary results »

my first step into sabermetrics

In a true display of my extreme geekiness, I have written a baseball simulator to run through all 9! (9 factorial = 362,880) permutations of the lineup in order to determine which one is optimal. There was a bit of talk on some of the sabermetric sites about lineup optimization a few weeks ago, and someone suggested that the best way would be to write a simulator, so I did. It's been a pretty good opportunity to brush up on my programming skills.

It's basically a Monte Carlo simulator that generates a random number and compares it against a player's statistics to determine whether the result of that particular at bat should be a walk, single, double, etc. This simulator doesn't take into account sac flys, sac bunts, steals, or double plays, so it consistently underestimates the run totals, but it should provide a reasonable first order approximation. I've written in a hook for a possible routine for double plays, and it ought to be possible to take sacrifices and speed into account as well, but that would probably require a significant rewrite and consequently increase the run time. I've spent the last few days tweaking the code to get it to run as fast as possible, but for any sort of statistical validity it still takes about 2 1/2 days to run through all the permutations (at about 150,000 to 250,000 games per lineup, we're talking about 72 billion simulated games). Currently I'm testing it out against the 2005 Astros, using the aggregated stats for the pitchers in one slot.

The current version of the code is here; I compile it with the following command on Mac OS X running on a G4-based machine:

gcc -o simulator_gsl simulator_gsl_switch.c -lgsl -mdynamic-no-pic -fast -mcpu=7450 -g

It requires the GNU Scientific Library (hence the -lgsl option on the command line) because it uses the GSL random number generator library; the Mersenne Twister random number generator helped shave a few percent off the overall run time. I originally wrote the scoring function using if/else statements, but have rewritten the subroutine using the switch/case conditionals, which is also slightly faster. I'm trying to see if there are any other places I can eke out a little better performance, but I think the real project will be to rewrite the code so that it can run in parallel on multiple CPUs. Parallel processing should provide nearly linear acceleration (this problem is, as they say, "embarrasingly parallel"), and I have a few older computers around that could contribute CPU time...

Anyway, once I'm done running the program against the 2005 Astros lineup, I'll post the results here.

Update: For those who want to download and compile the program to run it themselves, feel free. It will ask you for a tab-delimited text file with the player statistics in it. It should have the extension .txt, and each player should be given their own line in the file with stats in this order: name, plate appearances, walks, hits, doubles, triples, homers, strikeouts, GIDP, hit by pitch.

TrackBack URL for this entry:



My math:
computer geek + baseball geek = ubergeek


NERD! (...now, back to my job as a videogame programmer...)


No arguments here. It's pretty geeky of me. Even geekier is that my next step is to try to parallelize the code so that I can run it distributedly across several computers networked into a grid. My very own computing cluster!

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)