Don’t feel bad about your “busted” NCAA basketball bracket, it was better than a “Wild Assed Guess”

March madness is over. You might be feeling bad about your bracket so I thought I would give you a little ego boost. Your bracket probably isn’t as bad as you think.

The past two years I have done very well in my bracket pool. Also, I like to brag. I was bragging about how well I have been doing picking brackets the past two years on facebook (my exact words were, “Data driven/statistical decision making pays off again!”) and my college genetics professor asked if my picks were actually better than a “Wild Assed Guess”. He has a point. I might have just been lucky. So, I wanted to investigate this in more detail (so I could brag some more of course).

Last year the bracket pool was set up with an “upset bonus” which I exploited statistically and ruined everyones fun (if someone wants to partner with me, we could make a lot of money in Vegas betting in bracket pools with upset bonuses). This year the bracket pool was set up with the “standard” format. The first round of games were worth 1 point each, the second 2 points, and so on. Picking the champion was worth 32 points. Standard set-up. There were 20 people in our bracket pool and I finished in 2nd with 111 points. To pick my bracket I just used the most popular bracket picks supplied by Yahoo Sports. My rationale was that the collective reasoning of the masses was probably better than whatever I would choose. I could have also chosen to go with the rankings established by the NCAA committee that sets up the tournament. That would have earned 112 points, which would have still placed 2nd. Anyway, you might think it is a stretch to call my approach “data driven” and “statistical” but it really was. My choice was based on millions of other peoples choices, and I was unbiased in the decision making. Yahoo just did the analysis for me.

Was this method better than a “Wild Assed Guess”? When picking winners of games, I am assuming that a “Wild Assed Guess” is equivalent to a coin flip, 50/50. I used this estimate to simulate 1 million brackets based on “Wild Assed Guesses” and determined how many points each bracket would have earned in this years tournament. The distribution of points earned by the brackets looks like this:

The hump on the right is caused by brackets which randomly picked the correct champion. In the graph below brackets which didn’t pick the champion are on the left panel, and those which did are on the right panel. The scales are the same.

My bracket scored better than all but 283 of the 1 million brackets based on “wild assed guesses”. I did this same experiment a second time and my bracket scored better than all but 303 of the 1 million brackets. This implies that there is a 0.03% chance that a bracket would perform as well as mine based on a “Wild Assed Guess”. In science we summarize this concept as a p-value, where a 100% chance is 1, and a 0% chance is 0. It is a commonly accepted practice to use a p-value cutoff of 0.05. If a p-value is less than 0.05 (p<0.05) we conclude that the odds of an observation this extreme are rare enough that we can reject the null hypothesis of no difference. Many scientist are sticklers to the whole p<0.05 thing. I actually think being a stickler for “p<0.05″ is pretty dumb because the p-value itself contains the information about the certainty/uncertainty we are looking for. For example,  P=0.051 isn’t that much different than p=0.049. Here is an example of how we would use a p-value: I conclude that the method I used to pick my bracket is better than a “Wild Assed Guess” (P=0.0003).

Now for the cheerful, ‘non-braggy’ part. Your bracket was probably better than a “Wild Assed Guess”. If you abide by the whole P<0.05 thing, your bracket only needs to have scored 50 points for you to conclude that your method was better than a “Wild Assed Guess”. In my bracket pool only 2 people had scores less than 50. Even then, their brackets were better than 94 and 92 percent of brackets based on “Wild Assed Guesses”.

One last thought: the NCAA committee which chooses the tournament seedings deserves a lot of credit. It is obvious that their seedings are better than a “Wild Assed Guess” (p = 0.000236). They deserve real credit here. Without the seeding information I doubt that the collective conscious of Yahoo bracketologists would have performed so well.

Things I find interesting:

1. Picking the champ has a HUGE effect on whether you win your bracket pool.

2. The NCAA tournament committee does a really good job picking the seedings.

3. Most people do better than a “Wild Assed Guess” when they pick their brackets.

4. The guy who won my bracket pool is a badass. However, I am not sure that we are statistically different.

5. Next year I should try to beat the NCAA seedings.

Acknowledgements: I would like to thank my undergraduate genetics professor for inspiring this blog post. He will remain unnamed unless he requests otherwise.

Texas Holdem Monte Carlo Simulation: Winning hands

I previously told you about the odds of getting various hands in Texas Hold’em. In my example I compared odds estimated using Monte Carlo simulations to the exact odds calculated using mathematics. I promised you that using this approach would come in handy once calculating exact odds becomes unrealistic. Here is my first example. I simulated 110 million rounds of poker, 10 million rounds with 2 players, 10 million with 3 players, and so on. Here are the results:

Frequency which a given hand wins the round:

The odds a given hand is a winning hand:

 

 

Texas Hold’em Monte Carlo simulation: Odds of various hands

I recently posted that I wrote a simple program which draws random cards and simulates drawing poker hands. More recently I wrote a tool that evaluates the hands and finds the best 5 card hand that can be made. In the coming months I will be adding more functions to the program, and as I do I will update you on different probabilities that I determine using Monte Carlo simulations. Some of the earlier studies can probably be reasonably determined using mathematic approaches. However, as you will come to see many of simulations to come will answer questions that are nearly impossible to estimate using deterministic approaches.

Using mathematics you can calculate the exact odds of any hand in 7-card poker. Here is a link describing these calculations: Exact hand odds

Here are the odds I estimated using 200 million simulations:

True odds are shown in parenthesis for comparison sake.

Straight Flush: 0.00031 (0.00031)
Four of a kind: 0.00168 (0.00168)
Full house: 0.02596 (0.02596)
Flush: 0.03026 (0.03025)
Straight: 0.04619 (0.04619)
Trips: 0.04831 (0.04830)
Two pair: 0.23496 (0.23496)
Pair: 0.43822 (0.43823)
High card: 0.17410 (0.17412)

Things I find interesting:

1. You are more likely to get two pair than to play a high card.

2. There is a pretty big drop off in percentages between two pair (~1/4) and trips (~1/20).

3. The odds of getting trips or better is slightly less frequent than 1 in 6.5 hands.

4. The odds of getting a RSF are ~1/32,000. So, if someone tells you that they once had a RSF, they may be telling the truth if they play A LOT of poker, but otherwise they are probably bullshitting you. Also, if you see one at your weekly poker game, odds are that someone is cheating.

Review: Logitech ultrathin keyboard cover for iPad

I am the proud owner of a 4th generation iPad. I had been thinking that I wanted to buy a laptop/netbook/tablet for months, so of course Megan bought the iPad for me for Christmas. She is a great gift giver. The iPad is an incredible product. And so is the keyboard I bought for it, which I plan on telling you about below.

At the end of my first year of graduate school I decided it would help me maintain a healthy life/work balance not to buy another laptop when my old one crapped out on me. I ended up buying a desktop, and my plan for not taking work with me 24/7 worked great. However, after many hours holding down desk chairs I am ready to be mobile again.

The iPad is an amazing media consumption device. However, even with the split screen keyboard, which I like, the iPad is inconvenient to use for serious editing and programming because there simply is not enough screen real estate. The on-screen keyboard is a nice thought but I need a real keyboard to get things done. For a few weeks I carried around the Apple keyboard for my iMac, but eventually it became inconvenient because I was constantly switching the keyboard from the iPad to the iMac and vice-versa. I read several keyboard reviews and watched some youtube videos, and decided to buy the Logitech ultrathin keyboard cover. It cost ~$80, which isn’t cheap but is sort of par for the course for a quality tablet keyboard. I have been using it for 5 days now so here are my thoughts so far:

1. What an innovative design! When not in use the keyboard acts as a screen cover similar to the iPad smart cover sold by Apple, which goes for ~$40 by the way. The cover uses magnets to attach to the side of the iPad and to control sleep/wake like the cover sold by Apple. Apple really gets the credit for that innovative design, but Logitech get the credit for paying the license fees or whatever they had to do to integrate them into this keyboard! Switching between “cover” mode and “keyboard” mode is surprisingly intuitive. Just sit the iPad down cover down, pick up the iPad and unsnap from the magnetic hinge. Rotate it and put it in the magnetic groove which holds the iPad at an angle like a laptop screen. In portrait mode the iPad is not held in place by magnets, which I think would be an improvement. On the flip side, I haven’t needed to use it in portrait mode much at all. Overall the design of this keyboard gets an A+.

2.Holy functionality! With a functional keyboard my iPad is now a surprisingly functional work device. I am writing this blog post using the keyboard right now. I have big hands. I can even palm a basketball off of the bounce. My hands feel a little cramped typing but I think that a person that doesn’t have Neanderthal hands like me would probably find it relatively comfortable to use. The keyboard size is constrained because of the size of the iPad but really it isn’t so bad. I especially like the function keys. There is a home screen button on the keyboard which is especially handy. Overall, the keyboard gets an A because although it is a little cramped it is MUCH, MUCH better than the iPad’s on-screen keyboard.

3.Durability? I have only had the keyboard for 6 days so the jury is still out on the durability of the keyboard. The outer shell of the keyboard certainly seems sturdy. It appears to be made of the same material as the back of the iPad (some type of aluminum?), which looks and feels great, but I worry may be susceptible to scratching. Earlier today I found a small scratch on the back of the keyboard. I am not sure when or how it got there, which is a little worrisome. If scratches become a trend I will post an amendment below to update you. I read a criticism that the keys feel kind of cheap. I actually like them. They are not as nice as the pieces of finger magic that grace the Apple keyboard, but they are not terrible either. They give good feedback, are light to push, but don’t feel flimsy or cheap. Durability grade: B. Small scratch already, but not bad enough for a failing grade.

So far I am very happy with this product, which is what motivated me to write this blog post.
Overall grade: A

Texas hold’em poker hand simulator

I recently mentioned that I plan on simulating all of the data that I need to investigate Texas Hold’em. Here is how:

First, I created a really small program called a subroutine which randomly draws a single card from a deck of cards. Once a card is drawn it cannot be re-drawn until the deck is refreshed. This is the programs core.

Second, I wrote a very simple script which asks the program user how many hands of poker to deal and how many players. It then calls the subroutine to draw the cards for the number of players, the flop, the turn, and the river. It does this for the number of hands requested to be dealt.

Right now the program writes the output to the command line, but it is pretty simple to change this so that it stores the hands in a text file. The program is written in perl and it is reasonably fast. In Texas Hold’em there are many different combinations of hands that are possible (a number that is close to infinite for practical purposes), so it is important that the program is fast enough to simulate a high number hands quickly. On my 3 year old iMac, the program takes about 15 seconds to simulate 10,000 hands of poker with 12 players in each hand, 35 seconds to simulate 100,000 hands with 4 players, and just under 5 minutes to simulate 1,000,000 hands of heads up poker. I think this will be fast enough for me to simulate as many hands of poker as I will likely need.

Here is a simulated hand:

###ROUND 1000000###
Player 1: AH AD
Player 2: 2C 5S
FLOP: 2D 3D KS
TURN: 9S
RIVER: QD

Here is the code for the subroutine:

sub drawacard {
my (@cards) = (’2H’,’3H’,’4H’,’5H’,’6H’,’7H’,’8H’,’9H’,’10H’,'JH’,'QH’,'KH’,'AH’,’2D’,’3D’,’4D’,

’5D’,’6D’,’7D’,’8D’,’9D’,’10D’,'JD’,'QD’,'KD’,'AD’,’2S’,’3S’,’4S’,’5S’,’6S’,’7S’,

’8S’,’9S’,’10S’,'JS’,'QS’,'KS’,'AS’,’2C’,’3C’,’4C’,’5C’,’6C’,’7C’,’8C’,’9C’,’10C’,

‘JC’,'QC’,'KC’,'AC’);

%pickedcards;

do {
$randomcard = int(rand(52));
} until (! $pickedcards{$randomcard}++ );

return @cards[$randomcard];
}

So the question is, what do we do now? What have you always wanted to know when you were playing Texas hold’em?

Data sources

I recently posted that I intend to start writing posts in which I apply statistics in situations that are more interesting to a general audience. More specifically, I intend to blog about Texas Hold’em and American football. I can simulate all of the data that I will need to investigate Texas Hold’em. However, to investigate American football I will need data from real games. I have been searching around for data sources, but I haven’t had much luck. Until today. I casually read several blogs related to genetics/genomics, statistics, and programming/computer science. Today I was reading this post on simplystatistics.org, which is a hodgepodge of links loosely related to statistics, and discovered this post on advancedNFLstats.com which provides an accumulation of play-by-play data for the past 11 seasons. This is exactly what I have been looking for, kudos to Brian Burke for compiling this data.

Also, Kudos to Sean J Taylor. I liked this post. The comments on R, Matlab, and SAS were funny. I am curious, does using Perl put you in the Python and JVM crowd?

Two quick new years resolutions

I have had this website for a couple of years. It has undergone multiple revisions. I have written blog posts and then taken them down. I have used it to experiment with ftp, ssh, e-commerce software, and different content management systems. I have played around with it a lot, but I have never given it an identity or really used it for web publishing. It is finally time to give my website an identity that sticks.

Start writing blog posts about genetic/bioinformatic software. My PhD has focused on analyzing genetic data with a heavy focus on copy number variation. From using command line to statistical interpretation of data, most of what I do on a day to day basis is self taught. I have a lot to say that falls under the realm of personal opinion or general usefulness, but doesn’t necessarily need to be published in peer reviewed manuscripts.

Write blog posts about applying statistics to things more interesting to a general audience. This is one I have been interested in doing for a while. I thought about making it the primary focus of the blog, but I think I would struggle to produce content because it is to far removed from what I do on a daily basis. However, I have had some cool experiences using statistics outside of genetics so I will highlight those as they come along.

Copy number variation in Mexican Americans

I was recently first author on a study which identified 2,937 copy number variable regions in individuals of Mexican American descent. Among our results was the discovery of three related women carrying a ~1.4 Mb deletion containing the gene HNF1B, which led to the diagnosis for two of these women of maturity onset diabetes of the young 5 (MODY5). We are currently using these CNVRs to investigate the effect of copy number variation on heritable variation in gene expression and disease-relevant quantitative traits.

Here is a link to the publisher’s website:
http://www.nature.com/ejhg/journal/vaop/ncurrent/full/ejhg2012188a.html

August