interstice | (no subject)

I half-heartedly competed in this http://kaggle.com/chess?viewtype=results, which was a competition to predict chess match outcomes based on observed past matches. The penalty was a variant of RMSE (i.e. L₂) against aggregated win(1)/loss(0). This penalty rewards conservatism, i.e. in the absence of strong contrary information, the best guesses tend to be around the baseline win rate (which is ~0.55-0.60 for White). An absolute deviation (L₁) loss would encourage "bolder" guesses, and an entropy-type loss would have necessitated predicting outcomes as opposed to probabilities. I am not convinced that RMSE is the best penalty, but that's not in my bailiwick.

My team was "acceptablelies" and we came in 73^rd out of 258, which is OK given that we basically gave up (too much other work to do) about 1/3 into the competition. If we'd tweaked our premise to include what was in the literature (i.e. include some kind of dynamic model for describing a player's evolution over time) I think we could have gotten ~30^th place. But this isn't the point.

What is remarkable is that our model included no temporal terms at all and still came out competitive with the currently standard Glicko ratings system: Glicko came in 72^nd, beating our RMSE by only 0.0003. For all intents and purposes, our models are indistinguishable for prediction rate. Of course it may be the case that our models perform better on distinct subsets in which case even a naïve mixture would be fruitful.

Glicko uses a semi-elaborate (but smooth!) dynamic response model in which players have a rating which is allowed to vary smoothly over their observed performance. By contrast our system used only a discrete ranking (top tier; 2nd tier; ... down to 6th tier) which was fit by a simple energy-minimization heuristic. The ranking was then used to generate a two-way linear factor model and passed through a squashing function (0-1 clipping; oddly enough a logistic made things worse). An orthogonal L₂ penalization term (i.e. an independent Gaussian prior) was used to prevent overfitting.

Despite this simple-minded approach (i.e. we had only ~6² parameters) we still essentially matched Glicko (*). Thus, from a purely predictive standpoint there is perhaps evidence that these two hypotheses are in a "dead heat": `players gradually change over time and are mostly comparable'; versus `players don't change much, and further can be ordered into a few mostly identical groups'.

Given that we match in predictive performance, there is something to be said for our relatively-parsimonious ~40-parameter model as opposed to modeling a continuous trajectory. Although we would have done better by including a "temporal twist", I think also that smooth/continuous methods could also benefit from a discretization/factor approach. In fact they are probably quite complementary approaches in the real world (although the combo may seem a bit hodge-podge sometimes). I think this is a good lesson to have learned, and I would be surprised if the winning teams did not use such a combination.

*: I'm fairly convinced that if we'd used a more principled approach in selecting the rankings (factors), we'd have gotten maybe 55^th-ish.

**: OK, there are a lot more parameters since each player has a 6-way membership. In a sense, the model has O(n) parameters. However, considered this way, glicko has O(n*t) parameters since each player gets their own trajectory across time.

Flat | Top-Level Comments Only

From:

gustavolacerda.livejournal.com

If I were the one asking you to predict these things, I would want your best subjective probability... especially if other agents might come to different conclusions.

<< modeling the loss (which is minus the loglikelihood) >>

I'm not aware of this loss function. But surely you can define other loss functions, no?

What is curious for me is that the proper scoring rule is not unique.

random-walker.livejournal.com

I think you are aware of that loss function; after all, an MLE minimizes the (-loglikelihood) loss. Anyway, my point in bringing it up is perhaps a bit involved; we can talk about it IRL.

Since the proper scoring rule is not unique, perhaps it suggests that a subjective probability does not encapsulate all of one's (un)certainty.

I agree, it seems that one would want more than just classification when doing a meta-analysis, but how much more...? I don't like subjective probability, but on the other hand it seems to be useful. Then again there are methods like boosting which do meta-analysis without confidence/subjective probability. On the fourth hand, boosting seems very brittle to noise (http://www.phillong.info/publications/LS10_potential.pdf).

It is interesting.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

adequacy: fingers, lips, a tongue and three pounds of mush

(no subject)

(no subject)

no subject

no subject

Profile

May 2011

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags