the new bakery rating system will continue to be based off of the 1-5
star ratings from Bakery 1.0. Currently, the method used to determine
the article score is the simple arithmetic mean of the user attributed
ratings.
The problem with this is that the arithmetic mean (average) is usually
a very bad statistical variable, since it is easily affected by
outlying data points. Example: an article has the following five
ratings (1,1,1,1,5). We can clearly see that there is one definite
outlier, the rating of 5 (perhaps given to the article by himself on a
different account, or one of his friends, etc.) which bumps up the
average to 1.8. While this is a contrived example, I'm sure you get
the idea.
My proposed method will forgo the simple arithmetic mean in favor of a
more statistically sound method, which relies on what are called
[confidence intervals](http://en.wikipedia.org/wiki/Confidence_interval). Simply put, the article ratings may be considered normally
distributed (i.e. they can be modeled using a [gaussian distribution](http://en.wikipedia.org/wiki/Normal_distribution). From this model we
may calculate the mean & the variance of the sample, and using those
values, a confidence interval for the article score may be
constructed. This confidence interval will give an upper and a lower
bound for the predicted population mean based on the limited sample
given, where we can then take the lower bound & use it as the article
rating.
Depending on the tests & model I construct, I may add in a logarithmic
corrective factor to account for the time an article has been posted.
This might be useful, since over time an article can be edited/
corrected to improve the content, which thus (hopefully) improves the
rating that people give. However, this might be a bit overkill. I'm
not sure yet.
The algorithm & implementation details will be made available in the
bakery source, of course. I'll make an effort to have it well
documented.
**N.B.** In statistics, there is a difference between a 'sample' and a
'population'. A sample is, in our case, the ratings that an article
has been given. The corresponding 'population' would be the
(theoretical) ratings that the article would have if every single
person who viewed that bakery article would have voted, instead of
only a subset of them. Most of statistics is concerned with
determining some variable for the population, given only a small
sample.
