Unser täglicher Stockfish: 250413 (TCEC)

By Thorsten Czub Date 2013-04-29 17:13

yes and no. on the one hand in my tournament houdini has to play against weaker programs. but also it has to play against programs, it would never play in a rating list. and these programs, although weaker, can differenciate differences between h2 and h3.
if YOU e.g. test only the few important opponent programs, and of course you cannot test against anybody,
you will maybe not find out about these differences.

the less variance you have in your opponent team, with testing, the less accurate are your results. the weakest possible testing would be (IMO) if you test h2 versus h3 and get a performance rating only from this incest matches version against version+1.

if you now confrontate the best h3 engines versus OTHER engines you never played against, the performance is different.

so each tournament (due to the variety of opponents) gives a completely different performance.

if you test only against komodo, stockfish and rybka, and maybe h2, you get an incest rating too. because stockfish team is also only testing against komodo, houdini3 and maybe rybka.

with all those incest testing, the ratings change more and more from reality IMO.

By Ingo Althöfer Date 2013-04-29 17:29

Hi Thorsten,

[quote="Thorsten Czub"]
... with all those incest testing, the ratings change more and more from reality IMO.
[/quote]

a very valid point by you. Incest has already often been a problem in computer
chess - for instance in the phase between 2000 and 2004, when selective search
was growing into the sky without control. And then "Bodendecker" Fruit came
and convinced by its simple and clear cut search trees.

Having an ensemble
E_1, E_2, ... E_m of chess programs where the strength relations are
E_1 > E_2 > ... > E_m
and for instance Score(E_1 vs E_m) = 75 %
it is really an interesting question which percentage of the sparring games of a
(strong) newcomer should be made against each of the E_i. Of course E_1
should get most of the games, but by which margin?

Ingo.

By Robert Houdart Date 2013-04-29 20:22

Thorsten, your tournament currently only demonstrates the limited accuracy of playing a relatively low number of games against mostly weak opponents.

I suggest that you make a simple test. You take any weak engine that performed better against Houdini 3 than against Houdini 2 in your tournament table (for example Ruffian 1.0.5), and instead of playing 8 games you now play 1.000 games against both Houdini 2 and 3.

You claim that "these programs, although weaker, can differenciate differences between h2 and h3".
I claim that you're just seeing the effect of playing too few games, and that with a larger number of games you will see that Houdini 3 performs better against these engines than Houdini 2.

Cheers,
Robert

By Thorsten Czub Date 2013-04-29 20:47

weak is relative robert- from houdinis point of view all engines are weaker.
but i am not really interested in houdini in this tournament. i have strong engines in it,
middleclass engines and weak engines.

it would make no sense to let houdini play against only the top engines.
the results would be misleading because the top engines are very similar.
the more similar the engines are, the more senseless the testing is.

By Ingo Althöfer Date 2013-05-02 06:31

Hallo Thorsten,

ich habe lange über den Vorschlag von Robert Houdart nachgedacht.
Könntest Du bitte wie von ihm vorgeschlagen, je 1.000 Partien
von H-2 und H-3 gegen Ruffian 1.0.5 spielen und die Ergebnisse hier
berichten?

[quote="Robert Houdart"]
.. I suggest that you make a simple test. You take any weak engine that performed better
against Houdini 3 than against Houdini 2 in your tournament table (for example Ruffian 1.0.5),
and instead of playing 8 games you now play 1.000 games against both Houdini 2 and 3.
[/quote]

Gruss, Ingo.