Every week in the past I made a thread asking whether or not ICML 2026’s assessment coverage might need affected assessment outcomes, particularly whether or not Coverage A papers might have been judged extra harshly than Coverage B papers.

Authentic thread:
Ballot:

The aim was not to show causality. It was merely to gather a tough group snapshot and see whether or not there are any seen tendencies in:

reported common scores,
reported reviewer confidence,
whether or not scores felt harsher than anticipated,
and whether or not opinions felt particularly polished.

Now, earlier than rebuttal scores, I needed to share the present outcomes from the survey.

Necessary disclaimer

These outcomes are nonetheless not conclusive. It is a self-selected group ballot, not an official dataset, and there are various potential sources of bias. So please learn this as descriptive, preliminary information, not as proof that one coverage brought on higher or worse outcomes. Nonetheless, with 100 responses after one week, I believe the information at the moment are attention-grabbing sufficient to no less than focus on.

Pattern dimension

100 complete submissions
99 submissions with a sound common rating
91 submissions with a sound common confidence

By coverage:

Coverage A: 59 responses
Coverage B: 41 responses

Abstract desk

Coverage	Responses	Imply Rating	Rating SD	Imply Confidence	Confidence Responses
Coverage A	59	3.26	0.50	3.53	55
Coverage B	41	3.43	0.63	3.35	36
Whole	100	3.33*	0.56*	3.46**	91

* based mostly on 99 legitimate common rating entries
** based mostly on 91 legitimate confidence entries

Plot 1: rating distribution by coverage

Distribution of Scores by Coverage chosen

First patterns I see:

1) Coverage B presently has a considerably increased reported imply rating

For the time being, the common reported rating is increased for Coverage B (3.43) than for Coverage A (3.26). That is not conclusive that Coverage B was advantaged in a causal sense. However the distinction is seen sufficient that it appears price discussing.

2) Coverage A presently has increased reported reviewer confidence

Curiously, the boldness sample goes in the other way: the common reported reviewer confidence is increased for Coverage A (3.53) than for Coverage B (3.35). To me, this inversely proportional relationship of scores and confidence is among the extra attention-grabbing patterns within the present information which will be intepreted as those who depend on reasoning externally (on this case LLM) are much less assured on their opinion as a result of possibly they didn’t totally spend time studying the paper. On the similar time they’re extra skeptical that their assessment is legitimate.

3) Each teams lean towards “harsher than expected”, however that is stronger for Coverage A

Coverage	Harsher than anticipated	About as anticipated	Extra lenient than anticipated
Coverage A	67.8%	28.8%	3.4%
Coverage B	58.5%	29.3%	12.2%

So each teams lean towards the sensation that scores have been harsher than anticipated, however that is extra pronounced for Coverage A within the present pattern. This, nevertheless, will also be attributed to the decrease imply scores of Coverage A, which subjectively makes the Coverage A respondents really feel unfairly handled.

Plot 3: perceived harshness by coverage

Distribution of Harshness by coverage.

4) “Especially polished” opinions are reported rather more usually for Coverage B

Coverage	No	Considerably	Sure
Coverage A	37.3%	49.2%	13.6%
Coverage B	31.7%	36.6%	31.7%

The most important distinction right here is the “Yes” class: within the present pattern, respondents beneath Coverage B are more likely to explain the opinions as particularly polished. In fact, this does not show LLM use, and I don’t wish to overstate that time. However it’s nonetheless a sample that appears related to the unique debate.

My present interpretation

My present studying is:

there’s some tendency towards increased reported scores beneath Coverage B,
there’s some tendency towards increased reported reviewer confidence beneath Coverage A,
and there’s a noticeable distinction in how usually opinions are described as particularly polished, with that being reported extra usually for Coverage B.

On the similar time, I do not say these information justify a robust conclusion like:

“Policy B clearly had an unfair advantage”, or
“LLMs caused score inflation”.

However they justify an open debate.

There are too many confounders, nevertheless:

the survey is self-selected,
individuals who care about this problem are those who really feel affected and usually tend to reply,
and completely different subfields / paper strengths / reviewer swimming pools might all matter.

I would love opinions on these early outcomes

Additionally, in case you have not crammed the survey but, please do. And please share it, particularly with individuals beneath each insurance policies, so the pattern can turn into bigger, extra informative, and extra consultant. If sufficient further responses are available in, I can put up a follow-up after rebuttal as effectively.

Motivation

I overtly admit that my motivations for doing this survey was A) I initially felt probably handled unfairly and needed to know the fact; and B) I actually love Knowledge Evaluation of any sort and Debates. After every week I primarily do it for motivation B.

submitted by /u/Available_Net_6429
[comments]

Top Posts

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

[D] ICML 2026 assessment coverage debate: 100 responses recommend Coverage B might rating increased, whereas Coverage A exhibits increased confidence

Necessary disclaimer

Pattern dimension

Abstract desk

Plot 1: rating distribution by coverage

First patterns I see:

1) Coverage B presently has a considerably increased reported imply rating

2) Coverage A presently has increased reported reviewer confidence

3) Each teams lean towards “harsher than expected”, however that is stronger for Coverage A

Plot 3: perceived harshness by coverage

4) “Especially polished” opinions are reported rather more usually for Coverage B

My present interpretation

I would love opinions on these early outcomes

Motivation

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Beyond the Hype: Architecting Your AI-Native Data Fortress

The Hidden Alignment Chasm: Why Enterprise AI’s Unexamined Reality Gap Threatens Deployment

CMMC Listening Sessions: DoD Hears Questions as Plans Take Shape

Dale-Proof AI Learns Perfect MNIST, Near-CIFAR-10 Vision—No Backpropagation Needed

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Decoding Google DeepMind’s Bioresilience Blueprint: Inside the AI Immortality Race

Kimi K3 vs DeepSeek V4 Pro vs GLM-5.2: Open Trillion-Scale MoE Models Compared on Benchmarks, License, and Serving Cost

Champions of the Diplomatic Corps: Democrats Rally Around Fallen Foreign Service Officers

The Ultimate Blood Pressure Showdown: My Month-Long Wearable Battle Royale

Unlock Savings: Adaptive PDF Parsing That Scales Costs Page by Page

EU Forces Google to Surrender Android’s Secret Doors to Rival AI Assistants

Trending

Building America’s Future: The Hidden Security Risk in Every Shipment of Cement

5 Hidden iOS 27 Gems That Supercharge My iPhone (And None Are AI)

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

[D] ICML 2026 assessment coverage debate: 100 responses recommend Coverage B might rating increased, whereas Coverage A exhibits increased confidence

Necessary disclaimer

Pattern dimension

Abstract desk

Plot 1: rating distribution by coverage

First patterns I see:

1) Coverage B presently has a considerably increased reported imply rating

2) Coverage A presently has increased reported reviewer confidence

3) Each teams lean towards “harsher than expected”, however that is stronger for Coverage A

Plot 3: perceived harshness by coverage

4) “Especially polished” opinions are reported rather more usually for Coverage B

My present interpretation

I would love opinions on these early outcomes

Motivation

Related Posts