1d ago

ArXiv paper 'From Entropy to Epiplexity' prompts X discussion on why information theory lacks explanatory power and testable predictions for AI systems

Replies cite noise and unreliable estimators as practical ML barriers.

0
Original post

It's very disappointing that information theory cannot explain AI at all.

7:45 PM · May 21, 2026 View on X

@ziv_ravid Information theory might not explain why deep learning works, but it can help in using the models. For example, IT suggests the objective function to use and gives a clear algorithm for using your model to compress data.

Ravid Shwartz ZivRavid Shwartz Ziv@ziv_ravid

My two cents on why information theory doesn't quite work in the real world: as someone who's been arguing with people about IT and its connection to ML since 2016(!), I mostly agree with Alex. (At least) two problems show up when you try to use information theory in ML. First, the distance between theory and reality. Nothing is clean, everything is noisy, full of engineering tricks, and you have to estimate everything. We run experiments with information estimators whose quality we don't really know. But the bigger problem (and I hope Shannon will forgive me): information theory isn't really about learning. In the classical setting you're in an idealized world where everything is given, with no optimization and no learning. There have been attempts to change this (Stefano Ermon's usable information, for one), and many works took inspiration from IT, but actually applying its tools has had limited success. So, next time that people ask you "Does Information theory 'explain AI you can say, "No, and it doesn't supposed..."

8:28 PM · May 22, 2026 · 9.2K Views
2:02 AM · May 23, 2026 · 843 Views

@DimitrisPapail @jiaxinwen22 I have the same thoughts about thermodynamics and fluid mechanics.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

3:06 AM · May 22, 2026 · 9.3K Views
7:58 AM · May 22, 2026 · 419 Views

@jiaxinwen22 We discuss the tension between information theory and modern AI phenomena here: https://arxiv.org/abs/2601.03220. The good news is that we can shed light on these phenomena by understanding the role of computation and structural information.

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
12:38 AM · May 23, 2026 · 959 Views

@jiaxinwen22 @yidingjiang hi, yes, for sure!

Jiaxin WenJiaxin Wen@jiaxinwen22

@andrewgwils hi andrew, I'm running some experiments to understand how useful MDL/epiplexity are in practical LM training scenarios. both @yidingjiang and I think these experiments pretty funny and valuable. lmk if you are interested in giving feedback or proposing new experiments!

12:46 AM · May 23, 2026 · 612 Views
12:47 AM · May 23, 2026 · 202 Views

It’s interesting that people are just waking up to the gap between conventional information theory and AI. Wait until you hear about pseudorandom numbers.

11:20 PM · May 22, 2026 · 2.3K Views

Just like the no free lunch theorems, the data processing inequality (DPI) is irrelevant to the practice of machine learning, and that’s okay. It doesn’t mean we can’t develop info theory to be descriptive of practice.

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220

9:07 PM · May 22, 2026 · 20.7K Views
11:16 PM · May 22, 2026 · 6.5K Views

Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220

9:07 PM · May 22, 2026 · 20.7K Views

@DimitrisPapail But I also think that we can develop notions in information theory like epiplexity that are both descriptive of modern practice and prescriptive. So yes, information theory can be developed to explain AI, and already has been to some extent with epiplexity.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andrewgwils thoughts on Alex's post? I think he makes some good points on the utility of IT as a form of language, rather than a scientific theory

9:30 PM · May 22, 2026 · 3K Views
9:36 PM · May 22, 2026 · 693 Views

@DimitrisPapail I think this can quickly get into an uninteresting debate about semantics. Shannon info and AIT come up nearly empty handed on many questions around the value of deterministic transformations, ordering, emergence, and so on. 1/n

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andrewgwils as Alex put it "learning Greek does not explain Plato, you can just understand Plato better if you learn Greek."

9:39 PM · May 22, 2026 · 224 Views
9:43 PM · May 22, 2026 · 217 Views

@DimitrisPapail If systems like alpha zero aren’t learning “information”, what are they learning? But that doesn’t mean these classical notions can’t be descriptive elsewhere. We can use Solomonoff priors to get tight generalization bounds, even to describe scaling laws. 2/n

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail I think this can quickly get into an uninteresting debate about semantics. Shannon info and AIT come up nearly empty handed on many questions around the value of deterministic transformations, ordering, emergence, and so on. 1/n

9:43 PM · May 22, 2026 · 217 Views
9:44 PM · May 22, 2026 · 214 Views

@DimitrisPapail We can also adapt notions in information theory to describe what Shannon info and Kolmogorov complexity does not, like epiplexity. 3/n

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail If systems like alpha zero aren’t learning “information”, what are they learning? But that doesn’t mean these classical notions can’t be descriptive elsewhere. We can use Solomonoff priors to get tight generalization bounds, even to describe scaling laws. 2/n

9:44 PM · May 22, 2026 · 214 Views
9:47 PM · May 22, 2026 · 235 Views

@DimitrisPapail Yes we can debate about whether info theory is a language, etc, but I think that’s mostly besides the point. What we see in practice seems to contradict identities like the DPI, but we can gain clarity by then understanding the role or computation in learnable information. n/n

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail We can also adapt notions in information theory to describe what Shannon info and Kolmogorov complexity does not, like epiplexity. 3/n

9:47 PM · May 22, 2026 · 235 Views
9:49 PM · May 22, 2026 · 229 Views

@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andrewgwils I think we agree on substance. In the case of epiplexity the clarity comes from computation, not IT. If trad IT had explanatory power, DPI would have something useful to say, and in the case we're both thinking about it doesn't and as we've seen can be very misleading.

9:58 PM · May 22, 2026 · 111 Views
10:06 PM · May 22, 2026 · 128 Views

@DimitrisPapail Re: Val, yeah imo gen bounds are mostly useful if they give some interpretable insight into why the model generalizes. You can get this through bounds based on the solomonoff prior. I guess also useful if you care about a formal guarantee, but that’s less important in my view.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

I have not read this paper in detail, but I do remember feeling hopeful when I saw it. Perhaps I should spend more time on it. What I remember back when I was young and thinking for long hours about gen bounds, is nothing beats parameter count union bounds, especially when you account for sparsity (or soft sparsity). Which was still far from the real gen gap, and resulted in me giving up on them. Also i never quite understood one thing (and apologies if you fully address this in the work): isn't computing the val error always cheaper than most gen bounds people like, apart from param count ones?

10:11 PM · May 22, 2026 · 74 Views
10:16 PM · May 22, 2026 · 74 Views

@DimitrisPapail Yeah good bounds these days actually get tighter with more params. I recommend reading this paper which discusses the point at some length: https://arxiv.org/abs/2503.02113

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

thank you for engaging with this, it's very helpful. now let me push on gen bounds a bit. Parameter count bounds say fewer params => better gen. We know that's false eg overparameterization not only isn't avoided, it often helps. And most bounds share the same favlor: gen error < (1/f(n)) · Quantity, declared "good" when Quantity is small for whatever you happen to be training. But small Quantity is never shown to be necessary, only sufficient on paper. The perhaps only exception is stability, but that's circular i.e., stability is basically exactly equal generalization error. You can argue for lower bounds instead, but those get invalidated in practice too. So let me make it a bit more concrete (and sorry if i sound annoying, this comes from an honest place of ignorance): has there ever been a case where a generalization bound predicted a useful model attribute, and specifically one that was then tried because the bound suggested it, and worked? Where the bound was the actual source of the intervention, not a post-hoc explamnation of something we already did?

10:44 PM · May 22, 2026 · 118 Views
10:50 PM · May 22, 2026 · 100 Views

@jiaxinwen22 Hold my beer !

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
6:08 PM · May 22, 2026 · 2.5K Views

@jiaxinwen22

Ok I wrote a whole article. Thanks for your question, I think it sharpened my thinking.

Alex DimakisAlex Dimakis@AlexGDimakis

http://x.com/i/article/2057891027055525888

7:01 PM · May 22, 2026 · 37.9K Views
7:02 PM · May 22, 2026 · 4K Views

My point is that there are two separate things: 1.mathematical theories (like information theory, probability theory, or linear algebra) and 2. scientific theories (like Relativity or quantum mechanics).

The question ‘does information theory explain AI’ is a syntax error. It’s like asking if linear algebra explains electron interference.

The correct statement is that quantum mechanics makes falsifiable predictions about observable physical quantities and hence explains electron interference patterns. And quantum mechanics uses linear algebra as a mathematical framework.

Scientific theories can explain physical phenomena and use mathematical theories.

The mathematics of information theory are useful for AI: knowing what cross entropy is , what is the difference with entropy, and what is compressio etc, helps in understanding AI.

But there is no good scientific theory that explains AI. People are working on it, and often use information theory as one of the tools.

Alex DimakisAlex Dimakis@AlexGDimakis

@jiaxinwen22 Ok I wrote a whole article. Thanks for your question, I think it sharpened my thinking.

7:02 PM · May 22, 2026 · 4K Views
6:59 AM · May 23, 2026 · 28 Views

@DimitrisPapail Indeed the fact that Shannon had the bravery to write ‘The Bandwagon’ paper shows how he was targeting serious deep thinking vs hype, even when the hype was supporting his intellectual creation.

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Information theory is a mathematical theory and a language, not a scientific theory that explains phenomena. Slapping mutual information on all of AI’s mysteries won’t help explain them.

7:30 PM · May 22, 2026 · 14.7K Views
2:44 AM · May 23, 2026 · 496 Views

@andrewgwils ah interesting post ! I was thinking of what it means for a mathematical theory to 'explain' a phenomenon as @DimitrisPapail posted also. Does Epiplexity make falsifiable predictions about observable quantities?

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220

9:07 PM · May 22, 2026 · 20.7K Views
9:34 PM · May 22, 2026 · 1.4K Views

I think you made some great points and I’m going to read about epiplexity for sure. I am trying to make a different distinction between scientific theories (that make falsifiable predictions about observable quantities) and mathematical theories (like information theory and probability theory).

The statement ‘probability theory does not explain AI’ is a syntax error to me.

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208

10:06 PM · May 22, 2026 · 128 Views
7:10 AM · May 23, 2026 · 39 Views

Information theory is a mathematical theory and a language, not a scientific theory that explains phenomena. Slapping mutual information on all of AI’s mysteries won’t help explain them.

Alex DimakisAlex Dimakis@AlexGDimakis

http://x.com/i/article/2057891027055525888

7:01 PM · May 22, 2026 · 37.9K Views
7:30 PM · May 22, 2026 · 14.7K Views

@andrewgwils thoughts on Alex's post? I think he makes some good points on the utility of IT as a form of language, rather than a scientific theory

Alex DimakisAlex Dimakis@AlexGDimakis

http://x.com/i/article/2057891027055525888

7:01 PM · May 22, 2026 · 37.9K Views
9:30 PM · May 22, 2026 · 3K Views

@andrewgwils But I think a distinction here is that IT did not explain anything, you did, by using definitions borrowed from IT. no?

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail But I also think that we can develop notions in information theory like epiplexity that are both descriptive of modern practice and prescriptive. So yes, information theory can be developed to explain AI, and already has been to some extent with epiplexity.

9:36 PM · May 22, 2026 · 693 Views
9:38 PM · May 22, 2026 · 255 Views

@andrewgwils as Alex put it "learning Greek does not explain Plato, you can just understand Plato better if you learn Greek."

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

@andrewgwils But I think a distinction here is that IT did not explain anything, you did, by using definitions borrowed from IT. no?

9:38 PM · May 22, 2026 · 255 Views
9:39 PM · May 22, 2026 · 224 Views

@andrewgwils I think we agree on substance. In the case of epiplexity the clarity comes from computation, not IT. If trad IT had explanatory power, DPI would have something useful to say, and in the case we're both thinking about it doesn't and as we've seen can be very misleading.

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail Yes we can debate about whether info theory is a language, etc, but I think that’s mostly besides the point. What we see in practice seems to contradict identities like the DPI, but we can gain clarity by then understanding the role or computation in learnable information. n/n

9:49 PM · May 22, 2026 · 229 Views
9:58 PM · May 22, 2026 · 111 Views

I have not read this paper in detail, but I do remember feeling hopeful when I saw it. Perhaps I should spend more time on it. What I remember back when I was young and thinking for long hours about gen bounds, is nothing beats parameter count union bounds, especially when you account for sparsity (or soft sparsity). Which was still far from the real gen gap, and resulted in me giving up on them. Also i never quite understood one thing (and apologies if you fully address this in the work): isn't computing the val error always cheaper than most gen bounds people like, apart from param count ones?

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208

10:06 PM · May 22, 2026 · 128 Views
10:11 PM · May 22, 2026 · 74 Views

thank you for engaging with this, it's very helpful.

now let me push on gen bounds a bit. Parameter count bounds say fewer params => better gen. We know that's false eg overparameterization not only isn't avoided, it often helps. And most bounds share the same favlor: gen error < (1/f(n)) · Quantity, declared "good" when Quantity is small for whatever you happen to be training. But small Quantity is never shown to be necessary, only sufficient on paper. The perhaps only exception is stability, but that's circular i.e., stability is basically exactly equal generalization error.

You can argue for lower bounds instead, but those get invalidated in practice too.

So let me make it a bit more concrete (and sorry if i sound annoying, this comes from an honest place of ignorance): has there ever been a case where a generalization bound predicted a useful model attribute, and specifically one that was then tried because the bound suggested it, and worked? Where the bound was the actual source of the intervention, not a post-hoc explamnation of something we already did?

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

@DimitrisPapail Re: Val, yeah imo gen bounds are mostly useful if they give some interpretable insight into why the model generalizes. You can get this through bounds based on the solomonoff prior. I guess also useful if you care about a formal guarantee, but that’s less important in my view.

10:16 PM · May 22, 2026 · 74 Views
10:44 PM · May 22, 2026 · 118 Views

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
3:06 AM · May 22, 2026 · 9.3K Views

me after reading the epiplexity paper

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
3:46 AM · May 22, 2026 · 2.8K Views

don't be disappointed if information theory doesn't explain AI

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
7:05 PM · May 22, 2026 · 24.3K Views

My two cents on why information theory doesn't quite work in the real world: as someone who's been arguing with people about IT and its connection to ML since 2016(!), I mostly agree with Alex. (At least) two problems show up when you try to use information theory in ML. First, the distance between theory and reality. Nothing is clean, everything is noisy, full of engineering tricks, and you have to estimate everything. We run experiments with information estimators whose quality we don't really know. But the bigger problem (and I hope Shannon will forgive me): information theory isn't really about learning. In the classical setting you're in an idealized world where everything is given, with no optimization and no learning. There have been attempts to change this (Stefano Ermon's usable information, for one), and many works took inspiration from IT, but actually applying its tools has had limited success. So, next time that people ask you "Does Information theory 'explain AI you can say, "No, and it doesn't supposed..."

Alex DimakisAlex Dimakis@AlexGDimakis

http://x.com/i/article/2057891027055525888

7:01 PM · May 22, 2026 · 37.9K Views
8:28 PM · May 22, 2026 · 9.2K Views

@ChrSzegedy I'm a big fan of info theory compression, but what is the algorithm?

Christian SzegedyChristian Szegedy@ChrSzegedy

@ziv_ravid Information theory might not explain why deep learning works, but it can help in using the models. For example, IT suggests the objective function to use and gives a clear algorithm for using your model to compress data.

2:02 AM · May 23, 2026 · 843 Views
3:11 AM · May 23, 2026 · 214 Views

@DimitrisPapail @jiaxinwen22 Beside Stefano Ermon work...

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

3:06 AM · May 22, 2026 · 9.3K Views
3:43 AM · May 22, 2026 · 623 Views

@jiaxinwen22 What field can?

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
4:38 AM · May 22, 2026 · 558 Views

@andrewgwils The no free lunch theorems would have been more relevant to the practice of ML if people correctly and precisely read what they actually say.

Andrew Gordon WilsonAndrew Gordon Wilson@andrewgwils

Just like the no free lunch theorems, the data processing inequality (DPI) is irrelevant to the practice of machine learning, and that’s okay. It doesn’t mean we can’t develop info theory to be descriptive of practice.

11:16 PM · May 22, 2026 · 6.5K Views
1:30 AM · May 23, 2026 · 271 Views

@DimitrisPapail ai folks somehow are still frequently using description length or proposing new variants

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

3:06 AM · May 22, 2026 · 9.3K Views
3:18 AM · May 22, 2026 · 4K Views

@DimitrisPapail I am always impressed by their papers but feel disappointed when using those metrics to explain AI

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)

3:06 AM · May 22, 2026 · 9.3K Views
3:19 AM · May 22, 2026 · 724 Views

@jiaxinwen22 1) you must be using a very strong mechanistic instance-level notion of "explain" - and nothing currently meets that bar: not interp, not learning theory, not representation theory

Jiaxin WenJiaxin Wen@jiaxinwen22

It's very disappointing that information theory cannot explain AI at all.

2:45 AM · May 22, 2026 · 69.2K Views
8:24 AM · May 22, 2026 · 685 Views

@jiaxinwen22 2) it gives bounds which are very likely true. You can say they're too loose to be useful but that's a different claim!

gavin leech (Non-Reasoning)gavin leech (Non-Reasoning)@gleech

@jiaxinwen22 1) you must be using a very strong mechanistic instance-level notion of "explain" - and nothing currently meets that bar: not interp, not learning theory, not representation theory

8:24 AM · May 22, 2026 · 685 Views
8:25 AM · May 22, 2026 · 171 Views

@jiaxinwen22 3) I don't take people misapplying information-bottleneck and data-processing inequality to be a strike against the theory

gavin leech (Non-Reasoning)gavin leech (Non-Reasoning)@gleech

@jiaxinwen22 2) it gives bounds which are very likely true. You can say they're too loose to be useful but that's a different claim!

8:25 AM · May 22, 2026 · 171 Views
8:26 AM · May 22, 2026 · 84 Views
gavin leech (Non-Reasoning)gavin leech (Non-Reasoning)@gleech

@jiaxinwen22 3) I don't take people misapplying information-bottleneck and data-processing inequality to be a strike against the theory

8:26 AM · May 22, 2026 · 84 Views
8:28 AM · May 22, 2026 · 31 Views