ArXiv paper 'From Entropy to Epiplexity' prompts X discussion on why information theory lacks explanatory power and testable predictions for AI systems
Replies cite noise and unreliable estimators as practical ML barriers.
@ziv_ravid Information theory might not explain why deep learning works, but it can help in using the models. For example, IT suggests the objective function to use and gives a clear algorithm for using your model to compress data.
My two cents on why information theory doesn't quite work in the real world: as someone who's been arguing with people about IT and its connection to ML since 2016(!), I mostly agree with Alex. (At least) two problems show up when you try to use information theory in ML. First, the distance between theory and reality. Nothing is clean, everything is noisy, full of engineering tricks, and you have to estimate everything. We run experiments with information estimators whose quality we don't really know. But the bigger problem (and I hope Shannon will forgive me): information theory isn't really about learning. In the classical setting you're in an idealized world where everything is given, with no optimization and no learning. There have been attempts to change this (Stefano Ermon's usable information, for one), and many works took inspiration from IT, but actually applying its tools has had limited success. So, next time that people ask you "Does Information theory 'explain AI you can say, "No, and it doesn't supposed..."
@DimitrisPapail @jiaxinwen22 I have the same thoughts about thermodynamics and fluid mechanics.
Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)
@jiaxinwen22 We discuss the tension between information theory and modern AI phenomena here: https://arxiv.org/abs/2601.03220. The good news is that we can shed light on these phenomena by understanding the role of computation and structural information.
It's very disappointing that information theory cannot explain AI at all.
@jiaxinwen22 @yidingjiang hi, yes, for sure!
@andrewgwils hi andrew, I'm running some experiments to understand how useful MDL/epiplexity are in practical LM training scenarios. both @yidingjiang and I think these experiments pretty funny and valuable. lmk if you are interested in giving feedback or proposing new experiments!
It’s interesting that people are just waking up to the gap between conventional information theory and AI. Wait until you hear about pseudorandom numbers.
Just like the no free lunch theorems, the data processing inequality (DPI) is irrelevant to the practice of machine learning, and that’s okay. It doesn’t mean we can’t develop info theory to be descriptive of practice.
Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220
Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220
@DimitrisPapail But I also think that we can develop notions in information theory like epiplexity that are both descriptive of modern practice and prescriptive. So yes, information theory can be developed to explain AI, and already has been to some extent with epiplexity.
@andrewgwils thoughts on Alex's post? I think he makes some good points on the utility of IT as a form of language, rather than a scientific theory
@DimitrisPapail I think this can quickly get into an uninteresting debate about semantics. Shannon info and AIT come up nearly empty handed on many questions around the value of deterministic transformations, ordering, emergence, and so on. 1/n
@andrewgwils as Alex put it "learning Greek does not explain Plato, you can just understand Plato better if you learn Greek."
@DimitrisPapail If systems like alpha zero aren’t learning “information”, what are they learning? But that doesn’t mean these classical notions can’t be descriptive elsewhere. We can use Solomonoff priors to get tight generalization bounds, even to describe scaling laws. 2/n
@DimitrisPapail I think this can quickly get into an uninteresting debate about semantics. Shannon info and AIT come up nearly empty handed on many questions around the value of deterministic transformations, ordering, emergence, and so on. 1/n
@DimitrisPapail We can also adapt notions in information theory to describe what Shannon info and Kolmogorov complexity does not, like epiplexity. 3/n
@DimitrisPapail If systems like alpha zero aren’t learning “information”, what are they learning? But that doesn’t mean these classical notions can’t be descriptive elsewhere. We can use Solomonoff priors to get tight generalization bounds, even to describe scaling laws. 2/n
@DimitrisPapail Yes we can debate about whether info theory is a language, etc, but I think that’s mostly besides the point. What we see in practice seems to contradict identities like the DPI, but we can gain clarity by then understanding the role or computation in learnable information. n/n
@DimitrisPapail We can also adapt notions in information theory to describe what Shannon info and Kolmogorov complexity does not, like epiplexity. 3/n
@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208
@andrewgwils I think we agree on substance. In the case of epiplexity the clarity comes from computation, not IT. If trad IT had explanatory power, DPI would have something useful to say, and in the case we're both thinking about it doesn't and as we've seen can be very misleading.
@DimitrisPapail Re: Val, yeah imo gen bounds are mostly useful if they give some interpretable insight into why the model generalizes. You can get this through bounds based on the solomonoff prior. I guess also useful if you care about a formal guarantee, but that’s less important in my view.
I have not read this paper in detail, but I do remember feeling hopeful when I saw it. Perhaps I should spend more time on it. What I remember back when I was young and thinking for long hours about gen bounds, is nothing beats parameter count union bounds, especially when you account for sparsity (or soft sparsity). Which was still far from the real gen gap, and resulted in me giving up on them. Also i never quite understood one thing (and apologies if you fully address this in the work): isn't computing the val error always cheaper than most gen bounds people like, apart from param count ones?
@DimitrisPapail Yeah good bounds these days actually get tighter with more params. I recommend reading this paper which discusses the point at some length: https://arxiv.org/abs/2503.02113
thank you for engaging with this, it's very helpful. now let me push on gen bounds a bit. Parameter count bounds say fewer params => better gen. We know that's false eg overparameterization not only isn't avoided, it often helps. And most bounds share the same favlor: gen error < (1/f(n)) · Quantity, declared "good" when Quantity is small for whatever you happen to be training. But small Quantity is never shown to be necessary, only sufficient on paper. The perhaps only exception is stability, but that's circular i.e., stability is basically exactly equal generalization error. You can argue for lower bounds instead, but those get invalidated in practice too. So let me make it a bit more concrete (and sorry if i sound annoying, this comes from an honest place of ignorance): has there ever been a case where a generalization bound predicted a useful model attribute, and specifically one that was then tried because the bound suggested it, and worked? Where the bound was the actual source of the intervention, not a post-hoc explamnation of something we already did?
@jiaxinwen22 Hold my beer !
It's very disappointing that information theory cannot explain AI at all.
@jiaxinwen22
Ok I wrote a whole article. Thanks for your question, I think it sharpened my thinking.
http://x.com/i/article/2057891027055525888
My point is that there are two separate things: 1.mathematical theories (like information theory, probability theory, or linear algebra) and 2. scientific theories (like Relativity or quantum mechanics).
The question ‘does information theory explain AI’ is a syntax error. It’s like asking if linear algebra explains electron interference.
The correct statement is that quantum mechanics makes falsifiable predictions about observable physical quantities and hence explains electron interference patterns. And quantum mechanics uses linear algebra as a mathematical framework.
Scientific theories can explain physical phenomena and use mathematical theories.
The mathematics of information theory are useful for AI: knowing what cross entropy is , what is the difference with entropy, and what is compressio etc, helps in understanding AI.
But there is no good scientific theory that explains AI. People are working on it, and often use information theory as one of the tools.
@jiaxinwen22 Ok I wrote a whole article. Thanks for your question, I think it sharpened my thinking.
@DimitrisPapail Indeed the fact that Shannon had the bravery to write ‘The Bandwagon’ paper shows how he was targeting serious deep thinking vs hype, even when the hype was supporting his intellectual creation.
Information theory is a mathematical theory and a language, not a scientific theory that explains phenomena. Slapping mutual information on all of AI’s mysteries won’t help explain them.
@andrewgwils ah interesting post ! I was thinking of what it means for a mathematical theory to 'explain' a phenomenon as @DimitrisPapail posted also. Does Epiplexity make falsifiable predictions about observable quantities?
Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://arxiv.org/abs/2601.03220
I think you made some great points and I’m going to read about epiplexity for sure. I am trying to make a different distinction between scientific theories (that make falsifiable predictions about observable quantities) and mathematical theories (like information theory and probability theory).
The statement ‘probability theory does not explain AI’ is a syntax error to me.
@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208
Information theory is a mathematical theory and a language, not a scientific theory that explains phenomena. Slapping mutual information on all of AI’s mysteries won’t help explain them.
http://x.com/i/article/2057891027055525888
@andrewgwils thoughts on Alex's post? I think he makes some good points on the utility of IT as a form of language, rather than a scientific theory
http://x.com/i/article/2057891027055525888
@andrewgwils But I think a distinction here is that IT did not explain anything, you did, by using definitions borrowed from IT. no?
@DimitrisPapail But I also think that we can develop notions in information theory like epiplexity that are both descriptive of modern practice and prescriptive. So yes, information theory can be developed to explain AI, and already has been to some extent with epiplexity.
@andrewgwils as Alex put it "learning Greek does not explain Plato, you can just understand Plato better if you learn Greek."
@andrewgwils But I think a distinction here is that IT did not explain anything, you did, by using definitions borrowed from IT. no?
@andrewgwils I think we agree on substance. In the case of epiplexity the clarity comes from computation, not IT. If trad IT had explanatory power, DPI would have something useful to say, and in the case we're both thinking about it doesn't and as we've seen can be very misleading.
@DimitrisPapail Yes we can debate about whether info theory is a language, etc, but I think that’s mostly besides the point. What we see in practice seems to contradict identities like the DPI, but we can gain clarity by then understanding the role or computation in learnable information. n/n
I have not read this paper in detail, but I do remember feeling hopeful when I saw it. Perhaps I should spend more time on it. What I remember back when I was young and thinking for long hours about gen bounds, is nothing beats parameter count union bounds, especially when you account for sparsity (or soft sparsity). Which was still far from the real gen gap, and resulted in me giving up on them. Also i never quite understood one thing (and apologies if you fully address this in the work): isn't computing the val error always cheaper than most gen bounds people like, apart from param count ones?
@DimitrisPapail Yes I think we largely agree. But I do think trad IT does have some explanatory power, just not around certain phenomena. We can build tight generalization bounds using standard AIT: https://arxiv.org/abs/2504.15208
thank you for engaging with this, it's very helpful.
now let me push on gen bounds a bit. Parameter count bounds say fewer params => better gen. We know that's false eg overparameterization not only isn't avoided, it often helps. And most bounds share the same favlor: gen error < (1/f(n)) · Quantity, declared "good" when Quantity is small for whatever you happen to be training. But small Quantity is never shown to be necessary, only sufficient on paper. The perhaps only exception is stability, but that's circular i.e., stability is basically exactly equal generalization error.
You can argue for lower bounds instead, but those get invalidated in practice too.
So let me make it a bit more concrete (and sorry if i sound annoying, this comes from an honest place of ignorance): has there ever been a case where a generalization bound predicted a useful model attribute, and specifically one that was then tried because the bound suggested it, and worked? Where the bound was the actual source of the intervention, not a post-hoc explamnation of something we already did?
@DimitrisPapail Re: Val, yeah imo gen bounds are mostly useful if they give some interpretable insight into why the model generalizes. You can get this through bounds based on the solomonoff prior. I guess also useful if you care about a formal guarantee, but that’s less important in my view.
Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)
It's very disappointing that information theory cannot explain AI at all.
me after reading the epiplexity paper
It's very disappointing that information theory cannot explain AI at all.
don't be disappointed if information theory doesn't explain AI

It's very disappointing that information theory cannot explain AI at all.
My two cents on why information theory doesn't quite work in the real world: as someone who's been arguing with people about IT and its connection to ML since 2016(!), I mostly agree with Alex. (At least) two problems show up when you try to use information theory in ML. First, the distance between theory and reality. Nothing is clean, everything is noisy, full of engineering tricks, and you have to estimate everything. We run experiments with information estimators whose quality we don't really know. But the bigger problem (and I hope Shannon will forgive me): information theory isn't really about learning. In the classical setting you're in an idealized world where everything is given, with no optimization and no learning. There have been attempts to change this (Stefano Ermon's usable information, for one), and many works took inspiration from IT, but actually applying its tools has had limited success. So, next time that people ask you "Does Information theory 'explain AI you can say, "No, and it doesn't supposed..."
http://x.com/i/article/2057891027055525888
@ChrSzegedy I'm a big fan of info theory compression, but what is the algorithm?
@ziv_ravid Information theory might not explain why deep learning works, but it can help in using the models. For example, IT suggests the objective function to use and gives a clear algorithm for using your model to compress data.
@DimitrisPapail @jiaxinwen22 Beside Stefano Ermon work...
Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)
@jiaxinwen22 What field can?
It's very disappointing that information theory cannot explain AI at all.
@andrewgwils The no free lunch theorems would have been more relevant to the practice of ML if people correctly and precisely read what they actually say.
Just like the no free lunch theorems, the data processing inequality (DPI) is irrelevant to the practice of machine learning, and that’s okay. It doesn’t mean we can’t develop info theory to be descriptive of practice.
@DimitrisPapail ai folks somehow are still frequently using description length or proposing new variants
Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)
@DimitrisPapail I am always impressed by their papers but feel disappointed when using those metrics to explain AI
Information theory was not built to explain algorithmic phenomena. It's a beautiful framework for arguing about the limits of information: what can be communicated, stored, retrieved, compressed etc. Most attempts I've seen at forcing IT onto AI feel like trying to make coffee with a katana. Magnificent instrument but wrong job :)
@jiaxinwen22 1) you must be using a very strong mechanistic instance-level notion of "explain" - and nothing currently meets that bar: not interp, not learning theory, not representation theory
It's very disappointing that information theory cannot explain AI at all.
@jiaxinwen22 2) it gives bounds which are very likely true. You can say they're too loose to be useful but that's a different claim!
@jiaxinwen22 1) you must be using a very strong mechanistic instance-level notion of "explain" - and nothing currently meets that bar: not interp, not learning theory, not representation theory
@jiaxinwen22 3) I don't take people misapplying information-bottleneck and data-processing inequality to be a strike against the theory
@jiaxinwen22 2) it gives bounds which are very likely true. You can say they're too loose to be useful but that's a different claim!
@jiaxinwen22 4) I'm keeping the faith
@jiaxinwen22 3) I don't take people misapplying information-bottleneck and data-processing inequality to be a strike against the theory