9 Comments
User's avatar
Pradyumna's avatar

Great way of explaining this.

Naina Chaturvedi's avatar

++ Good Post, Also, start here how to build tech, Crash Courses, 100+ Most Asked ML System Design Case Studies and LLM System Design

How to Build Tech 

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-10-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-06-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-05-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-04-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-03-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-01-the-heart-of?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-02-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Crash Courses

https://open.substack.com/pub/crashcourses/p/crash-course-04-hands-on-crash-course?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-03-hands-on-crash-course?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-02-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-01-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

LLM System Design

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-499?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-63c?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-bdd?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-661?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-83b?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-799?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-612?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-7e6?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-67d?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/most-important-llm-system-design-b31?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://naina0405.substack.com/p/launching-llm-system-design-large?r=14q3sp

https://naina0405.substack.com/p/launching-llm-system-design-2-large?r=14q3sp

[https://open.substack.com/pub/naina0405/p/llm-system-design-3-large-language?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/important-llm-system-design-4-heart?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-63c?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

suman suhag's avatar

I discovered AI in early 1982, when I chanced on Doug Hofstadter’s Pulitzer-winning first book — still his best — “Godel, Escher, Bach: An Eternal Golden Braid” in a book fair in New Delhi, India. It changed my life. I’ve spent the better part of the past almost four decades in AI in research in ML and AI, seeing it emerge from the shadows being studied by a small number of enthusiasts like me, to today, where it seems to rule the tech industry as a trillion dollar technology.

Then, as now, the biggest problem with AI is how to make accurate predictions. Herbert Simon, one of the founders of the field, made some wild predictions of where the field would be in 10 years, back in the 1960s. He was wrong, by a factor of about 50 years. But, as predictions go, that’s not all that bad.

Most sci fi flicks of that era predicted flying cars by now — recall the beginning scene of Blade Runner — and other than in Chitty Chitty Bang Bang and some James Bond movies, we don’t seem remotely close to getting flying cars. Compared to such predictions, AI has done rather well. Thanks to a whole host of related inventions, from the smartphone to the internet and cloud computing, the reach of AI is more pervasive than ever. Where will AI be in the next 50 years?

This gets to my answer to the question. 40 years ago, I found machine learning the most fascinating field I could possibly study. I don’t think that any longer. The strengths and weaknesses of machine learning have become apparent in the ensuing decades. It’s best to explain this by an analogy, and I love analogies (as does Doug Hofstadter, as his most recent book “Surfaces and Essences” is subtitled “Analogy as the fuel and fire of thinking”).

Imagine you are fascinated, as our ancients were, by the possibility of manned human-powered flight. Every culture known to me has humans soaring in the air like birds in their mythology. In Greek mythology, Daedalus invented wings of wax to help him and his son, Icarus, escape from imprisonment. Sadly, Icarus flew too close to the sun, not heeding his father’s warning, and perished to his death. Indian mythology is riddled with stories of flying machines.

We now have flying machines that whisk us across continents at the speed of sound. But, we need huge airports, mile long runways, jet fuel, seat belts, and all the paraphernalia of modern air travel (don’t get me started on TSA background checks). Where’s our dream of human powered flight, soaring like birds? Gone into mythology, where it shall remain.

ML is in a similar state. Many of us 40 years ago dreamed our machines would learn like us, like children, curious about the world, learn fluency in many languages, help us in our old age, and become our intellectual companions. Alas, that’s largely a pipe dream.

Modern ML, like modern air travel, is a completely different enterprise. It needs huge labeled datasets (now in the petabytes). It’s notoriously brittle, as recent single pixel attacks have shown how vulnerable deep learning, our best ML technology, is. If you cater to its every whim, it can be successful, but it is no match for human learning, as a modern 747 is no match for the common garden sparrows that flit about my backyard.

So, I am curious whether AI will ever reach a state when it will lead to truly intelligent machines that can soar in the sky, like birds do without all the trappings of modern airliners, or will it for ever be consigned to the same fate, needing mile long runways, jet fuel, and seat belts and TSA background checks.

So, like the great MLK, I too have a dream. I dream of the day when machine learners will become like human learners, be like children, eternally curious about the world, not be dependent on terabyte sized labeled datasets, and careful human parameter and architecture tweaking. Is this a pipe dream? Will we get to this promised land? Or, like modern air travel, is this to be our fate to be relegated to intrusive TSA checks when we feel like soaring like the birds?

suman suhag's avatar

Provocative question! Having spent at least 25 years studying RL, ever since my first real job at IBM Research, where I explored the use of methods like Q-learning from 1990–93 to train robots new tasks, I’ve watched the field through its various phases. In the early 1990s, when I got involved, it was restricted to a small handful of aficionados. I organized the first National Science Foundation workshop on RL, to which about 50–60 senior researchers were invited (in 1995).

Gradually, through the early part of the 2000s, the field gained popularity, but never seemed to become a mainstream research topic within ML. Then, wham! Deep Mind did its thingie with the combination of deep learning and RL, applied to a visually appealing domain of Atari video games, and (deep) RL’s popularity went through the roof. Now, it seems all the rage, and certainly, many employers are hiring (in the Bay Area, it’s an area sought after by some of the labs doing autonomous driving). Google paid half a billion Euros for Deep Mind (supposedly!), on the basis of their deep RL Atari demo. So, this looked like a real turning point, and RL came to life!

So, getting back to the question, is RL a “dead end”? In answering this provocative question, one has to clarify one’s point of view. Certainly, from the standpoint of the work going on in Deep Mind and other places on using deep RL to play games like Go or Chess, or train given an accurate simulator of the world for a self-driving car, RL is certainly poised to become well established technology, and its popularity is only going to increase. RL sessions at major AI and ML conferences are very well attended, and RL submissions are definitely increasing. In all these dimensions, RL is very much not at a “dead end”, in fact, its popularity is only increasing.

But, but, …. you knew there was a but coming there!

When you impose on RL the goal of “online learning in real time from the real world”, and not doing millions of simulation steps where agents can be killed thousands of times with no penalty, I fear RL is very much at a dead end. It is not clear to me that any extension of the au courant deep RL methods is going to lead to successes in the real world, in terms of a physical agent that can learn in real time with a small number of examples.

That is, if your goal is to build a model of how humans learn complex skills, such as driving, then RL to me is a very poor explanation of how such skills are acquired. One has to only look at the comparative results reported in the AAAI 2017 paper by Tsividis et al., comparing random humans on Amazon Turk with the best deep RL programs at Atari video games to see where deep RL simply flounders. Humans learn Atari video games, like Frostbite, about 1000x faster than the fastest deep RL methods.

A typical human learned Frostbite in 1 minute with a few hundred examples at most. DQN or other deep RL programs take days with millions of examples. It’s not even close, it’s like another galaxy in terms of the speed of learning differences. So, looking at this paper, I’d have to say I don’t see any way to capture such large differences with any incremental tweaking of deep RL methods, such as being reported annually in ICML or NIPS papers (of which I review a bunch each year, hoping against hope to see a new idea emerge, only to be disappointed!).

So, what’s to be done to “rescue RL”. I’m not sure there’s really a solution out there. I for one have stopped believing that we learn complex skills like driving by something that resembles “pure RL” (that is, from rewards alone). Humans learn to drive because they in fact “know” how to drive even before they even try to drive once. They’ve seen their parents, friends, lovers, Uber drivers, etc. drive many many times, and they’ve seen driving behavior in movies for thousands of hours. So, when they finally get behind the wheel, they instinctively “know” what driving means, but of course, they have never actually controlled a physical car before. So, there is that all important “last mile” of actual driving that needs to be learned.

But, since the driving program is largely already in place, built in by many thousands of hours of observation, not to mention active instruction by a driving teacher or an anxious parent, what needs to be “learned” are a few control parameters that tell the human brain how much to turn the wheel, or press the brake, and more importantly, where to look on the road etc. This is course not trivial, which is why humans take a few weeks to get comfortable behind a wheel, But, if you look at real hours of practice, humans learn to drive in a few hundred hours — for those paying for driving instruction, this is expensive since you are charged by the hour.

Also, all important to remember is that when you impose the condition of learning in the real world, there can be “no cheating”! That is, unlike the ridiculous 2D world of Atari video games, like Enduro, where one is given a 2D highly simplified visual world, and actions are limited to a few discrete choices, humans must drive in the full 3D real world and have the huge task of controlling both legs, both hands, neck, body, etc, many hundreds of continuous degrees of freedom, as well as have to cope with an immense sensory space of stereo vision, and binaural hearing as well.

The only way humans ever learn to drive in a few hundred hours is the simple fact that we already almost know driving, and we have obviously a fully working vision system, so we can read signs, recognize cars and pedestrians, and our hearing system also recognizes sirens, alerts, horns etc. So, if you look at the immensity of the whole driving task, I would claim more than 95% of the driving knowledge is already known, and the small remaining part has to be acquired from practice. This is the only explanation for how humans learn such a complex skill as driving in a few hundred hours. There is NO magic here.

So, in that sense, pure (deep) RL seems like a dead end. The pure (deep) RL problem formulation really does not hold much interest for me any more. What is needed in its place is a more complex model of how learning happens by combining observation, transfer learning, and many other types of behavior cloning from observed demonstration to the learner, and finally being able to take this knowledge, and then improve it with some actual trial and error RL.

One can generalize this to other modes of learning as well. The late Richard Feynman, who was arguably the most influential physicist after the 2nd world war, taught a classic introductory course at Caltech, which led to probably the best selling college textbook of all time, the Feynman Lectures on Physics (still being sold almost 60 years later, in the nth edition). When he looked at how students handled his problem sets, Feynman was ultimately disappointed. He realized that even the extremely bright students at Caltech could not “learn” physics, simply sitting in his class and absorbing his lectures. So, he ended his preface to the textbook with a disappointing conclusion, quoting Gibbons (which I had long ago memorized):

“The power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous”.

I realized the wisdom of this saying after spending two decades or more teaching machine learning to graduate students at several institutions. It seems almost paradoxical, but what Gibbons is saying, and what Feynman and I both discovered is that learning from teaching only works when the learner “almost already knows” the subject.

But, this is precisely what the various theoretical formulations of ML predict must be the case, there is no “free lunch” in terms of being able to learn. Deep Mind’s DQN network takes millions and millions of steps to learn an apparently trivial task (to humans) like Frostbite, because initially DQN knows nothing. Humans, in contrast, learn Frostbite in < 1 minute because they have spent many many hours building the background needed to learn Frostbite so quickly (e.g, vision, hand eye coordination, general game playing strategies).

Unfortunately, the prevailing currents in the field, at venues like “NeurIPS” (NIPS) and ICML and AAAI conferences, tend to “glorify” knowledge-free learning, so you end up with hundreds, if not thousands, of (deep) RL papers, where agents take millions of time steps to learn apparently simple tasks. To me, this approach is ultimately a “dead end”, if your goal is to develop a computational model of how humans learn.

suman suhag's avatar

The best way, in my view, to understand a field is to understand the reason why the field exists in the first place, Why do we need a field like machine learning? In short, what problems does it solve and why?

Let’s start with an analogy, something you do practically every morning: you wake up and get ready to go to work. What problems do you need to solve? For one, you need to put on some clothes to protect your body from the weather and your feet against the rough surfaces you might encounter. You need to perhaps cover your head with a hat or a scarf and protect your eyes with sunglasses against the harsh rays of the sun. These are the problems we need to solve in getting dressed.

Algorithms are like clothes and shoes, hats and scarves and sunglasses, continuing the analogy from above. You could wear sneakers, dress shoes, or high heels. You could wear a T-shirt, a dress shirt, a full length skirt and so on. Clothes and shoes are ways to solve the problem of dressing up for work. Which clothes you wear and what shoes you put on may vary, depending on the occasion and the weather. Similarly, which machine learning algorithm you use may depend on the problem, the data, the distribution of instances etc. The lesson from the fashion industry is quite apt and worth remembering. Problems never change (you always need something to cover your feet), but algorithms change often (new styles of clothes and shoes get created every week or month). Don’t waste time learning fashionable solutions when they will become like yesterday’s newspaper. Problems last, algorithms don’t!

There’s a tendency, unfortunately, of recommending universal solutions to machine learning these days (e.g., learn TensorFlow and code up every algorithm as stochastic gradient descent using a deep neural net). To me, this makes just about as much sense as wrapping yourself up in your bedsheets to go to work. Sure, it covers most parts of your body, and probably could do the job, but it’s a one size fits all approach that neither shows any style or taste, nor any understanding of the machine learning (or dressing) problem.

The machine learning community has spent over four decades trying to understand how to pose the problem of machine learning. Start by understanding a few of these formulations, and resist the temptation to view every machine learning problem through a simplified lens (like supervised learning, one of dozens of ways of posing ML problems). The major categories include unsupervised learning, the most important, followed by reinforcement learning (learning by trial and error, the most widely prevalent in children after unsupervised learning), and finally supervised learning (which occurs rather late, because it requires labels and language, which young children mostly lack in early years). Transfer learning is growing in importance as labeled data is expensive and hard to collect for every new problem. There’s lifelong learning, and online learning, and so on. One of the deepest and most interesting areas of machine learning is the theory of probably approximately correct (or PAC) learning. This is a fascinating area, which looks at the problem of how we can give guarantees that a machine learning algorithm will work reliably or will produce a sufficiently accurate answer. Whether you understand PAC learning or not tells me if you are a ML scientist, or an ML engineer.

The most basic formulation of machine learning, and the one that gets short shrift in many popular expositions, is learning a “representation”. What does this even mean? Take the number “three”. I could write it using three strokes III, or as 11, or as 3. These correspond to the unary, binary, and decimal representations. The latter was invented in India more than 2000 years ago. Remarkably, the Greeks, for all their wisdom, never discovered the use of 0 (zero), and never invented decimal numbers. Claude Shannon, the famed inventor of information theory, popularized binary representations for computers in a famous MS thesis at MIT in the early part of the 20th century.

What does it mean for a computer to “learn” a representation? Take a selfie and imagine writing a program to identify your image (or your spouse or your pet) from the image. The phone uses one representation for the image (usually something like JPEG, which is mathematically called the Fourier basis). It turns out this basis is a terrible representation for machine learning. There are many better representations, and new ones get invented all the time. A representation is like the material that makes up your dress. There’s cotton and polyester and wool and nylon. Each of these has its strengths and weaknesses. Similarly, different representations of input data have their pros and cons. Resist the temptation to view one representation as superior to all the others.

Humans spend most of their day solving sequential tasks (driving, eating, typing, walking, etc.). All of these require making a sequence of decisions, and learning such tasks involves reinforcement learning. Without RL, we would not get very far. Sadly, all textbooks of ML ignore this most basic and important area, to their discredit. Fortunately, there are excellent specialized books that cover this area.

Let me end with two famous maxims from the legendary physicist Richard Feynman about learning a topic. First, he said: “What I cannot create, I do not understand”. What he meant that unless you can recreate an idea or an algorithm yourself, you probably haven’t understood it well enough. Second, he said: “Know how to solve every problem that has already been solved”. This second maxim is to make sure you understand what has been done previously. For most of us, these are hard principles to follow, but to the extent you can follow them, you will find your way to complete mastery over any field, including machine learning. Good luck!

Naina Chaturvedi's avatar

++ Good Post, Also, start here stock market, AI research, Crash Courses, 100+ Most Asked ML System Design Case Studies and LLM System Design

AI/ML/LLM Engineer Interview 

https://open.substack.com/pub/naina0405/p/launching-500k-job-meet-quantum-your?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

How to Build Tech 

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-04-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-03-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-01-the-heart-of?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-02-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Crash Courses

https://open.substack.com/pub/crashcourses/p/crash-course-03-hands-on-crash-course?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-02-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-01-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

LLM System Design

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-7e6?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-67d?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/most-important-llm-system-design-b31?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://naina0405.substack.com/p/launching-llm-system-design-large?r=14q3sp

https://naina0405.substack.com/p/launching-llm-system-design-2-large?r=14q3sp

[https://open.substack.com/pub/naina0405/p/llm-system-design-3-large-language?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/important-llm-system-design-4-heart?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

System Design

https://open.substack.com/pub/naina0405/p/bookmark-most-asked-ml-system-design-611?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-862?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-b3c?r=14q3sp&utm_campaign=post&utm_medium=web

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-135?r=14q3sp&utm_campaign=post&utm_medium=web

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-007?r=14q3sp&utm_campaign=post&utm_medium=web

Stock Market 

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-04-which?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-analysis-which?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-02-understand?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-03-this?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-06-i?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

AI/ML Research

https://open.substack.com/pub/airesearch04/p/ai-research-2-kimi-k2-thinking-a?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/airesearch04/p/ai-research-1-the-transformer-revolution?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-7e6?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

suman suhag's avatar

Hidden Markov Models can be used to generate a language, that is, list elements from a family of strings. For example, if you have a HMM that models a set of sequences, you would be able to generate members of this family, by listing sequences that would fall into the group of sequences we are modelling.

Neural Networks, take an input from a high-dimensional space and simply map it to a lower dimensional space (the way that the Neural Networks map this input is based on the training, its topology and other factors). For example, you might take a 64-bit image of a number and map it to a true / false value that describes whether this number is 1 or 0.

Whilst both methods are able to (or can at least try to) discriminate whether an item is a member of a class or not, Neural Networks cannot generate a language as described above.

There are alternatives to Hidden Markov Models available, for example you might be able to use a more general Bayesian Network, a different topology or a Stochastic Context-Free Grammar (SCFG) if you believe that the problem lies within the HMMs lack of power to model your problem - that is, if you need an algorithm that is able to discriminate between more complex hypotheses and/or describe the behaviour of data that is much more complex.

What is hidden and what is observed: The thing that is hidden in a hidden Markov model is the same as the thing that is hidden in a discrete mixture model, so for clarity, forget about the hidden state's dynamics and stick with a finite mixture model as an example. The 'state' in this model is the identity of the component that caused each observation. In this class of model such causes are never observed, so 'hidden cause' is translated statistically into the claim that the observed data have marginal dependencies which are removed when the source component is known. And the source components are estimated to be whatever makes this statistical relationship true. The thing that is hidden in a feedforward multilayer neural network with sigmoid middle units is the states of those units, not the outputs which are the target of inference. When the output of the network is a classification, i.e., a probability distribution over possible output categories, these hidden units values define a space within which categories are separable. The trick in learning such a model is to make a hidden space (by adjusting the mapping out of the input units) within which the problem is linear. Consequently, non-linear decision boundaries are possible from the system as a whole.

Generative versus discriminative: The mixture model (and HMM) is a model of the data generating process, sometimes called a likelihood or 'forward model'. When coupled with some assumptions about the prior probabilities of each state you can infer a distribution over possible values of the hidden state using Bayes theorem (a generative approach). Note that, while called a 'prior', both the prior and the parameters in the likelihood are usually learned from data. In contrast to the mixture model (and HMM) the neural network learns a posterior distribution over the output categories directly (a discriminative approach). This is possible because the output values were observed during estimation. And since they were observed, it is not necessary to construct a posterior distribution from a prior and a specific model for the likelihood such as a mixture. The posterior is learnt directly from data, which is more efficient and less model dependent.

Mix and match: To make things more confusing, these approaches can be mixed together, e.g. when mixture model (or HMM) state is sometimes actually observed. When that is true, and in some other circumstances not relevant here, it is possible to train discriminatively in an otherwise generative model. Similarly it is possible to replace the mixture model mapping of an HMM with a more flexible forward model, e.g., a neural network.

Internation Burke Institute's avatar

Always a strong signal for serious ML work.

Dr Teodora Szasz's avatar

I post every day on data science, AI/ML and career skills that will make you achieve your next levels in your career:

https://teodoracoach.substack.com/