Transcription
Transcription
gpt4 can improve Itself by reflecting on its mistakes and learning from them even
if the world does pause AI development gpt4 will keep getting smarter drawing
upon the stunning reflection paper and three other papers released only in the
last 72 hours I will show you not only how gpt4 is breaking its own records but
also how it's helping AI researchers to develop better models I will also cover
the groundbreaking hugging GPT model which like a centralized brain can draw upon
thousands of other AI models to combine tasks like text the image text the video
and question answering the reflection paper and follow-up sub stack post that
caught Global attention was released only a week ago and yes I did read both but
I also reached out to the lead author Noah shin and discussed their significance
at length others picked up on the results with the legendary Andre carpathy of
Tesla and openai fame saying that this metacognition strategy revealed that we
haven't yet seen the max capacity of gpt4 yet so what exactly was found here is
the headline result I'm going to explain and demonstrate what was tested in a
moment but look how they used gpt4 itself to beat past gpt4 standards using this
reflection technique this isn't any random challenge this is human eval a coding
test designed by the most senior AI researchers just two years ago the designers
included Ilya sutskovar of openai Fame and Dario amade who went on to found
anthropic these are realistic handwritten programming tasks that assess language
comprehension reasoning algorithms and Mathematics so how exactly did gpt4
improve itself and beat its own record because remember in the distant past of
two weeks ago in the gpt4 technical report it scored 67 not 88 well here is an
example from page 9 of the reflection paper as you can read in the caption this
was a Hotpot QA trial designed specifically such that models needed to find
multiple documents and analyze the data in each of them to come up with the
correct answer notice how initially a mistake was made on the left by the model
and then the model at the bottom reflected on how it had gone wrong in a self-
contained Loop it then came up with a better strategy and got it right and the
authors put it like this we hypothesized that llm's large language models possess
an emergent property of self-reflection meaning that earlier models couldn't do
this or couldn't do it as well it's a bit like GPT models are learning how to
learn in case you think it was a model blindly trying again and again until it
was successful no it wasn't this was another challenge called Alf world and look
at the difference between success without reflection and success with the
reflection I discussed this of course with the lead author and the goal was to
distinguish learning curves from self-improvement to simple probabilistic success
over time if you're wondering about Alf World by the way it's about interactively
aligning text and embodied worlds for example in a simulated environment the
model had the task of putting a pan on the dining table and it had to understand
and action that prompt so as you can see this ability to reflect doesn't just
help with coding it helps with a variety of tasks at this point I want to quickly
mention something I know that there will be a couple of well-versed insiders who
say didn't gpt4 actually get 82 percent in human eval in the Sparks of AGI paper
of course I did a video on that paper too and asked the author of reflection
about this point there are a few possibilities such as prompting changes and the
sparked authors having access to the raw gpt4 model but either way it is the
relative performance gain that matters whichever bass line you start with gpt4
can improve on it with a reflection and the 88 figure is not a cap the author has
observed results in the last few hours as high as 91 percent but before I go on I
can't resist showing you the examples I found through experimentation and also
shared with the author take this prompt that I gave gpt4 write a poem in which
every word begins with e now as you can see it did a good job but it didn't fully
get it right look at the word Ascent for example without mentioning anything
specific I just then wrote did the poem meet the assignment not even a
particularly leading question because of course it could have just said yes gpt4
then said apologies it appears the poem I provided did not meet the assignment
requirements not every word begins with the letter e here is a revised poem with
every word beginning with the letter e remember I didn't help it at all and look
at the results every word begins with e how far can we take this for the next
example I chose mathematics and asked write me a five question multiple choice
quiz to test my knowledge of probability with correct answers and explanations at
the bottom there should only be one correct answer per question it comes up with
a D decent quiz but notice a problem in question three for example the
probability of drawing an ace or a king is indeed 8 out of 52 but that simplifies
down to 2 out of 13. so two of the answers are correct and I explicitly asked for
it not to do this in the prompt so can the model self-reflect with mathematics
kind of almost look what happens first I give a vague response saying did the
quiz meet the assignment GPT 4 fumbles this and says yes the quiz did meet the
assignment hmm so I tried did the quiz meet all of the requirements and gbc4 says
yes so I did have to help it a bit and said did the quiz meet the requirement
that there should only be one correct answer per question that was just enough to
get gpt4 to self-reflect properly and it corrected the mistake I must say it
didn't self-correct perfectly notice it identified C and D as being correct and
equivalent when it was B and D but despite making that mistake it was able to
correct the quiz in case you're wondering the original chat TPT or gbt 3.5 can't
self-reflect as well I went back to the perm example and Not only was the poem
generated full of words that didn't begin with e also the self-reflection was
lacking I said did the poem meet the assignment and it said yes the poem meets
the assignment as the lead author Noah Shin put it with gpt4 we are shifting the
accuracy bottleneck from correct syntactic and semantic generation to correct
syntactic and semantic test generation in other words if a model can know how to
test its outputs accurately that might be enough even if its initial Generations
don't work it just needs to be smart enough to know where it went wrong others
are discovering similar breakthroughs this paper from just three days ago comes
up with this self-improvement technique they get gpt4 to frame its dialogue as a
discussion between two agent types A researcher and a decider a bit like a split
personality one identifying crucial problem components and the other one deciding
how to integrate that information here is an example with Gypsy 4's initial
medical care plan being insufficient in crucial regards the model then talks to
itself as a researcher and as a decider and then lo and behold it comes up with a
better final care plan the points in bold were added by gpt4 to its initial care
plan after discussions with itself and the results are incredible Physicians
chose the final summary produced by this dearer dialogue over the initial Gypsy 4
generator summary 90 to 10 that's the dark red versus the Pink I'm colorblind but
even I can see there's a pretty big difference the authors also introduce
hallucinations at different levels low medium and high and they wanted to see
whether this dialogue model would reduce those hallucinations these are different
medical gradings and you can see that pretty much every time it did improve it
quite drama automatically and then there was this paper also released less than
72 hours ago they also get a model to recursively criticize and improve its own
output and find that this process of reflection outperforms Chain of Thought
prompting they tested their model on Mini wob Plus plus which is a challenging
Suite of web browser-based tasks for computer control ranging from simple button
clicking to complex form filling here it is deleting files clicking on like
buttons and switching between tabs a bit like my earlier experiments they gave it
a math problem and said review your previous answer and find problems with your
answer this was a slightly more leading response but it worked they then said
based on the problems you found improve your answer and then the model got it
right even if you take nothing else from this video just deploying this technique
will massively improve your outputs from gbt4 but we can go much further which is
what the rest of the video is about before I move on though I found it very
interesting that the authors say that this technique can be viewed as using the
llm's output to write to an external memory which is later retrieved to choose an
action going back to carpathy remember that this critique retry metacognition
strategy isn't the only way that gpt4 will beat its own records the use of tools
as he says will also be critical less than 72 hours ago this paper was released
and arguably it is as significant as the reflection paper it's called hugging GPT
and as the authors put it it achieves impressive results in language Vision
speech and other challenging tasks which paves a new way towards AGI essentially
what the paper did is it used language as an interface to connect numerous AI
models for solving complicated AI tasks it's a little bit like a brain deciding
which muscle to use to complete an
action take this Example The Prompt was can you describe what this picture
depicts and count how many objects in the picture the model which was actually
chatbt not even gpt4 or use two different tools to execute the task one model to
describe the image and one model to count the objects within it and if you didn't
think that was impressive what about six different models so the task was this
please generate an image where a girl is reading a book and her pose is the same
as the boy in the image given then please describe the new image with your voice
the Central Language model or brain which was chattybt had to delegate
appropriately all of these models by the way are freely available on hugging face
the first model was used to analyze the pose of the boy the next one was to
transpose that into an image then generate an image detect an object in that
image break that down into text and then turn that text into speech it did all of
this and notice how the girl is in the same pose as the boy same head position
and arm position and then as a cherry on top the model read out loud what it had
accomplished this example actually comes from another paper released four days
ago called task Matrix remember how the original tool former paper used only five
apis this paper proposes that we could soon use millions in this example the
model is calling different apis to answer questions about the image caption the
image and do out painting from the image extending it from a simple single flower
to this 4K image going back to hugging GPT we can see how it deciphers these
inscrutable invoices and reads them out loud and can even perform text to video
with an astronaut walking in Space at this point I can't resist showing you what
CGI video editing might soon be possible with AI here's Wonder Studio which is
backed by Steven Spielberg welcome to wonder Studio we're making movies with CGI
is as simple as selecting your actor and assigning a character the system uses
AI to track the actor's performance across cuts and automatically animates lights
and composes the CG character directly into the scene [Music] whether it's one
shot or a full sequence Wonder Studio analyzes and captures everything from body
motion lighting compositing camera motion and it even tracks the actor's facial
performance these advancements do seem to be accelerating and requiring fewer and
fewer humans this paper showed back in the before times of October that models
didn't need carefully labeled human data sets and could generate their own going
back to the language models can solve computer task paper the authors seem to
concur they said that previously significant amounts of expert demonstration data
are still required to fine-tune large language models on the contrary the agent
we suggest needs less than two demonstrations per task on average and doesn't
necessitate any fine tuning this reminded me of the alpaca model that fine-tuned
its answers based on the outputs of another language model human experts were
needed briefly at the start but far less than before a bit like a child no longer
needing a parent except maybe gpt4 is on growth steroids Ilya satsgiver from
openai put it like this I mean already mostly data for enforcement loan is coming
from AIS the humans are being used to train the reward function but then the but
then the reward function enter and in its interaction with the model is
automatic and all the data that's generated in the during the process of
reinforcement learning it's created by AI before I end I should point out that
these recursive self-improvements are not limited to algorithms and apis even
Hardware is advancing more rapidly due to AI this week we had this from Reuters
Nvidia on Monday showed new research that explains how AI can be used to improve
chip design by the way this includes the new h100 GPU they say that the Nvidia
research took reinforcement learning and added a second layer of AI on top of it
to get even better results and to go back to where we started the gpt4 technical
report showed that even with compute alone not self-learning we can predict with
a high degree of specificity the future performance of models like gpc5 on tasks
such as human eval these accelerations of AI are even giving the CEO of Google
Whiplash and I can't help feeling that there is one more feedback loop to point
out as one company like openai make breakthroughs it puts pressure on other
companies like Google to catch up apparently Bard which has been powered by
Lambda will soon be upgraded to the more powerful model Palm with self-
improvement tool use Hardware advances and now commercial pressure it is hard to
see how AI will slow down and of course as always I will be here to discuss it
all thank you for watching to the end and have a wonderful day