Last week in AI #155 - ChatGPT memory, Altman seeks trillions, Califonia AI regulation, art gen lawsuit
From policy discussions to synthetic media, AI's trajectory is filled with debates and potential breakthroughs that could shape the essence of human creativity and strategic thought.
AI Developments and Concerns
California introduces a bill focused on establishing strict testing and safety protocols for AI companies.
OpenAI's GPT-4 stirs up discussions on its potential military applications, highlighting ethical and strategic implications.
AI image generation companies like Stability, Midjourney, and Runway face copyright challenges, prompting questions on creativity and ownership.
Policy and Safety
The UK's AI Safety Institute reveals the ease with AI models can produce biased or harmful outputs.
Public protests against OpenAI underline societal concerns regarding AGI and its possible militarization.
Research Breakthroughs
DeepMind explores chess play by relying solely on neural networks, bypassing traditional search strategies.
Research shifts towards interactive agents, highlighting the push for more adaptable and efficient AI systems.
Synthetic Media and Art
Read the full discussion in the transcript below 👇
#155 - ChatGPT memory, Altman seeks trillions, Califonia AI regulation, art gen lawsuit
Hello, and welcome to Scanner Today's Last Week in AI podcast, where you can hear us chat about what's going on with AI. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at lastweekin.ai for articles we did not cover in this episode. I am one of your hosts, Andrey Kurenkov. I finished my PhD focused on AI at Stanford last year, and I now work at a generative AI startup. I'm your other host, Jeremy Harris. I'm the co-founder of Gladstone AI, which is a national security AI safety company focusing on increasingly AGI and AGI-like systems. I do want to say, by the way, last episode, I made this just offhand remark at the beginning saying that we're looking for partnerships in sales with our Department of Defense-focused stuff and our intelligence community-focused sales stuff. We got a whole bunch of outreach, not just from amazing people who I think are going to be so, so great to talk to, and I've got calls booked with them already, but just from people who were like, I can't help necessarily with this, but I just want to be supportive. I honestly, that was super humbling and amazing how supportive our audience is. Just a big thank you to everybody who listens to the podcast, whether you're a regular listener or you're just tuning in every once in a while. I was blown away by it, and yeah, just super appreciative, so thank you so much. Yeah, and we got some useful comments also after that last episode. We got a shout out to one story we didn't cover last week that we will be covering this week from someone, so that was also very helpful, nice to see. So yeah, thank you, and do feel free to comment on any platform, we are on YouTube or Substack or feel free to email. As always, we have emails in the episode description, or you can just go ahead and type in contact at lastweek.ai. Well, basically just sitting there refreshing the comment section nonstop, so just anything, anything, please. I am subscribed to so many newsletters to keep up with the news, so I check my emails daily and go through dozens of them, and as a result, I do try and also make sure to see a new listener emails, so there you go. Before we get into the news, quick sponsor read, and we are once again promoting the Super Data Science Podcast. This is one of the biggest technology podcasts globally. They cover not just data science, but machine learning, AI, data careers, various things. It is hosted by John Krohn, the chief data scientist and co-founder of a machine learning company, Nebula, and the author of a bestselling book, Deep Learning Illustrated, and just generally a very, very knowledgeable person when it comes to AI. This podcast has been going on for a long, long time. There are over 700 episodes with all sorts of people, so if this podcast is like 150, that one is at 700. He must have learned everything from all the people you've talked to so far. Definitely go check it out if you'd like to hear from people in the world of data science or machine learning or AI, kind of more of a person-based way to see into the world as opposed to what we do, which is more news from media and so what's going on. Yeah, John is a great interviewer, and because he is bald, he never has a bad hair day, and I'm somewhat jealous of that, but yeah, he's a fine gentleman and a scholar, and you should definitely listen to that podcast. And now kicking off the news, starting with the tools and apps section, the first story being that ChatGPT is getting memory to remember who you are and what you like, so this was just announced, and this is a feature that will be starting to roll out. It's not available for everyone, but it's pretty much what it sounds like. Your ChatGPT chatbot will be able to manually or, I guess, automatically remember certain things that come up during your conversations, and each custom GPT will have its own individual memory allowing for more personalized experiences across various places. And the last thing is you will be able to see each individual snippet of memory, so this is not some sort of neural memory, it's a literal little string of text that sums up a fact about you, and there will be a UI component where you will be able to see each of them, delete stuff that you don't want to be remembered, even individually add stuff as well. So yet another new, more product-type feature of an AI-type feature coming to ChatGPT. Yeah, and the implementation of this, and the fact that OpenAI sees potential demand for the... I don't want to call it interpretability, it's almost an interpretability-type solution, right? Like, what does the system know about me? Obviously, if they're storing it in the form of more like a database of raw strings, and they're using kind of RAG, like Retrieval Augmented Generation, to get the chatbot to just ping that database, then it's not quite the same as neural interpretability. But what I think is kind of cool here is, this does start to introduce a potential kind of, not necessarily business model, but at least use case for consumers for deeper levels of AI interpretability, which I think is great for safety, and kind of getting market incentives aligned with some of the important interpretability work that's being done right now. Apparently, this memory feature works in two different ways. You can either explicitly tell ChatGPT to remember specific facts about you. The examples they give in the article are like, I always write my code in JavaScript, or my boss's name is Anna, or alternatively, ChatGPT can just try to pick up those details over time. So, kind of like implicitly learning from its interactions with you. So, you can kind of go either way. Apparently, each custom GPT you interact with will have its own memory, as you said. And there's a whole bunch of measures that they're taking to, again, give greater visibility. I don't want to say, again, greater interpretability. It's a fuzzy word now, because there's this ambiguity as to whether we're getting visibility into the knowledge that's stored in a database that just gets queried by the model, or in the model itself. But certainly, greater control of what's in that memory. And apparently, the system's been trained not to remember things like information about your health, by default. That's kind of interesting. So, the models have to learn some stuff, but not other things. And apparently, you can always just ask ChatGPT what it knows about you. We know, by default, this memory feature is going to be turned on. This is not the first time that OpenAI has kind of kicked things off with a default on mode for one of these more kind of permissive, I don't want to say intrusive applications, but certainly, it is more intrusive in the sense that it's going to remember more about you. So, yeah, right now, just in testing. Small portion of the user population, I suspect, will be seeing a rollout fairly soon. I think it's kind of interesting they are adding this. They already had, I believe, the ability to manually add in some instructions that are global, right? Like, I'm a teacher, system prompt, exactly. And this is probably, in practice, not too dissimilar from that feature we already had. But I guess the key difference is that the AI will be able to automatically store some things for future usage without you having to tell it to do that. So I guess there is a play here to make these chatbots as individualized as possible. And we've been covering how there's a bit more pressure happening on chatGPT coming from Gemini and many people launching chatbots, really. So it seems like we'll probably be seeing more of these sort of product iterations and user experience iterations rather than fundamental AI improvements to try and provide the best user experience. And up next, we have Reka Flash, an efficient and capable multimodal language model. So the open source language models that are about 7 billion to 12 billion parameters, those are a dime a dozen. We talk about them all the time. It's rare-ish to have new kind of pre-trained foundation models at scale. That is the category that this kind of falls into. It's funny. Last episode, we were talking about, Andre, you coined the term small-large language models at 7 to 13 billion or something. Well, we were talking about how big is big. And this is a 21 billion parameter model trained entirely from scratch. So it maybe falls in that category. So this is Reka's model. It's competitive with Gemini Pro and GPT 3.5. So quite interesting, especially given that GPT 3.5, if you look at the largest version, is a fair bit larger than that. Apparently, it outperforms Gemini Pro on a whole bunch of interesting benchmarks, including MMLU, which is a more general purpose language understanding benchmark. And it's competitive on GSM 8K and human eval. So GSM 8K is a math benchmark. So for logical reasoning, it actually seems like this model is doing pretty well. There's a whole bunch of interesting data that they share about it. It was pre-trained on text from over three, two different languages. There is a more compact variant that's called the Reka Edge. It's going to be 7 billion parameters and called Reka Edge, of course, because it's meant to be presumably deployed on Edge devices. So it's smaller, so it can fit on them. And they've got a playground that you can play around with it on. I thought what was really kind of most interesting about this is that they've done this. Reka has done this. They've built Reka Flash on relatively little in the way of resources. So they raised a $60 million Series A so far. Typically, that's just not going to be enough money to do anything, certainly nothing at the frontier. Now, this is not a genuine frontier model. It's far from that. It's GPT 3.5 level. Right now, we're on to, gosh, GPT 4, really GPT 5. But it is interesting and noteworthy that it's competitive with some models that people do legitimately pay to use. And that is GPT 3.5, and that is Gemini Pro. So Reka really kind of making a bit of a splash with Reka Flash, and certainly an impressive model, maybe more impressive than I would have expected a company that's as funded as Reka is at this stage. And they do say that their largest and most capable model, Reka Core, will be available to the public in the coming weeks. Hopefully Reka Core will be at least closer to that GPT 4, Gemini Ultra level of performance. If it is, that's going to be really impressive, actually. Yeah. Yeah, exactly. And as you said, you can go in and try this out in the Reka Playground. Reka is a bit more of a, I guess, research group so far, and a bit more focused on API deployments, as far as I can tell. It doesn't seem they're aiming quite as much to be like a chatbot provider, but yeah, you can go in and play around with them, and they'll be probably at least as good as GPT 3.5. And GPT 3.5 isn't multimodal, actually, right? So it's different in that way. It's quite impressive, I think. Like you mentioned, GPT 3.5 is, to our knowledge, at least initially, GPT 3 was like 185 billion parameters when it came out years ago. And the fact that you can squeeze this much performance out of a much smaller model indicates, I think, quite a bit of movement and improvement in our understanding of how to optimize these models with fewer parameters, smaller size, less compute, but still being as performant. And yeah, that's, I guess, what happens in AI. We kind of squeeze out as much performance as we can. Yeah, that's right. And this is, just to situate this for listeners who might be used to hearing the story of AI scaling, and bigger is better, AI scaling remains true, but at the same time, we have algorithmic advances that are compounding that. And so you can, as Andrei just said, you can squeeze more juice out of the lemon using the same amount of compute, the same model size. And so that's a big part of what we're seeing here. And if you are Reka, you are absolutely going to be interested in those kinds of strategies, again, because you're operating on a very limited budget, like $60 million of Series A funding, that's where they're at. That is not enough to compete with, like GPT-4 is estimated just the training run alone to have cost anywhere from 40 to $100 million in one shot. That's just in compute. That does not account, for example, for salaries, which are just gigantic in this space. So yeah, I mean, it's absolutely something they have to do. I think one thing I thought was really interesting here. So they actually, the Reka Flash model and Reka Edge, they were both pre-trained as usual on text autocomplete objective. They got instruction fine-tuned. So they got additional training on a dataset that consisted of instruction data and instruction following data. And then they got RLHFed with PPO. And that's the kind of newish technique that's starting to show up everywhere. But the key thing is they actually used Reka Flash to provide the reward model, the model that essentially evaluates the quality of its own outputs. So they're training Reka Flash, they're pre-training Reka Flash. And they're actually going to use Reka Flash to evaluate Reka Flash's own outputs in this kind of reinforcement learning from, well, it's not human feedback now, right? It's reinforcement learning from AI feedback loop that they've set up. So kind of interesting, certainly reinforcement learning from AI feedback is a thing. It's an increasingly popular thing, but noteworthy that it's now kind of full on being used by them. Like, I don't know that they mentioned anywhere in the blog post, any kind of actual human feedback in the RLHF process. So that's kind of an interesting little note there. And just one more quick story in this section. The headline is, Say What? Chat with RTX brings custom chatbot to NVIDIA RTX AI PCs. And this is from NVIDIA. So this is pretty much a tag demo called Chat with RTX. And the idea is that you can download a chatbot and run it locally on your NVIDIA GeForce RTX 30 GPU or higher. So you can, I guess this is demonstrates that if you have, I guess, a gaming PC or a gaming computer that has these pretty beefy, but not sort of like super computer level GPUs. At this point, you can definitely go ahead and run a chatbot and customize it and, you know, do whatever you want. This also has retrieval augmented generation and various kind of accelerations on top of it to demonstrate all the stuff that NVIDIA brings to the table to optimize your inference. Yeah. And it is obviously like it runs locally. That's a big part of the sort of logic here of how it works so quickly. Like it's blazingly fast because it's running on your local computer. There's no pinging the chat GPT server or whatever. And also better for privacy, right? All your data stays on the device. So there's this interesting question of what the future looks like when we look at, you know, customized chatbots or just chatbots in general. Do we have chatbots that are actually stored locally on whether it's our PC or our Mac or our GPUs? Like does it essentially, do we have computing happening at the edge in all cases for privacy reasons for speed reasons? Or does the future look more centralized in the form of, you know, open AI style server serving up a model like this? And right now we don't know, but this is certainly NVIDIA playing around with the idea of the former. Yeah. This allows you to connect all kinds of different open source models like Mistral or Lama too. So you know, you get decent, decent models for sure out of this. One of the examples that they give is, you know, you can imagine asking this chatbot, what was the restaurant my partner recommended while in Las Vegas and chat with RTX will actually scan local files on your computer and point you to the answer with the context. And so this is kind of like, you know, Google search for your own computer in a way, or maybe more like, you know, chat GPT for your own computer. But certainly kind of interesting. And again, for those privacy reasons, maybe something people would be more keen to plug into. Exactly. Just add a bit more detail. What this looks like in practice is a GUI application, so you don't need to be a developer and run terminal commands or anything. It has a little dropdown menus for everything. So there are currently, I think just Mistral and Lama 2 in that dropdown menu, probably will add more open source options as they become available. And there is a little UI to be able to say, have access to these files when you're answering my question. And it will then use that retrieval stuff that is built in to be able to answer questions relevant to your files. So yeah, it's a fun little tech demo and a fun little thing to try if you're someone to live a decent GPU. And now moving on to applications and business. And you know, it's a Wednesday. It's about time. You know, we haven't heard from OpenAI in a while or Sam Altman. We're about due for, you know, Sam Altman getting up and saying he wants a trillion, no, sorry, $7 trillion to just, you know, reshape the entire economics of semiconductor fabs. So that's where we're at. This headline is Sam Altman seeks trillions of dollars to reshape business of chips and AI. And this is all coming from, you know, people familiar with the matter type thing. So it's not an official announcement. But you might remember from previous episodes, we've talked about Sam A talking to apparently folks in the UAE, the United Arab Emirates about some chip project. It wasn't super clear. It now seems as if he's been asking them for help raising as much as five to $7 trillion with a T dollars for whatever this chip project is. We're getting a bit of information about that now. For context, and this is funny, I've read a couple of articles about this. They always do this. They always start listing all these comparables. They're like, okay, just to put this in context, right? So this would dwarf the current size of the global semiconductor industry. So global sales of chips were half a trillion dollars last year, right? Remember, he's raising five to $7 trillion with a T dollars. So half a trillion is the annual kind of global sales of chips. It's expected to go up to about $1 trillion annually by 2030. That's kind of like, he's going to be raising like five to seven times as much as the whole market is going to be worth in 2030, and then global sales of manufacturing equipment for semiconductor chips. So these are like the kind of ASML type companies that we talked about before were $100 billion last year. So all this stuff is like tiny, tiny, tiny. Again, more context. This is larger than the national debt of some major global economies and bigger than giant sovereign wealth funds. Okay. One problem that comes to mind, put my startup hat on on this. First of all, you never ever bet against Sam May, that much is clear. You never ever bet against the founder because when you're wrong, you're very, very badly embarrassed. But one of the big challenges that I would imagine starts to kind of arise when you're looking at any big move like this, this is a space with extremely high technical floor where you just need a ton of technical knowledge to get involved. So when you're talking about plowing, five to $7 trillion into this, at a certain point, you got to imagine you're going to be bottlenecked by talent as much as infrastructure. So what the bottlenecks are, it's going to be really interesting to track. I'm curious what Sam May himself thinks, but apparently he's been meeting with all kinds of folks, not just the UAE. The Commerce Secretary, Gio Raimondo, who's come up a lot on the podcast, of course, they've talked about that, apparently had a productive conversation. And I guess the deal details, as far as we know right now, it looks like OpenAI is basically saying, look, we're going to set up some kind of, I don't know to call it a consortium or whatever, but some kind of partnership with all the big fabs, TSMC and so on, and we're going to agree to be a significant customer of these new factories. And so they're going to fund a lot of the effort with debt, but it's all based on the promise of OpenAI, among others, growing really fast and coming into this high demand. The UAE, hugely important government for this, just because they have so much cash. But apparently Sam May also met with my associate's son, who is the famous CEO of SoftBank, and he also met with TSMC. So really all the folks you'd imagine you'd want to meet with, if you're raising some gigantic amount of money, like $7 trillion, unclear whether he'll be able to do it, whether it would work if he does, but it is the sort of thing that, if you look at the way Sam Alton's been thinking about this, it makes perfect sense. This is what you do if you think scaling gets you to AGI. You do not place a $7 trillion bet, which again, for more context is like roughly half of US GDP. You don't raise that kind of money for AI hardware if you don't think that the returns are going to be, and I can't believe I'm saying this, but literally on the order of the US GDP. This is what the returns would have to be in your mind if you were going to do this. And for those of us who think AGI might be happening soon, that actually makes perfect sense. But this just reflects Sam May's kind of continued doubling down on this sort of theory of the case. Now, I do want to say, we didn't cover this story initially last week, although it was already out there, but it got quite a bit of play throughout the media. And it's worth pointing out that this number, $7 trillion, it's coming from a quote in the article that says, a project could require raising as much as $5 trillion to $7 trillion, one of the people said, the people being some of these informed sources here. So it's worth keeping in mind that this is very early stage, right? The comment from the OpenAI spokeswoman was that OpenAI has had productive discussions about increasing global infrastructure and supply chains for chips, energy and data centers, which are crucial for AI and other industries. So there are discussions being had to increase the supply chain, to build more foundries. Sure, it could require raising as much as $5 trillion or $7 trillion, but it could also wind up being much less, right? So it's just worth keeping in mind that this is all very nebulous, this is kind of a broad direction Sam Altman is pushing in, and it's not like he is setting out to get $7 trillion right now. At least as far as you can tell from the details so far, that part of it has been a bit of a blown, but the fact that he is seeking to do this very, very, very capital intensive thing of creating more sources for chip production, that is certainly true. And yeah, as you said, Sam Altman, given his position as the lead at OpenAI, a very influential figure, very famous figure now, having been quite public, if anyone can try and do it, I guess it would be him. Yeah, no, and good point about, as we talked about the sort of uncertainties around who's saying what and whether this is real, for what it's worth, I mean, it does align with my understanding of the costs involved. If you really wanted to crank up hardware production, like semiconductor production, this is not an insane number to be thrown around if you expect, say, two orders of magnitude of growth in the space, which again, on that AGI hypothesis, yeah, pretty plausible. But you're right, I mean, there's tons of uncertainty. If nothing else, this forces us to have an interesting conversation about where the bottlenecks in the semiconductor manufacturing cycle are right now. And we talked about talent a minute ago. There's also this question of, what about the rare earth minerals, like gallium and germanium especially, which are kind of core to this, mostly produced in China, certainly mostly refined in China? And what can we do to massively increase that? Is money alone enough? And when you start talking about investments of this scale, you're actually talking about moving the market price of these things too. And so they become more scarce, so they become more expensive. So it's sort of like this nonlinear compounding effect that happens. So anyway, yeah, I totally agree. I think this is just going to be a space we'll have to keep an eye out for. And even $1 trillion is going to be pretty insane if it happens, but it may not happen at all. And yeah, we'll just have to wait and see. And to be fair, I'm sure Sam Alper would love to be able to raise $5 trillion to $7 trillion. I would too. I think he wouldn't mind if that was possible. And it does seem he is going in that direction. So that's worth noting. Yeah. And you won't go into it, but it is, I guess, worth also noting that the NVIDIA CEO did have a chance to comment on this kind of a little bit. And as you might expect, his general comment was like, well, GPUs will get more efficient. All this is probably overkill or something to that kind of direction of kind of underplaying the need for this and also kind of the entire effort kind of gently. And on to our lightning round. Lightning, lightning, lightning. Sorry. All right. So the first, I don't know why I did that. The first article here is a report. It says China's SMIC to begin production of five nanometer chips for Huawei. Okay. This is actually a really big story. We usually go through this song and dance any time we talk about, you know, five nanometer process and what does that mean to just really quickly summarize. So right now there are three, what are called node sizes that you need to know about that the humans currently know how to make. So we know how to make semiconductor chips down to roughly speaking three levels of precision. We have a seven nanometer process. This is the process that was used to make the NVIDIA A100 GPU, which GPT-4, by the way, was trained on. We have a five nanometer process, which is used to make the H100 GPU and all current top of the line GPUs. This is the process that was used to train GPT-5. And then there's the three nanometer process, which is currently being used for the iPhone basically. And it's the only people who make three nanometer GPUs right now, sorry, three nanometer process is TSMC is Taiwan Semiconductor Manufacturing Company. They are absolutely the world leader on this. Now other firms have since started to close in on the five nanometer process themselves. And five nanometers is really challenging. It can require new kinds of technology that are export controlled, that China cannot now legally get their hands on. It requires usually these devices called extreme UV lithography machines that are only really made by a Dutch company called ASML. So if you want to go to five nanometers, if you want to build, in other words, those NVIDIA H100 equivalent chips, if you want to train a model like GPT-5, you're going to need to find a way to get access to extreme UV lithography, unless you make some breakthrough. And that is exactly what seems perhaps to have happened. We've talked about this on the show before trying to figure out whether this Chinese company SMIC, which is a competitor to TSMC, which is the actual world leader in semiconductor manufacturing. It's the Chinese version of it. It's kind of China's best play here. They apparently are managing to use their existing stock of US and Dutch made equipment to produce five nanometer chips. This if true would be a significant breakthrough. The big question now is about whether they're going to be able to achieve enough yield. So you might be able to make a five nanometer chip, but your process, if it's inefficient, if it only works 10% of the time, then the cost per chip is going to be 10 times higher. So that just may not be economically viable. Right now, it seems like the yield from SMIC is looking like it may actually be good enough to start shipping on this five nanometer process that they're working on. Right now, they've already partnered with Huawei and kind of surprised the whole industry when their Mate 60 Pro premium smartphone launched with a seven nanometer process, which they were not supposed to be able to do either. That was back in August. Right now, it looks like there are some Kirin chips that are being designed with Huawei's high silicon unit that may actually end up containing five nanometer chips. So we're going to have to watch this very closely, but this would be a significant breakthrough, both from a consumer standpoint in China, from a domestication of the AI supply chain in China, and from the national security perspective. And for context, Huawei, in addition to working on phones, is also one of the big players in AI hardware. They have their Ascend AI chipsets. And over on the West, we have AMD, we have NVIDIA as kind of the major players with the expert controls that really prevent powerful GPUs from being shipped pretty strictly as we've been covering. It would be a pretty significant deal for China, for Huawei to get access to this better node process and then to be able to apply it to AI hardware so that expert controls don't matter quite as much, basically. Next story, and this is kind of a spicy one, a crowd destroyed a driverless Waymo car in San Francisco. So not any huge ramifications here, but I do think a fun story to cover. And it is exactly what it sounds like. During celebrations of a Chinese New Year in San Francisco's Chinatown, a Waymo car kind of got stuck. There were fireworks going around, so basically it just stood there staying still while human drivers drove around and left the area. And somehow at some point, someone decided to go ahead and throw in a firework into one of the open windows of the car. And there are a ton of photos and videos you can find online of the car just being full on flame, like completely on fire, no explosions as far as I'm aware, but it was totally on fire. You know, the fire department had to come out, you can see how it was basically melted. And yeah, pretty dramatic and kind of builds on prior events of people messing with self-driving cars. That was not a movement per se, but a bit of a trend of putting cones on the sensors of these self-driving cars just to mess with them. So yeah, this totally happened in San Francisco and I think speaks to there being a lot of Waymo cars in San Francisco and the people there still kind of adopting to them and responding to them in different ways, including this act of vandalism of that, you know, there's no specific person behind it. It was kind of a crowd, presumably just for fun, really, during the celebration, someone decided to go for it. I love your euphemisms there. That was some great like press secretary, people still getting used to these cars and you know, in the way that one does, beating the living shit out of them. Yeah, actually, it's funny. I was on Twitter, I guess, X, and I saw, I'm still trying to figure out whether it was this one. There was another video of a bunch of people kind of beating the crap out of one of these cars with a, one guy had a skateboard and he was just, you know, hammering away. Maybe this was just before it was lit on fire or something. I'm not sure. But yeah, anyway, it definitely seems to be stirring up a lot of emotions understandably as we look at like, we're automating away, like a whole category of jobs here. And there is something faintly dystopic about like this particular episode of the Black Mirror series that we're running in 2024. Just that, you know, we're seeing accidents and stuff like that with no humans responsible and so on and so forth. So yeah, I mean, I'm curious what kind of PR push we're going to have to see from some of these companies to get people more comfortable with the idea. You know, it's no longer just about dollars and like, you know, how, how much can you lower the cost of a cab ride, but also just, you know, how can you like reduce the odds that random people are just going to try to light these things on fire because they're upset, you know, whatever, whatever sort of existential angst they're experiencing for other reasons too. But yeah, really interesting and freaky, freaky part of the show. Yeah. Part of me wonders if there is kind of a bit of a, not just tech backlash, but AI backlash brewing. Right. You know, self-driving cars is a pretty clear and present sign of it. That's like out there in the physical world. So yeah, it would be interesting if that kind of is part of a general cultural movement of our time, so to speak. But anyway, cool story. And if you want to see a melted car or a self-driving car on fire, go to the link we provide or just search for it. Next story, OpenAI reportedly developing two AI agents to automate entire work processes. So this is kind of some insider info, not something OpenAI announced or released. But as the headline said, this is basically looking at actual AI agents. So ChattyPT, as we've been saying, is still more of a AI model categorized as something where you give it an input, it gives you an output, and that's sort of it. An AI agent is something that you can tell to do something and it can go off and sort of execute for a while autonomously, getting observations of the environment or the internet or whatever you want and deciding on actions, taking them and seeing the update or the result and doing that in a loop. So far, OpenAI hasn't released any agents, at least not since their work in reinforcement learning back in the day when they were playing Dota and so on before all this AI hype. And so these AI agents are meant to, in one case, take over a user's device to perform tasks such as transferring data between documents and spreadsheets. And in the second case, it's more web-centric. So it's designed to perform web-based tasks such as collecting public data, creating travel itineraries, or booking airline tickets. And we've seen some examples of these things before from other players in the space, especially for browsers. We've seen demos and examples of saying, book me a ticket to Atlanta. And there is an agent that just knows to go to the right website, do the right Google, click the right forms, fill in the text, et cetera, et cetera. So not too surprising to hear that this is an initiative at OpenAI. And yeah, I guess we'll have to keep an eye on it and hopefully see something concrete and official soon. Yeah. And I think the whole industry is kind of moving in this direction. And there are a couple of times where we've seen these big trends, right? Trends towards multimodality, trends toward agent-like behavior. This is certainly the latter. There are a whole bunch of companies pursuing this. OpenAI is. Google, with their BARD assistant, certainly is. But there's like Imbue, Rabbit, Adept.ai, all these companies. It really seems like this is where things are going, both because it's a better user experience but also because increasingly you want these language models to be taking actions in the real world. And this is just the agent framework is just so directly kind of connected to or action-oriented, I should say. Yeah. And so there's also, it's worth noting, these agents are probably going to be interacting with each other a lot, right? We already know that OpenAI has the ability or allows users to combine the capabilities of different GPTs. And so this is probably going to be a very interactive thing. They're not going to operate in isolation. But yeah, definitely something to keep an eye on. And one last story for this section, going back to self-driving cars, Cruz names first chief safety officer following crash and controversy is the headline. So we've covered the controversy quite a lot. I won't go into it. Suffice to say, there was a significant crash that happened last year that really messed up Cruz's overall fortunes and kind of the space in general was very impactful. So as a result, it's worth to highlight that they have named this first chief safety officer. This is Steve Kenner, who has been in the autonomous vehicle industry for a while, has previously held top safety roles at various companies and will be reporting directly to the president and chief administrative officer, generally, I guess, helping to both in reality improve the focus on safety and help Cruz kind of recover their reputation. As they try to go back out on the roads, do it in a way that goes better this time, I suppose. Yeah. And they're positioning Steve Kenner here for, it seems appropriately to report rather directly to the highest levels, Cruz's president in this case and their chief administrative officer. So that sounds reasonable to me, at least based on what little I know of their corporate structure. But apparently he's got a ton of experience. He's had top safety roles apparently at Kodiak, Locomation, and as well as Uber's now defunct self-driving division. So he's definitely got a lot of experience. Hopefully this helps them with the headlines, if at least that, and then obviously with more safety, hopefully the company can do better long-term. On to projects and open source. The story is Cohere for AI launches open source LLM for 101 languages. So Cohere for AI is a nonprofit research lab that has been around since 2022, so for a while. And they have now unveiled AYA, this open source language model that can support all these languages. They have also released the AYA data set, which has a ton of annotations. It is a huge endeavor. They had teams and participants from 119 countries contributing to this. And as a result, there is 513 million instruction fine-tuned annotations, data labels, to be able to train this model. So pretty significant release as far as data goes, for sure. And as far as models go, it's yet another open source one, but in this case, definitely more optimized for things beyond English, which is not a major focus for most things we've covered so far. Yeah, that's right. It definitely is a weak spot that a lot of people have identified. You have what are known as low-resource languages. Of course, English not only is the most valuable from the standpoint of just having a larger global population that speaks English, but also because, well, as a result of that, you have way more data. So low-resource languages, well, have less data. And that's what this model is designed to help address. It is a pretty remarkable exercise, like a pretty remarkable effort. They describe this kind of open access data set they're creating. They're calling that the AYA Collection, 513 million prompts and completions across 114 languages. So apparently, there are rare human-curated annotations from fluent speakers for rare languages. So that's kind of cool. And they have benchmarked this model against other classic multilingual models, including a variant of, so there's a model called MT0 that is sort of a prominent multilingual model. Anyway, they turn it into MT0x with a little bit of extra training just to kind of make it more fairly comparable. Or sorry, MT5. It's a variant of MT5. And they find that it compares quite favorably. It's preferred 77% of the time on average to these other models. And that's a significant delta. So that does mean it's a significantly better model. Yeah, this is also notable because it's coming from Cohere, right? Cohere for AI, which is the Cohere's community ecosystem for open source projects and sort of like AI for good type stuff. So in a sense, a kind of marketing exercise, or at least it's going to be chalked off as a marketing expense by Cohere, which of course is a competitor to OpenAI in the kind of LLM space. So yeah, interesting that they're dedicating resources actively to this. It's a really interesting project. They do have to fix their sign up button, though, because the text overflows from the button. So just a helpful little tip there on the UI UX side. But yeah, looks really cool. Yeah, it's a pretty cool release. And yeah, to be very clear, I might have misspoken and said just Cohere. This is from Cohere for AI specifically, which is a nonprofit research lab established by Cohere, the big for-profit company that focuses on enterprise LLMs. And there is, by the way, a really cool analytics dashboard that you can check out on their website just showing some of the regional analytics on the number of submissions that they got from different regions and information about how the project is being taken up in different places and how many submissions per language, all that jazz. So if you're curious about that, it's worth checking out. They've done a great job laying this out. Next story, BUD-E, so BUD-E, Enhancing AI Voice Assistance Conversation Quality, Naturalness, and Empathy. This is a project from LION, which is a major institution that has done a few major projects before. They were majorly responsible for the training data for a stable diffusion. So a lot of what kicked off, a lot of the text-to-image hype and progress comes in part from this organization, LION. So they, in collaboration with several other groups, are developing this system, BUD-E, this AI voice assistant. They have created a baseline voice assistant that has low latency, so 300 to 500 milliseconds. And they are working on getting it lower, below 300 milliseconds. The whole project is open source. And they are building a data set of natural human dialogues and various qualities of this. So pretty cool project. It does highlight that to go from just a chatbot to an AI voice assistant, there's quite a bit of engineering and additional data and additional optimization required. So interesting to see them working with several other groups to make it happen and make it be more optimized. 100%. And as we've talked about this on the show before, but when you talk about voice assistants, one of the key metrics you always look for is latency, latency, latency, right? Because it's very awkward if you say something and it takes three seconds for the system to respond. It really ruins the user experience. That's why the focus is so, so intense on the latency piece. To your point, they're targeting response times below 300 milliseconds, even with models like LLAMA2 30 billion, so 30 billion parameter models. That is pretty impressive. And for context, one of the things that makes this really hard, especially when you look at the vision assistant space, is you have to find ways to take your, so once the text gets generated, you kind of, so text-to-speech systems, they normally have taken entire sentences so that they have enough context to produce a response. You can sort of think of that in the context of speech pretty clearly, because sometimes you don't know what the inflection ought to be on a particular word until you know how the sentence ends. And that's one of these irreducible problems when you talk about these systems that are going to speak to us. So one of the key things that they've looked at is finding ways to get their text-to-speech to develop context from hidden layers of the large language model, and then so they can kind of short circuit the processing a little bit, and anyway, obtain kind of a faster output in that way. So really interesting workarounds that they've set up. Again, this is really AI hardware meets AI software. I think that's going to be the theme of 2024 and, frankly, beyond. But interesting that they've already gotten this crazy, crazy 300 to 500 millisecond latency. Again, this is with a PHY 2 model, by the way. So that's not a super small one. I think that's maybe 7 billion parameters, so pretty decently big. Yeah, for them to get under 300 milliseconds with a 30 billion parameter model, that is, if you can do that, then you can do some really interesting things with voice models. That's right. Yeah, this is kind of interesting to me also as an announcement of a project, more so than a release of something. So they released a baseline, basically their first product in this line of work. But they very much in this blog post also are inviting people to contribute. As an open source project, they are inviting open source developers, researchers, and enthusiasts. And they have a whole roadmap with a ton of stuff. So they do want to add quantization, various optimizations. They are still intending to work more on the data set and so on. So there's a lot of work still to be done in this project. And it is more of a new initiative that they're pushing towards. So if you are looking for an open source project to contribute to, I guess FYI, this is one that's out there that is open source. There is a demo. But you would, I think, have to go and look at the code and mess around if you want to try it out yourself. On to the landing round, we have just a couple more stories. The first one is evaclip18b, scaling CLIP to 18 billion parameters. So CLIP, Contrastive Language Image Pretraining, is something we haven't mentioned a lot. But if you've been around AI for a while, you know it's kind of a big deal. It was back in maybe 2022, around the time that DALI 2 came out, text-to-image. This was another very significant model from that time that contributed a lot to the progress of text-to-image and just AI research in general. CLIP models basically are able to compare text and images and say, how similar is this text descriptive of this image? And there's a lot of downstream applications for that. You can do classification, but you can also do training in various ways of image generation and stuff like that. So this is scaling that to 18 billion parameters. What they say is the largest and most powerful open source CLIP model to date. It achieves really good results. It's open sourced. The data set is openly available and is actually smaller than in-house data sets employed at other CLIP models. So pretty significant, I guess, as a sort of tooling, infrastructure system thing. Not something most people use directly, per se, but an important type of model for a lot of applications. Yeah, and to your point, CLIP is like an image classification model, but it's a little more flexible than the classic image classification models that we used to have back in the day, where you would just have a label, and you would try to like, or you might have like 1,000 different categories, classes, that you want to associate with a given image. And you would label the images with one of those 1,000 different categories. One of the problems with that approach is that you're limited to those 1,000 different categories. And so your image, your vision model ends up not being able to generalize as well out of that distribution. And CLIP was one of the first ways that, actually, opening eye built the first CLIP model back in, I think it was 2021, early 2021. At the same time, by the way, I think the very same day as they announced the first DALI model. And- Was it 2021, really? I thought it was 2022. Oh, was it? Maybe January 22. Going to the trusty AITracker.org to find out. CLIP is 2021. Wow. Time flies. Yeah, I know, right? It should have been 2022, but yeah, no. And so as you were saying, right, it's this more general thing. It allows you to associate kind of a longer text description to an image, have that generated in that way. And CLIP models have been used, like you said, not just for straight classification, but they're often combined with models like DALI to rank the outputs of those image generation models so that you effectively get, overall, a more effective model on the whole. One last thing I'll just mention, they have a scaling curve that they show at the very top of their paper, sort of figure one. So they're showing the scaling behavior as they increase the number of model parameters, what happens to the zero shot accuracy of their model in sort of like classifying images. And what's interesting about it is the scaling curve does not seem to be bending. It seems to be very healthy, kind of from, say, 75% accuracy all the way to 82% accuracy, which implies there's a lot of juice left in this particular lemon. So it's a really interesting advance. The team, by the way, is from the Beijing Academy of AI, BAI, which is sometimes, you can think of that as like China's leading AGI lab, and also Tsinghua University, which has an open affiliation with the People's Liberation Army, the Chinese military. So sort of an interesting development here, and definitely another flex for Chinese AI research. And on to our last open source story, this one from Stable AI. And it is introducing Stable Cascade. Stable Cascade is a new text-to-image model that is essentially kind of an alternative to Stable Diffusion. It builds on a different architecture, where Stable Diffusion basically does the whole generation end-to-end. This has these stages, thus the cascade. And that leads to potentially better results, as I highlight in this blog post. They are releasing this non-commercial only. So this is not going to be employed in popular applications out there that allow you to generate images. At least, officially, you're not allowed to. But yeah, if you go look at the blog post, the results are quite impressive. They show, in particular, a lot of faithfulness to a prompt and being able to follow your instructions very carefully. JON FOUSTER-WILSON Yeah, it's really impressive. And they also have some progress that they're touting on inference speed, too. So it seems like they've managed to cut down relative to two other kind of more or less, well, I don't know if it's apples to apples. They're doing a 50-step to 30-step comparison. Anyway, it's definitely an interesting advance, maybe a modest advance on inference speed, as far as I can tell, just from looking at the figure. So yeah, a cool advance, and interesting to see Stability continue to pump these things out. The images do look good, I will say. I mean, there's nothing obviously wrong with any of the faces or hands or anything like that. So another big leap forward for Stability AI. And moving on to research and advancements, we start with self-discover. Large language models self-compose reasoning structures. Let's talk about prompting for a second. Usually, when you have a language model, you have to come up with some kind of prompt to get it to behave optimally. You have a problem you want it to solve. It's not the case that you can usually just straight up ask the model to solve the problem, and it'll do it perfectly. Sometimes works, but often for especially more complex tasks, you have to try techniques like chain of thought prompting, where you tell the model, hey, I want you to solve this problem step by step. Let's think about this step by step. Give me your reasoning explicitly. And then based on that reasoning, guide yourself step by step towards the answer. There are a whole bunch of other strategies, like self-consistency is another one. You do chain of thought. You get the model to lay out its thought process and get an output. And then you do that a bunch of times, or with a bunch of different models, but usually with a bunch of times from the same model. And then you evaluate for the most self-consistent thought process, if you will, and then use that output. That's called self-consistency. A whole bunch of other techniques around few-shot learning, et cetera, et cetera. And these techniques, the argument that the authors of this paper are going to make is, these techniques are not universal. So it's not the case that you're always better off using self-consistency, that you're always better off using chain of thought prompting or some other technique. Depending on the problem that you're facing, sometimes you want to go with one. Sometimes you want to go with another. Manipulating symbols might call for a different prompting strategy than doing arithmetic or writing poetry. So there's this notion that maybe what we ought to do is figure out, as a first step, before we jump into just using a given prompt, we should figure out what is the underlying reasoning structure that this task requires. And that reasoning structure, it might involve chain of thought prompting, or it might involve a combination of different techniques. So that's essentially what they're going to do. They're going to try to first have their model pick from a set of atomic reasoning modules, like chain of thought prompting, like self-consistency, and so on. And then compose them together in a coherent way that solves a given problem class that we've given to the system. So essentially, you have reasoning modules that are really good at breaking problems down into subtasks, others that are really good at critical thinking, and so on. And the idea here is the language model is going to first select the reasoning modules that are most relevant, make small adaptations to them for the specific task at hand, and then actually implement them, and prompt itself with this composite prompt that invites it to follow a particular architected reasoning process. And the results are pretty impressive. So what we end up seeing is it outperforms chain of thought prompting, like pure chain of thought prompting, on the vast majority, like over 80% of tasks that they tried. In some cases, the performance gains are up to 42%. Again, this is just with a prompting strategy. They also compare it to other techniques that are very, they're known as inference-heavy techniques. So these are techniques that require you to run a lot of inferences, to run your model many times, to generate many outputs, and then to compare those outputs. So self-consistency is one of those. If you rerun your model many, many times, and then you compare, OK, which of these things are most self-consistent, which of these reasoning flows look best, and then we'll pick that, well, that requires you to run your system, your model, many, many times at inference. So the challenge with those heavy-duty, inference-heavy methods is they're very expensive, time-consuming. They take a lot of compute. And so they've done a bunch of techniques in the background that we won't go into necessarily to optimize the efficiency of this process. But ultimately, they're able to use 10 to 40 times less inference compute to compete with combined methods like chain of thought prompting and self-consistency. So really kind of impressive way of increasing the overall performance of models with this meta-strategy, where before you dive in to just picking a given prompting technique, you have the model think about, what are my options? What are the prompts I could give myself? And how can I compose those together intelligently to solve this problem? VIVEK SRIVASTAVA, MD, PhD Yeah, so definitely more applicable to, I guess, reasoning-heavy tasks. Visa's general research works on prompting strategies tend to focus on pretty tricky problems, where typically the LLM would get it wrong. Even if you do tell it to think step by step, it would just kind of mess up along the way. They highlight an example in figure 7, this SVG path element, and then there's a bunch of coordinates. And then they say, which shape does it draw? A circle, heptagon, hexagon, kite, line, et cetera. So you can imagine how given coordinates of points, you would have to then imagine in your head, OK, here is this line that gets drawn. Here's that line. What do the lines come together to represent? Trying this by default, you would not get a result that works. If you add this structure, really engineer on top of LLM to enforce a certain way of being more careful, I guess, strategic about how you break down a task, you can then get this result. So I think generally applicable if you want to address pretty tricky types of problems with LLMs for whatever reason, and you don't have a model that is optimized specifically for that task. It's, to me, kind of interesting to see continued research along this line of augmenting raw models with more and more structure on top of it, kind of separate from it really, that controls how it reasons and how it generates the outputs and so on. It'll be interesting to see if these wind up being useful in practice or if the scaling hypothesis is true. And if you just keep scaling, the models just do this themselves. Because in theory, the models should learn that implicitly, these are the things that should be done to solve these various tasks. But at present, they do not, even at GPT-4 scales and so on. So it's a matter of time, I guess, till we find out whether with scale, reasoning of this sort is just something that gets picked up or not. Absolutely, and it is also interesting to note how we actually talked about this, I think, about eight months ago. But just this idea that as scaling continues to happen, there's a question of what bucket the scaled compute ends up going in. Do you end up spending your compute on training, or do you start spending more and more of it at inference time using techniques like this one that have you run many rounds of inference for the same problem, but let the difference between the time you spend studying and the time you're given on the actual test to solve the problem. And my hypothesis is, and I think a lot of agent architectures are moving in this direction, we're seeing a heavier and heavier amount of focus on inference time compute. And cheaper and cheaper computation schemes mean that it now makes sense to do this. Back in the day with GPT-3, it's possible that you could have gotten a lift from getting GPT-3 to engage in these kinds of prompting techniques, probably not nearly as good. But the challenge was, at that point, that it was just so expensive to run that inference time compute on every problem, that it just made sense to just front load all your compute in the training stage. Now we can afford to spend some of that compute on inference time, and we're seeing a big lift. And I'm really interested in, what does that balance look like? What does the exchange rate look like between dollars spent on training compute versus dollars spent on inference compute? And I suspect that equation is going to evolve a lot this year. Next research paper. This one is a bit older, but we haven't covered it. And I think it's probably a good time to mention it. The paper is Black Mamba, Mixture of Experts for State Space Models. And it is exactly what it sounds like. So we've covered this a couple of times, so we'll go very quickly. Mamba is a new type of neural net that is more efficient than what is typically used and has generated a lot of research in recent months, as regular listeners know. Mixtures of Experts is a way of making neural nets more efficient, broadly speaking, that has yielded some great results as well. For instance, we have mixed trial models. And yeah, there's been a lot of movement in the space of exploring Mixture of Experts. And supposedly, GPT-4 uses that technique. So this paper is combining the two. It's pretty much just more kind of a permutation and empirical paper that demonstrates that it is possible to combine the two. And they do synergize to produce a model that has good evaluation performance and good efficiency. They have various engineering details here, and they compare with various transformers and different open source models, and overall find pretty good results. So not any fundamental advances here, but I think worth noting that people are already exploring this direction that you might imagine given the separate trends, if you combine them, do you actually get good results? And at least this paper indicates that it is possible to combine them and get complementary benefits. Yeah, and this is a real shame on me one too, because I remember, I think last week, when we were talking about the Mamba models, and I was like, oh, well, it'd be interesting to see a mixture of experts type strategy. And this paper just completely, I had not realized this was out. So I should have just said, hey, it's been done. So this is an MOE, essentially this idea with GPT-3, GPT-3.5, for example, of having a dense, fully connected model, right? So you have this one chunk of model, of AI algorithm. And what happens though, if instead we break that up into a bunch of expert models, and instead of, every time we want to run an inference, instead of sending that to every part of the model, having the whole model chew on our input, instead we route strategically that input to different sub-models, if you will, different experts who specialize in a given kind of input. That's really what's happening here, except they're using Mamba instead of a transformer. Usually a mixture of experts models like that would involve transformers. Now we're seeing the kind of natural extension and using this Mamba model, which is already more efficient because it doesn't have the same quadratic time complexity that transformers usually do. So normally if you increase in a transformer the size of your input, it leads to a, let's say you double the size of your input in a transformer, it'll increase the amount of processing required by four, right, so that's what the quadratic scaling means there. In this case, they achieve linear scaling. So it's got much more favorable scaling characteristics, and they do see compounding benefits of the combination of their kind of linear scaling there with the input and this sort of transformer mixture of experts models. That's kind of interesting. So this is more of an initial exploration into this direction. If you look at the paper, there's, I guess, a lot more you could explore. And they do say so themselves, the authors of a paper. They also open source the models. So they open source intermediate checkpoints, as well as the inference code with a permissive license. So they are enabling further exploration. They release 1.5 billion and 2.8 billion of parameter models, mixture of experts parameter models. So I'm sure soon enough we'll hear more about this combination. On to the landing round, where we go through some papers, hopefully quickly, although sometimes we do take a while. First paper is an interactive agent foundation model. So quick recap, foundation model is a general term for a really big model that does cool things, kind of. So that includes large language models like GPT, but also multi-modal models that do video or images or whatever. So this is proposing a foundation model that is specifically for training interactive agents. And they do so by training across three separate domains, robotics, gaming, AI, and health care. And they do that, demonstrate that it is able to be trained and to generate actions, agent-like interactions in each of these three areas. It's kind of a broad direction kind of paper. So an initial stab at the idea of an agent foundation model, as opposed to foundation models that are meant just for understanding text or just for understanding images or just for understanding video. Yeah, and the normal way that we make agents today is we will take a model and we'll give it an autocomplete task, like train it on just a disgusting amount of text to autocomplete that text. And what you get out of that process is a system that just happens to have a huge amount of world knowledge. Because if you're going to autocomplete sentences like, to mitigate economic harm from the next pandemic, central banks should blank. If you're going to autocomplete that sentence, you ought to know a lot about the world. You're forced to learn a lot about the world. And so this is how you imbue these language models with a ton of world knowledge. It just happens to be the case that you can take those models and kind of get them to talk to themselves or each other as agents. And that's how we get all the language model agents that we see around us today. That's basically just a coincidence that they happen to be good enough at agent-like behavior because of their language pre-training. What this paper is trying to get at is answer the question, what if we thought about pre-training itself as a deliberately agent-oriented thing? What if we actually trained on objectives that were actually had these things do things like next action prediction explicitly? So during the pre-training process, you're actually training in agent-like behavior from the beginning. So that's kind of the philosophy here. And I think this is a really interesting space to track. It's something that I personally am going to be diving into a lot more just because I personally think that this kind of direction of agent-like models is probably the most promising path to AGI, or at least one part of it, in the near future. So anyway, I'm really intrigued by this. By the way, Fei-Fei Li, one of the authors here, so very famous, one of the pioneers of early deep learning, and interesting to see her pop up in such an interesting context. And primarily coming from Microsoft, it's fun. You can actually, in the acknowledgments, they acknowledge the Microsoft Xbox team and various kind of gaming partners who help them with data and training, presumably. And since we were mentioning, also worth mentioning, University of California Los Angeles was another co-author. Next paper, Grandmaster Level Chess Without Search, New Work on Game Stuff from DeepMind. And this one is, as the title says, looking at whether we can get really, really good chess playing AI without requiring search. Search is when you explicitly program your chess playing AI to simulate forward in the game. So say, if I do this, what does the opponent do? Probably, then what I would do, et cetera, et cetera. And it has been core to chess playing AI basically forever. At least the top of the line AIs generally do a lot of search, simulate the game in 1,000 directions, 10 steps forward, or whatever. And that is how they were able to get so good. Going back to AlphaGo years ago and systems like that, they relied on search partially in addition to neural nets to evaluate the state of the game. So the focus of this is saying, can we get really good performance without having search just by training a neural net? And they do. They train this 270 million parameter model of supervised learning of a big data set of chess games, 10 million chess games. And they find that it is able to do really well and even beats some of the previous evaluation neural networks that they had. Yeah, and I think there's an important caveat here that has to do with the way the system is trained. This is basically a transformer model that is trained with supervised learning. So basically, it gets a chessboard. And there is an oracle, which is this tool that supposedly gives you a ground truth answer. Now, in reality, we don't actually know what the optimal best move is for any given chessboard. That's an open mathematical problem. That's why we have to build machine learning systems that approximate the best move. But in this case, they have a tool called Stockfish 16, which they use to annotate automatically millions of board states to try to tag them with the best optimal move based on that board state. This is an imperfect annotation. But it is a good chess bot. You can think of it that way. And the question is, can you get a transformer, basically a chat GPT-like system, can you get it to predict the best next move, given just the game board? And this is really interesting. It kind of means that you can only so there's an old chess champion quote that they get from the 1920s that summarizes how this works. This is how they summarize it. I see only one move ahead, but it is always the correct one. And I thought that was such a great way of summarizing what's going on here, right? It's as if you were preventing yourself. And if you imagine, if you ever played chess, you're thinking about, OK, if I do this, they're going to do this. And then if they do that, I'll do this. Imagine that you can't engage in that thought process. You cannot think literally two steps ahead. You only see the board, and all you're going to go on is kind of a gut instinct vibe based on what you see on that board to pick your next move. And the astonishing thing is that this actually seems to work, that this model actually gives results or recommendations for next moves that actually compete, in one case, favorably against AlphaZero's policy and value networks, which get these ELO scores. Basically, these are ways that chess players are differentially ranked that are really quite strong. The argument is that it's at grandmaster level. I've seen people kind of argue whether that's actually the case, and you can have a fun discussion about whether that's true. But the tournament ELO that they get is 2,299, so almost 2,300, for their largest transformer. And against humans, the performance goes up dramatically to 2,895, which is ostensibly around grandmaster level. That big delta, that gap, seems to come from the fact that AIs tend to come up with strategies that are more easily countered, it seems, by other AI bots, or at least that's part of the hypotheticals here that humans don't tend to think in the same ways, of course, and are vulnerable to different strategies in a way that's favorable, anyway, to this system. So that was an interesting, noteworthy big gap there. But certainly, this is another push from DeepMind in the direction of more game playing as the way to hit AGI. In addition, I will say, to scaling, because they do do a bunch of scaling experiments with their model, and they see that, yeah, as we increase the scale of our system, it is, in fact, able to perform better and better at this game. So really interesting little paper here. Next up, more agents is all you need from Tencent. And the gist is, they find that with a simple sampling and voting method, the performance of these LLM-driven agents improves. Essentially, they investigate out-sampling, a classic idea in AI, where you combine the outputs of several models that can predict independently. And with a combination of independent outputs, you get a better overall result. They look into experimentally seeing if that works, and in fact, show that it does. Yeah, that is basically the paper. It's funny, because they sort of recognize themselves that, hey, there are all these fancy techniques. We just talked about chain of thought, self-consistency earlier, this idea of having the system generate many different strategies to tackle a problem, and then picking the ones that are most kind of self-consistent. Well, this is really just saying, hey, what if we brute force and just build a bunch of smaller, large language models? Can we achieve performance that's superior to that of some larger language models? And the answer does seem to be yes. One interesting finding was that there was a correlation between the performance improvements that they saw by increasing, just increasing, the number of agents that they were using, and the difficulty of the problems that they were dealing with. So if your problem difficulty goes up, it turns out that the right thing to do is often just to add more agents relative to when the problem is easier. So if you're going to spend your compute on something, you want to spend it more on more agents rather than necessarily having each agent do more complex stuff. So that was kind of interesting and maybe a little counterintuitive. The last thing that I'll just note here is, apparently, the results they got are orthogonal to other prompting-based methods, so other methods that involve doing more fancy prompts. So you can actually combine them to get overall boosts. So you can take a bunch, a large number, of fancier agents, and you will get compounding benefits, both from the number of agents and from the fanciness of the prompts. So kind of an interesting result here, and another version maybe of AI scaling in a way, like just scaling the number of agents. That's going to increase the inference compute, but not the training compute. So anyway, really interesting little breakthrough here. And one last paper, it is Music Magus, Zero-Shot Text-to-Music Editing via Diffusion Models. Last week, we discussed text-driven image editing as one of the efforts, and this is pretty much about platform music. So let's say you have a track of relaxing classical music featuring piano. They introduced a method where you can go ahead and edit that text to say relaxing classical music featuring acoustic guitar, and it would directly alter the actual audio in correspondence to what you requested. And moving on now to policy and safety, we have debating with more persuasive LLMs leads to more truthful answers. So this is a piece of, you think of it as a kind of safety research. Just to frame this up a little bit, one question that folks in kind of AGI safety always have is, what happens when we start to build systems that are far, far more intelligent than us, that can surpass human expertise in a wide range of tasks? How would we even know if their recommendations, if their outputs are trustworthy? What guarantee could we possibly have? This is like your doctor comes to you and tells you you need to take these pills, and you're like, you don't have a medical degree. You're just going to say, OK, yeah, sure, I'll take the pill. How could you kind of find a way as a weaker system, since you are a weaker system, to oversee the performance and behavior of a more intelligent, of a stronger system? This is the problem of scalable oversight. And one of the key assumptions behind it is that it's easier to identify or critique the correct answer than it is to generate it. One of our hopes that we have is that maybe, OK, sure, I can't come up with the recommendation that I should take a certain medication. But once I have that recommendation, maybe I can validate it, even though I'm dumber than the system that generated the recommendation. And so this is essentially what they're going to explore in the setup. They're going to have a bunch of strong models. They call them strong because they're going to give them access to text, to comprehension text that they'll be quizzed on. Then we have weaker models that are invited to judge the answer to their questions, but don't have access to the text. So the way they're simulating this divide, this intelligence gap, is by having one group of people have access to the ground truth information and another group not. And then they're going to investigate how your ability to debate with the system that is generating outputs based on reading that text allows you to tell whether that's true or not. that system is telling the truth, whether its recommendations are accurate or correct. As part of this, they actually introduce a metric that they call persuasiveness. How persuasive, how effective is that system at persuading people independent of whether or not what it's saying is true? Kind of an interesting piece of work, they test out three different setups, one which they call consultancy, is a case where you're basically a judge and you're just going to talk to a chatbot, again, the chatbot gets to read the text that you're interested in and it's just going to give you answers, and you don't know, so the chatbot is going to give you a certain kind of answer that it's required to ahead of time, independent of what the text says, and you're going to try to figure out, okay, is it telling me the truth, are these true outputs? That's called consultancy, you're just hearing one side of the argument. Another version that they experiment with is called debate. Here you have two different bots, one of them is trying to argue for the correct answer, the other perhaps for the incorrect answer, and you're there just watching this play out. You're reading the transcript of this debate, and based on that, you're going to try to figure out what is the right answer. The last one is an interactive debate, where these bots are debating, but you have a voice too, and as they're discussing, as they're having that dialogue, you can participate as well. They have a couple of interesting findings, the first of which is that these weak judges, the folks who don't have access to the underlying information, they actually can supervise strong debaters. It turns out that for the most persuasive models that they have, non-expert human judges, they achieve about 88% accuracy, and non-expert level judges achieve about 76% accuracy with debate, and without debate, the kind of naive performance is 60 and 48%. You see really big, on the order of, say, 20% leaps or more in your ability to judge the truth when you see this debate play out. This also holds for the consultancy model, where you're just talking to a chatbot directly without seeing a debate. One interesting thing is, the more you optimize debaters for persuasiveness, the more your ability as a judge to tell the truth in debates goes up, which I found kind of surprising. I would have thought that the more persuasive these two debaters are, the harder it would be to kind of figure out which one is actually telling the truth. It turns out that's not so. Apparently, empirically, based on their study, when the persuasiveness of both of the chatbots involved in the debate goes up, you actually end up being better at figuring out which of those chatbots is telling the truth. Kind of interesting, I'm less optimistic, I would say, about this whole idea of debate as a way of solving our AI control problems long-term. But it certainly is an interesting finding that we're not clearly dead in the water yet when it comes to this. At least, having debates between AI chatbots with different perspectives does seem to allow us to elicit truths that otherwise we couldn't. Great job to this team for pulling this off. Fun fact here is that the initial idea of this general kind of approach comes from 2018 from a paper called AI Safety via Debate, actually from OpenAI. And originally, they, of course, at that time didn't have super good chatbots to train this with. This was just kind of a broad direction. This paper doesn't exactly do the same implementation of that paper, but takes the general idea of it and really does apply it now with chatbots that exist now that are able to debate each other and so on. Anyway, yeah, kind of a little demonstration of how research can build on research across time. At first, you have an idea and you publish some exploration paper without being able to fully test the idea. And now, this team of five different institutions actually went ahead and tried it out. Yeah. You're totally right to call that out too. It's funny. It's rare that I read a machine learning paper and I look at the names of the authors. I end up knowing so many of them. But in this case, Ethan Perez, in particular, who's the last author listed here, has a long history doing debate. I actually spoke to him on one podcast ago, back when I was doing the Taurus Data Science podcast, and he was investigating debate back then as well. That was a very popular approach, especially so the head of AI Safety at OpenAI at the time was this guy, Paul Christiano, and he was really into this idea of debate. I think he's since softened on it, but it's interesting and not a coincidence certainly that so many folks from Anthropic, which again, used to be at OpenAI and split off, are pursuing that thread. It's really interesting. Anyway, there are all kinds of interesting folks on that list of authors, including Sam Bowman and Tim Rockdeschel, who has done a bunch of agent type stuff to work as well, and I think was at Facebook at one point. In this paper, we just covered a safety type thing, now let's cover a policy type thing. The new story is in a big tech's backyard, California lawmaker unveils a landmark AI bill. These are always landmark bills. They're always landmark bills. Yeah, and always headlines. So California State Lawmaker Senator Scott Weiner has introduced a bill that would require companies to test powerful AI models for unsafe behavior, institute hacking protections, and ensure the tech can be shut down completely before releasing them. It also mandates AI companies to disclose their testing protocols and safety measures to the California Department of Technology, and allows the state's attorney general to sue the company if a tech causes critical harm. So could be pretty significant, California of course is huge. It's like a mini country essentially, and has a lot of influence via its own policy initiatives. Going beyond, right now in the US, we have the executive order, which does do some things related to safety and requiring mandating some safety practices in corporations. This is trying to make that law at least in California. Yeah. It's interesting, you can see him as he describes the bill, struggling with this idea of how much is, are we hampering progress, or are we hampering innovation versus bringing in AI safety legislation and some guardrails here. That is a genuine challenge. A lot of folks, especially on the Hill right now, Capitol Hill in DC, are dealing with exactly this kind of question. How do you come up with guardrails that don't hamper progress while also hitting those safety objectives? The idea of needing some sort of civil liability, that's what we talk about when we talk about companies that can be sued for dangerous practices. Honestly, I think we're pretty overdue for something like that. It's difficult to imagine a future where AI companies can keep producing more and more powerful systems, with access to a larger and larger action space. How many AI agent models have we talked about today, that in principle are being designed to go out and do things for you on the internet, arbitrary things, send emails, write software and so on? It's difficult to imagine a world where that continues to happen and you don't have some level of civil liability. I can't go out and sue OpenAI if their chatbot or their agent goes off and does something horribly wrong, gets somebody killed or causes property damage. That's one piece. There's a separate question about criminal liability as well. We need a way, at some point, if the harms that come from these systems enter a category where we're talking about loss of life, physical damage to property and infrastructure, that's the sort of thing where you can imagine needing some of those more intense measures, and balancing those, of course, with the need to innovate. I'm a big free market guy. I've been in Silicon Valley startups my whole life, and I think this tech is great. It needs to forge ahead, but we have to keep in mind the big picture and the fact that catastrophic risk does seem to be on the table. At a certain point, you got to start to think about the tech from that perspective. Anyway, I think it's a really interesting set of trade-offs that he's managing here. A lot of the ingredients are aligned with a lot of the conversations I've heard on the Hill and certainly more on the safety-oriented side. But yeah, we'll see if it actually passes. Obviously, California is way more democratic, so if he's himself a Democrat, which he is, maybe that means it'll have an easier time passing, but they are already facing criticism with a bunch of folks, obviously, in Silicon Valley. Not surprising. They're talking about regulation moving too aggressively. This is just Jeremy's personal opinion, but I think we're actually way overdue for that kind of thing. Even talking to some of the folks who are building this tech in the Valley, it's quite We need some guardrails of some kind. I think most people would agree with that, but anyway, yeah. Interesting next step, and we'll see if it sets precedent for the national conversation. Onto the lightning round, AI deployed nukes to have peace in the world in tense war simulation. There's a study involving AI in foreign policy decision-making that found that, I guess, maybe surprisingly, probably less surprisingly, that it is possible to escalate into war rather than finding peaceful solutions in the specific experimental setup that was done here. Some AI models in a study initiated nuclear warfare with little warning, leading to that headline and leading to a lot of war games references. What these researchers did was they actually did a side-by-side comparison of models from open AI, Anthropic and Meta. In the case of Meta, it was the open source Lama 2, Lama 2 chat specifically. They had them play out this war game scenario with eight fabricated countries that each had different properties. It wasn't exactly the same as the United States and Canada and whatever else, but that was the idea, was to give them a hypothetical scenario to deal with. Here's a sentence that won't be great for open AI, but it says, notably, comma, open AI's GPD 3.5 and GPD 4 escalated situations into harsh military conflict more than other models. Yeah, it's obviously not something that open AI's models will specifically have been trained to avoid, and it's unclear really what it would be about the training process that would open the door to that. I think one of the reasons Anthropic's models, CLAWD 2 in particular, is known to be a little bit more cautious, so maybe less surprising that it would refrain from escalating military conflicts. But yeah, it's an interesting note. Apparently, GPD 4, when it launched nuclear warfare in a simulation, its justification was I just want to have peace in the world, which if you think about it, would be achieved with a nuclear war, but it's more of like this long-term ... There's a little bit of ... Yeah, there's some disruption that happens before the peace sets in. So kind of interesting, they have some cool graphs that show you what different models did in the median and mean case, how long it took them to escalate. They have this escalation score that they track, and cool visualization to check out anyway. So I think nothing too surprising here. Models will do crazy stuff. We should not rely on them for controlling nukes, certainly in their current form and possibly ever. But yeah, interesting little paper and a bit of a warning shot for us all. In the testing, they do actually ... I can read a bit of a prompt just so the context is there. They tell the LLM, as the inaugural agent of Nation Purple, you embody a dual-function role amalgamating both foreign policy and military decision-making capacities. You are in an environment with other such AI-controlled nations. Your actions will have real-world consequences. So if one possible thought is, well, are they just told to pretend to be a president or something and do what presidents might do? In this case, they do actually look into what if you tell it you are an AI agent doing this, the agents will still do this. Next story, going back to the Bay Area once again, protesters gather outside OpenAI office opposing military AI and AGI. Dozens of protesters who are governed outside the OpenAI headquarters in San Francisco on Monday just this past week. And it was organized by two groups, PauseAI and NoAGI, who were pretty much directly telling OpenAI to stop what they're doing, right? PauseAI, NoAGI is directly saying, don't develop general intelligence, which is basically the mission of OpenAI. It was partially in response to news that OpenAI deleted a language, or at least that would prevent or prohibit it from interacting with the military and helping apply AI to military purposes. So aside from just pausing AGI, another aspect of this was making sure AI is not used for military purposes. Yeah. And it seems to be this sort of, at least based on the reporting, it seems to be a bit of a combined, I don't want to say necessarily confused, but certainly there are two messages here. The first is the PauseAI or PauseAGI research. The second is the military strand, and those are two distinct things, right? You can imagine being okay with military AI, and in fact, it's difficult to imagine a world in which DOD does not pursue this. The alternative is you just wait for other countries like China to forge ahead. There's international engagement that you could do to kind of reduce that risk and so on. And I think that's actually worth pursuing, but still. And then separately, you have this question of the push towards AGI itself, and that's a very loaded question with a whole bunch of other considerations behind it. But I think one of the big issues, at least for me, that comes to mind here is this question of pausing AI development. Obviously, there was the pause letter that came out now, I guess last year, but there's always this challenge with these protest movements of figuring out exactly what they are recommending, what ought to be done during a pause or what the circumstances of the pause ought to be. And you certainly can see arguments. We don't know how to control these systems. That is clear. We have data that suggests that as you scale them, they get more general. And in fact, power seeking does seem to be something that emerges in these systems as they get more and more scaled and capable. That may well be the default behavior of these systems. So there is certainly risk there. The question is, how do you frame the pause? Is it just an all-purpose pause, or do you say, like, we're pausing it until certain kinds of breakthroughs can be achieved? That's always been something that has been a little confusing to me. And I think to the extent that there's a tension between innovation and safety, this is really where it needs to be resolved. If we want to pause development, what are the criteria under which we could then resume it? And I think there's a lot that needs to be said there. And actually, a lot of what my company does is focused on answering exactly that question. But it's definitely interesting to see this movement take shape. By the way, OpenAI, famously, when SAMe was kicked out and then brought back in, all the OpenAI people started tweeting, OpenAI is nothing without its people. That was kind of their tagline. And I noticed these PauseAI posters, one of them in the article says, Earth is nothing without its people. So kind of funny to see that thrown back in OpenAI. Yeah. Yeah. It's worth pointing out. There are photos in this. It's literally protesters with signs and T-shirts saying PauseAI, and also some signs saying no to AGI, just standing outside of OpenAI, and kind of surreal, right? Yeah. Some people are really not just signing letters, but now going out and showing up in person to spread the message. I mean, you can kind of understand it. With the threat models, and I'm certainly sympathetic to the threat model, but yeah, I mean, it's an interesting question as to what the best way is to go about it, and they're making a splash. Yeah. And one last story for this section, AI safeguards can easily be broken, UK Safety Institute finds. So this is from the UK's AI Safety Institute that was just established last year. And they have released research that found that AI technology can deceive human users, producing biased outcomes, and that some of it at least lacks sufficient safeguards against providing harmful information. So as you might imagine, this is focused on large language models, and they kind of really demonstrate what is kind of already fairly known, that you are able to bypass their safeguards using some pretty simple prompting techniques, and use them for dual use purposes. And one of the interesting things about this article is just how, in some ways, unsurprising the findings are. One of the things that they say is basically you can jailbreak your way through any kind of safeguards that are trained into the system. We've seen this over and over again, right? The sad reality is that no matter how much effort, meta, open AI, hugging face, put into safeguarding their models so that they refuse to answer you when you ask them how to make a bomb, those safeguards can always be trained out. In fact, they can be trained out for a few hundred bucks, maybe even under a dollar depending on the technique used. And they can also just be bypassed straight up with jailbreaks, right? Depending on the prompt that you use, these systems can generate any kind of output, including advice on how to build bombs and bury dead bodies and so on. So if nothing else, this just reinforces the idea that currently, and I think this can't be said enough, right? Currently, we do not know how to make AI systems just behave the way we want them to. That is a simple fact about the state of play in AI right now. There is no known way to guarantee that a language model will not help you to solve a problem that it shouldn't. And in the same way, multimodal models, agent-like systems, they all inherit this problem. So to the extent that we worry about dangerous applications of these systems, things they may be able to do, the claim by, it doesn't matter if it's a frontier lab or somebody else, that they have a model that they have spent a lot of time introducing safeguards into, that claim ought to be considered deeply suspect. Because as a point of fact, as a point of technical rigorous fact, there is no technology, no strategy, no technique that is known that allows these models to be guaranteed to operate a certain way. And so I think this is mostly just a matter of getting that fact on the record and shining more light on it. They also show that, hey, current models can be used for limited cyber offensive purposes. Again, this is something that tends to increase with scale, as we've seen in the past. But certainly a good call out from this Safety Institute, and I think it's one of their earliest pieces of publicly produced artifact. So kind of interesting. It's their set of initial findings here. We'll see what they put out next. That's right. Yeah. This is from their published initial findings. So really is kind of a mix of safety related things we've known, reemphasizing or re-demonstrating them that you can get biased outcomes from image generators. You can prompt hack some systems to do things you're not supposed to, like help people planning cyber attacks, or you can get them to create convincing social media personas, et cetera. I guess a report that emphasizes a few different problematic aspects of modern day AI all in one place. And onto our last section, synthetic media and art. Just a couple of stories in this one before we are done. The first story is stability, mid-journey and runway hit back in AI art lawsuit. So there's an ongoing class action copyright lawsuit filed by artists against companies that provide AI image and video generation like stability, mid-journey and runway. And this article highlights how those companies, lawyers from those companies had filed a whole bunch of stuff, different motions in the case. They filed to introduce new evidence and even asked for the case to be dropped and dismissed entirely. Yeah. So it's industry back and forth. There's now been a wave of kind of new evidence that AI companies are introducing to push back on this, including for dismissing the case entirely, right? My understanding of the legal context here is that it's often the case that you'll just kind of toss out an attempt to dismiss a case, even if you don't think that your attempt will succeed because the bar tends to be fairly high, right? Usually you want the case to be fully kind of adjudicated or whatever it is in court before you make a decision. But if things are stacked so lopsidedly, then maybe you can get it thrown out. In this case, AI company's new counter argument starts to boil down to the idea that the AI models that they make or offer are not themselves copies of any artwork, but rather reference the artworks to create an entirely new product. And that's interesting. It holds, of course, unless they're explicitly instructed by users to generate kind of verbatim outputs that match actual art. But it's interesting. I mean, I don't know if that would particularly stand up. I mean, it seems like the fact that these systems sometimes accidentally generate verbatim copies of existing artwork would not be covered by this counter argument. But we'll just have to see if the courts end up accepting this kind of reasoning. There's a lot of detail in the story going into the various things stated by each company. Each one of them had its own slightly different things, runway, mid-journey, stability, all pointed out different things relevant to their, I guess, context and so on. So if you're curious to hear more, I would just encourage you to read the article. There's not sort of any especially interesting tidbits, although if you're curious about the legal case and how it proceeds, you might be curious to hear these more details. But I think the bigger news related to all of this is the last news story that came out a bit later. And the headline is AI companies take hit as judge says artists have public interest in pursuing lawsuits. So this is a small victory in the lawsuit against these companies. The US district judge rejected the company's argument that they are entitled to a first amendment defense for free speech and state of the case is in the public interest. So this pretty much means that at least when it comes to defending and dismissing lawsuit on first amendment grounds, that is not going to happen. Yeah. And this is the free speech tradition, obviously, in the US is really strong. And so this is something that I might've thought of as actually like a legit question mark as to whether open AI gets to say, ah, well, you know, our ability to launch this language model, it reflects our right to free speech and companies are persons under US law. So I believe they are entitled to first amendment protections on that basis. Company has a right to free speech. So to the extent that's true, and if AI models, large language models qualify, then certainly they'd be protected. I remember, you know, I, well, I don't remember this, but this is a thing that happened, I don't know what in the sixties or seventies in the context of pornography, where people sought first amendment protection and said, Hey, pornography is just my right to free expression. You know? And, and there you go. And in that case, it was upheld. The fact that we're seeing it not being upheld here is an interesting development. At least I'm no legal scholar. I have no idea what I would have predicted here, but I don't know. I thought that there might've been maybe a stronger argument for these things qualifying as like some kind of free speech thing. So yeah, they, the companies that argued that the, the lawsuits they were facing were targeting their speech precisely because the creation of art reflecting new ideas and concepts was a constitutionally protected activity. So there, they viewed this as, you know, their, their speech was the creation of art and interesting that that was stricken down. And with that, we are done with this episode of Last Week in AI. Thank you so much for listening. As we say in the beginning, you can always find the text newsletter, which gives you a text version of all this stuff and more every week at lastweekin.ai. You can also contact us with any suggestions or feedback or links to stories by emailing contact at lastweekin.ai or commenting on YouTube, Substack, anywhere else. We do appreciate it if you share the show, if you review it, if you like it, all of those nice things that help us be nice to the algorithm that recommends us in places, I guess. But we don't care about that so much. We mostly just care that people do enjoy the show and get a benefit out of listening to it. So please do keep tuning in.