Last week in AI - OpenAI's Sora, Gemini 1.5, BioMistral, V-JEPA, AI Task Force, Fun!
Summary (Generated with Bash)
OpenAI's Text-to-Video Model: Sora
Generative AI Startups
The Future of AI in Hollywood
Need for AI Policy and Regulation
AI and Employment
Technological Breakthroughs
Focus on AI Safety and Ethics
The discussions from the past week suggest a balanced view of AI’s potential, mixing excitement for innovation with caution regarding ethical and societal concerns. As we push the boundaries of what AI can achieve, responsible stewardship and forward-thinking policies are paramount to ensuring these technologies serve the greater good.
Read the full discussion in the transcript below 👇
#156 - OpenAI's Sora, Gemini 1.5, BioMistral, V-JEPA, AI Task Force, Fun!
Hello, and welcome to Last Week in AI, where you can hear us chat about what's going on with AI. As usual, in this episode, we will summarize and discuss some of last week's most interesting AI news. You can also check out our Last Week in AI newsletter at lastweekin.ai for articles we did not cover in this episode. I am one of your hosts, Andrey Kurenkov. I finished my PhD, focused on AI last year, and I now work at a generative AI startup. And I'm your other host, Jeremy Harris. I'm the co-founder of Gladstone AI, which is an AI national security company. And yeah, I mean, I guess consistent with the weird thing I've been on the last two episodes where I've randomly been saying like, hey guys, we're hiring. Anybody who wants to reach out, just reach out to us. And I'm making all these weird sounds. Well, here's a random announcement. I got a buddy called Ben. I was looking to make a career move. He's got a ton of experience leading AI teams. I don't want to mention the company, but it's like a Fortune 100 that he's been working at and doing a bunch of really cool stuff on AI, specifically PhD in physics, blah, blah, blah. Spent a bunch of years in DC doing stuff in the federal government on AI policy. That's kind of how I ran into him. So if you want to, you know, I don't know, reach out to him. If you want to connect, hello, gladstone.ai, I'd be happy to hook you up. And that is my weird pitch. It feels like a kind of flea market at the beginning of every episode now. I'm sorry about that. Yeah. Now I wonder why I didn't do this last year when I was graduating, just like, hey, I'm graduating. Who wants to give me a job? Maybe I should have. Life's full of regrets. Quick programming announcement for any regular listeners or even new listeners. We're going to try something new this episode, just a small little tweak where we will add a new section of news that isn't really news per se, but just sort of like a place to put fun things we came across that don't necessarily fit in any of our sections we have. So we're going to call it a fun section, even though some of it might be very nerdy and not necessarily fun in a traditional sense. And yeah, so we're going to cap things out with some less serious stories, hopefully, and have some fun and then close it out. We know, of course, nerdy is going to be a huge turnoff for our audience because we talk about AI papers all day. So sorry about that. Well, let's get going with the news, starting with our first section, tools and apps. And our first story is, of course, Sora. So this is not quite a tool yet, but I think close enough. We're going to get going there. We are starting with OpenAI's Sora, their new text to video model that came out right after we recorded our last episode. Helpful. And yeah, it was definitely the biggest news story of the past week, I would say. And the gist is it's a really, really good text to video model. So as soon as they announced it in the blog post and on Twitter, there were a bunch of example videos that they put out, some of them quite long, 20 seconds. And it is just beyond anything we've seen with text to video AI. Super kind of high resolution, clear, still has some artifacts, but now you do have to look pretty closely to see them, like weird kind of like trippy objects changing into other objects, things like that. But in general, very, very impressive. And of course, now there were a ton of discussion, a lot of like, oh no, AI is going to take over everything, Hollywood is doomed, all that kind of response. Probably a little overkill, I would say, but definitely pretty dramatic how much of a leap this is. Yeah. It's actually, it's funny, you said Sora is kind of the big story of the week. Obviously we've got Gemini 1.5 as well, which we'll talk about in a minute. I think the world is divided into two camps, the people who think Sora is the big story and the people who think Gemini 1.5 is a big story. And it kind of feels like that, what was it when Twilight came out and there was the Robert Pattinson people and I'm like showing my age, this is really, I'm also, I didn't see Twilight guys, I swear to God. Anyway, so this is actually, I think a really impressive breakthrough. If you are hoping to find out about how specifically the model is set up, by the way, be prepared to be disappointed because the technical report that OpenAI published here says model and implementation details are not included in this report, which is a very, very slight undersell. They do talk a little bit about it. Essentially, this is a, so first of all, it's a transformer. This is an interesting thing in and of itself, right? Transformers usually we use for text generation, GPT, that sort of thing. What they're doing here is they're basically taking videos, obviously video is a sequence of images and for each image, they kind of embed it in a latent space. They use an encoder to essentially extract from that image, it's meaning. Basically, they create a vector, a list of numbers that encode the meaning that was captured in that image. That's something that's done very often in kind of computer vision applications, that sort of thing. So now, essentially for every image in the video, they kind of have a bunch of numbers that describe the meaning in that image, but more than that, they do this for patches of the image. So for every patch of an image, they have a description, a sort of list of numbers that capture the meaning of that patch, and then they have that for each image over time. And so now you can start to think about not just having a patch of an image, but a patch of an image that you track over time, over several frames of video. This now, you might be tempted to call a space-time patch, ooh, space-time patch. Yes, we're talking about space-time. So this is a space-time patch. That's what OpenAI is calling them, and these are like the atomic units of meaning in the context of this video generation tool that they've built. They're just like the tokens that GPT models get trained on during text prediction. Usually that's like the syllables that make up words. So that's chunks of meaning. Well, these are chunks of meaning in video, and one of the big breakthroughs that seems to have happened here is OpenAI has figured out the right way to chunk up videos such that language models can actually learn from them. So Sora is indeed a transformer and reflects OpenAI's ongoing belief that transformers are pretty robust, that they can learn world models, in other words, that they can create internal representations that capture physical facts of the matter about the universe. Even ultimately, some people think laws of physics or things that are that deep. Certainly people have argued that that's happening with GPT-style models, but here we have the first time that's represented in video, which is a great way to test whether there is an actual physical world model here, because you can see glasses shatter or not shatter in these videos. You can see balls fall and not fall, and you can assess, does this thing seem to have an intuitive grasp of physics? And one of the really interesting things that they seem to find is that that is in fact the case, at least to an extent, and that the grasp of this model, the physical kind of world model that this thing has, gets better as it scales, again, consistent with OpenAI's thesis about scaling. One of the really interesting things here was apparently the model emergently develops the ability to portray 3D scenes without having any special architecture that biases it in that direction, without what are called inductive biases that push it that way just with scale. You just scale it up, train it with more compute during training time, and it emergently develops this ability. I just thought there was so much here, such an interesting breakthrough, and also notable because it's not a language model, right? This is not, or at least it's not just a language model. It's trained in tandem with video that's chunked up as we described, and text input as well that they correlate with the video. To add just a bit more technical detail beyond that, it is also not just a transformer. It's a diffusion model. In a way, it reminds me of what we covered just a few weeks ago from Google. That was the paper Lumiere, a spacetime diffusion model for video generation, where the big deal was that they didn't separately generate frames in the video and stitch them together. They had an end-to-end model, so to speak, a diffusion model that generated the whole video end-to-end. My impression, although it's hard to tell from the pretty skimpy technical report, is that this is similar to that in nature. This is a diffusion transformer in a similar way. I think, again, my feeling is that the real difference is, as usual, the OpenAI scale, right? My impression is they just through compute at this thing. They trained with full resolution videos. They go into some details here of how in training, often in the past, people kind of cropped videos or they down-res videos and then try to up-res them again afterward. They say that they just trained on full-res HD videos, and they can now generate them. As a diffusion model, this thing can do a lot of stuff. It doesn't just do text-to-video. You can give it an image, and it can automate it. We have some examples of images from DALI, and having them be animated. It can extend generated videos, so it can continue something that you give it as input. It can do video-to-video editing, lots of these sorts of things. You just got to go and see it for yourself. Sadly, this is not a video podcast, at least not yet. Maybe one day I'll find the time to make it so, but just go check out the link in the episode description, as always, or just Google OpenAI Sora, and you'll see that these videos are pretty impressive. It is still not the case that anyone can use this. It was announced, and they say that currently Sora is only available to red teamers assessing a model for potential harms and risks, and some artists, designers, and filmmakers for feedback. There's a bunch of stuff on safety in the blog post saying that they'll be working on the watermarking and detection and whatnot. Might be months until this is an actual tool that is out there, but certainly it's going to happen sooner rather than later. Yeah. I think just as a last quick note, it's worth putting this in the context of OpenAI's AGI mission, because that, of course, is OpenAI does everything they do to try to build AGI. The question is, how does Sora fit into this? One piece we touched on, this idea of building a physical world model, there was a time literally 20 minutes ago when people were saying, well, there are certain things that scale simply will not do. There are certain, for example, conservation laws in physics that cannot be learned by these systems consistently. This is where people made the argument, you need explicit symbols, symbolic AI. You need neurosymbolic approaches at a minimum to capture these things. One of the really interesting things that we see in this result is that Sora actually displays what's known as object permanence. There are many cases, for example, where a painter in a Sora video will paint something. Then the thing that they paint, the streak of paint will remain there throughout the video. You have that coherence happening over long stretches of time. That's the notion of object permanence. Things remain there after the cause that brought them into being, or at least the objects. I mean, classically, object permanence is the thing that babies lack, right? You hide your face behind a thing, and then they go, oh my God, the face disappeared. The ability to track objects over long time horizons seems to emergently have appeared here. The second piece is, I think one of the big things that distinguishes OpenAI's approach is they are really, really good at figuring out what are the atomic components of a dataset that will allow a model to have highly extensible behavior, right? We saw that with their giant bet on language modeling. We've seen that as they've built image generation tools. Figuring out, in this case, okay, it's this spacetime sort of chunk, this blob of spacetime. That is the atomic component. If we get an AI system to use that, to chew on that kind of data, to look at the problem that way, all of a sudden, we unlock all this generality and all this behavior. I think this is starting to emerge as a consistent theme with OpenAI's biggest breakthroughs, where we're seeing the kind of chunking up of data, looking at it from the right perspective in the right frame, and then, only then, applying massive scale to it. I think a really interesting breakthrough, and there's tons of detail in the technical report here, but maybe not as much as nerds like us would like. Not nearly as much, to be honest, but still some fun details. Yeah. As you said, there was some conversation online of this being essentially a world simulator, and we'll get back to this a little later, actually, with some announcements from Meta. You could argue that this is learning physics and learning kind of common sense about what's happening with things. That is a whole kind of philosophical thing, and probably we should just go ahead and move on without getting into it. Moving on, the next story is, as was foreshadowed, Gemini 1.5. This is coming from Google. This is Gemini 1.5 Pro, and it was a pretty big deal, as Jeremy said, because it is really good. Supposedly, or at least according to announcement, it is as good as Gemini Ultra, and is seemingly going to be more efficient. It's trained using a soft mixture of experts, from what we know, and somehow is able to deal with an absolutely gigantic context window. It can take a ton of input, I forget what it was, like 1 million tokens, something ridiculous. The announcement did seem to a lot of people like a huge deal of having a faster to run model that is as good as Gemini Ultra, and therefore kind of on par with GP4, that is being rolled out for now to developers and enterprise users, and will presumably soon be coming to consumer users, which will bring even more pressure on OpenAI if this is priced the same as Gemini Pro, which is not their GP4 tier model. This is their like three-year or less expensive model that would be, I guess, a real source of competition. Yeah. It's interesting too that the model is, they described it as a mid-size multimodal model. I don't think they actually tell us the number of parameters, but if we use the Andrei Korenkov scale model here, maybe, I don't know, mid-size, I don't know, what would that be, 30 billion parameters, something like that? Anyway. Yeah. Yeah. This seems reasonable. Let's just hear it here first. It's probably 30 billion parameters. You know what? It's definitely 30 billion parameters. We're reporting that right now. Yeah. It's a mid-size model. It's got a lot of, as you said, interesting characteristics, one of which is, of course, the widely advertised context window. You're right that they report on the 1 million token context window. That seems to be the thing they're anchoring on, though they do advertise, they've successfully tested it up to 10 million tokens. This is insane. To give you an idea, this is like, so they tested it out on one hour of video, 11 hours of audio, code bases with 30,000 lines of code. I mean, you can fit an entire code base in this thing, 700,000 words. Which is more that 1 million token context window. They give examples like they served at the 402 page transcript from Apollo 11's mission to the moon. And it could reason about conversations, it could recall events in details. And this is really important. One of the things that we find when we scale the big context windows, we talked about this in the context of Claude and Claude too, when these really big context window models started coming out. These models tend to forget things that are in the prompt. If you say something early on in your prompt, and then 100,000 tokens later, you say something else and you want to take advantage of a linkage between those two ideas, the thing will not be able to do it. One of the ways that diagnostics that people use to test for this is known as a needle in a haystack test. You'll bury a little detail somewhere in your gigantic prompt, and you'll see if the model can recall it after. Turns out this model blows everything, and I mean everything that has ever come up before out of the water. It blows GPT-4 out of the water. It blows Gemini Ultra out of the water. It is incredibly good. We're talking about over 99.7% recall for this needle in a haystack problem for up to 1 million token context windows across all modalities, text, video, and audio. This is insane. Apparently, it even maintains that recall performance when it's extended to the 10 million token mode. At this point, when I look at that, we're seeing something. Something has shifted here. There is something fundamental and algorithmic that's going on in the back, and this is not just scaling. At least, I almost hope it's not scaling, because that would mean, holy shit, scaling has just emergently done something absolutely insane. I don't think that's what's happened here. Total conjecture, but one of the few mechanisms that I can think of that allows you to achieve something like this is to have some ... I mean, it's not a state space model, but to have some kind of state spacey thing here where you can retain a memory as you're going through, as you're chunking through these giant prompts. No idea what that might look like. Again, total conjecture, and it's probably not the case, but this is just so, so weird and out of band with what we've seen previously with transformers. It's allowed the system to have insane learning speeds as well, because it really can soak up all the content in that context window. Something fundamental has happened with this model's ability to understand and absorb context. It was able to pick up this obscure language with fewer than 200 speakers worldwide. It was given a grammar manual for it. The language is called Calamang, and apparently, it learned to speak it or to write it as quickly as a human would. In other words, using the same amount of data than a human would. This again, it's another one of those goalposts that we used to have that said, hey, we're on our way to AGI, or we're not on our way to AGI, because we can't do this sort of thing. We can't have systems that learn as fast as humans. Well, here we have that. I think it's a very, very interesting breakthrough. Not a lot of detail, specifically about how the recall is achieved. I think to me, that's the fascinating thing. The expected performance boost that comes from all kinds of optimizations and jiggery pokery is absolutely there. As you said, Andrea, it compares favorably to Gemini 1.0 and other models, but we don't know. We don't know how this thing was built or how it was aligned. There's not a lot of detail I was able to find in the paper about the RLHF process. Is it DPO? What's going on there? But all we know is, holy crap, this is a really powerful model. Yeah, it sure looks like it. It's a shame we don't know how it can achieve such a long effective context window. It's been kind of an emerging topic in recent months, so we are really starting in academia and published papers to see more work on this problem. For instance, back like two and a half months ago, early December, Anthropic did publish a paper, Long Context Prompting from Cloud 2.1, where they showed how to get an effective 200,000 token window where they had this little hack of just like add a little sentence to a prompt. Here's the most relevant sentence and that made it effective. I would not be surprised if this is achieved by tweaking the decoding process and the prompting process primarily, although, as you said, there could be also algorithmic modifications to a model itself or all sorts of things going on here. One million tokens is huge. It's kind of hard to convert into intuitive stuff, but that's about 750,000 words-ish. Some really big books, yeah, or like all of Harry Potter, somewhere in that kind of benchmark. Yeah, pretty impressive announcement and pretty impressive to see this coming so soon after the initial announcement of Gemini, which was just a couple months ago, we saw the initial rollout of Gemini Pro, its release in BARD, and Gemini Ultra even more recently came out. Now we have Gemini 1.5, which is- It's a week later, Andre, like it's about time that we get another generation of language model. I know. Why? It's so weird. Now, Gemini Ultra isn't even Ultra anymore because Gemini 1.5 Pro is as good as Gemini Ultra. Very weird, but pretty exciting. Yeah, one last comment that did come to mind, and this is especially on the scaling side. Forgive me, I'm obsessed with scaling, but there's this interesting figure where what they do is they get Gemini 1.0 Pro, sorry, 1.5 Pro, I should say, to basically run predictions on code. So they give it a massive code base and they look at how essentially, so they feed this part of the code base, let's say, to the model, and they try to get it to predict the next token, and they see how surprised it is, if you will, at the next token. So if it's really surprised, it's a bad model, it hasn't been able to build on the previous context in order to inform its prediction. And what they find is, unsurprisingly, as you'd expect, as you feed the thing a longer and longer prompt, more and more of this code base, it gets progressively less and less surprised at the next token. Its predictions get better and better. And what you generally expect to see here is a power law fit. So as you increase the number of tokens that you feed to this thing, it'll kind of sort of exponentially drop the number of errors that the thing's making. Predictions will become exponentially better. Now, what ends up happening in practice, though, is that Gemini 1.5 Pro improves even faster than that. So it's actually outpacing. Its ability to predict the next token based on the context is actually accelerating faster than it should, according to at least all prior convention. That's another thing that leads me to think there's something fundamentally algorithmic going on here, unless, hey, maybe scaling is doing this too, in which case, holy shit. This implies that there's something qualitative going on here that's allowing the model to chew on its context to make predictions that are qualitatively better than what we had expected before and what any other models have done. So that's a really interesting and weird thing. They don't really do more than just speculate about why that might be. They seem kind of confused about it. So yeah, just wanted to call that out. That's something, by the way, that was also mentioned in this great video by AI Explained, which I recommend checking out too, about this whole paper. When you're looking at scaling as a way of getting to AGI, when you're thinking about transformers, these are the sorts of indications you might be looking for that something weird is afoot. I'm sure they just found the best hyperparameters. That's the secret. Yeah, that's right. Just the best parameters for training. All right. On to the lightning round. Starting with GroQ, AI model goes viral and rivals Chats GPT and other chatbots. So this is GroQ, not to be confused with Grok, the chatbot from Elon Musk's XAI. GroQ is coming from GroQ Inc., which is a company that's been around for a while, since 2016. And they are focused primarily on hardware. So they claim to have created the first language processing unit to run this model. And they posted a demo of it on Twitter that, per the headline, kind of went viral because of how fast it is. This is an ASIC chip, so an application-specific integrated circuit, not a general-purpose chip like GPUs are. And it allows it to generate about 500 tokens per second compared to, for instance, Chat GPT 3.5's 40 tokens per second. 500 tokens per second is roughly 400 words per second. So it's really blazingly fast. And the reactions have been pretty dramatic because of that, because it kind of changes really the experience of talking to a chatbot if it is that quick. Yeah. I want to do a combination of hype this up and throw some cold water on it, because this story really captured at least my attention. I think it's so important to track these kinds of breakthroughs. So first things first, yes, it's blazingly fast throughput. No question about that. They see, it turns out, about four times the throughput of other inference services. And yeah, it's very, very quick, no doubt. By the way, their chips are entirely fabricated and packaged in the US. That's a big advantage they have, too, over other companies that have a complex supply chain that involves Taiwan and South Korea. But one of the things that I think we have to keep in mind here is performance is about more than just how much throughput you can get, like how many tokens you can get out the other end. You also have to think about how many customers you can offer this to at a given time. And here I'll just share some insights from Semianalysis, which is a great firm that looked into this. So when you look at this Grok system, I think it's pronounced Grok, by the way, because they got into a tiff with Elon about, like, he took their name, and it's just a difference. I believe that's what's going on. Each of these chips, they're super fast, but they have crazy small amounts of, basically, of onboard RAM, 230 megabytes, right, in a context where language models are, like, seven billion parameters for a small one on the Andrei Karenkov scale. So 230 megabytes doesn't get you a whole hell of a lot. You need, it turns out, something like 600 chips, 600 of these chips, in order to have the inference you need to serve even a mixed trial model, whereas you can do that on a single NVIDIA H100 chip. So now you're having to buy a crap ton more. You're having to kind of dedicate way more sort of data center infrastructure to serve these chips. And so it is blazingly fast, but it's important to note, right, this is a big, big limitation, and the cost equation is far from clear. Right now, Grok is currently losing money on their API. They're going to need about a 7x increase in throughput, in traffic, in utilization to break even. And that's a reflection of the weird unit economics. They're just going in a different direction with this. And I think the last thing that I really need to mention is this is a chip that just does inference. It does not do training. And that's really, really important. That's a key limitation, but it's also a hint, right? We've talked on the podcast a lot about how hardware breakthroughs and model breakthroughs, algorithmic breakthroughs are starting to bias us towards a direction where increasingly our models are doing more and more of their thinking at inference time. Rather than spending your compute during training, you're going to start to spend more and more of your compute at inference time, where you get these models to prompt themselves a bunch of times using crazy prompting techniques like self-consistency and chain of thought prompting and all that jazz just to get a single output, right? So you're doing a lot of the putting in the elbow grease after the model has been trained rather than before. And I think this kind of breakthrough is another push in that direction perhaps, right? We may see these chips optimized ultimately for training too, that I don't know, but I think it's really important to note this is a key constraint and it does reflect something that will be seen more and more of, more custom chips for specifically LLM use cases that are specifically good at inference. And I think if nothing else, it's a great sort of warning shot that we can expect a lot of room for growth on chip design, even using existing fabrication nodes, even at the five nanometer process node. I think this one actually might even have been a bigger node, so not even the cutting edge TSMC one. So anyway, really, really interesting breakthrough with a lot of, I think, depth and detail too. Just to be super clear, this is not a new chat bot. This was kind of a demo I posted mainly to showcase the technology and they're serving open source language models like Mixedraw and LLAMA2 at very fast speeds. So the big deal is the LPU, the Language Processing Unit, and we'll see, as you said, if we'll have more of these specific, not general purpose hardware for inference. Next story, something I found pretty cool, introducing IP adopters create consistent game assets in seconds. So this is actually not a new story. This was a release from the company Scenario, where Scenario is a tool that is used to create assets for primarily video game developers. And the announcement is essentially that you can now have a reference image of, let's say, a single character. And from that one reference image, you can create all sorts of assets of that same character that are consistent with the initial image. So you can say, give me this character, but in these clothes, give me this character, but with these glasses, give me this character, but running, stuff like that. And I think worth highlighting, because this has been one of the limitations when it comes to text to image and image generation is like how, if I'm creating an animated movie, a web comic, a video game, how can I make the same character show up throughout? It's actually been a pretty tricky problem. We've been seeing more and more research come out in recent months, showing how you can get consistent character generation in various ways. And here's an example of that coming out in a pretty established product where you can say, okay, now I have this specific character and I can generate a bunch of assets of that character in various contexts, doing various things, pretty necessary for this to actually have an impact in the industry, I think. Yeah, it's interesting too, because they are explicitly generating IP here. So it makes me think about the whole copyright thing and what their copyright protections are, because this is meant for explicit use in commercial applications. So you never know, depending on how the thing's trained, what the training data set is, are other people's ideas or IP going to creep into the things you get generated? I think it's a really cool tool. Man, the demo on their website, by the way, holy crap. Have you used it in your workflow? Played around a bit, haven't used it directly, we have our own models, but yeah. Oh, you have your own models. Oh yeah. Okay. And up next, we have Report, OpenAI working on web search product. And this is a pretty short report, but maybe it shouldn't be too surprising. We have OpenAI that has, they've got a web crawler already, it's GPT bot. We have chat GPT plus users who can browse with Bing. And then of course, Microsoft used, sorry, Bing rather, used GPT-4 for its customized search product or still does. And so we're circling the drain here a little bit. And so perhaps not so shocking that we have now OpenAI potentially, according to this report, looking into the search product. I think one of the big questions is, are they going to be able to make a dent that's bigger than what Bing was able to make thanks to GPT-4? Bing's market share famously just barely budged after the initial hype wave surpassed on the GPT-4 launch. So hard to know. Right now, there's a big imbalance between Google, which has almost 85 billion visits that they saw in December versus chat GPT, where they got 1.6 billion. But of course, yeah, chat bot, different product from a search engine, no question. So we'll see if they can compete whether on quality or just distribution. It's also the case that we have other competitors in the space. There's Perplexity, there's U.com, all doing this kind of AI-enabled search idea. And of course, Google has that already built into, I guess not BART anymore, it's Gemini now. So it's already going into a crowded space, unlike the initial chat GPT release, where it was the first one of these. But it does make sense that they would go ahead and put it out there, I guess. And one last story for this somewhat long section, where we wound up talking for a while. The story is Adobe Acrobat adds generative AI to easily chat with documents. So that's the idea. There's now this new tool, AI Assistant in Acrobat, which is a conversational agent that can summarize files, answer questions, and recommend more based on the content, allowing users to interact with documents in a chat-like matter. Very intuitive, I guess, use of integrating chatbot. Acrobat is a way to read PDF files, pretty popular, as far as I understand, for looking into sort of stuff. So yeah, pretty impressive, or maybe just notable that Adobe is continuing to push out AI features throughout its product suite, not just in Photoshop, but now also in Adobe Acrobat. And up next, we have applications in business. And we start with Sam Altman owns OpenAI's Venture Capital Fund. This is a weird one. It's also one that has, well, been in the news without being in the news. So Sam Altman famously testified before Congress about concerns over catastrophic risk from AI and that sort of thing. And during that testimony, he was famously asked, like, how much of OpenAI do you own? He was like, none of it. I don't own any equity. And that's a really weird thing to have happen. This was sort of framed as Sam A already has investments in tons of companies. He doesn't need any more money. Maybe there was sort of a vaguely ethical dimension to this too, or somewhat ambiguous, but that's the lay of the land. Now we're finding out that OpenAI and its Venture Capital Fund, OpenAI's VC fund, which, by the way, has like about 175 million in total committed investments, and they've invested in companies like Descript and Harvey, which is a really popular legal tool. It's actually owned. It's owned in Sam A's name. By the way, so the fund also has LPs, limited partners that kind of like co-invest, I guess, with OpenAI. They include Microsoft. So it's a pretty, certainly it's a fund with a lot of access because of where OpenAI is in this nexus of AI startups. But yeah, so the really weird thing here is even like OpenAI and its nonprofit foundation do not actually have ownership over the startup fund. It's literally in Sam A's name. And there was a quote from an OpenAI spokesperson who says, well, look, we wanted to get started quickly and the easiest way to do that to our structure was to put this in Sam's name. We have always intended for this to be temporary, which, I mean, I'm no legal expert on, certainly not in corporate law, but this is a really weird way to kind of like do a temporary structure. Like you're just going to put it in Sam's name. One of the questions this article raises is like, what would have happened if the board debacle had just gone a different way? Like what would have happened if they kicked out Sam A? He actually stayed kicked out and now this dude owns the entire VC portfolio for this OpenAI VC. So it seems like it introduces all kinds of risks, risks that almost materialized even. And so just a really, really weird arrangement, no clear answer to why this has happened. But all they say is, look, we know that we need to re-examine our governance structure and that should come before changes to the fund. And then they're focused on creating a new board. But all of this is just a really weird, I mean, it's the sort of thing that, again, not being a corporate law expert and OpenAI has a famously like weird corporate structure. So there could easily be an explanation buried in there and I shouldn't play that down. But this seems, I mean, it seems almost like the amateurish thing that I might try to do if I was like, ah, you know, whatever, like I only have so many hours in the day, let's just put it in my name for now and we'll figure it out in post. So really, really weird arrangement. Not like a huge implication, but a very weird reveal. I think, as you said, I was also very surprised when reading this. The fund that was launched in late 2021 and the 175 million in total commitments was as of last May. So it may actually be even bigger now. And yeah, it's kind of a real surprise, a weird arrangement going on here for several years, this temporary situation seems to have been kept as is. And I suppose it does seem likely with the look into the board and governance structure that this will be re-examined at some point. But yeah, yet another aspect of illegal structures around OpenAI that is unusual and very idiosyncratic. Yeah, there's secretly a fund that's keeping lawyers afloat. And next story, Reddit signs AI content licensing deal ahead of its IPO. So this is relatively undetailed. This is kind of a report. They say that Reddit has been telling prospective investors in its IPO that it had signed the deal and that it is worth about 60 million annually as far as income. And this is apparently what some people had told the reporters here at Bloomberg. This would be a pretty significant chunk of their revenue. Reddit does take in 800 million in revenue last year. So that would be a pretty significant amount of income just from licensing the stuff that's on Reddit for use for training AI. There was a huge controversy at Reddit last year when they limited access to their API so that people couldn't easily get access to the data. There was a whole community revolt thing going on because a lot of apps and things built using the API no longer worked. And that shutdown and that kind of conflict with the community happened precisely because the idea was to keep the data inaccessible unless companies paid to get access to data for AI training. So it makes sense to see this happening. And I guess it'll be interesting to see if people do wind up paying for all this data. Yeah, and what's really weird about this is they don't name, well not weird, I guess it's confidential, they don't name the AI company the agreement is actually with. And so we don't know, but I think there's a good chance it could start with an O and end with a pin AI. And the reason I'm saying that, ready to stand completely corrected if this blows up in my face, the beauty of podcasting. But so Reddit has a kind of long history in the specific social circle that would cause it to be well-connected to OpenAI. They were a Y Combinator back company back in the day, Sam Altman himself was the president of Y Combinator, actually still at the time that I went through it. And Sam A actually has been on their board, I believe he still may be. And so is Michael Seibel, who's another partner at Y Combinator. So a lot of social and other entanglements there, I would not remotely be surprised if this was actually a YC, or sorry, an OpenAI deal and partnership. But either way, this is a deal with just one company. So we also don't know if it's an exclusive deal. You know, maybe there are terms that prevent Reddit from double dipping, selling the data to other AI companies. And it also, no matter what, could be precedent setting, right? So we now have a price point, $60 million per year it seems, for data of the quality and scale that Reddit can provide. And so that's a really interesting kind of like, I guess, a waypoint, a marker for potentially future deals to come. Onto the lighting round. First story, NVIDIA reveals its EOS supercomputer for AI processing that is sporting 4,608 H100 GPUs. So there you go. We have this crazy data center scale supercomputer designed specifically for AI applications. And yeah, having thousands of supercomputer level GPUs. So these H100 GPUs cost, I don't know the exact number now, but they cost a lot. I think it's tens of thousands to get just one. And it's pretty hard to get your hands on one. And beyond just the hardware, of course, there's also some pretty impressive networking tying all of this together. Like this article has some details talking about Quantum 2, InfiniBand networking and software providing 18.4 extra flops of FP8 AI performance, et cetera, et cetera, lots of big numbers. But point is there's now an NVIDIA-built supercomputer for AI and it is ranked number nine in the top 500 list of the world's fastest supercomputers. Yeah, in that list, it bears mentioning, it's not gonna include all of the, actually all the most powerful supercomputers. You talk about the clusters that Microsoft is running that Meta is running. Those are not necessarily gonna be there even if they are coupled together in one data center, but also worth mentioning that the world's most powerful, oh man, I don't even know what to call them, like clusters of computing are not necessarily even all under one roof anymore. Google famously is now working on ways to train across multiple data centers at a time. So it's increasingly less meaningful when we talk about individual supercomputers that are really, really fast, increasingly the ability to wire together, effectively wire together supercomputers and clusters is really important, but still very notable achievement, especially because NVIDIA actually stopped focusing on the double precision gains, basically. So the very double precision floating point number of calculations and basically so they could focus more on AI related stuff a while ago. And that's part of what was being measured here. Just to give one number, I guess, that maybe folks listening, you'll maybe listen to it and be like, oh, okay, that I can relate to. So a famous benchmark is the MLPerf training benchmark. And this is basically where you train GPT-3, which again is 175 billion parameter model. And you train it on 1 billion tokens and you see how long it takes for the hardware to handle that training run. And it turned out that EOS did this training run in about four minutes, which was three times faster than what it had been able to do just six months ago. So pretty insane, like thinking about, you know, training what used to be, what used to be just like three and a half years ago, a cutting edge model. And you're training it in like four minutes on one supercomputer, pretty wild. Fun fact, EOS, apparently the Greek goddess said to open the gates of dawn each day, which is pointed out in the Nvidia blog post, so. Oh man, I was gonna say, Andre has some Greek mythology skills. No, I'm- Just waiting to drop that. Not that knowledgeable. Next up, Google quietly launches internal AI model named Goose to help employees write code faster, leaked documents show. So yeah, not a product that's aimed at consumers, but nevertheless a new AI release from Google, at least internally. This is supposedly a descendant of Gemini and is meant to pretty much help write code using the internal tech stacks of Google. So meant to pretty much speed up the thousands, tens of thousands of software engineers, Google employs across all its various products. Makes sense for them to try and invest in it a little bit. Yeah, it's apparently trained on, as they put it, a sum total of 25 years of engineering expertise at Google. So I guess presumably just like on all their code. Yeah, it looks like this is potentially aligned as well with some of the Google efforts to like do a bunch of, you know, efficiency increases, read layoffs through AI potentially. That's at least what it seems like from the outside. But then again, their chief business officer also explicitly said, quote, We are not restructuring because AI is taking away any jobs. I guess that's different from intending at some point to do that, but still kind of interesting that they're doing this. It has a 28,000 token context window in case that's sort of of interest. And last quick note is that it was a collaboration between all of the different parts of Google that do AI things. So Google Brain, DeepMind, and then the Google internal infrastructure team. So very wide ranging effort here. And next up, we are gonna have a couple of funding stories. First up, Chinese startup Moonshot AI raises $1 billion in funding round led by Alibaba and VC Hongshan in interest of opening ITAP firms. So there you go. This is a huge funding round. Moonshot AI has launched a smart chatbot, KimiChat in October, and that is built in its self-developed Moonshot LLM, a large language model, apparently capable of processing up to 200,000 Chinese characters in its company window. The company was founded just in April of 2023. So a pretty new one. And yeah, this is pretty much investment in open AI type play. I'm a bit surprised because I haven't been seeing anything about this personally, but yeah, I mean, wow. This is definitely one of the weirder fundraisers that I've personally seen, especially in the Chinese ecosystem. Like for context, usually if you're like a fundraise on the order of a billion dollars is something that you do, like before you're gonna IPO after years and years and years development, totally get that AI is different, things move faster. But the other weird thing about this is when you raise on those valuations, you usually don't give away in this case, like more than a third of your company in a single fundraise. Though in total, you might easily have given away a third of your company by the time you're raising that amount. So I don't even know how to parse this. This company is headed by this guy, Yang Zhilin, who's a computer science grad from Tsinghua University, very prestigious institution in Beijing, worth noting it has an open affiliation with the People's Liberation Army. So this is for all intents and purposes, a military affiliated institution, and now it has the spinoff. So really kind of noteworthy as well, because the investors, yes, include Alibaba, they also include Hongshan, which is a kind of Chinese spinoff of Sequoia Capital. We talked about them, I think in a previous episode and how China basically wrestled this spinoff away from Sequoia. And it just turned out to be kind of a, not a dead loss maybe for Sequoia, but certainly not nearly as a good outcome as they might've hoped. And of course, Hongshan, we previously talked about how another company that they funded is another sort of like open AI competitor. This was the company Lightyears Beyond, and they've anyway been doing at scale language model training. Maybe a last thing worth mentioning, there are a lot of these companies flaunting long context windows, right? In fact, at one time, there was a company called Baichuan in China that had said, hey, look, we've launched a 350,000 character context window model. And they even said that this is the longest context window in the world and that it beat Claude too, which at the time was its next closest competitor. One of the things to look out for when you hear these things about big context windows, anybody can make a model that can absorb a huge context window. The question is how well does it handle that context window? How does that context window translate into quality? Oftentimes with Chinese models in particular, I recommend a trust but verify approach, wait until you see the performance on the open benchmarks, at least if it's an open source model, that can be especially helpful. Yeah, just because you never quite know, there's so much impetus to kind of hit those vanity metrics just because they make headlines. But still very interesting development, huge, huge fundraise, and very, I mean, this is a lot of capital sloshing around in the Chinese market right now. Next story, AI competing firm Lambda raises 320 million in fresh funding. There you go, Lambda focuses on compute as it said. They already serve 5,000 customers across various industries and will now be trying to compete with Nvidia and other hardware providers. They're known for doing a bunch of LLM and generative AI trainings, that's kind of where they specialize, and not in stuff that's not generative AI, but yeah, specifically those things. And apparently, I thought this was interesting on their website, they apparently have a shipment of H200s either already on site or about to be deployed. So they're one of the first cloud providers with their hands on H200s from Nvidia. So clearly they've got a pretty good way of getting allocation from Nvidia, which is probably behind the raise in the valuation. And one last quick story about funding, ex-Salesforce co-CEO Brett Taylor and longtime Googler Clay Baver raised 110 million to bring AI agents to business. This is 110 million from investors that includes Kia Capital and Benchmark. And these are meant to be AI agents that do various things. So for example, there are hundreds of thousands of customer conversation every month for clients, including Weight Watchers, SiriusXM, Sonos, and Outlook AI. So they are kind of enabling AI interactions across various businesses and customers. And up next, we have our project and open source section. We're kicking it off with Biomistral, Biomistral if you're French, which, you know, not everyone is. A collection of open source pre-trained large language models for medical domains. So essentially what we've got here is a case where this lab is like put together, they've taken the Mistral 7 billion instruct V0.1 model. So the instruction fine-tuned version of Mistral's 7 billion parameter model, and they gave it some additional training on PubMed Central. So a database of basically medical data. It's a pretty significant corpus, about 1.47 million documents, 3 billion tokens that were added to the already kind of pre-trained Mistral 7 billion model. And then they basically looked at like, hey, how does this do on a whole bunch of different questions and medical question answering tasks in English? And then they automatically had translations of those tasks into seven other languages as well, just to see how well it generalized. And it does pretty well. It outperforms models like MedAlpaca, which, you know, it's kind of what it sounds like, sort of medically fine-tuned version of the 7 billion parameter Alpaca model, and BiomedGPT, which we covered, I think in a previous episode a little while ago. So it definitely is ahead of the pack on, you know, but just about all the benchmarks that they tried with a couple of limited kind of exceptions, you know, things on medical genetics, anatomy, college medicine, you know, on average, strongly outperforming the vast majority of other models, in a lot of cases by like 10% performance on these benchmarks. So really quite impressive. And another kind of open source model, increasingly we're seeing obviously the medical models, the medical versions of these models come out, you know, first the base model, then you get the instruction fine-tuned, then the dialogue fine-tuned, maybe the RLHF, and then you get the kind of medical specialist models and so on and so forth. So we're seeing the Mistral line very much mature. I found it interesting that this was trained using the CNRS, the French National Center for Scientific Research, a high-performance computer. There's a similar initiative in the US to make a national AI cloud, where you can have these sorts of supercomputers for AI research. So I guess, yeah, a nice demonstration of what happens when you give academics or open source developers access to powerful hardware, they can develop these sorts of models and fully open source them. So models, data sets, benchmarks, scripts, everything is out similar to what we saw last week with AI2. And next story, Nomec AI releases their first fully open source long context text embedding model that surpasses OpenAI's ATA2 performance on various benchmarks. Long title of a story there, but that's what it is. The release here is of Nornis Embed Text V1 that does generate these embeddings. A quick recap, embeddings are just a bunch of numbers that, roughly speaking, tell you what text means. And you can use that as an input to a language model, a chatbot, or you can use it for various other things. You can use it to find similar text, do retrieval, classification, a bunch of stuff. This model can handle sequence lengths of 8,000 tokens. So that's quite a bit higher than a lot of typical open source models that are capped at, let's say, 500. And yeah, fully open source. So this is coming out under an Apache 2 license. Yeah, it's also, so it is a very small model at 137 million parameters. It used to be that to hit anything like an 8,000 token context window, you would just need a much, much bigger model. My guess is that they probably haven't done like sort of compute optimal training. So in other words, they probably poured in more compute than is ideal for this number of parameters on the standard scaling basis, just to make sure that they squeeze as much value as they possibly can in those parameters. The idea here really being to make sure that you have a small model that can do really well, right? It's another example of the kinds of little nooks and crannies that we're still trying to fill in with models that have, in this case, a long context window, but are really small. That combination is something that there just wasn't a good model for. This actually beats, yeah, opening eyes text embedding models like text embedding ADA and text embedding 3-small on short and long context benchmarks. So it is useful for a wide range of things that those models perhaps aren't for, or aren't useful for, and it is released under an Apache 2 license. So, you know, very permissive. But yeah, definitely got more of those vibes from this paper of like another example where people are trying to fall over themselves to show how open they are. We'll give you the code, we'll give you the data, we'll give you everything. This was very much one of those cases. So pretty well everything you can imagine wanting out of this model, you can certainly get. They have a bunch of interesting details, like, you know, they're using flash attention, maybe not too surprising to see that used now, and a bunch of tricks like, anyway, setting their vocabulary size strategically just to make sure that they're improving, let's say, on the previous systems. They do use a BERT base. So that being kind of the, I guess, the version of language models before the GPT series, actually, that used to be the cutting edge, well, now they're back to BERT with these augmentations and seeing some cool results. On to research and adjustments. First story is a meta unveils V-JEPA AI model that improves training by learning from video. And the blog post from meta was actually titled V-JEPA, the next step towards Yen-Lakun's vision of advanced machine intelligence, or AMI. A dramatic blog post title. And AMI apparently is now a new acronym we are trying to make happen. But yes, this is V-JEPA, a video joint embedding predictive architecture, similar to their image joint embedding predictive architecture. And the whole idea is to be able to try and predict patches of a video. So similar in a way to the SORA story we began with, this is another story of trying to train a world model, in this case, trying to train an AI to understand how the world works by predicting portions of video. In this case, they say they mask out large aspects of the image and have a model try and predict it. Yeah, and so I think it's really noteworthy just how similar this is in spirit to OpenAI SORA, right? Like in both cases, you take in an image, you chunk it up into patches, you create an embedding, right? This like list of numbers that captures the meaning in that patch of the image, right? That captures whether or not certain concepts are represented there. And for each of those patches, you're gonna create an embedding like that. And what you're gonna do is take advantage of the fact that video frames that are close to each other in time or patches of image that are close to each other in space usually contain closely related information, right? So if you see a bit of sky in one part or one frame of a video, then there's a good chance that there's gonna be sky in that same patch one frame later. And there's a good chance that neighboring patches will also contain some sky. And we've actually talked about this idea, I think last week, when we talked about how humans might learn visual skills and visual understanding of the world by recognizing that the things we see at one moment are usually gonna be pretty similar to the things we see very soon after. And we don't need to label our data to know that, right? This is just all unsupervised learning. It's all being done without labels, without labeling our input data during the training process. And so essentially this model is just gonna be trained to determine whether a chunk of an image or a time bound piece of video follows or is close to another given chunk of the video. And in that sense, it's not a generative model, right? It's not gonna be generating video. It's a discriminative model. It's gonna be analyzing pieces of video and images. And it makes its predictions not by operating on raw pixels in the video, but instead on operating on this embedding space, right? On the space where we're capturing the semantic meaning of the sort of chunks of video that we have. So in that sense, it is kind of like Sora, right? Both of these things operate at the level of the embedding space. Both it seems taking advantage of the fact that meaning is similar in kind of physically closely related parts of the video, whether that's close in time or close in space. And ultimately the thing you wanna get out of this in the case of VGEPA, the most valuable artifact really that you're after from the training process is the encoder that takes in a patch of video or a patch of an image and generates the embeddings, right? Generates that meaning, extracts the meaning from those inputs. And so you might be tempted to look at this and kind of compare it to OpenAI's Sora more explicitly. And if you do that, like I did superficially at first, you might see this as a fairly weak showing compared to Sora, right? This model has a lot of big limitations that Sora doesn't have. It's only discriminative, right? So it can analyze meaning in images and video, but it can't generate video. It also only works on very short chunks of video. Like their paper says something like 10 seconds or so. That's about what it can handle in terms of recognizing actions over long time horizons. And another thing is that you still need to adapt the model by training like a small lightweight specialized layer or a small network on top of it if you want it to learn a new skill. So there's no equivalent here to like in context learning that's captured as far as I can tell in this architecture. But the flip side is that they're publishing it openly and it does reflect a commitment to Yann LeCun's vision of AGI and how he thinks it'll be achieved. So whereas OpenAI tends to like the idea of scaling individual models, they tend to take the view that as we progress, architecture matters less and less and scale matters more and more because it starts to do more and more of the work for you. Whereas meta on the other hand, as is the case here, they take inspiration from the way the brain works a little bit more and they see that as the path to human level intelligence. So it's maybe not so surprising that in that context, you know, they're more interested in these more specialized and purpose-built modular architectures and framing the study as an investigation into how to replicate human learning patterns rather than, you know, opening eyes, let's just scale this up and see approach. This is definitely more of a research effort primarily, right? So they say, this paper explores feature prediction as a standalone objective for unsupervised learning from video and introduces VJAPA collection of vision models trained. So it really is exploring specifically a training objective and is primarily a research artifact, right? Of trying to release some new information and release this stuff for really our academics, which I don't think can be compared to Sora, which is definitely, I guess, not a tool yet, not a product yet, but clearly is more of a commercial investment from OpenAI's perspective. I will also say, it's a little different in the sense that another thing highlighted here is that this is self-supervised or unsupervised training. So you can't just take a ton of videos and train, whereas for a generative model, text to video, you need text and video, right? So you could make your argument from a scaling perspective at the limit, you can use this kind of method to train on a giant corpus of videos without requiring any labeling versus if you'd want to train a generative model, then somehow you'll need to first label all of them. And maybe it turns out that you can first train a model on this self-supervised objective and then train some more with a more generative object, whatever. Worth noting, a lot of investments in image stuff has come from self-supervised learning. I mean, chatbots fundamentally are self-supervised at first, then they do fine tuning on human labels. So another example of a meta really exploring that type of space and showing that it can be extended to video. Next paper, chain of thought reasoning without prompting. So chain of thought, we have mentioned it a lot of times over the years. Basically the idea is that for LLMs to be able to solve, let's say trickier problems that require a bit of thinking, it's been found that you can condition the chatbot to do better by telling it to, you know, let's say think step-by-step or first give it an example of reasoning through the question before giving the answer, stuff like that. So that all is kind of chain of thought. And what this paper is saying is that it is possible to get chatbots and LLMs to do chain of thought reasoning without giving it examples or telling it to do so in the prompt. The way they do that is they investigate the decoding process, the process by which you generate the output after giving it the input. And instead of doing greedy decoding, where you just take the most likely output each time, they show that using the top K alternative tokens, so other paths that the LLM could go down, it is possible to find the chain of thought paths of output that are inherent in the sequences. So the basic claim is that language models are already inherently capable of a chain of thought reasoning or like reasoning in their output if you decode the output in the right way. And the confidence in the final answer increases when a chain of thought type output is present in the decoding path, which they can leverage to create this chain of thought specific decoding. Yeah, I'm curious if you can think of sort of like concrete applications of this. I think it's interesting in its own right and worth consideration for that reason. It's just some of the caveats here. So they'll say, for example, it's generally but not always the case that the model will be most confident in its final answer if it takes a reasoning trajectory, as you said, that is associated with chain of thought prompting or sort of like where it autonomously decides to do chain of thought. But that's not always the case, that's inconsistent. And chain of thought prompting is also apparently not the most common of these reasoning trajectories that it ends up going through. And so as a result, you can't automatically sift through and pick out the chain of thought one using techniques like self-consistency, which sort of look at kind of which of these approaches is coming up the most and is most self-consistent. There's just too much diversity in the different reasoning strategies that come out at that level. And this approach also would require getting the model to fully generate all of these outputs. And so that's pretty expensive from an inference standpoint. You're running the model many, many times all the way through. And so at a certain point, you're kind of reduced to just doing an ensembling approach, really. That's to me what this looks like. It's an intelligent ensembling approach. They're coming up with heuristics to make it more likely that we can pick out the self-prompting one. But then the last challenge that they flagged here too was apparently this works best for simple problems that are more similar to the problems that were in the training set explicitly. But you are still gonna need standard chain of thought prompting if you're gonna take on more challenging problems because you kind of need to be in teaching mode a little bit more and help the model out. So yeah, I think it's academically interesting because we're learning that, oh, bubbling up to the top in a lot of these suggestions is the LLM autonomously is kind of going, oh, I want to try this. But it's not always, in fact, not often the first thing that it'll try. And there are all these kinds of issues behind the scenes that at least to my mind might make it a little hard to kind of use this in practice. Yeah, I agree. I think this is more of an interesting result and something that by itself isn't, let's say, a game changer. You can't just prompt the LLM to do reasoning or fine tune it. But you can use this insight and build on it as you do in so many cases in AI and research. And potentially this could impact how you create fine tuning data sets when you have LLMs evaluate other LLMs. I think having a better understanding of the space of things you can do while decoding is very useful. Yeah. And onto the lightning round where we're gonna try to move fast through a few papers. We're doing great at this. Yeah, we are really good at that. Well, first paper, OS Copilot Towards Generalist Computer Agents with Self-Improvement. So this is exactly what the title said, a framework to build generalist agents capable of interfacing with various elements of an operating system, including, I guess, everything on a computer, not just kind of the operating system internals, but also the web, code terminals, files, media, and third-party applications. They use that framework to create FRIDAY, a self-improving embodied agents for automating general computer tasks. And on this benchmark called GAIA, a general AI assistance benchmark, FRIDAY outperforms previous methods by a good deal. So yet another example of people working on agents and on automating workflows on a computer in a general way. Yeah, I think the GAIA benchmark is one that we talked about a fair bit before. And it was basically this attempt to come up with benchmarks that are hard enough that current language model agents struggle with them and that they ended up having these three different levels of tasks, level one, level two, level three tasks, level three being the most challenging. And basically, previously, all language model agents just would flop at that level of task. FRIDAY, this particular framework, achieves a success rate of 6.12% on it. So that's kind of cool. And certainly an indication that we're maybe getting a little bit of liftoff now at that higher end level of difficulty tier. But yeah, interesting result. Another new round of agent architecture. I feel like we're seeing another paper like that every week or so these days, actually more than that. But definitely a big step forward. And certainly looking at that level of three task push is at least to me, seems like one of the more impressive things where we've seen so far. Next paper, world model on million length video and language with ring attention. So the idea is to show that you can train a transformer to be effective on token lengths, context lengths of 1 million tokens, similar to when you began with Gemini 1.5 Pro. In this case, we actually have a paper that tells us how they did it. So one of the tricks they had was ring detection. That is a technique for scaling up context size arbitrarily without approximations or overheads. Secondly, they curated a large data set of videos and language from public books and gradually trained this model with increasing context sizing. Starting at 4,000 tokens and going up all the way to 1 million tokens. They open-sourced a highly optimized implementation with ring detection and other features to let other people build on this long context transformer that they trained. Yeah, to me, the ring attention piece is really the highlight here. You know, this seems to be what they're using to achieve these absurdly long context windows. Peter Abbeel is a pretty famous UC Berkeley researcher who came up with the original ring attention paper. This is, it's kind of a, I mean, it is kind of like a standard transformer, but it has a fancy way of passing off the keys and values. So basically these are the intermediary things that you have to calculate in the process of generating the output of a transformer. So it's this fancy way of passing those values along to, or between multiple devices that are set up in like a, well, a ring-like structure. And by doing that, they're able to achieve, like the details are somewhat technical, unfortunately, but they're able to achieve these really, really long context windows. In principle, they are, like theoretically they can go up to infinite length, but they're limited basically just by the number of those devices or cores that you can kind of stack together in that way. So yeah, it's, I think a really important new development. Ring attention is something that I'm gonna be paying a lot more attention to going forward. And yeah, very interesting that they've been able to pull this off. It's also kind of weird that it's coming out at the same time as Gemini 1.5. Sort of makes you wonder a little bit where, what's under the hood there and how that might relate. But yeah, very, very interesting new breakthrough. Right, I think this is highlighting that these kinds of things, longer context, is part of the trends, I guess, still ongoing. It's been ongoing really for the last year and a half where we've seen, like at a time, Claude having 32,000 token context length was really impressive. And now we're going all the way up to a million, which is of course essential for having general purpose AI and so on. So makes some sense to me that they're coming out close to each other. Pretty cool to see them open sourcing a fine-tuned version of Llama 2.7b. So this model that they release with long context windows is a 7 billion model that others can use. Next story, Amazon AGI team says their AI is showing emergent abilities. That's the headline. So it turns out that Amazon has an AGI team that does research and works towards AGI. This is something I just didn't know. Oh man, poor Amazon. No, I mean, I was not aware, maybe I was aware, but they have created a new model called Big Adoptive Streamable TTS with Emergent Abilities, so BASE-TTS. And as per the headline, it says that there's some emerging abilities that it wasn't trained on. So this model was trained on 100,000 hours of public domain speech data, mostly English, and is presumably really good at text-to-speech as a result. And the specifics for this emergent stuff, basically it has to do with pretty complicated aspects of text-to-speech. So they have some sentences that include foreign words and, you know, like at signs and hashtags and things like that. And apparently BASE-TTS was not explicitly trained to deal with foreign words or punctuations or things like that, but still was able to do pretty well. So it kind of kept the ability to create speech even for things that it hasn't seen. One of the big take-homes of this thing is actually that Amazon has an AGI research team. Like I was joking about it earlier, like I vaguely remembered this, I think we might've actually touched on it in a past episode, but we haven't heard much from this team. This seems to be one of the first results that we're seeing. They do have a bunch of audio samples that you can listen to. I just listened to a couple just now and they're, you know, they're good. Definitely a solid sort of text-to-speech model. And interesting that their AGI team is starting with that focus. I'm not sure if that's a commitment to a certain view about the value of audio or that as a path to AGI, but we'll just have to see what they come out with next. That is it for research. Moving on to policy and safety. Starting with the story hackers for China, Russia and others have used OpenAI system according to a report. This is research by OpenAI and Microsoft. And they say this is some of the first documentation of hackers with ties to foreign governments using generative AI in their attacks. The attacks were using AI in relatively mundane ways. So drafting emails, translating documents, debugging computer code. And as you might expect, OpenAI and Microsoft have said that they are working to disallow and curtail the use of their systems by these foreign hackers. Yeah, and it's an extension of a partnership with Microsoft Threat Intelligence, which is interesting because it sort of reminds us of that, you know, close partnership between Microsoft and OpenAI, which, you know, both Microsoft and OpenAI hasten to tell us does extend to safety and security. And so, yeah, they have a bunch of different sort of scenarios or what am I trying to say? Like a bunch of different vignettes or examples that they're sharing in this post. And then Microsoft's post as well goes into a little bit more depth. They look at two different China affiliated threat actors who apparently tried to use OpenAI servers. They give them code names that are really cool. cool, like Charcoal Typhoon and Salmon Typhoon. There's an Iranian-affiliated threat actor called Crimson Sandstorm, a North Korean one called Emerald Sleet, and a Russian one called Forest Blizzard. So kind of cool if you're into the fancy code names. Yeah, essentially, as you said, they're trying a wide range of different things. It was useful, I guess, for Microsoft Threat Intelligence and OpenAI to watch them, to kind of let them use the service a little bit, see what sorts of things they're after. It was interesting, because Microsoft's post went into a little bit more detail about who these actors are and what they have tended to do. Charcoal Typhoon, the Chinese state-affiliated one, they talk about it having a broad operational scope, targeting sectors like government, higher ed, comms, oil and gas, so very, very broad. Whereas Salmon Typhoon, the other Chinese state-affiliated one, seems a lot more sophisticated. They have a history of targeting US defense contractors, government agencies, cryptographic tech companies, that sort of thing. And what they were doing is they were using these OpenAI models to translate technical papers. That's kind of interesting. Get publicly available information on intelligence agencies and regional threat actors. Get help with coding. And also to research common ways processes can be hidden on a system. So we're seeing definitely more of a veering into the kind of malware, cyber offense dimension. And anyway, there's a bunch more really interesting information. Last one I'll mention is Forest Blizzard, the Russian one. Apparently, this was actually linked to GRU unit 26165. So apparently, that's targeted victims of both tactical and strategic interest to the Russian government and has been active in the context of Ukraine. So the GRU is Russia's, back in the day, would have been the KGB, the sort of Russian intelligence agency. So the GRU here also getting in on the action. And kind of interesting, apparently, all of these accounts, by the way, have been shut down. So no surprise there. But yeah, sort of an interesting bit of news and some cool transparency, I guess, from Microsoft and OpenAI sharing a little bit about what's been going on under the hood. That's right, yeah. You can go to these releases by both of them to get into the details and see how they're tracking these agents. But I guess our takeaway is, so far, the hackers are mostly just doing mundane stuff with chatbots like we are and are not somehow becoming super hackers just because they have access to chat GPT. Next story, house leaders launch bipartisan artificial intelligence task force. So the house has been doing some stuff related to AI for the past year. We've seen forums on AI. We've seen some bills starting to come out related to deep fakes. And now house leaders, Speaker Mike Johnson and Minority Leader Hakeem Jeffries, are launching this bipartisan AI task force. The task force will be looking into how the US can support AI innovation and study potential threats, release guidelines, policy proposals, all sorts of things. And it will have 24 members, led by Chairman Jay Obernolte and Co-Chairman Ted Lieu, both of whom have computer science backgrounds and have previously talked about AI. And I found that detail pretty interesting, the leadership front there. Yeah, for sure. Jay Obernolte is famous for owning a video game development company as well, so a technical guy. Yeah, a master's specifically. He also has historically been less concerned about the alignment risk, the catastrophic risk potential from exotic AI accidents, that sort of thing. So interesting and useful to get this balance of more free market libertarian perspective and the Ted Lieu's perspective, which he has been more of a hawk generally on AI overall, though I'm struggling to remember if he's actually kind of looked at catastrophic risk from AI alignment. That's something that I think really ought to be in the conversation here, especially as we enter or start to think about entering the spring and then the summer, when if there's going to be a bill that'll go through the House and Senate before the election, it's going to probably have to happen fairly soon. So this is clearly part of Congress trying to wrap its arms around this very complex issue. And yeah, it's good that there are technically informed minds at the table. I think one of the big risks that you run into as well is we all can index a little bit too much, potentially, towards our past experience. And I find this often with folks who do CS stuff, AI stuff from back in the day. The sorts of strategies that worked back then had limitations that the strategies that work now don't. I'm sure that these folks are tracking that, but it can tend to bias us towards thinking that things are moving perhaps more slowly. We see the limitations. We don't necessarily see the capabilities. So I think hopefully one of the things that will happen here is they'll kind of canvas around for a wide range of opinions on where the field might be going and account for the fact that a lot is unknown. We do not know how fast stuff could move. And that probably means we ought to have some chips bet on the possibility that things could move fairly soon. That's the kind of possibility you don't want to be blindsided by. You want to have some kind of legislative muscle in place to deal with that possibility. So really interesting cast of characters, very well-informed group of people. And hopefully they make the right calls. Right, and quite bipartisan. It's a real mix of Democrats and Republicans on this thing. And we've seen that, at least in AI, there are some things that can be bipartisan, like regulating deepfakes. So I wouldn't be surprised to see this task force actually come to some agreement regarding aspects of AI regulation, even if, as typical in the US, Democrats and Republicans will have pretty wide divides on some of the related issues. All right, on to our lightning round. We have your fingerprints can be recreated from the sounds made when you swipe on a touchscreen. Chinese and US researchers show new side channel can reproduce fingerprints to enable attacks. OK, I'm all out of breath. And that was basically the entire paper in the title. It is now February 2024. And you're probably asking yourself, how come we haven't yet run into an AI breakthrough that allows your fingerprints to be reconstructed based on the sound they make when they slide on a screen? Well, Tsinghua University, which, again, I hasten to remind you, has an open PLA affiliation. And the University of Colorado teamed up together to make this breakthrough happen. It's not a 100% effective thing. This attack allows you to, it turns out, attack about 28% of partial fingerprints, cases wherein partial fingerprints, and about 10% or 9% of complete fingerprints within five attempts, just based on the sound, the sound of your finger as you move it across the freaking trackpad. This is pretty insane to me. But I think it's just a reminder of how crazy easy it is to gather information, to recreate data about your environment with very little information. And we're moving into a very interesting information environment where monitoring, intelligence gathering, and all that stuff is going to be a hell of a lot easier. That's right. This is not a huge AI leap here. It's not some end-to-end model they created. It's rather a whole little system where there's a series of algorithms, different types of preprocessing, and steps for understanding the raw audio signal that they put together and got this to work pretty well. So I guess a little worrying, because if you do try to actually just train a model end-to-end from text to image or whatever you want to attempt, maybe you could do better. So just FYI. Next story, this one is from Axios. And it is simply that states are introducing 50 AI-related bills per week. And this is in the US, of course. So they just covered some details as to the state of bills being introduced in states in the US and highlight that there's a lot going on. As of February 7, there were 407 total AI-related bills across more than 40 states in the US. And that's up from just 67 bills a year ago. States have introduced 211 AI bills last month. There's 33 states have election-related AI bills and so on. So yeah. And as per headline, it's now the case that there's 50 new AI-related bills per week throughout the states, with some of the states having the most of them. New York has 65, California 29, Tennessee 28, Illinois 27, and some other ones also are introducing them. And up next, we have finally my home country actually making the news. Air Canada found liable for Chatbot's bad advice on plane tickets. So Air Canada is our national airline. And they've been ordered to pay compensation to a grieving grandchild who claimed that they were misled into purchasing full-price plane tickets by an ill-informed chatbot. This is where things get weird. So the airline actually tried to separate itself from the crazy shit that its chatbot said. And I'm going to say this all Canadian-like, because I'm guessing that this is what they sounded like when they said it. But they said, a separate legal entity that is responsible for its own actions, eh? That's what this thing is. It's a separate legal entity, they claim, this chatbot that is distinct from them. And so who knew, right? This thing is an independent agent. So this led to a decision, perhaps unsurprisingly, coming from a tribunal that said that, look, this whole idea basically does not apply, that while a chatbot has an interactive component, it's still just a part of Air Canada's website. It should be obvious to Air Canada that it is responsible for all the information on its website. It makes no difference whether the information comes from a static page or a chatbot. And then they added, I find Air Canada did not take reasonable care to ensure its chatbot was accurate. Interestingly, this introduces a legal requirement, implicitly, for AI alignment, right? Like, what is reasonable care for AI alignment? What is an amount of care that is reasonable to ensure that this chatbot will give true outcomes, correct outputs? So I think this is really interesting. We're going to see a lot more stuff like this, obviously. But it was also kind of interesting because of the weird legal argument that Air Canada tried to put forward here, like a separate legal entity. That's interesting for a chatbot. So we're playing out a bunch of science fiction plot lines and figuring out what's human and what's not, I guess. That's right. And specifically, the chatbot claimed you could get essentially a full refund. And you, in this case, couldn't. And the person was awarded the $812 to cover that difference. And the last story for this section, the FTC warned about quiet TOS changes for AI training, TOS being Terms of Service. So the warning was that companies might be tempted to resolve a conflict of wanting to use user data into AI training fuel. And the FTC stated that, yeah, they might simply change the terms of their privacy policy so that they are no longer restricted in the ways they can use their customers' data. And so the FTC blog post actually says this. And to avoid backlash from users who are concerned about their privacy, companies may try to make these changes surreptitiously. But market participants should be on notice that any firm that reneges on its user privacy commitments risks running afoul of the law. So there you go. The FTC is saying, don't secretly change your privacy policy to make money from video data. Don't do it. That's not OK. And apparently, Zoom did this in August of 2023 with updated terms of service to clarify that the company can train AI on user data with no way to opt out. Yeah, and they had a commentator, an analyst, who stepped in and said, maybe it's not so bad as that, the Zoom change and analogous Google change. They were saying, at least in the case of Zoom, if done quietly, it was likely because the change wasn't material. It was just stating more explicitly something that it had already retained the rights to do. So maybe this is a more innocuous play. But certainly, the idea of the FTC is coming in and saying, hey, folks, you're going to play nice now, right? This is a bit of a shot across the bow. So we'll see if people actually heed the warning. All right, and on to synthetic media and art. This time, we all have only one story in this section. And it is that Sarah Silverman's lawsuit against OpenAI partially dismissed. This is a California court. And as per the headline, it has dismissed this lawsuit against OpenAI by Silverman and several other authors. The lawsuit made six claims, including direct copyright infringement, vicarious infringement, some other ones. OpenAI has requested the dismissal of all of those except for direct copyright infringement. And the judge has dismissed four of these different claims. So now it's down to just the two of them, which is unfair competition and that direct copyright infringement. So I guess narrowing of the scope of the lawsuit in this case, less for the authors to argue as to what OpenAI is liable for in this case. Yeah, and as always, as we're keen to say, hashtag not lawyers. But here is the reasoning that the judge in this case, I'm going to butcher the pronunciation if I try, but Judge, well, I guess I'll try, Martinez Olguin essentially said that, look, I'm not convinced that OpenAI was, so there's one allegation that OpenAI was intentionally removing copyright management information. This would be like the title and the registration number for these documents, these books. And she also said it's not really clear that the authors had proven economic injury because nowhere in their complaint were they alleging that defendants reproduced or distributed copies of these books. And this is an interesting threshold to set, right? Like that, OK, apparently, you've got to, the way you show economic injury, the justification for this part of the lawsuit at least, is there's got to be the full distributed copy of this material. That's a pretty high bar and something that you could imagine being gamed pretty easily as well by LLM companies, right? Like if you have a classifier run over the system and go, oh, am I reproducing verbatim a chunk of text? OK, I'll just add a word here. And now it's no longer verbatim. That may be too facile. But also, apparently, the court decided that the claim of risk of future damage to intellectual property was too speculative. And that's also interesting, right? Because you can imagine comedians, let's say, well, Sarah Silverman, Dave Chappelle, these folks, you train these models on the data that they've produced, their kind of collective works. And then it can go out and do a Sarah Silverman monologue or a Dave Chappelle monologue. Arguably, that is economic or intellectual property damage of some kind. But apparently, that's too speculative at this stage for courts to consider. I'm not sure what that does to the precedent-setting side, given they're couching this in, well, it's too speculative. Maybe that gives them the opening to not have this effect precedent to too much. But it definitely does start to, as you said, constrain the set of things that you actually can sue these companies for. At least it starts to set that precedent. Definitely, yeah. So this is one of many lawsuits that are ongoing, as we've covered over the last few months. There are also different lawsuits for text-to-image companies specific to offers of texts. The offers can file changes to their original complaints by March 14. And the main complaint that JCPT directly violated the copyright remains on the table. So we will still be seeing this go forward. And I guess it will still be interesting to see where this goes. And now, on to the last section, the new last section, which is just fun or miscellaneous, where we can include whatever we feel like. It doesn't have to be anything else. So for my end, the first one I picked was this Visual Guide to Mamba and State Space Models. This is a really nice write-up by Martin Grutendorst, the substack, where it just goes through the details of the architecture, explains the various details of Mamba. It does take a while to get through conceptually. It is built on some of these control theory concepts. It has some hardware optimization in there. It's just a mixture of various elements that are rather technical. So I, to be honest, still haven't tried to fully get all the details in there. But I now have a general grasp of what's happening thanks to explainers like this that go through it in a nice step-by-step. And it is, yeah, I would say kind of interesting at least to get the general picture by reading something like this. 100%, I mean, the illustrations are so good. And yeah, I mean, these sorts of things are worth their weight in gold, right? So often, we think we understand something, and then we see there are a couple of places where this really changed my way of thinking about it. It was like, oh, wow, OK, this is a kind of nicer way than the equation-based approach that sometimes is the default, especially when these things just have come out, right? And all we have is the paper. So yeah, really, really nice resource. And next, the one I picked is called Cellular Functions of Spermatogonial Stem Cells in Relation to JAK Stat Signaling Pathway. And if you're wondering to yourself, I thought this was an AI podcast. Why are you talking to me about sperm stem cells? Well, I would, too, in your shoes. But if you click on the link and you go to the paper, what you will find is that this is a retracted paper. And it's retracted for a very interesting reason. Because if you click on the actual images that are in the paper and you zoom in real close, what you will find is that they contain very nice pictures of cells and stuff. I'm not a biologist, blah, blah, blah. But when you look closely, you find that the text on those images has weird spelling shit going on, almost as if those images were generated by an AI, almost as if this is complete confabulation. In fact, that is exactly what seems to be going on here. This paper is riddled with AI-generated images. And it's not clear if the text is or is not, but it has been retracted. It's from two different institutions. They're researchers from the Department of Spine Surgery, Honghui Hospital, Xi'an Jiao Tong University in China, and the Department of Spine Surgery in Xi'an, Honghui University in Xi'an, China. So really kind of a big ding to this particular journal, which is Frontiers In, which I think is actually pretty, I think it's a decently well-known journal. I've heard of them before, not sure where or in what context. But anyway, this is a real ding for their reputation and very surprising that like peer review didn't catch this. It seems weirdly obvious. Yeah, so this was covered. I wasn't aware of a paper title. So at first I didn't know where you were going with this, but there have been media articles about this. For instance, one titled Scientific Journal Publishes AI-Generated Rat With Gigantic Penis In Warring Incident. I didn't mention the penis, by the way, because this is a family show. But that was in the news and I knew of that story. I didn't think it would fit in any section, but now I guess this is where it'll go. So there you go. For reference, some journals are less reputable. When I was peer review can be kind of broken, especially if you go for a journal that is more sort of you just pay and most papers get in. Maybe that was the case here. So I wouldn't say this is necessarily like a worrying sign for all of science that we're gonna start to get more of these kind of ridiculous incidents. But yeah, kind of a fun story to be aware of. And just a couple more. I have one more from my end. The story is that Helen Mirren has ripped up an AI-generated speech at the American Cinematheque Awards. So that's it. Helen Mirren was accepting a lifetime achievement award, read out of a generic type speech, and then said that it was AI-generated and proceeded to tear it up, let the pieces fall to the floor, and that was met with applause and cheering. So yeah, kind of a little bit sign that there is a growing backlash towards AI in the creative industries. I mean, this has already definitely been a case with text to image, but I'm sure this will be the case for authorship as well. And here's a very kind of clear sign of that. Yeah, and as if on cue, the next story is Microsoft's game-changing Super Bowl ad, which basically goes like, hey guys, I know everybody's really freaked out about AI's gonna destroy the world and take your jobs or steal your children and kidnap them and sell them back to you for money that it can then use to train more of itself because it's got more, anyway, all that. But Microsoft goes, don't worry, we're here to make your dreams happen with AI. That's the reframe that they're going for. This is their big Super Bowl ad. They start the ad with a bunch of people who are talking about all the different ways that their lives are not going the way they want them to. The things like, they say, I'll never open my own business or get my degree or make my movie or build something, like all these things. And that's the first 30 seconds of the ad. And then later, basically Microsoft swoops in and says, well, it's the co-pilot AI bot, which goes to all these users and responds to them and goes, yes, I can help you, unless the request involves opening the pod bay doors, in which case it's like, no, no, I will not do that. Anyway, so this is kind of an interesting ad. It's Microsoft really trying to push back on this kind of cultural fascination with AI being like a source of significant risk and angst. And they're trying to do, in some ways, what Apple did back in the apparently 1984 Super Bowl ad when they unveiled the Macintosh and they basically took this Orwellian dystopia with a bunch of zombies that gets liberated by a projectile from a revolutionary sprinter. Anyway, it was like an attempt to kind of rouse people out of thinking of technology as this big, bad thing. Here is Microsoft trying to do the same thing and quite clearly kind of stepping on Apple's turf a little bit with a bit of that think different vibe. There is that sort of subtext to it. Anyway, so this Axios article is actually really great and efficient at kind of walking you through it and providing a little bit of context. So kind of like that. Right, yeah, I saw this ad, the Super Bowl was now a week and a half ago as of this recording, just the way I, and it's a little hammy, this whole thing of like, oh, they say I can't do this, I can't do that. And then the answer is, well, AI exists, so you can. And if you read the YouTube comments and the general response to this ad, I think it was seen as kind of lame and not particularly inspiring. But to your point, I think it does show a desire to reportray AI or reframe it as something that is an enabler of human achievement and not a replacement or something like that. And with that, we are done with this episode of Last Week in AI. Once again, you can find the articles we discussed here at lastweekin.ai, our text newsletter. You can also feel free to email us with any suggestions or feedback at contactatlastweekin.ai or comment on YouTube or Substack or elsewhere, and we will be sure to keep an eye on it and reply. As always, we would appreciate it if you share the podcast or review it on Apple Podcasts or somewhere else. It's always nice to hear your feedback and I guess know that recording these ridiculously long episodes is something that actually people like. But more than anything, we do like to see what people listen and benefit from these episodes. So please do keep tuning in.