|
Incredible array of panelists who are who build the products and services that we all use, and let's kick this off with a quick round of introductions, We'll start with Ronan. Hello, I'm Ronan. I am a research scientist here at Apple, and I work on the MNX framework. That's awesome, Ronan. Moving on up next is Richard. Hi, I'm Richard. I'm from the team that brought you the foundation models framework, and we're so excited to see so much interest from developers and some excitement from developers on foundation models. Oh, awesome. Uh, next, Edict? Hi, everybody. My name's Eric. I work very closely with Richard. We both contributed a whole bunch to the foundation model's framework. We were really excited to talk to you about it today. Awesome edict, and last but not the least, Michael. I'm from the Coromel and Creedamel team, we're really focused on making your models run as efficiently as we can on device, and giving you the tools to get them there. Thanks, Michael. We are all incredibly excited to be here and answer all your questions, ranging from foundation models, to COML, to open source MLX, and everything in between. Speaking of questions, please put in your questions on WebEx, on the Slido, in addition to the team here on WebEx, we have a huge team behind the scenes, helping us make this run as smoothly as possible so we can get to all your questions. Once you post in your questions, a moderator will approve your questions, and you'll be able to upport them. This helps us narrow down to a list of questions that is broadly applicable to most of you, so we can get to the important questions. A couple of other bookkeeping items, before we jump into your questions, in this forum, it's hard to answer code specific questions, so if you have code specific questions, or we can't get your questions today, please head over to developer.apple.com/forums, ask a question there, and we'll continue this conversation well after the group labs. If you also have a feature request or a book to report, head over to bugreports.apple.com and share your bugs, and we will take a look at it. We love feedback. So thank you for doing that. Now, with that, this week, all of the folks you see on screen, we had opportunity to speak to many of you who are here in person, ask you a lot of questions, and let's start with some of the questions we receive from developers, then we'll move on to the questions in slide of. So, please keep those questions coming. We are here for you. We'd love to get to all your questions. So, let's start off with some of the most exciting announcements this week, which is foundation models, and for Richard and Eric, here's a question we heard often, today speaking to many of you. What are the most, what are you most excited about the release of foundation models framework? Richard, if you want to kick this off? Sure. So, Foundation Models, framework gives you access to the und device, large language model. So every all the processing, every request that you make to the model, is processed entirely on device. It's this is incredible, because you can do all of this, you can do a lot of undivided processing for your apps to implement intelligence features, to generate personalized search suggestions, to generate MPCs in a game, there's so much more. So we are super excited to see developers build features using guided generations, structured output, and also tool calling. Awesome, thanks, Richard, and if you're all wondering, why Richard seems familiar, he was on the state of the union, announcing foundational model framework. I know many of you are super excited, but, Eric, you're on the foundation models team, too, and you work incredibly hard on this, what are you excited about? Well, I mean, of course I echo everything that Richard said. I'm personally really excited about the combination of guided generation, with Snapchat streaming, and seeing all of the really cool animations and the magical features that all of you guys are gonna be building with that, it makes it just so easy to get reliable output from the model, and then to turn that into these really delightful animations, and that's gonna play so nicely with liquid glass and all the updates that are coming in Swift UI. There's just gonna be so much cool stuff that you guys do, that's what I'm personally really looking forward to seeing. 100%, I agree with you. We are only so creative, you guys are the new all of you in the audience, your creativity, we are excited to see what you build with it. All right, here's another question that we often heard, and this is for you, Michael, about Coramel. The question is, we use CoreML extensively today for our MLU cases. With the availability of foundation models framework, when should I still consider bringing my own LLM, large language model, through CoreML? Great question. I think I'll first answer this in sort of a general term, and then we'll get to LLMs in the foundation model framework. In general, we highly recommend you first look at the built in system models and APIs we provide as Apple for your use cases. These models are highly optimized, have been tuned for our devices, and, you know, cover a wide variety of use cases, and you don't need to package this model as part of your app. And this is totally true for the foundation model framework. You know, this is tapping into, as mentioned by Eric and Richard, right into our, you know, large language model, as Oreon device, has been optimized, shared across multiple apps, you know, you get other features in addition to just the model itself, right? You get the great APIs built on top of them, particularly the foundation model framework, like guided generation, the tool calling, all the decoding loop, and can train the coding that is happening under the scenes, that is not just the direct interaction with the model, right? This is great advantage to use. You can focus on your use cases, and not so much the low level details of the model. That said, right, the role for Cormel is always, if you want some more control, or choice over the specific model you're deploying, whether you're customizing one of these system models, from the vision or natural language frameworks, or whether you're augmenting some things or helping build a prompt with the Cormel model, right? Cormel has a role there. And we provide the tools to get those on device. But, again, I think, in general, we want to recommend you first try out the system atheized, ML parity, APIs, the foundation model, framework in particular, but also the vision, natural angles, speech, translation, all of those, where you can leverage those built in models. Thanks, Michael. Yeah, I think we heard this from many developers that you don't want to be bundling these models with your apps, with foundation model, as part of the OS, and you get to take all the advantage of LLMs through the foundation model framework, and augment the capabilities using Coremel models. Here's another question that we heard, commonly this week, which is, this is for you, Ron, and about MLX, for a lot of my ML modeling use cases, I currently use PyTouch, and, as you all know, is a popular open source machine learning framework. MLX looks exciting. Should I consider migrating my pie Torch code to MLX? Thanks to, um, maybe let me first, like, take a step back, you know, what is in my next. So, indeed, you know, you can compare MLX with spite options in the sense that MLX is also an open source framework. It's also, like, a general purpose machine learning framework. And as each framework out there, we, you know, pick a number of features that we try to emphasize greatly, right? So, one of these features is, of course, we made it, uh, pay law for Apple Silicon from the ground up. The other thing is that, it's, I mean, it's, I would say it's like, uh, I mean, it's in between Jacks and Bidors, in terms of features, you will feel very comfortable at home, you know, using MNX, because you have, like, a number ABI, so it's kind of easy, you know, to migrate your code. And it comes with a number of APIs, so we have, like, a CAPI, a C plus plus API, a python API, but also a swift FBI. So, in that respect, if you're interested to, you know, move your code to, for example, iOS, Swift is like, you know, the long wage of choice in that respect, right? So, as every framework, again, we come with, like, a different programming model, so, it might be of interest for you, in particular, we insist on unified memory, which is, like, a key feature of the Apple Sikon hardware. So, there, again, you know, it might be better for your application, you must try, basically, and must give it a try, and tell us what you think. Yeah, agree. Another day, what works well for you is what works well for you. You want to try out a few different things. MLX, my experience using MLX and pie torches, they're very similar in API, and MLX is built from the ground up with unified memory, as Ronan mentioned, so it's incredibly performant. Having said that, I will mention, though, there are two MLX sessions here at WWDC, I encourage you to go check it out after the group lab. Thanks, Ronan. So, let's move on to the questions that you've all been asking us today. We have a nice array of questions here. I'm excited to get to all of these. This one is about foundation models for Richard and Eric. The question, and I heard this several times, talking to many of you, as well. I've been exploring the foundation models, introduced with Apple Intelligence, and I was wondering, can I test them directly in export simulator, or on a physical device with an apple silicon chip, or do you recommend using new export playgrounds instead? That's a great question. So, you can absolutely use the simulator to test your foundation models, use cases, you can build an app, and use, uh, escope previews, but all of this requires that your host machine, your Mac, needs to run maco as Tahoe. So, Macoas, Tahoe is the requirement for similators, uh, to be able to talk to the to the same model, uh, that, uh, the Macoas is running. So, for example, if you're calling the foundation models API on MacOS, that would be directly tapping into the MacOS, Apple Intelligence model. So your simulator is going through that path. At the same time, you can also use on device testing, so if you hook your iPhone up to your Mac, and your iPhone is running iOS 26, you can run playgrounds, or preview, live previews directly through your iPhone as well. Awesome Thanks, Richard. Moving on, there's another question. This is about on device models, and I think some a couple of you will have opinions on this, which on device models will be supported, Any open source models? Now, I know this question touches upon MLX, COML, so let's start with Ronan. Do you want to take this first? Yes, so indeed, it really depends what you want to do, I would say. I mean, Richard and Eric will answer better, you know, regarding, like, the financial model part, because it ships with, I mean, the US ships with, like, specific models. Depending on the kind of model you are interested on the open source side, you may be better off, you know, going with government, for example, like, I guess for small models, or, you know, regular I mean, reasonably sized models, let's say, I would say, you know, Carmen is probably the first thing you should try, because, you know, it ships with the US, it would be very easy to use your model there. Then, when it comes to state of the art, a larger model, or a very large moments, that's where I'm basically am an ex shrines. We support, basically, all models or most of all models coming from the Emily community. We pay, you know, attention on what's going on on Egging face, in terms of, but also, most of the language models, or, you know, vision models, which are popping up on a green face, will be usable straight out of the box with MNX. Now, of course, you cannot use any model on any device, right? Big models require more memory, they are large, so you will need, like, larger, larger machines, you know, possibly even, like, our latest, Mac studio, ultra, you know, 5, 12 gigabytes, right, which, I mean, it's probably the only machine on the market, which always like to run, for example, a deep sick R1, you know, just on a single machine, right? So it's kind of amazing. So you can do that with, you know, that hallway and MNX. And then, you know, if you're more into US development and want to run, like, low wage moments on your iPhone, then it's very likely that you want to choose a smaller, I mean, a model which, like, a smaller footprint, and that's where, I would maybe, Michael, Richard, and, I talk about, you know, what you can do with Carmel and the foundational malls, which ships, you know, with US. Yeah, thanks, thanks, Ron. I think the Coramel path is interesting to Michael, I think, to add there. Yeah, I guess, you know, sort of Carmel is, you know, sort of primed for you to take a Wi-Fi item miles. I'm guessing in this question in the context, maybe when we say open source models, are we focused on LLMs or just general set of models? But, you know, Cormal can cover you both there, as Ronan pointed out, right? You need to consider the size of your model, right? and your use case of what is appropriate. And so, but when it comes to sort of LMs, I'll just repeat, right? It is going to be really hard to find a competitive solution to the foundation models framework in the sense of a model of that size, of that quality, and of that speed and amount of integration. But you are, you know, also totally welcome to sort of take any off the shelf model, right? get it in your framework of choice, and run it on the vice, but directly. It just leaves a lot more on you, responsibility of how you want to distribute that model, when you want to update it, et cetera. Thanks, Michael, and while we are on the topic, Eric and Richard, did you want to add something to the conversation? Oh, the models that you can run through the foundation models framework are Apple's first party models. We don't have support for external models or open source models, And there's good reason for that. Getting everybody sharing one base model allows us to make lots of optimizations, and to make the platform as a whole, run more smoothly. So if everybody's sharing one model, that helps us extend the battery life, it reduces latency, and there's a lot of good things that come out of that, or the system as a whole. It's all about leveraging that shared resource. Using foundation models will not increase the size of your ass.. That's a very good point. Richard. This was a very good question. I think it focuses on the core questions everyone's asking about the few different ways to do machine learning on devices, I think, as Ronan mentioned, if you're working with frontier models, there's MLX that will help you run these gigantic models on, like, Mac Studios, with unified memory, if you really want a package, an LLM, can use a Koramal path, and if you just want to rely on the foundation malls framework, the API, it's bundled with the APSR available, it's bundled with the OS. So, various options for you, what works well for you, is for you to figure out, but, thanks for the discussion, I think that was a good question. Moving on, here's a question, by Andre. How often will the foundation model be updated? How do we test for stability when the model is updated? Ronan, Eric, sorry, I meant Richard or Eric, who wants to go, who wants to take this first? Every want to take the first part, Richard? Sure. So the foundation model will be updated in sync with operating system updates. So, Apple updates these models, Apple trains all these models continuously, and we are always making updates and improvements to the model. So these models are expected to be updated, along with a future OS update. So your app still gave the opportunity to test against the New York model during the beta period. So you can download the beta, run your app, and evaluate it using your own prompts, or you can run the app through your UI to see if it works well, or if there's any unsatisfactory or even satisfactory cases, rapport to us using feedback assistance. So we're always eager to hear your feedback. Eric, and think to add to that? Yeah. On that note, that's gonna tie into a common theme that I think you'll hear a lot today, which is the importance of having an avowal set for your use case. If you can collect some number of golden prompts and golden responses, that you can use over time to evaluate the performance of your feature as things change. That could be the model changing, or it could also be changes to your own prompts as you tweak your app, You want to know, when I make a tweak to this prompt, does that make it better? On the whole? It's very easy to tell if it makes it better or not, on one or two examples, but it's really important to understand statistically, across all the different things my users are gonna do, does it help? And having that, having that vowel set ready and at hand will be super useful for all kinds of things, including preparing for model updates. That's that's a very good best practice thing right there. Thanks, Eric. If you haven't checked out our, if you haven't checked out our pump design, safety, WDC video yet, make sure to check it out. It teaches you all about how to prompt the undevised model, and what are the best practices. That's a good caller. Thanks, Richard. Yeah. Awesome. Thanks, Richard and Eric. Moving on, here's another question also about foundation models. We're keeping you both busy, Richard and Eric. So this one is about context window. This is by Nosa, The question is, what is the context window of the model? What are the max tokens in and the max tokens out? Well, that one has a pretty straightforward answer. So it's 4,096. And the split up of the in and out is very flexible. If you put 4,000 tokens in the prompt, you've got 96 left over for the output. You can balance that any way that you'd like. We also provide the ability to tweak the maximum number of output tokens that you want, just in case you have some upper bound that you know is relevant for your use case. We know, like, we talk about tokens, right? Like, the APIs, right, are they're not taking tokens in or are they? Uh, that's a very good question. So it takes in text, and then it converts that to tokens under the hood. And so, in order to reason about how many tokens are in your text, you want to form a metric for making an estimate, and typically, you want to use kind of an overestimate to make sure that you're staying well within that 4096 that you have to work with. There's lots of different formulas out there that you can find for estimating how many tokens are in a body of text, but roughly, if you assume that every three to four characters is one token for languages like English, and for languages like Japanese or Chinese, you can probably assume that one character is one token, That'll get you roughly in the ballpark. And then what's important to do is to handle the air cases if your estimate turns out to be wrong. Most of the time, you'll be fine, particularly if you're doing an overestimate, but in some occasions, for example, if somebody gives you input that has lots of punctuation in it, like, a log message, for example, that's been printed out of software, that tends to be more token dense, and so a short piece of text, filled up with logs, can have many more tokens than natural language would. And so it's important in those cases, to catch the expected air, and handle that gracefully. You can do that by asking the user to submit a shorter prompt, or you can start a new session to get a blank slate, so that you have more context size to work over to work with. And if your session goes really long, and has it, because this existing transcript, you could even feed the transcript, some of the transcript into the model, to get a summary, and feed a summary back into the model in a new session, so that your model well remembers some of the context. Yeah, there are all kinds of, like, creative approaches that you can take to these problems and that's one of the things that makes it really exciting. There's no canned single solution for any particular use case. You kind of have to feel it out based on what you're doing, and have a way that you can evaluate, is this working well? Makes sense. I was about to ask, and I think Richard kind of answered the question, which is, how do you know how much token you've consumed? So there's a transcript that you have access to in a session, right? That's the best way to find out how much you've consumed. Awesome. So, thanks. That was a good question, I think. Thanks, Nossa. So, moving on, this is a question from Francois, which is, which on device model/API can I use to extract text data from images, such as nutrition labels, ingredient lists, cashier receipts, et cetera. Thank you. I'm gonna say this is closer to you, Michael. ar take this. take this, yeah. A little change of topic, right? First thing that should come to mind, we're thinking about analyzing or understand images, is the vision framework. The vision framework has had a text recognition, or recognized text request for a bit, but this year, actually, there's a brand new one that's perfectly matched to the use cases you mentioned. There's a new recognized document request in the vision framework. And this will not just recognize text and images, but also give you the structure of the document. So it could, you know, give you the rows in your receipt, you know, in sort of a table form or figure out the structure of what's in the nutrition label. And it'll sort of break that down into a structure of I see tables, within tables, I see cells, within cells, I see some text, or maybe even identify things like data that go through data detectors, like phone numbers, addresses, you know, prices, et cetera. I really recommend you check out this year's vision session. I actually forgot its title, or maybe it's like redocuments using the vision framework. is a great resource to do. Excellent. Thanks. Thanks, Michael. Moving on back to Foundation Markles. is larger server based models also available through foundation models. Richard, actually, I heard this question several times talking to many of the developers, I think this is something that's on top of mind for everyone. Do you want to take this, Richard? Sure. This is a great question. So, this year, the Foundation Models API only gives you access to the on device, large language model at the core of Apple Intelligence. It does not support server site models, but there's so much you can do with undevised models. For example, tech summarization, extraction, generating dialogue in the game, I would encourage you to explore what you can do with this no cost on device model, and if there's anything you want the model to know, where there's any external information, you want to expand sort of the model's knowledge of, like, for example, the model is really, really small. It does not know all the latest information about the world, but you can define a tool that reaches out to Wikipedia, or define a tool that searches, social media, or to find a tool that looks up the weather, or looks up your, even your personal contacts, like contacts. So, tool calling is a great way for the model to access external information autonomously. So whenever the model decides that it needs extra information to answer your question, to complete your request, then it'll call your tool and your tool can return to the model, whatever is necessary for the model to know. Awesome, thanks. Thanks, Richard. The next question is actually related to the previous question, and you may have answered it, partially. This question is by Alex. In what scenarios should we offload tasks to server side models versus relying slowly on foundation models and on device, Apple intelligence? Here, I assume, when their effort to suicide models, these are other large language models services available. So when do we know which to use, I suppose? Richard, you want to take this? I think Eric should take that. Sharon, how could you take this one? So, I'll give this answer in two parts. The first part is the really obvious ones. So, if you need really advanced reasoning, or if you need huge context sizes, like, you're gonna be feeding a 200 page PDF into the model, that's something that you're going to have to fall back on a server model four. So that one's relatively black and white. The more interesting part, though, is the second answer to the question, where it's a little bit more gray. So there's a lot of things that you can do with the on device model that you could also do with the server model. And in most cases, if you can do it with the on device model, it's probably a good idea, too. It's gonna save you some money on those server bills, It local, it's more private. There's all kinds of good reasons to do that. And so, I'm gonna harp on it again. I know I've said this a bunch of times. I'm gonna keep saying it. Having an eval set, sitting around, prepped? is how you can answer these questions yourself, because it's very difficult to answer those questions generally, in the absence of a specific use case. And I think what you'll find is that, even for advanced use cases, there will be parts of the feature that you might be able to break out and handle with the on device model, even if there are other parts that you still need to offload to a large server model. And it's all about finding the creative ways that you can get the most out of what you have available to you. Thanks, thanks, Eric. The next question, I think, will be a quick, quick one. So this is by Nikolai. The question is, will multimodality be supported on device, foundation models? Will there be supported in structured output? Eric, do you want to take this? Yeah, I can take this one. So, Apple's policy is that we can't communicate about things that we're gonna do in the future. Sounds like you're excited about that? It's a neat technology area. Which is good. We are happy that you are already thinking about the future. All of this stuff is available to you today, and we look forward to more things in the future. Thanks, Sarah. Even today, we can do so much with the combination of the vision framework, and the on device large language model, and potentially at the image playground API. For example, you could use the vision framework to get the texts out, and feed the texts into the foundation model, and you could also use the image playground image creator API to generate images. So the combination of multiple frameworks and multiple APIs on the platform is what really makes it the platform shine and, yeah, be a really, really good experience for our developers. Yeah, Richard. I'm really glad you brought that up, because it's an excellent point. Very good color. One of the really neat things that you can do is you can take all of the other frameworks that we have, and we can plug them together. So, for example, you can use the new speech analysis framework to caption what a developer or what your users are saying. And if that user says, Hey, make me an image of a panda on a tricycle, you can take the foundation model framework and ask the foundation model framework to rewrite that user's request into an image description, and then you can feed that image description into image playgrounds. And there you go from audio in to image out by plugging it through these three different frameworks. And that's really the kind of creativity that we want to see you guys pursuing, and that's really the advantage of having all of these powerful frameworks across the whole platform. That's an amazing example, Eric, I think that's fun. I think, with the availability of the vision and speech and sound, and the image creator Apia, you can string all of these together into nice pipelines. That's awesome. Thanks, so moving on, this is about vision models, So, Michael, can you talk a bit about the improvements that were made this year in existing vision models? I didn't think there are some models that have been updated, like the hand pose model, in particular, it's talked about in that same session I mentioned previously, but it has improved the accuracy and speed of that model, We are always aiming to improve all of the underlying models, both in terms of the performance, and accuracy. And there's some new requests in addition to even the document, recognition requests that I mentioned before, There's things like lens smudge detection, which is actually something we always deal with WebEx, of, you know, depending on your application, if people are taking photos to analyze with the vision framework, right? You can inform your user to be like, Hey, you may want to wipe or check out your lens to see if it's smudged. So those are some quick ones that come to mind. Thanks, Michael. Here's another question about foundation models again, clearly a hot topic today. This is by Shane, with the new generable macro, will date be supported in the near future, or should I get around this in my model? Eric, do you want to take this? Ah, this is a really interesting question. We talked about this a lot internally, and discussed how we would handle date, and we think right now, the best way to do it is for you to create your own date struck and market generable. And the reason for that is because depending on the feature that you're building, and the context of your app, the interpretation of date may be different. So in your app, you may think of date as a year, a month, and a day. But in another context, a date may be a year, a month, a day, an hour, a minute, and a second. And we don't wanna box together a solution that is inflexible, But by letting you define what you mean by date, and how precise you want to be, maybe you just want, like, a month and a day, that gives you the flexibility to tweak the model, and to try lots of different ways to prompt it. There's different output formats, like maybe if you put the year first, and the month second, it might do better, or you could flip it around, the other way, do the month first, and the year after that, and by giving you control over how you represent the date, it ensures that you can prop the engineer to get the best performance for your app and your use case. Awesome, thanks, Edic. Uh, anything anyone else wants to add? Richard? I can use a lot of existing tools, like guides and rexes, to define what you want exactly as part of your generable type. Mm. These, I realized... I sorry. It's one of the cool features that gives you all the customization. I was gonna say, one of the customizations I think I saw, but it would be good to sort of explain, is sort of the way you define your general structure, in addition to the flexibility in terms of its fields. There are specific order, Does that play a role in terms of, let's say I want to stream my responses. And what I really want to get out to the user as quickly as possible is the year, and maybe the extra date information to fill in later. Is there a way for me to control that? Eric's the perfect person to answer this question. This is such an interesting design space. So you get into a place here where, so first off, the order that you declare the properties in a generable struct matters. That is the order that the model will generate the output in. And so, in the perfect situation, the order that it generates the properties to get the maximum accuracy is the same order that you want to show them in your UI. And in most cases, that'll probably be true. But there will be cases, for example, if you want to show the year first, there may be some use cases where having the model generate the year last? Actually, helps it to perform better. And you may see things come up like that, a very common one is like a summary. So, typically, if you make a summary field, the very last one, in your struct, the model tends to do a little bit better then, because it can look at everything else that it's already generated, when it makes the summary. If you put the summary at the very top, then the model's gonna generate a summary first, and kind of try to expand that summary into everything below it. And so you'll have to think very carefully about how you're going to design your UI, along with the order that you're going to generate your properties in, and make sure that you find a balance between the two that gets you both the UI that you want and the accuracy that you need. Well, that's some very good insight. I didn't think about that. The order matters because that's the way the generation works. awesome. On it, anyone else wants to act to that or ask a follow up before we move on? All good? Okay, awesome. Okay, let's move on. Thanks, Shane. That was a good question. Moving on, this is from Xavier, again, about foundation models, so I am developing an educational app about some ancient books. How can I load all the books text into an Apple model? Ooh, this hints at context window and related topics. Eric, do you wanna take this? Yeah, so you'll have to get creative here. If you have an entire book's worth of text, you're not gonna be able to fit that into 4096 tokens, and so you're going to have to do some kind of chunking. You can split it up into windows, and then look at each of those individually. You can also do overlapping windows, so you have some amount of overlap between each one, and you just feed each one of those into a new session. There's techniques like recursive summarization, so you break your book down into, like, five or six chunks, and then you summarize each of those five or six chunks, and you stick those things together, and then you perform some operation over that until you get it into a size that you can work with. And none of these are a bulletproof solution, none of them work in every instance. You're just gonna have to experiment with lots of different approaches and have that eval set on hand so that you can tell which ones are working over a representative distribution of the sorts of things your real users are gonna do with it. Thanks, Eric, and I think this is true, even for other types of models, which are larger, if you have sufficiently large enough books, then none of the models are gonna be able to consume the whole thing. So, semantic chunking and ways to query those, these are common approaches. Anyone else wants to add anything to that in the context of LLM in general, Ronan... That's good. No, I don't mean to put... I just don't want to give his heart to go back and forth with the, you know, the question of, what do they want to do with the ancient books? Like, do they actually need them all in the context for the response they're looking for? So there is the question of, like, what is the user interaction, and is there opportunity for the tool, from some tool calling, to say, given some prompt about it? Now we know the category of books. The tool is gonna go pull out the relevant sections, or going to go and bring them back to the foundation model, right, for it to continue its task. So, you know, so more of these retrieval, you know, approaches can also help in terms of, like, you don't have to put all of the knowledge up front in your prompt, and then be able to answer an arbitrary question, right? stage it out. Right. That's a really good point. I'm glad that you brought that up. There's all kinds of search algorithms and stuff that you can use to find the relevant part of the book, and just pull that little bit in. And present to the model just this section that matters to answer the question. Awesome. All right, good question. Maybe one thing I can add is that, you know, if you want to experiment with, you know, very large context, MNX supports this kind of things, right? So, we actually support, you know, you know, zenians of tokens as input. If you want to, so, it's totally possible to do that with an ex, but then, you know, the caryat is that it doesn't come with the US, you would have to use, like, napons or smaller for that. So, you know, if I would recommend you start first with, like, the formation model, before, you know, digging into that, but it's a possibility as well. Yeah, I love that thanks Ron, because there's a lot of this is experimentation, maybe you're just learning and experimenting about the broader techniques before you hone in on the model specific ones, and there are tools available to you, MLX is there, you can try how chunking works, how you can use, pull out the relevant context, and call the model before you actually migrate that to a foundation model approach, and all these tools are available to you. So, I think it's important. If you are at the stage of, you know, fitting a book to a model, you know, maybe what you need is more like a Mac, right? So, there, you know, you are less constraints with space, and maybe it's fine to use, like, an open so small, right? So... Awesome. excellent. So let's move on, then. That was a very good question. This is from Adam. This is about text to speech models. Were there any improvements made to the text to speech models, which was originally introduced in iOS 7? Michael, do you want to try to take this? Sure, I'll give it a shot. I mean, I think, you know, the models are always continually improving, you know, both in terms of their quality and their performance. I believe there's some way for apps to include voice extensions to provide new voices to the system. It's a little outside my area of expertise, but that is the amount of knowledge I know there. I think, you know, they hook into some of the system voices as well. I think there's some connection to the, you know, the accessibility you personalized or voice banking feature as well. That's my best shot. Awesome. Thanks, Michael. Uh, another question about foundation models. This is based on workflows and use cases, by the Eric or Richard. Is it possible to run rag, which is retrieval, augmented generation, patterns, workload, on device, offline, using the foundation models framework? Is there an embeddings model packaged with the framework? or any approaches you could recommend? Wow, that was loaded, a few different things. Feel free to take your time, piece it out and answer all the components. So, is it possible to do rag, retrieval augmented generation? Is there an embedding model that's packaged with the framework, and what approaches do you recommend? Eric, do you want to start? Swing at this one. So, I'll answer out of order a little bit. with the second question, which is about if there's an embedding model. The answer is that there's not. The first question, then, was, can you do rag in the absence of an embedding model? And the answer to that one is, yes, but we don't provide built in tools to do that, like we do with running the LLM itself. So it'll be up to you to find a database that you can store vectors in. But given that you can embed the vectors and store them in an index, and you have a way to run, like, a nearest neighbor's search or a cosine distance search against that, It should be fairly simple, to query relevant entries from a rag database, and then use those to populate your prompt when you're talking to the model. And this links back to the conversation that we were just talking about in that book example, That would be one of the creative solutions you could have for how to deal with the book. You could chunk up the book, load it into a vector database, do a nearest neighbor's search, to find the relevant passages, and then just put as much as fits into the context to do question answering or things of that notion. I think, with a model that has relatively small context size compared to, like, the giant 600 billion parameter models that are out there, how you use RAG becomes very relevant. And so, it's a good thing to explore and to learn how to do, because it's one of the ways that you can really, like, squeeze all the juice out of the model that we're providing. Awesome, like I said to that a little bit, like, as mentioned, right, there's not, you know, an embedding exposed for the foundation model framework, but the natural language framework also has simple word embeddings and sentence embeddings, and that may get you some, you know, amount of way there. Actually, if you search for, I think it's like finding similarities between pieces of text, there's an article on developer at apple.com, which will give you sort of the basic structure of whether using the built in embedding or others of, like, what is your approach to sort of, you know, embedding data and then finding neighbors? Yeah, that's an excellent point. Yeah. And something, if I may add, this is an excellent workflow wherein you could use a combination of foundation models and coronal. Let's say you have your favorite embedding model that you like to use. You can bring that in as a model through CoreMal and use that in conjunction with the foundation models, API, so you have the embeddings, you also do CoreMal also supports tensor APIs that let you do basic things like querying and finding similarities and stuff like that. So, yeah, I think you can be creative and build these apps using multiple of these tools together. Awesome. Anyone else wants to add anything to this before we move on to the next question? All good? Awesome. Let's move. Before I take the next question, I just want to pause and say, These are awesome questions. Thank you so much for asking these and please keep them coming. I know we have some more time to answer and get through all of these, but these are amazing. Just want to say thank you. This next question is about foundation models again, I believe. Is there a rate limit of AI APIs, is it limited by power/temperature conditions on the iPhone, similar to how the camera API has thermal and processing limitations. Eric, do you want to take that? Yeah, of course. So the answer to that one is yes, there are rate limits that apply. Those rate limits apply, particularly when your app is in the background. There is some budget that is allocated for your app, so you swipe up, your app goes into the background, you're allowed to keep doing some work there, but there is a fixed number of requests that your app will be allowed to run in the background, and if you do too many in a given day, then you will receive rate limiting errors from that. It's a pretty, pretty generous budget. I really don't expect a lot of folks to hit it. When you're in the foreground, there is no rate limit, there is no budget. You can use as much as you want. The caveat to that, though, is that if the device is under heavy load, you won't be able to make requests, and that's not a rate limit. It's not because you've been sending too many requests. just because the system is under load, So that might be like, if you have the camera open, if you're in game mode, you may see some things happen in those cases. Yeah, I'll just add to that from the general, you know, system health, right? The system overall, whether you're running your own model through Coromel, or even your own code, or you're using, you know, the ML powered APIs of vision natural energy, you know, all of these things are going through the core system, which is trying to balance when it's using performance cores, or efficiency cores. clock rates and thermal medications, You know, in general, is trying to make sure that the customer's still having a good experience with their battery life and their device. And so, there's a dynamic control system on all of your devices to sort of find that balance, and that affects everything. It's not just a rate limit. It's also just the performance that the device is gonna provide. So, yeah, heavy load in general, right, is a complicated scenario. And so making sure you're setting the right quality of service and other attributes of where you're dispatching work, right? If it's background work, like put it in a, you know, a background, a priority, right? If it really truly needs to be user interactive, use that quality of service. So please use the appropriate system, you know, descriptions of the work that you need to do, which helps the system better balance, where it's giving the performance and where it's sort of willing to sort of back off a little bit to meet the overall user experience of people just having one of their devices operating and have a long battery life. Yeah, that's a really good point, And that ties into, we make an effort not to rate limit, because it's not great for your app or for the user. So we have kind of a system where we slow down a little bit in order to keep your requests running. So you may see if you have a full battery, your phone's cool, you're not doing anything else, you're gonna have really fast token throughput. But if you have a whole bunch of apps open, your batteries running low, the phone's hot because it's been sitting in the sun, the number of tokens you get per second will go down, and that's a sign that the system is doing what it can to keep your app running without rate limiting you. Thanks, again, trying playgrounds and escode. You might have seeing a rate limit error for some number of requests. That is just a bug. So, you can work around it by reopening the playground tab, that'll get rid of the real limit, and you can continue to prompt engineer and try the model on all sorts of prompts. Thanks, Richard. Kudos for getting ahead on that. And thanks, I think Michael, some of those best practices, those were awesome. I think they're generally applicable. Thanks, Eric, too. It was a good question. Moving on, this is about foundation models again, by Alexander, to the foundation model support languages other than English. Richard, do you want to start with this? Yeah, the foundation model on device is multilingual. It supports languages other than English, all the languages supported by Apple intelligence. This model has it. So there are a few best practices to get the language model to output in the language you wanted to output. So, for example, you can prompt the model in your instructions to say the user's preferred language is Yen US, or is FRFR, so you can use the locale, API, from Foundation to get the language code, and the model was actually trained on these language codes, to recognize if there's a language code in instructions, it'll help put in the language code that you specify. You can also use the system language model, supported languages property, to dynamically check for all the supported languages. On that note, kind of some prompting best practices around other languages, typically, you don't have to follow this rule, as with all things, test out a bunch of stuff, see what works best for your use case, but what we generally recommend is putting the instructions in English, but then putting the user prompt in whatever language you would like the model to respond in. We find that that typically works the best. Awesome. Thanks. Thanks, Ed, and Richard. Uh, Oingon, uh, this is a good question, um, a little advanced, perhaps, um, this is about, uh, well, Apple, let us create slash train our own adapters, uh, lower us, to be used with foundation models. Now, before you jump into responding, I think, see if you can try and explain what adaptors mean and Laura mean for everyone in the audience. Who wants to start? Richard or Eric? A start. So, adapters, low rank adaptation is a technique that we use all across Apple Intelligence to support all different use cases. So you can train, so we train a bunch of adapters for, for example, writing tools and other use cases, This was covered in last year's WDC State of the Union. So this technique works great for specialized data sets, specialized eval sets, and it can help reduce the amount of prompts to achieve one task. So, this year, we did release a foundation models adapter training token. You can find it on the Apple Intelligence Foundation Models website. Now, you can use this toolkit to train your own custom adapter if you are an ML practitioner and have your own eval sets and your training sets. But bear in mind, this comes with very significant responsibilities, because every time Apple updates the base model in a newer operating system, you would need to retrain your adapter, because each adapter you train works with exactly one version of the baseball. But, yeah, if you're an amalpact practitioner, and you are, your reuse cases really, really specialized, or is something that the model is, you try the base model, you try the general purpose model that you have, that you get access to, using that foundation model, framework. If you found that your use case is really special, that the model is not handling to the degree you expected, then this is an area that you can try. But keep in mind, there's responsibilities for you to maintain that adapter and keep it up to date, going forward. And this comes with, it's a nice middle ground. You do still have to ship an asset with your app and adapters work out to be about 150, 160 megabytes. They're not small, but we do have a new framework this year. I think there's some dub dub sessions on it for background asset delivery. And that integrates really nicely with the foundation model's framework. If you do decide to go down that path. That's a good point. So, sort of to summarize only if you need, if you can't get the specific functionality you need using structured outputs, generable or other rag type workflows with tool calling, If you can't, if none of those satisfy use case, then go ahead with the adaptive train toolkit, but you are responsible to maintain compatibility with version updates. Awesome. Let's move on. We have few more minutes, and there are a couple more good questions here. This one is, again, about foundation models, by Nikolai again, What is the latency and throughput of foundation models? roughly order of magnitude in tens, hundreds, once milliseconds, or seconds, or tens of seconds. Any performance impact or limitations or frequent calls, example, no more than 10 hertz or more than 1,000 tokens per second, Nikola is really getting into the performance details of this. Eric Diona, take this one? Yeah, this answer is really fun because it's really nuanced, and we get to talk about some really cool optimizations. So, if you take a bear model, a very plain decoding loop, and you run prompts through it, you will get tokens out of that at a fixed speed. It might be 30 tokens per second, it might be 50. It all depends on the size of the model and stuff. What you'll find when you use the foundation model's framework is that the speed you get tokens out at is dynamic and it's variable. And there is a good reason for that. It happens because of a technique called constrained decoding that we combine with another technique, speculative decoding. What those keywords in your head, and go look up what they are later, but I'm gonna be a very short explanation of both of them now. The way that speculative decoding works is we actually have two versions of the model. There's a large one, and then there's a small one, the large one's called the target model, The small one's called a draft model. And what we actually do behind the scenes is we use the small draft model, first, and we have it predict a couple of tokens out into the future. Then we take the large target model, and we check its work to see if it predicted the same output that the large model would. That operation, where we do that checking, is very efficient. It's much faster than having the large model generate the output itself. And if the draft model got the answer right, then we get a huge speed up. The draft model is able to get the answer right when the output is relatively predictable. So, like, if you ask the model to count from one to 10, you're gonna blaze. You're going to get two to three times as many tokens per second. If you ask the model to do something really difficult, then the draft model's gonna miss a lot, and you'll get lower hit rate. You'll get lower tokens per second. So it's all dependent on what you're trying to do. Now, we also pair that with something called constrained decoding, which is the technique that makes guided generation possible. The way that that works is when the model predicts the next token, what it's actually doing, it's not putting out one token, it's putting out a probability distribution. over the entire vocabulary, so it's like 150,000 numbers, probabilities, each one of those corresponds to a single token. Louis talks about this a lot in our deep dive on the foundation models framework, in the, in the guided gesturation part. But the short version of it is that we mask out any illegal token. Any token that would result in poor grammar or malformed output, we take it out of the distribution, The model's not even allowed to sample it. And because we do that, that has multiplicative effects. The draft model, even if it's smaller and not that smart, because we mask out tokens that would be wrong, the probability of that draft model, picking the right ones, skyrockets. And so if the draft model is right, more often, we also get a speed up. And so between constrained decoding and speculative decoding, we get these very variable output speeds, but they are always faster than the model would be on its own. Those are the kinds of benefits you get from using the foundation model's framework? as opposed to kind of trying to implement your own decoding loop on top of, like, an open source model or something. That was very in depth answer, but I hope that gives you some flavor of what we're doing. |