From cost and performance specs to advanced capabilities and quirks, answers to these questions will help you determine the right model for your use case. Credit: StepanPopov / Shutterstock Car buyers kick tires. Horse traders inspect the teeth. What should shoppers for large language models (LLMs) do? Here are 27 prescient questions that developers are asking before they adopt a particular model. Model capabilities are diverse, and not every application requires the same support. These questions will help you identify the best models for your job. What is the size of the model? The number of parameters is a rough estimate of how much information is already encoded in the model. Some problems want to leverage this information. The prompts will be looking for information that might be in the training corpus.Some problems won’t need larger models. Perhaps there will be plenty of information added from a retrieval-augmented generation (RAG) database. Perhaps the questions will be simpler. If you can anticipate the general size of the questions, you can choose the smallest model that will satisfy them. Does the model fit in your hardware? Anyone who will be hosting their own models needs to pay attention to how well they run on the hardware at hand. Finding more RAM or GPUs is always a chore and sometimes impossible. If the model doesn’t fit or run smoothly on the hardware, it can’t be a solution. What is the time to first token? There are multiple ways to measure the speed of an LLM. The time to first token, or TTFT, is important for real-time, interactive applications where the end user will be daydreaming while waiting for some answer on the screen. Some models start the response faster, but then poke along. Others take longer to begin responding. If you’re going to be using the LLM in the background or as a batch job, this number isn’t as important. Are there rate limits? All combinations of models and hardware have a speed limit. If you’re supplying the hardware, you can establish the maximum load through testing. If you’re using an API, the provider will probably put rate limits on how many tokens it can process for you. If your project needs more, you’ll either need to buy more hardware or look for a different provider. What is the size of the context window? How big is your question? Some problems like refactoring a large code base require feeding millions of tokens into the machine. A smaller model with a limited context window won’t do. It will forget the first part of the prompt before it gets to the end. If your problem fits into a smaller prompt, then you can get by with a smaller context window and a simpler model. How does the model balance reasoning with speed? Model developers can add different stages where the models will attempt to reason or think about the problem on a meta level. This is often considered “reasoning,” although it generally means that the models will iterate through a variety of different approaches until they find an answer that seems good enough. In practice, there’s a tradeoff between reasoning and speed. More iterations means slower responses. Is this “reasoning” worth it? It all depends upon the problem. How stable is the model? On certain prompts, some models are more likely to fail than others. They’ll start off with an answer but diverge into some dark statistical madness, spewing random words and gibberish. In many cases, they’ll offer correct answers. In many cases, the instability appears at random times when the model is already running in production. When did training end? The “knowledge cutoff” is the last day when the training set for the model stopped getting an injection of new information. If you’re going to be relying on the general facts embedded in the model, then you’ll want to know how they age. Not all projects need a current date, though, because some use other documents in a RAG system or vector database to add more details to the prompt. Is additional training possible? Some LLM providers support another round of training, usually on domain-specific data sets of the customer. This fine-tuning can teach a foundation model some of the details that give it the power to take up a place in some workflow or data assembly line. The fine-tuning is often dramatically cheaper and faster than building an entirely new model from scratch. Which media types are supported? Some models only return text. Some return images. Some are trained to do something else entirely. The same goes for input. Some can read a text prompt. Some can examine an image file and parse charts or PDFs. Some are smart enough to unpack strange file types. Just make sure the LLM can listen and speak in the file formats you need. What is the prompting structure? The structure of the prompt can make a difference with many models. Some pay particular attention to instructions in the system prompt. Others are moving to a more interactive, Socratic style of prompting that allows the user and the LLM to converge upon the answer. Some encourage the LLM to adopt different personas of famous people. The best way to prompt iterative, agentic thought is still a very active research topic. Is the model open source? Some models have been released with open source licenses that bring many of the same freedoms as open source software. Projects that need to run in controlled environments can fire up these models inside their space and avoid trusting online services. Some users will want to fine-tune the models, and open source models allow them to take advantage of access to the model weights. Is there a guaranteed lifespan? If the model is not open source, the creators may shut it down at any time. Some services offer assurance that the model will have a set lifespan and will be supported for a predictable amount of time. This allows developers to be sure that the rug won’t be pulled out from beneath their feet soon after integrating the model with their stack. Whereas earlier versions of open source models remain available, the ongoing availability of proprietary models is determined by the owners. What happens to some old versions that have been retired? Most of us are happier with their replacements, but some of us may have grown to rely on them and we’re out of luck. Some providers of proprietary models have promised to release the model weights on retirement, an option that makes the model always available even though it’s not fully open source. Does the model have a batch architecture? If the answer is not needed in real time, some LLMs can process the prompts in delayed batches. Many model hosts will offer large discounts for the option to answer at some later time when demand is lower. Some inference engines can offer continuous batching with techniques like PagedAttention or finer-grained scheduling. All of these techniques can lower costs by boosting the throughput of hardware. What is the cost? In some situations, price is very important, especially when some tasks will be repeated many times. While the cost of one answer may be fractions of a cent, they’ll add up. On big data assembly lines, downgrading to a cheap option can make the difference between a financial success and failure.In other jobs, the price won’t matter. Maybe the prompt will only be run a few times. Maybe the price is much lower than the value of the job. Scrimping on the LLM makes little sense in these cases because spending extra for a bigger, fancier model won’t break the budget. Was the model trained on synthetic data? Some LLMs are trained on synthetic data created by other models. When things go correctly, the model doesn’t absorb any false bits of information. But when things go wrong, the models can lose precision. Some draw an analogy to the way that copies of copies of copies grow blurred and lose detail. Others compare the process to audio feedback between an amplifier and a microphone. Is the training set copyrighted? Some LLM creators cut corners when they started building their training set by including pirated books. Anthropic, for example, has announced a settlement to a class action lawsuit for some books that are still under copyright. Other lawsuits are still pending. The claim is that the models may produce something close to the copyrighted material when prompted the right way. If your use cases may end up asking for answers that might turn up plagiarized or pirated material, you should look for some assurances about how the training set was chosen. Is there a provenance audit? Some developers are fighting the questions about synthetic data and copyright by offering a third-party audit of their training sets. This can answer questions and alleviate worries about future infringement issues. Does the model come with indemnification? Does the contract offer a guarantee that the answers won’t infringe upon copyright or include personal information? Some companies are confident that their training data is clean enough that they’re able to offer contractual indemnification for customers. Do we know the environmental impacts? This usually means how much electricity and water is consumed to produce an answer. Some services are offering estimates that they hope will distinguish their services from others that are more wasteful. In general, price is not a bad proxy for environmental impact because both electricity and water are direct costs and they’re often some of the greatest ones. Developers have a natural incentive to use less of both. Is the hardware powered by renewable energy? Did the power come from a clean source? Some services are partnering directly with renewable energy providers so that they can promise that the energy used to construct an answer came from solar or wind farms. In some cases, they’re offering batch services that queue up the queries until the renewable sources are online. Does the model have compliance issues? Some developers who work in highly regulated environments need to worry about access to their data. These developers will need to review how standards like SOC2, HIPAA, and GDPR among others affect how the model can be used. In many cases, the model needs to be fired up in a controlled environment. In some cases, the problem is more complex. Some regulations require “transparency” in some decisions meaning that the model will need to explain how it came to a conclusion. This is often one of the most complicated questions to answer. Where does the model run? Some of the regulations are tied directly to location. Some of the GDPR regulations, for instance, require that personal data from Europeans stay in Europe. Geopolitics and national borders also affect legal questions for a number of issues like taxes, libel, or privacy. If your use case strays into these areas, the physical location of the LLM may be important. Some services are setting up regional deployments just to resolve these questions. Does the model support human help? Some developers are explicitly building in places for humans inside the reasoning of the LLM. These “human-in-the-loop” solutions make it possible to stop an LLM from delivering a flawed or dangerous answer. Finding the best architectural structure for these hooks can be tricky because they can create too much labor if they’re triggered too often. Does the model support tool use? Some models and services allow their models to use outside features for searching the internet, looking in a database, or calling an arbitrary function. These functions can really help some problems that need to leverage the data found from the outside sources. There is a large collection of tools and interfaces that uses APIs like the Model Context Protocol (MCP). It’s worth experimenting with them to determine how stable they are. Is the model agentic? There may be no bigger buzzword now and that’s because everyone is using it to describe how they’re adding more reasoning capabilities to their models. Sometimes this means that a constellation of LLMs work together, often choreographed by some other set of LLMs. Does it mean smarter? Maybe. Better? Only you can tell. What are the model’s quirks? Anyone who spends some time with an LLM starts to learn its quirks. It’s almost like they’ve learned everything they know from fallible humans. One model gives different answers if there are two spaces after a period instead of one. Another model sounds pretentious. Most are annoyingly sycophantic. Anyone choosing an LLM must spend some time with the model and get a feel for whether the quirks will end up being endearing, annoying, or worse. Artificial IntelligenceGenerative AISoftware Development