Engineers Corner
|
June 16, 2026
6
min read

Machine Vision in 2026: The Rise of Foundation Models

Foundation models can handle vision tasks from a prompt—but ML fundamentals still matter for production-grade systems.
Foundation models can handle vision tasks from a prompt—but ML fundamentals still matter for production-grade systems.

The assumption that you need labeled training data to build a vision system is breaking down. For organizations without ML infrastructure, today's AI powered by foundation models is the better starting point.

Can foundation models be used for object detection?

Yes, today they can. Foundation models can now be used for a wide variety of vision tasks. With a foundation model, you simply prompt your image analysis instructions. In fact, these models are more versatile for vision tasks than dedicated machine vision models. Beware though, for if you are not approaching this with a machine learning mindset, you might be negatively surprised by the results.

Machine Vision has been dominated by traditional signal processing

Detecting and segmenting objects in images is what machine vision is all about. Countless engineering hours have been spent on crafting preprocessing, edge detection, and shape recognition algorithms to make machines see. It's one of the earliest applications of neural networks as well and especially convolutional architectures were the primary driver of machine learning resurrection in the 2010s.

Signal processing and machine learning methods for object detection are also incredibly power efficient. We can deploy them on microcontrollers and edge devices, and they can run in real time with minimal computational resources.

Are we about to flush all that down the memory hole?

Foundation models are changing the game: barrier of entry is low

A foundation model is one that you prompt, like an LLM, to do a task. Most such models these days accept image inputs (in addition to other modalities not discussed here). Some models can output different types of spatial or visual tokens, such as bounding box coordinates. These tokens have massively improved performance on vision tasks.

Some models, like Google's Gemini, have had this capability for a while. But with the 3.5 Flash it's greatly improved in accuracy over previous generations. Hence we use it here as an example.

The really interesting part about foundation models for vision tasks is that you can write as detailed instructions as you want in the prompt. For example, you can ask for "checkered shirts", or "items that need quality inspection". These tasks would have previously required a pipeline of methods, starting from generic bounding boxes and moving to more specific downstream tasks.

Due to the massive datasets foundation models have been trained on, they may not be sensitive to image capture devices, lighting conditions, or other factors that could, without careful preprocessing, cause traditional models to fail. In this sense you get some robustness that you would have to engineer for with traditional models.

Let's look at some examples of how you can use a foundation model like Gemini for vision tasks. Here, I am instructing Gemini to detect object bounding boxes and respond in a JSON format containing coordinates and class labels. Then I list what to detect from each image.

In the first example, we prompt for signs, bottles and grapes. A task that would require us to collect data on quite different objects across many contexts to get a traditional model to work. Out of the box, Gemini finds objects of all types and the results are pretty good. Surprisingly good.

Gemini detecting signs, bottles and grapes with bounding boxes
Gemini detecting signs, bottles and grapes with bounding boxes.

In the second example, we detect drumsticks. The sticks are especially tough for any vision system because of their small size and motion blur. Gemini is able to pick up the most prominent drumsticks, even in motion, while some in the background are missed.

Gemini detecting drumsticks
Gemini detecting drumsticks, even in motion.

This is an interesting example because it benefits from giving extra clues to the foundation model, such as looking for thin objects that are near hands or drums. We can even hook up another LLM or use a more agentic workflow to reason about the images before asking for the coordinates. All of this is possible because of the versatility of foundation models and their multimodal training methods.

Downsides of foundation models for machine vision tasks

Even if you are on to a promising start with a model like Gemini, solving challenging vision tasks with little effort, do consider the following.

Firstly, the price per detection is much higher than with traditional models. You'll pay pennies per inference, but a dedicated model can run for a fraction of that cost. Granted, some of that cost can be offset by quicker time to market and less engineering effort, but over time the price per detection dominates overall costs.

Second is of course the lack of deployment options. Let's just call it hardware limitations. You cannot easily deploy foundation models of this scale on edge devices. And if you could, the latency and power consumption are detrimental to many applications. For a SaaS application this may not be a problem if throughput requirements are low and the application is not very real-time focused. For anything else, a dedicated machine vision algorithm is likely the better choice.

All in all, you need to consider if the initial velocity and versatility of foundation models outweigh the long-term costs and limitations.

You cannot build a great vision system without machine learning fundamentals

So where is the wall that foundation model approaches easily hit? Well, you still should not disregard the fundamentals of machine learning. Foundation models are probabilistic in nature. That means sometimes they work, sometimes they don't. And when they don't, your options are really just to prompt better or wait for the next generation of models (and hope there is not a price hike with it). With a signal processing pipeline, you can go and tweak the parameters of individual steps and push for that extra bit of performance. It won't be easy, but at least you have a proven path and time-tested methodology to follow.

Like with any machine learning deployment, you will want to have a sense of the success rate and failure modes of your system before you go to production. With ML systems this is about test and validation data sets. With LLMs we've called those "evaluation sets" or evals for short. No matter the term, both will give you some number to lean on when making deployment plans and R&D investment decisions.

Naturally, this monitoring should not stop at deployment. Continuous monitoring is vital for all systems, especially AI and machine learning systems. You'll want to be able to collect all data where the system fails, preferably with some context, and study it to understand the failure modes.

In conclusion, I've been very impressed by the progress of foundation models on vision tasks. The capability to output different kinds of visual tokens for tasks like bounding box detection is a great addition to multimodal models - it boosts their abilities beyond what just an LLM with text output could achieve. Just don't be fooled by the ease of getting started, effort is required to build a robust vision system even with foundation models.

Author
Engineers Corner
Foundation models can handle vision tasks from a prompt—but ML fundamentals still matter for production-grade systems.
Mikko Lehtimäki
Co-founder, Applied AI Engineer
Profile
Github LogoLinkedin logo
Testimonials
Quote
Softlandia logo

name

I really appreciate Mikko! He improved LlamaIndex's Qdrant integration by fixing critical issues in the QdrantVectorStore API—enhancing query accuracy, reliability and performance of LlamaIndex.

Jerry Liu

CEO & Co-founder

Mikko is awesome! He built a prompt support system for Guardrails AI back when OpenAI's API only supported basic text completion. His solution improved the quality of language model outputs.

Shreya Rajpal

CEO and Co-founder

Working with Softlandia was great! Mikko and Henrik built a Slack bot integrated with real-time RAG pipelines, delivering instant and accurate answers to questions. The bot was created during a live 2-hour session streamed on YouTube.

Zander Matheson

CEO & Co-founder

We love Olli-Pekka! He added support for dynamic Bearer Token authentication in the Qdrant client, enabling customers to integrate seamlessly with Azure and other platforms.

Andre Zayarni

CEO & Co-founder

Other cases