Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

From hallucinations to hardware: Lessons from a real-world computer vision project gone sideways


Join the event that trusts business leaders for almost two decades. VB Transform brings together people who build a real business AI strategy. Learn more


Computer vision projects rarely go exactly as planned, and it was no exception. The idea was simple: build a model that could look at a picture of a laptop and identify any physical damage – things like cracked screens, missing keys or broken hinges. It seemed to be a simple use case for image models and Great language modelS (llms), but it quickly turned into something more complicated.

Along the way, we encountered problems with hallucinations, outings and unreliable images that were not even laptops. To solve them, we ended up applying an agent frame in an atypical way – not for the automation of tasks, but to improve the performance of the model.

In this article, we will go through what we have tried, which did not work and how a combination of approaches finally helped us build something reliable.

Where we started: monolithic invitation

Our initial approach was standard enough for a multimodal model. We used only one, big prompt to pass an image in a LLM capable of image And asked him to identify visible damage. This monolithic incentive strategy is simple to implement and works decently for clean and well -defined tasks. But real data rarely play.

We encountered three major problems from the start:

  • Hallucinations: The model sometimes invents damage that did not exist or did not know what he saw.
  • Unwanted image detection: There had no reliable way to report images that were not even laptops, such as photos of offices, walls or people who sometimes slipped and received absurd damage reports.
  • Incoherent: The combination of these problems made the model too unreliable for operational use.

It was the point where it became clear that we would need to iterate.

First correction: mixing image resolutions

One thing we have noticed was the amount of image quality that affected the output of the model. Users have downloaded all kinds of images ranging from net and high resolution to blurred. It led us to refer to research Stressing how image resolution has an impact on deep learning models.

We have formed and tested the model using a mixture of high and low resolution images. The idea was to make the model more resilient with a wide range of image qualities that he would encounter in practice. This has helped improve consistency, but the main problems of hallucination and unwanted images have persisted.

Multimodal detour: LLM Text only becomes multimodal

Encouraged by recent experiences by combining image subtitling with LLM only in text – such as the technique covered in The lotWhere legends are generated from images and then interpreted by a language model, we decided to try it.

Here’s how it works:

  • The LLM begins by generating several possible legends for an image.
  • Another model, called multimodal integration model, checks how each legend corresponds to the image. In this case, we used Siglip to mark the similarity between image and text.
  • The system retains the few reductions according to these scores.
  • The LLM uses these subtitles to write new ones, trying to get closer to what the image really shows.
  • He repeats this process until legends stop improving, or reaching a defined limit.

Although intelligent in theory, this approach has introduced new problems for our use case:

  • Persistent hallucinations: The legends themselves sometimes included imaginary damage, which the LLM then pointed out with confidence.
  • Incomplete cover: Even with several legends, some problems have been completely missed.
  • Increased complexity, few advantages: The added steps made the system more complicated without reliably outdoing the previous configuration.

It was an interesting experience, but ultimately not a solution.

A creative use of aging frames

It was the turn. Although agent frameworks are generally used for orchestration tasks flows (think of agents coordinating the invitations of the calendar or customer service actions), we wondered if the decomposition of the interpretation of the image in smaller, Specialized agents could help.

We have built an agentic framework structured like this:

  • Orchestrator agent: It checked the image and identified the visible laptop components (screen, keyboard, chassis, ports).
  • Component agents: Dedicated agents inspected each component for specific types of damage; For example, one for cracked screens, another for missing keys.
  • Garbage detection: A separate agent reported if the image was even a laptop in the first place.

This modular and task -focused approach produced much more precise and explainable results. The hallucinations have dropped dramatically, the unwanted images were reported reliably and the task of each agent was simple and sufficiently concentrated to properly control quality.

Dead angles: compromise of an agency approach

As effective as it was, it was not perfect. Two main limitations arose:

  • Increased latency: Execute several sequential agents added to the total inference time.
  • Cover gaps: Agents could only detect problems that they were explicitly programmed to search. If an image showed something unexpected that no agent was responsible for identifying, it would go unnoticed.

We needed a way to balance precision with the cover.

The hybrid solution: combining agent and monolithic approaches

To fill the gaps, we have created a hybrid system:

  1. THE agency Ran first, managing precise detection of known types of damage and unwanted images. We have limited the number of agents to the most essential to improve latency.
  2. Then a Monolithic image prompt scanned the image for all that the agents could have missed.
  3. Finally, we refined the model Use of a set of images organized for high priority use cases, such as frequently reported damage scenarios, to further improve precision and reliability.

This combination gave us the accuracy and the explanability of the aging configuration, the large coverage of the monolithic incentive and the confidence of the targeted adjustment.

What we have learned

Some things have become clear when we finished this project:

  • Agent executives are more versatile than they get credit: Although they are generally associated with the management of the workflow, we have found that they could significantly increase the performance of the model when they are applied in a structured and modular manner.
  • Mix different beating approaches by relying on a single: The combination of precise detection based on agents alongside the wide coverage of LLM, more a little fine adjustment where it counted the most, gave us much more reliable results than any unique method.
  • Visual models are subject to hallucinations: Even the most advanced configurations can take conclusions or see things that are not there. It takes a thoughtful system design to control these errors.
  • The variety of image quality makes a difference: Training and test with clear and high resolution images and daily images, of lower quality, helped the model to remain resilient in the face of unpredictable and real photos.
  • You need a way to take unwanted images: A dedicated check for unwanted or unrelated images was one of the simplest changes we have brought, and it has had a disproportionate impact on the overall reliability of the system.

Final reflections

What started as a simple idea, using an LLM prompt to detect physical damage in laptop images, quickly turned into a much deeper experience in the combination of different AI techniques to solve unpredictable and real problems. Along the way, we realized that some of the most useful tools were those that are not originally designed for this type of work.

Agent frameworks, often considered as public workflow services, have been surprisingly effective when reused for tasks such as structured damage detection and image filtering. With a little creativity, they helped us build a system that was not only more precise, but easier to understand and manage in practice.

Shruti Tiwari is AI product manager at Dell Technologies.

Vadiraj Kulkarni is a data scientist at Dell Technologies.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *