Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Join the event that trusts business leaders for almost two decades. VB Transform brings together people who build a real business AI strategy. Learn more
Computer vision projects rarely go exactly as planned, and it was no exception. The idea was simple: build a model that could look at a picture of a laptop and identify any physical damage – things like cracked screens, missing keys or broken hinges. It seemed to be a simple use case for image models and Great language modelS (llms), but it quickly turned into something more complicated.
Along the way, we encountered problems with hallucinations, outings and unreliable images that were not even laptops. To solve them, we ended up applying an agent frame in an atypical way – not for the automation of tasks, but to improve the performance of the model.
In this article, we will go through what we have tried, which did not work and how a combination of approaches finally helped us build something reliable.
Our initial approach was standard enough for a multimodal model. We used only one, big prompt to pass an image in a LLM capable of image And asked him to identify visible damage. This monolithic incentive strategy is simple to implement and works decently for clean and well -defined tasks. But real data rarely play.
We encountered three major problems from the start:
It was the point where it became clear that we would need to iterate.
One thing we have noticed was the amount of image quality that affected the output of the model. Users have downloaded all kinds of images ranging from net and high resolution to blurred. It led us to refer to research Stressing how image resolution has an impact on deep learning models.
We have formed and tested the model using a mixture of high and low resolution images. The idea was to make the model more resilient with a wide range of image qualities that he would encounter in practice. This has helped improve consistency, but the main problems of hallucination and unwanted images have persisted.
Encouraged by recent experiences by combining image subtitling with LLM only in text – such as the technique covered in The lotWhere legends are generated from images and then interpreted by a language model, we decided to try it.
Here’s how it works:
Although intelligent in theory, this approach has introduced new problems for our use case:
It was an interesting experience, but ultimately not a solution.
It was the turn. Although agent frameworks are generally used for orchestration tasks flows (think of agents coordinating the invitations of the calendar or customer service actions), we wondered if the decomposition of the interpretation of the image in smaller, Specialized agents could help.
We have built an agentic framework structured like this:
This modular and task -focused approach produced much more precise and explainable results. The hallucinations have dropped dramatically, the unwanted images were reported reliably and the task of each agent was simple and sufficiently concentrated to properly control quality.
As effective as it was, it was not perfect. Two main limitations arose:
We needed a way to balance precision with the cover.
To fill the gaps, we have created a hybrid system:
This combination gave us the accuracy and the explanability of the aging configuration, the large coverage of the monolithic incentive and the confidence of the targeted adjustment.
Some things have become clear when we finished this project:
What started as a simple idea, using an LLM prompt to detect physical damage in laptop images, quickly turned into a much deeper experience in the combination of different AI techniques to solve unpredictable and real problems. Along the way, we realized that some of the most useful tools were those that are not originally designed for this type of work.
Agent frameworks, often considered as public workflow services, have been surprisingly effective when reused for tasks such as structured damage detection and image filtering. With a little creativity, they helped us build a system that was not only more precise, but easier to understand and manage in practice.
Shruti Tiwari is AI product manager at Dell Technologies.
Vadiraj Kulkarni is a data scientist at Dell Technologies.