The new image generation functionality continues to make progress. The Melbourne student teams have nearly finished and appear to have produced useful functionality.
Improved image generation
I am still focussing on the new image generation functionality, and have reorganised the code to make it easy to run tests and experiment with different variants of the basic scheme.
It turns out that the critical issue is how to set things up so that the AI can determine whether a generated image is acceptable. This is a very interesting problem! Our initial naive idea was simply to ask GPT-4o to describe the image, and then ask o1-mini to compare the description with the specification sent to DALL-E-3. The problem with doing it this way is that if GPT-4o has no prior information about what it’s seeing, it may have difficulty interpreting it. I’m still testing with the story of the Crow and the Fox, where the Fox tricks the Crow into dropping its cheese. If the cheese is depicted as smooth and round, GPT-4o may not realise that it’s meant to be cheese, and see it as a stone or a ball. I tried to get round this by telling it explicitly that the images are illustrations to this specific story. But then we get the converse problem, where GPT-4o may see things that aren’t there. For example, in this image,

it it initially told me that the crow was holding the cheese in its beak, because it knows that’s what it’s supposed to be doing.
I am not yet sure, but it seems that the way to solve the problem may be to break down the description task into a sequence of smaller questions. When I try this in the interactive version of GPT-4o, it works well. I start by asking which of the four main elements in the story we see here: the AI correctly says that we see the Crow, the Tree, the Cheese but not the Fox. I then ask it what the relationship is between the Crow and the Tree, and it correctly says that the Crow is perched on a branch, gripping it with its claws. Finally, I ask for the relationship between the Crow and the Cheese, and now the AI gets it right:
In the image, le Corbeau (the crow) and le Fromage (the cheese) are positioned quite close to each other on the same tree branch.
- Le Corbeau is perched on the branch, facing toward the left.
- Le Fromage is placed to the right of the crow, resting on the same branch but slightly apart from the bird. The cheese is near the crow’s head, suggesting that it is within easy reach but not being held in the crow’s beak at the moment.
The positioning suggests that the crow is either about to take the cheese or has already placed it on the branch, but it hasn’t dropped it yet (which is a key moment in the story).
It should be straightforward to implement this strategy inside the image generation module. I will do that next and run some tests.
Melbourne student projects
The Melbourne students have nearly finished. Alex Xiang, who has been their main supervisor, is very pleased with the results. I will soon be meeting with the two teams that have been responsible for the Voice Recorder and Image Segmenter projects. They say both tools are working well in standalone mode, and I’m hoping that o1-preview and I will be able to integrate them into C-LARA without too much trouble.
Next Zoom call
The next call will be at:
Thu Oct 24 2024, 19:00 Adelaide (= 08.30 Iceland = 09.30 Ireland/Faroe Islands = 10.30 Europe = 11.30 Israel = 12.00 Iran = 16.30 China = 18:30 Melbourne = 19.30 New Caledonia)
Leave a comment