Weekly summary, Aug 22-28 2024

We are making progress on adapting the reinforcement learning/Chain of Thought idea to C-LARA and on support for non-AI languages. The Melbourne students are continuing to develop their projects, with encouraging results. The Palgrave Encyclopaedia article will be published elsewhere.

Priority list

Reinforcement learning and Chain of Thought for MWEs. I have been discussing with Francis and the AI how to adapt the reinforcement learning/Chain of Thought method from the Tic-Tac-Toe paper to the task of annotating multi-word expressions. We have considerable progress:

Francis has created an initial set of 14 examples illustrating the most important types of MWEs in English. The intention is that we will use this to expand the initial set of few-shot examples when we start the learning process (“priming the pump”).
I made a small adjustment to the MWE processing so that we can correctly handle MWEs with hyphens. After doing this, C-LARA is able to process the list. With the current very simple few-shot examples, it scores 12/14.
Francis sent me the MWE-annotated Sherlock Holmes corpus he and his students have created, about 1200 sentences. The AI wrote a script to convert this into a C-LARA compatible form.
The AI and I have been reorganising the annotation code for the purposes of the experiment. We now have functionality that lets us input a list of sentence/few-shot-example-list pairs, and MWE-annotate each sentence using the few shot examples. This is the core of the learning process.
We have been looking at OpenAI embeddings as a possible way to compare sentences in order to find few-shot examples created from sentences similar to the current one.

We should soon be in a position to try an initial learning experiment, probably this weekend.

Support for non-AI languages. I met up with Sophie Rendina to do another session using the new support for non-AI languages. It makes things a good deal easier, though we’re not there yet; the next step will be for the AI to correct inconsistencies. This should be fairly easy to implement.

Encyclopaedia article

The Palgrave Encyclopaedia refused to accept an AI author for our Encyclopaedia article. We have withdrawn it, and will instead publish it in two versions: a minimally modified preprint on ResearchGate, and a more substantially rewritten version for the EUROCALL proceedings.

Melbourne students

I have continued to talk with the students at Melbourne Uni about their projects. The Image Annotator project have been experimenting with the Segment Anything 2 model and getting promising results: it now seems likely that they will be able to use it. The Music group have been experimenting with using the Whisper speech recogniser on songs generated by Suno and Udio. They say it well, which means we should be able to use automatic text/audio alignment methods. I will port over the method we implemented in our ALTA 2022 paper.

If these projects can indeed produce usable results, and right now it looks good, they will add major functionality to C-LARA.

Next Zoom call

The next call will be at:

Thu Aug 29 2024, 18:00 Adelaide (= 08.30 Iceland = 09.30 Ireland/Faroe Islands = 10.30 Europe = 11.30 Israel = 12.00 Iran = 16.30 China = 18:30 Melbourne = 19.30 New Caledonia)