Over the last couple of weeks, the AI and I took some time out to do a study on using GPT-4 to play Tic-Tac-Toe. It may sound frivolous, but it’s provided some remarkably useful insights into how to do GPT-4 prompting better. I’ve also got a lot of comments from C-LARA users, both from our Icelandic colleagues and more recently from the people who attended the hands-on sessions at our recent workshop. Based on that, I’ve put together a list of suggestions for what we should prioritise over the next phase of the project. I suggest we discuss over email and/or at next few Thursday Zoom meetings.
Here’s the preliminary list. I’ve divided it into three categories: fundamental issues, Simple C-LARA, and small-but-important.
Fundamental issues
The Tic-Tac-Toe study was inspired by remarks in Aschenbrenner’s already famous essay Situational Awareness, and uses several techniques mentioned there. They all turned out to perform well in the study, and my feeling is that we can use them to rewrite C-LARA’s annotation framework and make it both faster and more accurate:
- Chain of Thought. Basically, instead of just telling GPT-4 to do something, you tell it to think aloud about the issues and then do it; this often gives much more accurate results. We are already using CoT for Multi-Word Expression analysis. I think we should also use it for glossing, lemma tagging, segmentation and CERF determination.
- Fitting few-shot examples to the current task. For many kinds of GPT-4 tasks, you need to give a small number of examples to guide the AI. At the moment, when we e.g. gloss in a particular language, we always use the same set of examples.
In the Tic-Tac-Toe experiment, we found we got better performance when we picked examples taken from positions similar to the current one. It seems likely to me that the same thing will happen with linguistic annotation. - Voting. As everyone knows, GPT-4 often makes random mistakes. A well-known method for reducing randomness, which gave good results in Tic-Tac-Toe, is to run the task multiple times and use a voting scheme to combine the results. If the AI says the same thing twice, it’s less likely to be a random glitch.
In Tic-Tac-Toe, voting was trivial: the controller process just picked the move that had been recommended most often. In C-LARA, implementing the scheme is slightly more complicated, since results generally need to be combined at the word level. - Parallelism. The above methods come at a cost: you do a lot more querying, so things get much slower. In Tic-Tac-Toe, we addressed this by restructuring the code so that many independent GPT-4 queries could be run in parallel. It turns out that this is quite easy to implement. In C-LARA annotation, we could for example annotate all the segments in a text simultaneously. [done}
- Reinforcement learning of few-shot examples. This was the most interesting and ambitious technique we used in Tic-Tac-Toe: we let the AI improve its few-shot examples by seeing which ones work, then feeding them into the next cycle. It worked well in the context of this simple game, where we know what “success” is.
We can do the same thing with C-LARA annotation if we have a success criterion. For some tasks, there are resources we can use in the form of carefully annotated corpora that the AI can train on.
We also have some other fundamental issues:
- Automatic checking of images. People who’ve tried using the coherent image set functionality find they spend a lot of time regenerating obviously bad images. We want the AI to take over some of this job, looking at the images and rejecting the ones which clearly don’t match the prompt.
- Splitting up the segmentation phase. The segmentation phase is very unreliable. It will probably work better if we split it into two sub-phases: 1) division of text into pages and segments, and 2) division of segments into words. For (1), the user will say what kind of text it is, choosing from a menu which has items like story, essay, poem, picture book, etc.
- Automatic checking of inconsistencies between different text versions. When using Advanced C-LARA, it’s all too easy to make careless mistakes when revising a text and discover that the different versions are out of sync. The AI should be able to catch many of these errors if we set things up so that it can compare the versions and critique them. [done]
- Exercises. As we did in the LARA project, we want to be able to create simple exercises (flashcards, fill-in-the-blanks) from the text. It would also be good to generate pre- and post-reading tests.
- Text/audio alignment. We should port over and integrate the text/audio alignment method we developed under LARA.
- Integration of bilingual lexica. For languages which aren’t supported by the AI, we should make it possible to upload bilingual lexica and use them for glossing. Such lexica are available for several of the Kanak languages our New Caledonian collaborators are working with.
Simple C-LARA
It’s unfortunately clear that Advanced C-LARA just isn’t very user-friendly. There are too many different functionalities, and it’s easy to get lost. We need to take the functionalities currently only available in Advanced C-LARA and make them available in Simple C-LARA too. In particular, we want the following:
- Segment translations.
- Multi-Word Expressions.
- Coherent image sets.
For this to work acceptably, we first need to implement some of the functionality in the “fundamental” group. In particular, I think we need parallelism, otherwise things will be too slow. Simple C-LARA needs to be reasonably responsive.
Small but important
There are several small but extremely irritating issues that have been hanging around too long. Fixing them would make the platform much nicer to use.
- Downloadable versions of text. We want to be able to download the text in various versions, so that we can use it without access to the internet. For example: a plain text version, a complete mp3 version, a standalone multimedia version.
- Control playing of segment audio. An annoying bug is that we can easily end up playing two pieces of segment audio at the same time. We should cancel audio playback of the first one before starting the second. Again, some simple JavaScript.
- Search function in phonetic lexicon window. We need a search function in the phonetic lexicon editing window, so that errors in phonetic texts can easily be corrected.
- Correct integration of phonetic texts and TTS. It’s currently possible for TTS word audio to be different from the IPA in the “phonetic” version. This can be corrected by generating the audio directly from the IPA, a functionality which is supported in Google TTS.
- Correct integration of phonetic texts and pinyin. Similarly, for Mandarin, the TTS should be consistent with the pinyin.
- Heteronyms. A related but more complex issue is heteronyms: words that are spelled the same but pronounced differently. GPT-4 can probably disambiguate heteronyms well, though finding the right way to do this may require some tuning.
A particularly important case is the ubiquitous English word ‘the’, which has two pronunciations. When we showed a C-LARA phonetic text to a primary school teacher, she immediately mentioned this. - Highlighting MWEs. It’s pedagogically important to highlight all the components of an MWE when we click on one of them. This should be a relatively simple piece of JavaScript. [done]
- Audio for MWEs. Similarly, clicking on part of an MWE should play audio for the whole MWE. [done]
- Frequency counts for MWEs. As pointed out by Francis, frequency counts for MWEs are currently wrong: we count components rather than MWEs. [done]
- Displaying MWE s in Advanced C-LARA. It would be convenient if the MWE view in Advanced C-LARA could optionally display only the MWEs, hiding the Chain of Thought protocols used to generate them. [done]
- Archive images and prompts. We should store the different versions of each image, together with the prompts used to generate them, and make it possible to revert to an earlier version. [done]
- Links to inflection tables. As in the LARA project, we should optionally include links to inflection tables and other online dictionary information for languages where this is available. [done]
- “Play-all”. We should have a control to play all the audio on a page. We had this in LARA and people used it constantly. [done]
- “Back to project” link. A small thing that significantly contributes to Advanced C-LARA’s lack of user friendliness is that the “Back to project” link is not well placed. You are constantly scrolling to the bottom to find it. Again, easy to fix. [done]
- Access counters. The metadata for a text posted in the social network should contain a counter showing the number of unique users who have accessed it. [done]
- Check segmentation of title. As many people have noticed, there is a weird bug that sometimes adds a lot of text to the segmented title. It is easy for C-LARA to check for this and regenerate if necessary. [done]
On the principle of going for the low-hanging fruit first, I will start by fixing some of the obvious small things. Please send feedback on the less trivial issues!
Leave a reply to Weekly summary, Aug 1-7 2024 – C-LARA Cancel reply