LexSet - Lexical Set Analysis

This is one of my longest running and most useful projects. I do a lot of research on accents for my main blog. This often involves looking at particular words that change predictably based on a dialect. One of the most useful concepts for this is the 'lexical set', invented by John Wells. For example, if we know that a word belongs to the FLEECE lexical set, we know that that word will be pronounced [i] in American English and [Ii] in Cockney English.

I often scan a text for particular lexical sets, but I'm not perfect and I miss them a lot. This became a big problem for me with the AI-monophthongization project.

I discovered that there is a dataset of English words with phonemic representation. Discovering this spurred the creation of this - an automated tool for scanning a text for words that match a certain phoneme and then highlighting it somehow.

This took a while to make, and as always, there are ways I'd like to improve it. Enjoy some logs from when I was first working on it in 2018.

DevLog 1

I'm currently working on a simple site that will allow you to input some text, and then select a phoneme to search for. You will get a list of all the words in the text that have that phoneme according to a dictionary I have.

After having a proof of concept that I could look up lexical sets using the ARPAbet dictionary, work on the project stalled because I wasn't sure what I could do with it. All I did was log to the console "found X word in Y line" for each word that matched.

This might be useful if I had to look at one very large piece of text I couldn't be bothered to read/listen through all the way, but that's not the type of work I usually do.

I usually have a lot of smaller pieces of text to look for (e.g. a whole artist's discography). Going one by one like this would still take a lot of time. And even if I automated putting all the songs in, the output was still pretty tragic. I decided to formalize it.

My first idea was to do something like this:

        resultsObject {
          line: 'the line with the word',
          word: 'line',
          lexical set: 'PRICE'
        }
        

And push each one to a results array, so I could cycle through the array and get all the results.

But when I tried implementing that in Angular, I realized I had a problem. It worked well the first time you used it, giving you the result:

'the line with the word' - line.

But if I switched from looking for 'PRICE' words to looking for 'NURSE' words, I would get this:

'the line with the word' - line
'the line with the word' - word

Oh no, I'm dumping the whole results array. Not what I wanted, but not a surprise. But what if I only want it to show results for the current word I'm selecting, while still keeping the prior results in the array? (Otherwise I could just clear the array before each call.)

One way I could do this would be to go through each member of the resultsArray, check if the resultsObject.lexicalSet is the same as the current one, and display it if it is. But imagine a very long resultsArray... not sure why, but perhaps I became intensely interested in the phonemic distribution in 'War and Peace.' You would have to go through every single member of the array. How wasteful! Not a very scaling solution, is it? (No matter if it's the best or worse case situation, you have to go through every single line of the array ... don't remember how to translate that to Big O notation, but it's not great.)

But what if we changed the resultsArray and resultsObject? What if instead of storing lines, I first stored lexical sets, and then inside those lexical sets I stored the lines?

        let resultsObject = {
          lexicalSet: 'PRICE',
          linesList: [
          { line: 'here is a line',
             word: 'line,
          },
          { line: 'feels right',
            word: 'right'
            }
          ]
        }
        

So we've changed our conceptualization of the type of result we want to store from being 'a list of lines and the lexical sets they contain' to 'a list of lexical sets, and the lines that contain their members.' This is useful because I'm not researching lines of songs as much as I am researching lexical sets. The thing I want to find is instances of an artist saying a 'PRICE' word.

Now we can change how we look up lines. I can say 'find this lexical set in the array, and return all the lines in them' instead of 'go through every line to see if it has this lexical set.' You now have to check way fewer things.

Going back to our 'War and Peace' example. With the old way, you would have to check thousands of lines to then check for a limited number of lexical sets. With this way, you only have to check 16-ish lexical sets. In the worst case scenario, you have skipped 15 other lexical sets (that may have thousands of lines in them) and the sixteenth will contain just the lines you want. Much fewer lines to check. This ought to be faster and scale better!

At least, for this specific example. But what if I wanted to find out all the lexical sets in a line? Then I would be back at the beginning, have to look inside lexical set for the lines. (Probably still better than the original way, though.) Perhaps I should not foolishly insist on storing everything in a Javascript object.

In any case, regardless of whether this is the 'best' way (it's not), I think this is still an illustrative example of the impact of how organizing data affects the algorithms that are even possible. And it's also precisely because it's so simple that I think it's a good way for a beginner to visualize the consequences of how they choose to structure their data.

The question of how to structure my data for this project I'm working on is definitely one I've labored over for months. Should I link lexical sets to a diaphonemic representation? What about an actual phonemic representation? How can I balance speed with ease of readability for human understanding? I mean, ideally it would be nice to share this project with other dialect lovers like me, but to do that I would have to make it friendly for them.

DevLog 2

The first version of the project has you copy/paste a text in and then type out the lexical set you want. It doesn't have all the John Wells lexical sets available yet, but it has enough that you could make do. This worked pretty well. If I wanted to find out whether a particular lexical set was in a text, it would be easy enough to do.

But this still has some noticeable problems:

I know I have a tendency to let my dreams get ahead of what can be done in reality, so I decided to limit the goals to this:

Some longer-term goals I have for this project: