LexSet - Lexical Set Analysis

This is one of my longest running and most useful projects. I do a lot of research on accents for my main blog. This often involves looking at particular words that change predictably based on a dialect. One of the most useful concepts for this is the 'lexical set', invented by John Wells. For example, if we know that a word belongs to the FLEECE lexical set, we know that that word will be pronounced [i] in American English and [Ii] in Cockney English.

I often scan a text for particular lexical sets, but I'm not perfect and I miss them a lot. This became a big problem for me with the AI-monophthongization project.

I discovered that there is a dataset of English words with phonemic representation. Discovering this spurred the creation of this - an automated tool for scanning a text for words that match a certain phoneme and then highlighting it somehow.

This took a while to make, and as always, there are ways I'd like to improve it. Enjoy some logs from when I was first working on it in 2018.

DevLog 1

I'm currently working on a simple site that will allow you to input some text, and then select a phoneme to search for. You will get a list of all the words in the text that have that phoneme according to a dictionary I have.

After having a proof of concept that I could look up lexical sets using the ARPAbet dictionary, work on the project stalled because I wasn't sure what I could do with it. All I did was log to the console "found X word in Y line" for each word that matched.

This might be useful if I had to look at one very large piece of text I couldn't be bothered to read/listen through all the way, but that's not the type of work I usually do.

I usually have a lot of smaller pieces of text to look for (e.g. a whole artist's discography). Going one by one like this would still take a lot of time. And even if I automated putting all the songs in, the output was still pretty tragic. I decided to formalize it.

My first idea was to do something like this:

        resultsObject {
          line: 'the line with the word',
          word: 'line',
          lexical set: 'PRICE'
        }

And push each one to a results array, so I could cycle through the array and get all the results.

But when I tried implementing that in Angular, I realized I had a problem. It worked well the first time you used it, giving you the result:

'the line with the word' - line.

But if I switched from looking for 'PRICE' words to looking for 'NURSE' words, I would get this:

'the line with the word' - line
'the line with the word' - word

Oh no, I'm dumping the whole results array. Not what I wanted, but not a surprise. But what if I only want it to show results for the current word I'm selecting, while still keeping the prior results in the array? (Otherwise I could just clear the array before each call.)

One way I could do this would be to go through each member of the resultsArray, check if the resultsObject.lexicalSet is the same as the current one, and display it if it is. But imagine a very long resultsArray... not sure why, but perhaps I became intensely interested in the phonemic distribution in 'War and Peace.' You would have to go through every single member of the array. How wasteful! Not a very scaling solution, is it? (No matter if it's the best or worse case situation, you have to go through every single line of the array ... don't remember how to translate that to Big O notation, but it's not great.)

But what if we changed the resultsArray and resultsObject? What if instead of storing lines, I first stored lexical sets, and then inside those lexical sets I stored the lines?

        let resultsObject = {
          lexicalSet: 'PRICE',
          linesList: [
          { line: 'here is a line',
             word: 'line,
          },
          { line: 'feels right',
            word: 'right'
            }
          ]
        }

So we've changed our conceptualization of the type of result we want to store from being 'a list of lines and the lexical sets they contain' to 'a list of lexical sets, and the lines that contain their members.' This is useful because I'm not researching lines of songs as much as I am researching lexical sets. The thing I want to find is instances of an artist saying a 'PRICE' word.

Now we can change how we look up lines. I can say 'find this lexical set in the array, and return all the lines in them' instead of 'go through every line to see if it has this lexical set.' You now have to check way fewer things.

Going back to our 'War and Peace' example. With the old way, you would have to check thousands of lines to then check for a limited number of lexical sets. With this way, you only have to check 16-ish lexical sets. In the worst case scenario, you have skipped 15 other lexical sets (that may have thousands of lines in them) and the sixteenth will contain just the lines you want. Much fewer lines to check. This ought to be faster and scale better!

At least, for this specific example. But what if I wanted to find out all the lexical sets in a line? Then I would be back at the beginning, have to look inside lexical set for the lines. (Probably still better than the original way, though.) Perhaps I should not foolishly insist on storing everything in a Javascript object.

In any case, regardless of whether this is the 'best' way (it's not), I think this is still an illustrative example of the impact of how organizing data affects the algorithms that are even possible. And it's also precisely because it's so simple that I think it's a good way for a beginner to visualize the consequences of how they choose to structure their data.

The question of how to structure my data for this project I'm working on is definitely one I've labored over for months. Should I link lexical sets to a diaphonemic representation? What about an actual phonemic representation? How can I balance speed with ease of readability for human understanding? I mean, ideally it would be nice to share this project with other dialect lovers like me, but to do that I would have to make it friendly for them.

DevLog 2

The first version of the project has you copy/paste a text in and then type out the lexical set you want. It doesn't have all the John Wells lexical sets available yet, but it has enough that you could make do. This worked pretty well. If I wanted to find out whether a particular lexical set was in a text, it would be easy enough to do.

But this still has some noticeable problems:

You can only do one text at a time. Ideally, I would like to run this analysis on multiple texts at a time. That would require copy/pasting multiple times, and also having to manually save the results.
The results are formatted in an ugly, non-intuitive way. Notice how "I was running for the door" appears twice. It really should only appear once and have a list of the words below it.
You can't send the results to another program to work on them (yet).
Very limited search.

I know I have a tendency to let my dreams get ahead of what can be done in reality, so I decided to limit the goals to this:

Fix the 'one line, multiple words' problem. Unless the line itself is literally repeated in the text, there's no reason for it to appear multiple times.
Deploy this version online. Just because it isn't suitable for ~big data~ doesn't mean that somebody won't get some use from it. Even if it is gimmicky, it's a proof of concept. And it could be useful for me if I somehow need to do emergency lexical set analysis and only have a phone.

Some longer-term goals I have for this project:

Make a CLI for it. This would make it much easier to do 'big-ish data'. Instead of passing the texts in directly, I can point it to a folder full of .txt files and have it do this analysis on each of them.
Refactor the data structure, as mentioned in the last post, to make it more friendly to lexical set analysis.
Expand the search capabilities to include stress, phonemes, and custom conditions via regular expressions.
Consider how to make it faster if I really wanted to stuff large amounts of data in there.

The last one has been on my mind. Right now this is how the code looks:

          public containsThisPhoneme(phoneme, song) {
            // clear results
            this.resultsArray = [];
            // byNewLine is an array where each member is a line of the text
            const byNewLine = song.split('\n');
            for (let i = 0; i < byNewLine.length; i++) {
                // split each line into an array of words
                const currentLine = byNewLine[i].split(' ');
                for (let j = 0; j < currentLine.length; j++) {
                  // all words are capitalized for regularization and because the ARPAbet is already all-caps
                    const word =  currentLine[j].toUpperCase();
                    if (this.dictionary[word] && this.dictionary[word]['arpabet'].includes(this.convertLexicalSetToPhoneticDictionary(phoneme))) {
                      // create a 'result object' that tells you the line and the word that matched the selected lexical set
                        const resultObj = {
                          line: byNewLine[i],
                          result: currentLine[j],
                          matchingPhoneme: phoneme
                        };
                        this.resultsArray.push(resultObj);
                    }
                }
            }
          }

As you can see, there is a for loop nested inside a for loop, which is a code smell. If I've got algorithmic complexity right, that means that it's O(n^2). It's workable now when I have teeny tiny texts, but if I ever wanted to do some really large corpuses, it would probably be slow. What can we do to get around this?

I noticed that this thing goes over words more than once. For example, think of how often the word 'California' appears in the song 'Hotel California.' We don't need to check if California is in the THOUGHT lexical set every single time it shows up. It's not going to change from one line to another, so that's just a waste of time.

It might be better to reduce each text to the unique words that appear in it, so we don't go over any word twice. That should be faster.

The problem I'm dealing with, then, is figuring out how to maintain the 'line by line' analysis. I don't just want to spit out a list of words: 'for', 'California', and 'door'. I still want it to tell me what line they're in.

I'm also thinking that having this as an Angular project isn't the best idea. I made it in Angular to split up the dictionary, but I could just do that using Javascript modules instead.

DevLog 3 (2025)

At a certain point, I moved the project to Python with the intention of turning it into an API. You may ask why not use node.js as an API, but unfortunately I do not remember why I did not think that was an acceptable solution. I ported the project to Python and began working on converting it from being tightly bound to the front page to just spitting out text.

An important leap forward came from the AI-monophthongization project. I really needed to know exactly which words had a particular phoneme. For example, I needed to know all the 'AY' words in a song so I could listen to the whole song and code it. The method that worked best for me was surrounding that text with brackets. I also had the option to delete the word and replace it with brackets [] for coding later on. This was a lifesaver when listening to hundreds of songs as I needed to quickly enter whether the vowel was [ai] or [a] and having the empty brackets for every word was huge.

I was lazy and just hardcoding the texts to analyze in the program, but eventually this was too much work. I decided to convert it to a web app to make it easier to copy/paste and get the results. I decided to make it in React to refamiliarize myself with the paradigm. This time, I also had to concern myself with the presentation of the data. I did not want that line by line presentation like I had before, but the bracket presentation I came up with for the AI-monophthongization project. I decided I wanted to have the original text and the annotated text side by side for easy comparison and copying without scrolling. I also wanted the website to let me do things like count phones, a feature that was also present in the Python version of the code.

I decided on a 'tab' experience because I did not foresee this project getting a million different things to do. There is only so much one can do fetching and counting lexical sets. I wanted the navigation to be available throughout the whole site easily, and tabs were an easy way to see everything accessible at once. It also did not take up precious horizontal space, which I wanted free to be able to have two text fields side by side. I did make a mistake here - for some reason I decided to fake navigation by hiding and showing different components instead of just... using navigation. It's fast, but it also means there's no way to share the URL to a particular tab of the site. I intend to fix this in a future revision.

The next part was making a Python API. I looked for a free host because I was not going to be using this all the time. I settled on PythonAnywhere, which only had the limitation that you needed to manually restart the project every 3 months. This seemed fair to me. I moved all the ("business") logic to the backend. Now instead of having all the logic to analyze the lexical sets in the front-end, it was abstracted away somewhere else. This also meant another project could consume that API, too. There was a slight delay, but not enough to make me reconsider.

There is an unfinished aspect of this site - CSV analysis. I have a massive Google doc with around a hundred songs left that need to be coded for number of 'AY' phonemes in them. Unfortunately Google doc does not let me call APIs from individual cells, or apparently with macros at all. I decided to get around this by coming up with a feature that will take a CSV, specify which column contains the text you want to annotate, which column to output the count to, and then return the CSV to you. However, this has been stalled for the foreseeable future.

Some thoughts I have about the project as is:

Right now anybody can request from PythonAnywhere if they have the URL. I'd prefer bad actors not DDOS those folks by begging for phoneme counts over and over again. May be worth figuring out a way to make it so only certain approved actors can request from the API.
Still want to add regular expression support for the people who have highly specific phoneme needs.
Would love to make a command line version for someone who wanted to do mass amounts of analysis.
Updating ARPAbet in the future - it lacks British pronunciations for words, which are crucial for any future analyses involving comparing American and British pronunciations.
As mentioned above, I am only working with one text at a time. The CSV feature will allow multiple texts to be analyzed at a time, but only if they are in, well, CSV format. Good for Google Docs, less so for any other text. Perhaps a command line version could accept as an argument a list of files to be analyzed.
Performance at high levels is unclear.

As with every version of the DevLog, there is always a wishlist of things that could be better at the end. However, LexSet is easily one of the most useful things I've ever made, having saved me countless hours of the drudgery of counting phonemes with more accuracy than I could ever muster. I'm very satisfied with how it works right now.