P, NP and Panlexicon

This week I updated Panlexicon‘s Word of the Day for Twitter’s new authentication API requirements, squeaking in just under the August 30 deadline. Panlexicon has been updated tweeting a unique word every day for nearly a year and a half now.

Also contemporary is P ≠ NP and generating Panlexicon’s daily tweet is an example of both Polynomial (P) and Non-Polynomial (NP) Time operations. Panlexicon’s uniqueness comes from exploration and discovery; when thinking about how to bring Panlexicon to Twitter, ensuring those values came through lead to an interesting computational problem.

For every Word of the Day Panlexicon tweets, it also includes a number of related words. Any word of the day could have hundreds of related words, but the difficulty is in maximizing the number of related words to share in the tweet while still staying within Twitter’s 140 character limit. The more related words I can fit into the tweet, and the more diverse those words are, the more explorational it is and the more likely you will discover something interesting or fun by reading it.

Generating Panlexicon’s Word of the Day Twitter message is an NP-type problem. There are millions of potential solutions that stay within the 140 character limit, but only a few of them are optimal: fitting as many related words into the tweet as possible, while still having a diverse distribution of words lengths. There is no easy way to figure out those optimal solutions without a lot of computational muscle. At the same time checking if the tweet as a whole is less than 140 characters is an N-type problem: it’s just a simple matter of counting up the characters.

So that’s just one example of the mathematical problems Panlexicon faces. Of course, the algorithm I actually use to write Panlexicon’s Word of the Day on Twitter is by no means fully optimizing, but by keeping the broader computational context in mind, I can fake it reasonably well.


More thoughts on an interesting thesaurus

My associate, Rebecca, and I have been starting to think critically about Panlexicon.com, the unique, tag-cloud based thesaurus I’ve written about previously. We’re hoping to put some more time and effort into the project and in the process, learn some more about what’s happening with the language and the underlying structure of the thesaurus taxonomy.

Panlexicon.com - Thesaurus Visualization

The thesaurus data we’re working with is the Moby Thesaurus from the Project Gutenburg library of free electronic texts. Like many thesauruses, it’s structure in an interesting way. Every word is assigned to one or more groups based on it’s general meaning or idea. Each group has a keyword, also known as a headword, that is a general encapsulation that idea—this is why, for example in Roget’s, you must first look up a word in the index to acquire its keywords. Each group has only one keyword, but a keyword can exist in other groups (but as an ordinary word).

This thesaurus structure allows us to do some easy simplifications and analysis on the data. For many functions, we can treat the groups as supernodes, performing operations and storing connections upon them in place of the words themselves. For example, when determining relatedness between words, we only have compare the groups they are a part of; while there are approximately 100,000 words in our database, there are only 30,000 groups, which greatly diminishes the size and complexity of the data set we’re working on.

Panlexicon.com - Correspondence Weighting

Currently Panlexicon works by comparing the overlap between groups of words. When typing in a search term, Panlexicon looks up all of the groups that word is a member of. It then returns a list of words that are also in those groups. The weight of each word (or size in our word cloud model) is calculated according to how many groups—-of those groups that include the search term—that word is a member of. A property of this is that no other returned word will have a heavier weight than the search term. When searching multiple terms, Panlexicon creates a set of groups such that all search terms are a member. In the case when there exists no groups that contain all the search terms, Panlexicon returns nothing.

Already we’re digging into some interesting relations that turn up in the thesaurus data. For example, one of my favorite linguistic myths is that Eskimos have 50 different words for snow. The supposed lesson was that eskimos had a different conception of snow than us (the non-Eskimos). I always wondered, “Well, is 50 a lot?” The largest group in our thesaurus has the keyword cut with 1448 related words or synonyms. This is followed by set (1152), turn (1108), run (1025), and color (1007). That’s quite a bit.

Also, interestingly in our dataset, are the most versatile words. These words are members of the most groups. The list shares four out five of the same words as those of the most synonyms, beginning with cut, being a member of 1120 distinct groups. This is followed by set (928), run (750), turn (715), and check (699).

Right now, we’re investigating paths between words. This will allow us to play the Kevin Bacon game, making connections between words that may not share the same group. It will be interesting to determine what words are connected (even through a medium) and which ones are disconnected. Lastly on our list of things to do is determine the eigenvectors of our groups in relation to how their connected to other groups. This will allow us to determine—without using fancy words like Markov chains—which words are probably used the most. I say probably because we’re analyzing a taxonomic work, rather than actual speech. Who knows if they match up; we’ll find out.