Thursday, October 24, 2013

Permutations and Context

A recent development in summaries is the Martin Luther King Jr. monument in Washington DC. The monument cites King as saying "I am a drum major for justice, peace and righteousness." While King did say these words he did so in a sermon and in the context of "Yes, if you want to say that I was a drum major, say that I was a drum major for justice. Say that I was a drum major for peace. I was a drum major for righteousness. And all of the other shallow things will not matter". This clarifying information is essential to understanding that Martin Luther King was not arrogant.

In Quick Summary I use the Zackman's framework in order to prevent problems like this occurring in summaries.

I've read Building Great Sentences: Exloring the Writer's Craft. Professor Landon accounts of how permutations and word order do matter. He writes how abstract thoughts are implied. For example the Mother Goose story of this is the house that Jack built. She goes on to say how a list of things went together to make the circumstances so that Jack could build his house. He talks about the weight of which words have within sentences. He talks about how suspensive sentences are ordered so that the sentences introduce elements later in the sentence rather than initially.
 
Works Cited:
http://www.dailymail.co.uk/news/article-2373139/Disputed-Drum-Major-quote-removed-Martin-Luther-King-memorial-carving-grooves-lettering.html

Wednesday, October 9, 2013

Total Recall

Quick Summary is currently investigating the following aspects of natural language processing.

Tracks to the future: I am developing a parser that breaks a sentence into word parts. In David Crystal's book Spell it Out talks about how the language is a technology that is adapted over time to meet the needs of people. The book The Infinite Gift makes the argument that language is designed for only humans to use. It points out that a baby can take up to a year listening to their parents. Research has been done to confirm that there is a biological aspect that makes the infants prefer the language of their parents over other languages. He argues that classes of speech cannot be understood and manipulated by machines and he cites Chomsky's famous phrase "Green ideas sleep furiously".

I disagree withe the author of The Infinite Gift on a number of points
It is hard to draw borders around one geographic group from the next.  The book The Story of Spanish describes Portuguese and Spanish as 89% similar. After the Norman conquest French was widely spoken with varying accents. High courts spoke French however as the Normans gave up power, the language changed into an accent or dialect of French. This dialect became known as English.

While metaphors are an important aspect of word origins. Many newspapers are written so people with a fifth grade vocabulary can understand them. We cannot assume that many fifth graders understand metaphors such as a broken record or see an activist as something that acts upon. This network of vernacular fossils composes much of our language today. What we need to find out is how spelling can be used to construct etymological concepts. If we were strictly follow a formula for the English of yesterday of pragmatically tagging parts of speech (POS) we would never be able to keep up with new words as they are added to our corpus.

While his argument that language is partly biologically inspired it does not undermine that machines can also learn given enough training example they too can understand rhetoric or images painted by words. IBM's jeopardy winning program Watson came close but I think we can come closer with Zachman's framework to identify elements in poems, stories written in verse, and even Dr. Seuss.

Our goal is to use the Maven Markup Modeling Language (MMML) a version of predictive modeling markup language (PMML) to find ways of correlating words in hopes of integrating new words faster into the corpus.

The second aspect we wish to pursue is the idea of business processing framework as it relates to
Zachman's framework. We want to know the actor, what is being acted upon, and what goals are being sought. Our first paper discussed how a grammar, or a set of rules in a language that allows people to communicate and understand. This grammar is so apparent that verbal language is not required at times for example a person checking into a hotel. We parse these elements out using a reinforcing neural network. Pybrain is a python based neural network. A similar project is Gate's ANNIE in the UK.

The third aspect of the project is to produce a summary that highlights the most important sentences. This can be challenging given grammar structure and punctuation that occurs in a society where short sentences are not necessarily the best.

Book I am planning to read are
How to Create a Mind by Ray Kurzweil
Crowdsourcing by Daren C. Brabham
 How to Not Write Bad : the most common writing problems and the best ways to avoid them by Ben Yagoda.

Wednesday, October 2, 2013

The unfortunate false friend (beware of the gift)

The book Found in Translation: How Language Shapes our Lives and Transforms the World talks about how words are associated with different metaphors in different cultures. One example they give is intaxicato solar. This means food poisoning. It looks as though it means intoxicated as though someone had too much to drink.  A "false friend." A false cognate means that it looks as though they come from the same linguistic root but have different meaning. Another  example is the word okoru in Japanese. Although it looks as though it may resemble the word to occur, it means to get angry.

The book goes on to say that certain brands like Mitsubishi have to rename their Pajero module and Honda Fitta module in particular countries that have a different connotation of a word.

The word gift means poison in German. The Online Etymology Dictionary says that it was often associated with prescribed medicine. People used the word gift as in a potion by a doctor and it came to mean something of tangible from a knowing person.

The author tells how jokes in some languages are missed such as in the Harry Potter series Lord Voldemolt is "Tom Marvolo Riddle" is an anagram for I am Lord Voldemort. The Bulgarian translation is Mersvoluko whose anagram translates to "And here I am, Lord Voldemort".

The word in Hebrew originally were tohu meant formless and vohu meant empty. In French today toha bohu meant chaos and confusion. These words became associated with chaos later on in Hebrew.

Sometimes the environment makes it necessary to invent words in order to present the concepts behind them. For example to explain to the Hmong people about cancer at UC Davis they compiled an English to Hmong dictionary for medical terms. Martin Luther when translating the scriptures invented words so he can explain to people Latin concepts such as Machtwort  (authoritative guidance).

Monday, September 23, 2013

Neat NLP Videos

There are two neat videos that explain natural language processing. One is the wit.ai project at
http://youtu.be/G7M74_K8iiw.
 It talks about matching entities with intents using similar words.

The second video from TED talks on mapping ideas through natural language processing:
http://www.ted.com/talks/eric_berlow_and_sean_gourley_mapping_ideas_worth_spreading.html .

Google came out with their word2vec project.  This project is a great effort to make similar words to combine using the nearest vector. While this is a great effort, it some perfection needs to occur. Instead of classifying words that are similar I'm finding the approach taken by wit.ai to be better. The Google approach does not take into account the fact that words can not be multiple dimensionality. For example we say that a portrait resembles a person, not a person resembles a painting. Wit attempts to get to the heart of what a conversation is so that, as my first paper suggests, we learn from a transaction using business processing modelling language first what a conversation is and why it occurs.

Friday, September 20, 2013

About me

I am Rob Wahl and I am the writer of the paper on natural language processing called Quick Summary available at http://arxiv.org/abs/1210.3634 . I am very interested in how a computer can solve the problem of having too much information and not enough time to digest it. Quick summary works by finding recurrent themes through a paper by using a PMML or predictive model markup language called MMML (maven meta-data markup language). I am currently contributing to several open source projects including Gensim that looks as similarities between text. I also contribute to Apache projects and to the NLTK framework. In this blog I explore ways of improving the field of NLP as I hope to write another paper.