Wednesday, October 9, 2013

Total Recall

Quick Summary is currently investigating the following aspects of natural language processing.

Tracks to the future: I am developing a parser that breaks a sentence into word parts. In David Crystal's book Spell it Out talks about how the language is a technology that is adapted over time to meet the needs of people. The book The Infinite Gift makes the argument that language is designed for only humans to use. It points out that a baby can take up to a year listening to their parents. Research has been done to confirm that there is a biological aspect that makes the infants prefer the language of their parents over other languages. He argues that classes of speech cannot be understood and manipulated by machines and he cites Chomsky's famous phrase "Green ideas sleep furiously".

I disagree withe the author of The Infinite Gift on a number of points
It is hard to draw borders around one geographic group from the next.  The book The Story of Spanish describes Portuguese and Spanish as 89% similar. After the Norman conquest French was widely spoken with varying accents. High courts spoke French however as the Normans gave up power, the language changed into an accent or dialect of French. This dialect became known as English.

While metaphors are an important aspect of word origins. Many newspapers are written so people with a fifth grade vocabulary can understand them. We cannot assume that many fifth graders understand metaphors such as a broken record or see an activist as something that acts upon. This network of vernacular fossils composes much of our language today. What we need to find out is how spelling can be used to construct etymological concepts. If we were strictly follow a formula for the English of yesterday of pragmatically tagging parts of speech (POS) we would never be able to keep up with new words as they are added to our corpus.

While his argument that language is partly biologically inspired it does not undermine that machines can also learn given enough training example they too can understand rhetoric or images painted by words. IBM's jeopardy winning program Watson came close but I think we can come closer with Zachman's framework to identify elements in poems, stories written in verse, and even Dr. Seuss.

Our goal is to use the Maven Markup Modeling Language (MMML) a version of predictive modeling markup language (PMML) to find ways of correlating words in hopes of integrating new words faster into the corpus.

The second aspect we wish to pursue is the idea of business processing framework as it relates to
Zachman's framework. We want to know the actor, what is being acted upon, and what goals are being sought. Our first paper discussed how a grammar, or a set of rules in a language that allows people to communicate and understand. This grammar is so apparent that verbal language is not required at times for example a person checking into a hotel. We parse these elements out using a reinforcing neural network. Pybrain is a python based neural network. A similar project is Gate's ANNIE in the UK.

The third aspect of the project is to produce a summary that highlights the most important sentences. This can be challenging given grammar structure and punctuation that occurs in a society where short sentences are not necessarily the best.

Book I am planning to read are
How to Create a Mind by Ray Kurzweil
Crowdsourcing by Daren C. Brabham
 How to Not Write Bad : the most common writing problems and the best ways to avoid them by Ben Yagoda.

No comments:

Post a Comment