Wednesday, November 25, 2015

Personality classification via NLP

I recently connected with David Rostcheck, an NLP expert from Chile who is taking our Coursera Course. David told me some interesting thing about personality testing using text inputs I wanted to share here, as I was not aware of this field. 

Here is guest blog post from David:

When we think of hot Data Science tools, we usually think of Machine Learning and Data Visualization. Another area that does not get quite as much attention - but can produce amazing results - is Natural Language Processing (NLP). These algorithms can give powerful insight into the subject’s education, power dynamic, tone, and other personal traits. I recently completed an engagement in which I did quite a bit of this work and got to use some cutting edge APIs. In this article, I will explain how NLP functions and discuss using IBM Watson and Receptiviti to analyze personality from writing.

Natural Language Processing uses statistical techniques to extract insight from text. Consider assessing the grade level of a piece of text. When we say that something is “written at a 10th grade level,” we are speaking about how it scores on a standard metric. There are several, but one of the most common is called the “Flesch-Kincaid” score [1]. It was developed in the 1970s through a statistical study of language. Flesch-Kincaid is a simple formula using two variables: the average number of words per sentence and the average number of syllables per word. 

Although it seems simple [2], this heuristic can quickly and accurately assess the grade level of a written sample. Flesch-Kinkaid, like all Natural Language Processing algorithms, is specific to a language (in this case, English). Assessing writing in another language, such as Spanish, requires a different formula.

In today’s world of Deep Learning algorithms that extract features and relationships, NLP’s statistics-based approach can seem unsatisfying. The algorithm knows nothing about what the words mean. Feeding meaningless long sentences with multisyllabic words to the Flesch-Kincaid formula will produce a high grade score. But the methods work because the relationships hold true, given enough data – which, because text is information-dense, can be fairly little.

If Natural Language Processing is “just a bag of statistical tricks,” then why do we care about it? Advanced techniques, such as personality analysis, can give powerful, immediate insight to deep traits of the writer - or at least to the persona exposed by this sample. Is the author a Type-A person? How hostile is the message? Is it brooding? Impulsive? Emotionally distant, or accessible? Is the author in a high or low-status power position relative to the recipient? Does the writing show signs of mental instability? We can assess all those parameters. Furthermore, the fast execution of these heuristics allows us to push high volumes of data through them quickly.

We can directly implement simple algorithms, like the one above. More complex analysis requires a set of correlation coefficients established by painstaking academic research. Vendors generally acquire the research and wrap the algorithm in a web service API. For extracting insight about each subject, I used IBM Watson’s Personality Insight API and Receptiviti.

IBM’s service consumes text and returns scores along the axes (“traits”) of three different psychological models: Big 5, Needs, and Values. For example, using the Big 5 model, it will evaluate the openness, conscientiousness, extraversion, agreeableness, and emotional range of the input.

Do you obtain real objective traits from the output? IBM carefully cautions that to give valid absolute scores, Watson needs very long samples. It is better to think of the response as an assessment of persona - the voice in which the piece is written - rather than absolute personality. In test experiments I found, though, that for a given persona I could cut the sample down considerably and still get reliably consistent results. Within a defined problem domain - such as help desk ticket messages - comparison proved valid even with much shorter pieces than required to assess absolute personality. And the assessment held up to examination - when it reported a message as having high or low emotional range, a human reader concluded the same.

Receptiviti productizes research from James Pennebaker, a major academic figure in linguistic personality analysis. It produces Big 5 model traits too, but also gives more directly usable outputs such as the emotional warmth, impulsiveness, and depression of the persona. And with direct access to Dr. Pennebaker’s research, Receptiviti continuously adds new types of analyses.

It may seem incredible that one can obtain accurate psychological markers from writing - even more so because these tools operate based on comparing words against special purpose dictionaries established through psychological research. Watson will return the same results if you sort the text alphabetically, losing all sentence structure. The statistical relationships are word-based, not sentence-based - but they do hold.

What do companies use NLP-based personality analysis for? Early adopters span a wide range of industries. Telecommunications company Telefónica and human resources firm Adecco have both begun using Watson (embedded via SocialBro’s social media marketing tool) to segregate their customers by personality traits for Twitter marketing campaigns. I worked with educational software vendor Learning Machine to surface added insight in university applications. Design studio Chaotic Moon has explored ways to improve user interaction by shaping application behavior to the user’s personality, and fantasy football research firm Edge Up Sports uses the technology in its sports analysis. Receptiviti has put out a series of blog posts analyzing the personalities surfaced by candidates in the U.S. election debates, so it seems likely political consultants may begin using these tools as well.

You do need to bring a true scientific approach to the use of these tools - they are easily misused. For example, Watson’s three models are tuned to specific lengths and styles of writing:  blog posts, Facebook messages, and tweets. Testing with a variety of samples from known sources revealed that it gave consistent results with the model that best fit the data, and inconsistent results with the others. It took carefully thought-out experimental tests to qualify the limits and precision of the tools. But they work, and they can give surprisingly deep insight.

The explosion of blogging and social media has opened new opportunities for the field of linguistic researchers. NLP may have a lower profile than other skills, but it can produce solid insights and holds a place in the Data Science toolbox.

[1] There are actually two Flesch-Kincaid scores, one for grade level and another for readability, with different formulas.

[2] It’s always simple after someone does all the analysis to extract the relationship, now, isn’t it?

Monday, November 23, 2015

Veles: deep learning by Samsung

I got this from my fiend Assaf Araki from Intel: Veles is a new project by Samsung for distributed deep learning.

Interestingly, their first blog post says: "Vadim Markovtsev (main developer) and Gennady Kuznetsov (project leader) left Samsung to another company about 2 month ago. It is difficult to work with splited command, but we are not stoping Veles developing. We are using Slack to sharing ideas, problems and news. We are slowdown a little, but we will catch up."

Monday, November 9, 2015

TensorFlow: Google releases a new ML library for deep learning

I got this yesterday from both Assaf Spanier and Guy Rapoprt: Google is announcing the release of their TensorFlow library. Main use case is deep learning. With a python interface, multiple GPUs are supported.

A recent benchmark shows that TensorFlow is rather slow compared to Torch.

Sunday, November 8, 2015 deep learning for art!

I got this from colleague Chris Dubois: DeepArt is an application which combines and image with an artist style to create new art work. For example:


The application is generated by Łukasz Kidziński & Michał Warchoł based on a paper: Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge in the research paper entitled 'A neural algorithm of artistic style'. Here are some images from the paper:

Wednesday, October 28, 2015

News from Stockholm: HopsWorks

A couple of weeks ago I visited Stockholm, I was kindly invited to give a keynote at the SICS Data Science Day. Thanks again for Prof. Seif Haridi for his kind invitation!

One of the interesting lectures there was by Prof Jim Downling, previously from mysql. Jim is building an open source system called HopsWorks to improve the Hadoop experience. Jim have kindly answered my questions about the project.

When did the project start?
Hops started in 2011, we started with HDFS. In 2013, we started YARN. In 2014, we started HopsWorks. So, it's about 14 months in development now.

What is the project goal?
The project goal is to make Hadoop for humans. We want to make it easier for people to store, process, and share data in Hadoop. That means moving away from the command-line to graphical interfaces for non programming tasks. Everything from managing access to data to sharing datasets should be accessible for people who are not data engineers.

What is the project license?
The project is licensed as a mix of Apache v2 and GPL v2. Our extensions to Hadoop are Apache v2 licensed, but we have connectors to the NewSQL database that we support (NDB - MySQL Cluster), and they have to be licensed in GPL v2. Because of the mixed licensing model, we don't provide a single distribution. However, users can install HopsWorks with 6 mouse clicks using our recommended installation tool,

Who is using HopsWorks?
We have had most interest from companies and organizations with sensitive datasets that require sharing those datasets in a controlled manner with users. So, Ericsson are interested enabling non-Ericsson employees to do analysis on some of their DataSets without requiring the long process of signing NDAs. As HopsWorks has a data scientist role (who cannot import or export data from the system), they could provide access to external data scientists, knowing they have an audit trail for actions by the external users and that the external users cannot download the dataset or derived data from the cluster. In the area of Genomics, we have a lot of interest as well.

Can you share performance numbers?
I don't have figures for Sentry's performance. The figures I showed were for the state-of-the-art Policy enforcement points (XACML). Sentry is trying not to do any enforcement and is basically sending all of its rules to all of the services to be cached there (HDFS, Solr, Impala, etc). My guess is that Sentry itself can still only handle a few 100 ops/sec. The main problem it has is how to keep the privileges and the data consistent. I don't see how they can do that for all Hadoop services. 
Here's Cloudera's own report on the slowdown of turning on Sentry for Solr (it leads to a 20% slowdown for Solr - even with most privileges being stored in Solr):
They admit that Sentry "doesn't scale to store document-level [privileges]", so they store policies in Solr instead (breaking the assumption that Sentry is the central store for all policies (privileges).

Can you share a video of your talk?
The talk from last week is up:

And here are some screenshots:

Thursday, October 22, 2015

Apache Zeppelin in Picking Up!

A few months ago, I wrote about Apache Zeppelin. Yesterday I visited SICS and met with Jim Dowling. He has an interesting open source project named HopsWorks (I plan to write more about it soon!). Anyway HopsWorks is using Zeppelin and I saw a very interesting demo of its functionality. According to Jim, Zeppelin is really picking up. He sent me the following resources which indicate the rising popularity of Zeppelin:

In Microsoft Azure:


In hortonworks:

In cloudera:

In HopsWorks :)

Sunday, October 18, 2015

IEEE ParLearning '2016 workshop announced

My colleague Yinglong Xia from IBM have kindly invited me to participate as program committee at the 2016 IEEE Parallel Learning Workshop. The workshop will take place May 27, 2016 in Chicago. Submission deadline is January 15, 2016. Submissions of original work in the area of parallel and distributed machine learning systems is encouraged!