Pluribo technology

Patent-pending summary technology

Pluribo faces the challenge of continuously aggregating a massive quantity of opinions, mining their textual content, extracting key statistical trends, and summarizing the results in natural language. What's more, we are committed to doing this in a fully automated way, which reduces human bias and allows for greater speed and scale.

To meet this end-to-end challenge, we developed a novel suite of natural language data mining and summarization techniques. These techniques are encapsulated in our summary engine, and are covered under a U.S. patent application. Given the right data, the summary engine can rapidly summarize opinions on nearly any topic. At present, we are applying the summary engine mainly to product review data from Amazon.com. If there is demand, we may soon open our API to developers working in other topic areas.

Systems and techniques

As we research new and more powerful approaches for summarization, we're happy to discuss our systems and techniques. Here are some of the key technologies that we use.

Feature-based sentiment analysis

Feature-based sentiment analysis is the process of scanning text about a topic and extracting a distinct sentiment score for each topic attribute. Numerous techniques exist (for example, see opinion mining work by Liu or Popescu and Etzioni). Our basic approach is to scan text and look for feature phrases occurring in close proximity with sentiment phrases. To do this well, it helps to have a good top-down ontology of the features for a given domain and a comprehensive lexicon of the typical feature and sentiment phrases in that domain. We derive our ontologies and lexicons using a basket of bottom-up statistical techniques, including word frequency, proximity in WordNet, and Bayesian phrase clustering. This practical hybrid of top-down and bottom-up techniques allows us to maintain a high level of accuracy without sacrificing scalability.

Intelligent synthesis

Many sites with user opinions synthesize them only by taking an average of the 5-star ratings. We think this is woefully inadequate. In addition to performing a fine-grained feature-based sentiment analysis, we also weight and filter individual opinions in various ways prior to synthesizing all the opinions on a topic.

  • We give greater weight to opinion that have been voted as "helpful" by other internet users.
  • Newer opinions are often more helpful, so we decay the weight of an opinion in proportion to its age.
  • We filter out duplicate opinions.
  • We reduce the weight of, or filter out, opinions that appear malformed or redundant.
Moreover, once all the opinions on a topic have been appropriately synthesized, we further adjust the scores for a topic by adjusting for low sample sizes and comparing the topic with the average of its peers.

Lucid text generation

Once the summary engine has performed a statistical analysis of a topic, the challenge is to take the resulting mass of numerical data and condense it into a lucid text summary. It's no easy task. To generate a quality summary, the text generation algorithm should meet the following criteria:

  • Relevancy: Select only the most relevant statistical information for inclusion in the text summary.
  • Fluency: Express the information in a fluent text paragraph that reads naturally.
  • Non-redundancy: Avoid redundant language and content within any given text summary.
  • Variety: Vary the content and language of the summaries for different topics, so they appear unique and not repetitive.
  • Robustness: Produce a valid summary of a topic whenever possible, even though the quantity or type of information available for inclusion may vary significantly across topics.
To meet these challenges, we developed and applied for a patent on a novel grammar algorithm. It's not perfect, but we're actively improving. Have ideas for us? We'd love to hear them. You know, we are also hiring experts in NLP.

Data API

Our summary engine is encapsulated in an API that uses XML over a REST inteface. Our front end apps (such as our Firefox extension) all make use of this underlying API. Separating the summary engine from our applications in this way allows more rapid development, and it also allows us to open our summary API to third party developers.

Scalable computing

It takes massive computing power to mine and summarize millions of opinions. Our data API runs on Amazon Web Services (we love AWS). For scalable cloud computing, we use Amazon EC2. For scalable data storage, we use Amazon SimpleDB and S3. Our web applications are built on Google's new App Engine.

Languages and software

Nearly all of our code is written in Python, a wonderful and elegant language. We use Boto to access AWS from python. Our NLP algorithms get help from NLTK, WordNet, and SciPy. Most of our servers run Ubuntu Linux. Our API is built with Django and Apache Lucene.

Home.   About us.   How it works.   FAQs.   Feedback.   We're hiring.
© 2008 Pluribo Inc. Not affiliated with Amazon. See complete Terms of Use.