Unstructured data is worth the effort when you've got the right tools

It’s dawning on companies that data analysis can yield insights and inform business decisions. As data-driven benefits grow, so do our demands about what more data can tell us and what other types we can mine.

During her PhD studies, Alyona Medelyan (@zelandiya) developed Maui, an open source tool that performs as well as professional librarians in identifying main topics in documents. Medelyan now leads the research and development of API-based products at Pingar.

Pingar senior software researcher Anna Divoli (@annadivoli) studied sentence extraction for semi-automatic annotation of biological databases. Her current research focuses on developing methodologies for acquiring knowledge from textual data.

“Big data is important in many diverse areas, such as science, social media, and enterprise,” observes Divoli. “Our big data niche is analysis of unstructured text.” In the interview below, Medelyan and Divoli describe their work and what they see on the horizon for unstructured data analysis.

How did you get started in big data?

Anna Divoli: I began working with big data as it relates to science during my PhD. I worked with bioinformaticians who mined proteomics data. My research was on mining information from the biomedical literature that could serve as annotation in a database of protein families.

Alyona Medelyan: Like Anna, I mainly focus on unstructured data and how it can be managed using clever algorithms. During my PhD in natural language processing and data mining, I started applying such algorithms to large datasets to investigate how time-consuming data analysis and processing tasks can be automated.

What projects are you working on now?

Alyona Medelyan: For the past two years at Pingar, I’ve been developing solutions for enterprise customers who accumulate unstructured data and want to search, analyze, and explore this data efficiently. We develop entity extraction, text summarization, and other text analytics solutions to help scrub and interpret unstructured data in an organization.

Anna Divoli: We’re focusing on several verticals that struggle with too much textual data, such as bioscience, legal, and government. We also strive to develop language-independent solutions.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

What are the trends and challenges you’re seeing in the big data space?

Anna Divoli: There are plenty of trends that span various aspects of big data, such as making the data accessible from mobile devices, cloud solutions, addressing security and privacy issues, and analyzing social data.

One trend that is pertinent to us is the increasing popularity of APIs. Plenty of APIs exist that give access to large datasets, but there also powerful APIs that manage big data efficiently, such as text analytics, entity extraction, and data mining APIs.

Alyona Medelyan: The great thing about APIs is that they can be integrated into existing applications used inside an organization.

With regard to the challenges, enterprise data is very messy, inconsistent, and spread out across multiple internal systems and applications. APIs like the ones we’re working on can bring consistency and structure to a company’s legacy data.

The presentation you’ll be giving at the Strata Conference will focus on practical applications of mining unstructured data. Why is this an important topic to address?

Anna Divoli: Every single organization in every vertical deals with unstructured data. Tons of text is produced daily — emails, reports, proposals, patents, literature, etc. This data needs to be mined to allow fast searching, easy processing, and quick decision making.

Alyona Medelyan: Big data often stands for structured data that is collected into a well-defined database — who bought which book in an online bookstore, for example. Such databases are relatively easy to mine because they have a consistent form. At the same time, there is plenty of unstructured data that is just as valuable, but it’s extremely difficult to analyze it because it lacks structure. In our presentation, we will show how to detect structure using APIs, natural language processing and text mining, and demonstrate how this creates immediate value for business users.

Are there important new tools or projects on the horizon for big data?

Alyona Medelyan: Text analytics tools are very hot right now, and they improve daily as scientists come up with new ways of making algorithms understand written text more accurately. It is amazing that an algorithm can detect names of people, organizations, and locations within seconds simply by analyzing the context in which words are used. The trend for such tools is to move toward recognition of further useful entities, such as product names, brands, events, and skills.

Anna Divoli: Also, entity relation extraction is an important trend. A relation that consistently connects two entities in many documents is important information in science and enterprise alike. Entity relation extraction helps detect new knowledge in big data.

Other trends include detecting sentiment in social data, integrating multiple languages, and applying text analytics to audio and video transcripts. The number of videos grows at a constant rate, and transcripts are even more unstructured than written text because there is no punctuation. That’s another exciting area on the horizon!

Who do you follow in the big data community?

Alyona Medelyan: We tend to follow researchers in areas that are used for dealing with big data, such as natural language processing, visualization, user experience, human computer information retrieval, as well as the semantic web. Two of them are also speaking at Strata this year: Daniel Tunkelang and Marti Hearst.

This interview was edited and condensed.

Related: