Daylife's API for the News

Several years ago, my friend Upendra Shardanand tried to get me to join him in starting a company that would remake the way news is created and understood — overturning the worst, ambulance-chasing tendencies of modern journalism, and building tools to help people track and understand the topics and people that shape their lives. I begged off in order to pursue my own startup, but it was the hardest “no” at which to arrive, since I respect Upendra so much and so admire what he was looking to build. Though we’ve chosen to pursue different topics, we have in common a desire to make the world better through entrepreneurial projects, and Upendra’s effort definitely would have won me over had I not already started down my own road.

Happily, Upendra has built and launched a company, Daylife, around his ideas about the news industry, and I’m proud to be a Daylife advisor. There’s an excellent article about Daylife in the current issue of BusinessWeek, talking about some of their early successes.

This month, Daylife is sponsoring a developer contest around its API, which provides a rich programming interface around news topics, people, and places. I’m one of the judges for the contest, along with Brian Behlendorf, Clay Shirky, Jeff Jarvis, and others. It makes me very happy to see some of the API samples, many of which remind me of ideas I heard kicked around back when Google News first launched. (Coincidentally, there’s an interesting article about the stagnation of Google News in today’s New York Times.) Daylife has also put together a list of Lazyweb ideas for the contest, my favorite of which is this design for a tracker of news about evil dictators.

I’m looking forward to seeing what people come up with for the contest, and I’d encourage you to check it out and submit a project. I started playing around tonight and quickly came up with three ideas for Daylife API projects that would help my startup. It won’t take too many people doing the same before Upendra’s idea of changing the way news works starts to take shape in the world.

tags:
  • Falafulu Fisi

    Hi Marc Hedlund,

    Daylife looks interesting. Here are some journal research papers that might be of interest to Daylife. The titles and their abstracts are listed below:

    #1) “Emotion Sensitive News Agent: An Approach Towards User Centric Emotion Sensing from the News”

    This paper describes a character-based system called “Emotion Sensitive News Agent” (ESNA). ESNA is been developed as a news aggregator to fetch news from different news sources chosen by a user, and to categorize the themes of the news into eight emotion types. A small user study indicates that the system is conceived as intelligent and interesting as an affective interface. ESNA exemplifies a recent research agenda that aims at recognizing affective information conveyed through texts. News is an interesting application domain where user may have marked attitudes to certain events or entities reported about. Different approaches have already been employed to “sense” emotion from text. The novelty of our approach is twofold: affective information conveyed through text is analyzed (1) by considering the cognitive and appraisal structure of emotions, and (2) by taking into account user preferences.

    #2) “Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features”

    Today’s Web pages are commonly made up of more than merely one cohesive block of information. For instance, news pages from popular media channels such as Financial Times or Washington Post consist of no more than 30%-50% of textual news, next to advertisements, link lists to related articles, disclaimer information, and so forth. However, for many search-oriented applications such as the detection of relevant pages for an in-focus topic, dissecting the actual textual content from surrounding page clutter is an essential task, so as to maintain appropriate levels of document retrieval accuracy. We present a novel approach that extracts real content from news Web pages in an unsupervised fashion. Our method is based on distilling linguistic and structural features from text blocks in HTML pages, having a Particle Swarm Optimizer (PSO) learn feature thresholds for optimal classification performance. Empirical evaluations and benchmarks show that our approach works very well when applied to several hundreds of news pages from popular media in 5 languages.

    #3) “NewsRec, a SVM-driven Personal Recommendation System for News Websites”

    Fast absorption of information is a necessity for modern information workers. In the short-lived news area, information is a perishable good. While online news websites can speed up the publication of current events compared to traditional newspapers, reading can be more exhausting as online readers have to navigate through websites by clicking on abstracts or headlines before viewing the underlying article. Online shops use personalization methods in order to improve product selection. So far, most types of personalization are offered by website owners and are therefore bound to a specific website. This work presents NewsRec, a client side personal recommendation system for news websites, that supports information workers during their usage of online news websites. Design aspects are discussed and empirical results are shown.

    #4) “Topic Detection and Tracking for News Web Pages”

    This paper propose a new approach to observe, summarize and track events from a collection of news Web Pages. Given a set of temporal Web pages, we obtain valid timestamp from Web pages and detect events by means of clustering. Then we track events by using KeyGraph based on the clusters and abstract the clusters by using SuffixTree. We examine some experimental results and show the usefulness of our approach.

    #5) “Automated Metadata and Instance Extraction from News Web Sites”

    Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.

    #6) “Event Recognition on News Stories and Semi-Automatic Population of an Ontology”

    This paper describes a system which recognizes events on news stories. Our system classifies stories and populates a hand-crafted ontology with new instances of classes defined in it. Currently, our system recognizes events which can be classified as belonging to a single category and it also recognizes overlapping events within one article (more than one event is recognized). In each case, the system provides a confidence value associated to the suggested classification. Our system uses Information Extraction and Machine Learning technologies. The system was tested using a corpus of 200 news articles from an archive of electronic news stories describing the academic life of the Knowledge Media (KMi). In particular, these news stories describe events such as a project award, publications, visits, etc.)

    The publications I have cited above are from the Proceedings of the 2003-2007 IEEE/WIC/ACM International Conference on Web Intelligence. I believe that there are other papers that your developers may be interested from each volume of the Web Intelligence Proceedings, but you just have to click on every article’s abstract and read to see if it is of interest. You can buy each article online, however if you don’t want to do that, perhaps check out your local University library if they have print copies so you can photocopy articles that you like. This print series is available in my local University library, so I never buy any article online. If the article is available online somewhere (usually the author’s website), then I download that copy & don’t need to go to the library to photocopy. Sometimes I request the author/s to send me a free copy of their papers, if the particular journal the article is published in is not available in my local University library.

    You might also find this local New Zealand Java data-mining API called WEKA which is the most popular open source tool today in machine learning also useful to Daylife product development. In fact you will find some of the algorithms in the list of papers I have cited above, already available in WEKA.

    Also the following O’Reilly book (see below) is useful if your developers don’t have a background in data-mining & machine learning, since it simplifies the complex topic of machine learning/data-mining to those who are non-expert in the field:

    “Programming Collective Intelligence” by Toby Segaran”