- Eight Docker Development Patterns (Vidar Hokstad) — patterns for creating repeatable builds that result in as-static-as-possible server environments.
- How to Make More Published Research True (PLOSmedicine) — overview of efforts, and research on those efforts, to raise the proportion of published research which is true.
- Gearpump — Intel’s “actor-driven streaming framework”, initial benchmarks shows that we can process 2 million messages/second (100 bytes per message) with latency around 30ms on a cluster of 4 nodes.
- Foundations of Data Science (PDF) — These notes are a first draft of a book being written by Hopcroft and Kannan [of Microsoft Research] and in many places are incomplete. However, the notes are in good enough shape to prepare lectures for a modern theoretical course in computer science.
"data science" entries
From the Internet of Things to data-driven fashion, here are key insights from Strata + Hadoop World in Barcelona 2014.
Experts from across the big data world came together for Strata + Hadoop World in Barcelona 2014. We’ve gathered insights from the event below.
#IoTH: The Internet of Things and Humans
“If we could start over with these capabilities we have now, how would we do it differently?” Tim O’Reilly continues to explore data and the Internet of Things through the lens of human empowerment and the ability to “use technology to give people superpowers.”
Rajiv Maheswaran talks about the tools and techniques required to analyze new kinds of sports data.
Many data scientists are comfortable working with structured operational data and unstructured text. Newer techniques like deep learning have opened up data types like images, video, and audio.
Other common data sources are garnering attention. With the rise of mobile phones equipped with GPS, I’m meeting many more data scientists at start-ups and large companies who specialize in spatio-temporal pattern recognition. Analyzing “moving dots” requires specialized tools and techniques. A few months ago, I sat down with Rajiv Maheswaran founder and CEO of Second Spectrum, a company that applies analytics to sports tracking data. Maheswaran talked about this new kind of data and the challenge of finding patterns:
“It’s interesting because it’s a new type of data problem. Everybody knows that big data machine learning has done a lot of stuff in structured data, in photos, in translation for language, but moving dots is a very new kind of data where you haven’t figured out the right feature set to be able to find patterns from. There’s no language of moving dots, at least not that computers understand. People understand it very well, but there’s no computational language of moving dots that are interacting. We wanted to build that up, mostly because data about moving dots is very, very new. It’s only in the last five years, between phones and GPS and new tracking technologies, that moving data has actually emerged.”
Schemas inevitably will change — Apache Avro offers an elegant solution.
When a team first starts to consider using Hadoop for data storage and processing, one of the first questions that comes up is: which file format should we use?
This is a reasonable question. HDFS, Hadoop’s data storage, is different from relational databases in that it does not impose any data format or schema. You can write any type of file to HDFS, and it’s up to you to process it later.
The usual first choice of file formats is either comma delimited text files, since these are easy to dump from many databases, or JSON format, often used for event data or data arriving from a REST API.
There are many benefits to this approach — text files are readable by humans and therefore easy to debug and troubleshoot. In addition, it is very easy to generate them from existing data sources and all applications in the Hadoop ecosystem will be able to process them. Read more…
Researchers and startups are building tools that enable feature discovery.
Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many0 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:
Good features allow a simple model to beat a complex model.
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira)
The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:
Feature Engineering or the Creation of New Features
A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features1 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation2 of new features (feature engineering) and identifying the most essential ones (feature selection).
New report covers areas of innovation and their difficulties
O’Reilly recently released a report I wrote called The Information Technology Fix for Health: Barriers and Pathways to the Use of Information Technology for Better Health Care. Along with our book Hacking Healthcare, I hope this report helps programmers who are curious about Health IT see what they need to learn and what they in turn can contribute to the field.
Computers in health are a potentially lucrative domain, to be sure, given a health care system through which $2.8 trillion, or $8.915 per person, passes through each year in the US alone. Interest by venture capitalists ebbs and flows, but the impetus to creative technological hacking is strong, as shown by the large number of challenges run by governments, pharmaceutical companies, insurers, and others.
Some things you should consider doing include:
- Join open source projects
- Numerous projects to collect and process health data are being conducted as free software; find one that raises your heartbeat and contribute. For instance, the most respected health care system in the country, VistA from the Department of Veterans Affairs, has new leadership in OSEHRA, which is trying to create a community of vendors and volunteers. You don’t need to understand the oddities of the MUMPS language on which VistA is based to contribute, although I believe some knowledge of the underlying database would be useful. But there are plenty of other projects too, such as the OpenMRS electronic record system and the projects that cooperate under the aegis of Open Health Tools.