"analytics" entries

Four short links: 12 June 2015

Four short links: 12 June 2015

OLAP Datastores, Timely Dataflow, Paul Ford is God, and Static Analysis

  1. pinota realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
  2. Naiad: A Timely Dataflow System — in Timely Dataflow, the first two features are needed to execute iterative and incremental computations with low latency. The third feature makes it possible to produce consistent results, at both outputs and intermediate stages of computations, in the presence of streaming or iteration.
  3. What is Code (Paul Ford) — What the coders aren’t seeing, you have come to believe, is that the staid enterprise world that they fear isn’t the consequence of dead-eyed apathy but rather détente. Words and feels.
  4. Facebook Infer Opensourced — the static analyzer I linked to yesterday, released as open source today.
Comment
Four short links: 3 June 2015

Four short links: 3 June 2015

Filter Design, Real-Time Analytics, Neural Turing Machines, and Evaluating Subjective Opinions

  1. How to Design Applied FiltersThe most frequently observed issue during usability testing were filtering values changing placement when the user applied them – either to another position in the list of filtering values (typically the top) or to an “Applied filters” summary overview. During testing, the subjects were often confounded as they noticed that the filtering value they just clicked was suddenly “no longer there.”
  2. Twitter Herona real-time analytics platform that is fully API-compatible with Storm […] At Twitter, Heron is used as our primary streaming system, running hundreds of development and production topologies. Since Heron is efficient in terms of resource usage, after migrating all Twitter’s topologies to it we’ve seen an overall 3x reduction in hardware, causing a significant improvement in our infrastructure efficiency.
  3. ntman implementation of neural Turing machines. (via @fastml_extra)
  4. Bayesian Truth Seruma scoring system for eliciting and evaluating subjective opinions from a group of respondents, in situations where the user of the method has no independent means of evaluating respondents’ honesty or their ability. It leverages respondents’ predictions about how other respondents will answer the same questions. Through these predictions, respondents reveal their meta-knowledge, which is knowledge of what other people know.
Comment

The two-sided coin of Web performance

Hacking performance across your organization.

I’ve given Web performance talks where I get to show one of my favorite slides with the impact of third-party dependencies on load time. It’s the perfect use case for “those marketing people,” who overload pages with the tracking pixels and tags that make page load time go south. This, of course, would fuel the late-night pub discussrion with fellow engineers about how much faster the Web would be if those marketing people would attend a basic Web performance 101 course.

I’ve also found myself discussing exactly this topic in a meeting. This time, however, I was the guy arguing to keep the tracking code, although I was well aware of the performance impact. So what happened?
Read more…

Comment
Four short links: 13 April 2015

Four short links: 13 April 2015

Occupation Changes, Country Data, Cultural Analytics, and Dysfunctional Software Engineering Organisations

  1. The Great Reversal in the Demand for Skill and Cognitive Tasks (PDF) — The only difference with more conventional models of skill-biased technological change is our modelling of the fruits of cognitive employment as creating a stock instead of a pure flow. This slight change causes technological change to generate a boom and bust cycle, as is common in most investment models. We also incorporated into this model a standard selection process whereby individuals sort into occupations based on their comparative advantage. The selection process is the key mechanism that explains why a reduction in the demand for cognitive tasks, which are predominantly filled by higher educated workers, can result in a loss of employment concentrated among lower educated workers. While we do not claim that our model is the only structure that can explain the observations we present, we believe it gives a very simple and intuitive explanation to the changes pre- and post-2000.
  2. provinces — state and province lists for (some) countries.
  3. Cultural Analyticsthe use of computational and visualization methods for the analysis of massive cultural data sets and flows. Interesting visualisations as well as automated understandings.
  4. The Code is Just the SymptomThe engineering culture was a three-layer cake of dysfunction, where everyone down the chain had to execute what they knew to be an impossible task, at impossible speeds, perfectly. It was like the games of Simon Says and Telephone combined to bad effect. Most engineers will have flashbacks at these descriptions. Trigger warning: candid descriptions of real immature software organisations.
Comment
Four short links: 10 February 2015

Four short links: 10 February 2015

Speech Recognition, Predictive Analytic Queries, Video Chat, and Javascript UI Library

  1. The Uncanny Valley of Speech Recognition (Zach Holman) — I’m reminded of driving up US-280 in 2003 or so with @raelity, a Kiwi and a South African trying every permutation of American accent from Kentucky to Yosemite Sam in order to get TellMe to stop giving us the weather for zipcode 10000. It didn’t recognise the swearing either. (Caution: features similarly strong language.)
  2. TuPAQ: An Efficient Planner for Large-scale Predictive Analytic Queries (PDF) — an integrated PAQ [Predictive Analytic Queries] planning architecture that combines advanced model search techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching. The resulting system, TUPAQ, solves the PAQ planning problem with comparable accuracy to exhaustive strategies but an order of magnitude faster, and can scale to models trained on terabytes of data across hundreds of machines.
  3. p2pvc — point-to-point video chat. In an 80×25 terminal window.
  4. Sortable — nifty UI library.
Comment

What happens when fashion meets data: The O’Reilly Radar Podcast

Liza Kindred on the evolving role of data in fashion and the growing relationship between tech and fashion companies.

Editor’s note: you can subscribe to the O’Reilly Radar Podcast through iTunes, SoundCloud, or directly through our podcast’s RSS feed.

In this podcast episode, I talk with Liza Kindred, founder of Third Wave Fashion and author of the new free report “Fashioning Data: How fashion industry leaders innovate with data and what you can learn from what they know.” Kindred addresses the evolving role data and analytics are playing in the fashion industry, and the emerging connections between technology and fashion companies. “One of the things that fashion is doing better than maybe any other industry,” Kindred says, “is facilitating conversations with users.”

Gathering and analyzing user data creates opportunities for the fashion and tech industries alike. One example of this is the trend toward customization. Read more…

Comment

New approaches to anomaly detection

A practical example of how anomaly detection makes complex data problems easier to solve.

Dots

As new tools for distributed storage and analysis of big data are becoming more stable and widely known, there is a growing need for discovering best practices for analytics at this scale. One of the areas of widespread interest that crosses many verticals is anomaly detection.

At its best, anomaly detection is used to find unusual, rarely occurring events or data for which little is known in advance. Examples include changes in sensor data reported for a variety of parameters, suspicious behavior on secure websites, or unexpected changes in web traffic. In some cases, the data patterns being examined are simple and regular and, thus, fairly easy to model.

Anomaly detection approaches start with some essential but sometimes overlooked ideas about anomalies:

  • Anomalies are defined not by their own characteristics but in contrast to what is normal.

Thus …

  • Before you can spot an anomaly, you first have to figure out what “normal” actually is.

This need to first discover what is considered “normal” may seem obvious, but it is not always obvious how to do it, especially in situations with complicated patterns of behavior. Best results are achieved when you use statistical methods to build an adaptive model of events in the system you are analyzing as a first step toward discovering anomalous behavior. Read more…

Comment: 1
Four short links: 1 May 2014

Four short links: 1 May 2014

Cloud Jurisdiction, Driverless Cars, Robotics IPOs, and Fitting a Catalytic Convertor to Your Data Exhaust

  1. US Providers Must Divulge from Offshore Servers (Gigaom) — A U.S. magistrate judge ruled that U.S. cloud vendors must fork over customer data even if that data resides in data centers outside the country. (via Alistair Croll)
  2. Inside Google’s Self-Driving Car (Atlantic Cities) — Urmson says the value of maps is one of the key insights that emerged from the DARPA challenges. They give the car a baseline expectation of its environment; they’re the difference between the car opening its eyes in a completely new place and having some prior idea what’s going on around it. This is a long and interesting piece on the experience and the creator’s concerns around the self-driving cars. Still looking for the comprehensive piece on the subject.
  3. Recent Robotics-Relate IPOs — not all the exits are to Google.
  4. How One Woman Hid Her Pregnancy From Big Data (Mashable) — “I really couldn’t have done it without Tor, because Tor was really the only way to manage totally untraceable browsing. I know it’s gotten a bad reputation for Bitcoin trading and buying drugs online, but I used it for BabyCenter.com.”
Comment
Four short links: 18 April 2014

Four short links: 18 April 2014

Interview Tips, Data of Any Size, Science Writing, and Instrumented Javascript

  1. 16 Interviewing Tips for User Studies — these apply to many situations beyond user interviews, too.
  2. The Backlash Against Big Data contd. (Mike Loukides) — Learn to be a data skeptic. That doesn’t mean becoming skeptical about the value of data; it means asking the hard questions that anyone claiming to be a data scientist should ask. Think carefully about the questions you’re asking, the data you have to work with, and the results that you’re getting. And learn that data is about enabling intelligent discussions, not about turning a crank and having the right answer pop out.
  3. The Science of Science Writing (American Scientist) — also applicable beyond the specific field for which it was written.
  4. earhornEarhorn instruments your JavaScript and shows you a detailed, reversible, line-by-line log of JavaScript execution, sort of like console.log’s crazy uncle.
Comment

Can data provide the trust we need in health care?

Collecting actionable data is a challenge for today's data tools

One of the problems dragging down the US health care system is that nobody trusts one another. Most of us, as individuals, place faith in our personal health care providers, which may or may not be warranted. But on a larger scale we’re all suspicious of each other:

  • Doctors don’t trust patients, who aren’t forthcoming with all the bad habits they indulge in and often fail to follow the most basic instructions, such as to take their medications.
  • The payers–which include insurers, many government agencies, and increasingly the whole patient population as our deductibles and other out-of-pocket expenses ascend–don’t trust the doctors, who waste an estimated 20% or more of all health expenditures, including some thirty or more billion dollars of fraud each year.
  • The public distrusts the pharmaceutical companies (although we still follow their advice on advertisements and ask our doctors for the latest pill) and is starting to distrust clinical researchers as we hear about conflicts of interest and difficulties replicating results.
  • Nobody trusts the federal government, which pursues two (contradictory) goals of lowering health care costs and stimulating employment.

Yet everyone has beneficent goals and good ideas for improving health care. Doctors want to feel effective, patients want to stay well (even if that desire doesn’t always translate into action), the Department of Health and Human Services champions very lofty goals for data exchange and quality improvement, clinical researchers put their work above family and comfort, and even private insurance companies are trying moving to “fee for value” programs that ensure coordinated patient care.

Read more…

Comments: 2