"realtime" entries

Four short links: 16 September 2015

Four short links: 16 September 2015

Data Pipelines, Amazon Culture, Real-time NFL Data, and Deep Learning for Chess

  1. Three Best Practices for Building Successful Data Pipelines (Michael Li) — three key areas that are often overlooked in data pipelines, and those are making your analysis: reproducible, consistent, and productionizable.
  2. Amazon’s Culture Controversy Decoded (Rita J King) — very interesting culture map analysis of the reports of Amazon’s culture, and context for how companies make choices about what to be. (via Mike Loukides)
  3. How Will Real-Time Tracking Change the NFL? (New Yorker) — At the moment, the NFL is being tightfisted with the data. Commentators will have access during games, as will the betting and analytics firm Sportradar. Users of the league’s Xbox One app, which provides an interactive way of browsing video clips, fantasy-football statistics, and other metrics, will be able to explore a feature called Next Gen Replay, which allows them to track each player’s speed and trajectory, combining moving lines on a virtual field with live footage from the real one. But, for now, coaches are shut out; once a player exits the locker room on game day, the dynamic point cloud that is generated by his movement through space is a corporately owned data set, as outlined in the league’s 2011 collective-bargaining agreement. Which should tell you all you need to know about the NFL’s role in promoting sporting excellence.
  4. Giraffe: Using Deep Reinforcement Learning to Play Chess (Matthew Lai) — Giraffe, a chess engine that uses self-play to discover all its domain-specific knowledge, with minimal hand-crafted knowledge given by the programmer. See also the code. (via GitXiv)
Four short links: 11 August 2015

Four short links: 11 August 2015

Real-time Sports Analytics, UI Regression Testing, AI vs. Charity, and Google's Data Pipeline Model

  1. Denver Broncos Testing In-Game Analytics — their newly hired director of analytics working with the coach. With Tanney nearby, Kubiak can receive a quick report on the statistical probabilities of almost any situation. Say that you have fourth-and-3 from the opponent’s 45-yard-line with four minutes to go. Do the large-sample-size percentages make the risk-reward ratio acceptable enough to go for it? Tanney’s analytics can provide insight to aid Kubiak’s decision-making. (via Flowing Data)
  2. Visual Review (GitHub) — Apache-licensed productive and human-friendly workflow for testing and reviewing your Web application’s layout for any regressions.
  3. Effective Altruism / Global AI (Vox) — fear of AI-run-amok (“existential risks”) contaminating a charity movement.
  4. The Dataflow Model (PDF) — Google Research paper presenting a model aimed at ease of use in building practical, massive-scale data processing pipelines.
Four short links: 4 August 2015

Four short links: 4 August 2015

Data-Flow Graphing, Realtime Predictions, Robot Hotel, and Open-Source RE

  1. Data-flow Graphing in Python (Matt Keeter) — not shared because data-flow graphing is sexy new hot topic that’s gonna set the world on fire (though, I bet that’d make Matt’s day), but because there are entire categories of engineering and operations migraines that are caused by not knowing where your data came from or goes to, when, how, and why. Remember Wirth’s “algorithms + data structures = programs”? Data flows seem like a different slice of “programs.” Perhaps “data flow + typos = programs”?
  2. Machine Learning for Sports and Real-time Predictions (Robohub) — podcast interview for your commute. Real time is gold.
  3. Japan’s Robot Hotel is Serious Business (Engadget) — hotel was architected to suit robots: For the porter robots, we designed the hotel to include wide paths.” Two paths slope around the hotel lobby: one inches up to the second floor, while another follows a gentle decline to guide first-floor guests (slowly, but with their baggage) all the way to their room. Makes sense: at Solid, I spoke to a chap working on robots for existing hotels, and there’s an entire engineering challenge in navigating an elevator that you wouldn’t believe.
  4. bokken — GUI to help open source reverse engineering for code.
Four short links: 11 March 2014

Four short links: 11 March 2014

Game Analysis, Brave New (Disney)World, Internet of Deadly Things, and Engagement vs Sharing

  1. In-Game Graph Analysis (The Economist) — one MLB team has bought a Cray Ulrika graph-processing appliance for in-game analysis of data. Please hold, boggling. (via Courtney Nash)
  2. Disney Bets $1B on Technology (BusinessWeek) — MyMagic+ promises far more radical change. It’s a sweeping reservation and ride planning system that allows for bookings months in advance on a website or smartphone app. Bracelets called MagicBands, which link electronically to an encrypted database of visitor information, serve as admission tickets, hotel keys, and credit or debit cards; a tap against a sensor pays for food or trinkets. The bands have radio frequency identification (RFID) chips—which critics derisively call spychips because of their ability to monitor people and things. (via Jim Stogdill)
  3. Stupid Smart Stuff (Don Norman) — In the airplane, the pilots are not attending, but when trouble does arise, the extremely well-trained pilots have several minutes to respond. In the automobile, when trouble arises, the ill-trained drivers will have one or two seconds to respond. Automobile designers – and law makers – have ignored this information.
  4. What You Think You Know About the Web Is WrongChartbeat looked at deep user behavior across 2 billion visits across the web over the course of a month and found that most people who click don’t read. In fact, a stunning 55% spent fewer than 15 seconds actively on a page. The stats get a little better if you filter purely for article pages, but even then one in every three visitors spend less than 15 seconds reading articles they land on. The entire article makes some powerful points about the difference between what’s engaged with and what’s shared. Articles that were clicked on and engaged with tended to be actual news. In August, the best performers were Obamacare, Edward Snowden, Syria and George Zimmerman, while in January the debates around Woody Allen and Richard Sherman dominated. The most clicked on but least deeply engaged-with articles had topics that were more generic. In August, the worst performers included Top, Best, Biggest, Fictional etc while in January the worst performers included Hairstyles, Positions, Nude and, for some reason, Virginia. That’s data for you.
Four short links: 27 January 2014

Four short links: 27 January 2014

Real Time Exploratory Analytics, Algorithmic Agendas, Disassembly Engine, and Future of Employment

  1. Druid — open source clustered data store (not key-value store) for real-time exploratory analytics on large datasets.
  2. It’s Time to Engineer Some Filter Failure (Jon Udell) — Our filters have become so successful that we fail to notice: We don’t control them, They have agendas, and They distort our connections to people and ideas. That idea that algorithms have agendas is worth emphasising. Reality doesn’t have an agenda, but the deployer of a similarity metric has decided what features to look for, what metric they’re optimising, and what to do with the similarity data. These are all choices with an agenda.
  3. Capstone — open source multi-architecture disassembly engine.
  4. The Future of Employment (PDF) — We note that this prediction implies a truncation in the current trend towards labour market polarization, with growing employment in high and low-wage occupations, accompanied by a hollowing-out of middle-income jobs. Rather than reducing the demand for middle-income occupations, which has been the pattern over the past decades, our model predicts that computerisation will mainly substitute for low-skill and low-wage jobs in the near future. By contrast, high-skill and high-wage occupations are the least susceptible to computer capital. (via The Atlantic)

How companies are using Spark

The inaugural Spark Summit will feature a wide variety of real-world applications

When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be where Spark is today. With over 90 contributors from 25 companies, it has one of the largest developer communities among big data projects (second only to Hadoop MapReduce).

Spark Growth by Numbers

I recently became an advisor to Databricks (a startup commercializing Spark) and a member of the program committee for the inaugural Spark Summit. As I pored over submissions to Spark’s first community gathering, I learned how companies have come to rely on Spark, Shark, and other components of the Berkeley Data Analytics Stack (BDAS). Spark is at that stage where companies are deploying it, and the upcoming Spark Summit in San Francisco will showcase many real-world applications. These applications cut across many domains including advertising, marketing, finance, and academic/scientific research, but can generally be grouped into the following categories:

Data processing workflows: ETL and Data Wrangling
Many companies rely on a wide variety of data sources for their analytic products. That means cleaning, transforming, and fusing (unstructured) external data with internal data sources. Many companies – particularly startups – use Spark for these types of data processing workflows. There are even companies that have created simple user interfaces that open up batch data processing tasks to non-programmers.

Read more…

Simplifying interactive, realtime, and advanced analytics

Tools for unlocking big data continue to get simpler

Here are a few observations based on conversations I had during the just concluded Strata NYC conference.

Interactive query analysis on Hadoop remains a hot area
A recent O’Reilly survey confirmed SQL is an important skill for data scientists. A year after the launch of Impala, quite a few attendees I spoke with remained interested in the progress of SQL-on-Hadoop solutions. A trio from Hortonworks gave an update on recent improvements and changes to Hive1. A sign that Impala is gaining traction, Greg Rahn’s talk on Practical Performance Tuning for Impala was one of the best attended sessions in the conference. Ditto for a sponsored session on Kognitio’s latest features.

Existing SQL-on-Hadoop solutions require that users define a schema – an additional step given that a lot of data is increasingly in key-value or JSON format. In his talk Hadapt co-founder Daniel Abadi highlighted a solution2 that lets users query complex data types (Hadapt reserializes complex data types to speed up joins). I expect other SQL-on-Hadoop solutions to also offer query support for complex data types in the near future.

Empowering business users
With its launch at the conference, ClearStory joins Platfora and Datameer in the business analytics space. Each company builds tools that lets business users wade through large amounts of data, while emphasizing different areas. Platfora is for interactive visual analysis of massive data sets, while Datameer connects to many data sources (not just Hadoop), has started offering analytics, and can run on a laptop or cluster. Built primarily on the Berkeley stack (BDAS), ClearStory’s interesting platform encourages collaboration and simplifies data harmonization (fusing disparate data sources is a common bottleneck for business users). For organizations willing to tag and describe their data sets, Microsoft unveiled a tool that lets users query data using natural language (UK startup NeutrinoBI uses a similar “search interface”).

Read more…

Big Data and Advertising: In the trenches

Volume, variety, velocity, and a rare peek inside sponsored search advertising at Google

The $35B merger of Omnicom and Publicis put the convergence of Big Data and Advertising1 in the front pages of business publications. Adtech2 companies have long been at the forefront of many data technologies, strategies, and techniques. By now it’s well-known that many impressive large scale, realtime analytics systems in production, support3 advertising. A lot of effort has gone towards accurately predicting and measuring click-through rates, so at least for online advertising, data scientists and data engineers have gone a long way towards addressing4 the famous “but we don’t know which half” line.

The industry has its share of problems: privacy & creepiness come to mind, and like other technology sectors adtech has its share of “interesting” patent filings (see for example here, here, here). With so many companies dependent on online advertising, some have lamented the industry’s hold5 on data scientists. But online advertising does offer data scientists and data engineers lots of interesting technical problems to work on, many of which involve the deployment (and creation) of open source tools for massive amounts of data.

Read more…

Near realtime, streaming, and perpetual analytics

Hadoop moves from batch to near realtime: next up, placing streaming data in context

Simple example of a near realtime app built with Hadoop and HBase
Over the past year Hadoop emerged from its batch processing roots and began to take on interactive and near realtime applications. There are numerous examples that fall under these categories, but one that caught my eye recently is a system jointly developed by China Mobile Guangdong (CMG) and Intel1. It’s an online system that lets CMG’s over 100 million subscribers2 access and pay their bills, and examine their CDR’s (call detail records) in near realtime.

A service for providing detailed billing information is an important customer touch point. Repeated/extended downtimes and data errors could seriously tarnish CMG’s image. CMG needed a system that could scale to their current (and future) data volumes, while providing the low-latency responses consumers have come to expect from online services. Scalability, price and open source3 were important criteria in persuading the company to choose a Hadoop-based solution over4 MPP data warehouses.

In the system it co-developed with Intel, CMG stores detailed subscriber billing records in HBase. This amounts to roughly 30 TB/month, but since the service lets users browse up to six months of billing data it provides near realtime query results on much larger amounts of data. There are other near realtime applications built from Hadoop components (notably the continuous compute system at Yahoo!), that handle much larger data sets. But what I like about the CMG example is that it’s an application that most people understand right away (a detailed billing lookup system), and it illustrates that the Hadoop ecosystem has grown beyond batch processing.

Besides powering their online billing lookup service, CMG uses its Hadoop platform for analytics. Data from multiple sources (including phone device preferences, usage patterns, and cell tower performance) are used to compute customer segments and targeted promotions. Over time, Hadoop’s ability to handle large amounts of unstructured data opens up other data sources that can potentially improve CMG’s current analytic models.

Contextualize: Streaming and Perpetual Analytics
This leads me to something “realtime” systems are beginning to do: placing streaming data in context. Streaming analytics operates over fixed time windows and is used to identify “top k” trending items, heavy-hitters, and distinct items. Perpetual analytics takes what you’re observing now and places it in the context of what you already know. As much as companies appreciate metrics produced by streaming engines, they also want to understand how “realtime observations” affect their existing knowledge base.

Read more…

Scalable streaming analytics using a single-server

The simplest and quickest way to mine your data is to deploy efficient algorithms designed to answer key questions at scale.

For many organizations real-time1 analytics entails complex event processing systems (CEP) or newer distributed stream processing frameworks like Storm, S4, or Spark Streaming. The latter have become more popular because they are able to process massive amounts of data, and fit nicely with Hadoop and other cluster computing tools. For these distributed frameworks peak volume is function of network topology/bandwidth and the throughput of the individual nodes.

Scaling up machine-learning: Find efficient algorithms
Faced with having to crunch through a massive data set, the first thing a machine-learning expert will try to do is devise a more efficient algorithm. Some popular approaches involve sampling, online learning, and caching. Parallelizing an algorithm tends to be lower on the list of things to try. The key reason is that while there are algorithms that are embarrassingly parallel (e.g., naive bayes), many others are harder to decouple. But as I highlighted in a recent post, efficient tools that run on single servers can tackle large data sets. In the machine-learning context recent examples2 of efficient algorithms that scale to large data sets, can be found in the products of startup SkyTree.

Read more…