- Roboto — Google’s signature font is open sourced (Apache 2.0), including the toolchain to build it.
- Pregel: A System for Large Scale Graph Processing — a walk through a key 2010 paper from Google, on the distributed graph system that is the inspiration for Apache Giraph and which sits under PageRank.
- How to Turn a Liberal Hipster into a Global Capitalist (The Guardian) — In Zoe Svendsen’s play “World Factory at the Young Vic,” the audience becomes the cast. Sixteen teams sit around factory desks playing out a carefully constructed game that requires you to run a clothing factory in China. How to deal with a troublemaker? How to dupe the buyers from ethical retail brands? What to do about the ever-present problem of clients that do not pay? […] And because the theatre captures data on every choice by every team, for every performance, I know we were not alone. The aggregated flowchart reveals that every audience, on every night, veers toward money and away from ethics. I’m a firm believer that games can give you visceral experience, not merely intellectual knowledge, of an activity. Interesting to see it applied so effectively to business.
- End to End Training of Deep Visuomotor Policies (PDF) — paper on using deep learning to teach robots how to manipulate objects, by example.
"machine learning" entries
Learning the fundamentals of natural language processing.
Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.
When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.
A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.
LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:
- There is some fixed number K of topics.
- There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
- There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
- Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).
In particular, we have a collection of
documents, each of which is a
list of words. And we have a corresponding collection of
document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…
The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.
Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.
I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).