# "machine learning" entries

## Four short links: 27 May 2015

### Domo Arigato Mr Google, Distributed Graph Processing, Experiencing Ethics, and Deep Learning Robots

1. Roboto — Google’s signature font is open sourced (Apache 2.0), including the toolchain to build it.
2. Pregel: A System for Large Scale Graph Processing — a walk through a key 2010 paper from Google, on the distributed graph system that is the inspiration for Apache Giraph and which sits under PageRank.
3. How to Turn a Liberal Hipster into a Global Capitalist (The Guardian) — In Zoe Svendsen’s play “World Factory at the Young Vic,” the audience becomes the cast. Sixteen teams sit around factory desks playing out a carefully constructed game that requires you to run a clothing factory in China. How to deal with a troublemaker? How to dupe the buyers from ethical retail brands? What to do about the ever-present problem of clients that do not pay? […] And because the theatre captures data on every choice by every team, for every performance, I know we were not alone. The aggregated flowchart reveals that every audience, on every night, veers toward money and away from ethics. I’m a firm believer that games can give you visceral experience, not merely intellectual knowledge, of an activity. Interesting to see it applied so effectively to business.
4. End to End Training of Deep Visuomotor Policies (PDF) — paper on using deep learning to teach robots how to manipulate objects, by example.

## Four short links: 25 May 2015

### 8 (Bits) Is Enough, Second Machine Age, LLVM OpenMP, and Javascript Graphs

1. Why Are Eight Bits Enough for Deep Neural Networks? (Pete Warden) — It turns out that neural networks are different. You can run them with eight-bit parameters and intermediate buffers, and suffer no noticeable loss in the final results. This was astonishing to me, but it’s something that’s been re-discovered over and over again.
2. The Great Decoupling (HBR) — The Second Machine Age is playing out differently than the First Machine Age, continuing the long-term trend of material abundance but not of ever-greater labor demand.
3. OpenMP Support in LLVMOpenMP enables Clang users to harness full power of modern multi-core processors with vector units. Pragmas from OpenMP 3.1 provide an industry standard way to employ task parallelism, while ‘#pragma omp simd’ is a simple yet flexible way to enable data parallelism (aka vectorization).
4. JS Graphs — a visual catalogue (with search) of Javascript graphing libraries.

## Four short links: 14 May 2015

### Human-Machine Cooperation, Concurrent Systems Books, AI Future, and Gesture UI

1. Ghosts in the Machines (Courtney Nash) — People are neither masters of machines, nor subservient to their machine-learning outcomes — we cannot, and should not, separate the two. We are actors, together, in a very complex system. David Woods calls this “joint cognitive systems.”
2. TLA+ (Leslie Lamport) — two tutorials: “Principles of Concurrent Computing” and “Specification of Concurrent Systems.” Ironically, I see people grizzling that the book on distributed systems hasn’t been linearised. I wonder if you can partition it into the two tutorials and still have full availability…
3. Deep Learning vs Probabilistic vs LogicAs of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent.
4. Touché (Disney Research) — measur[es] capacitive response of object and human at multiple frequencies, a technique that we called Swept Frequency Capacitive Sensing. The signal travels through different paths depending on its frequency, capturing the posture of human hand and body as well as other properties of the context. The resulted data is classified using machine learning algorithms to identify gestures that are then used to trigger desired responses of the user interface.

## Topic modeling for the newbie

### Learning the fundamentals of natural language processing.

Get “Data Science from Scratch” at 50% off with code DATA50. Editor’s note: This is an excerpt from our recent book Data Science from Scratch, by Joel Grus. It provides a survey of topics from statistics and probability to databases, from machine learning to MapReduce, giving the reader a foundation for understanding, and examples and ideas for learning more.

When we built our Data Scientists You Should Know recommender in Chapter 1, we simply looked for exact matches in people’s stated interests.

A more sophisticated approach to understanding our users’ interests might try to identify the topics that underlie those interests. A technique called Latent Dirichlet Analysis (LDA) is commonly used to identify common topics in a set of documents. We’ll apply it to documents that consist of each user’s interests.

LDA has some similarities to the Naive Bayes Classifier we built in Chapter 13, in that it assumes a probabilistic model for documents. We’ll gloss over the hairier mathematical details, but for our purposes the model assumes that:

• There is some fixed number K of topics.
• There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k.
• There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d.
• Each word in a document was generated by first randomly picking a topic (from the document’s distribution of topics) and then randomly picking a word (from the topic’s distribution of words).

In particular, we have a collection of documents, each of which is a list of words. And we have a corresponding collection of document_topics that assigns a topic (here a number between 0 and K – 1) to each word in each document. Read more…

## Four short links: 7 May 2015

### Predicting Hits, Pricing Strategies, Quis Calculiet Shifty Custodes, Docker Security

1. Predicting a Billboard Music Hit (YouTube) — Shazam VP of Music and Platforms at Strata London. With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.
2. Psychological Pricing Strategies — a handy wrap-up of evil^wuseful pricing strategies to know.
3. What Two Programmers Have Revealed So Far About Seattle Police Officers Who Are Still in Uniformthrough their shrewd use of Washington’s Public Records Act, the two Seattle residents are now the closest thing the city has to a civilian police-oversight board. In the last year and a half, they have acquired hundreds of reports, videos, and 911 calls related to the Seattle Police Department’s internal investigations of officer misconduct between 2010 and 2013. And though they have only combed through a small portion of the data, they say they have found several instances of officers appearing to lie, use racist language, and use excessive force—with no consequences. In fact, they believe that the Office of Professional Accountability (OPA) has systematically “run interference” for cops. In the aforementioned cases of alleged officer misconduct, all of the involved officers were exonerated and still remain on the force.
4. Understanding Docker Security and Best Practices — explanation of container security and a benchmark for security practices, though email addresses will need to be surrendered in exchange for the good info.

## Four short links: 16 April 2015

### Relationships and Inference, Mother of All Demos, Kafka at Scale, and Real World Hardware

1. DeepDiveDeepDive is targeted to help users extract relations between entities from data and make inferences about facts involving the entities. DeepDive can process structured, unstructured, clean, or noisy data and outputs the results into a database.
2. From the Vault: Watching (and re-watching) “The Mother of All Demos”“I wish there was more about the social vision for computing—I worked with him for a long time, and Doug was always thinking ‘how can we collectively collaborate,’ like a sort of rock band.”
3. Running Kafka at Scale (LinkedIn Engineering) — This tiered infrastructure solves many problems, but it greatly complicates monitoring Kafka and assuring its health. While a single Kafka cluster, when running normally, will not lose messages, the introduction of additional tiers, along with additional components such as mirror makers, creates myriad points of failure where messages can disappear. In addition to monitoring the Kafka clusters and their health, we needed to create a means to assure that all messages produced are present in each of the tiers, and make it to the critical consumers of that data.
4. 3D Printing Titanium, and the Bin of Broken Dreams — you will learn HUGE amounts on the challenges of real-world manufacturing by reading this.

## Building big data systems in academia and industry

### The O'Reilly Data Show Podcast: Mikio Braun on stream processing, academic research, and training.

Mikio Braun is a machine learning researcher who also enjoys software engineering. We first met when he co-founded a real-time analytics company called streamdrill. Since then, I’ve always had great conversations with him on many topics in the data space. He gave one of the best-attended sessions at Strata + Hadoop World in Barcelona last year on some of his work at streamdrill.

I recently sat down with Braun for the latest episode of the O’Reilly Data Show Podcast, and we talked about machine learning, stream processing and analytics, his recent foray into data science training, and academia versus industry (his interests are a bit on the “applied” side, but he enjoys both).

An example of a big data solution. Source: Mikio Braun, used with permission.

## Four short links: 8 April 2015

### Learning Poses, Kafkaesque Things, Hiring Research, and Robotic Movement

1. Apple Patent on Learning-based Estimation of Hand and Finger Pose — machine learning to identify gestures (hand poses) that works even when partially occluded. See writeup in Apple Insider.
2. The Internet of Kafkaesque Things (ACLU) — As computers are deployed in more regulatory roles, and therefore make more judgments about us, we may be afflicted with many more of the rigid, unjust rulings for which bureaucracies are so notorious.
3. Schmidt and Hunter (1998): Validity and Utility of Selection Methods in Personnel (PDF) — On the basis of meta-analytic findings, this article examines and summarizes what 85 years of research in personnel psychology has revealed about the validity of measures of 19 different selection methods that can be used in making decisions about hiring, training, and developmental assignments. (via Wired)
4. Complete Force Control in Constrained Under-actuated Mechanical Systems (Robohub) — Nori focuses on finding ways to advance the dynamic system of a robot – the forces that interact and make the system move. Key to developing dynamic movements in a robot is control, accompanied by the way the robot interacts with the environment. Nori talks us through the latest developments, designs, and formulas for floating-base/constrained mechanical systems, whole-body motion control of humanoid systems, whole-body dynamics computation on the iCub humanoid, and finishes with a video on recent implementations of whole-body motion control on the iCub. Video and download of presentation.

## Four short links: 3 April 2015

### Augmenting Humans, Body-Powered CPUs, Predicting the Future, and Hermit Life

1. Unpowered Ankle Exoskeleton“As we understand human biomechanics better, we’ve begun to see wearable robotic devices that can restore or enhance human motor performance,” says Collins. “This bodes well for a future with devices that are lightweight, energy-efficient, and relatively inexpensive, yet enhance human mobility.”
2. Body-Powered Processing (Ars Technica) — The new SAM L21 32-bit ARM family of microcontroller (MCUs) consume less than 35 microamps of power per megahertz of processing speed while active, and less than 200 nanoamps of power overall when in deep sleep mode—with varying states in between. The chip is so low power that it can be powered off energy capture from the body. (via Greg Linden)
3. Temporal Effects in Trend Prediction: Identifying the Most Popular Nodes in the Future (PLOSone) — We find that TBP have high general accuracy in predicting the future most popular nodes. More importantly, it can identify many potential objects with low popularity in the past but high popularity in the future.
4. The Shut-In EconomyIn 1998, Carnegie Mellon researchers warned that the Internet could make us into hermits. They released a study monitoring the social behavior of 169 people making their first forays online. The Web-surfers started talking less with family and friends, and grew more isolated and depressed. “We were surprised to find that what is a social technology has such anti-social consequences,” said one of the researchers at the time. “And these are the same people who, when asked, describe the Internet as a positive thing.”

## Four short links: 19 March 2015

### Changing Behaviour, Building Filters, Public Access, and Working Capital

1. Using Monitoring Dashboards to Change Behaviour[After years of neglect] One day we wrote some brittle Ruby scripts that polled various services. They collated the metrics into a simple database and we automated some email reports and built a dashboard showing key service metrics. We pinpointed issues that we wanted to show people. Things like the login times, how long it would take to search for certain keywords in the app, and how many users were actually using the service, along with costs and other interesting facts. We sent out the link to the dashboard at 9am on Monday morning, before the weekly management call. Within 2 weeks most problems were addressed. It is very difficult to combat data, especially when it is laid out in an easy to understand way.
2. Quiet Mitsubishi Cars — noise-cancelling on phone calls by using machine learning to build the filters.
3. NSF Requiring Public AccessNSF will require that articles in peer-reviewed scholarly journals and papers in juried conference proceedings or transactions be deposited in a public access compliant repository and be available for download, reading, and analysis within one year of publication.
4. Filtered for Capital (Matt Webb) — It’s important to get a credit line [for hardware startups] because growing organically isn’t possible — even if half your sell-in price is margin, you can only afford to grow your batch size at 50% per cycle… and whether it’s credit or re-investing the margin, all that growth incurs risk, because the items aren’t pre-sold. There are double binds all over the place here.