- Reverse Engineering for Beginners (GitHub) — from assembly language through stack overflows, dongles, and more.
- Incident Response at Heroku — the difference between good and bad shops is that good shops have a routine for exceptions.
- 3D Petrie Museum — The Petrie Museum of Egyptian Archaeology has one of the largest ancient Egyptian and Sudanese collections in the world and they’ve put 3D models of their goods online. Not (yet) available for download, only viewing which seem a bug.
- Sandy Pentland on Wearables (The Verge) — Pentland was also Nathan Eagle’s graduate advisor, and behind the Reality Mining work at MIT. Check out his sociometer: One study revealed that the sociometer helps discern when someone is bluffing at poker roughly 70 percent of the time; another found that a wearer can determine who will win a negotiation within the first five minutes with 87 percent accuracy; yet another concluded that one can accurately predict the success of a speed date before the participants do.
ENTRIES TAGGED "machine learning"
Reverse Engineering, Incident Response, 3D Museum, and Social Prediction
Hardening Android, Samsung Connivery, Scalable WebSockets, and Hardware Machine Learning
- Hardening Android for Security and Privacy — a brilliant project! prototype of a secure, full-featured, Android telecommunications device with full Tor support, individual application firewalling, true cell network baseband isolation, and optional ZRTP encrypted voice and video support. ZRTP does run over UDP which is not yet possible to send over Tor, but we are able to send SIP account login and call setup over Tor independently.
- The Great Smartphone War (Vanity Fair) — “I represented [the Swedish telecommunications company] Ericsson, and they couldn’t lie if their lives depended on it, and I represented Samsung and they couldn’t tell the truth if their lives depended on it.” That’s the most catching quote, but interesting to see Samsung’s patent strategy described as copying others, delaying the lawsuits, settling before judgement, and in the meanwhile ramping up their own innovation. Perhaps the other glory part is the description of Samsung employee shredding and eating incriminating documents while stalling lawyers out front. An excellent read.
- socketcluster — highly scalable realtime WebSockets based on Engine.io. They have screenshots of 100k messages/second on an 8-core EC2 m3.2xlarge instance.
- Machine Learning on a Board — everything good becomes hardware, whether in GPUs or specialist CPUs. This one has a “Machine Learning Co-Processor”. Interesting idea, to package up inputs and outputs with specialist CPU, but I wonder whether it’s a solution in search of a problem. (via Pete Warden)
Internet of Listeners, Mobile Deep Belief, Crowdsourced Spectrum Data, and Quantum Minecraft
- Jasper Project — an open source platform for developing always-on, voice-controlled applications. Shouting is the new swiping—I eagerly await Gartner touting the Internet-of-things-that-misunderstand-you.
- DeepBeliefSDK — deep neural network library for iOS. (via Pete Warden)
- Microsoft Spectrum Observatory — crowdsourcing spectrum utilisation information. Just open sourced their code.
- qcraft — beginner’s guide to quantum physics in Minecraft. (via Nelson Minar)
Google Flu, Embeddable JS, Data Analysis, and Belief in the Browser
- The Parable of Google Flu (PDF) — We explore two
issues that contributed to [Google Flu Trends]’s mistakes—big data hubris and algorithm dynamics—and offer lessons for moving forward in the big data age. Overtrained and underfed?
- Principles of Good Data Analysis (Greg Reda) — Once you’ve settled on your approach and data sources, you need to make sure you understand how the data was generated or captured, especially if you are using your own company’s data. Treble so if you are using data you snaffled off the net, riddled with collection bias and untold omissions. (via Stijn Debrouwere)
More than algorithms, companies gain access to models that incorporate ideas generated by teams of data scientists
Data scientists were among the earliest and most enthusiastic users of crowdsourcing services. Lukas Biewald noted in a recent talk that one of the reasons he started CrowdFlower was that as a data scientist he got frustrated with having to create training sets for many of the problems he faced. More recently, companies have been experimenting with active learning (humans1 take care of uncertain cases, models handle the routine ones). Along those lines, Adam Marcus described in detail how Locu uses Crowdsourcing services to perform structured extraction (converting semi/unstructured data into structured data).
Another area where crowdsourcing is popping up is feature engineering and feature discovery. Experienced data scientists will attest that generating features is as (if not more) important than choice of algorithm. Startup CrowdAnalytix uses public/open data sets to help companies enhance their analytic models. The company has access to several thousand data scientists spread across 50 countries and counts a major social network among its customers. Its current focus is on providing “enterprise risk quantification services to Fortune 1000 companies”.
CrowdAnalytix breaks up projects in two phases: feature engineering and modeling. During the feature engineering phase, data scientists are presented with a problem (independent variable(s)) and are asked to propose features (predictors) and brief explanations for why they might prove useful. A panel of judges evaluate2 features based on the accompanying evidence and explanations. Typically 100+ teams enter this phase of the project, and 30+ teams propose reasonable features.
In order to make an effective decision, I need to understand key issues about the design, performance, and cost of cars, regardless of whether or not I actually know how to build one myself. The same is true for people deciding if machine learning is a good choice for their business goals or project. Will the payoff be worth the effort? What machine learning approach is most likely to produce valuable results for your particular situation? What size team with what expertise is necessary to be able to develop, deploy, and maintain your machine learning system?
Given the complex and previously esoteric nature of machine learning as a field – the sometimes daunting array of learning algorithms and the math needed to understand and employ them – many people feel the topic is one best left only to the few.
Hardcore Data Science speakers provided many practical suggestions and tips
One of the most popular offerings at Strata Santa Clara was Hardcore Data Science day. Over the next few weeks we hope to profile some of the speakers who presented, and make the video of the talks available as a bundle. In the meantime here are some notes and highlights from a day packed with great talks.
We’ve come to think of analytics as being comprised primarily of data and algorithms. Once data has been collected, “wrangled”, and stored, algorithms are unleashed to unlock its value. Longtime machine-learning researcher Alice Zheng of GraphLab, reminded attendees that data structures are critical to scaling machine-learning algorithms. Unfortunately there is a disconnect between machine-learning research and implementation (so much so, that some recent advances in large-scale ML are “rediscoveries” of known data structures):
While there are many data structures that arise in computer science, Alice devoted her talk to two data structures1 that are widely used in machine-learning:
Emotions Wanted, Future's So Bright, Machine Learning for Security, and Medieval Unicode Fonts
- What Machines Can’t Do (NY Times) — In the 1950s, the bureaucracy was the computer. People were organized into technocratic systems in order to perform routinized information processing. But now the computer is the computer. The role of the human is not to be dispassionate, depersonalized or neutral. It is precisely the emotive traits that are rewarded: the voracious lust for understanding, the enthusiasm for work, the ability to grasp the gist, the empathetic sensitivity to what will attract attention and linger in the mind. Cf the fantastic The Most Human Human. (via Jim Stogdill)
- The Technium: A Conversation with Kevin Kelly (Edge) — If we were sent back with a time machine, even 20 years, and reported to people what we have right now and describe what we were going to get in this device in our pocket—we’d have this free encyclopedia, and we’d have street maps to most of the cities of the world, and we’d have box scores in real time and stock quotes and weather reports, PDFs for every manual in the world—we’d make this very, very, very long list of things that we would say we would have and we get on this device in our pocket, and then we would tell them that most of this content was free. You would simply be declared insane. They would say there is no economic model to make this. What is the economics of this? It doesn’t make any sense, and it seems far-fetched and nearly impossible. But the next twenty years are going to make this last twenty years just pale. (via Sara Winge)
- Applying Machine Learning to Network Security Monitoring (Slideshare) — interesting deck on big data + machine learning as applied to netsec. See also their ML Sec Project. (via Anton Chuvakin)
- Medieval Unicode Font Initiative — code points for medieval markup. I would have put money on Ogonek being a fantasy warrior race. Go figure.
Business users are starting to tackle problems that require machine-learning and statistics
I talk with many new companies who build tools for business analysts and other non-technical users. These new tools streamline and simplify important data tasks including interactive analysis (e.g., pivot tables and cohort analysis), interactive visual analysis (as popularized by Tableau and Qlikview), and more recently data preparation. Some of the newer tools scale to large data sets, while others explicitly target small to medium-sized data.
As I noted in a recent post, companies are beginning to build data analysis tools1 that target non-experts. Companies are betting that as business users start interacting with data, they will want to tackle some problems that require advanced analytics. With business analysts far outnumbering data scientists, it makes sense to offload some problems to non-experts2.
Moreover data seems to support the notion that business users are interested in more complex problems. I recently looked at data3 from 11 large Meetups (in NYC and the SF Bay Area) that target business analysts and business intelligence users. Altogether these Meetups had close to 5,000 active4 members. As you can see in the chart below, business users are interested in topics like machine learning (1 in 5), predictive analytics (1 in 4), and data mining (1 in 4):