Mike Loukides

Mike Loukides is Vice President of Content Strategy for O'Reilly Media, Inc. He's edited many highly regarded books on technical subjects that don't involve Windows programming. He's particularly interested in programming languages, Unix and what passes for Unix these days, and system and network administration. Mike is the author of System Performance Tuning", and a coauthor of "Unix Power Tools." Most recently, he's been fooling around with data and data analysis, languages like R, Mathematica, and Octave, and thinking about how to make books social.

Phishing in Facebook’s Pond

Facebook scraping could lead to machine-generated spam so good that it's indistinguishable from legitimate messages.

A recent blog post inquired about the incidence of Facebook-based spear phishing: the author suddenly started receiving email that appeared to be from friends (though it wasn’t posted from their usual email addresses), making the usual kinds of offers and asking him to click on the usual links. He wondered whether this was a phenomenon and how it happened — how does a phisherman get access to your Facebook friends?

The answers are “yes, it happens” and “I don’t know, but it’s going to get worse.” Seriously, my wife’s name has been used in Facebook phishing. A while ago, several of her Facebook friends said that her email account had been hacked. I was suspicious; she only uses Gmail, and hacking Google isn’t easy, particularly with two-factor authentication. So, I asked her friends to send me the offending messages. It was obvious that they hadn’t come from my wife’s account; they were Yahoo accounts with her name but an unrecognizable email address, exactly what this blogger had seen.

How does this happen? How can a phisher discover your name and your Facebook friends? I don’t know, but Facebook is such a morass of weird and conflicting security settings that it’s impossible to know just how private or how public you are. If you’ve ever friended people you don’t know (a practice that remains entirely too common), and if you’ve ever enabled visibility to friends of friends, you have no idea who has access to your conversations.

Read more…

Really Understanding Computation

Tom Stuart's new book will shed light on what you're really doing when you're programming.

It’s great to see that Tom Stuart’s Understanding Computation has made it out. I’ve been excited about this book ever since we signed it.

Understanding Computation started from Tom’s talk Programming with Nothing, which he presented at Ruby Manor in 2011. That talk was a tour-de-force: it showed how to implement a more-or-less complete programming system without using any libraries, methods, classes, objects, or even control structures, assignments, arrays, strings, or numbers. It was, literally, programming with nothing. And it was an eye-opener.

Shortly after I saw the conference video, I talked to Tom to ask if we could do more like this. And amazingly, the answer was “yes.” He was very interested in teaching the theory of computing through Ruby, using similar techniques. What does a program mean? What does it mean for something to be a program? How do we build languages that can handle ever more flexible abstractions? What kinds of problems can’t we solve computationally? It’s all here, and it’s all clearly demonstrated via Ruby code. It’s not code that you’d ever use in a real application (trust me, doing arithmetic without numbers, assignments, and control statements is ridiculously slow). But it is code that will expand your mind and leave you with a much better understanding of what you’re doing when you’re programming.

Understanding skepticism

Skepticism isn't a blanket rejection of data; it's central to understanding data.

I’d like to correct the impression, given by Derrick Harris on GigaOm, that I’m part of a backlash against “big data.”

I’m not skeptical about data or the power of data, but you don’t have to look very far or very hard to see data abused. The best people to be skeptical about the data, and to point out the abuse of data, are data scientists because they understand problems such as overfitting, bias, and much more.

Cathy O’Neil recently wrote about a Congressional hearing in which a teacher at a new data science program dodged some perceptive questions about whether he was teaching students to be skeptical about results, whether he was teaching students how to test whether their observations were real signals or just noise. Anyone who has worked with data knows that false correlations come cheaply, particularly when you’re working with a lot of data. But ducking that question is not the attitude we need.

Data is valuable. I see no end to the collection or analysis of data, nor should their be an and. But like any tool, we have to be careful about how we use it. Skepticism isn’t a blanket rejection of data; it’s central to understanding data. That’s precisely what makes “science” science.

And of all people, journalists should understand what skepticism means, even if they don’t have the technical tools to practice it.

Burning the silos

The boundaries created by traditional management are just getting in the way of reducing product cycle times.

If I’ve seen any theme come up repeatedly over the past year, it’s getting product cycle times down. It’s not the sexiest or most interesting theme, but it’s everywhere: if it’s not on the front burner, it’s always simmering in the background.

Cutting product cycles to the bare minimum is one of the main themes of the Velocity Conference and the DevOps movement, where integration between developers and operations, along with practices like continuous deployment, allows web-native companies like Yahoo! to release upgrades to their web products many times a day. It’s no secret that many traditional enterprises are looking at this model, trying to determine what they can use or implement. Indeed, this is central to their long-term survival; companies as different from Facebook as GE and Ford are learning that they will need to become as agile and nimble as their web-native counterparts.

Integrating development and operations isn’t the only way to shorten product cycles. In his talk at Google IO, Braden Kowitz talked about shortening the design cycle: rather than build big, complete products that take a lot of time and money, start with something very simple and test it, then iterate quickly. This approach lets you generate and test lots of ideas, but be quick to throw away the ones that aren’t working. Rather than designing an Edsel, just to fail when the product is released, the shorter cycles that come from integrating product design with product development let you build iteratively, getting immediate feedback on what works and what doesn’t. To work like this, you need to break down the silos that separate engineers and designers; you need to integrate designers into the product team as early as possible, rather than at the last minute. Read more…

Yet another Kickstarter: Otherlabs’ Home Milling Machine

If you have a good memory, you know that I’ve written about 3D printers. Technically, I grew up with the laser printer; my first computer industry job (part-time while getting an English PhD) was with Imagen, a startup that built the first laser printer that cost under $20,000, then the first that cost under $10,000, then under $7,000, and died a slow death after Apple produced the first that cost under $5000. Now a laser printer costs a few hundred. And I’ve been cheering as 3D printers followed the the same price curve.

But even as I’ve been cheering, I’ve had this nagging doubt in the back of my head. So I can 3D-print my own chess set. Cool. So what? Sure, you can do great things with them (enclosures for projects; every DIY-bio lab I’ve visited has a 3D printer stashed somewhere). While a 3D printer is an important step in bringing 21st-century tooling to the home hacker, they’re still fairly limited.

Last night, the other shoe dropped. Otherfab, a project of Saul Griffiths’ Otherlab, has a new Kickstarter project for Othermill: a home computer-controlled milling machine. A milling machine is a large, versatile beast that uses a high-speed cutting bit to sculpt material (often metal) into the desired shape. Instead of adding layers of plastic or some other material, like a 3D printer, a milling machine cuts material away. If you’ve ever visited machine shops, you know that milling machines are where the magic happens. Particularly state-of-the-art computer controlled mills. They’re big, they’re expensive, and they can do just about anything. Putting one in the home shop — that’s revolutionary. Read more…

Another Serving of Data Skepticism

I was thrilled to receive an invitation to a new meetup: the NYC Data Skeptics Meetup. If you’re in the New York area, and you’re interested in seeing data used honestly, stop by!

That announcement pushed me to write another post about data skepticism. The past few days, I’ve seen a resurgence of the slogan that correlation is as good as causation, if you have enough data. And I’m worried. (And I’m not vain enough to think it’s a response to my first post about skepticism; it’s more likely an effect of Cukier’s book.) There’s a fundamental difference between correlation and causation. Correlation is a two-headed arrow: you can’t tell in which direction it flows. Causation is a single-headed arrow: A causes B, not vice versa, at least in a universe that’s subject to entropy.

Let’s do some thought experiments–unfortunately, totally devoid of data. But I don’t think we need data to get to the core of the problem. Think of the classic false correlation (when teaching logic, also used as an example of a false syllogism): there’s a strong correlation between people who eat pickles and people who die. Well, yeah. We laugh. But let’s take this a step further: correlation is a double headed arrow. So not only does this poor logic imply that we can reduce the death rate by preventing people from eating pickles, it also implies that we can harm the chemical companies that produce vinegar by preventing people from dying. Read more…

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

This reminded me of when my daughter was in first grade, and we looked (briefly) at private schools. All the schools talked the same talk. But if you looked at classes, it was pretty clear that the quality of the music program was a proxy for the quality of the school. After all, it’s easy to shortchange music, and both hard and expensive to do it right. Oddly enough, using the music program as a proxy for evaluating school quality has continued to work through middle school and (public) high school. It’s the first thing to cut when the budget gets tight; and if a school has a good music program with excellent teachers, they’re probably not shortchanging the kids elsewhere.

How does this connect to data science? What are the proxies that allow you to evaluate a data science program from the “outside,” on the information that you might be able to cull from company blogs, a job interview, or even a job posting? We came up with a few ideas:

  • Are the data scientists simply human search engines, or do they have real projects that allow them to explore and be curious? If they have management support for learning what can be learned from the organization’s data, and if management listens to what they discover, they’re accomplishing something significant. If they’re just playing Q&A with the company data, finding answers to specific questions without providing any insight, they’re not really a data science group.
  • Do the data scientists live in a silo, or are they connected with the rest of the company? In Building Data Science Teams, DJ Patil wrote about the value of seating data scientists with designers, marketers, with the entire product group so that they don’t do their work in isolation, and can bring their insights to bear on all aspects of the company.
  • When the data scientists do a study, is the outcome predetermined by management? Is it OK to say “we don’t have an answer” or to come up with a solution that management doesn’t like? Granted, you aren’t likely to be able to answer this question without insider information.
  • What do job postings look like? Does the company have a mission and know what it’s looking for, or are they asking for someone with a huge collection of skills, hoping that they will come in useful? That’s a sign of data science cargo culting.
  • Does management know what their tools are for, or have they just installed Hadoop because it’s what the management magazines tell them to do? Can managers talk intelligently to data scientists?
  • What sort of documentation does the group produce for its projects? Like a club sandwich, it’s easy to shortchange documentation.
  • Is the business built around the data? Or is the data science team an add-on to an existing company? A data science group can be integrated into an older company, but you have to ask a lot more questions; you have to worry a lot more about silos and management relations than you do in a company that is built around data from the start.

Coming up with these questions was an interesting thought experiment; we don’t know whether it holds water, but we suspect it does. Any ideas and opinions?

Google Glass and the Future

I just read a Forbes article about Glass, talking about the split between those who are “sure that it is the future of technology, and others who think society will push back against the technology.”

I don’t see this as a dichotomy (and, to be fair, I’m not sure that the author does either). I expect to see both, and I’d like to think a bit more about what these two apparently opposing sides mean.

Push back is inevitable. I hope there’s a significant push back, and that it has some results. Not because I’m a Glass naysayer, but because we, as technology users, are abused so often, and push back so weakly, that it’s not funny. Facebook does something outrageous; a few technorati whine; they add option 1023 to their current highly intertwined 1022 privacy options that have been designed so they can’t be understood or used effectively; and sooner or later, it all dies down. A hundred fifty users have left Facebook, and half a million more have joined. When Apple puts another brick in their walled garden, a few dozen users (myself included) bitch and moan, but does anyone leave? Personally, I’m tired of getting warnings whenever I install software that doesn’t come from the Apple Store (I’ve used the Store exactly twice), and I absolutely expect that a not-too-distant version of OS X won’t allow me to install software from “untrusted” sources, including software I’ve written. Will there be push back? Probably. Will it be effective? I don’t know; if things go as they are now, I doubt it.

There will be push back against Glass; and that’s a good thing. I think Google, of all the companies out there, is most likely to listen and respond positively. I say that partly because of efforts like the Data Liberation Front, and partly because Eric Schmidt has acknowledged that he finds many aspects of Glass creepy. But going beyond Glass: As a community of users, we need to empower ourselves to push back. We need to be able to push back effectively against Google, but more so against Apple, Facebook, and many other abusers of our data, rather than passively accept the latest intrusion as an inevitability. If Glass does nothing more than teach users that they can push back, and teach large corporations how to respond constructively, it will have accomplished much.

Is Glass the future? Yes; at least, something like Glass is part of the future. As a species, we’re not very good at putting our inventions back into the box. About three years ago, there was a big uptick in interest in augmented reality. You probably remember: Wikitude, Layar, and the rest. You installed those apps on your phone. They’re still there. You never use them (at least, I don’t). The problem with consumer-grade AR up until now has been that it was sort of awkward walking around looking at things through your phone’s screen. (Commercial AR–heads-up displays and the like–is a completely different ball game.) Glass is the first attempt at broadly useful platform for consumer AR; it’s a game changer.

Could Glass fail? Sure; I know more failed startups than I can count where the engineers did something really cool, and when they released it, the public said “what is that, and why do you think we’d want it?” Google certainly isn’t immune from that disease, which is endemic to an engineering-driven culture; just think back to Wave. I won’t deny that Google might shelve Glass if they consider unproductive, as they’ve shelved many popular applications. But I believe that Google is playing long-ball here, and thinking far beyond 2014 or 2015. In a conversation about Bitcoin last week, I said that I doubt it will be around in 20 years. But I’m certain we will have some kind of distributed digital currency, and that currency will probably look a lot like Bitcoin. Glass is the same. I have no doubt that something like Glass is part of our future. It’s a first, tentative, and very necessary step into a new generation of user interfaces, a new way of interacting with computing systems and integrating them into our world. We probably won’t wear devices around on our glasses; it may well be surgically implanted. But the future doesn’t happen if you only talk about hypothetical possibilities. Building the future requires concrete innovation, building inconvenient and “creepy” devices that nevertheless point to the next step. And it requires people pushing back against that innovation, to help developers figure out what they really need to build.

Glass will be part of our future, though probably not in its current form. And push back from users will play an essential role in defining the form it will eventually take.

Glowing Plants

I just invested in BioCurious’ Glowing Plants project on Kickstarter. I don’t watch Kickstarter closely, but this is about as fast as I’ve ever seen a project get funded. It went live on Wednesday; in the afternoon, I was backer #170 (more or less), but could see the number of backers ticking upwards constantly as I watched. It was fully funded for $65,000 Thursday; and now sits at 1340 backers (more by the time you read this), with about $84,000 in funding. And there’s a new “stretch” goal: if they make $400,000, they will work on bigger plants, and attempt to create a glowing rose.

Glowing plants are a curiosity; I don’t take seriously the idea that trees will be an alternative to streetlights any time in the near future. But that’s not the point. What’s exciting is that an important and serious biology project can take place in a biohacking lab, rather than in a university or an industrial facility. It’s exciting that this project could potentially become a business; I’m sure there’s a boutique market for glowing roses and living nightlights, if not for biological street lighting. And it’s exciting that we can make new things out of biological parts.

In a conversation last year, Drew Endy said that he wanted synthetic biology to “stay weird,” and that if in ten years, all we had accomplished was create bacteria that made oil from cellulose, we will have failed. Glowing plants are weird. And beautiful. Take a look at their project, fund it, and be the first on your block to have a self-illuminating garden.

Data skepticism

If data scientists aren't skeptical about how they use and analyze data, who will be?

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.