"Big Data" entries

Four short links: 4 February 2016

Four short links: 4 February 2016

Shmoocon Video, Smart Watchstrap, Generalizing Learning, and Dataflow vs Spark

  1. Shmoocon 2016 Videos (Internet Archive) — videos of the talks from an astonishingly good security conference.
  2. TipTalk — Samsung watchstrap that is the smart device … put your finger in your ear to hear the call. You had me at put my finger in my ear. (via WaPo)
  3. Ecorithms — Leslie Valiant at Harvard broadened the concept of an algorithm into an “ecorithm,” which is a learning algorithm that “runs” on any system capable of interacting with its physical environment. Algorithms apply to computational systems, but ecorithms can apply to biological organisms or entire species. The concept draws a computational equivalence between the way that individuals learn and the way that entire ecosystems evolve. In both cases, ecorithms describe adaptive behavior in a mechanistic way.
  4. Dataflow/Beam vs Spark (Google Cloud) — To highlight the distinguishing features of the Dataflow model, we’ll be comparing code side-by-side with Spark code snippets. Spark has had a huge and positive impact on the industry thanks to doing a number of things much better than other systems had done before. But Dataflow holds distinct advantages in programming model flexibility, power, and expressiveness, particularly in the out-of-order processing and real-time session management arenas.
Comment
Four short links: 29 January 2016

Four short links: 29 January 2016

LTE Security, Startup Tools, Security Tips, and Data Fiction

  1. LTE Weaknesses (PDF) — ShmooCon talk about how weak LTE is: a lot of unencrypted exchanges between handset and basestation, cheap and easy to fake up a basestation.
  2. AnalyzoFind and Compare the Best Tools for your Startup it claims. We’re in an age of software surplus: more projects, startups, apps, and tools than we can keep in our heads. There’s a place for curated lists, which is why every week brings a new one.
  3. How to Keep the NSA Out — NSA’s head of Tailored Access Operations (aka attacking other countries) gives some generic security advice, and some interesting glimpses. “Don’t assume a crack is too small to be noticed, or too small to be exploited,” he said. If you do a penetration test of your network and 97 things pass the test but three esoteric things fail, don’t think they don’t matter. Those are the ones the NSA, and other nation-state attackers will seize on, he explained. “We need that first crack, that first seam. And we’re going to look and look and look for that esoteric kind of edge case to break open and crack in.”
  4. The End of Big Data — future fiction by James Bridle.
Comment
Four short links: 28 January 2016

Four short links: 28 January 2016

Augmented Intelligence, Social Network Limits, Microsoft Research, and Google's Go

  1. Chimera (Paper a Day) — the authors summarise six main lessons learned while building Chimera: (1) Things break down at large scale; (2) Both learning and hand-crafted rules are critical; (3) Crowdsourcing is critical, but must be closely monitored; (4) Crowdsourcing must be coupled with in-house analysts and developers; (5) Outsourcing does not work at a very large scale; (6) Hybrid human-machine systems are here to stay.
  2. Do Online Social Media Remove Constraints That Limit the Size of Offline Social Networks? (Royal Society) — paper by Robin Dunbar. Answer: The data show that the size and range of online egocentric social networks, indexed as the number of Facebook friends, is similar to that of offline face-to-face networks.
  3. Microsoft Embedding ResearchTo break down the walls between its research group and the rest of the company, Microsoft reassigned about half of its more than 1,000 research staff in September 2014 to a new group called MSR NExT. Its focus is on projects with greater impact to the company rather than pure research. Meanwhile, the other half of Microsoft Research is getting pushed to find more significant ways it can contribute to the company’s products. The challenge is how to avoid short-term thinking from your research team. For instance, Facebook assigns some staff to focus on long-term research, and Google’s DeepMind group in London conducts pure AI research without immediate commercial considerations.
  4. Google’s Go-Playing AIThe key to AlphaGo is reducing the enormous search space to something more manageable. To do this, it combines a state-of-the-art tree search with two deep neural networks, each of which contains many layers with millions of neuron-like connections. One neural network, the “policy network,” predicts the next move, and is used to narrow the search to consider only the moves most likely to lead to a win. The other neural network, the “value network,” is then used to reduce the depth of the search tree — estimating the winner in each position in place of searching all the way to the end of the game.
Comment

Building a business that combines human experts and data science

The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-bombe

In this episode of the O’Reilly Data Show, I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.

Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation”

We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It’s a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there’s certain tasks that only a human can do or we’re going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines. … We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that’s distributed across the resources. It’s a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans.

Read more…

Comment

Is 2016 the year you let robots manage your money?

The O’Reilly Data Show podcast: Vasant Dhar on the race to build “big data machines” in financial investing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Merchants'_Exchange,_Wall_Street,_New_York_City

In this episode of the O’Reilly Data Show, I sat down with Vasant Dhar, a professor at the Stern School of Business and Center for Data Science at NYU, founder of SCT Capital Management, and editor-in-chief of the Big Data Journal (full disclosure: I’m a member of the editorial board). We talked about the early days of AI and data mining, and recent applications of data science to financial investing and other domains.

Dhar’s first steps in applying machine learning to finance

I joke with people, I say, ‘When I first started looking at finance, the only thing I knew was that prices go up and down.’ It was only when I actually went to Morgan Stanley and took time off from academia that I learned about finance and financial markets. … What I really did in that initial experiment is I took all the trades, I appended them with information about the state of the market at the time, and then I cranked it through a genetic algorithm and a tree induction algorithm. … When I took it to the meeting, it generated a lot of really interesting discussion. … Of course, it took several months before we actually finally found the reasons for why I was observing what I was observing.

Read more…

Comment: 1
Four short links: 22 December 2015

Four short links: 22 December 2015

Machine Poetry, Robo Script Kiddies, Big Data of Love, and Virtual Currency and the Nation State

  1. How Machines Write PoetryHarmon would love to have writers or other experts judge FIGURE8’s work, too. Her online subjects tended to rate the similes better if they were obvious. “The snow continued like a heavy rain” got high scores, for example, even though Harmon thought this was quite a bad effort on FIGURE8’s part. She preferred “the snow falls like a dead cat,” which got only middling ratings from humans. “They might have been cat lovers,” she says. FIGURE8 (PDF) system generates figurative language.
  2. The Decisions the Pentagon Wants to Leave to Robots“You cannot have a human operator operating at human speed fighting back at determined cyber tech,” Work said. “You are going to need have a learning machine that does that.” I for one welcome our new robot script kiddie overlords.
  3. Love in the Age of Big DataOver decades, John has observed more than 3,000 couples longitudinally, discovering patterns of argument and subtle behaviors that can predict whether a couple would be happily partnered years later or unhappy or divorced. Turns out, “don’t be a jerk” is good advice for marriages, too. (via Cory Doctorow)
  4. National Security Implications of Virtual Currency (PDF) — Rand research report examining the potential for non-state actor deployment.
Comment
Four short links: 7 December 2015

Four short links: 7 December 2015

Telepresent Axeman, Toxic Workers, Analysis Code, and Cryptocurrency Attacks

  1. Axe-Wielding Robot w/Telepresence (YouTube) — graphic robot-on-wall action at 2m30s. (via IEEE)
  2. Toxic Workers (PDF) — In comparing the two costs, even if a firm could replace an average worker with one who performs in the top 1%, it would still be better off by replacing a toxic worker with an average worker by more than two-to-one. Harvard Business School research. (via Fortune)
  3. Replacing Sawzall (Google) — At Google, most Sawzall analysis has been replaced by Go […] we’ve developed a set of Go libraries that we call Lingo (for Logs in Go). Lingo includes a table aggregation library that brings the powerful features of Sawzall aggregation tables to Go, using reflection to support user-defined types for table keys and values. It also provides default behavior for setting up and running a MapReduce that reads data from the logs proxy. The result is that Lingo analysis code is often as concise and simple as (and sometimes simpler than) the Sawzall equivalent.
  4. Attacks in the World of Cryptocurrency — a review of some of the discussed weakness, attacks, or oddities in cryptocurrency (esp. bitcoin).
Comment
Four short links: 24 November 2015

Four short links: 24 November 2015

Tabular Data, Distrusting Authority, Data is the Future, and Remote Working Challenges

  1. uitable — cute library for tabular data in console golang programs.
  2. Did Carnegie Mellon Attack Tor for the FBI? (Bruce Schneier) — The behavior of the researchers is reprehensible, but the real issue is that CERT Coordination Center (CERT/CC) has lost its credibility as an honest broker. The researchers discovered this vulnerability and submitted it to CERT. Neither the researchers nor CERT disclosed this vulnerability to the Tor Project. Instead, the researchers apparently used this vulnerability to deanonymize a large number of hidden service visitors and provide the information to the FBI. Does anyone still trust CERT to behave in the Internet’s best interests? Analogous to the CIA organizing a fake vaccination drive to get close to Osama. “Intelligence” agencies.
  3. Google Open-Sourcing TensorFlow Shows AI’s Future is Data not Code (Wired) — something we’ve been saying for a long time.
  4. Challenges of Working Remote (Moishe Lettvin) — the things that make working remote hard aren’t, primarily, logistical; they’re emotional.
Comment

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px_Handy_notes_and_queries_(1887)_(14760632446)

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Read more…

Comment: 1

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

Spiral_Galaxy

Many of the open source systems and projects we’ve come to love — including Hadoop and HBase — were inspired by systems used internally within Google. These systems were described in papers and implemented by people who needed frameworks that could comfortably scale to massive data sets.

Google engineers and scientists continue to publish interesting papers, and these days some of the big data systems they describe in publications are available on their cloud platform.

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

Read more…

Comment