ENTRIES TAGGED "Big Data"

Four short links: 27 June 2014

Four short links: 27 June 2014

Google MillWheel, 20yo Bug, Fast Real-Time Visualizations, and Google's Speed King

  1. MillWheel: Fault-Tolerant Stream Processing at Internet Scale — Google Research paper on the tech underlying the new cloud DataFlow tool. Watch the video. Yow.
  2. The Integer Overflow Bug That Went to Mars — long-standing (20 year old!) bug in a compression library prompts a wave of new releases. No word yet on whether NASA will upgrade the rover to avoid being pwned by Martian script kiddies. (update: I fell for a self-promoter. The Martians will need to find another attack vector. Huzzah!)
  3. epoch (github) — Fastly-produced open source general purpose real-time charting library for building beautiful, smooth, and high performance visualizations.
  4. Achieving Rapid Response Times in Large Online Services (YouTube) — Jeff Dean‘s keynote at Velocity. He wrote … a lot of things for this. And now he’s into deep learning ….
Comment
Four short links: 24 June 2014

Four short links: 24 June 2014

Failure of Imagination, Meat Failure Mode, Grand Challenges, and Data Programming

  1. Maximum Happy Imagination (Matt Jones) — questioning the true vision of Marc Andreessen’s recent Twitter discourse on the great future that awaits us. His analogies run out in the 20th century when it comes to the political, social and economic implications of his maximum happy imagination.
  2. The MirrortocracyIt’s astonishing how many of the people conducting interviews and passing judgement on the careers of candidates have had no training at all on how to do it well. Aside from their own interviews, they may not have ever seen one. I’m all for learning on your own but at least when you write a program wrong it breaks. Without a natural feedback loop, interviewing mostly runs on myth and survivor bias.
  3. Longitude Prize — six prize areas, Grand Challenge style, in clean flight, antibiotic resistance, dementia, food, water, and overcoming paralysis. Mysteriously none for library system that avoids DLL hell.
  4. The Re-Emergence of DatalogMichael Fogus overviews Datalog and provides examples of how it is implemented and used in Datomic, Cascalog, and the Bacwn Clojure library. See also notes from the talk.
Comment
Four short links: 20 June 2014

Four short links: 20 June 2014

Available Data, Goal Setting, Real Tech, and Gamification Numbers

  1. Dynamo and BigTable — good preso overview of two approaches to solving availability and consistency in the event of server failure or network partition.
  2. Goals Gone Wild (PDF) — In this article, we argue that the beneficial effects of goal setting have been overstated and that systematic harm caused by goal setting has been largely ignored. We identify specific side effects associated with goal setting, including a narrow focus that neglects non-goal areas, a rise in unethical behavior, distorted risk preferences, corrosion of organizational culture, and reduced intrinsic motivation.
  3. Tech Isn’t All Brogrammers (Alexis Madrigal) — a reminder that there are real scientists and engineers in Silicon Valley working on problems considerably harder than selling ads and delivering pet food to one another. (via Brian Behlendorf)
  4. Numbers from 90+ Gamification Case Studies — cherry-picked anecdata for your business cases.
Comment
Four short links: 9 June 2014

Four short links: 9 June 2014

SQL against Text, Fake Social Networks, Hidden Biases, and Versioned Data

  1. textqlexecute SQL against structured text like CSV or TSV.
  2. Social Network Structure of Fake Friends — author bought 4,000 Twitter followers and studied their relationships.
  3. Hidden Biases in Big Datawith every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets? (via Quinn Norton)
  4. CoreObjecta version-controlled object database for Objective-C that supports powerful undo, semantic merging, and real-time collaborative editing.
Comment

A growing number of applications are being built with Spark

Many more companies want to highlight how they're using Apache Spark in production.

One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than building tools from scratch. Virtually all of these components are open source and have contributors across many companies. Organizations are also sharing best practices for building big data applications, through blog posts, white papers, and presentations at conferences like Strata.

These trends are particularly apparent in a set of technologies that originated from UC Berkeley’s AMPLab: the number of companies that are using (or plan to use) Spark in production1 has exploded over the last year. The surge in popularity of the Apache Spark ecosystem stems from the maturation of its individual open source components and the growing community of users. The tight integration of high-performance tools that address different problems and workloads, coupled with a simple programming interface (in Python, Java, Scala), make Spark one of the most popular projects in big data. The charts below show the amount of active development in Spark:

Apache Spark contributions

For the second year in a row, I’ve had the privilege of serving on the program committee for the Spark Summit. I’d like to highlight a few areas where Apache Spark is making inroads. I’ll focus on proposals2 from companies building applications on top of Spark.

Read more…

Comment

How to be agile with your big data

Agile methodology brings flexibility to the EDW and offers ways to integrate open-source technologies with existing systems.

Data analysis, like other pursuits, is a balancing act. The rise of big data ratchets up the pressure on the traditional enterprise data warehouse (EDW) and associated software tools to handle rapidly evolving sets of new demands posed by the business. Companies want their EDW systems to be more flexible and more user friendly — without sacrificing processing speeds, data integrity, or overall reliability.

“The more data you give the business, the more questions they will ask,” says José Carlos Eiras, who has served as CIO at Kraft Foods, Philip Morris, General Motors, and DHL. “When you have big data, you have a lot of different questions, and suddenly you need an enterprise data warehouse that is very flexible.”

EDWs are remarkably powerful, but it takes considerable expertise and creativity to modify them on the fly. Adding new capabilities to the EDW generally requires significant investments of time and money. You can develop your own tools internally or purchase them from a vendor, but either way, it’s a hard slog. Read more…

Comment
Four short links: 26 May 2014

Four short links: 26 May 2014

Statistical Sensitivity, Scientific Mining, Data Mining Books, and Two-Sided Smartphones

  1. Car Alarms and Smoke Alarms (Slideshare) — how to think about and draw the line between sensitivity and specificity.
  2. 101 Uses for Content Mining — between the list in the post and the comments from readers, it’s a good introduction to some of the value to be obtained from full-text structured and unstructured access to scientific research publications.
  3. 12 Free-as-in-beer Data Mining Books — for your next flight.
  4. Dual-Touch Smartphone Concept — brilliant design sketches for interactivity using the back of the phone as a touch-sensitive input device.
Comment
Four short links: 22 May 2014

Four short links: 22 May 2014

Local Clusters, Pancoopticon, Indie Oversupply, and Open Source PDF

  1. Ferryhelps you create big data clusters on your local machine. Define your big data stack using YAML and share your application with Dockerfiles. Ferry supports Hadoop, Cassandra, Spark, GlusterFS, and Open MPI.
  2. What Google Told SECFor example, a few years from now, we and other companies could be serving ads and other content on refrigerators, car dashboards, thermostats, glasses, and watches, to name just a few possibilities. The only thing they make that people want to buy is the ad space around what you’re actually trying to do.
  3. The Indie Bubble is Popping (Jeff Vogel) — gamers’ budgets and the number of hours in the day to play games are not increasing at the rate at which the number of games on the market is increasing.
  4. pdfium — Chrome’s PDF engine, open source.
Comment
Four short links: 1 May 2014

Four short links: 1 May 2014

Cloud Jurisdiction, Driverless Cars, Robotics IPOs, and Fitting a Catalytic Convertor to Your Data Exhaust

  1. US Providers Must Divulge from Offshore Servers (Gigaom) — A U.S. magistrate judge ruled that U.S. cloud vendors must fork over customer data even if that data resides in data centers outside the country. (via Alistair Croll)
  2. Inside Google’s Self-Driving Car (Atlantic Cities) — Urmson says the value of maps is one of the key insights that emerged from the DARPA challenges. They give the car a baseline expectation of its environment; they’re the difference between the car opening its eyes in a completely new place and having some prior idea what’s going on around it. This is a long and interesting piece on the experience and the creator’s concerns around the self-driving cars. Still looking for the comprehensive piece on the subject.
  3. Recent Robotics-Relate IPOs — not all the exits are to Google.
  4. How One Woman Hid Her Pregnancy From Big Data (Mashable) — “I really couldn’t have done it without Tor, because Tor was really the only way to manage totally untraceable browsing. I know it’s gotten a bad reputation for Bitcoin trading and buying drugs online, but I used it for BabyCenter.com.”
Comment
Four short links: 23 April 2014

Four short links: 23 April 2014

Mobile UX, Ideation Tools, Causal Consistency, and Intellectual Ventures Patent Fail

  1. Samsung UX (Scribd) — little shop of self-catalogued UX horrors, courtesy discovery in a lawsuit. Dated (Android G1 as competition) but rewarding to see there are signs of self-awareness in the companies that inflict unusability on the world.
  2. Tools for Ideation and Problem Solving (Dan Lockton) — comprehensive and analytical take on different systems for ideas and solutions.
  3. Don’t Settle for Eventual Consistency (ACM) — proposes “causal consistency”, prototyped in COPS and Eiger from Princeton.
  4. Intellectual Ventures Loses Patent Case (Ars Technica) — The Capital One case ended last Wednesday, when a Virginia federal judge threw out the two IV patents that remained in the case. It’s the first IV patent case seen through to a judgment, and it ended in a total loss for the patent-holding giant: both patents were invalidated, one on multiple grounds.
Comment