"Big Data" entries

More tools for managing and reproducing complex data projects

A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.

The_right_fit_Ian_D_Keating_Flickr

As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:

Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?

At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:

  • Reproduce their work — so they can rerun and/or audit when needed
  • Collaborate
  • Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
  • Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective

As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…

Comments: 4
Four short links: 29 April 2015

Four short links: 29 April 2015

Deceptive Visualisation, Small Robots, Managing Secrets, and Large Time Series

  1. Disinformation Visualisation: How to Lie with DatavisWe don’t spread visual lies by presenting false data. That would be lying. We lie by misrepresenting the data to tell the very specific story we’re interested in telling. If this is making you slightly uncomfortable, that’s a good thing; it should. If you’re concerned about adopting this new and scary habit, well, don’t worry; it’s not new. Just open your CV to be reminded you’ve lied with truthful data before. This time, however, it will be explicit and visual. (via Regine Debatty)
  2. Microtugsa new type of small robot that can apply orders of magnitude more force than it weighs. This is in stark contrast to previous small robots that have become progressively better at moving and sensing, but lacked the ability to change the world through the application of human-scale loads.
  3. Vaulta tool for securely managing secrets and encrypting data in-transit.
  4. iSAX: Indexing and Mining Terabyte Sized Time Series (PDF) — Our approach allows both fast exact search and ultra-fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real-world data sets, containing millions of time series. (via Benjamin Black)
Comment

Embracing failure and learning from the Imposter Syndrome

What you miss with a "get it right the first time" mentality

Masks_Brian_Snelson_Flickr

Download our updated Women in Data report, which features four new profiles of women across the European Union. You can also pick-up a copy at Strata + Hadoop World London, where Alice Zheng will lead a session on Deploying Machine Learning in Production.

Lately, there has been a slew of media coverage about the Imposter Syndrome. Many columnists, bloggers, and public speakers have spoken or written about their own struggles with the Imposter Syndrome. And original psychological research on the Imposter Syndrome has found that out of every five successful people, two consider themselves a fraud.

I’m certainly no stranger to the sinking feeling of being out of place. During college and graduate school, it often seemed like everyone else around me was sailing through to the finish line, while I alone lumbered with the weight of programming projects and mathematical proofs. This led to an ongoing self-debate about my choice of a major and profession. One day, I noticed myself reading the same sentence over and over again in a textbook; my eyes were looking at the text, but my mind was saying, “Why aren’t you getting this yet? It’s so simple. Everybody else gets it. What’s wrong with you?”

When I look back upon those years, I have two thoughts: 1. That was hard. 2. What a waste of perfectly good brain cells! I could have done so many cool things if I had not spent all that time doubting myself.

But one can’t simply snap out of the Imposter Syndrome. It has a variety of causes, and it’s sticky. I was brought up with the idea of holding myself to a high standard, to measure my own progress against others’ achievements. Falling short of expectations is supposed to be a great motivator for action…or is it? Read more…

Comments: 2

How to implement a security data lake

Practical tips for centralizing security data.

Information security has been dealing with terabytes of data for more than a decade — almost two. The benefits of having more data available spans many use cases, from forensic investigations to pro-actively finding anomalies and stopping adversaries before they cause harm.

But let’s be realistic. You probably have numerous repositories for your security data. Your Security Information and Event Management (SIEM) solution doesn’t scale to the volumes of data that you would really like to collect. This, in turn, makes it hard to use all of your data for any kind of analytics. It’s likely that your tools have to operate on multiple, disconnected data stores that have very different capabilities for data access and analysis. Even worse, during an incident, how many different consoles do you have to touch before you get the complete picture of what has happened? I would guess probably at least four (I would have said 42, but that seemed a bit excessive).

When talking to your peers about this problem, do they tell you to implement Hadoop to deal with the huge data volumes? But what does that really mean — is Hadoop really the solution? After all, Hadoop is a pretty complex ecosystem of tools that requires skilled and expensive people to implement and maintain. Read more…

Comment
Four short links: 24 April 2015

Four short links: 24 April 2015

Jeff Jonas, Siri and Mesos, YouTube's Bandwidth Bill, and AWS Numbers

  1. Decoding Jeff Jonas (National Geographic) — “He thinks in three—no, four dimensions,” Nathan says. “He has a data warehouse in his head.” And that’s where the work takes place—in his head. Not on paper. Not on a computer. He resorts to paper only to work the details out. When asked about his thought process, Jonas reaches for words, then says: “It’s like a Rubik’s Cube. It all clicks into place. “The solution,” he says, is “simply there to find.” Jeff’s a genius and has his own language for explaining what he does. This quote goes a long way to explaining it.
  2. How Apple Uses Mesos for Siri — great to see not only some details of the tooling that Apple built, but also their acknowledgement of the open source foundations and ongoing engagement with those open source communities. There have been times in the past when Apple felt like a parasite on the commons rather than a participant.
  3. Cheaper Bandwidth or Bust: How Google Saved YouTube (ArsTechnica) — Remember YouTube’s $2 million-a-month bandwidth bill before the Google acquisition? While it wasn’t an overnight transition, apply Google’s data center expertise, and this cost drops to about $666,000 a month.
  4. AWS Business NumbersAmazon Web Services generated $5.2 billion over the past four quarters, and almost $700 million in operating income. During the first quarter of 2015, AWS sales reached $1.6 billion, up 49% year-over-year, and roughly 7% of Amazon’s overall sales.
Comments: 2

Coming full circle with Bigtable and HBase

The O'Reilly Data Show Podcast: Michael Stack on HBase past, present, and future.

stones_geralt_pixabay

Subscribe to the O’Reilly Data Show to explore the opportunities and techniques driving big data and data science.

At least once a year, I sit down with Michael Stack, engineer at Cloudera, to get an update on Apache HBase and the annual user conference, HBasecon. Stack has a great perspective, as he has been part of HBase since its inception. As former project leader, he remains a key contributor and evangelist, and one of the organizers of HBasecon.

In the beginning: Search and Bigtable

During the latest episode of the O’Reilly Data Show Podcast, I decided to broaden our conversation to include the beginnings of the very popular Apache HBase project. Stack reminded me that in the early days much of the big data community in the SF Bay Area was centered around search technologies, such as HBase. In particular, HBase was inspired by work out of Google (Bigtable), and the early engineers had ties to projects out of the Internet Archive:

At the time, I was working at the Internet Archive, and I was working on crawlers and search. The Bigtable paper looked really interesting to us because the archive, as you know, we used to host — or still do — the Wayback Machine. The Wayback Machine is a picture of the Web that goes back to 1998, and you could look at the Web at any particular time. What pages looked liked at a particular time. Bigtable was very interesting at the Internet Archive because it had this time dimension.

A group had started up to talk about the possibility of implementing a Bigtable clone. It was centered at a place called Powerset, a startup that was in San Francisco back then. That was about doing a search, so I went and talked to them. They said, ‘Come on over and we’ll make a space for doing a Bigtable clone.’ They had a very intricate search pipeline, and it was based on early Amazon AWS, and every time they started up their pipeline, they’d get a phone call from Amazon saying, ‘Please stop whatever it is you’re doing.’ … The first engineer would be a fellow called Jim Kellerman. The actual first 30 classes came from Mike Cafarella. He was instrumental in getting the first versions of Hadoop going. He was hanging around Apache Nutch at the time. … Doug [Cutting] used to work at the Internet archive, and the first actual versions of Hadoop were run on racks at the Internet archive. Doug was working on fulltext search. Then he moved on to go to Yahoo, to work on Hadoop full time.

Read more…

Comment

Squaring big data with database queries

Integrating open source tools into a data warehouse has its advantages.

alone_realworkhard_pixabay

Although next-gen big data tools such as Hadoop, Spark, and MongoDB are finding more and more uses, most organizations need to maintain data in traditional relational stores as well. Deriving the benefits of both key/value stores and relational databases takes a lot of juggling. Three basic strategies are currently in use.

  • Double up on your data storage. Log everything in your fast key/value repository and duplicate part of it (or perform some reductions and store the results) in your relational data warehouse.
  • Store data primarily in a relational data warehouse, and use extract, transform, and load (ETL) tools to make it available for analytics. These tools run a fine-toothed comb through data to perform string manipulation, remove outlier values, etc. and produce a data set in the format required by data processing tools.
  • Put each type of data into the repository best suited to it––relational, Hadoop, etc.––but run queries between the repositories and return results from one repository to another for post-processing.

The appeal of the first is a large-scale simplicity, in that it uses well-understood systems in parallel. The second brings the familiarity of relational databases for business users to access. This article focuses on the third solution, which has advantages over the others: it avoids the redundancy of the first solution and is much easier to design and maintain than the second. I’ll describe how it is accomplished by Teradata, through its appliances and cloud solutions, but the building blocks are standard, open source tools such as Hive and HCatalog, so this strategy can be implemented by anyone. Read more…

Comment

The log: The lifeblood of your data pipeline

Why every data pipeline should have a Unified Logging Layer.

The value of log data for business is unimpeachable. On every level of the organization, the question, “How are we doing?” is answered, ultimately, by log data. Error logs tell developers what went wrong in their applications. User event logs give product managers insights on usage. If the CEO has a question about the next quarter’s revenue forecast, the answer ultimately comes from payment/CRM logs. In this post, I explore the ideal frameworks for collecting and parsing logs.

Apache Kafka Architect Jay Kreps wrote a wonderfully crisp survey on log data. He begins with the simple question of “What is the log?” and elucidates its key role in thinking about data pipelines. Jay’s piece focuses mostly on storing and processing log data. Here, I focus on the steps before storing and processing.

Changing the way we think about log data

oreilly_radar_fluentd_1

The old paradigm — machines to humans, and the new — machines to machines. Image courtesy of Kiyoto Tamura.

Over the last decade, the primary consumer of log data shifted from humans to machines.

Software engineers still read logs, especially when their software behaves in an unexpected manner. However, in terms of “bytes processed,” humans account for a tiny fraction of the total consumption. Much of today’s “big data” is some form of log data, and businesses run tens of thousands of servers to parse and mine these logs to gain competitive edge. Read more…

Comment

Investigating Spark’s performance

A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.

Ousterhout_WebcastFor many who use and deploy Apache Spark, knowing how to find critical bottlenecks is extremely important. In a recent O’Reilly webcast, Making Sense of Spark Performance, Spark committer and PMC member Kay Ousterhout gave a brief overview of how Spark works, and dove into how she measured performance bottlenecks using new metrics, including block-time analysis. Ousterhout walked through high-level takeaways from her in-depth analysis of several workloads, and offered a live demo of a new performance analysis tool and explained how you can use it to improve your Spark performance.

Her research uncovered surprising insights into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark), and one production workload. As part of our overall series of webcasts on big data, data science, and engineering, this webcast debunked commonly held ideas surrounding network performance, showing that CPU — not I/O — is often a critical bottleneck, and demonstrated how to identify and fix stragglers.

Network performance is almost irrelevant

While there’s been a lot of research work on performance — mainly surrounding the issues of whether to cache input data in-memory or on machine, scheduling, straggler tasks, and network performance — there haven’t been comprehensive studies into what’s most important to performance overall. This is where Ousterhout’s research comes in — taking on what she refers to as “community dogma,” beginning with the idea that network and disk I/O are major bottlenecks. Read more…

Comment
Four short links: 17 April 2015

Four short links: 17 April 2015

Distributed SQLite, Communicating Scientists, Learning from Failure, and Cat Convergence

  1. Replicating SQLite using Raft Consensus — clever, he used a consensus algorithm to build a distributed (replicated) SQLite.
  2. When Open Access is the Norm, How do Scientists Communicate? (PLOS) — From interviews I’ve conducted with researchers and software developers who are modeling aspects of modern online collaboration, I’ve highlighted the most useful and reproducible practices. (via Jon Udell)
  3. Meet DJ Patil“It was this kind of moment when you realize: ‘Oh, my gosh, I am that stupid,’” he said.
  4. Interview with Bruce Sterling on the Convergence of Humans and MachinesIf you are a human being, and you are doing computation, you are trying to multiply 17 times five in your head. It feels like thinking. Machines can multiply, too. They must be thinking. They can do math and you can do math. But the math you are doing is not really what cognition is about. Cognition is about stuff like seeing, maneuvering, having wants, desires. Your cat has cognition. Cats cannot multiply 17 times five. They have got their own umwelt (environment). But they are mammalian, you are a mammalian. They are actually a class that includes you. You are much more like your house cat than you are ever going to be like Siri. You and Siri converging, you and your house cat can converge a lot more easily. You can take the imaginary technologies that many post-human enthusiasts have talked about, and you could afflict all of them on a cat. Every one of them would work on a cat. The cat is an ideal laboratory animal for all these transitions and convergences that we want to make for human beings. (via Vaughan Bell)
Comment