- Disinformation Visualisation: How to Lie with Datavis — We don’t spread visual lies by presenting false data. That would be lying. We lie by misrepresenting the data to tell the very specific story we’re interested in telling. If this is making you slightly uncomfortable, that’s a good thing; it should. If you’re concerned about adopting this new and scary habit, well, don’t worry; it’s not new. Just open your CV to be reminded you’ve lied with truthful data before. This time, however, it will be explicit and visual. (via Regine Debatty)
- Microtugs — a new type of small robot that can apply orders of magnitude more force than it weighs. This is in stark contrast to previous small robots that have become progressively better at moving and sensing, but lacked the ability to change the world through the application of human-scale loads.
- Vault — a tool for securely managing secrets and encrypting data in-transit.
- iSAX: Indexing and Mining Terabyte Sized Time Series (PDF) — Our approach allows both fast exact search and ultra-fast approximate search. We show how to exploit the combination of both types of search as sub-routines in data mining algorithms, allowing for the exact mining of truly massive real-world data sets, containing millions of time series. (via Benjamin Black)
"Big Data" entries
A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.
As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:
Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?
At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:
- Reproduce their work — so they can rerun and/or audit when needed
- Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
- Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective
As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…
What you miss with a "get it right the first time" mentality
Download our updated Women in Data report, which features four new profiles of women across the European Union. You can also pick-up a copy at Strata + Hadoop World London, where Alice Zheng will lead a session on Deploying Machine Learning in Production.
Lately, there has been a slew of media coverage about the Imposter Syndrome. Many columnists, bloggers, and public speakers have spoken or written about their own struggles with the Imposter Syndrome. And original psychological research on the Imposter Syndrome has found that out of every five successful people, two consider themselves a fraud.
I’m certainly no stranger to the sinking feeling of being out of place. During college and graduate school, it often seemed like everyone else around me was sailing through to the finish line, while I alone lumbered with the weight of programming projects and mathematical proofs. This led to an ongoing self-debate about my choice of a major and profession. One day, I noticed myself reading the same sentence over and over again in a textbook; my eyes were looking at the text, but my mind was saying, “Why aren’t you getting this yet? It’s so simple. Everybody else gets it. What’s wrong with you?”
When I look back upon those years, I have two thoughts: 1. That was hard. 2. What a waste of perfectly good brain cells! I could have done so many cool things if I had not spent all that time doubting myself.
But one can’t simply snap out of the Imposter Syndrome. It has a variety of causes, and it’s sticky. I was brought up with the idea of holding myself to a high standard, to measure my own progress against others’ achievements. Falling short of expectations is supposed to be a great motivator for action…or is it? Read more…
Practical tips for centralizing security data.
But let’s be realistic. You probably have numerous repositories for your security data. Your Security Information and Event Management (SIEM) solution doesn’t scale to the volumes of data that you would really like to collect. This, in turn, makes it hard to use all of your data for any kind of analytics. It’s likely that your tools have to operate on multiple, disconnected data stores that have very different capabilities for data access and analysis. Even worse, during an incident, how many different consoles do you have to touch before you get the complete picture of what has happened? I would guess probably at least four (I would have said 42, but that seemed a bit excessive).
When talking to your peers about this problem, do they tell you to implement Hadoop to deal with the huge data volumes? But what does that really mean — is Hadoop really the solution? After all, Hadoop is a pretty complex ecosystem of tools that requires skilled and expensive people to implement and maintain. Read more…
The O'Reilly Data Show Podcast: Michael Stack on HBase past, present, and future.
Subscribe to the O’Reilly Data Show to explore the opportunities and techniques driving big data and data science.
At least once a year, I sit down with Michael Stack, engineer at Cloudera, to get an update on Apache HBase and the annual user conference, HBasecon. Stack has a great perspective, as he has been part of HBase since its inception. As former project leader, he remains a key contributor and evangelist, and one of the organizers of HBasecon.
In the beginning: Search and Bigtable
During the latest episode of the O’Reilly Data Show Podcast, I decided to broaden our conversation to include the beginnings of the very popular Apache HBase project. Stack reminded me that in the early days much of the big data community in the SF Bay Area was centered around search technologies, such as HBase. In particular, HBase was inspired by work out of Google (Bigtable), and the early engineers had ties to projects out of the Internet Archive:
At the time, I was working at the Internet Archive, and I was working on crawlers and search. The Bigtable paper looked really interesting to us because the archive, as you know, we used to host — or still do — the Wayback Machine. The Wayback Machine is a picture of the Web that goes back to 1998, and you could look at the Web at any particular time. What pages looked liked at a particular time. Bigtable was very interesting at the Internet Archive because it had this time dimension.
A group had started up to talk about the possibility of implementing a Bigtable clone. It was centered at a place called Powerset, a startup that was in San Francisco back then. That was about doing a search, so I went and talked to them. They said, ‘Come on over and we’ll make a space for doing a Bigtable clone.’ They had a very intricate search pipeline, and it was based on early Amazon AWS, and every time they started up their pipeline, they’d get a phone call from Amazon saying, ‘Please stop whatever it is you’re doing.’ … The first engineer would be a fellow called Jim Kellerman. The actual first 30 classes came from Mike Cafarella. He was instrumental in getting the first versions of Hadoop going. He was hanging around Apache Nutch at the time. … Doug [Cutting] used to work at the Internet archive, and the first actual versions of Hadoop were run on racks at the Internet archive. Doug was working on fulltext search. Then he moved on to go to Yahoo, to work on Hadoop full time.
Integrating open source tools into a data warehouse has its advantages.
Although next-gen big data tools such as Hadoop, Spark, and MongoDB are finding more and more uses, most organizations need to maintain data in traditional relational stores as well. Deriving the benefits of both key/value stores and relational databases takes a lot of juggling. Three basic strategies are currently in use.
- Double up on your data storage. Log everything in your fast key/value repository and duplicate part of it (or perform some reductions and store the results) in your relational data warehouse.
- Store data primarily in a relational data warehouse, and use extract, transform, and load (ETL) tools to make it available for analytics. These tools run a fine-toothed comb through data to perform string manipulation, remove outlier values, etc. and produce a data set in the format required by data processing tools.
- Put each type of data into the repository best suited to it––relational, Hadoop, etc.––but run queries between the repositories and return results from one repository to another for post-processing.
The appeal of the first is a large-scale simplicity, in that it uses well-understood systems in parallel. The second brings the familiarity of relational databases for business users to access. This article focuses on the third solution, which has advantages over the others: it avoids the redundancy of the first solution and is much easier to design and maintain than the second. I’ll describe how it is accomplished by Teradata, through its appliances and cloud solutions, but the building blocks are standard, open source tools such as Hive and HCatalog, so this strategy can be implemented by anyone. Read more…
Why every data pipeline should have a Unified Logging Layer.
The value of log data for business is unimpeachable. On every level of the organization, the question, “How are we doing?” is answered, ultimately, by log data. Error logs tell developers what went wrong in their applications. User event logs give product managers insights on usage. If the CEO has a question about the next quarter’s revenue forecast, the answer ultimately comes from payment/CRM logs. In this post, I explore the ideal frameworks for collecting and parsing logs.
Apache Kafka Architect Jay Kreps wrote a wonderfully crisp survey on log data. He begins with the simple question of “What is the log?” and elucidates its key role in thinking about data pipelines. Jay’s piece focuses mostly on storing and processing log data. Here, I focus on the steps before storing and processing.
Changing the way we think about log data
Over the last decade, the primary consumer of log data shifted from humans to machines.
Software engineers still read logs, especially when their software behaves in an unexpected manner. However, in terms of “bytes processed,” humans account for a tiny fraction of the total consumption. Much of today’s “big data” is some form of log data, and businesses run tens of thousands of servers to parse and mine these logs to gain competitive edge. Read more…
A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.
For many who use and deploy Apache Spark, knowing how to find critical bottlenecks is extremely important. In a recent O’Reilly webcast, Making Sense of Spark Performance, Spark committer and PMC member Kay Ousterhout gave a brief overview of how Spark works, and dove into how she measured performance bottlenecks using new metrics, including block-time analysis. Ousterhout walked through high-level takeaways from her in-depth analysis of several workloads, and offered a live demo of a new performance analysis tool and explained how you can use it to improve your Spark performance.
Her research uncovered surprising insights into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark), and one production workload. As part of our overall series of webcasts on big data, data science, and engineering, this webcast debunked commonly held ideas surrounding network performance, showing that CPU — not I/O — is often a critical bottleneck, and demonstrated how to identify and fix stragglers.
Network performance is almost irrelevant
While there’s been a lot of research work on performance — mainly surrounding the issues of whether to cache input data in-memory or on machine, scheduling, straggler tasks, and network performance — there haven’t been comprehensive studies into what’s most important to performance overall. This is where Ousterhout’s research comes in — taking on what she refers to as “community dogma,” beginning with the idea that network and disk I/O are major bottlenecks. Read more…