- The Declarative Imperative (Morning Paper) — on Dataflow. …a large class of recursive programs – all of basic Datalog – can be parallelized without any need for coordination. As a side note, this insight appears to have eluded the MapReduce community, where join is necessarily a blocking operator.
- Consensual Reality (Alistair Croll) — Among other things we discussed what Inbar calls his three rules for augmented reality design: 1. The content you see has to emerge from the real world and relate to it. 2. Should not distract you from the real world; must add to it. 3. Don’t use it when you don’t need it. If a film is better on the TV watch the TV.
- X-Rays Behaving Badly — According to the report, medical devices – in particular so-called picture archive and communications systems (PACS) radiologic imaging systems – are all but invisible to security monitoring systems and provide a ready platform for malware infections to lurk on hospital networks, and for malicious actors to launch attacks on other, high value IT assets. Among the revelations contained in the report: A malware infection at a TrapX customer site spread from a unmonitored PACS system to a key nurse’s workstation. The result: confidential hospital data was secreted off the network to a server hosted in Guiyang, China. Communications went out encrypted using port 443 (SSL) and were not detected by existing cyber defense software, so TrapX said it is unsure how many records may have been stolen.
- The Online Privacy Lie is Unraveling (TechCrunch) — The report authors’ argue it’s this sense of resignation that is resulting in data tradeoffs taking place — rather than consumers performing careful cost-benefit analysis to weigh up the pros and cons of giving up their data (as marketers try to claim). They also found that where consumers were most informed about marketing practices they were also more likely to be resigned to not being able to do anything to prevent their data being harvested. Something that didn’t make me regret clicking on a TechCrunch link.
"Big Data" entries
Practical applications of human-in-the-loop machine learning.
With hundreds, thousands, or even just tens of suppliers — each with different business units, payment terms, and locations — businesses are faced with a monumental task: unifying all of their supplier-related data, and fast so that it can be useful. In order to ask deep questions about their data, companies are increasingly looking for a single, unified view of their supply chain.
And yet, business data is often stored in different sources, systems, and formats, resulting in silos of information. These data silos take the form of enterprise resource planning systems, CSV files, spreadsheets, and relational databases. To pull together all of the data from these disparate sources, a business faces three interrelated challenges:
- Speed. Traditionally, businesses have attempted to catalog and organize supply chain data manually — profiling and integrating data themselves, which leads directly to the next challenge: cost.
- Cost. Manual work is expensive work. Usually more than one employee will need to work on the same data set in order to move quickly enough for the results to have any value for the business. Even with several employees working on the same data sets, this work will still not achieve what could be done on a machine scale.
- Efficiency. Relying completely on humans to organize and unify data is a situation ripe for error. Plus, there’s often no audit trail, and the work results in inherently incomplete views of information.
In a recent live demo by Dr. Clare Bernard, a field engineer at Tamr, I got a glimpse into how Tamr is using a combination of machine learning algorithms and input from subject matter experts to help businesses unify their data for analysis. A practice that uses short-term human intervention to actively improve machine models, human-in-the-loop machine learning is taking off across all types of industries, including fashion, automotive, and cloud services such as Google Maps. Read more…
A deep-dive into exploratory and presentation graphs.
Buy “Graphing Data with R: An Introduction” in early release. Editor’s note: this is an excerpt of “Graphing Data with R: An Introduction,” by John Jay Hilfiger.Graphs are useful both for exploration and for presentation. Exploration is the process of analyzing the data and finding relationships and patterns. Presentation of your findings is making your case to others who have not studied the data as intensively as you have yourself. While one is exploring the data, graphs can be stark, lean, and somewhat unattractive. The data analyst, who knows the data and is getting to know it better with each graph made, does not need all the titles, labels, reference details, and colors that someone sitting through a presentation might expect, and might, indeed, find necessary. Furthermore, adding all this stuff just slows down the analyst. Also, some graphs will prove to be dead ends, or just not very interesting. Consequently, many graphs may be discarded during the discovery journey.
As the process of exploration continues, adding some details may make relationships a little clearer. As the analyst gets closer to presentation and/or publication, the graphs become more detailed and prettier. There probably will have been many plain graphs in the process of analysis and relatively few beautiful graphs that appear in the final report. Read more…
The O'Reilly Data Show Podcast: Patrick Wendell on the state of the Spark ecosystem.
As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded.
With Spark Summit SF right around the corner, I recently sat down with Patrick Wendell, release manager of Apache Spark and co-founder of Databricks, for this episode of the O’Reilly Data Show Podcast. (Full disclosure: I’m an advisor to Databricks). We talked about how he came to join the UC Berkeley AMPLab, the current state of Spark ecosystem components, Spark’s future roadmap, and interesting applications built on top of Spark.
User-driven from inception
From the beginning, Spark struck me as different from other academic research projects (many of which “wither away” when grad students leave). The AMPLab team behind Spark spoke at local SF Bay Area meetups, they hosted 2-day events (AMP Camp), and worked hard to help early users. That mindset continues to this day. Wendell explained:
We were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user-driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. We never really built it for our own use — it was not like we were at a company solving a problem and then we decided, “hey let’s let other people use this code for free”. … From the beginning, we were focused on empowering other people and building platforms for other developers, so I always thought that was quite unique about Spark.
A case for back-end A/B testing.
Start the O’Reilly “Introduction to Apache Kafka” training video for free. In this video, Gwen Shapira shows developers and administrators how to integrate Kafka into a data processing pipeline.
A/B testing is a popular method of using business intelligence data to assess possible changes to websites. In the past, when a business wanted to update its website in an attempt to drive more sales, decisions on the specific changes to make were driven by guesses; intuition; focus groups; and ultimately, which executive yelled louder. These days, the data-driven solution is to set up multiple copies of the website, direct users randomly to the different variations and measure which design improves sales the most. There are a lot of details to get right, but this is the gist of things.
When it comes to back-end systems, however, we are still living in the stone age. Suppose your business grew significantly and you notice that your existing MySQL database is becoming less responsive as the load increases. Suppose you consider moving to a NoSQL system, you need to decide which NoSQL solution to pick — there are a lot of options: Cassandra, MongoDB, Couchbase, or even Hadoop. There are also many possible data models: normalized, wide tables, narrow tables, nested data structures, etc.
A/B testing multiple data stores and data models in parallel
It is surprising how often a company will pick a solution based on intuition or even which architect yelled louder. Rather than making a decision based on facts and numbers regarding capacity, scale, throughput, and data-processing patterns, the back-end architecture decisions are made with fuzzy reasoning. In that scenario, what usually happens is that a data store and a data model are somehow chosen, and the entire development team will dive into a six-month project to move their entire back-end system to the new thing. This project will inevitably take 12 months, and about 9 months in, everyone will suspect that this was a bad idea, but it’s way too late to do anything about it. Read more…
A new partnership between O’Reilly and DataStax offers certification and training in Cassandra.
I am pleased to announce a joint program between O’Reilly and DataStax to certify Cassandra developers. This program complements our developer certification for Apache Spark and — just as in the case of Databricks and Spark — we are excited to be working with the leading commercial company behind Cassandra. DataStax has done a tremendous job growing and nurturing the Cassandra community, user base, and technology.
Once the certification program is ready, developers can take the exam online, in designated test centers, and at select training courses. O’Reilly will also be developing books, training days, and videos targeted at developers and companies interested in the Cassandra distributed storage system.
Cassandra is a popular component used for building big data and real-time analytic platforms. Its ability to comfortably scale to clusters with thousands of nodes makes it a popular option for solutions that need to ingest and make sense of large amounts of time series and event data. As noted in an earlier post, real-time event data are at the heart of one of the trends we’re closely following: the convergence of cheap sensors, fast networks, and distributed computation. Read more…
The O'Reilly Data Show Podcast: Gary Kazantsev on how big data and data science are making a difference in finance.
Learn more about Next:Money, O’Reilly’s conference focused on the fundamental transformation taking place in the finance industry.
Having started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.
During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street. Read more…
A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.
Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.
From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.
Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…