"Big Data" entries

Four short links: 9 June 2015

Four short links: 9 June 2015

Parallelising Without Coordination, AR/VR IxD, Medical Insecurity, and Online Privacy Lies

  1. The Declarative Imperative (Morning Paper) — on Dataflow. …a large class of recursive programs – all of basic Datalog – can be parallelized without any need for coordination. As a side note, this insight appears to have eluded the MapReduce community, where join is necessarily a blocking operator.
  2. Consensual Reality (Alistair Croll) — Among other things we discussed what Inbar calls his three rules for augmented reality design: 1. The content you see has to emerge from the real world and relate to it. 2. Should not distract you from the real world; must add to it. 3. Don’t use it when you don’t need it. If a film is better on the TV watch the TV.
  3. X-Rays Behaving BadlyAccording to the report, medical devices – in particular so-called picture archive and communications systems (PACS) radiologic imaging systems – are all but invisible to security monitoring systems and provide a ready platform for malware infections to lurk on hospital networks, and for malicious actors to launch attacks on other, high value IT assets. Among the revelations contained in the report: A malware infection at a TrapX customer site spread from a unmonitored PACS system to a key nurse’s workstation. The result: confidential hospital data was secreted off the network to a server hosted in Guiyang, China. Communications went out encrypted using port 443 (SSL) and were not detected by existing cyber defense software, so TrapX said it is unsure how many records may have been stolen.
  4. The Online Privacy Lie is Unraveling (TechCrunch) — The report authors’ argue it’s this sense of resignation that is resulting in data tradeoffs taking place — rather than consumers performing careful cost-benefit analysis to weigh up the pros and cons of giving up their data (as marketers try to claim). They also found that where consumers were most informed about marketing practices they were also more likely to be resigned to not being able to do anything to prevent their data being harvested. Something that didn’t make me regret clicking on a TechCrunch link.
Comment

Create your graphs with R

A deep-dive into exploratory and presentation graphs.

Buy “Graphing Data with R: An Introduction” in early release. Editor’s note: this is an excerpt of “Graphing Data with R: An Introduction,” by John Jay Hilfiger.

Graphs are useful both for exploration and for presentation. Exploration is the process of analyzing the data and finding relationships and patterns. Presentation of your findings is making your case to others who have not studied the data as intensively as you have yourself. While one is exploring the data, graphs can be stark, lean, and somewhat unattractive. The data analyst, who knows the data and is getting to know it better with each graph made, does not need all the titles, labels, reference details, and colors that someone sitting through a presentation might expect, and might, indeed, find necessary. Furthermore, adding all this stuff just slows down the analyst. Also, some graphs will prove to be dead ends, or just not very interesting. Consequently, many graphs may be discarded during the discovery journey.

As the process of exploration continues, adding some details may make relationships a little clearer. As the analyst gets closer to presentation and/or publication, the graphs become more detailed and prettier. There probably will have been many plain graphs in the process of analysis and relatively few beautiful graphs that appear in the final report. Read more…

Comments: 4

Apache Spark: Powering applications on-premise and in the cloud

The O'Reilly Data Show Podcast: Patrick Wendell on the state of the Spark ecosystem.

Compass_Rose_Justin_Pickard_Flickr

As organizations shift their focus toward building analytic applications, many are relying on components from the Apache Spark ecosystem. I began pointing this out in advance of the first Spark Summit in 2013 and since then, Spark adoption has exploded.

With Spark Summit SF right around the corner, I recently sat down with Patrick Wendell, release manager of Apache Spark and co-founder of Databricks, for this episode of the O’Reilly Data Show Podcast. (Full disclosure: I’m an advisor to Databricks). We talked about how he came to join the UC Berkeley AMPLab, the current state of Spark ecosystem components, Spark’s future roadmap, and interesting applications built on top of Spark.

User-driven from inception

From the beginning, Spark struck me as different from other academic research projects (many of which “wither away” when grad students leave). The AMPLab team behind Spark spoke at local SF Bay Area meetups, they hosted 2-day events (AMP Camp), and worked hard to help early users. That mindset continues to this day. Wendell explained:

We were trying to work with the early users of Spark, getting feedback on what issues it had and what types of problems they were trying to solve with Spark, and then use that to influence the roadmap. It was definitely a more informal process, but from the very beginning, we were expressly user-driven in the way we thought about building Spark, which is quite different than a lot of other open source projects. We never really built it for our own use — it was not like we were at a company solving a problem and then we decided, “hey let’s let other people use this code for free”. … From the beginning, we were focused on empowering other people and building platforms for other developers, so I always thought that was quite unique about Spark.

Read more…

Comment
Four short links: 3 June 2015

Four short links: 3 June 2015

Filter Design, Real-Time Analytics, Neural Turing Machines, and Evaluating Subjective Opinions

  1. How to Design Applied FiltersThe most frequently observed issue during usability testing were filtering values changing placement when the user applied them – either to another position in the list of filtering values (typically the top) or to an “Applied filters” summary overview. During testing, the subjects were often confounded as they noticed that the filtering value they just clicked was suddenly “no longer there.”
  2. Twitter Herona real-time analytics platform that is fully API-compatible with Storm […] At Twitter, Heron is used as our primary streaming system, running hundreds of development and production topologies. Since Heron is efficient in terms of resource usage, after migrating all Twitter’s topologies to it we’ve seen an overall 3x reduction in hardware, causing a significant improvement in our infrastructure efficiency.
  3. ntman implementation of neural Turing machines. (via @fastml_extra)
  4. Bayesian Truth Seruma scoring system for eliciting and evaluating subjective opinions from a group of respondents, in situations where the user of the method has no independent means of evaluating respondents’ honesty or their ability. It leverages respondents’ predictions about how other respondents will answer the same questions. Through these predictions, respondents reveal their meta-knowledge, which is knowledge of what other people know.
Comment
Four short links: 29 May 2015

Four short links: 29 May 2015

Data Integration, Carousel, Distributed Project Management, and Costs of Premature Build

  1. Using Logs to Build Solid Data Infrastructure — (Martin Kleppmann) — For lack of a better term, I’m going to call this the problem of ‘data integration.’ With that, I really just mean ‘making sure that the data ends up in all the right places.’ Whenever a piece of data changes in one place, it needs to change correspondingly in all the other places where there is a copy or derivative of that data.
  2. TremulaJS — sweet carousel, with PHYSICS.
  3. Project Advice (Daniel Bachhuber) — for leaders of distributed teams. Focus on clearing blockers above all else. Because you’re working on an open source project with contributors across many timezones, average time to feedback will optimistically be six hours. More likely, it will be 24-48 hours. This slow feedback cycle can kill progress on pull requests. As a project maintainer, prioritize giving feedback, clearing blockers, and making decisions.
  4. YAGNI (Martin Fowler) — cost of build, cost of delay, cost of carry, cost of repair. Every product manager has felt these costs.
Comment

Validating data models with Kafka-based pipelines

A case for back-end A/B testing.

Start the O’Reilly “Introduction to Apache Kafka” training video for free. In this video, Gwen Shapira shows developers and administrators how to integrate Kafka into a data processing pipeline.

A/B testing is a popular method of using business intelligence data to assess possible changes to websites. In the past, when a business wanted to update its website in an attempt to drive more sales, decisions on the specific changes to make were driven by guesses; intuition; focus groups; and ultimately, which executive yelled louder. These days, the data-driven solution is to set up multiple copies of the website, direct users randomly to the different variations and measure which design improves sales the most. There are a lot of details to get right, but this is the gist of things.

When it comes to back-end systems, however, we are still living in the stone age. Suppose your business grew significantly and you notice that your existing MySQL database is becoming less responsive as the load increases. Suppose you consider moving to a NoSQL system, you need to decide which NoSQL solution to pick — there are a lot of options: Cassandra, MongoDB, Couchbase, or even Hadoop. There are also many possible data models: normalized, wide tables, narrow tables, nested data structures, etc.

A/B testing multiple data stores and data models in parallel

It is surprising how often a company will pick a solution based on intuition or even which architect yelled louder. Rather than making a decision based on facts and numbers regarding capacity, scale, throughput, and data-processing patterns, the back-end architecture decisions are made with fuzzy reasoning. In that scenario, what usually happens is that a data store and a data model are somehow chosen, and the entire development team will dive into a six-month project to move their entire back-end system to the new thing. This project will inevitably take 12 months, and about 9 months in, everyone will suspect that this was a bad idea, but it’s way too late to do anything about it. Read more…

Comment: 1

Announcing Cassandra certification

A new partnership between O’Reilly and DataStax offers certification and training in Cassandra.

apache-cassandra-certified-300x300I am pleased to announce a joint program between O’Reilly and DataStax to certify Cassandra developers. This program complements our developer certification for Apache Spark and — just as in the case of Databricks and Spark — we are excited to be working with the leading commercial company behind Cassandra. DataStax has done a tremendous job growing and nurturing the Cassandra community, user base, and technology.

Once the certification program is ready, developers can take the exam online, in designated test centers, and at select training courses. O’Reilly will also be developing books, training days, and videos targeted at developers and companies interested in the Cassandra distributed storage system.

Cassandra is a popular component used for building big data and real-time analytic platforms. Its ability to comfortably scale to clusters with thousands of nodes makes it a popular option for solutions that need to ingest and make sense of large amounts of time series and event data. As noted in an earlier post, real-time event data are at the heart of one of the trends we’re closely following: the convergence of cheap sensors, fast networks, and distributed computation. Read more…

Comments: 2

Data science makes an impact on Wall Street

The O'Reilly Data Show Podcast: Gary Kazantsev on how big data and data science are making a difference in finance.

Learn more about Next:Money, O’Reilly’s conference focused on the fundamental transformation taking place in the finance industry.

Charging_Bull_Sam_valadi_FlickrHaving started my career in industry, working on problems in finance, I’ve always appreciated how challenging it is to build consistently profitable systems in this extremely competitive domain. When I served as quant at a hedge fund in the late 1990s and early 2000s, I worked primarily with price data (time-series). I quickly found that it was difficult to find and sustain profitable trading strategies that leveraged data sources that everyone else in the industry examined exhaustively. In the early-to-mid 2000s the hedge fund industry began incorporating many more data sources, and today you’re likely to find many finance industry professionals at big data and data science events like Strata + Hadoop World.

During the latest episode of the O’Reilly Data Show Podcast, I had a great conversation with one of the leading data scientists in finance: Gary Kazantsev runs the R&D Machine Learning group at Bloomberg LP. As a former quant, I wanted to know the types of problems Kazantsev and his group work on, and the tools and techniques they’ve found useful. We also talked about data science, data engineering, and recruiting data professionals for Wall Street. Read more…

Comments: 3

Cultivating a psychological sense of community

A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.

Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.

Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.

From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.

Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…

Comment

Connected play

How data-driven tech toys are — and aren’t — changing the nature of play.

Sign up to be notified when the new free report Data, Technology & The Future of Play becomes available. This post is part of a series investigating the future of play that will culminate in a full report.

When I was in first grade, I cut the fur pom-poms off of my dad’s mukluks. (If you didn’t grow up in the Canadian North and you don’t know what mukluks are, here’s a picture.) My dad’s mukluks were specially made for him, so he was pretty sore. I cut the pom-poms off because I had just seen The Trouble With Tribbles at a friend’s house, and I desperately wanted some Tribbles. I kept them in a shoebox, named them, brought them to show-and-tell, and pretended they were real.

It’s exactly this kind of imaginative play that a lot of parents are afraid is being lost as toys become smarter. And in exchange for what? There isn’t any real evidence yet that smart toys genuinely make kids smarter.

Boys_with_hoops_on_Chesnut_Street

Low-fi toys. Public domain image, “Boys playing with hoops on Chesnut Street, Toronto, Canada.” Source: Wikimedia Commons.

I tell this story not to emphasize what a terrible vandal I was as a child, rather, I tell it to show how irrepressible childrens’ imaginations are, and to explain why technological toys are not going to kill that imagination. Today’s “smart” toys are no different than dolls and blocks, or in my case, a pair of mukluks. By nature, all toys have affordances that imply how they should be used. The more complex the toy, the more focused the affordances are. Consider a stick: it can be a weapon, a mode of transport, or a magic wand. But an app that is designed to do a thing guides users toward that use case, just as a door handle suggests that you should grasp and turn it. Design has opinions. Read more…

Comment