- Predicting a Billboard Music Hit (YouTube) — Shazam VP of Music and Platforms at Strata London. With relative accuracy, we can predict 33 days out what song will go to No. 1 on the Billboard charts in the U.S.
- Psychological Pricing Strategies — a handy wrap-up of evil^wuseful pricing strategies to know.
- What Two Programmers Have Revealed So Far About Seattle Police Officers Who Are Still in Uniform — through their shrewd use of Washington’s Public Records Act, the two Seattle residents are now the closest thing the city has to a civilian police-oversight board. In the last year and a half, they have acquired hundreds of reports, videos, and 911 calls related to the Seattle Police Department’s internal investigations of officer misconduct between 2010 and 2013. And though they have only combed through a small portion of the data, they say they have found several instances of officers appearing to lie, use racist language, and use excessive force—with no consequences. In fact, they believe that the Office of Professional Accountability (OPA) has systematically “run interference” for cops. In the aforementioned cases of alleged officer misconduct, all of the involved officers were exonerated and still remain on the force.
- Understanding Docker Security and Best Practices — explanation of container security and a benchmark for security practices, though email addresses will need to be surrendered in exchange for the good info.
"Big Data" entries
Scale-out applications need scaled-in virtualization.
Data center operating systems are emerging as a first-class category of distributed system software. Hadoop, for example, is evolving from a MapReduce framework into YARN, a generic platform for scale-out applications.
To enable a rich ecosystem of diverse applications to coexist on these platforms, providing adequate isolation is crucial. The isolation mechanism must enforce resource limits, decouple software dependencies among applications and the host, provide security and privacy, confine failures, etc. Containers offer a simple and elegant solution to the problem. However, a question that comes up frequently is: Why not virtual machines (VMs)? After all, these systems face a number of the same challenges that have been solved by virtualization for traditional enterprise applications.
All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections” — David Wheeler
From linear models to neural networks: an interview with Reza Zadeh.
Get notified when our free report, “Future of Machine Intelligence: Perspectives from Leading Practitioners,” is available for download. The following interview is one of many that will be included in the report.
As part of our ongoing series of interviews surveying the frontiers of machine intelligence, I recently interviewed Reza Zadeh. Reza is a Consulting Professor in the Institute for Computational and Mathematical Engineering at Stanford University and a Technical Advisor to Databricks. His work focuses on Machine Learning Theory and Applications, Distributed Computing, and Discrete Applied Mathematics.
- Neural networks have made a comeback and are playing a growing role in new approaches to machine learning.
- The greatest successes are being achieved via a supervised approach leveraging established algorithms.
- Spark is an especially well-suited environment for distributed machine learning.
David Beyer: Tell us a bit about your work at Stanford
Reza Zadeh: At Stanford, I designed and teach distributed algorithms and optimization (CME 323) as well as a course called discrete mathematics and algorithms (CME 305). In the discrete mathematics course, I teach algorithms from a completely theoretical perspective, meaning that it is not tied to any programming language or framework, and we fill up whiteboards with many theorems and their proofs. Read more…
A survey of the landscape shows the types of tools remain the same, but interfaces continue to improve.
As data projects become complex and as data teams grow in size, individuals and organizations need tools to efficiently manage data projects. A while back, I wrote a post on common options, and I closed that piece by asking:
Are there completely different ways of thinking about reproducibility, lineage, sharing, and collaboration in the data science and engineering context?
At the time, I listed categories that seemed to capture much of what I was seeing in practice: (proprietary) workbooks aimed at business analysts, sophisticated IDEs, notebooks (for mixing text, code, and graphics), and workflow tools. At a high level, these tools aspire to enable data teams to do the following:
- Reproduce their work — so they can rerun and/or audit when needed
- Facilitate storytelling — because in many cases, it’s important to explain to others how results were derived
- Operationalize successful and well-tested pipelines — particularly when deploying to production is a long-term objective
As I survey the landscape, the types of tools remain the same, but interfaces continue to improve, and domain specific languages (DSLs) are starting to appear in the context of data projects. One interesting trend is that popular user interface models are being adapted to different sets of data professionals (e.g. workflow tools for business users). Read more…
What you miss with a "get it right the first time" mentality
Download our updated Women in Data report, which features four new profiles of women across the European Union. You can also pick-up a copy at Strata + Hadoop World London, where Alice Zheng will lead a session on Deploying Machine Learning in Production.
Lately, there has been a slew of media coverage about the Imposter Syndrome. Many columnists, bloggers, and public speakers have spoken or written about their own struggles with the Imposter Syndrome. And original psychological research on the Imposter Syndrome has found that out of every five successful people, two consider themselves a fraud.
I’m certainly no stranger to the sinking feeling of being out of place. During college and graduate school, it often seemed like everyone else around me was sailing through to the finish line, while I alone lumbered with the weight of programming projects and mathematical proofs. This led to an ongoing self-debate about my choice of a major and profession. One day, I noticed myself reading the same sentence over and over again in a textbook; my eyes were looking at the text, but my mind was saying, “Why aren’t you getting this yet? It’s so simple. Everybody else gets it. What’s wrong with you?”
When I look back upon those years, I have two thoughts: 1. That was hard. 2. What a waste of perfectly good brain cells! I could have done so many cool things if I had not spent all that time doubting myself.
But one can’t simply snap out of the Imposter Syndrome. It has a variety of causes, and it’s sticky. I was brought up with the idea of holding myself to a high standard, to measure my own progress against others’ achievements. Falling short of expectations is supposed to be a great motivator for action…or is it? Read more…
Practical tips for centralizing security data.
But let’s be realistic. You probably have numerous repositories for your security data. Your Security Information and Event Management (SIEM) solution doesn’t scale to the volumes of data that you would really like to collect. This, in turn, makes it hard to use all of your data for any kind of analytics. It’s likely that your tools have to operate on multiple, disconnected data stores that have very different capabilities for data access and analysis. Even worse, during an incident, how many different consoles do you have to touch before you get the complete picture of what has happened? I would guess probably at least four (I would have said 42, but that seemed a bit excessive).
When talking to your peers about this problem, do they tell you to implement Hadoop to deal with the huge data volumes? But what does that really mean — is Hadoop really the solution? After all, Hadoop is a pretty complex ecosystem of tools that requires skilled and expensive people to implement and maintain. Read more…
The O'Reilly Data Show Podcast: Michael Stack on HBase past, present, and future.
Subscribe to the O’Reilly Data Show to explore the opportunities and techniques driving big data and data science.
At least once a year, I sit down with Michael Stack, engineer at Cloudera, to get an update on Apache HBase and the annual user conference, HBasecon. Stack has a great perspective, as he has been part of HBase since its inception. As former project leader, he remains a key contributor and evangelist, and one of the organizers of HBasecon.
In the beginning: Search and Bigtable
During the latest episode of the O’Reilly Data Show Podcast, I decided to broaden our conversation to include the beginnings of the very popular Apache HBase project. Stack reminded me that in the early days much of the big data community in the SF Bay Area was centered around search technologies, such as HBase. In particular, HBase was inspired by work out of Google (Bigtable), and the early engineers had ties to projects out of the Internet Archive:
At the time, I was working at the Internet Archive, and I was working on crawlers and search. The Bigtable paper looked really interesting to us because the archive, as you know, we used to host — or still do — the Wayback Machine. The Wayback Machine is a picture of the Web that goes back to 1998, and you could look at the Web at any particular time. What pages looked liked at a particular time. Bigtable was very interesting at the Internet Archive because it had this time dimension.
A group had started up to talk about the possibility of implementing a Bigtable clone. It was centered at a place called Powerset, a startup that was in San Francisco back then. That was about doing a search, so I went and talked to them. They said, ‘Come on over and we’ll make a space for doing a Bigtable clone.’ They had a very intricate search pipeline, and it was based on early Amazon AWS, and every time they started up their pipeline, they’d get a phone call from Amazon saying, ‘Please stop whatever it is you’re doing.’ … The first engineer would be a fellow called Jim Kellerman. The actual first 30 classes came from Mike Cafarella. He was instrumental in getting the first versions of Hadoop going. He was hanging around Apache Nutch at the time. … Doug [Cutting] used to work at the Internet archive, and the first actual versions of Hadoop were run on racks at the Internet archive. Doug was working on fulltext search. Then he moved on to go to Yahoo, to work on Hadoop full time.