Hadoop for business: Analytics across industries
The O’Reilly Podcast: Ben Sharma on the business impact of Hadoop and the evolution of tools
In this episode of the O’Reilly Podcast, O’Reilly’s Ben Lorica chats with Ben Sharma, CEO and co-founder of Zaloni, a company that provides enterprise data management solutions for Hadoop. Sharma was one of the first users of Apache Hadoop, and has a background in enterprise solutions architecture and data analytics.
Before starting Zaloni, Sharma spent many years as a business consultant and began to see that companies across industries were struggling to process, store, and extract value from their data. Having worked extensively in telecom, Sharma helped equipment vendors deploy large-scale network infrastructures at carriers across the world. He began to see how Hadoop could have an impact in the business analytics aspect of companies, not just in IT.
In this interview, Lorica and Sharma discuss the early days of Hadoop and how businesses across industries are benefitting from Hadoop. They also discuss the evolution of tools in the space and how more companies are moving toward real-time decision-making with the growth of streaming tools and real-time data. Read more…
The business value of unifying data
Practical applications of human-in-the-loop machine learning.
With hundreds, thousands, or even just tens of suppliers — each with different business units, payment terms, and locations — businesses are faced with a monumental task: unifying all of their supplier-related data, and fast so that it can be useful. In order to ask deep questions about their data, companies are increasingly looking for a single, unified view of their supply chain.
And yet, business data is often stored in different sources, systems, and formats, resulting in silos of information. These data silos take the form of enterprise resource planning systems, CSV files, spreadsheets, and relational databases. To pull together all of the data from these disparate sources, a business faces three interrelated challenges:
- Speed. Traditionally, businesses have attempted to catalog and organize supply chain data manually — profiling and integrating data themselves, which leads directly to the next challenge: cost.
- Cost. Manual work is expensive work. Usually more than one employee will need to work on the same data set in order to move quickly enough for the results to have any value for the business. Even with several employees working on the same data sets, this work will still not achieve what could be done on a machine scale.
- Efficiency. Relying completely on humans to organize and unify data is a situation ripe for error. Plus, there’s often no audit trail, and the work results in inherently incomplete views of information.
In a recent live demo by Dr. Clare Bernard, a field engineer at Tamr, I got a glimpse into how Tamr is using a combination of machine learning algorithms and input from subject matter experts to help businesses unify their data for analysis. A practice that uses short-term human intervention to actively improve machine models, human-in-the-loop machine learning is taking off across all types of industries, including fashion, automotive, and cloud services such as Google Maps. Read more…
Cultivating a psychological sense of community
A profile of Dr. Renetta Garrison Tull, from our latest report on women in the field of data.
Download our updated report, “Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education,” by Cornelia Lévy-Bencheton and Shannon Cutt, featuring four new profiles of women across the European Union. Editor’s note: this is an excerpt from the free report.Dr. Renetta Garrison Tull is a recognized expert in women and minorities in education, and in the STEM gender gap — both within and outside the academic environment. Dr. Tull is also an electrical engineer by training and is passionate about bringing more women into the field.
From her vantage point at the University of Maryland Baltimore County (UMBC) as associate vice provost for graduate student development and postdoctoral affairs, Dr. Tull concentrates on opportunities for graduate and postdoctoral professional development. As director of PROMISE: Maryland’s Alliance for Graduate Education and the Professoriate (AGEP) program for the University System of Maryland (USM), Dr. Tull also has a unique perspective on the STEM subjects that students cover prior to attending the university, within academia and as preparation for the workforce beyond graduation.
Dr. Tull has been writing code since the seventh grade. Fascinated by the Internet, she “learned HTML before there were WYSIWYGs,” and remains heavily involved with the online world. “I’ve been politely chided in meetings for pulling out my phones (yes plural), sending texts, and updating our organization’s professional Twitter and Facebook status, while taking care of emails from multiple accounts. I manage several blogs, each for different audiences … friends, colleagues, and students.” Read more…
Investigating Spark’s performance
A deep dive into performance bottlenecks with Spark PMC member Kay Ousterhout.
For many who use and deploy Apache Spark, knowing how to find critical bottlenecks is extremely important. In a recent O’Reilly webcast, Making Sense of Spark Performance, Spark committer and PMC member Kay Ousterhout gave a brief overview of how Spark works, and dove into how she measured performance bottlenecks using new metrics, including block-time analysis. Ousterhout walked through high-level takeaways from her in-depth analysis of several workloads, and offered a live demo of a new performance analysis tool and explained how you can use it to improve your Spark performance.
Her research uncovered surprising insights into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark), and one production workload. As part of our overall series of webcasts on big data, data science, and engineering, this webcast debunked commonly held ideas surrounding network performance, showing that CPU — not I/O — is often a critical bottleneck, and demonstrated how to identify and fix stragglers.
Network performance is almost irrelevant
While there’s been a lot of research work on performance — mainly surrounding the issues of whether to cache input data in-memory or on machine, scheduling, straggler tasks, and network performance — there haven’t been comprehensive studies into what’s most important to performance overall. This is where Ousterhout’s research comes in — taking on what she refers to as “community dogma,” beginning with the idea that network and disk I/O are major bottlenecks. Read more…
Exploring methods in active learning
Tips on how to build effective human-machine hybrids, from crowdsourcing expert Adam Marcus.
In a recent O’Reilly webcast, “Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd,” Adam Marcus explains how to mitigate common challenges of managing crowd workers, how to make the most of human-in-the-loop machine learning, and how to establish effective and mutually rewarding relationships with workers. Marcus is the director of data on the Locu team at GoDaddy, where the “Get Found” service provides businesses with a central platform for managing their online presence and content.
In the webcast, Marcus uses practical examples from his experience at GoDaddy to reveal helpful methods for how to:
- Offset the inevitability of wrong answers from the crowd
- Develop and train workers through a peer-review system
- Build a hierarchy of trusted workers
- Make crowd work inspiring and enable upward mobility
What to do when humans get it wrong
It turns out there is a simple way to offset human error: redundantly ask people the same questions. Marcus explains that when you ask five different people the same question, there are some creative ways to combine their responses, and use a majority vote. Read more…
Human-in-the-loop machine learning
Practical machine-learning applications and strategies from experts in active learning.
What do you call a practice that most data scientists have heard of, few have tried, and even fewer know how to do well? It turns out, no one is quite certain what to call it. In our latest free report Real-World Active Learning: Applications and Strategies for Human-in-the-Loop Machine Learning, we examine the relatively new field of “active learning” — also referred to as “human computation,” “human-machine hybrid systems,” and “human-in-the-loop machine learning.” Whatever you call it, the field is exploding with practical applications that are proving the efficiency of combining human and machine intelligence.
Learn from the expertsThrough in-depth interviews with experts in the field of active learning and crowdsource management, industry analyst Ted Cuzzillo reveals top tips and strategies for using short-term human intervention to actively improve machine models. As you’ll discover, the point at which a machine model fails is precisely where there’s an opportunity to insert — and benefit from — human judgment.
- When active learning works best
- How to manage crowdsource contributors (including expert-level contributors)
- Basic principles of labeling data
- Best practice methods for assessing labels
- When to skip the crowd and mine your own data
Explore real-world examples
This report gives you a behind-the-scenes look at how human-in-the-loop machine learning has helped improve the accuracy of Google Maps, match business listings at GoDaddy, rank top search results at Yahoo!, refer relevant job postings to people on LinkedIn, identify expert-level contributors using the Quizz recruitment method, and recommend women’s clothing based on customer and product data at Stitch Fix. Read more…