- nupic (github) -GPL v3-licensed ode from Numenta, at last. See their patent position.
- Robocup — soccer robotics contest, condition of entry is that all codes are open sourced after the contest. (via The Economist)
- Security Data Science Paper Collection — machine learning, big data, analysis, reports, all around security issues.
- Building an Open Wireless Router — EFF call for coders to help build a wireless router that’s more secure and more supportive of open sharing than current devices.
ENTRIES TAGGED "Big Data"
Numenta Code, Soccer Robotics, Security Data Science, Open Wireless Router
Data Brokers, Car Data, Pattern Classification, and Hogwild Deep Learning
- Inside Data Brokers — very readable explanation of the data brokers and how their information is used to track advertising effectiveness.
- Elon, I Want My Data! — Telsa don’t give you access to the data that your cars collects. Bodes poorly for the Internet of Sealed Boxes. (via BoingBoing)
- Pattern Classification (Github) — collection of tutorials and examples for solving and understanding machine learning and pattern classification tasks.
- HOGWILD! (PDF) — the algorithm that Microsoft credit with the success of their Adam deep learning system.
Developer Inequality, Weak Signals, Geek Feminism Wiki, and Reidentification Risks
- Developer Inequality (Jonathan Edwards) — The bigger injustice is that programming has become an elite: a vocation requiring rare talents, grueling training, and total dedication. The way things are today if you want to be a programmer you had best be someone like me on the autism spectrum who has spent their entire life mastering vast realms of arcane knowledge — and enjoys it. Normal humans are effectively excluded from developing software. (via Slashdot)
- Signals From Foo Camp (O’Reilly Radar) — useful for me (aka “the stuff I didn’t get to see”), hopefully useful to you too. Companies outside of Silicon Valley badly want to understand it and want to find ways to truly collaborate with it, but they’re worried that conversations can turn into competition. “Old industry” has incredible expertise and operates in very complex environments, and it has much to teach tech, if tech will listen. Silicon Valley isn’t an IT department for the world, it’s the competition.
- Feminist Point of View: Lessons from Running the Geek Feminism Wiki — deck from Alex’s OS Bridge session. Today’s awareness and actions around sexism in tech resulted from their actions, sometimes directly, sometimes indirectly.
- Big Data Should Not Be a Faith-Based Initiative (Cory Doctorow) — Re-identification is part of the Big Data revolution: among the new meanings we are learning to extract from huge corpuses of data is the identity of the people in that dataset. And since we’re commodifying and sharing these huge datasets, they will still be around in ten, twenty and fifty years, when those same Big Data advancements open up new ways of re-identifying — and harming — their subjects.
Efficient Representation, Page Rendering, Graph Database, Warning Effectiveness
- word2vec — This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research. From Google Research paper Efficient Estimation of Word Representations in Vector Space.
- What Every Frontend Developer Should Know about Page Rendering — Rendering has to be optimized from the very beginning, when the page layout is being defined, as styles and scripts play the crucial role in page rendering. Professionals have to know certain tricks to avoid performance problems. This arcticle does not study the inner browser mechanics in detail, but rather offers some common principles.
- Cayley — an open-source graph inspired by the graph database behind Freebase and Google’s Knowledge Graph.
- Alice in Warningland (PDF) — We performed a field study with Google Chrome and Mozilla Firefox’s telemetry platforms, allowing us to collect data on 25,405,944 warning impressions. We find that browser security warnings can be successful: users clicked through fewer than a quarter of both browser’s malware and phishing warnings and third of Mozilla Firefox’s SSL warnings. We also find clickthrough rates as high as 70.2% for Google Chrome SSL warnings, indicating that the user experience of a warning can have tremendous impact on user behaviour.
Google MillWheel, 20yo Bug, Fast Real-Time Visualizations, and Google's Speed King
- MillWheel: Fault-Tolerant Stream Processing at Internet Scale — Google Research paper on the tech underlying the new cloud DataFlow tool. Watch the video. Yow.
- The Integer Overflow Bug That Went to Mars — long-standing (20 year old!) bug in a compression library prompts a wave of new releases. No word yet on whether NASA will upgrade the rover to avoid being pwned by Martian script kiddies. (update: I fell for a self-promoter. The Martians will need to find another attack vector. Huzzah!)
- epoch (github) — Fastly-produced open source general purpose real-time charting library for building beautiful, smooth, and high performance visualizations.
- Achieving Rapid Response Times in Large Online Services (YouTube) — Jeff Dean‘s keynote at Velocity. He wrote … a lot of things for this. And now he’s into deep learning ….
Failure of Imagination, Meat Failure Mode, Grand Challenges, and Data Programming
- Maximum Happy Imagination (Matt Jones) — questioning the true vision of Marc Andreessen’s recent Twitter discourse on the great future that awaits us. His analogies run out in the 20th century when it comes to the political, social and economic implications of his maximum happy imagination.
- The Mirrortocracy — It’s astonishing how many of the people conducting interviews and passing judgement on the careers of candidates have had no training at all on how to do it well. Aside from their own interviews, they may not have ever seen one. I’m all for learning on your own but at least when you write a program wrong it breaks. Without a natural feedback loop, interviewing mostly runs on myth and survivor bias.
- Longitude Prize — six prize areas, Grand Challenge style, in clean flight, antibiotic resistance, dementia, food, water, and overcoming paralysis. Mysteriously none for library system that avoids DLL hell.
- The Re-Emergence of Datalog — Michael Fogus overviews Datalog and provides examples of how it is implemented and used in Datomic, Cascalog, and the Bacwn Clojure library. See also notes from the talk.
Available Data, Goal Setting, Real Tech, and Gamification Numbers
- Dynamo and BigTable — good preso overview of two approaches to solving availability and consistency in the event of server failure or network partition.
- Goals Gone Wild (PDF) — In this article, we argue that the beneficial effects of goal setting have been overstated and that systematic harm caused by goal setting has been largely ignored. We identify specific side effects associated with goal setting, including a narrow focus that neglects non-goal areas, a rise in unethical behavior, distorted risk preferences, corrosion of organizational culture, and reduced intrinsic motivation.
- Tech Isn’t All Brogrammers (Alexis Madrigal) — a reminder that there are real scientists and engineers in Silicon Valley working on problems considerably harder than selling ads and delivering pet food to one another. (via Brian Behlendorf)
- Numbers from 90+ Gamification Case Studies — cherry-picked anecdata for your business cases.
Many more companies want to highlight how they're using Apache Spark in production.
One of the trends we’re following closely at Strata is the emergence of vertical applications. As components for creating large-scale data infrastructures enter their early stages of maturation, companies are focusing on solving data problems in specific industries rather than building tools from scratch. Virtually all of these components are open source and have contributors across many companies. Organizations are also sharing best practices for building big data applications, through blog posts, white papers, and presentations at conferences like Strata.
These trends are particularly apparent in a set of technologies that originated from UC Berkeley’s AMPLab: the number of companies that are using (or plan to use) Spark in production1 has exploded over the last year. The surge in popularity of the Apache Spark ecosystem stems from the maturation of its individual open source components and the growing community of users. The tight integration of high-performance tools that address different problems and workloads, coupled with a simple programming interface (in Python, Java, Scala), make Spark one of the most popular projects in big data. The charts below show the amount of active development in Spark:
For the second year in a row, I’ve had the privilege of serving on the program committee for the Spark Summit. I’d like to highlight a few areas where Apache Spark is making inroads. I’ll focus on proposals2 from companies building applications on top of Spark.
Agile methodology brings flexibility to the EDW and offers ways to integrate open-source technologies with existing systems.
Data analysis, like other pursuits, is a balancing act. The rise of big data ratchets up the pressure on the traditional enterprise data warehouse (EDW) and associated software tools to handle rapidly evolving sets of new demands posed by the business. Companies want their EDW systems to be more flexible and more user friendly — without sacrificing processing speeds, data integrity, or overall reliability.
“The more data you give the business, the more questions they will ask,” says José Carlos Eiras, who has served as CIO at Kraft Foods, Philip Morris, General Motors, and DHL. “When you have big data, you have a lot of different questions, and suddenly you need an enterprise data warehouse that is very flexible.”
EDWs are remarkably powerful, but it takes considerable expertise and creativity to modify them on the fly. Adding new capabilities to the EDW generally requires significant investments of time and money. You can develop your own tools internally or purchase them from a vendor, but either way, it’s a hard slog. Read more…