ENTRIES TAGGED "Big Data"
- SAMOA — Yahoo!’s distributed streaming machine learning (ML) framework that contains a programming abstraction for distributed streaming ML algorithms. (via Introducing SAMOA)
- madlib — an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
- Data Portraits: Connecting People of Opposing Views — Yahoo! Labs research to break the filter bubble. Connect people who disagree on issue X (e.g., abortion) but who agree on issue Y (e.g., Latin American interventionism), and present the differences and similarities visually (they used wordclouds). Our results suggest that organic visualisation may revert the negative effects of providing potentially sensitive content. (via MIT Technology Review)
- Disguise Detection — using Raspberry Pi, Arduino, and Python.
Squid in the Dark, Beautiful Automation, Fan Criticism, and Petabyte Queries
- Living Light — 3D printed cephalopods filled with bioluminescent bacteria. PAGING CORY DOCTOROW, YOUR ORGASMATRON HAS ARRIVED. (via Sci Blogs)
- Repacking Lego Batteries with a CNC Mill — check out the video. Patrick programmed a CNC machine to drill out the rivets holding the Mindstorms battery pack together. Coding away a repetitive task like this is gorgeous to see at every scale. We don’t have to teach our kids a particular programming language, but they should know how to automate cruft.
- My Thoughts on Google+ (YouTube) — when your fans make hatey videos like this one protesting Google putting the pig of Google Plus onto the lipstick that was YouTube, you are Doin’ It Wrong.
- Presto: Interacting with Petabytes of Data at Facebook — a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. For details, see the Facebook post about its launch.
Time Series Database, Cluster Schedulers, Structural Search-and-Replace, and TV Data
- Influx DB — open-source, distributed, time series, events, and metrics database with no external dependencies.
- Omega (PDF) — ﬂexible, scalable schedulers for large compute clusters. From Google Research.
- Amazon Mines Its Data Trove To Bet on TV’s Next Hit (WSJ) — Amazon produced about 20 pages of data detailing, among other things, how much a pilot was viewed, how many users gave it a 5-star rating and how many shared it with friends.
Flying Robot, State of Cyberspace, H.264, and Principal Component Analysis
- Insect-Inspired Collision-Resistant Robot — clever hack to make it stable despite bouncing off things.
- The Battle for Power on the Internet (Bruce Schneier) — the state of cyberspace. [M]ost of the time, a new technology benefits the nimble first. [...] In other words, there will be an increasing time period during which nimble distributed powers can make use of new technologies before slow institutional powers can make better use of those technologies.
- Cisco’s H.264 Good News (Brendan Eich) — Cisco is paying the license fees for a particular implementation of H.264 to be used in open source software, enabling it to be the basis of web streaming video across all browsers (even the open source ones). It’s not as ideal a solution as it might sound.
- Principal Component Analysis for Dummies — This post will give a very broad overview of PCA, describing eigenvectors and eigenvalues (which you need to know about to understand it) and showing how you can reduce the dimensions of data using PCA. As I said it’s a neat tool to use in information theory, and even though the maths is a bit complicated, you only need to get a broad idea of what’s going on to be able to use it effectively.
- Android Guides — lots of info on coding for Android.
- Statistics Done Wrong — learn from these failure modes. Not medians or means. Modes.
- Streaming, Sketching, and Sufficient Statistics (YouTube) — how to process huge data sets as they stream past your CPU (e.g., those produced by sensors). (via Ben Lorica)
Digital Citizenship, Berg Cloud, Data Warehouse, and The Spying Iron
- Mozilla Web Literacy Standard — things you should be able to do if you’re to be trusted to be on the web unsupervised. (via BoingBoing)
- Berg Cloud Platform — hardware (shield), local network, and cloud glue. Caution: magic ahead!
- Shark — a large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions, providing seamless integration with existing Hive deployments and a familiar, more powerful option for new ones. (via Strata)
- The Malware of Things — a technician opening up an iron included in a batch of Chinese imports to find a “spy chip” with what he called “a little microphone”. Its correspondent said the hidden devices were mostly being used to spread viruses, by connecting to any computer within a 200m (656ft) radius which were using unprotected Wi-Fi networks.
Publishing Bad Research, Reproducing Research, DIY Police Scanner, and Inventing the Future
- Science Not as Self-Correcting As It Thinks (Economist) — REALLY good discussion of the shortcomings in statistical practice by scientists, peer-review failures, and the complexities of experimental procedure and fuzziness of what reproducibility might actually mean.
- Reproducibility Initiative Receives Grant to Validate Landmark Cancer Studies — The key experimental findings from each cancer study will be replicated by experts from the Science Exchange network according to best practices for replication established by the Center for Open Science through the Center’s Open Science Framework, and the impact of the replications will be tracked on Mendeley’s research analytics platform. All of the ultimate publications and data will be freely available online, providing the first publicly available complete dataset of replicated biomedical research and representing a major advancement in the study of reproducibility of research.
- $20 SDR Police Scanner — using software-defined radio to listen to the police band.
- Reimagine the Chemistry Set — $50k prize in contest to design a “chemistry set” type kit that will engage kids as young as 8 and inspire people who are 88. We’re looking for ideas that encourage kids to explore, create, build and question. We’re looking for ideas that honor kids’ curiosity about how things work. Backed by the Moore Foundation and Society for Science and the Public.
New Math, Business Math, Summarising Text, Clipping Images
- Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It (Jennifer Ouellette) — Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway.
- Is Google Jumping the Shark? (Seth Godin) — Public companies almost inevitably seek to grow profits faster than expected, which means beyond the organic growth that comes from doing what made them great in the first place. In order to gain that profit, it’s typical to hire people and reward them for measuring and increasing profits, even at the expense of what the company originally set out to do. Eloquent redux.
- textteaser — open source text summarisation algorithm.
- Clipping Magic — Instantly create masks, cutouts, and clipping paths online.
Android Malware Numbers, Open Networking Hardware, Winning with Data, and DIY Pollution Sensor
- Android Malware Numbers — (Quartz) less than an estimated 0.001% of app installations on Android are able to evade the system’s multi-layered defenses and cause harm to users, based on Google’s analysis of 1.5B downloads and installs.
- Facebook Operations Chief Reveals Open Networking Plan — long interview about OCP’s network project. The specification that we are working on is essentially a switch that behaves like compute. It starts up, it has a BIOS environment to do its diagnostics and testing, and then it will look for an executable and go find an operating system. You point it to an operating system and that tells it how it will behave and what it is going to run. In that model, you can run traditional network operating systems, or you can run Linux-style implementations, you can run OpenFlow if you want. And on top of that, you can build your protocol sets and applications.
- How Red Bull Dominates F1 (Quartz) — answer: data, and lots of it.
- Ground-Level Air Pollution Sensor (Make) — neat sensor project from Make.