ENTRIES TAGGED "Big Data"
New Math, Business Math, Summarising Text, Clipping Images
- Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It (Jennifer Ouellette) — Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway.
- Is Google Jumping the Shark? (Seth Godin) — Public companies almost inevitably seek to grow profits faster than expected, which means beyond the organic growth that comes from doing what made them great in the first place. In order to gain that profit, it’s typical to hire people and reward them for measuring and increasing profits, even at the expense of what the company originally set out to do. Eloquent redux.
- textteaser — open source text summarisation algorithm.
- Clipping Magic — Instantly create masks, cutouts, and clipping paths online.
Android Malware Numbers, Open Networking Hardware, Winning with Data, and DIY Pollution Sensor
- Android Malware Numbers — (Quartz) less than an estimated 0.001% of app installations on Android are able to evade the system’s multi-layered defenses and cause harm to users, based on Google’s analysis of 1.5B downloads and installs.
- Facebook Operations Chief Reveals Open Networking Plan — long interview about OCP’s network project. The specification that we are working on is essentially a switch that behaves like compute. It starts up, it has a BIOS environment to do its diagnostics and testing, and then it will look for an executable and go find an operating system. You point it to an operating system and that tells it how it will behave and what it is going to run. In that model, you can run traditional network operating systems, or you can run Linux-style implementations, you can run OpenFlow if you want. And on top of that, you can build your protocol sets and applications.
- How Red Bull Dominates F1 (Quartz) — answer: data, and lots of it.
- Ground-Level Air Pollution Sensor (Make) — neat sensor project from Make.
USB in Cars, Capture Presentations, Amazon Redshift, and Polytweeting
- Hyundia Replacing Cigarette Lighters with USB Ports (Quartz) — sign of the times. (via Julie Starr)
- Freeseer — free, open source, cross-platform application that captures or streams your desktop—designed for capturing presentations. Would you like freedom with your screencast?
- Amazon Redshift: What You Need to Know — good write-up of experience using Amazon’s column database.
- GroupTweet — Allow any number of contributors to Tweet from a group account safely and securely. (via Jenny Magiera)
Amen Break, MySQL Scale, Spooky Source, and Graph Analytics Engine
- The Amen Break (YouTube) — fascinating 20m history of the amen break, a handful of bars of drum solo from a forgotten 1969 song which became the origin of a huge amount of popular music from rap to jungle and commercials, and the contested materials at the heart of sample-based music. Remix it and weep. (via Beta Knowledge)
- The MySQL Ecosystem at Scale (PDF) — nice summary of how MySQL is used on massive users, and where the sweet spots have been found.
- Lab41 (Github) — open sourced code from a spook hacklab in Silicon Valley.
- Fanulus — open sourced Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster. A breadth-first version of the graph traversal language Gremlin operates on graphs stored in the distributed graph database Titan, in any Rexster-fronted graph database, or in HDFS via various text and binary formats.
Drones Dismissed, Drones Denied, Passing PRISM, and Data Analysis and Mining
- UAV Offers of Assistance in Colorado Rebuffed by FEMA — we were told by FEMA that anyone flying drones would be arrested. [...] Civil Air Patrol and private aircraft were authorized to fly over the small town tucked into the base of Rockies. Unfortunately due to the high terrain around Lyons and large turn radius of manned aircraft they were flying well out of a useful visual range and didn’t employ cameras or live video feed to support the recovery effort. Meanwhile we were grounded on the Lyons high school football field with two Falcons that could have mapped the entire town in less than 30 minutes with another few hours to process the data providing a near real time map of the entire town.
- Texas Bans Some Private Use of Drones (DIY Drones) — growing move for govt to regulate drones.
- IETF PRISM-Proof Plans (Parity News) — Baker starts off by listing out the attack degree including he likes of information / content disclosure, meta-data analysis, traffic analysis, denial of service attacks and protocol exploits. The author than describes the different capabilities of an attacker and the ways in which an attack can be carried out – passive observation, active modification, cryptanalysis, cover channel analysis, lawful interception, Subversion or Coercion of Intermediaries among others.
- Data Mining and Analysis: Fundamental Concepts and Algorithms (PDF) — 650 pages on cluster, sequence mining, SVNs, and more. (via author’s page)
Remote Work, Raspberry Pi Code Machine, Low-Latency Data Processing, and Probabilistic Table Parsing
- Fog Creek’s Remote Work Policy — In the absence of new information, the assumption is that you’re producing. When you step outside the HQ work environment, you should flip that burden of proof. The burden is on you to show that you’re being productive. Is that because we don’t trust you? No. It’s because a few normal ways of staying involved (face time, informal chats, lunch) have been removed.
- MillWheel (PDF) — a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous ﬂow of records, all within the envelope of the framework’s fault-tolerance guarantees. From Google Research.
- Probabilistic Scraping of Plain Text Tables — the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.
PaaS Vendors, Educational MMO, Changing Culture, Data Mythologies
- Amazon Compute Numbers (ReadWrite) — AWS offers five times the utilized compute capacity of each of its other 14 top competitors—combined. (via Matt Asay)
- MIT Educational MMO — The initial phase will cover topics in biology, algebra, geometry, probability, and statistics, providing students with a collaborative, social experience in a systems-based game world where they can explore how the world works and discover important scientific concepts. (via KQED)
- Changing Norms (Atul Gawande) — neither penalties nor incentives achieve what we’re really after: a system and a culture where X is what people do, day in and day out, even when no one is watching. “You must” rewards mere compliance. Getting to “X is what we do” means establishing X as the norm.
- The Mythologies of Big Data (YouTube) — Kate Crawford at UC Berkeley iSchool. The six months: ‘Big data are new’, ‘Big data is objective’, ‘Big data don’t discriminate’, ‘Big data makes cities smart’, ‘Big data is anonymous’, ‘You can opt out of big data’. (via Sam Kinsley)
Constant KV Store, Google Me, Learned Bias, and DRM-Stripping Lego Robot
- Sparkey — Spotify’s open-sourced simple constant key/value storage library, for read-heavy systems with infrequent large bulk inserts.
- The Truth of Fact, The Truth of Feeling (Ted Chiang) — story about what happens when lifelogs become searchable. Now with Remem, finding the exact moment has become easy, and lifelogs that previously lay all but ignored are now being scrutinized as if they were crime scenes, thickly strewn with evidence for use in domestic squabbles. (via BoingBoing)
- Algorithms Magnifying Misbehaviour (The Guardian) — when the training set embodies biases, the machine will exhibit biases too.
- Lego Robot That Strips DRM Off Ebooks (BoingBoing) — so. damn. cool. If it had been controlled by a C64, Cory would have hit every one of my geek erogenous zones with this find.
Big Diner, Fab Future, Browser Crypto, and STEM Crisis Questioned
- In Search of the Optimal Cheeseburger (Hilary Mason) — playing with NYC menu data. There are 5,247 cheeseburgers you can order in Manhattan. Her Ignite talk from Ignite NYC15.
- James Burke Predicting the Future — spoiler: massive disruption from nano-scale personal fabbing.
- The STEM Crisis is a Myth (IEEE Spectrum) — Every year U.S. schools grant more STEM degrees than there are available jobs. When you factor in H-1B visa holders, existing STEM degree holders, and the like, it’s hard to make a case that there’s a STEM labor shortage.
Bezos on Business, CS Ratios, Easier Hadoopery, and AWS CLI
- Bezos at the Post (Washington Post) — “All businesses need to be young forever. If your customer base ages with you, you’re Woolworth’s,” added Bezos.[...] “The number one rule has to be: Don’t be boring.” (via Julie Starr)
- How Carnegie-Mellon Increased Women in Computer Science to 42% — outreach, admissions based on potential not existing advantage, making CS classes practical from the start, and peer support.
- Summingbird (Github) — Twitter open-sourced library that lets you write streaming MapReduce programs that look like native Scala or Java collection transformations and execute them on a number of well-known distributed MapReduce platforms like Storm and Scalding.
- aws-cli (Github) — commandline for Amazon Web Services. (via AWS Blog)