- bleve — A modern text indexing library for go.
- Scientific Consensus Has A Bad Reputation—And Doesn’t Deserve It (Ars Technica) — a lovely explanation of how informal consensus works in science. NB for anyone building social software which attempts to formalise and automate consensus.
- TiVo Mega — 24TB of RAID storage, six tuners for capturing broadcasts. Which is rather like building the International Space Station and then hitching it to six horses for launch. Who at this point would make a $5k bet that everything you want to see on a TV will be broadcast by a cable company?
- runswift — an in-browser client for compiling and running basic Swift functionality.
New Math, Business Math, Summarising Text, Clipping Images
- Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It (Jennifer Ouellette) — Yale University mathematician Ronald Coifman says that what is really needed is the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he believes is already underway.
- Is Google Jumping the Shark? (Seth Godin) — Public companies almost inevitably seek to grow profits faster than expected, which means beyond the organic growth that comes from doing what made them great in the first place. In order to gain that profit, it’s typical to hire people and reward them for measuring and increasing profits, even at the expense of what the company originally set out to do. Eloquent redux.
- textteaser — open source text summarisation algorithm.
- Clipping Magic — Instantly create masks, cutouts, and clipping paths online.
How illustrations and a clear path can enhance a story.
A clear reading path isn't always a bad thing. Here's an example where imagery advances the narrative and guides the reader along a defined trajectory.
Two examples of how digital images and associated text can stick together.
The fluidity of digital content occasionally sends images in one direction and text in another. Here's a look at two design experiments that keep digital assets together.
Regular Expressions, Mac Git, Open Source Patents, and Pepys Lessons
- Rubular — a way to write and test regular expressions interactively. Very cool. (via Adam Fields)
- gitx — OSX ui for git. (via Marc Hedlund)
- Open Source Critical to Competition (Simon Phipps) — DOJ and German Federal Cartel Office see danger for open source in Novell’s patents being acquired by a consortium of Oracle, Microsoft, Apple, and EMC (fancy!) and are taking steps to ensure open source is protected.
- My Talk about Samuel Pepys’s Diary as an Online Story (Phil Gyford) — I love the ways Phil has stretched and repurposed the web’s affects for storytelling. Listen to this talk. (via BoingBoing)
Stream Processing, Semantic Web, Location Services, and PDF Extraction
- S4 — S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Open-sourced (Apache license) by Yahoo!.
- RDF and Semantic Web: Can We Reach Escape Velocity? (PDF) — spot-on presentation from the data.gov.uk linked data advisor. It nails, clearly and in only 12 slides, why there’s still resistance to linked data uptake and what should happen to change this. Amen! (via Simon St Laurent)
- Pew Internet Report on Location-based Services — 10% of online Hispanics use these services – significantly more than online whites (3%) or online blacks (5%).
- Slate — Python library for extracting text from PDFs easily.
Amazon Margins, Crowdsourced Science, Data Tool Opensourced, Document Splitting
- AWS: Forget the Revenue, Did You See the Margins? (RedMonk) — According to UBS, Amazon Web Services gross margins for the years 2006 through 2014 are 47%, 48%, 48%, 49%, 49%, 50%, 50.5%, 51%, 53%. (these are analyst projections, so take with grain of salt, but those are some sweet margins if they’re even close to accurate)
- Science Pipes — an environment in which students, educators, citizens, resource managers, and scientists can create and share analyses and visualizations of biodiversity data. It is built to support inquiry-based learning, allowing analysis results and visualizations to be dynamically incorporated into web sites (e.g. blogs) for dissemination and consumption beyond SciencePipes.org itself. (via mikeloukides on Twitter)
- ScraperWiki Source Code — AGPL-licensed source to the ScraperWiki, a tool for data storage, cleaning, search, visualization, and export.
- Doc split — a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)
Statistical Jeopardy Wins, Mobile Taxonomy, Geodata Mystery, and Machine Learning Blog
- What is IBM’s Watson? (NY Times) — IBM joining the big data machine learning race, and hatching a Blue Gene system that can answer Jeopardy questions. Does good, not great, and is getting better.
- Google Lays Out its Mobile Strategy (InformationWeek) — notable to me for Rechis said that Google breaks down mobile users into three behavior groups: A. “Repetitive now” B. “Bored now” C. “Urgent now”, a useful way to look at it. (via Tim)
- BP GIS and the Mysteriously Vanishing Letter — intrigue in the geodata world. This post makes it sound as though cleanup data is going into a box behind BP’s firewall, and the folks who said “um, the government should be the depot, because it needs to know it has a guaranteed-untampered and guaranteed-able-to-access copy of this data” were fired. For more info, including on the data that is available, see the geowanking thread.
- Streamhacker — a blog talking about text mining and other good things, with nltk code you can run. (via heraldxchaos on Delicious)