- Smart Hacking for Privacy — can mine smart power meter data (or even snoop it) to learn what’s on the TV. Wow. (You can also watch the talk). (via Rob Inskeep)
- Conditioning Company Culture (Bryce Roberts) — a short read but thought-provoking. It’s easy to create mindless mantras, but I’ve seen the technique that Bryce describes and (when done well) it’s highly effective.
- hydrat (Google Code) — a declarative framework for text classification tasks.
- Dynamic Face Substitution (FlowingData) — Kyle McDonald and Arturo Castro play around with a face tracker and color interpolation to replace their own faces, in real-time, with celebrities such as that of Brad Pitt and Paris Hilton. Awesome. And creepy. Amen.
"text analysis" entries
Smart Meter Snitches, Company Culture, Text Classification, and Live Face Substitution
Panagiotis Ipeirotis on the phrases and formatting of effective product reviews.
How much is an Amazon review — good or bad — worth? Computer scientist and NYU professor Panagiotis Ipeirotis analyzed the text in thousands of Amazon reviews to find out.
Apple Factories, Open Source Spy Drones, Mail Files, and Text Topic Extraction
- Mr Daisey and the Apple Factor (This American Life) — episode looking at the claims of human rights problems in Apple’s Chinese factories.
- OpenPilot — open source UAVs with cameras. Yes, a DIY spy drone on autopilot. (via Jim Stogdill)
- mbox — more technical information than you ever thought you’d need, to be saved for the time when you have to parse mailbox files. It’s a nightmare. (via Hacker News)
- Maui (Google Code) — Maui automatically identifies main topics in text documents. Depending on the task, topics are tags, keywords, keyphrases, vocabulary terms, descriptors, index terms or titles of Wikipedia articles. GPLv3.
- Terrier IR — open source (Mozilla) text search engine, now with Hadoop support.
- s3ql — open source (GPLv3) Linux filesystem which stores its data on Google Storage, Amazon S3, or OpenStack. (via Adam Shand)
Text Analysis Bundle, Scala Probabilistic Modeling, Game Analytics, and Encouraging Writing
- Pattern — a BSD-licensed bundle of Python tools for data retrieval, text analysis, and data visualization. If you were going to get started with accessible data (Twitter, Google), the fundamentals of analysis (entity extraction, clustering), and some basic visualizations of graph relationships, you could do a lot worse than to start here.
- Factorie (Google Code) — Apache-licensed Scala library for a probabilistic modeling technique successfully applied to […] named entity recognition, entity resolution, relation extraction, parsing, schema matching, ontology alignment, latent-variable generative models, including latent Dirichlet allocation. The state-of-the-art big data analysis tools are increasingly open source, presumably because the value lies in their application not in their existence. This is good news for everyone with a new application.
- Playtomic — analytics as a service for gaming companies to learn what players actually do in their games. There aren’t many fields untouched by analytics.
- Write or Die — iPad app for writers where, if you don’t keep writing, it begins to delete what you wrote earlier. Good for production to deadlines; reflective editing and deep thought not included.
- Fuzzy String Matching in Python (Streamhacker) — useful if you’re to have a hope against the swelling dark forces powered by illiteracy and touchscreen keyboards.
- The Business of Illegal Data (Strata Conference) — fascinating presentation on criminal use of big data. “The more data you produce, the happier criminals are to receive and use it. Big data is big business for organized crime, which represents 15% of GDP.”
- Isarithmic Maps — an alternative to chloropleths for geodata visualization.
Dispel Your Illusions, Simple Mac OS X Apps, Assisted Translation, and AutoTagging
- How to Dispel Your Illusions (NY Review of Books) — Freeman Dyson writing about Daniel Kahneman’s latest book. Only by understanding our cognitive illusions can we hope to transcend them.
- Appify-UI (github) — Create the simplest possible Mac OS X apps. Uses HTML5 for the UI. Supports scripting with anything and everything. (via Hacker News)
- Translation Memory (Etsy) — using Lucene/SOLR to help automate the translation of their UI. (via Twitter)
- Automatically Tagging Entities with Descriptive Phrases (PDF) — Microsoft Research paper on automated tagging. Under the hood it uses Map/Reduce and the Microsoft Dryad framework. (via Ben Lorica)
Quantified Learner, Text Extraction, Backup Flickr, and Multitouch UI Awesomeness
- Learning With Quantified Self — this CS grad student broke Jeopardy records using an app he built himself to quantify and improve his ability to answer Jeopardy questions in different categories. This is an impressive short talk and well worth watching.
- Evaluating Text Extraction Algorithms — The gold standard of both datasets was produced by human annotators. 14 different algorithms were evaluated in terms of precision, recall and F1 score. The results have show that the best opensource solution is the boilerpipe library. (via Hacker News)
- Parallel Flickr — tool for backing up your Flickr account. (Compare to one day of Flickr photos printed out)
- Quneo Multitouch Open Source MIDI and USB Pad (Kickstarter) — interesting to see companies using Kickstarter to seed interest in a product. This one looks a doozie: pads, sliders, rotary sensors, with LEDs underneath and open source drivers and SDK. Looks almost sophisticated enough to drive emacs :-)
Internet Asthma Care, C Fulltext, Citizen Science, and Mozilla
- Cost-Effectiveness of Internet-Based Self-Management Compared with Usual Care in Asthma (PLoSone) — Internet-based self-management of asthma can be as effective as current asthma care and costs are similar.
- Apache Lucy — full-text search engine library written in C and targeted at dynamic languages. It is a “loose C” port of Apache Lucene™, a search engine library for Java.
- The Near Future of Citizen Science (Fiona Romeo) — near future of science is all about honing the division of labour between professionals, amateurs and bots. See Bryce’s bionic software riff. (via Matt Jones)
- Microsoft’s Patent Claims Against Android (Groklaw) — behold, citizen, the formidable might of Microsoft’s patents and how they justify a royalty from every Android device equal to that which you would owe if you built a Windows Mobile device: These Microsoft patents can be divided into several basic categories: (1) the ‘372 and ‘780 patents relate to web browsers; (2) the ‘551 and ‘233 patents relate to electronic document annotation and highlighting; (3) the ‘522 patent relates to resources provided by operating systems; (4) the ‘517 and ‘352 patents deal with compatibility with file names once employed by old, unused, and outmoded operating systems; (5) the ‘536 and ‘853 patents relate to simulating mouse inputs using non-mouse devices; and (6) the ‘913 patent relates to storing input/output access factors in a shared data structure. A shabby display of patent menacing.