"text analytics" entries

Four short links: 11 June 2014

Four short links: 11 June 2014

Right to Mine, Summarising Microblogs, C Sucks for Stats, and Scanning Logfiles

  1. UK Copyright Law Permits Researchers to Data Mine — changes mean Copyright holders can require researchers to pay to access their content but cannot then restrict text or data mining for non-commercial purposes thereafter, under the new rules. However, researchers that use the text or data they have mined for anything other than a non-commercial purpose will be said to have infringed copyright, unless the activity has the consent of rights holders. In addition, the sale of the text or data mined by researchers is prohibited. The derivative works will be very interesting: if university mines the journals, finds new possibility for a Thing, is verified experimentally, is that Thing the university’s to license commercially for profit?
  2. Efficient Online Summary of Microblogging Streams (PDF) — research paper. The algorithm we propose uses a word graph, along with optimization techniques such as decaying windows and pruning. It outperforms the baseline in terms of summary quality, as well as time and memory efficiency.
  3. Statistical Shortcomings in Standard Math Libraries — or “Why C Derivatives Are Not Popular With Statistical Scientists”. The following mathematical functions are necessary for implementing any rudimentary statistics application; and yet they are general enough to have many applications beyond statistics. I hereby propose adding them to the standard C math library and to the libraries which inherit from it. For purposes of future discussion, I will refer to these functions as the Elusive Eight.
  4. fail2ban — open source tool that scans logfiles for signs of malice, and triggers actions (e.g., iptables updates).
Comment
Four short links: 26 May 2014

Four short links: 26 May 2014

Statistical Sensitivity, Scientific Mining, Data Mining Books, and Two-Sided Smartphones

  1. Car Alarms and Smoke Alarms (Slideshare) — how to think about and draw the line between sensitivity and specificity.
  2. 101 Uses for Content Mining — between the list in the post and the comments from readers, it’s a good introduction to some of the value to be obtained from full-text structured and unstructured access to scientific research publications.
  3. 12 Free-as-in-beer Data Mining Books — for your next flight.
  4. Dual-Touch Smartphone Concept — brilliant design sketches for interactivity using the back of the phone as a touch-sensitive input device.
Comment
Four short links: 12 November 2013

Four short links: 12 November 2013

Coding for Unreliability, AirBnB JS Style, Category Theory, and Text Processing

  1. Quantitative Reliability of Programs That Execute on Unreliable Hardware (MIT) — As MIT’s press release put it: Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output. (via Pete Warden)
  2. AirBNB’s Javascript Style Guide (Github) — A mostly reasonable approach to JavaScript.
  3. Category Theory for Scientists (MIT Courseware) — Scooby snacks for rationalists.
  4. Textblob — Python open source text processing library with sentiment analysis, PoS tagging, term extraction, and more.
Comment
Four short links: 30 September 2013

Four short links: 30 September 2013

Google Code Analysis, Deep Learning, Front-End Workflow, and SICP in JS

  1. Steve Yegge on GROK (YouTube) — The Grok Project is an internal Google initiative to simplify the navigation and querying of very large program source repositories. We have designed and implemented a language-neutral, canonical representation for source code and compiler metadata. Our data production pipeline runs compiler clusters over all Google’s code and third-party code, extracting syntactic and semantic information. The data is then indexed and served to a wide variety of clients with specialized needs. The entire ecosystem is evolving into an extensible platform that permits languages, tools, clients and build systems to interoperate in well-defined, standardized protocols.
  2. Deep Learning for Semantic AnalysisWhen trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.
  3. Fireshell — workflow tools and framework for front-end developers.
  4. SICP.js — lots of Structure and Interpretation of Computer Programs (the canonical text for higher-order programming) ported to Javascript.
Comment
Four short links: 29 August 2013

Four short links: 29 August 2013

Semi-Structured Text, Bitcoin Built On, Cryptic C++, Kickstarter Wins

  1. textfsmPython module which implements a template based state machine for parsing semi-formatted text. Originally developed to allow programmatic access to information returned from the command line interface (CLI) of networking devices. TextFSM was developed internally at Google and released under the Apache 2.0 licence for the benefit of the wider community.
  2. The Money is in the Bitcoin Protocol (Vikram Kumar) — some of the basics in this post as well as how people are thinking about using the Bitcoin protocol to do some very innovative things. MUST. READ.
  3. Parsing C++ is Literally Undecidable — any system with enough moving parts will generate eddies of chaotic behaviour, where the interactions between the components are unpredictable. (via Pete Warden)
  4. Kickstarter Raises 6x Indiegogo Money (Medium) — a reminder of the importance of network effects. Crowdfunding is the online auction side of the 2010s.
Comment
Four short links: 4 June 2013

Four short links: 4 June 2013

Distributed Browser-Based Computation, Streaming Regex, Preventing SQL Injections, and SVM for Faster Deep Learning

  1. WeevilScout — browser app that turns your browser into a worker for distributed computation tasks. See the poster (PDF). (via Ben Lorica)
  2. sregex (Github) — A non-backtracking regex engine library for large data streams. See also slide notes from a YAPC::NA talk. (via Ivan Ristic)
  3. Bobby Tables — a guide to preventing SQL injections. (via Andy Lester)
  4. Deep Learning Using Support Vector Machines (Arxiv) — we are proposing to train all layers of the deep networks by backpropagating gradients through the top level SVM, learning features of all layers. Our experiments show that simply replacing softmax with linear SVMs gives significant gains on datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop’s face expression recognition challenge. (via Oliver Grisel)
Comment
Four short links: 27 May 2013

Four short links: 27 May 2013

Search API, Cyberwar=Cyberbollocks, 4k Magic, and Geoparsing

  1. techu Search ServerTechu exposes a RESTful API for realtime indexing and searching with the Sphinx full-text search engine. We leverage Redis, Nginx and the Python Django framework to make searching easy to handle & flexible.
  2. In Defence of Digital Freedom — a member of the European Parliament’s piece on the risks to our online freedoms caused by framing computer security into cyberwarfare. Digital freedoms and fundamental rights need to be enforced, and not eroded in the face of vulnerabilities, attacks, and repression. In order to do so, essential and difficult questions on the implementation of the rule of law, historically place-bound by jurisdiction rooted in the nation-state, in the context of a globally connected world, need to be addressed. This is a matter for the EU as a global player, and should involve all of society. (via BoingBoing)
  3. Inside a 4k Demo — what it’s like to write an amazing demo with only 4k of code. (via Nelson Minar)
  4. CLAVIN — open source (Apache2) Java library for document geotagging and geoparsing that employs context-based geographic entity resolution. (via Pete Warden)
Comment
Four short links: 9 April 2013

Four short links: 9 April 2013

Electric Monks, Moore's Law's Death Spiral, Trafficking Technology, and Product Management

  1. Automated Essay Grading To Come to EdX (NY Times) — shortly after we get software that writes stories for us, we get software to read them for us.
  2. AMD Calls End of Moore’s Law in Ten Years (ComputerWorld) — story based on this video, where Michio Kaku lays out the timeline for Moore’s Law’s wind-down and the spin-up of new technology.
  3. Addressing Human Trafficking Through Technology (danah boyd) — technologists love to make tech and then assert it’ll help people. Danah’s work on teens and now trafficking steers us to do what works, rather than what is showy or easiest.
  4. Product Management (Rowan Simpson) — hand this to anyone who asks what product management actually is. Excellent explanation.
Comment