ENTRIES TAGGED "text analytics"
Coding for Unreliability, AirBnB JS Style, Category Theory, and Text Processing
- Quantitative Reliability of Programs That Execute on Unreliable Hardware (MIT) — As MIT’s press release put it: Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output. (via Pete Warden)
- Category Theory for Scientists (MIT Courseware) — Scooby snacks for rationalists.
- Textblob — Python open source text processing library with sentiment analysis, PoS tagging, term extraction, and more.
Google Code Analysis, Deep Learning, Front-End Workflow, and SICP in JS
- Steve Yegge on GROK (YouTube) — The Grok Project is an internal Google initiative to simplify the navigation and querying of very large program source repositories. We have designed and implemented a language-neutral, canonical representation for source code and compiler metadata. Our data production pipeline runs compiler clusters over all Google’s code and third-party code, extracting syntactic and semantic information. The data is then indexed and served to a wide variety of clients with specialized needs. The entire ecosystem is evolving into an extensible platform that permits languages, tools, clients and build systems to interoperate in well-defined, standardized protocols.
- Deep Learning for Semantic Analysis — When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.
- Fireshell — workflow tools and framework for front-end developers.
Semi-Structured Text, Bitcoin Built On, Cryptic C++, Kickstarter Wins
- textfsm — Python module which implements a template based state machine for parsing semi-formatted text. Originally developed to allow programmatic access to information returned from the command line interface (CLI) of networking devices. TextFSM was developed internally at Google and released under the Apache 2.0 licence for the benefit of the wider community.
- The Money is in the Bitcoin Protocol (Vikram Kumar) — some of the basics in this post as well as how people are thinking about using the Bitcoin protocol to do some very innovative things. MUST. READ.
- Parsing C++ is Literally Undecidable — any system with enough moving parts will generate eddies of chaotic behaviour, where the interactions between the components are unpredictable. (via Pete Warden)
- Kickstarter Raises 6x Indiegogo Money (Medium) — a reminder of the importance of network effects. Crowdfunding is the online auction side of the 2010s.
Distributed Browser-Based Computation, Streaming Regex, Preventing SQL Injections, and SVM for Faster Deep Learning
- WeevilScout — browser app that turns your browser into a worker for distributed computation tasks. See the poster (PDF). (via Ben Lorica)
- sregex (Github) — A non-backtracking regex engine library for large data streams. See also slide notes from a YAPC::NA talk. (via Ivan Ristic)
- Bobby Tables — a guide to preventing SQL injections. (via Andy Lester)
- Deep Learning Using Support Vector Machines (Arxiv) — we are proposing to train all layers of the deep networks by backpropagating gradients through the top level SVM, learning features of all layers. Our experiments show that simply replacing softmax with linear SVMs gives significant gains on datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop’s face expression recognition challenge. (via Oliver Grisel)
Search API, Cyberwar=Cyberbollocks, 4k Magic, and Geoparsing
- techu Search Server — Techu exposes a RESTful API for realtime indexing and searching with the Sphinx full-text search engine. We leverage Redis, Nginx and the Python Django framework to make searching easy to handle & flexible.
- In Defence of Digital Freedom — a member of the European Parliament’s piece on the risks to our online freedoms caused by framing computer security into cyberwarfare. Digital freedoms and fundamental rights need to be enforced, and not eroded in the face of vulnerabilities, attacks, and repression. In order to do so, essential and difficult questions on the implementation of the rule of law, historically place-bound by jurisdiction rooted in the nation-state, in the context of a globally connected world, need to be addressed. This is a matter for the EU as a global player, and should involve all of society. (via BoingBoing)
- Inside a 4k Demo — what it’s like to write an amazing demo with only 4k of code. (via Nelson Minar)
- CLAVIN — open source (Apache2) Java library for document geotagging and geoparsing that employs context-based geographic entity resolution. (via Pete Warden)