- Steve Yegge on GROK (YouTube) — The Grok Project is an internal Google initiative to simplify the navigation and querying of very large program source repositories. We have designed and implemented a language-neutral, canonical representation for source code and compiler metadata. Our data production pipeline runs compiler clusters over all Google’s code and third-party code, extracting syntactic and semantic information. The data is then indexed and served to a wide variety of clients with specialized needs. The entire ecosystem is evolving into an extensible platform that permits languages, tools, clients and build systems to interoperate in well-defined, standardized protocols.
- Deep Learning for Semantic Analysis — When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effect of contrastive conjunctions as well as negation and its scope at various tree levels for both positive and negative phrases.
- Fireshell — workflow tools and framework for front-end developers.
Google Code Analysis, Deep Learning, Front-End Workflow, and SICP in JS
Google's Data Centers, Top Engineers, Hiring, and Git Explained
- Google Has Spent 21 Billion on Data Centers — The company invested a record $1.6 billion in its data centers in the second quarter of 2013. Puts my impulse-purchased second external hard-drive into context, doesn’t it honey?
- 10x Engineer (Shanley) — in which the idea that it’s scientifically shown that some engineers are innately 10x others is given a rough and vigorous debunking.
- How to Hire — great advice, including “Poaching is the titty twister of Silicon Valley relationships”.
- Think Like a Git — a guide to git, for the perplexed.
Fanout Architectures, In-Browser Emulation, Paean to Programmability, and Social Hardware
- Achieving Rapid Response Times in Large Online Services (PDF) — slides from a talk by Jeff Dean on fanout architectures. (via Alex Dong)
- Go Ahead, Mess with Texas Instruments (The Atlantic) — School typically assumes that answers fall neatly into categories of “right” and “wrong.” As a conventional tool for computing “right” answers, calculators often legitimize this idea; the calculator solves problems, gives answers. But once an endorsed, conventional calculator becomes a subversive, programmable computer it destabilizes this polarity. Programming undermines the distinction between “right” and “wrong” by emphasizing the fluidity between the two. In programming, there is no “right” answer. Sure, a program might not compile or run, but making it offers multiple pathways to success, many of which are only discovered through a series of generative failures. Programming does not reify “rightness;” instead, it orients the programmer toward intentional reading, debugging, and refining of language to ensure clarity.
- When A Spouse Puts On Google Glass (NY Times) — Google Glass made me realize how comparably social mobile phones are. […] People gather around phones to watch YouTube videos or look at a funny tweet together or jointly analyze a text from a friend. With Glass, there was no such sharing.
Flexible Layouts, Web Components, Distributed SQL Database, and Reverse-Engineering Dropbox Client
- intention.js — manipulates the DOM via HTML attributes. The methods for manipulation are placed with the elements themselves, so flexible layouts don’t seem so abstract and messy.
- F1: A Distributed SQL Database That Scales — a distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.
- Looking Inside The (Drop)Box (PDF) — This paper presents new and generic techniques, to reverse engineer frozen Python applications, which are not limited to just the Dropbox world. We describe a method to bypass Dropbox’s two factor authentication and hijack Dropbox accounts. Additionally, generic techniques to intercept SSL data using code injection techniques and monkey patching are presented. (via Tech Republic)
Mobile Image Cache, Google on Net Neutrality, Future of Programming, and PSD Files in Ruby
- How to Easily Resize and Cache Images for the Mobile Web (Pete Warden) — I set up a server running the excellent ImageProxy open-source project, and then I placed a Cloudfront CDN in front of it to cache the results. (a how-to covering the tricksy bits)
- Google’s Position on Net Neutrality Changes? (Wired) — At issue is Google Fiber’s Terms of Service, which contains a broad prohibition against customers attaching “servers” to its ultrafast 1 Gbps network in Kansas City. Google wants to ban the use of servers because it plans to offer a business class offering in the future. […] In its response [to a complaint], Google defended its sweeping ban by citing the very ISPs it opposed through the years-long fight for rules that require broadband providers to treat all packets equally.
- The Future of Programming (Bret Victor) — gorgeous slides, fascinating talk, and this advice from Alan Kay: I think the trick with knowledge is to “acquire it, and forget all except the perfume” — because it is noisy and sometimes drowns out one’s own “brain voices”. The perfume part is important because it will help find the knowledge again to help get to the destinations the inner urges pick.
- psd.rb — Ruby code for reading PSD files (MIT licensed).
Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
The App Store model has increased the uncertainty of the software release process
The recent unavailability of the Apple Developer’s Portal just underscores how increasingly dependent developers have become on third parties during the software lifecycle. For those who are not following the fun and games, the developer.apple.com sites, which include much of the functionality needed to develop Mac and iOS applications, has been unavailable for more than a week as of this writing. Although iTunes Connect, the portal used to actually deploy apps to the App Stores, has remained available, the remainder of the site territory has been off-limits. This is all thanks to a security intrusion (evidently by an over-zealous researcher.)
The App Store model has fundamentally changed how software is distributed, mostly for the better (IMHO), but it has also removed some of the control of the release process from the hands of the developers and companies they work for. As I have spelled out previously in my book on iOS enterprise development, the fact that Apple has the final say on if and when software goes into the store has required more conservative release timelines. If you want to release on the first of September, you need to count back at least two weeks for “gold master”, because you need to upload the app, potentially go through a round of rejection from Apple, and then upload a fixed version.
Android apps don’t suffer from this lag, because most of the Android stores don’t do any significant checking of the applications uploaded to them. The Devil’s Deal that Apple developers have made with Apple is that in return for the longer wait time to get apps in the store (and having to follow Apple’s rules), they get a de facto seal of approval from Apple. In other words, it is assumed that apps in the iTunes store are more stringently policed and less likely to crash or do harm (deliberately or else-wise.)
The current downtime has brought that deal into question, however. Suddenly, developers who need new provisioning certificates, passbook certificates, or push notification certificates find themselves with nowhere to go. Even if iTunes Connect is available, it doesn’t do you any good if you can’t get a distribution certificate to sign your app for the store. I’m sure that there are developers at this moment who have had their finely tuned release strategies thrown into disarray by the in-availability of the developer portal.
Being essentially at the mercy of Apple’s whims (or Google’s, for that matter) can’t be a pleasant sensation for a company or individual trying to get a new piece of software out the door. The question that the developer community will have to answer is if the benefits of the App Store model make it worth the hassles, in the long run.
Rules of the Internet, Bigness of the Data, Wifi ADCs, and Google Flirts with Client-Side Encryption
- Ten Rules of the Internet (Anil Dash) — they’re all candidates for becoming “Dash’s Law”. I like this one the most: When a company or industry is facing changes to its business due to technology, it will argue against the need for change based on the moral importance of its work, rather than trying to understand the social underpinnings.
- Data Storage by Vertical (Quartz) — The US alone is home to 898 exabytes (1 EB = 1 billion gigabytes)—nearly a third of the global total. By contrast, Western Europe has 19% and China has 13%. Legally, much of that data itself is property of the consumers or companies who generate it, and licensed to companies that are responsible for it. And in the US—a digital universe of 898 exabytes (1 EB = 1 billion gigabytes)—companies have some kind of liability or responsibility for 77% of all that data.
- x-OSC — a wireless I/O board that provides just about any software with access to 32 high-performance analogue/digital channels via OSC messages over WiFi. There is no user programmable firmware and no software or drivers to install making x-OSC immediately compatible with any WiFi-enabled platform. All internal settings can be adjusted using any web browser.
- Google Experimenting with Encrypting Google Drive (CNet) — If that’s the case, a government agency serving a search warrant or subpoena on Google would be unable to obtain the unencrypted plain text of customer files. But the government might be able to convince a judge to grant a wiretap order, forcing Google to intercept and divulge the user’s login information the next time the user types it in. Advertising depends on the service provider being able to read your data. Either your Drive’s contents aren’t valuable to Google advertising, or it won’t be a host-resistant encryption process.
Microvideos for MIcrohelp, Organic Search, Probabilistic Programming, and Cluster Management
- How to Make Help Microvideos For Your Site (Alex Holovaty) — Instead of one monolithic video, we decided to make dozens of tiny, five-second videos separately demonstrating features.
- How Google is Killing Organic Search — 13% of the real estate is organic results in a search for “auto mechanic”, 7% for “italian restaurant”, 0% if searching on an iPhone where organic results are four page scrolls away. SEO Book did an extensive analysis of just how important the top left of the page, previously occupied by organic results actually is to visitors. That portion of the page is now all Google. (via Alex Dong)
- Church — probabilistic programming language from MIT, with tutorials. (via Edd Dumbill)
- mesos — a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark (a new framework for low-latency interactive and iterative jobs), and other applications. Mesos is open source in the Apache Incubator. (via Ben Lorica)
OSCON 2013 Speaker Series
Note: Amy Unruh, Google Cloud Platform Developer Relations, is just one of the many fantastic speakers we have at OSCON this year. If you are interested in attending to check out Amy’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.
At this year’s Google I/O, we launched the PHP runtime for Google App Engine, part of the Google Cloud Platform. App Engine is a service that lets you build web apps using the same scalable infrastructure that powers many of Google’s own applications. With App Engine, there are no servers to maintain; you just upload your application, and it’s ready to go.
App Engine’s services support and simplify many aspects of app development. One of those services is Task Queues, which lets you easily add asynchronous background processing to your PHP app, and allows you to simultaneously make your applications more responsive, more reliable, and more scalable.
The App Engine Task Queue service allows your application to define tasks, add them to a queue, and then use the queue to process them asynchronously, in the background. App Engine automatically scales processing capacity to match your queue configuration and processing volume. You define a Task by specifying the application-specific URL of a handler for the task, along with (optionally) parameters or a payload for the task, and other settings, then add it to a Task Queue.