"graph" entries

Semi-automatic method for grading a million homework assignments

Organize solutions into clusters and “force multiply” feedback provided by instructors

One of the hardest things about teaching a large class is grading exams and homework assignments. In my teaching days a “large class” was only in the few hundreds (still a challenge for the TAs and instructor). But in the age of MOOCs, classes with a few (hundred) thousand students aren’t unusual.

Researchers at Stanford recently combed through over one million homework submissions from a large MOOC class offered in 2011. Students in the machine-learning course submitted programming code for assignments that consisted of several small programs (the typical submission was about 16 lines of code). While over 120,000 enrolled only about 10,000 students completed all homework assignments (about 25,000 submitted at least one assignment).

The researchers were interested in figuring out ways to ease the burden of grading the large volume of homework submissions. The premise was that by sufficiently organizing the “space of possible solutions”, instructors would provide feedback to a few submissions, and their feedback could then be propagated to the rest.

Read more…

Four short links: 5 July 2013

Four short links: 5 July 2013

Tracking Bitcoin, Gaming Deflation, Bloat-Aware Design, and Mapping Entity Relationships

  1. Quantitative Analysis of the Full Bitcoin Transaction Graph (PDF) — We analyzed all these large transactions by following in detail the way these sums were accumulated and the way they were dispersed, and realized that almost all these large transactions were descendants of a single transaction which was carried out in November 2010. Finally, we noted that the subgraph which contains these large transactions along with their neighborhood has many strange looking structures which could be an attempt to conceal the existence and relationship between these transactions, but such an attempt can be foiled by following the money trail in a succinctly persistent way. (via Alex Dong)
  2. Majority of Gamers Today Can’t Finish Level 1 of Super Mario Bros — Nintendo test, and the President of Nintendo said in a talk, We watched the replay videos of how the gamers performed and saw that many did not understand simple concepts like bottomless pits. Around 70 percent died to the first Goomba. Another 50 percent died twice. Many thought the coins were enemies and tried to avoid them. Also, most of them did not use the run button. There were many other depressing things we noted but I can not remember them at the moment. (via Beta Knowledge)
  3. Bloat-Aware Design for Big Data Applications (PDF) — (1) merging and organizing related small data record objects into few large objects (e.g., byte buffers) instead of representing them explicitly as one-object-per-record, and (2) manipulating data by directly accessing buffers (e.g., at the byte chunk level as opposed to the object level). The central goal of this design paradigm is to bound the number of objects in the application, instead of making it grow proportionally with the cardinality of the input data. (via Ben Lorica)
  4. Poderopedia (Github) — originally designed for investigative journalists, the open src software allows you to create and manage entity profile pages that include: short bio or summary, sheet of connections, long newsworthy profiles, maps of connections of an entity, documents related to the entity, sources of all the information and news river with external news about the entity. See the announcement and website.

Improving options for unlocking your graph data

Graph data is an area that has attracted many enthusiastic entrepreneurs and developers

The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.

While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).

Data wrangling: creating graphs
Before you can take advantage of the other tools mentioned in this post, you’ll need to turn your data (e.g., web pages) into graphs. GraphBuilder is an open source project from Intel, that uses Hadoop MapReduce1 to build graphs out of large data sets. Another option is the combination of GraphX/Spark described below. (A startup called Trifacta is building a general-purpose, data wrangling tool, that could help as well. )

Read more…

Four short links: 13 May 2013

Four short links: 13 May 2013

Exploiting Glass, Teaching Probability, Product Design, and Subgraph Matching

  1. Exploiting a Bug in Google Glass — unbelievably detailed and yet easy-to-follow explanation of how the bug works, how the author found it, and how you can exploit it too. The second guide was slightly more technical, so when he returned a little later I asked him about the Debug Mode option. The reaction was interesting: he kind of looked at me, somewhat confused, and asked “wait, what version of the software does it report in Settings”? When I told him “XE4” he clarified “XE4, not XE3”, which I verified. He had thought this feature had been removed from the production units.
  2. Probability Through Problems — motivating problems to hook students on probability questions, structured to cover high-school probability material.
  3. Connbox — love the section “The importance of legible products” where the physical UI interacts seamless with the digital device … it’s glorious. Three amazing videos.
  4. The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees (PLoSONE) — The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks. […] An implementation of ISMA in Java is freely available.
Four short links: 29 March 2013

Four short links: 29 March 2013

Titan Improved, Security Tweeps, Probabilistic Programming, and 3D-Printable Optics

  1. Titan 0.3 Out — graph database now has full-text, geo, and numeric-range index backends.
  2. Mozilla Security Community Do a Reddit AMA — if you wanted a list of sharp web security people to follow on Twitter, you could do a lot worse than this.
  3. Probabilistic Programming and Bayesian Methods for Hackers (Github) — An introduction to Bayesian methods + probabilistic programming in data analysis with a computation/understanding-first, mathematics-second point of view. All in pure Python. See also Why Probabilistic Programming Matters and Trends to Watch: Logic and Probabilistic Programming. (via Mike Loukides and Renee DiRestra)
  4. Open Source 3D-Printable Optics Equipment (PLOSone) — This study demonstrates an open-source optical library, which significantly reduces the costs associated with much optical equipment, while also enabling relatively easily adapted customizable designs. The cost reductions in general are over 97%, with some components representing only 1% of the current commercial investment for optical products of similar function. The results of this study make its clear that this method of scientific hardware development enables a much broader audience to participate in optical experimentation both as research and teaching platforms than previous proprietary methods.

GraphChi: Graph analytics over billions of edges using your laptop

A disk-based, single-node, graph analytics system that scales to massive graphs

GraphChi is a spinoff project of GraphLab, an open source, distributed, in-memory software system for analytics and machine-learning.

Designed specifically to run on a single computer with limited memory1 (DRAM), since its release a few months ago GraphChi has been used to analyze graphs with billions of edges. Running on a single machine means deployment and debugging are simpler. In addition it is no longer necessary to find (optimal) graph partitions that minimize communication between compute nodes – the starting point for many distributed graph computations.

The stated goal of GraphChi is to “Compute on graphs with billions of edges, in a reasonable time, on a single PC.” One way to define “reasonable amount of computation time” is to compare against the results produced by other graph processing systems. That’s exactly what GraphChi’s creators did in a recent paper. They found that GraphChi compared favorably to graph analytics packages such as Pegasus and Stanford GPS. While GraphChi was 2-3X slower2 in some cases, it is easier to deploy, easier to debug, and way more energy efficient. Read more…

Four short links: 23 November 2012

Four short links: 23 November 2012

Island Traps, Apolitical Technology, 3D Printing Patent Suits, and Disk-Based Graph Tool

  1. Trap Island — island on most maps doesn’t exist.
  2. Why I Work on Non-Partisan Tech (MySociety) — excellent essay. Obama won using big technology, but imagine if that effort, money, and technique were used to make things that were useful to the country. Political technology is not gov2.0.
  3. 3D Printing Patent Suits (MSNBC) — notable not just for incumbents keeping out low-cost competitors with patents, but also (as BoingBoing observed) Many of the key patents in 3D printing start expiring in 2013, and will continue to lapse through ’14 and ’15. Expect a big bang of 3D printer innovation, and massive price-drops, in the years to come. (via BoingBoing)
  4. GraphChican run very large graph computations on just a single machine, by using a novel algorithm for processing the graph from disk (SSD or hard drive). Programs for GraphChi are written in the vertex-centric model, proposed by GraphLab and Google’s Pregel. GraphChi runs vertex-centric programs asynchronously (i.e changes written to edges are immediately visible to subsequent computation), and in parallel. GraphChi also supports streaming graph updates and removal of edges from the graph.
Four short links: 10 August 2011

Four short links: 10 August 2011

Gamification is Bullshit, Design for Impact, Public Domain, and Network Analysis

  1. Gamification is Bullshit (Ian Bogost) — [G]amification is marketing bullshit, invented by consultants as a means to capture the wild, coveted beast that is videogames and to domesticate it for use in the grey, hopeless wasteland of big business, where bullshit already reigns anyway. Bullshitters are many things, but they are not stupid. The rhetorical power of the word “gamification” is enormous, and it does precisely what the bullshitters want: it takes games—a mysterious, magical, powerful medium that has captured the attention of millions of people—and it makes them accessible in the context of contemporary business.
  2. Design for (Real) Social Impact (Vimeo) — single best talk I’ve seen on making philanthropy effective. (via Rowan Simpson)
  3. The Public Domain Review — an online weekly journal dedicated to treasures that have entered the public domain and articles on them. The home page currently features: Boris Karloff in “Last of the Mohicans”, the Boston Revolution in psychotherapy, “Was Charles Darwin an Atheist?”, the Orson Welles audio show, “100 Years of The Secret Garden”, a feature on a 1300 year old illustrated work on the Book of Revelations, and more.
  4. SNAP — the Stanford Network Analysis Platform, a library for network and graph analysis. (via Joshua Schachter)
Four short links: 8 August 2011

Four short links: 8 August 2011

Graph ORM, Graphic Computation, Web Intents, and Async RPC

  1. Bulbflow — a Python framework for graph databases: it’s like an ORM for graphs. (via Joshua Schachter)
  2. Nomograms — the lost art of graphical computing. (via John D Cook)
  3. Web Intents — adding Android-style Intents to the web. Services register their intention to be able to handle an action on the user’s behalf. Applications request to start an Action of a certain verb (share, edit, view, pick etc) and the system will find the appropriate Services for the user to use based on the user’s preference.
  4. Finagle (GitHub) — Twitter’s asynchronous network stack for the JVM that you can use to build asynchronous Remote Procedure Call (RPC) clients and servers in Java, Scala, or any JVM-hosted language. Finagle provides a rich set of tools that are protocol independent.
Four short links: 1 July 2011

Four short links: 1 July 2011

Vector Graphics, Processing Maps, Augemented Senses, and Graph Analysis

  1. paper.jsThe Swiss Army Knife of Vector Graphics Scripting. MIT-licensed Javascript library that gives great demo.
  2. TileMill for Processing — gorgeous custom maps in Processing. (via FlowingData)
  3. Research Assistant Wanted — working with one of the authors of Mind Hacks on augmenting our existing senses with a form of “remote touch” generated by using artificial distance sensors, such as ultrasound, to stimulate tactile stimulators (vibrating pads) placed against the surface of the head.. (via Vaughn Bell)
  4. GoldenORBa cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture. (via BigData)