"Big Data" entries

Big data’s big ideas

From cognitive augmentation to artificial intelligence, here's a look at the major forces shaping the data world.

Big data’s big ideas

Looking back at the evolution of our Strata events, and the data space in general, we marvel at the impressive data applications and tools now being employed by companies in many industries. Data is having an impact on business models and profitability. It’s hard to find a non-trivial application that doesn’t use data in a significant manner. Companies who use data and analytics to drive decision-making continue to outperform their peers.

Up until recently, access to big data tools and techniques required significant expertise. But tools have improved and communities have formed to share best practices. We’re particularly excited about solutions that target new data sets and data types. In an era when the requisite data skill sets cut across traditional disciplines, companies have also started to emphasize the importance of processes, culture, and people. Read more…

Comment: 1
Four short links: 21 October 2014

Four short links: 21 October 2014

Data Delusions, OS Robotics, Insecure Crypto, and Free Icons

  1. The Delusions of Big Data (IEEE) — When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
  2. ROSCON 2014 — slides and videos of talks from Chicago open source robotics conference.
  3. Making Sure Crypto Stays Insecure (PDF) — Daniel J. Bernstein talk: This talk is actually a thought experiment: how could an attacker manipulate the ecosystem for insecurity?
  4. Material Design Icons — Google’s CC-licensed (attribution, sharealike) collection of sweet, straightforward icons.
Comment

Fast data fuels real-time streaming applications

A new report describes an imminent shift in real-time applications and the data architecture they require.

Fast_data_coverThe era is here: we’re starting to see computers making decisions that people used to make, through a combination of historical and real-time data. These streams of data come together in applications that answer questions like:

  • What news items or ads is this website visitor likely to be interested in?
  • Is current network traffic part of a Distributed Denial of Service attack?
  • Should our banking site offer a visitor a special deal on a mortgage, based on her credit history?
  • What promotion will entice this gamer to stay on our site longer?
  • Is a particular part of the assembly line overheating and need to be shut down?

Such decisions require the real-time collection of data from the particular user or device, along with others in the environment, and often need to be done on a per-person or per-event basis. For instance, leaderboarding (determining who is top candidate among a group of users, based on some criteria) requires a database that tracks all the relevant users. Such a database nowadays often resides in memory. Read more…

Comment

Big data’s move to the cloud

A new survey shows the market is ready for cloud-based big data services.

Migrating_big_data_analytics_cover_lg_2One night when our son was two years old, he abruptly decided that he didn’t like taking baths. As my wife recalls, he struggled mightily against the ritual of bathing for several months until, suddenly and mysteriously, he decided that he liked bathing again. We’re happy to report that he has managed to stay relatively clean ever since.

When I speak with CIOs and other IT leaders about moving big data operations into the cloud, I am reminded of our son’s unexplained loathing of the bathtub.

Nearly everyone associated with IT understands that most IT operations — including big data analytics — must eventually move into the cloud. The traditional on-premises approaches are simply too costly, and CIOs are under crushing pressure to shift budgetary resources to value-added, customer-facing activities.

For most companies, the writing is already on the wall. The cloud offers greater agility and elasticity, and quicker product development cycles — and can reduce costs. When you add up the benefits, it seems inevitable that the bulk of IT operations will move into the cloud. Nevertheless, the foot-dragging and excuse-making continues. Read more…

Comment: 1
Four short links: 25 September 2014

Four short links: 25 September 2014

Elevation Data, Soft Robots, Clean Data, and Security Souk

  1. NGA Releases Hi-Res Elevation Data — 30-meter topographic data for the world.
  2. Soft Roboticsa collection of shared resources to support the design, fabrication, modeling, characterization, and control of soft robotic devices. From Harvard.
  3. OpenGovIn many domains, it’s not so much about “big data” yet as it is about “clean data.”
  4. Mitnick’s Zero-Day Exploit Shop — marketplace connecting “corporate and government” buyers and sellers of zero-day exploits. Claims to vet buyers. Another hidden economy becoming public.
Comment
Four short links: 15 September 2014

Four short links: 15 September 2014

Weird Machines, Libraries May Scan, Causal Effects, and Crappy Dashboards

  1. The Care and Feeding of Weird Machines Found in Executable Metadata (YouTube) — talk from 29th Chaos Communication Congress, on using tricking the ELF linker/loader into arbitrary computation from the metadata supplied. Yes, there’s a brainfuck compiler that turns code into metadata which is then, through a supernatural mix of pixies, steam engines, and binary, executed. This will make your brain leak. Weird machines are everywhere.
  2. European Libraries May Digitise Books Without Permission“The right of libraries to communicate, by dedicated terminals, the works they hold in their collections would risk being rendered largely meaningless, or indeed ineffective, if they did not have an ancillary right to digitize the works in question,” the court said. Even if the rights holder offers a library the possibility of licensing his works on appropriate terms, the library can use the exception to publish works on electronic terminals, the court ruled. “Otherwise, the library could not realize its core mission or promote the public interest in promoting research and private study,” it said.
  3. CausalImpact (GitHub) — Google’s R package for estimating the causal effect of a designed intervention on a time series. (via Google Open Source Blog)
  4. Laws of Crappy Dashboards — (caution, NSFW language … “crappy” is my paraphrase) so true. Not talking to users will result in a [crappy] dashboard. You don’t know if the dashboard is going to be useful. But you don’t talk to the users to figure it out. Or you just show it to them for a minute (with someone else’s data), never giving them a chance to figure out what the hell they could do with it if you gave it to them.
Comment: 1
Four short links: 1 September 2014

Four short links: 1 September 2014

Sibyl, Bitrot, Estimation, and ssh

  1. Sibyl: Google’s System for Large Scale Machine Learning (YouTube) — keynote at DSN2014 acting as an intro to Sibyl. (via KD Nuggets)
  2. Bitrot from 1997That’s 205 failures, an actual link rot figure of 91%, not 57%. That leaves only 21 URLs as 200 OK and containing effectively the same content.
  3. What We Do And Don’t Know About Software Effort Estimation — nice rundown of research in the field.
  4. fabric — simple yet powerful ssh library for Python.
Comment: 1
Four short links: 27 August 2014

Four short links: 27 August 2014

Discourse 1.0, Programmable Matter, Versioned Databases, and What Humans Learned About Machine Learning

  1. Discourse turns 1.0 — community/forum software that doesn’t suck.
  2. Programmable Matter (IEEE Spectrum) — recap of where research is going in this area.
  3. Liquibasesource control for your database. Apache 2.0 licensed.
  4. A Few Useful Things to Know About Machine Learning (PDF) — This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions. My fave: First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. But it makes sense if you consider how time-consuming it is to gather data, integrate it, clean it and pre-process it, and how much trial and error can go into feature design.
Comments: 2
Four short links: 20 August 2014

Four short links: 20 August 2014

Plant Properties, MQ Comparisons, 1915 Vis, and Mobile Web Weaknesses

  1. Machine Learning for Plant Properties — startup building database of plant genomics, properties, research, etc. for mining. The more familiar you are with your data and its meaning, the better your machine learning will be at suggesting fruitful lines of query … and the more valuable your startup will be.
  2. Dissecting Message Queues — throughput, latency, and qualitative comparison of different message queues. MQs are to modern distributed architectures what function calls were to historic unibox architectures.
  3. 1915 Data Visualization Rules — a reminder that data visualization is not new, but research into effectiveness of alternative presentation styles is.
  4. The Broken Promise of the Mobile Webit’s not just about the UI – it’s also about integration with the mobile device.
Comment
Four short links: 13 August 2014

Four short links: 13 August 2014

Thinking Machines, Chemical Sensor, Share Containerised Apps, and Visualising the Net Neutrality Comments

  1. Viv — another step in the cognition race. Wolfram Alpha was first out the gate, but Watson, Viv, and others are hot on heels of being able to parse complex requests, then seek and use information to fulfil them.
  2. Universal Mobile Electrochemical Detector Designed for Use in Resource-limited Applications (PNAS) — $35 handheld sensor with mobile phone connection. The electrochemical methods that we demonstrate enable quantitative, broadly applicable, and inexpensive sensing with flexibility based on a wide variety of important electroanalytical techniques (chronoamperometry, cyclic voltammetry, differential pulse voltammetry, square wave voltammetry, and potentiometry), each with different uses. Four applications demonstrate the analytical performance of the device: these involve the detection of (i) glucose in the blood for personal health, (ii) trace heavy metals (lead, cadmium, and zinc) in water for in-field environmental monitoring, (iii) sodium in urine for clinical analysis, and (iv) a malarial antigen (Plasmodium falciparum histidine-rich protein 2) for clinical research. (via BoingBoing)
  3. panamax.io containerized app creator with an open-source app marketplace hosted in GitHub. Panamax provides a friendly interface for users of Docker, Fleet & CoreOS. With Panamax, you can easily create, share and deploy any containerized app no matter how complex it might be.
  4. Quid Analysis of Comments to FCC on Net Neutrality (NPR) — visualising the themes and volume of the comments. Interesting factoid: only half the comments were derived from templates (cf 80% in submissions to some financial legislation).
Comment: 1