"data science" entries

Turning big data into actionable insights

The O’Reilly Data Show podcast: Evangelos Simoudis on data mining, investing in data startups, and corporate innovation.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Fleming_valves

Can developments in data science and big data infrastructure drive corporate innovation? To be fair, many companies are still in the early stages of incorporating these ideas and tools into their organizations.

Evangelos Simoudis has spent many years interacting with entrepreneurs and executives at major global corporations. Most recently, he’s been advising companies interested in developing long-term strategies pertaining to big data, data science, cloud computing, and innovation. He began his career as a data mining researcher and practitioner, and is counted among the pioneers who helped data mining technologies get adopted in industry.

In this episode of the O’Reilly Data Show, I sat down with Simoudis and we talked about his thoughts on investing, data applications and products, and corporate innovation:

Open source software companies

I very much appreciate open source. I encourage my portfolio companies to use open source components as appropriate, but I’ve never seen the business model as being one that is particularly easy to really build the companies around them. Everybody points to Red Hat, and that may be the exception, but I have not seen companies that have, on the one hand, remained true to the open source principles and become big and successful companies that do not require constant investment. … The revenue streams never prove to be sufficient for building big companies. I think the companies that get started from open source in order to become big and successful … [are] ones that, at some point, decided to become far more proprietary in their model and in the services that they deliver. Or they become pure professional services companies as opposed to support services companies. Then they reach the necessary levels of success.

Read more…

Comment

Get started with cloud-based data science

Learn how to deploy machine learning solutions using Azure ML.

620px-MODIS_Map

Download the free, updated report “Data Science in the Cloud with Microsoft Azure Machine Learning and R: 2015 Update.

Cloud-based machine learning platforms, like Microsoft’s Azure Machine Learning (Azure ML), provide a simplified path to create and deploy analytic solutions. Azure ML is a fully managed and secure machine learning platform that resides within the Microsoft Cortana Analytics Suite.

Azure ML workflows (known as “experiments”) are constructed using a combination of drag-and-drop modules, SQL, R, and Python scripts. The wide range of built modules support the typical steps in a machine learning workflow, from data ingestion and data munging to model construction and cross validation.

Once your Azure ML experiment is ready, there are several options to deploy it. Azure ML experiments can access large-scale data stored in Azure Blob storage, Azure SQL and Hive, to name a few options. Similarly, your experiment can write results back to multiple scalable Azure storage options.

Read more…

Comment

Resolving transactional access and analytic performance trade-offs

The O’Reilly Data Show podcast: Todd Lipcon on hybrid and specialized tools in distributed systems.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px-Dolderbrug_Steenwijk_inclusief_lichtontwerpIn recent months, I’ve been hearing about hybrid systems designed to handle different data management needs. At Strata + Hadoop World NYC last week, Cloudera’s Todd Lipcon unveiled an open source storage layer — Kudu —  that’s good at both table scans (analytics) and random access (updates and inserts).

While specialized systems will continue to serve companies, there will be situations where the complexity of maintaining multiple systems — to eke out extra performance — will be harder to justify.

During the latest episode of the O’Reilly Data Show Podcast, I sat down with Lipcon to discuss his new project a few weeks before it was released. Here are a few snippets from our conversation:

HDFS and Hbase

[Hadoop is] more like a file store. It allows you to upload files onto an arbitrarily sized cluster with 20-plus petabytes, in single clusters. The thing is, you can upload the files but you can’t edit them in place. To make any change, you have to basically put in a new file. What HBase does in distinction is that it has more of a tabular data model, where you can update and insert individual row-by- row data, and then randomly access that data [in] milliseconds. The distinction here is that HDFS is pretty good for large scans where you’re putting in a large data set, maybe doing a full parse over the data set to train a machine learning model or compute an aggregate. If any of that data changes on a frequent basis or if you want to stream the data in or randomly access individual customer records, you’re kind of out of luck on HDFS. Read more…

Comment

Specialized and hybrid data management and processing engines

A new crop of interesting solutions for the complexity of operating multiple systems in a distributed computing setting.

Shinkyo_Sacred_Bridge_Paul_Mannix_Flickr

The 2004 holiday shopping season marked the start of Amazon’s investigation into alternative database technologies that led to the creation of DynamoDB — a key-value storage system that went onto inspire several NoSQL projects.

A new group of startups began shifting away from the general-purpose systems favored by companies just a few years earlier. In recent years, we’ve seen a diverse set of DBMS technologies that specialize in handling particular workloads and data models such as OLTP, OLAP, search, RDF, XML, scientific applications, etc. The success and popularity of such systems reinforced the belief that in order to scale and “go fast,” specialized systems are preferable.

In distributed computing, the complexity of maintaining and operating multiple specialized systems has recently led to systems that bridge multiple workloads and data models. Aside from multi-model databases, there are an emerging number of storage and compute engines adept at handling different workloads and problems. At this week’s Strata + Hadoop World conference in NYC, I had a chance to interact with the creators of some of these new solutions.

OLTP (transactions) and OLAP (analytics)

One of the key announcements at Strata + Hadoop World this week was Project Kudu — an open source storage engine that’s good at both table scans (analytics) and random access (updates and inserts). Its creators are quick to point out that they aren’t out to beat specialized OLTP and OLAP systems. Rather, they’re shooting to build a system that’s “70-80% of the way there on both axes.” The project is very young and lacks enterprise features, but judging from the reaction at the conference, it’s something the big data community will be watching. Leading technology research firms have created a category for systems with related capabilities:  HTAP (Gartner) and Trans-analytics (Forrester).

Read more…

Comment: 1

Accelerating real-time analytics with Spark

Integration of the data supply chain is key to a reliable and fast big data analytics deployment.

Cnc_plasma_cutting-crop

Watch our free webcast “Accelerating Advanced Analytics with Spark” to learn about the architecture, applications, and best practices of Apache Spark.

Apache Hadoop is a mature development framework, which coupled with its large ecosystem, and support and contributions from key players such as Cloudera, Hortonworks, and Yahoo, provides organizations with many tools to manage data of varying sizes.

In the past, Hadoop’s batch-oriented nature using MapReduce was sufficient to meet the processing needs of many organizations. However, increasing demands for faster processing of data have emerged. These demands have been driven by recent developments in streaming technologies, the Internet of Things (IoT) and real-time analytics, to name just a few. These new demands have required new processing models. One significant new technology today that is being used to meet these demands and is gaining considerable interest and widespread support is Apache Spark. Spark’s speed and versatility make it a key part of today’s big-data processing stack in industries from energy to finance. Read more…

Comment

Translating data into knowledge

Best practices for data preparation — what you need to know before data analysis can begin.

Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation.

Joseph_Wright_of_Derby_The_AlchemistData is growing at an exponential rate worldwide, with huge business opportunities and challenges for every industry. In 2016, global Internet traffic will reach 90 exabytes per month, according to a recent Cisco report. The ability to manage and analyze an unprecedented amount of data will be the key to success for every industry.

To exploit the benefits of a big data strategy, a key question is how to translate all of that data into useful knowledge. To meet this challenge, a company first needs to have a clear picture of their strategic knowledge assets, such as their area of expertise, core competencies, and intellectual property.

Having a clear picture of the business model and the relationships with distributors, suppliers, and customers is extremely useful in order to design a tactical and strategic decision-making process. The true potential value of big data is only gained when placed in a business context, where data analysis drives better decisions — otherwise, it’s just data.

In a new O’Reilly report Data Preparation in the Big Data Era, we provide a step-by-step guide to manage the challenges of data cleaning and preparation — critical steps before effective data analysis can begin. We explore the common problems of data preparation and the different steps involved, including data cleaning, combination, and transformation. You’ll also learn about new products that deal with problem of data variety at scale, including Tamr’s solution, which curates data at scale using a combination of machine learning and expert feedback. Read more…

Comment

Building enterprise data applications with open source components

The O’Reilly Data Show podcast: Dean Wampler on bounded and unbounded data processing and analytics.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px_Julia_set_of_a_hyperbolic_exponential_map_with_an_attracting_orbit_of_period_3

I first found myself having to learn Scala when I started using Spark (version 0.5). Prior to Spark, I’d peruse books on Scala but just never found an excuse to delve into it. In the early days of Spark, Scala was a necessity — I quickly came to appreciate it and have continued to use it enthusiastically.

For this Data Show Podcast, I spoke with O’Reilly author and Typesafe’s resident big data architect Dean Wampler about Scala and other programming languages, the big data ecosystem, and his recent interest in real-time applications. Dean has years of experience helping companies with large software projects, and over the last several years, he’s focused primarily on helping enterprises design and build big data applications.

Here are a few snippets from our conversation:

Apache Mesos & the big data ecosystem

It’s a very nice capability [of Spark] that you can actually run it on a laptop when you’re developing or working with smaller data sets. … But, of course, the real interesting part is to run on a cluster. You need some cluster infrastructure and, fortunately, it works very nicely with YARN. It works very nicely on the Hadoop ecosystem. … The nice thing about Mesos over YARN is that it’s a much more flexible, capable resource manager. It basically treats your cluster as one giant machine of resources and gives you that illusion, ignoring things like network latencies and stuff. You’re just working with a giant machine and it allocates resources to your jobs, multiple users, all that stuff, but because of its greater flexibility, it cannot only run things like Spark jobs, it can run services like HDFS or Cassandra or Kafka or any of these tools. … What I saw was there was a situation here where we had maybe a successor to YARN. It’s obviously not as mature an ecosystem as the Hadoop ecosystem but not everybody needs that maturity. Some people would rather have the flexibility of Mesos or of solving more focused problems.

Read more…

Comment

Major players and important partnerships in the big data market

A fully data-driven market report maps and analyzes the intersections between companies.

Relate-IO-Interactive-Map-Screenshot-Radar

Download our new free report “Mapping Big Data: A Data Driven Market Report” for insights into the shape and structure of the big data market.

Who are the major players in the big data market? What are the sectors that make up the market and how do they relate? Which among the thousands of partnerships are most important?

These are just a handful of questions we explore in-depth in the new O’Reilly report now available for free download: Mapping Big Data: A Data Driven Market Report. For this new report, San Francisco-based startup Relato mapped the intersection of companies throughout the data ecosystem — curating a network with tens of thousands of nodes and edges representing companies and partnerships in the big data space.

Relato created the network by extracting data from company home pages on the Web and analyzed it using social network analysis; market experts interpreted the results to yield the insights presented in the report. The result is a preview of the future of market reports. Read more…

Comment

Improving corporate planning through insight generation

Data storage and management providers are becoming key contributors for insight as a service.

350px-Nikola_Tesla,_with_his_equipment_Wellcome_M0014782Contrary to what many believe, insights are difficult to identify and effectively apply. As the difficulty of insight generation becomes apparent, we are starting to see companies that offer insight generation as a service.

Data storage, management and analytics are maturing into commoditized services, and the companies that provide these services are well-positioned to provide insight on the basis not just of data, but data access and other metadata patterns.

Companies like DataHero and Host Analytics [full disclosure: Host Analytics is one of my portfolio companies] are paving the way in the insight-as-a-service space. Host Analytics’ initial product offering was a cloud-based Enterprise Performance Management (EPM) Suite, but far more important is what they are now enabling for the enterprise: they have moved from being an EPM company to being an insight generation company.  In this post, I will discuss a few of the trends that have enabled insight as a service (IaaS) and discuss the general case of using a software-as-a-service (SaaS) EPM solution to corral data and deliver insight as a service as the next level of product.

Insight generation is the identification of novel, interesting, plausible and understandable relations among elements of a data set that a) lead to the formation of an action plan and b) result in an improvement as measured by a set of KPIs. The evaluation of the set of identified relations to establish an insight, and the creation of an action plan associated with a particular insight or insights, needs to be done within a particular context and necessitates the use of domain knowledge. Read more…

Comment