Working with large data using datashader¶
import datashader as ds
import numpy as np
import holoviews as hv
from holoviews import opts
from holoviews.operation.datashader import datashade, shade, dynspread, rasterize
from holoviews.operation import decimate
hv.extension('bokeh','matplotlib')
decimate.max_samples=1000
dynspread.max_px=20
dynspread.threshold=0.5
def random_walk(n, f=5000):
"""Random walk in a 2D space, smoothed with a filter of length f"""
xs = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
ys = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
xs += 0.1*np.sin(0.1*np.array(range(n-1+f))) # add wobble on x axis
xs += np.random.normal(0, 0.005, size=n-1+f) # add measurement noise
ys += np.random.normal(0, 0.005, size=n-1+f)
return np.column_stack([xs, ys])
def random_cov():
"""Random covariance for use in generating 2D Gaussian distributions"""
A = np.random.randn(2,2)
return np.dot(A, A.T)
def time_series(T = 1, N = 100, mu = 0.1, sigma = 0.1, S0 = 20):
"""Parameterized noisy time series"""
dt = float(T)/N
t = np.linspace(0, T, N)
W = np.random.standard_normal(size = N)
W = np.cumsum(W)*np.sqrt(dt) # standard brownian motion
X = (mu-0.5*sigma**2)*t + sigma*W
S = S0*np.exp(X) # geometric brownian motion
return S
When viewed statically, the plots will not update fully when you zoom and pan.
Principles of datashading¶
Because HoloViews elements are fundamentally data containers, not visualizations, you can very quickly declare elements such as Points or Path containing datasets that may be as large as the full memory available on your machine (or even larger if using Dask dataframes). So even for very large datasets, you can easily specify a data structure that you can work with for making selections, sampling, aggregations, and so on. However, as soon as you try to visualize it directly with either the matplotlib or bokeh plotting extensions, the rendering process may be prohibitively expensive.
Let's start with a simple example we can visualize as normal:
np.random.seed(1)
points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000,)),label="Points")
paths = hv.Path([random_walk(2000,30)], label="Paths")
points + paths
These browser-based plots are fully interactive, as you can see if you select the Wheel Zoom or Box Zoom tools and use your scroll wheel or click and drag.
Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page. Note that even though the visualization above is not computationally expensive, even with just 1000 points as in the scatterplot above, the plot already suffers from overplotting, with later points obscuring previously plotted points.
With much larger datasets, these issues will quickly make it impossible to see the true structure of the data. We can easily declare 50X or 1000X larger versions of the same plots above, but if we tried to visualize them they would be nearly unusable even if the browser did not crash:
np.random.seed(1)
points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000000,)),label="Points")
paths = hv.Path([0.15*random_walk(100000) for i in range(10)],label="Paths")
#points + paths ## Danger! Browsers can't handle 1 million points!
Luckily, HoloViews Elements are just containers for data and associated metadata, not plots, so HoloViews can generate entirely different types of visualizations from the same data structure when appropriate. For instance, in the plot on the left below you can see the result of applying a decimate() operation acting on the points object, which will automatically downsample this million-point dataset to at most 1000 points at any time as you zoom in or out:
decimate(points) + datashade(points) + datashade(paths)
Decimating a plot in this way can be useful, but it discards most of the data, yet still suffers from overplotting. If you have Datashader installed, you can instead use the datashade() operation to create a dynamic Datashader-based Bokeh plot. The middle plot above shows the result of using datashade() to create a dynamic Datashader-based plot out of an Element with arbitrarily large data. In the Datashader version, a new image is regenerated automatically on every zoom or pan event, revealing all the data available at that zoom level and avoiding issues with overplotting by dynamically rescaling the colors used. The same process is used for the line-based data in the Paths plot.
These two Datashader-based plots are similar to the native Bokeh plots above, but instead of making a static Bokeh plot that embeds points or line segments directly into the browser, HoloViews sets up a Bokeh plot with dynamic callbacks that render the data as an RGB image using Datashader instead. The dynamic re-rendering provides an interactive user experience even though the data itself is never provided directly to the browser. Of course, because the full data is not in the browser, a static export of this page (e.g. on holoviews.org or on anaconda.org) will only show the initially rendered version, and will not update with new images when zooming as it will when there is a live Python process available.
Though you can no longer have a completely interactive exported file, with the Datashader version on a live server you can now change the number of data points from 1000000 to 10000000 or more to see how well your machine will handle larger datasets. It will get a bit slower, but if you have enough memory, it should still be very usable, and should never crash your browser as transferring the whole dataset into your browser would. If you don't have enough memory, you can instead set up a Dask dataframe as shown in other Datashader examples, which will provide out-of-core and/or distributed processing to handle even the largest datasets.
The datashade() operation is actually a "macro" or shortcut that combines the two main computations done by datashader, namely shade() and rasterize():
rasterize(points).hist() + shade(rasterize(points)) + datashade(points)