Open-Source Data Viz with Superset and DuckDB

Read Your Files Directly

Dec 11, 2023

Apache Superset is a data visualization platform, while DuckDB is an OLAP database. Both technologies are open source. In this post I describe how the two can be combined to create a live data connection with a file system. By the end, you will be able to to query Parquet, JSON, and CSV files directly from Superset.

I created a repository on Github that stores the code referenced in this post. If you want, you can clone it and setup a DuckDB-powered Superset instance yourself!
https://github.com/jorritsandbrink/superset-duckdb

Start Superset

A custom Docker image is used to package Superset and DuckDB together. It uses the official Superset Docker image as a base, and extends it with DuckDB and a SQLAlchemy driver for DuckDB:

FROM apache/superset:3.1.0rc1-py310

USER root

RUN pip install duckdb==0.9.2
RUN pip install duckdb-engine==0.9.2

USER superset

Build the image

If you have Docker installed, you can easily build the image by running

docker build -t jorritsandbrink/superset-duckdb docker

Run the container

After you’ve built the image, you can start a container by running

docker run -d -p 8080:8088 \
    -e "SUPERSET_SECRET_KEY=your_secret_key" \
    --mount type=bind,source=/$(pwd)/data,target=/data \
    --name superset-duckdb \
    jorritsandbrink/superset-duckdb

Please replace your_secret_key with something more secure. The Superset docs suggests generating a key with openssl rand -base64 42.

Note that we are mounting our local data folder to the container to make the data files accessible from within the container.

Setup Superset

By now you have a running container, but some configuration needs to be done before you can work with your Superset instance. This configuration is done with several CLI commands, which I’ve bundled into a script called setup.sh. You can setup Superset with a single command by running

./docker/setup.sh

Alternatively, you run the following commands in sequence.

Create admin user

docker exec -it superset-duckdb superset fab create-admin \
    --username admin \
    --firstname Superset \
    --lastname Admin \
    --email admin@superset.com \
    --password admin

If you want, you can replace the admin credentials with your own values.

Upgrade database to latest

docker exec -it superset-duckdb superset db upgrade

Superset uses a backend OLTP database as metastore. By default, this is SQLLite, but it is recommended to use another supported database engine (e.g. PostgreSQL or MySQL) for production usage.

Setup roles

docker exec -it superset-duckdb superset init

Create database connection

docker exec -it superset-duckdb superset set_database_uri \
    -d DuckDB-memory \
    -u duckdb:///:memory:

We are using the special value :memory: to create an in-memory DuckDB database that does not persist any data to disk. In this mode, we don’t need to load our source data files into a persisted DuckDB database file, thereby effectively enabling a live data connection between the files and Superset.

Open Superset

You can now open the Superset UI by navigating to http://localhost:8080/login/ in a browser. Login with username=admin and password=admin (or your own values, if you changed them):

After logging in, you’ll be redirected to the Superset welcome page:

Check database connection

If all went well with the configuration, you should be able to see the DuckDB database connection from the Superset UI. Hover over Settings in the top-right corner and click Database Connections:

For now, DuckDB-memory is the only database connection in the list:

Overview of database connections in Superset UI

You can click the Edit button the see the SQLAlchemy connection URI:

DuckDB database connection configuration in Superset UI

Click TEST CONNECTION and make sure you see this popup message:

Query files

You have made sure your DuckDB database connection works—now’s the time to use it! We will use SQL Lab—a built-in SQL IDE—to interactively query the Parquet, JSON, and CSV files in your filesystem from within Superset. Hover over SQL and click SQL Lab:

You can use a glob syntax to read multiple files at once:

Parquet

SELECT *
FROM '/data/parquet_table/*.parquet'

JSON

SELECT *
FROM '/data/json_table/*.json'

CSV

SELECT *
FROM '/data/csv_table/*.csv'

Next steps

That’s it! You now know how you can use DuckDB to connect Superset with a filesystem. You’ve seen that—by using the in-memory connection mode—you can effectively create a live data connection with the filesystem, without needing to load the data into a database. The next step would be to save your queries as datasets, to build charts and dashboards.

Thank you for reading this post. Feel free to share it with others.

Please check out the links that inspired this post:

Data Engineering with Jorrit Sandbrink

Discussion about this post