Apache Superset is a data visualization platform, while DuckDB is an OLAP database. Both technologies are open source. In this post I describe how the two can be combined to create a live data connection with a file system. By the end, you will be able to to query Parquet
, JSON
, and CSV
files directly from Superset.
I created a repository on Github that stores the code referenced in this post. If you want, you can clone it and setup a DuckDB-powered Superset instance yourself!
Start Superset
A custom Docker image is used to package Superset and DuckDB together. It uses the official Superset Docker image as a base, and extends it with DuckDB and a SQLAlchemy
driver for DuckDB:
FROM apache/superset:3.1.0rc1-py310
USER root
RUN pip install duckdb==0.9.2
RUN pip install duckdb-engine==0.9.2
USER superset
Build the image
If you have Docker installed, you can easily build the image by running
docker build -t jorritsandbrink/superset-duckdb docker
Run the container
After you’ve built the image, you can start a container by running
docker run -d -p 8080:8088 \
-e "SUPERSET_SECRET_KEY=your_secret_key" \
--mount type=bind,source=/$(pwd)/data,target=/data \
--name superset-duckdb \
jorritsandbrink/superset-duckdb
Please replace your_secret_key
with something more secure. The Superset docs suggests generating a key with openssl rand -base64 42.
Note that we are mounting our local data folder to the container to make the data files accessible from within the container.
Setup Superset
By now you have a running container, but some configuration needs to be done before you can work with your Superset instance. This configuration is done with several CLI commands, which I’ve bundled into a script called setup.sh
. You can setup Superset with a single command by running
./docker/setup.sh
Alternatively, you run the following commands in sequence.
Create admin user
docker exec -it superset-duckdb superset fab create-admin \
--username admin \
--firstname Superset \
--lastname Admin \
--email admin@superset.com \
--password admin
If you want, you can replace the admin credentials with your own values.
Upgrade database to latest
docker exec -it superset-duckdb superset db upgrade
Superset uses a backend OLTP database as metastore. By default, this is
SQLLite
, but it is recommended to use another supported database engine (e.g.PostgreSQL
orMySQL
) for production usage.
Setup roles
docker exec -it superset-duckdb superset init
Create database connection
docker exec -it superset-duckdb superset set_database_uri \
-d DuckDB-memory \
-u duckdb:///:memory:
We are using the special value
:memory:
to create an in-memory DuckDB database that does not persist any data to disk. In this mode, we don’t need to load our source data files into a persisted DuckDB database file, thereby effectively enabling a live data connection between the files and Superset.
Open Superset
You can now open the Superset UI by navigating to http://localhost:8080/login/ in a browser. Login with username=admin
and password=admin
(or your own values, if you changed them):
After logging in, you’ll be redirected to the Superset welcome page:
Check database connection
If all went well with the configuration, you should be able to see the DuckDB database connection from the Superset UI. Hover over Settings in the top-right corner and click Database Connections:
For now, DuckDB-memory
is the only database connection in the list:
You can click the Edit button the see the SQLAlchemy
connection URI:
Click TEST CONNECTION
and make sure you see this popup message:
Query files
You have made sure your DuckDB database connection works—now’s the time to use it! We will use SQL Lab—a built-in SQL IDE—to interactively query the Parquet
, JSON
, and CSV
files in your filesystem from within Superset. Hover over SQL and click SQL Lab:
You can use a glob syntax to read multiple files at once:
Parquet
SELECT *
FROM '/data/parquet_table/*.parquet'
JSON
SELECT *
FROM '/data/json_table/*.json'
CSV
SELECT *
FROM '/data/csv_table/*.csv'
Next steps
That’s it! You now know how you can use DuckDB to connect Superset with a filesystem. You’ve seen that—by using the in-memory connection mode—you can effectively create a live data connection with the filesystem, without needing to load the data into a database. The next step would be to save your queries as datasets, to build charts and dashboards.
Please check out the links that inspired this post: