Loading Data

There are different strategies for data loading and you can combine them depending on your use cases.

The most important thing to understand is that Shaper is just a wrapper around DuckDB. Any data source DuckDB can query, Shaper can query as well. And in addition to querying external data you can store data within the DuckDB database itself. We extend DuckDB with functionality to make it all work seamlessly for building online and realtime analytic dashboards:

Ingest data efficiently by storing it in NATS
Auto-create tables for ingested data
Securely store credentials for remote data sources
Schedule automated jobs to load and transform data

Ingesting Data into Shaper

You can ingest data into Shaper’s database through the HTTP API or via NATS. Since DuckDB is not optimized for write operations, we are storing data in a NATS Jetstream and then write it to DuckDB in batches using DuckDB’s Appender API. We create tables and add columns automatically based on the data ingested.

To ingest data you first need to create an API Key in the “Admin” settings of the Shaper UI.

When you run Shaper locally for development and you haven’t set up a user yet, the system doesn’t require any authentication.
Not only the UI is accessible without login, but also when using the Ingest API via HTTP and NATS you can skip the authentication.
Then you can write JSON data to the HTTP API or NATS directly:
- HTTP
- NATS
Endpoint:
POST http://localhost:5454/api/data/:tablename
Authentication is done through Bearer token in the Authorization header.
You can pass a single JSON object or an array of objects.
Example:
Single
Multiple
Terminal window
curl -X POST http://localhost:5454/api/data/my_table \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '{"col1": "value1", "col2": 124}'
Terminal window
curl -X POST http://localhost:5454/api/data/my_table \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '[{"col1": "value1", "col2": 123}, {"col1": "value2", "col2": 456}]'
Shaper uses NATS internally, but by default NATS is not directly accessible.
You can make NATS reachable by specifying nats-port. To make sure NATS is secured you also need to specify an admin nats-token. For example:
Terminal window
docker run --rm -it -p5454:5454 -p4222:4222 taleshape/shaper --nats-port 4222 --nats-token mytoken
You can skip nats-token for development, but in production you want to make sure it is set.
Now you can ingest data like this:
Terminal window
nats pub --user '<your-api-key>' shaper.ingest.my_table '{"col1": "value1", "col2": 124}'
You can also use any other NATS client. And you can also submit data using the Jetstream API to get ACKs.
If you click on “New” in the sidebar now, you can run the following query:
```
DESC my_table;
SELECT * FROM my_table;
```
column_name column_type null key default extra
_id VARCHAR YES
_ts TIMESTAMP YES
col1 VARCHAR YES
col2 DOUBLE YES

You can see that Shaper auto-creates columns with fitting data types, and we always add _id and _ts columns. You can override the default values by passing data in the JSON object for them. If using NATS directly and you set the Nats-Msg-Id header it will be used as the _id column value (if not set in data itself).

Shaper detects boolean and numbers in JSON. We also detect date and timestamp strings in various formats. If any data is a complex data type such as an array or object, we store them as JSON column in DuckDB.

column_name	column_type	null
`_id`	`VARCHAR`	`YES`
`_ts`	`TIMESTAMP`	`YES`
`col1`	`VARCHAR`	`YES`
`col2`	`DOUBLE`	`YES`

Querying External Data

The easiest way to query external data is through files.

DuckDB can read local files and it can use HTTP to read remote files.

We can also read files from S3 and other object storages.

Common file formats are supported out of the box:

CSV
JSON
Parquet
Text files
Excel

You can find the complete list in the DuckDB documentation.

init-sql

Specify SQL that is executed when the system is started using the --init-sql flag or use --init-sql-file to specify a file that contains SQL.

Use init-sql to:

Create tables and views
Manage credentials
Connect to remote databases
Load extensions

Credential Management

To access data that requires credentials, you can use DuckDB Secrets together with Shaper’s init-sql functionality.

Environment variables specified as $VAR or ${VAR} are substituted in the SQL.

For example, you can set up a DuckDB secret for S3 credentials like this:

CREATE OR REPLACE SECRET mys3 (
    TYPE s3,
    KEY_ID '${S3_KEY_ID}',
    SECRET '${S3_SECRET}',
    REGION 'my_region',
    SCOPE 's3://my_bucket'
);

Connecting to Remote Databases

DuckDB allows you to attach to remote databases.

There are extensions to attach to many common databases:

Another DuckDB database file
Postgres
SQLite
MySQL

Attach to a database by using DuckDB’s ATTACH functionality together with Shaper’s init-sql functionality.

Specify SQL that is executed when the system is started using the --init-sql flag. You can also use --init-sql-file to specify a file that contains SQL.

For example, you can connect to a local Postgres database:

ATTACH IF NOT EXISTS 'postgresql://[email protected]/mydb' AS mydb (TYPE POSTGRES, READ_ONLY);

Use an environment variable to pass the database connection string in production:

ATTACH IF NOT EXISTS '${DATABASE_URL}' AS mydb (TYPE POSTGRES, READ_ONLY);

DuckDB Extensions

Make use of any DuckDB Extension within Shaper.

Learn about all existing core and community extensions in the DuckDB documentation.

If you run into any issues with extensions or have any questions, don’t hesitate to reach out.

The many extensions are installed and loaded automatically when they are first used.

Otherwise you can use INSTALL and LOAD with Shaper’s init-sql functionality (see above).

Change the directory extensions are installed in using the --duckdb-ext-dir flag.

Jobs and Automation

Jobs can be scheduled to run on startup and rerun in scheduled intervals to take care of any data maintenance work needed:

Import data from remote sources
Transform data as needed
Write data to remote sources

Think of jobs as a lightweight scripting tool similar to CRON jobs.