Processing 17,000 Islands of Climate Data with Apache DataFusion and Parquet

Indonesia’s climate data is one of the world’s most complex datasets to manage. With 17,000 islands, 100+ BMKG weather stations, and decades of historical observations, processing this data efficiently requires careful architecture.

In this post, we’ll explore how GARUDA uses Apache DataFusion and Parquet to handle this scale.

The Challenge: Scale and Fragmentation

BMKG (Badan Meteorologi, Klimatologi, dan Geofisika) operates one of the densest weather station networks in the world. Each station generates:

Hourly observations: Temperature, humidity, precipitation, wind speed
Daily aggregates: Min/max temperatures, total rainfall
Monthly summaries: Climate normals, extremes

Multiply this across 100+ stations over 50+ years, and you’re looking at 50+ million data points.

Traditional approaches fail here:

CSV files: Too slow to query, poor compression
Relational databases: Expensive to scale, difficult to distribute
JSON APIs: Rate-limited, inconsistent schemas

Why Parquet + DataFusion?

Parquet is a columnar format optimized for analytics:

import polars as pl

# Load 2.3 GB of climate data in seconds
df = pl.read_parquet("bmkg_climate_2020_2024.parquet")

# Filter to a specific province
jakarta = df.filter(pl.col("province") == "DKI Jakarta")

# Aggregate by month
monthly = jakarta.groupby("month").agg([
    pl.col("temperature_c").mean(),
    pl.col("precipitation_mm").sum()
])

DataFusion is Apache’s SQL engine for Parquet:

SELECT
  station_id,
  DATE_TRUNC('month', timestamp) as month,
  AVG(temperature_c) as avg_temp,
  SUM(precipitation_mm) as total_rain
FROM bmkg_climate
WHERE province = 'West Java'
GROUP BY station_id, month
ORDER BY month DESC;

Benefits:

10-100x faster than CSV for analytical queries
Compression: 2.3 GB of raw data → ~500 MB Parquet
Distributed: Works on local files or cloud storage (S3, GCS)
Language-agnostic: Python, Rust, JavaScript, Go

GARUDA’s Architecture

We partition climate data by:

Year: Separate Parquet files per year
Province: Geographic locality for regional queries
Data type: Climate, Carbon, Finance in separate datasets

garuda-datasets/
├── climate/
│   ├── 2020/
│   │   ├── aceh.parquet
│   │   ├── north_sumatra.parquet
│   │   └── ...
│   ├── 2021/
│   └── ...
├── carbon/
└── finance/

This allows:

Fast regional queries: Only read relevant province files
Incremental updates: Add new year without reprocessing
Parallel processing: Query multiple years simultaneously

Example: Cross-Domain Query

One of GARUDA’s unique features is correlating climate with carbon data:

use rakit_client::Client;

let client = Client::new("YOUR_API_KEY");

let result = client.query(
    "SELECT
        c.station_id,
        c.timestamp,
        c.temperature_c,
        cb.carbon_intensity
    FROM climate c
    JOIN carbon cb ON c.station_id = cb.region_id
    WHERE c.timestamp >= '2023-01-01'
    AND c.province = 'East Java'"
).await?;

for row in result.rows() {
    println!("{}: {} °C, {} gCO2/kWh",
        row.station_id,
        row.temperature_c,
        row.carbon_intensity
    );
}

Performance Metrics

On a 2020 MacBook Pro (M1):

Query	Time	Data Scanned
All climate data (2020-2024)	2.3s	2.3 GB
Single province, single year	45ms	18 MB
Cross-domain join (climate + carbon)	340ms	500 MB
Saka Calendar enrichment	120ms	50 MB

What’s Next?

We’re exploring:

GPU acceleration with RAPIDS for large aggregations
Real-time streaming with Apache Kafka for live BMKG data
Federated queries across multiple data sources
Time-series optimization for seasonal analysis

Try It Yourself

Download the free BMKG climate dataset and experiment:

# Download 2.3 GB Parquet file
curl -O https://github.com/teknorakit/garuda-datasets/releases/download/v1.0.0/bmkg_climate_2020_2024.parquet

# Query with DuckDB (local SQL engine)
duckdb
> SELECT COUNT(*) FROM 'bmkg_climate_2020_2024.parquet';
50234567

> SELECT DISTINCT province FROM 'bmkg_climate_2020_2024.parquet';

GARUDA makes this accessible via API, but the raw Parquet files are free to download and analyze locally.

Questions? Join our GitHub Discussions or email support@teknorakit.com.