Skip to content

Processing 17,000 Islands of Climate Data with Apache DataFusion and Parquet

Indonesia’s climate data is one of the world’s most complex datasets to manage. With 17,000 islands, 100+ BMKG weather stations, and decades of historical observations, processing this data efficiently requires careful architecture.

In this post, we’ll explore how GARUDA uses Apache DataFusion and Parquet to handle this scale.

BMKG (Badan Meteorologi, Klimatologi, dan Geofisika) operates one of the densest weather station networks in the world. Each station generates:

  • Hourly observations: Temperature, humidity, precipitation, wind speed
  • Daily aggregates: Min/max temperatures, total rainfall
  • Monthly summaries: Climate normals, extremes

Multiply this across 100+ stations over 50+ years, and you’re looking at 50+ million data points.

Traditional approaches fail here:

  • CSV files: Too slow to query, poor compression
  • Relational databases: Expensive to scale, difficult to distribute
  • JSON APIs: Rate-limited, inconsistent schemas

Parquet is a columnar format optimized for analytics:

import polars as pl
# Load 2.3 GB of climate data in seconds
df = pl.read_parquet("bmkg_climate_2020_2024.parquet")
# Filter to a specific province
jakarta = df.filter(pl.col("province") == "DKI Jakarta")
# Aggregate by month
monthly = jakarta.groupby("month").agg([
pl.col("temperature_c").mean(),
pl.col("precipitation_mm").sum()
])

DataFusion is Apache’s SQL engine for Parquet:

SELECT
station_id,
DATE_TRUNC('month', timestamp) as month,
AVG(temperature_c) as avg_temp,
SUM(precipitation_mm) as total_rain
FROM bmkg_climate
WHERE province = 'West Java'
GROUP BY station_id, month
ORDER BY month DESC;

Benefits:

  • 10-100x faster than CSV for analytical queries
  • Compression: 2.3 GB of raw data → ~500 MB Parquet
  • Distributed: Works on local files or cloud storage (S3, GCS)
  • Language-agnostic: Python, Rust, JavaScript, Go

We partition climate data by:

  1. Year: Separate Parquet files per year
  2. Province: Geographic locality for regional queries
  3. Data type: Climate, Carbon, Finance in separate datasets
garuda-datasets/
├── climate/
│ ├── 2020/
│ │ ├── aceh.parquet
│ │ ├── north_sumatra.parquet
│ │ └── ...
│ ├── 2021/
│ └── ...
├── carbon/
└── finance/

This allows:

  • Fast regional queries: Only read relevant province files
  • Incremental updates: Add new year without reprocessing
  • Parallel processing: Query multiple years simultaneously

One of GARUDA’s unique features is correlating climate with carbon data:

use rakit_client::Client;
let client = Client::new("YOUR_API_KEY");
let result = client.query(
"SELECT
c.station_id,
c.timestamp,
c.temperature_c,
cb.carbon_intensity
FROM climate c
JOIN carbon cb ON c.station_id = cb.region_id
WHERE c.timestamp >= '2023-01-01'
AND c.province = 'East Java'"
).await?;
for row in result.rows() {
println!("{}: {} °C, {} gCO2/kWh",
row.station_id,
row.temperature_c,
row.carbon_intensity
);
}

On a 2020 MacBook Pro (M1):

QueryTimeData Scanned
All climate data (2020-2024)2.3s2.3 GB
Single province, single year45ms18 MB
Cross-domain join (climate + carbon)340ms500 MB
Saka Calendar enrichment120ms50 MB

We’re exploring:

  • GPU acceleration with RAPIDS for large aggregations
  • Real-time streaming with Apache Kafka for live BMKG data
  • Federated queries across multiple data sources
  • Time-series optimization for seasonal analysis

Download the free BMKG climate dataset and experiment:

Terminal window
# Download 2.3 GB Parquet file
curl -O https://github.com/teknorakit/garuda-datasets/releases/download/v1.0.0/bmkg_climate_2020_2024.parquet
# Query with DuckDB (local SQL engine)
duckdb
> SELECT COUNT(*) FROM 'bmkg_climate_2020_2024.parquet';
50234567
> SELECT DISTINCT province FROM 'bmkg_climate_2020_2024.parquet';

GARUDA makes this accessible via API, but the raw Parquet files are free to download and analyze locally.


Questions? Join our GitHub Discussions or email support@teknorakit.com.