Processing 17,000 Islands of Climate Data with Apache DataFusion and Parquet
Indonesia’s climate data is one of the world’s most complex datasets to manage. With 17,000 islands, 100+ BMKG weather stations, and decades of historical observations, processing this data efficiently requires careful architecture.
In this post, we’ll explore how GARUDA uses Apache DataFusion and Parquet to handle this scale.
The Challenge: Scale and Fragmentation
Section titled “The Challenge: Scale and Fragmentation”BMKG (Badan Meteorologi, Klimatologi, dan Geofisika) operates one of the densest weather station networks in the world. Each station generates:
- Hourly observations: Temperature, humidity, precipitation, wind speed
- Daily aggregates: Min/max temperatures, total rainfall
- Monthly summaries: Climate normals, extremes
Multiply this across 100+ stations over 50+ years, and you’re looking at 50+ million data points.
Traditional approaches fail here:
- CSV files: Too slow to query, poor compression
- Relational databases: Expensive to scale, difficult to distribute
- JSON APIs: Rate-limited, inconsistent schemas
Why Parquet + DataFusion?
Section titled “Why Parquet + DataFusion?”Parquet is a columnar format optimized for analytics:
import polars as pl
# Load 2.3 GB of climate data in secondsdf = pl.read_parquet("bmkg_climate_2020_2024.parquet")
# Filter to a specific provincejakarta = df.filter(pl.col("province") == "DKI Jakarta")
# Aggregate by monthmonthly = jakarta.groupby("month").agg([ pl.col("temperature_c").mean(), pl.col("precipitation_mm").sum()])DataFusion is Apache’s SQL engine for Parquet:
SELECT station_id, DATE_TRUNC('month', timestamp) as month, AVG(temperature_c) as avg_temp, SUM(precipitation_mm) as total_rainFROM bmkg_climateWHERE province = 'West Java'GROUP BY station_id, monthORDER BY month DESC;Benefits:
- 10-100x faster than CSV for analytical queries
- Compression: 2.3 GB of raw data → ~500 MB Parquet
- Distributed: Works on local files or cloud storage (S3, GCS)
- Language-agnostic: Python, Rust, JavaScript, Go
GARUDA’s Architecture
Section titled “GARUDA’s Architecture”We partition climate data by:
- Year: Separate Parquet files per year
- Province: Geographic locality for regional queries
- Data type: Climate, Carbon, Finance in separate datasets
garuda-datasets/├── climate/│ ├── 2020/│ │ ├── aceh.parquet│ │ ├── north_sumatra.parquet│ │ └── ...│ ├── 2021/│ └── ...├── carbon/└── finance/This allows:
- Fast regional queries: Only read relevant province files
- Incremental updates: Add new year without reprocessing
- Parallel processing: Query multiple years simultaneously
Example: Cross-Domain Query
Section titled “Example: Cross-Domain Query”One of GARUDA’s unique features is correlating climate with carbon data:
use rakit_client::Client;
let client = Client::new("YOUR_API_KEY");
let result = client.query( "SELECT c.station_id, c.timestamp, c.temperature_c, cb.carbon_intensity FROM climate c JOIN carbon cb ON c.station_id = cb.region_id WHERE c.timestamp >= '2023-01-01' AND c.province = 'East Java'").await?;
for row in result.rows() { println!("{}: {} °C, {} gCO2/kWh", row.station_id, row.temperature_c, row.carbon_intensity );}Performance Metrics
Section titled “Performance Metrics”On a 2020 MacBook Pro (M1):
| Query | Time | Data Scanned |
|---|---|---|
| All climate data (2020-2024) | 2.3s | 2.3 GB |
| Single province, single year | 45ms | 18 MB |
| Cross-domain join (climate + carbon) | 340ms | 500 MB |
| Saka Calendar enrichment | 120ms | 50 MB |
What’s Next?
Section titled “What’s Next?”We’re exploring:
- GPU acceleration with RAPIDS for large aggregations
- Real-time streaming with Apache Kafka for live BMKG data
- Federated queries across multiple data sources
- Time-series optimization for seasonal analysis
Try It Yourself
Section titled “Try It Yourself”Download the free BMKG climate dataset and experiment:
# Download 2.3 GB Parquet filecurl -O https://github.com/teknorakit/garuda-datasets/releases/download/v1.0.0/bmkg_climate_2020_2024.parquet
# Query with DuckDB (local SQL engine)duckdb> SELECT COUNT(*) FROM 'bmkg_climate_2020_2024.parquet';50234567
> SELECT DISTINCT province FROM 'bmkg_climate_2020_2024.parquet';GARUDA makes this accessible via API, but the raw Parquet files are free to download and analyze locally.
Questions? Join our GitHub Discussions or email support@teknorakit.com.