Why BMKG API Data Needs Normalization (and How We Do It)
BMKG’s API is a treasure trove of Indonesian climate data—but it’s raw, inconsistent, and difficult to work with at scale. This post explains why normalization is critical and how GARUDA solves it.
The BMKG Data Problem
Section titled “The BMKG Data Problem”BMKG publishes weather observations through multiple channels:
- Real-time API: Latest observations from 100+ stations
- Historical archives: Decades of daily/monthly data
- Forecast models: Predicted weather for the next 10 days
- Satellite data: Rainfall estimates from weather satellites
Each source has different:
- Schemas: Different field names, units, precision
- Update frequencies: Some hourly, some daily, some monthly
- Quality levels: Raw observations vs. QA-checked data
- Geographic coverage: Not all stations report all variables
Example: Temperature Data Inconsistencies
Section titled “Example: Temperature Data Inconsistencies”// Station A (modern equipment){ "station_id": "BMKG_001", "timestamp": "2024-03-17T15:00:00Z", "temperature": 28.5, "unit": "celsius", "precision": 0.1}
// Station B (older equipment){ "station_code": "BMG002", "observation_time": "2024-03-17 15:00", "temp": 28, "unit": "C", "precision": 1.0}
// Station C (satellite-derived){ "id": "SAT_003", "time": "2024-03-17T15:30:00", "temperature_estimate": 28.3, "data_type": "satellite", "confidence": 0.85}Same phenomenon, three different schemas. Querying across all stations requires extensive data wrangling.
The Normalization Pipeline
Section titled “The Normalization Pipeline”GARUDA’s pipeline transforms raw BMKG data into a consistent, queryable format:
Stage 1: Ingestion
Section titled “Stage 1: Ingestion”# Fetch from multiple BMKG endpointsraw_data = { "realtime": fetch_bmkg_api(), "historical": fetch_bmkg_archive(), "satellite": fetch_satellite_data(),}Stage 2: Schema Mapping
Section titled “Stage 2: Schema Mapping”We map each source to a canonical schema:
# Define canonical schemaclass ClimateObservation: station_id: str # Normalized ID timestamp: datetime # ISO 8601 UTC temperature_c: float # Always Celsius humidity_pct: float # Always 0-100 precipitation_mm: float # Always mm wind_speed_ms: float # Always m/s province: str # Standardized province name latitude: float longitude: float data_quality: str # "raw" | "qc_passed" | "estimated" source: str # "bmkg_realtime" | "bmkg_archive" | "satellite"Stage 3: Quality Control
Section titled “Stage 3: Quality Control”def validate_observation(obs: ClimateObservation) -> bool: # Temperature bounds (Indonesia: -5°C to 45°C) if not (-5 <= obs.temperature_c <= 45): return False
# Humidity bounds (0-100%) if not (0 <= obs.humidity_pct <= 100): return False
# Precipitation bounds (0-500 mm/day) if not (0 <= obs.precipitation_mm <= 500): return False
# Check for missing values if any(v is None for v in [ obs.temperature_c, obs.humidity_pct, obs.timestamp ]): return False
return TrueStage 4: Enrichment
Section titled “Stage 4: Enrichment”We add contextual data:
# Add geographic hierarchyobs.island = get_island_from_coordinates(obs.latitude, obs.longitude)obs.region = get_region_from_province(obs.province)
# Add Saka Calendar (unique to GARUDA)obs.saka_sasih = gregorian_to_saka_month(obs.timestamp)obs.saka_pawukon = gregorian_to_saka_pawukon(obs.timestamp)
# Add derived metricsobs.heat_index = calculate_heat_index(obs.temperature_c, obs.humidity_pct)obs.dew_point = calculate_dew_point(obs.temperature_c, obs.humidity_pct)Stage 5: Partitioning & Storage
Section titled “Stage 5: Partitioning & Storage”# Partition by year and province for fast queriesoutput_path = f"s3://garuda-data/climate/{year}/{province}.parquet"
# Write normalized datadf.write_parquet( output_path, compression="snappy", row_group_size=10000)Before vs. After
Section titled “Before vs. After”Before Normalization (Raw BMKG)
Section titled “Before Normalization (Raw BMKG)”# Querying raw data is painfuldata = fetch_bmkg_api()
for station in data: # Handle different field names temp = station.get("temperature") or station.get("temp")
# Convert units if station.get("unit") == "fahrenheit": temp = (temp - 32) * 5/9
# Handle missing data if temp is None: continue
# Normalize province names province = normalize_province_name(station.get("province_name"))
# Finally, use the data print(f"{province}: {temp}°C")After Normalization (GARUDA)
Section titled “After Normalization (GARUDA)”# Query normalized data with confidencedf = pl.read_parquet("climate/2024/west_java.parquet")
result = df.filter( (pl.col("temperature_c") > 25) & (pl.col("data_quality") == "qc_passed")).select([ "station_id", "timestamp", "temperature_c", "humidity_pct", "saka_sasih"])Data Quality Metrics
Section titled “Data Quality Metrics”GARUDA tracks quality for every observation:
| Quality Level | Meaning | Usage |
|---|---|---|
raw | Direct from BMKG, no validation | Research, exploratory analysis |
qc_passed | Passed automated quality checks | Production queries, carbon MRV |
estimated | Satellite-derived or interpolated | Filling gaps, regional estimates |
The Cost of Normalization
Section titled “The Cost of Normalization”Running this pipeline costs:
- Compute: ~$50/month (AWS Lambda + S3)
- Storage: ~$10/month (compressed Parquet)
- Maintenance: ~10 hours/month (schema updates, new stations)
GARUDA absorbs these costs so you don’t have to.
Real-World Impact
Section titled “Real-World Impact”A carbon MRV (Monitoring, Reporting, Verification) project needed to correlate:
- BMKG temperature data (for cooling degree days)
- Carbon intensity data (for emissions calculations)
- Saka Calendar dates (for cultural alignment in reporting)
Without normalization: 3 weeks of data engineering to integrate sources
With GARUDA: 1 hour to write a query
SELECT c.station_id, c.saka_sasih, AVG(c.temperature_c) as avg_temp, SUM(cb.emissions_kg) as total_emissionsFROM climate cJOIN carbon cb ON c.station_id = cb.region_idWHERE c.timestamp >= '2023-01-01' AND c.data_quality = 'qc_passed'GROUP BY c.station_id, c.saka_sasihORDER BY c.saka_sasih;What’s Next?
Section titled “What’s Next?”We’re working on:
- Real-time normalization: Process BMKG data as it arrives
- Automated schema evolution: Handle new BMKG fields automatically
- Anomaly detection: Flag suspicious observations
- Forecast normalization: Standardize BMKG’s weather predictions
Ready to use normalized BMKG data? Get started with GARUDA.