Skip to content

Why BMKG API Data Needs Normalization (and How We Do It)

BMKG’s API is a treasure trove of Indonesian climate data—but it’s raw, inconsistent, and difficult to work with at scale. This post explains why normalization is critical and how GARUDA solves it.

BMKG publishes weather observations through multiple channels:

  1. Real-time API: Latest observations from 100+ stations
  2. Historical archives: Decades of daily/monthly data
  3. Forecast models: Predicted weather for the next 10 days
  4. Satellite data: Rainfall estimates from weather satellites

Each source has different:

  • Schemas: Different field names, units, precision
  • Update frequencies: Some hourly, some daily, some monthly
  • Quality levels: Raw observations vs. QA-checked data
  • Geographic coverage: Not all stations report all variables
// Station A (modern equipment)
{
"station_id": "BMKG_001",
"timestamp": "2024-03-17T15:00:00Z",
"temperature": 28.5,
"unit": "celsius",
"precision": 0.1
}
// Station B (older equipment)
{
"station_code": "BMG002",
"observation_time": "2024-03-17 15:00",
"temp": 28,
"unit": "C",
"precision": 1.0
}
// Station C (satellite-derived)
{
"id": "SAT_003",
"time": "2024-03-17T15:30:00",
"temperature_estimate": 28.3,
"data_type": "satellite",
"confidence": 0.85
}

Same phenomenon, three different schemas. Querying across all stations requires extensive data wrangling.

GARUDA’s pipeline transforms raw BMKG data into a consistent, queryable format:

# Fetch from multiple BMKG endpoints
raw_data = {
"realtime": fetch_bmkg_api(),
"historical": fetch_bmkg_archive(),
"satellite": fetch_satellite_data(),
}

We map each source to a canonical schema:

# Define canonical schema
class ClimateObservation:
station_id: str # Normalized ID
timestamp: datetime # ISO 8601 UTC
temperature_c: float # Always Celsius
humidity_pct: float # Always 0-100
precipitation_mm: float # Always mm
wind_speed_ms: float # Always m/s
province: str # Standardized province name
latitude: float
longitude: float
data_quality: str # "raw" | "qc_passed" | "estimated"
source: str # "bmkg_realtime" | "bmkg_archive" | "satellite"
def validate_observation(obs: ClimateObservation) -> bool:
# Temperature bounds (Indonesia: -5°C to 45°C)
if not (-5 <= obs.temperature_c <= 45):
return False
# Humidity bounds (0-100%)
if not (0 <= obs.humidity_pct <= 100):
return False
# Precipitation bounds (0-500 mm/day)
if not (0 <= obs.precipitation_mm <= 500):
return False
# Check for missing values
if any(v is None for v in [
obs.temperature_c,
obs.humidity_pct,
obs.timestamp
]):
return False
return True

We add contextual data:

# Add geographic hierarchy
obs.island = get_island_from_coordinates(obs.latitude, obs.longitude)
obs.region = get_region_from_province(obs.province)
# Add Saka Calendar (unique to GARUDA)
obs.saka_sasih = gregorian_to_saka_month(obs.timestamp)
obs.saka_pawukon = gregorian_to_saka_pawukon(obs.timestamp)
# Add derived metrics
obs.heat_index = calculate_heat_index(obs.temperature_c, obs.humidity_pct)
obs.dew_point = calculate_dew_point(obs.temperature_c, obs.humidity_pct)
# Partition by year and province for fast queries
output_path = f"s3://garuda-data/climate/{year}/{province}.parquet"
# Write normalized data
df.write_parquet(
output_path,
compression="snappy",
row_group_size=10000
)
# Querying raw data is painful
data = fetch_bmkg_api()
for station in data:
# Handle different field names
temp = station.get("temperature") or station.get("temp")
# Convert units
if station.get("unit") == "fahrenheit":
temp = (temp - 32) * 5/9
# Handle missing data
if temp is None:
continue
# Normalize province names
province = normalize_province_name(station.get("province_name"))
# Finally, use the data
print(f"{province}: {temp}°C")
# Query normalized data with confidence
df = pl.read_parquet("climate/2024/west_java.parquet")
result = df.filter(
(pl.col("temperature_c") > 25) &
(pl.col("data_quality") == "qc_passed")
).select([
"station_id",
"timestamp",
"temperature_c",
"humidity_pct",
"saka_sasih"
])

GARUDA tracks quality for every observation:

Quality LevelMeaningUsage
rawDirect from BMKG, no validationResearch, exploratory analysis
qc_passedPassed automated quality checksProduction queries, carbon MRV
estimatedSatellite-derived or interpolatedFilling gaps, regional estimates

Running this pipeline costs:

  • Compute: ~$50/month (AWS Lambda + S3)
  • Storage: ~$10/month (compressed Parquet)
  • Maintenance: ~10 hours/month (schema updates, new stations)

GARUDA absorbs these costs so you don’t have to.

A carbon MRV (Monitoring, Reporting, Verification) project needed to correlate:

  • BMKG temperature data (for cooling degree days)
  • Carbon intensity data (for emissions calculations)
  • Saka Calendar dates (for cultural alignment in reporting)

Without normalization: 3 weeks of data engineering to integrate sources

With GARUDA: 1 hour to write a query

SELECT
c.station_id,
c.saka_sasih,
AVG(c.temperature_c) as avg_temp,
SUM(cb.emissions_kg) as total_emissions
FROM climate c
JOIN carbon cb ON c.station_id = cb.region_id
WHERE c.timestamp >= '2023-01-01'
AND c.data_quality = 'qc_passed'
GROUP BY c.station_id, c.saka_sasih
ORDER BY c.saka_sasih;

We’re working on:

  • Real-time normalization: Process BMKG data as it arrives
  • Automated schema evolution: Handle new BMKG fields automatically
  • Anomaly detection: Flag suspicious observations
  • Forecast normalization: Standardize BMKG’s weather predictions

Ready to use normalized BMKG data? Get started with GARUDA.