Why BMKG API Data Needs Normalization (and How We Do It)

BMKG’s API is a treasure trove of Indonesian climate data—but it’s raw, inconsistent, and difficult to work with at scale. This post explains why normalization is critical and how GARUDA solves it.

The BMKG Data Problem

BMKG publishes weather observations through multiple channels:

Real-time API: Latest observations from 100+ stations
Historical archives: Decades of daily/monthly data
Forecast models: Predicted weather for the next 10 days
Satellite data: Rainfall estimates from weather satellites

Each source has different:

Schemas: Different field names, units, precision
Update frequencies: Some hourly, some daily, some monthly
Quality levels: Raw observations vs. QA-checked data
Geographic coverage: Not all stations report all variables

Example: Temperature Data Inconsistencies

// Station A (modern equipment)
{
  "station_id": "BMKG_001",
  "timestamp": "2024-03-17T15:00:00Z",
  "temperature": 28.5,
  "unit": "celsius",
  "precision": 0.1
}

// Station B (older equipment)
{
  "station_code": "BMG002",
  "observation_time": "2024-03-17 15:00",
  "temp": 28,
  "unit": "C",
  "precision": 1.0
}

// Station C (satellite-derived)
{
  "id": "SAT_003",
  "time": "2024-03-17T15:30:00",
  "temperature_estimate": 28.3,
  "data_type": "satellite",
  "confidence": 0.85
}

Same phenomenon, three different schemas. Querying across all stations requires extensive data wrangling.

The Normalization Pipeline

GARUDA’s pipeline transforms raw BMKG data into a consistent, queryable format:

Stage 1: Ingestion

# Fetch from multiple BMKG endpoints
raw_data = {
    "realtime": fetch_bmkg_api(),
    "historical": fetch_bmkg_archive(),
    "satellite": fetch_satellite_data(),
}

Stage 2: Schema Mapping

We map each source to a canonical schema:

# Define canonical schema
class ClimateObservation:
    station_id: str          # Normalized ID
    timestamp: datetime      # ISO 8601 UTC
    temperature_c: float     # Always Celsius
    humidity_pct: float      # Always 0-100
    precipitation_mm: float  # Always mm
    wind_speed_ms: float     # Always m/s
    province: str            # Standardized province name
    latitude: float
    longitude: float
    data_quality: str        # "raw" | "qc_passed" | "estimated"
    source: str              # "bmkg_realtime" | "bmkg_archive" | "satellite"

Stage 3: Quality Control

def validate_observation(obs: ClimateObservation) -> bool:
    # Temperature bounds (Indonesia: -5°C to 45°C)
    if not (-5 <= obs.temperature_c <= 45):
        return False

    # Humidity bounds (0-100%)
    if not (0 <= obs.humidity_pct <= 100):
        return False

    # Precipitation bounds (0-500 mm/day)
    if not (0 <= obs.precipitation_mm <= 500):
        return False

    # Check for missing values
    if any(v is None for v in [
        obs.temperature_c,
        obs.humidity_pct,
        obs.timestamp
    ]):
        return False

    return True

Stage 4: Enrichment

We add contextual data:

# Add geographic hierarchy
obs.island = get_island_from_coordinates(obs.latitude, obs.longitude)
obs.region = get_region_from_province(obs.province)

# Add Saka Calendar (unique to GARUDA)
obs.saka_sasih = gregorian_to_saka_month(obs.timestamp)
obs.saka_pawukon = gregorian_to_saka_pawukon(obs.timestamp)

# Add derived metrics
obs.heat_index = calculate_heat_index(obs.temperature_c, obs.humidity_pct)
obs.dew_point = calculate_dew_point(obs.temperature_c, obs.humidity_pct)

Stage 5: Partitioning & Storage

# Partition by year and province for fast queries
output_path = f"s3://garuda-data/climate/{year}/{province}.parquet"

# Write normalized data
df.write_parquet(
    output_path,
    compression="snappy",
    row_group_size=10000
)

Before vs. After

Before Normalization (Raw BMKG)

# Querying raw data is painful
data = fetch_bmkg_api()

for station in data:
    # Handle different field names
    temp = station.get("temperature") or station.get("temp")

    # Convert units
    if station.get("unit") == "fahrenheit":
        temp = (temp - 32) * 5/9

    # Handle missing data
    if temp is None:
        continue

    # Normalize province names
    province = normalize_province_name(station.get("province_name"))

    # Finally, use the data
    print(f"{province}: {temp}°C")

After Normalization (GARUDA)

# Query normalized data with confidence
df = pl.read_parquet("climate/2024/west_java.parquet")

result = df.filter(
    (pl.col("temperature_c") > 25) &
    (pl.col("data_quality") == "qc_passed")
).select([
    "station_id",
    "timestamp",
    "temperature_c",
    "humidity_pct",
    "saka_sasih"
])

Data Quality Metrics

GARUDA tracks quality for every observation:

Quality Level	Meaning	Usage
`raw`	Direct from BMKG, no validation	Research, exploratory analysis
`qc_passed`	Passed automated quality checks	Production queries, carbon MRV
`estimated`	Satellite-derived or interpolated	Filling gaps, regional estimates

The Cost of Normalization

Running this pipeline costs:

Compute: ~$50/month (AWS Lambda + S3)
Storage: ~$10/month (compressed Parquet)
Maintenance: ~10 hours/month (schema updates, new stations)

GARUDA absorbs these costs so you don’t have to.

Real-World Impact

A carbon MRV (Monitoring, Reporting, Verification) project needed to correlate:

BMKG temperature data (for cooling degree days)
Carbon intensity data (for emissions calculations)
Saka Calendar dates (for cultural alignment in reporting)

Without normalization: 3 weeks of data engineering to integrate sources

With GARUDA: 1 hour to write a query

SELECT
  c.station_id,
  c.saka_sasih,
  AVG(c.temperature_c) as avg_temp,
  SUM(cb.emissions_kg) as total_emissions
FROM climate c
JOIN carbon cb ON c.station_id = cb.region_id
WHERE c.timestamp >= '2023-01-01'
  AND c.data_quality = 'qc_passed'
GROUP BY c.station_id, c.saka_sasih
ORDER BY c.saka_sasih;

What’s Next?

We’re working on:

Real-time normalization: Process BMKG data as it arrives
Automated schema evolution: Handle new BMKG fields automatically
Anomaly detection: Flag suspicious observations
Forecast normalization: Standardize BMKG’s weather predictions

Ready to use normalized BMKG data? Get started with GARUDA.