Back to Portfolio
Power BI · SQL · Python · DAX · Machine Learning

Hotel Booking Cancellation Prediction
& Reservation Intelligence

Transforming 119,400 raw hotel reservations into decision-grade revenue, cancellation risk, and guest retention intelligence — across City & Resort Hotel properties.

119.4K Reservations $101.83 Avg ADR 37.0% Cancellation Rate 235K Total Guests 5 Dashboards
119.4K
Total Reservations
37.0%
Cancellation Rate
$101.83
Average ADR
235K
Total Guests
104
Avg Lead Time (days)
3.2%
Repeat Guest Rate
7,464
Parking Requests
3.43
Avg Stay (nights)
12.49%
Room Upgrade Rate
75.2K
Confirmed Bookings
Hotel Performance Overview
Cancellation Risk
Pricing & ADR
Guest Behavior
Strategy
6 Critical Findings
Insight 01
OTA Dependency Risk
Online TA generated the highest booking volume but also the highest cancellation rate at 36.7% — creating structural revenue instability.
→ Prepaid OTA offers · Stricter cancellation policies · Direct booking campaigns
Insight 02
Peak Season Yield Gap
Booking volume and ADR peaked in July–August with ADR reaching $160 in Room Type C — untapped yield management opportunity.
→ Dynamic pricing · Operational surge readiness
Insight 03
Unmonetized Parking Demand
Parking demand remained consistently high across all segments — 7,464 requests — reflecting strong private transport dependence.
→ Premium reserved parking · Capacity management system
Insight 04
Resort Premium Pricing Power
Resort Hotel achieved a higher ADR than City Hotel, reflecting stronger premium pricing power and higher guest spending behavior.
→ Luxury packages · Seasonal upsells · Experience bundles
Insight 05
Transient Dominance
Transient customers contributed 75.1% of all bookings — dominant revenue driver — yet loyalty investment remains minimal.
→ Personalized loyalty tiers · Direct booking incentives
Insight 06
Groups = Highest Risk Segment
Groups segment showed a 61% cancellation rate — the single highest-risk segment causing massive revenue block leakage.
→ Deposit requirements · Contract enforcement · Risk scoring
Cancellation Risk Matrix — Segment × Customer Type
Market SegmentContractGroupTransientTransient-PartyTotal
Aviation0.00%21.10%35.29%21.94%
Complementary0.00%0.00%13.23%12.50%13.06%
Corporate18.18%17.24%18.29%19.72%18.73%
Direct14.29%13.43%15.54%13.55%15.34%
Groups95.92% ⚠0.00%95.69% ⚠31.30%61.06% ⚠
Offline TA/TO9.19%11.85%42.63%26.15%34.32%
Online TA25.84%6.15%38.80%12.52%36.72% ⚠
TOTAL30.96%10.23%40.75%25.43%37.04%
Monthly Cancellation Trend
Jan
1,800
Lowest
Feb
2,700
Rising
Mar
3,100
Climbing
Apr
4,500
High Season ↑
May
4,700
High Season
Jun
4,500
High Season
Jul
4,700
Peak Season
Aug
5,200
⚠ Peak
Sep
4,100
Declining
Oct
4,200
Moderate
Nov
2,100
Low
Dec
2,400
Low
ADR by Customer Type
Transient
$107.0
Contract
$87.5
Transient-Party
$86.1
Group
$83.5

Transient guests pay the most ($107) but cancel the most (40.7%) — highest revenue risk segment.

Booking Risk by Segment
Groups
61.1% 🔴
Online TA
36.7% 🟡
Offline TA/TO
34.3% 🟡
Aviation
21.9%
Corporate
18.7%
Direct
15.3% 🟢
ML Model — Cancellation Risk Predictor
Input Features
lead_time · market_segment · adr · hotel_type · deposit_type · customer_type · previous_cancellations · special_requests
Feature Engineering
Key driver: lead_time × market_segment · ADR normalisation · Deposit type encoding
Model Training
Logistic Regression + Random Forest Ensemble classifier
Risk Score Output
Every booking flagged Low / Medium / High risk at reservation time
Predicted Risk Distribution
🟢 Low Risk60.3% · 72,000 bookings
🟡 Medium Risk28.2% · 33,600 bookings
🔴 High Risk11.5% · 13,700 bookings

High-Risk Triggers

Mandatory deposit request · Reduced cancellation window · Upsell to prepaid rate · Revenue team alert

13,700 high-risk bookings flagged at reservation time — enabling proactive revenue protection.

Code Samples — SQL · DAX · Python
1. Cancellation Rate by Market Segment
-- Cancellation rate per market segment
SELECT
    market_segment,
    COUNT(*) AS total_bookings,
    SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) AS cancellations,
    ROUND(
        SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2
    ) AS cancellation_rate_pct
FROM hotel_bookings
GROUP BY market_segment
ORDER BY cancellation_rate_pct DESC;
2. Monthly ADR Trend by Hotel Type
-- Monthly ADR trend for City vs Resort Hotel
SELECT
    hotel,
    arrival_date_month,
    ROUND(AVG(adr), 2) AS avg_adr,
    COUNT(*) AS total_bookings,
    SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) AS cancellations
FROM hotel_bookings
WHERE adr > 0
GROUP BY hotel, arrival_date_month
ORDER BY
    hotel,
    FIELD(arrival_date_month,
        'January','February','March','April','May','June',
        'July','August','September','October','November','December'
    );
3. Guest Acquisition Flow — Segment to Customer Type
-- Market segment to customer type flow with risk scoring
SELECT
    market_segment,
    customer_type,
    COUNT(*) AS total_bookings,
    ROUND(AVG(adr), 2) AS avg_adr,
    ROUND(
        SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2
    ) AS cancel_rate,
    CASE
        WHEN AVG(lead_time) > 150 AND
             SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) > 40
             THEN 'High Risk'
        WHEN SUM(CASE WHEN reservation_status = 'Canceled' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) > 25
             THEN 'Medium Risk'
        ELSE 'Low Risk'
    END AS risk_level
FROM hotel_bookings
GROUP BY market_segment, customer_type
ORDER BY cancel_rate DESC;
1. Core KPI Measures
-- Total Reservations
Total Reservations =
COUNTROWS('HotelBookings')

-- Cancellation Rate
Cancellation Rate =
VAR _cancelled =
    CALCULATE(
        COUNTROWS('HotelBookings'),
        'HotelBookings'[reservation_status] = "Canceled"
    )
RETURN
    DIVIDE(_cancelled, COUNTROWS('HotelBookings'), 0)

-- Average Daily Rate (filtered for valid bookings)
Avg ADR =
CALCULATE(
    AVERAGE('HotelBookings'[adr]),
    'HotelBookings'[adr] > 0
)
2. Dynamic Cancellation Risk Flag
-- Predicted Risk Category (replicates ML output in DAX)
Risk Category =
VAR _leadTime = 'HotelBookings'[lead_time]
VAR _prevCancel = 'HotelBookings'[previous_cancellations]
VAR _depositType = 'HotelBookings'[deposit_type]
VAR _segment = 'HotelBookings'[market_segment]
RETURN
    SWITCH(TRUE(),
        _leadTime > 150
            && _prevCancel > 0
            && _depositType = "No Deposit",
            "High Risk",
        _leadTime > 80
            && OR(_segment = "Online TA", _segment = "Groups"),
            "Medium Risk",
        "Low Risk"
    )

-- Revenue at Risk from High-Risk Bookings
Revenue at Risk =
CALCULATE(
    SUMX(
        'HotelBookings',
        'HotelBookings'[adr] * 'HotelBookings'[stays_in_week_nights]
        + 'HotelBookings'[adr] * 'HotelBookings'[stays_in_weekend_nights]
    ),
    [Risk Category] = "High Risk"
)
3. Time Intelligence — MoM Cancellation Change
-- Month-over-Month cancellation change
Cancellations MoM % =
VAR _currentMonth =
    CALCULATE(
        COUNTROWS('HotelBookings'),
        'HotelBookings'[reservation_status] = "Canceled"
    )
VAR _prevMonth =
    CALCULATE(
        COUNTROWS('HotelBookings'),
        'HotelBookings'[reservation_status] = "Canceled",
        DATEADD('DateTable'[Date], -1, MONTH)
    )
RETURN
    DIVIDE(_currentMonth - _prevMonth, _prevMonth, 0)
1. Exploratory Data Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('hotel_bookings.csv')

# Basic overview
print(df.shape)         # (119390, 32)
print(df.isnull().sum()) # Check nulls

# Cancellation rate by market segment
cancel_rate = (
    df.groupby('market_segment')['is_canceled']
    .agg(['sum', 'count'])
    .assign(rate=lambda x: x['sum'] / x['count'] * 100)
    .sort_values('rate', ascending=False)
)
print(cancel_rate)

# ADR distribution by hotel type
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(
    data=df[df['adr'] > 0],
    x='hotel', y='adr',
    hue='reservation_status',
    palette=['#4f46e5', '#ef4444'], ax=ax
)
ax.set_title('ADR Distribution: Hotel × Booking Status')
plt.tight_layout()
plt.show()
2. Feature Engineering
# Feature engineering for ML model
df['total_nights'] = (
    df['stays_in_weekend_nights'] + df['stays_in_week_nights']
)
df['total_guests'] = (
    df['adults'] + df['children'].fillna(0) + df['babies']
)
df['revenue_estimate'] = df['adr'] * df['total_nights']

# Encode categorical features
from sklearn.preprocessing import LabelEncoder
cat_cols = [
    'hotel', 'market_segment', 'customer_type',
    'deposit_type', 'distribution_channel'
]
le = LabelEncoder()
for col in cat_cols:
    df[col + '_enc'] = le.fit_transform(df[col].astype('str'))

# Select final features
features = [
    'lead_time', 'adr', 'total_nights', 'total_guests',
    'previous_cancellations', 'total_of_special_requests',
    'booking_changes', 'hotel_enc', 'market_segment_enc',
    'customer_type_enc', 'deposit_type_enc'
]
X = df[features]
y = df['is_canceled']
3. ML Model Training & Risk Scoring
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Random Forest model
rf = RandomForestClassifier(
    n_estimators=200, max_depth=12,
    random_state=42, class_weight='balanced'
)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

# Assign risk tiers based on probability score
df_test = X_test.copy()
df_test['cancel_prob'] = y_prob
df_test['risk_tier'] = pd.cut(
    df_test['cancel_prob'],
    bins=[0, 0.35, 0.65, 1.0],
    labels=['Low Risk', 'Medium Risk', 'High Risk']
)
print(df_test['risk_tier'].value_counts(normalize=True))
Strategic Action Pillars
01
Reduce Cancellation Losses
  • Predictive risk flags at booking time
  • OTA deposit mandates
  • Lead time cancellation policies
  • Group contract enforcement
  • Advance payment triggers
02
Grow Revenue & ADR
  • Dynamic peak-season pricing
  • Resort luxury bundles
  • Yield management system
  • Room type upsell automation
  • Premium parking monetization
03
Retain Guests Directly
  • Personalized loyalty tiers
  • Transient direct booking offers
  • Member-exclusive rates
  • Website & mobile UX optimization
  • Exclusive member discounts
Tools & Technologies
Power BI DAX SQL PostgreSQL Python Pandas Scikit-learn Power Query (M) Jupyter Notebook Data Modeling
Key Features
ML Cancellation Risk Predictor
Monthly ADR trend analysis
Segment × customer risk matrix
Sankey acquisition flow chart
Global guest distribution map
Dynamic City / Resort filter
ADR distribution box plots
Monthly cancellation trend
6-insight strategy dashboard
Mohd Faij
Mohd Faij
Data Analyst
SQL · Excel · Power BI · Python · Turning Data into Business Insights