ML Interview Exercise - Delivery Time Prediction

Welcome!

You'll be working with historical order and shipment data from an e-commerce logistics platform to build and improve a delivery time prediction model. This exercise simulates real-world ML engineering challenges you might encounter when optimizing supply chain operations.

The Context

You're helping an e-commerce logistics platform improve their delivery time predictions. The platform connects multiple shippers (merchants) with various carriers (FedEx, UPS, USPS, DHL) to deliver packages across the US. Accurate delivery predictions are crucial for:

Setting customer expectations at checkout
Optimizing carrier selection
Managing warehouse operations
Improving customer satisfaction

What You'll Be Doing

Explore the Data - Dive into historical shipment records to understand delivery patterns and spot any interesting anomalies
Review the Model - Examine how we currently predict delivery times using package dimensions, distance, carrier, and service level
Problem Solve - Identify data quality issues and propose model improvements
Design for Production - Discuss how to deploy, monitor, and maintain this system at scale

Setup Instructions

Prerequisites

Python 3.12+
Basic familiarity with pandas and scikit-learn/XGBoost

Quick Start

# Clone the repository
git clone https://github.com/stordco/ai-team-ds-interview-challenge.git
cd ai-team-ds-interview-challenge 

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Train the baseline model
python -m src.train --save

# Try making some predictions!
python -m src.predict --interactive

What to Focus On

Data Exploration Tips

Look for Patterns: Are certain carriers or routes consistently faster/slower?
Spot Anomalies: Any surprising delivery times that don't make sense?
Seasonal Effects: Do delivery times vary by time of year?
Geographic Quirks: Some state pairs might have unexpected behavior

Model Improvement Ideas

Feature Engineering: What additional features could improve predictions?
Model Selection: Is XGBoost the best choice? What alternatives might work?
Validation Strategy: How should we split data for time-series delivery predictions?
Performance Metrics: Beyond RMSE, what metrics matter for the business?

Production Considerations

Real-time vs Batch: When do predictions need to happen?
Model Monitoring: How do we know if the model degrades?
Retraining Pipeline: How often should we update the model?
Business Segments: Should different shippers get different models?

Resources Available

Training Data: Pre-extracted training data in data/training_data.csv
Predictions: Test predictions using python -m src.predict --interactive

Helpful Commands to Get Started

# Quick data exploration
python -c "
import pandas as pd
df = pd.read_csv('data/training_data.csv')
print('Dataset shape:', df.shape)
print('\nFirst few rows:')
print(df.head())
print('\nBasic statistics:')
print(df.describe())
"

# Interactive prediction mode (recommended for testing)
python -m src.predict --interactive

# Make a specific prediction
python -m src.predict --carrier FedEx --service-level ground \
  --distance 500 --weight 2.5 --length 10 --width 8 --height 6 \
  --from-state CA --to-state NY

# Batch predictions from JSON
echo '{"carrier": "FedEx", "service_level": "ground", "distance_miles": 500,
  "weight": 2.5, "length": 10, "width": 8, "height": 6}' > request.json
python -m src.predict --json-file request.json

About the Data

You're working with real-world inspired historical shipment data from our e-commerce logistics platform. The dataset represents thousands of completed deliveries with known outcomes.

What's in the Training Data

The data/training_data.csv file contains pre-engineered features from our order and shipment history. The BigQuery SQL that was used to query this data is in queries/parcel_features.sql.

Package Characteristics:

height, length, width - Package dimensions (inches)
weight - Package weight (pounds)

Shipping Details:

distance_miles - Calculated distance between origin and destination
carrier - The shipping company (FedEx, UPS, USPS, DHL)
service_level - Speed of delivery (economy, standard, three_day, two_day, overnight)

Geographic Information:

from_state, to_state - Origin and destination states
Route characteristics that affect delivery times

Temporal Context:

day_of_week - When the package was shipped
month - Captures seasonal patterns
Holiday and peak season indicators

What We're Predicting:

delivery_days - The actual number of days it took to deliver (our target variable)

Current Model Approach

We're using XGBoost to predict delivery times based on the features above. The current implementation is basic - think of it as an MVP that needs your expertise to improve.

Interview Structure

This is a collaborative technical discussion, not a test! We want to see how you think about ML problems and work with existing code.

Duration: ~90 minutes Format: Screen sharing with live coding/analysis What we're looking for:

How you explore and understand data
Your approach to debugging ML issues
Ideas for improving model performance
Thoughts on productionizing ML systems

Feel free to:

Ask questions about the business context
Think out loud as you explore
Use any tools or libraries you're comfortable with
Google things - we all do it!

Let's have fun exploring this delivery prediction challenge together!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Interview Exercise - Delivery Time Prediction

Welcome!

The Context

What You'll Be Doing

Setup Instructions

Prerequisites

Quick Start

What to Focus On

Data Exploration Tips

Model Improvement Ideas

Production Considerations

Resources Available

Helpful Commands to Get Started

About the Data

What's in the Training Data

Current Model Approach

Interview Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML Interview Exercise - Delivery Time Prediction

Welcome!

The Context

What You'll Be Doing

Setup Instructions

Prerequisites

Quick Start

What to Focus On

Data Exploration Tips

Model Improvement Ideas

Production Considerations

Resources Available

Helpful Commands to Get Started

About the Data

What's in the Training Data

Current Model Approach

Interview Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages