Over the past few months, I worked on one of the most challenging and rewarding projects of my graduate program: building a machine learning model to predict regional construction costs across the United States. The goal was to create a civilian-focused Area Cost Factor (ACF) model, something that could serve as a more dynamic alternative to the static indices published by organizations like the U.S. Army Corps of Engineers.

I started by digging into Construction Check’s internal database of commercial project costs and then layered in external data from a wide range of sources: Bureau of Labor Statistics wage data, Home Depot material pricing (which I pulled using SerpAPI), FEMA’s natural hazard risk indices, NOAA storm event data, and labor productivity metrics. After cleaning and merging these very different datasets, I inflation-adjusted all costs to 2025 dollars and normalized everything relative to Atlanta, which we used as the baseline city.

The modeling process was iterative and taught me a lot about how messy and complex real-world data can be. I filtered outliers, engineered features like a square-footage/year interaction term, and even used clustering to better structure the dataset before training the models. I tested Linear Regression, Random Forest, XGBoost, and LightGBM, and in the end, LightGBM came out on top with an R² of 0.665 and a MAPE of 31.7%. It was able to predict relative construction costs for 102 U.S. metro areas—essentially a living cost factor model that adjusts to actual market data.

It wasn’t easy. Accessing detailed construction cost data was a huge hurdle, since most commercial datasets like RSMeans are behind paywalls, so I had to get creative with free sources. Merging datasets from different geographic definitions was another challenge, and I had to manually reconcile cities that didn’t line up. Finally, the sheer influence of project square footage made avoiding overfitting a constant concern.

But the outcome was worth it. The project was a great exercise on applying machine learning to adaptively capture local cost differences—something static indices simply can’t do. It’s exciting to think about how this kind of model could make budgeting and planning more accurate for public agencies, developers, and planners working in different parts of the country.