Kommuner Clustering - Mikkel Bøhnke

Interactive map

Hover a municipality to see its name, cluster, and the six feature values that drove the classification. The radar chart below the map shows each cluster's socioeconomic profile relative to the others.

What it does

This project applies unsupervised machine learning to official Danish register data to answer a question that is simple to ask but hard to answer by inspection: which of Denmark's 98 municipalities are structurally similar to each other, and what separates the groups?

The pipeline fetches six socioeconomic indicators for every municipality from the Statistics Denmark StatBank API, scales and clusters them using K-means (k=5, validated by silhouette analysis), and generates a self-contained interactive choropleth. The map is rebuilt quarterly by a GitHub Actions workflow as DST publishes new data.

Data and features

All data comes from the Statistics Denmark StatBank API, which provides free programmatic access to Danish official statistics under a CC 4.0 BY licence. Six features were selected to capture distinct socioeconomic dimensions without introducing collinear variables.

Feature	DST table	What it captures
Elderly share (65+)	`FOLK1A`	Demographic ageing pressure on municipal services
Youth share (0–17)	`FOLK1A`	Demand for schools and childcare
Unemployment rate	`AUP01`	Labour market health
Median disposable income	`INDKP101`	Household prosperity
Higher education share	`HFUDD11`	Human capital concentration
Social housing share	`BOL101`	Housing structure and affordability pressure

Method

Missing values - which arise when DST suppresses counts for small municipalities - are imputed using column medians before scaling. All six features are then standardised with StandardScaler so that no single variable dominates by magnitude. The cluster count k=5 was selected by combining the elbow method on inertia with silhouette score maximisation across k=2 through k=10.

K-means (n_init=20, random_state=42) was run alongside agglomerative hierarchical clustering with Ward linkage as an independent cross-check. The two methods' label agreement is quantified with the Adjusted Rand Index. On the real data, ARI above 0.85 means the cluster structure is genuine and not an artefact of the K-means initialisation.

Deployment

The analysis runs as a GitHub Actions workflow on a quarterly schedule. The pipeline fetches fresh data from the DST API, runs the full clustering, writes outputs/map.html as a self-contained Plotly HTML file, and commits it back to the repository. GitHub Pages then serves the file with no build step or server required.

Total cost: zero. Statistics Denmark's API is free and CC-licensed. GitHub Actions free minutes cover the quarterly run with ease. GitHub Pages serves the static output at no charge.

What I learned

The most interesting methodological question was feature selection. Several plausible variables - population density, net migration, tax base per capita - are correlated with the six chosen features. Including them would not have improved the clustering; it would have introduced collinearity that inflates the apparent importance of the correlated dimension. Deciding what to leave out turned out to be more consequential than what to include.

On the production side, the DST API occasionally returns suppressed values (marked as missing) for small municipalities with populations under roughly 5,000. Writing a robust imputation step before scaling was essential - a naive approach would have crashed or produced distorted clusters for islands like Samsø, Ærø, and Læsø. The defensive design is documented in the code so it is immediately visible during a code review.

The output format - a single self-contained HTML file served via GitHub Pages - reflects a deliberate choice about what "deployed" means for a portfolio project. It is live, shareable, and inspectable by anyone with a link, without requiring a server, a database, or an active process.