Background
I took CSC 422 Automated Learning and Data Analysis in Spring 2024. This course explored machine learning techniques, from decision trees to k-NN, to linear/logistic regression, to neural networks, to clustering. Practical application of course content was taught in Python.
Final project
The parameters of the final course project were to explore a data mining problem with a “novelty” factor.
Examining patterns or bias of pedestrian identification for autonomous vehicles
I worked with two teammates (J. Donahue, K. Henry) to evaluate results from a preprint, Bias Behind the Wheel: Fairness Analysis of Autonomous Driving Systems. The original paper discovered that many models were biased against women, dark skin tones, and children. As a team, we sought to quantify the risk.
We developed the project with a Jupyter notebook in Python. Relevant libraries included pandas, NumPy, Matplotlib, and scikit-learn.
We decided to use a clustering approach, playing with K-Means and Bisecting K-Means techniques.
k = 1
In comparing the K-Means analysis and the Bisecting K-Means analysis, it was discovered that there was no significant difference when the models were examined with one cluster.
k = 2
However, when the models were compared using two clusters, a pattern emerged. In the K-Means analysis, most of the clusters fell into two distinct groups, one biased against adult male pedestrians and one biased against child female pedestrians. In Bisecting K-Means, a similar pattern emerged, however, because of how new centroids are calculated, there are also two outlier points that do not fit into the two groups. The presence outlier points in Bisecting K-Means are likely due to the hierarchical nature of how the procedure elects to create new clusters, which puts lower priority on the bias that these models display.
k = 3
With three clusters, the K-Means model revealed the same information as with two clusters: there is bias against adult male and child female pedestrians. However, breaking the observations into three clusters also reveals that there is quite a bit of variation for racial bias. In comparison, the Bisecting K-Means with three clusters have a result that reveals a similar pattern, so it does not deviate from the K-Means analysis in the same way that the analysis with two clusters does.
This work confirms the findings of Li et al., demonstrating that there is significant bias within AV pedestrian detection models. When examining clusters of specific subgroups instead of aggregating all data points, there seems to be a bias against male adults and female children, with variations in how racial bias can appear. The results reveal that bias in AV pedestrian detection is based on several factors interacting with each other, which explains why certain groups are more vulnerable than others.
Visualization
Image below demonstrates result of K-means with one cluster.