Saturday March 11th
AI generated image with Midjourney
Data science is becoming more and more relevant with each passing day, as though it wasn't already omnipresent. Think about the billions if not trillions of useful data points that are captured every moment. From the weather to traffic to the endless content constantly being generated through social media platforms, there is a constant and mind-boggling torrent of data that accumulates and constitutes an incredibly powerful resource - one that is worth billions in and of itself. The existence of this overwhelming flood of data is often concealed from us - in the apps we often trust and use without a second thought, and by big corporate houses. Sitting on the other side of our tiny portable mirrors that lead us to the fantastical virtual world, lie several thousand employees and scientists carefully studying our data, alongside millions of others, just out of sight. It's really quite frightening if you think about it!
Increasingly, the use of data and data privacy have become hot-button topics in government and larger society. Data has proven to be a lethal weapon that gives producers the edge in the information war, and over consumers. We can defend ourselves by understanding the motive behind various organisations for using our data. As the common saying goes, "if you aren't paying for the product, you are the product."
Let's start with the most downloaded and regularly used class of apps in the world - social media. Have you ever considered how an app like Snapchat makes money? Or what about Instagram? Are they truly the pure, charitable organisations whose aim is to simply facilitate communication through technology they claim to be? And what about their algorithms. The staple of the modern social media app consists of an infinite feed of information. So how do they decide what to show you? It's by asking questions like this that we can be better informed.
That said, not all companies use data to scam you or exploit you. One good example of this is data-informed ads. Everyone wins, you get more useful ads, the platform gets better ad services and can charge higher rates, and advertisers get more effective ads with better click-through rates (CTR) at overall lower costs. This is subtly (yet crucially) different, however, from targetted ads which allow advertisers to direct their ads to specific ranges of targets to push their ads. This has proven adverse effects, including those presented by manipulative and malicious actors which directly target and oppress particular demographics, as in the case of Cambridge Analytica. If we simply understand the algorithms, and exercise our better judgement, taking the moment to step back and ask whether the content we consume on a daily basis truly deserves the time we spend on it, then there is nothing to fear and data can become a valuable tool to fix systematic issues.
Take for instance the issue of car accidents. By creating meaningful tools to truly explore data, and perform insightful analysis we can learn important lessons about road structure and their fundamental relationship with car crashes. This is a well-documented and easily accessible topic with resources such as this dataset from the SWITRS system for car crashes in California on Kaggle. From this I created an application (pictures below) which combines unsupervised machine learning algorithms and an easy to explore map interface to identify clusters of car crashes by querying particular quadrants. I performed the clustering using the K-Means algorithm in conjunction with the elbow method so as to find a suitable number of clusters.
Top graph illustrates the map of California with datapoint illustrated as a point.
Bottom graph shows a 3D map of the collisions that occur in discrete 0.05 longitude x 0.05 latitude boxes.
Clustering on a local area, each cluster shown in a different colour
Understanding data and what data we produce is the new normal and will only become increasingly important as deepfakes and AI become more effective and prominent in our lives - to strike the balance between limiting our data footprint, and embracing the future without fear is the key.