Why did the machine learning algorithm cross the road? To get to the other dataset!
Machine learning has been a game-changer for many industries, so it’s not surprising everyone wants to jump on this bandwagon. However, the journey to creating successful machine learning solutions (i.e. creating an ROI for the business) is not without its challenges. I have observed many pitfalls that can the machine learning project you want to roll out. Here are some of them (and how to avoid them):
Not understanding the problem: Before you start building a machine learning model, it is crucial to have a deep understanding of the problem you are trying to solve. This goes all the way to the top of the organisation. This doesn’t mean that you need know how the algorithms work (you can leave that to your ML engineers and data scientists to figure this out). You as a leader need to understand what the business problems that you are trying to solve using ML and success looks like (results). Your engineering team will then need to identify the right performance metrics, understanding the data and the relationships between the variables, and being aware of any constraints or assumptions.
Overfitting: Overfitting occurs when an ML model fits the training data too well, resulting in poor performance on unseen data. To avoid overfitting, you may use regularization techniques like L1 or L2 regularization, or try simplifying your model by reducing the number of features.
Bias in data: Bias in ML algorithms can lead to inaccurate predictions and unfair decisions. To avoid bias, ensure that your data is representative of the real-world population, and use techniques such as oversampling or under-sampling to balance your data.
Insufficient data: I cannot stress how important this is (hence the dad joke above^). I have seen instances where companies think they can become ML ready right away, when they don’t have enough data. Sometimes the data they hold is just noise. Even when you push back and say you cannot build a good model, the response from management might to “give it a go”. Please stop this. You are wasting your resources if a data scientist comes back to you after initial exploration and say you don’t have enough data (or a lot of work needs to go into cleaning your data). Note that machine learning models need large amounts of data to learn and make accurate predictions. Insufficient data can lead to underfitting, where the model is too simple and fails to capture the underlying patterns in the data. This would have terrible consequences if used in production.
Ignoring the context: ML models can only make predictions based on the data they have been trained on. If the context of the problem changes, the model may perform poorly. This why providing the full scope to data scientists is important. In the absence of a strong Product function within an organisation, it’s the responsibility of senior leaders. It is important for your data scientists to regularly re-evaluate and retrain the model, and to have a process in place for monitoring and updating your model as the context changes.
Not testing and evaluating the model: It is important to test your model on a validation set, and to use appropriate evaluation metrics to determine its performance. This will help you avoid overfitting and ensure that your model generalizes well to unseen data.
I am always interested in having conversions with senior leaders in the business to see the type of data you hold and what can be done using that data. If you are looking for someone to speak about what you can do with you data, reach out to me on LinkedIn.