Interest in Data Science & Machine Learning has sky-rocketed as businesses realize that they need deeper and more valuable insights into their data. But how do you manage an essentially open-ended process where you may never get the 'right' answer?
Endjin have developed a pragmatic approach to data science, based on a series of iterative experiments, relying on evidence-based decision making to answer the most important business questions.
Following this process allows us to iterate to insights quickly when there are no guarantees of success.
1. Understand the business objective
We'll help you to clarify what you're doing and why you're doing it. The key is aligning the data science work (and team) to the needs of the business, just like every other function. Data exploration, data preparation and model development should all be done in the context of answering a defined business question. In the same way that a business case and agreed set of requirements would be made before embarking on developing a new web application, data science work and machine learning experiments should be performed in line with a clear set of objectives that tie back to an overall strategy or business goal.
2. Start with a hypothesis
Once the business objective is understood, we'll define a testable hypothesis, stating the parameters and the success criteria before we continue. This means stating what it is you’re trying to prove and, most importantly, what success looks like. Defining the success criteria before you start the experiment is critical to keeping decisions to continue or change direction in line with the business goals.
3. Time-box an experiment
Data science is essentially open-ended process where you may never get the 'right' answer, so it's important to decide how much time and effort you're willing to spend to prove or disprove your hypothesis. We'll work in weekly iterations and define the scope of each experiment accordingly.
4. Prepare the data
The most important activity in the data science process is choosing and preparing the data. The data holds the secrets - so, the more of it you have, and the better quality it is, the better the results will be.
Once the data has been identified and obtained, it then needs to be prepared for use. This might mean deciding how to handle missing values, duplicate values, outlying values or values that have direct correlations to others. Different data sets may need to be combined, including data from internal and external sources. Traditional and Cloud ETL data processing to cleanse, transform and merge data is combined with statistical analysis to identify the most useful data attributes.
5. Experiment with algorithms
With prepared data, an appropriate algorithm can be applied to try to find a model that answers the questions you're asking. Selecting the best one to use depends on the size, quality and nature of your data, the type of question your asking and what you want to do with the answer. Models will evolve over time – what's true in your data now may not be in a day, month or year's time. It's therefore important to build in a mechanism for re-training the model as part of then productionizing process so that it stays relevant and – most importantly still meets the success criteria that you defined.
6. Evaluate and iterate
We'll document the entire process as a set of lab notes and present an executive summary of results and recommendations, allowing you to use the evidence from the experiment to decide whether there's value in continuing, stopping or changing direction.
If you find a successful model, we can help you take the necessary steps to realise the value across the organisation by developing flexible, extensible, scalable, multi-tenant, polyglot data processing pipelines to power your intelligent solutions.