Dataset Analysis

We regularly analyze datasets to better understand what can be done with our customers' data. During this analysis phase, we try to formulate a specific question like: what is the probability that a credit card transaction is fraudulent? We then gather a sample of requisite data, check for signal, look at variable contributions, and summarize our findings. 

Dataset Creation

In order to analyze a dataset, first we work with you to formulate a dataset that is ripe for machine learning. If you're still reading, chances are, you have a hypothesis about some insights that may be gained from your data. For example, you might suspect that there are patterns in your data, which if understood, can predict whether a credit card transaction is fraudulent. Usually we start an investigation by asking a lot of questions to better understand your intuition about these potential patterns. You might, for example, tell us that the location of purchases is a key factor in fraud. 

Next, we will follow up with where potentially helpful data originates in your organization and how it is accessible to you. Often we find that data required for effective machine learning is spread out in an organization and needs to be fused into a single spreadsheet, or CSV from its origins in a data warehouse or a smattering of databases. Our goal in the analysis phase is to formulate a single dataset snapshot (not live feeds yet and often a sample usually less than 1 million rows), that we can take away and analyze in depth. 

The final construction of the dataset is usually performed by your team following a regular back and forth with multiple team members typically including a DBA, data scientist, or data engineer. We do, however, have customers for which these skills are not available, in which case we are more than happy to jump in and provide everything necessary.


Once we have the dataset, we perform a number of analysis operations. For example, we usually begin by looking at the most frequently occurring values for each column, or variable. One of the things we are looking for is how to transform the variables into a form which allows us to formulate a machine learning problem. For example, we might look at a variable like price, and assess the range of the values or the frequency of missing values. We might look at a variable like gender, and consider its possible values (e.g. M, F, male, female, neutral, etc.). We then might choose an encoding technique like one hot encoding. In yet other cases, we might have unstructured data like text, audio, images or videos which each have different representation techniques.

In all these cases, we need to ultimately formulate a matrix of numbers to represent features. In many cases, we try to formulate a supervised machine learning problem where we also formulate a matrix to represent our targets (the thing we're trying to classify or predict). Both the features and targets are used inside of your various machine learning models. Once the problem is formulated, we often perform signal checks to see how accurately, and with what level of reproducibility we can predict your target. We also, might look at various permutations of problem formulation including feature representation and efficacy analysis to see if indeed there is signal. Finally, if your data is structured or semi-structured, we might stack rank the variables by predictive impact to help us glean insight into why the models are working.

Ultimately, we will come back to you and say, yes a successful model can be built, and here is a ballpark range of how effective it might be or no, a model should not be built yet.