I’m an engineer who, the last five years has been working to wrap my head around machine learning. All it is, all it isn’t and I think I’ve got it. What I struggled with the most was the context. There’s a lot of moving parts and the implementation can be really complex but the overall concept isn’t, so I’m hoping this will be a good reference for those who want to understand it well enough to know how and when to use it. Then if you want to dive deeper you have at least a idea of what’s going on.
At a high level machine learning is really defined as an algorithms that allows a machine to learn (or continually learn) from a set of information and use that 'knowledge' to provide a relatively accurate answer about similar data. Take for example an image analysis algorithm. Having a large enough pool of labeled images to learn from (hundreds of thousands), it can look at your image and can tell you what's in it (faces, objects) based on having seen enough similar objects before. The more images you provide the more accurate it will become but it's power is that it won't need an infinitely large set of information to begin to be useful. Some is good. More is better. Eventually it’s good enough but never really perfect.
Being able to train a machine means you need to have a fair amount of information to feed it and a computer fast enough to process it. It doesn't mean that any data you provide will work though. If you want to drive across town, you need car fuel. To fly, you need jet fuel and to leave earth you need rocket fuel. The circumstances will change depending on how much and what quality information is needed.
There's a lot of ways to attempt to create an algorithm that learns and that's a good thing. This is because no two sets of information are the same. Some are large and complex. Some are small and simple and there's many in between. You want the tool that fits the job. A shovel is sufficient to dig you a small hole but a if you're mining ore, you'll need to scale up.
Overall most of what you’ll read about is how the math used to ‘learn’ is applied as an algorithm and how that algorithm uses hardware to learn quickly. After all, if it takes too long, who cares how ‘smart’ it is.
You can read up on the top 10 list of algorithms and data formats that are most commonly used and what they do.
One of the hardest things for me to get my head around was how all the parts fit together. There's so much more than just having data to analyze and picking an algorithm. For any given data set you might use one algorithm at first while you're still trying to understand what patterns are in it and then change to another algorithm to help you manage those patterns as you begin to understand what you're working with and how to model it. Generally though you’re either working with an algorithm that is ‘supervised’ or ‘unsupervised’.
At first you might not know what to expect from the data you’ve collected. You may just be grabbing everything not knowing what matters and what doesn’t. One easy to understand example of unsupervised learning is a ‘clustering’ algorithm. Assuming you are tracking some analytic data from a website or application, you're not aware of who's using your site or what things individuals like to do. At first you're going to want to collect a mass of information and then try to use a broad clustering algorithm to help you define different groups. Every website will have a unique audience with unique tastes but after running for a sufficient amount of time you'll start to see patterns either for times of day or in groups of products. And you’ll start to develop confidence you know who the audiences are. At this point you're going to want to take that information and turn it into a model and begin to assume new users will fall into one of those known categories. This lets you trim the data you track and gain some efficiency in learning.
The idea of supervising the learning of an algorithm is really that you're providing it bounds to work within. It's no longer just randomly trying to guess a solution. It now has some structure or a model based on the unsupervised learning you’ve done. You don’t just randomly buy food for whoever walks in. You know you work near a factory and there’s a lunch crowd and dinner crowd who want a hearty meal.
Going back to the example of website analytics, once you know specific audience segments and potentially specific interests, you would want to define a model that incorporated that information and instead of trying to define groups based on user wide scale user behavior, you'd want to be able to put a user into a group based on their individual behavior. Which group does it's behavior best match. The faster you can identify a user's preferences, the better of an experience you can provide to them. For example, if you know users from a certain region and time of year always celebrate a holiday, you could be offering them bulk purchasing discounts. You could also promote content to the homepage to save them time searching for it.
When it comes time to design a model you really need good data. What this means is that you have complete information that wholly represents what your trying to learn. Where this falls short is when you have bad information (old or invalid values), no information (either scrubbed for privacy or just not collected) or not enough information (no data on certain demographics creates a bias). This is really where the data scientists earns their keep. It does take an experienced person to create a specification for the data that’s required to make a model work well. They could be collecting information from any number of data warehouses, data lakes or real-time data streams and all of it needs to be fixed, filtered and tested for biases, diversity or overfitting (too accurate to be trusted). If you make it through that with a sample set large enough to be of use you’re winning.
One of the aspect of machine learning that creates the most distrust is the limits of what it can do. This is because it’s the hardest to explain what’s happening when it goes wrong if it doesn’t work. The best way to explain it is: garbage in, garbage out but no one wants to hear that when they paid good money for cutting edge technology. And it certainly doesn’t help explain how to fix it. And without any data integrity, you really can’t. All you can do is make this known from the start. Educating customers that there are boundaries in which the machine operates well but if you go outside them, it loses it's capabilities. Driving any standard cars requires a road, specific fuel and a set of temperature ranges. If you drive it over a lava field or put salt in the gas tank will do nothing for the driving experience. Unfortunately we’re not at the point in the Information Age where we have ubiquitous fuel stations so you have to blend your own and be very careful about the recipe.
For more on the fixed limits and the value of data integrity and its implications on privacy and ethics check out the series titled the 'Soul in the Machine' and an article 'Can a machine be racist'.
There’s a lot of ways to start using machine learning but it really depends on what your needs are.
I’ve found the commercial grade, web hosted algorithms like Microsoft Cognitive Services, IBM Watson, Amazon and Google machine learning provide a lot of power with minimal effort. Enough to get started and provide an on-ramp to the technology that most people need.
Are you going to be developing new algorithms for a custom business need? Then you’re going to want to start building a team that knows Python and R this is the domain most data scientists work in. With the home brew approach you could deploy to a cloud host that can scale like AWS, Google or Azure.
Another option is to use Google’s Tensor Flow. This has reached the level of maturity that Google has invested in building hardware just to maximize its capabilities. Its fairly easy to setup and test with and there is ample documentation to learn from.
Depending on the size or sensitivity of the data though, you might not be able to transfer the data offsite. Another option is to run the algorithm where the data lives. Microsoft has been embedding algorithms into SQL Server since 2016 and even building out a C# machine learning language. This saves time developing common algorithms and places the development and deployment into a preexisting skill set within IT.
Going beyond that and the cost gets prohibitive. The competition at this point are probably pioneers in the field and you’ll want to question if this is the right place to jump in. If so you’ll be able to afford a team of people better at explaining the rest to you than I ever could.
Machine learning is certainly going to be saturating our future but it’s no different than any other technology. It’s going to take time to contextualize the stories so that it’s accessible and there’s a lot it just can’t do but we’re past the point of questioning its capabilities. It can, it will, but will you?