Types of Bugs (2019)
Programming bugs are every coder’s nightmare, and since data science is intimately connected to coding, they are also every data scientist’s nightmare. This is because real data science (i.e. the data science done in the real world and in research centers of most universities) involves a lot of coding.
In other words, if you want to remain relevant in the data science field, getting your hands dirty with coding is a requirement, not a matter of choice. And wherever there is coding, there are programming bugs.
This tutorial explains types of Programming Bugs and how to deal with it. Also explains the difference between mistakes and bugs and most common types of mistakes in data science.
Dealing with Programming Bugs
With all the talk about AI these days and how it is revamping the data science field, you may be tempted to think it is going to solve all of your problems.
However, even if AI makes things easier and more efficient, it cannot change the fact that bugs and human errors are still going to be around If you believe that AI is going to do away with all your programming problems, you may want to reconsider your viewpoint!
It is good to keep in mind that even the most adept data scientists have bugs in their code from time to time. Being more experienced may allow you to come up with solutions faster, and the quality of these solutions is bound to be higher than that of a newcomer.
However, the experience will not get rid of all your mistakes when writing code, since many of these mistakes are due to factors beyond your control (such as fatigue and having too many things on your mind). Therefore, coming up with a robust strategy for dealing with these problems is going to be useful, if not essential, for many years to come.
Bugs are not always bad. If you look past the frustration they cause, they can be the source of useful lessons, especially in the early stages of your career.
Examining them closely and tackling them with the right attitude can be a great way to learn more about the programming language you are using, the algorithms implemented, and how all of this fits into the data science pipeline. Let’s look into this in more detail, starting with the places where bugs tend to appear.
Places You Usually Find Bugs
Looking at the various places where you are more likely to encounter bugs in your code is an important endeavor, as it is bound to help you classify these bugs and gradually gain a better understanding of your strengths and weaknesses when it comes to the coding aspect of your work as a data scientist.
This is especially useful when you are new to coding and wish to improve your skills quickly.
A place where bugs often flock is variables, particularly when you are new to the programming language. Fortunately, most high-level languages such as Julia and Python are able to adapt the variables’ types so that they are best suited for the values assigned to them in your code.
Still, it is not uncommon, even when using such languages, to make mistakes with how you use these variables, leading to exceptions and errors in your scripts. You will always need to be conscious of how you handle variables when you are programming.
Coding structures, such as conditional statements and loops, can also be a nest for bugs. These bugs may be subtle and are equally vexing and can sometimes be seriously problematic since they do not always throw errors when you run the scripts that contain bugs like this.
Functions are complex structures, and as such, they deserve lots of attention. Specifically, in modern programming languages such as Scala and Julia where they play a more important role, functions tend to be a place where bugs creep up. This is even if these functions are tested individually and work fine in the majority of cases they are tested on.
An even more subtle kind of buggy situation appears whenever there are issues with your code’s logic. This is the most difficult situation you will encounter, as bugs in the area of a code’s logic are far more elusive and tend to remain unnoticed until they create issues.
Finally, a bug may come out in a combination of the aforementioned places. Bugs like that are even more challenging to resolve but may give you valuable insight into the inner workings of the code you use. Now let’s look at the different types of bugs more closely.
Types of Bugs Commonly Encountered
Much like the bugs found in nature, coding bugs vary greatly, with some of them being more frustrating than others. Yet all of them can be dealt with if you understand them properly and learn to identify them when they creep up in your scripts.
First, we have the bugs which are fairly simple and relatively easy to tackle. These bugs have to do with the type of a variable. Since most modern languages are forgiving when it comes to variable types, it is easy to fall into the habit of not paying attention to them.
Most data scientists do not particularly care about programming, to the extent people training in that skill do, so it is often the case that types are not set properly, resulting in various issues with the corresponding variables.
Best case scenario, you lose some of the accuracies of the variables that should have been defined as Floats but were defined as Integers. Worst case scenario, the problem with the variable types gets unnoticed and creates issues later on.
If the compiler of the language identifies such a problem and throws an exception or an error, it should be a cause of celebration, since at least in that case, you become instantly aware of the issue and you can remedy it before it causes other, subtler bugs later on in the program.
Indexing bugs are also fairly common, especially if you are uncertain about the dimensionality of the arrays you are trying to access. Sometimes the language you use may not be able to accept binary arrays as indexes, resulting in errors.
Other times, you may be using a different indexing paradigm than the one the language is designed with.
For example, the indexes in Julia as well as in R always start with 1, unlike other, more traditional languages, such as Python and C, that start with 0. Also, these languages have a different last element in their arrays than you might expect.
For example, a 20 x 1 array in Python (let’s call it A) has indexes ranging from 0 to 19, so trying to access element A will yield an error. To access the last element, you will need to refer to it is as A or A[-1].
Moreover, even though negative indexes are acceptable in Python, other languages may not understand them and will throw an error when you attempt to use them.
Parameter value issues are another type of bug closer to home when it comes to data science applications. Sometimes these values are not set right, resulting in issues with the functions they correspond to.
These issues are not always easy to detect since they do not always translate to errors, so it is best to make sure that whenever you are setting a value for a parameter, you know what values you should choose from for the function to work in a meaningful way.
Otherwise, you may end up with results that don’t make much sense or compromise the effectiveness of your models.
Another type of bug has to do with code that never runs (or runs so infrequently that you never get to test it properly under normal circumstances).
This is due to the existence of conditionals that are peculiar in the sense that one or more of the conditions present may never (or very rarely) hold true, resulting in whole branches of your code remaining dormant. The code may look fine (i.e. be void of obvious bugs), but may not always yield the results you expect.
Unfortunately, a compiler is not sophisticated enough to detect this kind of bug, and the issues they may cause are bound to surface after a long time, possibly after you are done with that part of the project.
This can result in delays in your project (if you are fortunate), though it is quite possible that the issues may be much worse (e.g. problematic situations arising when the script is already in production).
Sometimes, conditionals may result in infinite loops, which are yet another type of bug. These bugs are generally easier to pinpoint, though not any less vexing than the other bugs mentioned in this blog.
Note that infinite loops can be very expensive when it comes to the computing resources they consume, so you need to be careful with this kind of bug.
Also, since many scripts take a while to run, especially when you are testing them on a single computer, infinite loops may not be apparent and you may waste not just your computing resources, but a lot of time too (waiting for the script to finish running).
It is good to keep in mind that oftentimes the outputs of a function are diverse, depending on the inputs or on other factors (a common situation when dealing with languages supporting multiple dispatches).
Even though in the vast majority of cases a function yields a certain kind of output, it is possible for it to yield a completely different one that may mess up your code if you have not accounted for that possibility.
This type of bug is also subtle, so it will not be identified by the compiler. Instead, when the problem arises, the computer is bound to make sure you become aware of this kind of bug by yielding the corresponding error.
There are also other types of bugs beyond these ones. These other types are more application-specific, and because of this, it is hard to talk about them in any meaningful way. However, they exist, so it is best to be mindful of your code.
Things can get very complex and fast as you build more and more scripts that rely on other scripts.
Even if the individual pieces of code appear to be simple enough to be bug-free, sometimes just the sheer amount of code you need to run will generate problems you had not anticipated.
Therefore, it is advised that you expect this and always budget time for it. This way, when the time comes to debug these scripts, you can do so without getting stressed out.
Some Useful Considerations on Programming Bugs
Although programming bugs are generally a cause for delays and vexing situations, they are part of the package and are what make the scripts valuable, in a way.
If everyone could write a program easily and without any issues, no-one would want to pay someone to do it and do it well. Avoiding programming is not a solution though since it is programming that most empower data science.
It is unlikely that you will be able to do much in the field of data science without writing some code. Also, data science algorithms are always evolving. Even if you can perform some processes without having to write your own scripts, chances are that sooner or later you will need to do some coding if you want to remain relevant as a data scientist.
Handling bugs is a skill that you gradually develop. Although it is unlikely that your code-writing will ever be completely void of bugs, if you pay close attention to your programming mistakes, you will be able to limit them.
As a bonus, this kind of experience can enable you to be a good communicator of the programming mindset and a great troubleshooter, essential skills in all mentoring endeavors.
Hopefully, the information in this blog will enable you to pinpoint and understand the bugs in your programs and gradually come to accept them as issues that need to be tackled, just like problematic data or obscure requirements.
Mistakes and Bugs
Mistakes are inevitable, not just in coding but also through the whole data science pipeline.
This is more or less obvious. What is not obvious is that these mistakes are also learning opportunities for you and that no matter how experienced you are, they are bound to creep up on you.
Naturally, the better your grasp of data science, the lower your chances of making these mistakes, but as the field constantly changes, it is likely that you will not be on top of every aspect of it.
In this blog section, we will examine the difference between mistakes and bugs, the most common types of mistakes in data science, how the selection of a model can be erroneous (even if it does not create apparent issues that will make you identify it as a mistake), the value of a mentor in discovering these mistakes, and additional considerations about mistakes in the data science process.
How Mistakes Differ From Bugs
The main issue with mistakes is that because they are high-level, they do not typically yield errors or exceptions, so they are easy to neglect, creating issues in the end result of your project, whether that is in the data product or the insights.
However, a simple mistake can force you into more back-and-forths in your process, causing additional delays.
Being able to limit these mistakes can significantly improve your efficiency and the quality of your insights, as you will have more time to spend on meaningful tasks rather than troubleshooting issues you could have avoided.
Still, as you progress in your career as a data scientist, the mistakes you make are bound to become more and more subtle (and as a result, more interesting).
Some bystanders may even not recognize them as mistakes, which is why those mistakes tend to require more effort and commitment to excellence to remedy them.
One thing is for certain: mistakes will never vanish completely. Also, the sooner you realize that they are part of the daily life of a data scientist, the more realistic your expectations about your work and the field in general.
With the right attitude, you can turn these mistakes into opportunities for growth that will make you a better data scientist and transform this otherwise vexing process of discovering and amending into priceless lessons.
[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]
Most Common Types of Mistakes
Let us now take a look at the most common types of mistakes in the data science process that people in the field tend to make, particularly in the early part of their careers.
Even if your scripts are entirely bug-free, you may still fall victim to subtle errors which can oftentimes be more challenging than the simpler issues with the code you write.
The majority of the data science process mistakes involve the data engineering part of the pipeline – data cleaning in particular. A lot of data science practitioners nowadays are fond of data modeling and tend to forget that for the models to function well, the data that is fed into them has to be prepared properly.
This takes place in the data engineering stage, an essential and time-consuming part of the pipeline. However, data cleaning has to do with more than merely getting rid of corrupt data and formatting the remaining parts of the raw data so that it forms a dataset.
If there are a lot of missing values, we may have to examine these data points and see how they relate to the rest of the dataset, especially the target variables when we are dealing with a predictive analytics problem.
Moreover, sometimes the arithmetic mean is not the right metric to use when replacing these missing values with something else. Plus, when it comes to classification problems, we need to take into account the class structure of the dataset before replacing these missing values.
If you neglect any one of these steps, you are bound to distort the signal of the dataset, which is a costly mistake in this part of the pipeline.
The process of feature creation is an integral part of the data science pipeline, especially if the data at hand does not lend itself to advanced data analysis methods.
Feature creation is not an easy task. The majority of beginners in the data science field find it very challenging and tend to neglect it.
On the bright side, the programming part of it is fairly straight-forward, so if you are confident in your programming skills, it is unlikely to yield any bugs.
Also, if you pay attention to the feature creation stage, you will likely save a lot of time later on, while at the same time get better results in your models.
The mistakes data scientists make in this stage are usually not related to the feature creation per se, but rather to the fact that the time they dedicate in this process is insufficient.
Coming up with new features is not the same as feature extraction, an often automated process for condensing the feature set into a set of meta-features (also known as super-features).
Creating new features, even if many of them end unused, is useful in that it allows you to get to know the data on a deeper level through a creative process.
Moreover, the new features you select from the ones you have come up with are bound to be useful additions to the existing features, making the parts of your project that follow considerably easier.
Issues related to sampling are not as common, but they are still a type of mistake you can encounter in your data science endeavors. Naturally, sampling a dataset properly (i.e. with random sampling) that does not have any measurable biases is essential for training and testing your models.
This is why we often need to use several samples to ensure our models are stable. Using a single or a small number of samples will usually bring about models that do not have an adequate generalization. Therefore, not paying attention to this process when building your models is a serious mistake that can throw off even the most promising models you create.
As we saw in a previous blog, model evaluation is an important part of the data science pipeline related to the model development stage. Nevertheless, it often does not get the necessary attention, and many people rush to use their models without spending enough time evaluating them.
Model evaluation is essential for making sure that there are no biases present, a process that is often handled through K-fold cross-validation as we have seen. Yet using this method only a single time is rarely enough.
Therefore, the conclusions drawn from an insufficient analysis of the model can easily be a liability, particularly if that model is chosen to go into production afterward. All of this constitutes a serious mistake that is unfortunately all too common among those unaware of the value of sampling.
Over-fitting a model is another issue that comes about from a superficial approach to the data modeling stage and which constitutes another important mistake.
This is closely linked to the previous issues, as it involves the performance of models. Specifically, it has to do with models performing well for some data but horribly for most other data.
In other words, the model is too specialized and its generalization is insufficient for it to be broadly useful. Allowing over-fitting in a model is a serious mistake which can put the whole project at risk if it is not handled properly.
Another mistake deals with the assumptions behind the tests we do or the metrics we calculate. Sometimes, these assumptions do not apply to the data at hand, making the conclusions that stem from them less robust then we may think and subject to change if the underlying discrepancies are stretched further.
For most tests performed in everyday data science, this is a common but not a crucial problem (e.g. the t-test can handle many cases where the assumptions behind it are not met, without yielding misleading results).
Since some cases are more sensitive than others, when it comes to their assumptions, it is best to be aware of this issue and avoid it whenever possible.
Finally, there are other types of mistakes which are not related to the above types, as they are more application-specific (e.g. mistakes related to the architectural design of an AI system or to the modeling of an NLP problem).
I will not go into any detail about them here, but I recommend that you are aware of them, since all the mistakes mentioned here are merely the tip of the iceberg.
The data science process may be simple enough to understand and apply effects, but it entails many complications, and every aspect of it requires considerable attention.
The more mindful you are of the data pipeline, the less likely you are to make any mistakes which can cause delays or other issues with your data science projects.
Choosing the Right Model
Choosing the right model for a given problem is often the root of a fundamental mistake in data science that tends to go unnoticed, which is why I dedicate a section to it. This is linked to understanding the problem you are trying to solve and figuring out the best strategy to move forward.
There are a variety of models out there that can work with your data, but that does not mean that they are suitable as potential solutions to the data modeling part of the pipeline.
What is the right model anyway? More often than not, there is no single model that is optimum for your data, which is why you need to try several models before you make a choice. Take into account a variety of factors that are relevant to your project.
This is how you decide on the model you plan to utilize and eventually put it into production. All of this can be challenging, especially if you are new to the organization. The most common related scenarios which can constitute different manifestations of the model-selection mistake are the following:
Choosing a model just because it is supposed to be good or popular in the data science community.
This is the most common mistake related to model selection. Although there is nothing wrong with the models presented in articles or the models used by the “experts,” it is often the case that they are not the best ones to choose for every situation.
The right model for a particular problem depends on various factors, and it is virtually impossible to know which one it would be beforehand. So, going for an expert’s a priori view on what should work would be unscientific and imprudent.
Selecting a model because it has a very high accuracy rate.
This kind of model-related mistake is more subtle. Although accuracy is important when it comes to predictive analytics problems, it is not always the best evaluation metric to rely on.
Sometimes the nature of the problem calls for a very specific kind of performance metric, such as the F1 metric, or perhaps a model that is easy to interpret or implement.
There is also the case that speed is of utmost importance, so a faster model is preferred to a super-accurate one.
Going for a model because it is easy to understand and work with.
This is a typical rookie mistake, but it can happen to data sciences pros as well. If models are overly simple and have a ton of literature on them, some data scientists go for them, as they lack the discernment to make a more informed decision.
Models like that may be easy to defend as well, since everyone in a data analytics division has heard of a statistical inference system, for example.
Deciding on a model because it is faster than every other option you have.
Having a fast model to train and use is great, but in many cases, this is not enough to make it the optimum choice.
It is worth it to consider a more time-consuming model that can guarantee a higher value in terms of accuracy rate or F1 metric. So, choosing a model for its speed may not be a great option, unless speed is one of the key requirements of the systems you plan to deploy.
Dealing with each one of these possibilities will not only help you avoid a mistake related to the data modeling stage but also increase your confidence in the model you end up selecting.
Value of a Mentor
Having a mentor in your life, especially at the beginning of your career, is a priceless resource. Not only will he be able to help you with general career advice, but he can answer questions you may have about specific technical topics, such as methodological matters in the data science pipeline.
In fact, a mentor would probably be your best source of information about these matters, especially if it is someone who is active in the field and has hands-on knowledge of the craft.
It is important to have specific topics to discuss with your mentor in order to make the most of your time with them.
What’s more, he may be able to help you develop a positive approach to dealing with the mistakes you make and enable you to go deeper into the ideas behind each one of them. This will also help you develop a holistic understanding of data science and its pipeline.
Some Useful Considerations on Mistakes
Mistakes in the data science process are not something to be ashamed of or to conceal. In fact, mistakes are worth discussing with other people in the field, especially more knowledgeable ones.
They provide opportunities to delve into what you need to learn the most. No one is perfect, and even things that are considered good in data science now may prove to be sub-optimal or even bad in the future.
So, if you start feeling complacent about your work, this may be a sign that you are not putting in enough effort.
Data science is not an exact science, and the solutions that may be acceptable for now may not be good enough in the future. Keeping that in mind will help you avoid all kinds of issues throughout your data science career.
Allowing your thinking of the data science process to become stagnant is possibly the worst kind of mistake a person can make in this field, as data science greatly depends on having a flow of ideas, driven by creativity and evaluated by experimentation.
Handling Bugs and Mistakes
Once you have identified bugs in your code or mistakes in the data science process, you need to deal with them in an effective and efficient manner. In this blog, I will examine how you can accomplish this so that you do not need to dedicate too many resources on these often annoying situations.
In addition, I will discuss some ideas for selecting the most appropriate model in order to prevent the more serious pipeline issues as much as possible.
Strategies for Coping with Bugs
Still, they are very difficult to avoid. So, in order to cope with this matter and not let it disrupt your workflow, it is essential to come up with various strategies to deal with bugs or even prevent them altogether. Let’s look at some such strategies that have been shown to be an effective practice in this regard.
A somewhat controversial-sounding strategy that is nevertheless guaranteed to take care of some bugs is using static-typed functions in all of your programs. This will enable you to be on top of all the type of issues that often plague scripts, and therefore eliminate the possibility of such a bug arising.
If there is a type mismatch, it will become evident early, making it easy to deal with before it wastes much of your time.
Also, this will give your code performance boost, at least in certain programming languages, such as Julia. If you are using Python, you can use a package called mypy for checking the types of your scripts to see if they are properly defined.
Moreover, you can make use of hints in your code to ensure that later users of these functions know what types are supposed to go into them and what types are expected to come out in the outputs.
In addition, you can take care of many potential bugs proactively by setting up a series of unit tests. These are QA tests focused on making sure your scripts work as they are expected to be covering a large variety of inputs and potential outputs, ensuring that the scripts are void of bugs that can manifest under normal operating circumstances.
This is the most essential kind of quality control you can do on your code, and it is common as a practice among both coders as well as the more technically adept data scientists.
Another strategy that may seem apparent to most people in this field has to do with how you test a data analytics script you have written against data.
It is best to start small, meaning that instead of running your code on the whole dataset, use only a small sample of it. This will ensure that it does not hang or yield nonsensical results within a reasonable amount of time.
This last part is highly important, as your code needs to be able to scale well, otherwise, it may not be usable. This simple strategy will save you time in general and reveal issues the script may have more quickly, as you will be able to customize its input to cover a large number of possibilities.
Also, a strategy that is common among programmers and which applies well to data science professionals too is sharing your scripts with other people in the field whom you trust, especially people you already work with, as they are bound to be more directly impacted by your code.
In case you have a mentor, it would be a good idea to show them your script as well, in order to obtain additional feedback and valuable guidance.
It is useful to keep in mind that you are not on your own, no matter how self-sufficient you may feel. Beyond the debugging part, sharing your code with others can even bring about insight on how you can optimize it.
Finally, a popular practice which stems from sharing your code is called pair-coding.
This has to do with coming together with another person as a team. Usually, one of you sits and does the scripting, while the other drives the process through pitching ideas and observing all that is being done to ensure the code makes sense.
As you might expect, the roles in the project change periodically, so parties working on the project will get a more holistic experience of the process.
Pair-coding is helpful particularly if you are new to programming; apart from pinpointing your programming issues and inefficiencies, it also helps you build confidence in your coding skills over time.
Strategies for Coping with High-level Mistakes
As we saw in the previous blog, high-level mistakes, such as those related to the data science pipeline, can be challenging and time-consuming to deal with, even more than with programming bugs. Luckily, there are effective strategies for dealing even with these kinds of mistakes.
First of all, you can pick a data science case study, perhaps even a solved problem, and study it, focusing on how it follows the various stages of the pipeline. Even if all the steps are familiar to you, there is still a lot you can learn from this drill.
Moreover, by studying this material carefully, you can gradually hone your intuition about what needs to be done and how. All of this can help reduce the number of mistakes you make through the data science process.
Another strategy you can employ, which is also quite common, is keeping a record of your work and your assumptions through proper documentation.
This way, you can always go back to your notes if something goes wrong to figure out why it went wrong. Your notes may also provide you with insight as to how you can remedy the mistakes you may make, saving you energy and time.
For instance, maybe an assumption you made did not apply, or you forgot a step in the previous stages. Your documentation will likely be very useful at this stage, not to mention at the end of your project, when you may have to produce a report or a slideshow presentation of your work.
In addition, you can learn more about the data science process by studying it in detail from a reliable source. It would be best to be cautious about amateur material on YouTube though since there is no quality assurance process in place for anything being published there.
For any reliable source you decide on, make sure that you fully comprehend how the data science process applies to different problems since every application is slightly different.
A strategy that may appear to be more like common sense but is definitely worth mentioning is practice. This is necessary not just for dealing with potential mistakes that you may make in the data science pipeline, but also for cultivating the intuition that will allow you to become efficient at applying the process.
As it gradually becomes second nature to you, you will be able to minimize the chances of these mistakes occurring. It may also help you identify mistakes more accurately when they manifest so you can remedy them quicker.
Preventing Erroneous Situations in the Pipeline
As we saw in the previous blog, choosing the right model is a tricky thing, especially if you are new to the field, making it a potential cause for serious and sometimes hard to identify mistakes in the data science pipeline.
This is why we are going to look at this matter more extensively so that you prevent model selection from becoming an issue that can put your whole project in jeopardy.
Here, we will examine the different types of models that are often confused, how data evaluation factors in when it comes to selecting a model, how you can choose among various classification systems, and ensembles’ place in all of this.
Types of Models
Most of the issues with choosing the right model have to do with the following methodologies: predictive analytics (e.g. classification vs. regression or classifier A vs. classifier B), anomaly detection (e.g. finding an outlier), exploratory analysis (e.g. clustering) and recommender systems.
Your options will not always be narrowed down to a particular category of models, making the process of choosing the right model more challenging.
Of course, there are other model types that are potential options. For example, graph analysis is about relationships among data points in an abstract space that is inherently different than the feature space.
If you are asked to predict a particular value of a variable, for example, you are not going to consider graph analysis as an option for a model type.
In the case where you get the model category correct and the data at hand seems to fit a particular model type, this still does not make every model of that type a viable option.
The problem is not as simple as it seems at first. This is why we often need to dig deeper into the data and understand what makes more sense as a model. At the very least, this will point toward the right model category, which is a good start.
Evaluating the Data at Hand and Pairing It with a Model
Evaluating the data available to you is something that goes without saying, yet not everyone thinks of it as a way to figure out the most meaningful model category.
However, if you take the time to examine all the data you have, assess each feature’s potential for bringing out whatever signals are in the data, and see how it works with the other features, you can gain a much deeper understanding of the problem and what kind of model you can craft.
Sometimes it can be something as simple as examining which types of variables in your dataset can hint towards the model category.
For example, if you have mainly discrete variables as your target variables, you will need to focus on classification systems, since classification always deals with discrete targets.
Also, if you find that you have variables that are full of missing values and that you are looking at finding similar data points, this can point toward a recommender system.
When it comes to exploratory analysis, the one technique that stands out is clustering. This is what you would normally do when you do not have a target variable.
Sometimes, you may use the cluster labels as the target variable afterward. There are two distinct possibilities when it comes to performing clustering: either predefine the number of clusters or let the clustering system work it out.
In the first case, you would use something like k-means (or one of its many variants), while in the latter case you would use another system like DBSCAN.
Keep that in mind so that you do not fall prey to thinking there is only a single option methodology for clustering (revolving around k-means).
Evaluating the data can go deeper, especially if you are knowledgeable about the potential categories of the model that are available.
For example, if you find that what you are planning to predict is a fairly rare phenomenon that is reflected in a particular variable, you may want to transform that variable to a binary one, with one value being related to all the cases you want to predict, and then seek a model in that anomaly detection domain.
Also, if you find that you have several features that loosely relate to a continuous variable you want to predict and find that there is no feature at your disposal that correlates well with the target variable for all of its values, you may want to introduce a discrete variable that acts as an intermediary with the target variable.
This way you can use the latter for building a classification model that breaks up the original regression problem into two or more smaller and more manageable problems. Each of these you can tackle using a regression system.
Choosing the Right Model for a Classification Methodology
Classification-related problems are where the biggest confusion comes about. This is mainly because there are so many options to choose from, especially nowadays. Plus, there is a strong trend toward more sophisticated systems that are widely available and well-documented.
However, not all problems lend themselves to an advanced classifier that in order to work well, needs a considerable amount of fine-tuning without any guarantee about the stability of the results. Sometimes a simple model is good enough because it is stable and fairly easy to interpret.
So, if you are going to be asked about the results afterward and are expected to know which features are more relevant, for example, and by how much, a black-box classifier would be something best to avoid, while a more transparent system like a random forest or k nearest neighbor (kNN) would be more appropriate.
Also, if you find that most of your features are binary and more or less mutually exclusive (i.e. there is very small overlap among them), you may want to consider a Bayesian model.
If it is the speed that you are after (at the expense of high accuracy), you may want to go for a basic classification system like logistic regression or a decision tree, especially if you have done considerable work in feature engineering or if your data is information-rich to start with. Be sure to examine various possibilities when it comes to these systems’ hyperparameters, however.
Also, it is important to make sure you are using the right evaluation metrics when choosing the classifier for your project. Perhaps good precision is more important than good recall, while the overall accuracy may be more relevant.
Or maybe the different classes are not equally important and it makes more sense to employ a cost function for the misclassifications involved when deciding which classifier is optimal for the problem at hand.
Or perhaps the confidence of the classifier is meaningful to you, so you want it to be aligned with the correct classifications, something that is reflected on a good ROC curve. Whatever the case, these are important factors to take into account before benchmarking your classifiers and deciding on the one to rely on in your project.
Combining Different Options in an Ensemble Setting
Sometimes it is the case that the best model option is not a single model but rather a combination of models in the form of an ensemble (see glossary for a definition). In fact, there are a few classification systems which are ensembles under the hood (e.g. random forests and extreme learning machines).
The idea is that with an ensemble setting, you attain a better performance (whether this is measured as accuracy rate, F1 score, or some other metric), while at the same time limiting the risk of over-fitting.
However, ensembles tend to be more computationally expensive, translating into a lower speed in the classification system. Also, not all classifiers can be merged into an ensemble and have a guaranteed result.
In order for the ensemble to be effective, its components are better off being diverse, particularly in terms of the errors they make.
This is why the classifiers that comprise them are often trained differently from one another, leading to vastly different generalizations from the same dataset.
The differences in training can be because of different data being used (be it different samples or different feature sets), different operational parameters in the classifiers involved, or some combination of both.
Combining different classifiers is not always a simple and straight-forward task. If you are dealing with different classifier types, a simple aggregate function may not be the best way to go, as other factors may need to be taken into account (e.g. the performance of the classifier at a particular region of the data space or its confidence score).
Still, with proper attention, ensembles can be a great option to explore, particularly if you find that conventional classification systems fail to deliver a reasonable performance and you are willing to trade speed for performance in your project.
Other Considerations for Choosing the Right Model
There are other things worth considering when choosing the right model for the problem you are tackling. For example, no matter how content you are with your model and how well it performs, its relevancy may change as new data becomes available.
That is not to say that suddenly you will need a model from an entirely different category, but most likely you will need to explore other options within the same category (after you try retraining your model to see how it performs with the new data).
Also, it may be the case that the requirements of the project change after the model has been in production for a while and there is feedback on its results.
For example, in the case of a classification scenario, it could be that the costs involved in the misclassifications have shifted and that a new cost function needs to be implemented.
This is normal since the world is not a static place and no matter how good a model is, it usually has an expiration date beyond which it is no longer relevant for the particular project.
As your data science expertise grows, you are bound to be able to do more with your data, changing the dynamics of the problem. Perhaps you can now figure out a better synthetic feature that can be used in a model or filter out noisy data points more effectively, bringing about a more robust feature set.
Your new expertise can be a gateway to a better model, which is something that will be worth investigating.
Things are not set in stone in data science. Whether it is because of new data, a new requirement, a new feature you come up with, or some other reason (e.g. another new system becomes available), you need to revisit the models you have selected and either revamp them or replace them altogether.
Programming bugs are a frustrating, yet inevitable situation a data scientist encounters. However, with the right attitude, they can also be useful lessons in terms of the programming language used, the algorithms involved, and the data science pipeline overall.
AI cannot rid the data science domain of bugs altogether. Programming bugs are typically encountered in the following areas:
Coding structures. These are subtle issues that involve more elaborate aspects of coding, such as loops
Functions. This is particularly important in cases where the same function is used in different programs, as is often the case in modern programming languages
The logic of the code. Bugs in this area are harder to pinpoint, as they involve issues with the algorithm behind the code
Indexing bugs, involving access to arrays, be it vectors, matrices, or multi-dimensional data structures
Bugs related to parameter values. Most subtle bugs involving the input of a function, such as inputting a value that is out of range or problematic when in combination with other parameter values
Code that never runs, or runs very rarely. A side-effect of conditionals where there is insufficient forethought about the various possibilities they cover
Infinite loops. Bugs that have to do with loops that never terminate, wasting computer resources and your time
Bugs having to do with the output of a function. When the output of a function is not what you expect and you feed it to some other function or to a model.