Do you remember the intelligent robots in the movie A.I. Artificial Intelligence? Well, that same technological concept applies to monitoring digital threats! Here at Axur we apply state-of-the-art machine learning techniques to identify phishing cases. Among the millions of URLs that we collect every day, we quickly spot cases that could affect hundreds of thousands of unsuspecting people. And the process is very interesting. Just look:
Machine learning: Teach the machines!
Artificial intelligence? Machine learning? Is it all the same thing? No! It’s like this: artificial intelligence, which is actually a generic term, is a very large field. It encompasses machine learning, which is the process by which machines are taught to observe patterns for decision-making. Within machine learning there are two distinct types:
- Unsupervised: This type doesn't need to be tracked, because it’s “self-taught.” These are used, for example, in Google's search algorithms, which are constantly being fed by user behavior.
- Supervised: this type of machine learning needs to be trained using a dataset (group of data) as an example. The dataset includes the desired results and markings (for example “true” or “false”). These markings are inserted manually to ensure that the model, after training, mimics human behavior in classifying occurrences. This is the type that we use to detect phishing!
Data science, algorithms and plenty of analyses: how Axur machine learning works
Axur's machine learning operation is quite simple, though it’s not, of course, easy to build. But the process can be explained. It all starts with a database, which passes repeatedly through a huge number of tests and improvements so that actions can then be implemented to rule out URLs. Some datasets take months or more to prepare and can contain millions of pieces of data! But let’s start at the beginning.
Preparing the datasets
First, data science: Our machine learning team builds a database with a number of detected URLs that have already been verified by the Digital Fraud Discovery team. On each line there is either a true for occurrences that are actually phishing or a false for those that are legitimate.
This database is used in the initial algorithm’s first “lesson.” It receives the first part to learn, and the rest to test its learning. The entire process is done with programming languages specific to data science, using a hybrid of on-premise and cloud structures to provide greater computing power.
The results are then verified by expert phishing analysts, who validate the results and point out any abnormalities to the data science team.
Understanding the features: one lesson at a time
Now comes one of the most important parts of the machine learning implementation process: the so-called feature engineering. This consists of identifying the characteristics that allow us to accurately differentiate phishing from legitimate cases. Some examples of features used in URL analysis are:
- Top-level domains (TLDs): This is an important field because phishing attacks commonly use the domains .tk and .ml.
- Suspicious words: Terms like “promo,” “mothersday” and so many others are very common in phishing. Our list of suspicious words is a compilation from the last 10 years of detection.
- And so many other examples! In addition to URLs, HTML elements can also be analyzed, such as a password entry field, and anything else related to websites.
In all, we analyze over 80 features. Once all of these are available, it's time for testing, testing, and more testing. Using statistical analyses, various numbers and percentages show us which combinations are best suited to obtain the greatest possible number of hits.
Constant quality control and monitoring
A small percentage of all the occurrences identified as phishing are randomly sent for team analysis. This allows us to see if the machine is really getting it right. Currently, the hit rate of the algorithms used to validate phishing is higher than the hit rate achieved by humans. After all, to err is human! Our process can validate a gigantic volume of data in minutes.
Want to learn more about how Axur's entire digital risk monitoring and response process works? Then check out our solution for phishing, which ensures that no fake pages can affect your brand for long. Who knows--maybe machine learning could be your ally!
Invited specialist_
Mateus Dalponte
PhD in Applied Physics and a member of Axur for 8 years, having started as manager in operations of detection, analysis and fraud removal. Currently responsible for the Data Science and Machine Learning team, acting on automation and economies of scale on detecting digital risks.
A journalist working as Content Creator at Axur, in charge of Deep Space and press activities. I have also analyzed lots of data and frauds here as a Brand Protection team member. Summing up: working with technology, information and knowledge together is one of my biggest passions!