• Arthur Samuel, a pioneer in artificial intelligence, described machine learning as a set of methods and technologies that “gives computers the ability to learn without being explicitly programmed.” In a particular case of supervised learning for anti-malware, the task could be formulated as follows: given a set of object features X and corresponding object labels Y as an input, create a model that will produce the correct labels Y’ for previously unseen test objects X’. X could be some features representing file content or behavior (file statistics, list of used API functions, etc.) and labels Y could be simply “malware” or “benign” (in more complex cases, we could be interested in a fine-grained classification such as Virus, Trojan-Downloader, Adware, etc.). In case of unsupervised learning, we are more interested in revealing hidden structure of data - e.g., finding groups of similar objects or highly correlated features.

    Kaspersky Lab’s multi-layered, next generation protection utilizes machine learning methods extensively on all stages of detection pipeline - from scalable clustering methods used for preprocessing incoming file stream in infrastructure to robust and compact deep neural network models for behavioral detection that will work directly on users’ machines. These technologies are designed in a way to address several important requirements for machine learning models in a real world information security applications, i.e. extremely low false positive rate, interpretability of a model and robustness to a potential adversary.

    Let’s consider some of the most important machine learning based technologies used in Kaspersky Lab endpoint products:

    Decision tree ensemble

    In this approach, the predictive model takes the form of a set of decision trees (e.g. random forest or gradient boosted trees). Every non-leaf node of a tree contains some question regarding features of a file, while the leaf nodes contain final decision of the tree on object. During test phase, the model traverses the tree by answering the questions in the nodes with the corresponding features of the object under consideration. At the final stage, decisions of multiple trees are averaged in an algorithm-specific way to provide final decision on the object.

    The model benefits Pre-Execution Proactive protection stage on the endpoint site. One of our applications of this technology is Android Cloud ML used for mobile threats detection.

    Similarity hashing (Locality sensitive hashing)

    Hashes used to create malware “footprints” in older times were sensitive to every small change in a file. This drawback was exploited by malware writers through obfuscation techniques like server-side polymorphism: minor changes in malware took it off the radar. Similarity hash (or locality sensitive hash) is a method to detect similar malicious files. To do this, the system extracts file features and use orthogonal projection learning to choose the most important features. Then ML-based compression is applied so that value vectors of similar features are transformed into similar or identical patterns. This method provides good generalization and noticeably reduces the size of the detection records' base, since one record now can detect the whole family of polymorphic malware.

    The model benefits Pre-Execution Proactive protection stage on the endpoint site. It’s applied in our Similarity Hash Detection System.

    Behavioral model

    A monitoring component provides a behavior log - the sequence of system events occurred during the process execution together with corresponding arguments. In order to detect malicious activity in observed log data our model compresses obtained sequence of events to a set of binary vectors and trains the deep neural network to distinguish clean and malicious logs.

    The model benefits Post-Execution Proactive protection stage on the endpoint site: it is the integral part of Behavior Detection module in Kaspersky Lab products.

    Machine learning plays an equally important role when it comes to building proper in-lab malware processing infrastructure. Kaspersky Lab uses it for the following infrastructure purposes:

    Incoming stream clustering

    ML-based clustering algorithms allow us to efficiently separate the large volumes of unknown files coming to our infrastructure into a reasonable number of clusters, some of which can be automatically processed based on the presence of an already annotated object inside it.

    Large-scale classification models

    Some of the most powerful classification models (like a huge random decision forest) require large amount of resources (processor time, memory) along with expensive feature extractors (e.g., processing via sandbox could be required for detailed behavior logs). It is more effective therefore to keep and run the models in a lab, and then distil the knowledge gained by such models via training some lightweight classification model on the output decisions of the bigger model.

    To learn more about Machine Learning in cybersecurity, read Whitepaper

Related Products


Machine Learning for Malware Detection

Read more


Machine learning and Human Expertise

Read more

US 8250655 B1

Rapid heuristic method and system for recognition of similarity between malware variants

Read more

US 8955120 B2

Flexible fingerprint for detection of malware

Read more

US9171155 B2

System and method for evaluating malware detection rules

Read more

Conference: Bayess methods in deep learning School 2017

Read more

Conference: ICML 2017 Workshop

Read more

Conference: ICLR 2017

Read more

Independent Benchmark Results

  • ICSA Advanced Threat Defense 2017Q3

  • AV-Comparatives Whole Product Dynamic Real-World Proteciton Test Feb-Jun 2017

  • SELabs Enterprise Endpoint Protection July-September 2017



Related Technologies