The rise of artificial intelligence (AI) over the last few years is undeniable. It is as fast as it is surprising, playing a dominant role in many areas of technology and our lives. This development of new technology not only brings new opportunities but also raises many questions about the trust we place in it. The concept of Trustworthy AI has been created to ensure that features that we build using AI are ethical, explainable, reliable, transparent, and robust. The last concept is extremely important in the area of cybersecurity, although its importance also extends to other domains. AI systems must be designed to be robust and resilient to various types of attacks and failures. This includes ensuring that AI systems can function even when faced with malicious attacks, system failures, or unexpected inputs. Robustness requires that AI systems are tested thoroughly and can operate in different environments and scenarios. In other words, AI systems have to be reliable, safe, and able to handle unexpected situations.

We can distinguish two most common types of attacks on models: model evasion and data poisoning. Data poisoning attack begins when adversaries, known as “threat actors”, somehow gain access to the training dataset and have the ability to contaminate the data by modifying entries or introducing tampered data into the training dataset. This attack is very effective but requires the threat actor to gain access to the training data, which makes this attack vector quite difficult. Model evasion attacks are among the most common types of inference attacks that target the prediction-making process of machine learning (ML) systems. Evasion attacks target the inference phase of the machine learning model lifecycle and compromise the integrity of the machine learning model’s predictions. They exploit misclassifications using well-crafted malicious inputs, so-called ‘adversarial examples’, to confuse machine learning models into making an incorrect prediction. As shown in a seminal work from Goodfellow, Shlens, and Szegedy, after adding a minimum amount of noise to the testing data, which is imperceptible to a human, the classification result has been successfully compromised from a “panda” to a “gibbon” with surprisingly high confidence (99.3%). Consider a situation where an attacker takes control of electronic signs, especially stop signs and speed limit signs; the consequences of such attacks can be devastating. Those attack patterns don’t even require extra noise, but small changes to a stop sign may be enough for the network to recognize it as a “50 km/h” speed sign. The above example highlights the vulnerability of AI models to security attacks, as even minor modifications to the input data can lead to significant errors in the model’s predictions. The consequence of such attacks is the erosion of trust in AI systems and potential negative impacts on various applications relying on these models.

In our work, we concentrated on model evasion because it is the easiest and cheapest attack for the attacker and also the most realistic scenario. This case is realistic because we assume that the attacker does not have access to the model and does not know the weights or model architecture but can query it multiple times, see the model output, and tune their attack based on that.

A key piece of technology built as part of the Spatial project is the security diagnosis library, which aims to identify security vulnerabilities and describe the security state of ML systems. This library focuses on the assessment and improvement of resilience against evasion attacks, which are the most common adversarial ML attacks. Many methods and tools have been developed to generate adversarial examples for performing evasion attacks. These can be used for empirical vulnerability assessment: they are executed against an ML model to infer how vulnerable it is against this threat. However, existing methods are typically restricted to a few domains and types of input, e.g., image, sound or text, and to a few types of ML models, such as neural networks. In order to be applicable to many other ML applications, the security diagnosis library implements generic approaches to evasion attacks and the generation of adversarial examples.

We have introduced several important and unique features to the security diagnosis library, enabling effective and comprehensive testing of the model’s resilience:

  • Applicability to a wide range of ML models. Our solutions allow you to run security tests against various types of models, such as SVM, Naïve bayes, Random Forest, Gradient-boosted decision trees, and Deep Neural Networks.
  • Applicability to various frameworks. The security diagnosis library enables the testing of different ML frameworks, such as Scikit-learn, TensorFlow, PyTorch, XGBoost, ONNX, etc.
  • Various types of input data. Model input features can have different modalities compatible with the analyzed model, such as tabular data in CSV format or image files.
  • There is a wide range of evasion attacks. In the latest version, the security diagnosis library enables testing models using Simba, Zoe HSJ, and NES attacks, each with its own set of configuration parameters that can be defined by the user/attacker.
  • Attack hyperparameter optimization capability. Our solution offers a method for users to search for the optimal parameters for a specified attack. This labour-intensive process can be set up to run in the background, and upon identifying the ideal combination, the attack is automatically executed using the discovered parameters.
  • Iterative measurement of the attack progress. This functionality allows one to observe attack effectiveness during its execution and collect relevant metrics, which can be stored and later used for in-depth analysis and visualization.
  • Storing best adversarial sample. During the attack, an interactive search is conducted for the optimal sample that will introduce the minimal perturbations of features necessary to achieve the attack’s goal. As some algorithms eventually begin to yield less effective samples, this functionality allows for the selection of the best sample rather than merely the most recent one.
  • Modifying attack success criterion. This functionality enables the execution of successful attacks on binary classification models that utilize a custom classification threshold to determine the target class. This is especially beneficial in fields such as cybersecurity, where, for certain machine learning models, the output probability must be exceedingly high for a sample to be deemed malicious.
  • Definition of feature space constraints. It allows the definition of various sampling strategies that govern the modification of input features to launch an attack successfully. These strategies can be relative or absolute regarding feature values and may include value clipping, rounding, monotonic constraints, etc. Additionally, the user can decide if they should be applied globally or separately for each feature.
  • Custom metrics for model security. We have developed metrics to evaluate complexity, detectability, and global feature distortions. These metrics facilitate the assessment of an attack’s effectiveness and the model’s security, and notably, they enable the comparison of different models’ security against one another.
  • Security assessment report. A proper assessment of the model’s resilience ought to be grounded in a comprehensive and intelligible report. Users have the option to specify whether to preserve various file artefacts generated during attacks and whether to save the analysis results in HTML format.

As the use of ML models in contemporary IT systems grows, so does the concern for their security. Ensuring their security is crucial, as initial research using our library indicates that even well-functioning models can become vulnerable to attacks when the right tools are employed. The security diagnosis library aims to equip AI developers with the means to secure their creations and to finely balance the trade-off between performance and security.