Neural network model extraction
Natural language models can be "stolen" from a black box API. The attacker provides random sentences, then uses the output of the API to train its own model. The authors found the following:
- If both models were pretrained with the same dataset, the model can be extracted much more effectively.
- The method did not require confidence scores from the black box API, only hard labels.
- If the random sentences are more similar to the dataset that the black box API was trained on, extraction is more effective.
- Current known defenses are:
- Detecting and blocking queries likely to be from an attack;
- Watermarking outputs of the black box API.
Such an attack could be used in the following scenarios:
- Extract the model, then use the extracted model to construct adversarial examples against the original model.
- Extract the model, then extract training data from that extracted model. This could be used to extract private data that the black box API was trained on.
- Extract the model, then use it directly to avoid calling out to the black box API.
References
- https://www.cleverhans.io/2020/04/06/stealing-bert.html
- K. Krishna, G. S. Tomar, A. P. Parikh, N. Papernot, and M. Iyyer, “Thieves on Sesame Street! Model Extraction of BERT-based APIs,” arXiv:1910.12366 [cs], Jan. 2020, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1910.12366
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531 [cs, stat], Mar. 2015, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1503.02531
- F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing Machine Learning Models via Prediction APIs,” arXiv:1609.02943 [cs, stat], Oct. 2016, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1609.02943