Neural network model extraction

Natural language models can be "stolen" from a black box API. The attacker provides random sentences, then uses the output of the API to train its own model. The authors found the following:

If both models were pretrained with the same dataset, the model can be extracted much more effectively.
The method did not require confidence scores from the black box API, only hard labels.
If the random sentences are more similar to the dataset that the black box API was trained on, extraction is more effective.
Current known defenses are:
- Detecting and blocking queries likely to be from an attack;
- Watermarking outputs of the black box API.

Such an attack could be used in the following scenarios:

Extract the model, then use the extracted model to construct adversarial examples against the original model.
Extract the model, then extract training data from that extracted model. This could be used to extract private data that the black box API was trained on.
Extract the model, then use it directly to avoid calling out to the black box API.

References

https://www.cleverhans.io/2020/04/06/stealing-bert.html
K. Krishna, G. S. Tomar, A. P. Parikh, N. Papernot, and M. Iyyer, “Thieves on Sesame Street! Model Extraction of BERT-based APIs,” arXiv:1910.12366 [cs], Jan. 2020, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1910.12366
G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531 [cs, stat], Mar. 2015, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1503.02531
F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing Machine Learning Models via Prediction APIs,” arXiv:1609.02943 [cs, stat], Oct. 2016, Accessed: Apr. 08, 2020. [Online]. Available: https://arxiv.org/abs/1609.02943