It might be tricky to successfully isolate deception dynamics without touching a bit on other general cognitive skills which Alex might employ across problem-solving contexts. Eliciting "raw" deception via adversarial examples might be a particularly handy way of responding to this. Similarly, eliciting pure deception in different contexts might help single out underlying dynamics.
Another approach would be to deal with different snapshots of Alex obtained during its training. If one snapshot was taken just before teaching Alex deception, that might help single it out for targeted removal. However, its newly acquired skill might have shifted its other internalized patterns in funny ways, making the subtraction messier as a means of singling out the "signature" of deception.