NLP Evaluation in the Time of Large Language Models