China’s financing and investment spread across 61 BRI countries in 2023 (up...
2024-02-27 30 英文报告下载
We registered predictions for GPT-4’s performance on HumanEval before training completed, using only information available prior to training. All but the 15 hardest HumanEval problems were split into 6 difficulty buckets based on the performance of smaller models. The results on the 3 rd easiest bucket are shown in Figure 2, showing that the resulting predictions were very accurate for this subset of HumanEval problems where we can accurately estimate log(pass_rate) for several smaller models. Predictions on the other five buckets performed almost as well, the main exception being GPT-4 underperforming our predictions on the easiest bucket. Certain capabilities remain hard to predict. For example, the Inverse Scaling Prize [38] proposed several tasks for which model performance decreases as a function of scale. Similarly to a recent result by Wei et al. [39], we find that GPT-4 reverses this trend, as shown on one of the tasks called Hindsight Neglect [40] in Figure 3.
We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C. Exams were sourced from publicly-available materials. Exam questions included both multiplechoice and free-response questions; we designed separate prompts for each format, and images were included in the input for questions which required it. The evaluation setup was designed based on performance on a validation set of exams, and we report final results on held-out test exams. Overall scores were determined by combining multiple-choice and free-response question scores using publicly available methodologies for each exam. See Appendix A for further details on the exam evaluation methodology.
标签: 英文报告下载
相关文章
China’s financing and investment spread across 61 BRI countries in 2023 (up...
2024-02-27 30 英文报告下载
Though the risk of AI leading to catastrophe or human extinction had...
2024-02-26 50 英文报告下载
Focusing on the prospects for 2024, global growth is likely to come i...
2024-02-21 95 英文报告下载
Economic activity declined slightly on average, employment was roughly flat...
2024-02-07 66 英文报告下载
Economic growth can be defned as an increase in the quantity or quali...
2024-02-06 82 英文报告下载
In this initial quarterly survey, 41% of leaders reported their organizatio...
2024-02-05 66 英文报告下载
最新留言