没有合适的资源?快使用搜索试试~ 我知道了~
2023年美赛特等奖论文-C-2307946-解密.pdf
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 183 浏览量
2024-05-06
22:05:54
上传
评论
收藏 5.99MB PDF 举报
温馨提示
![preview](https://dl-preview.csdnimg.cn/89273324/0001-6bd8aad2ce4246661f041155583c0b56_thumbnail.jpeg)
![preview-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/scale.ab9e0183.png)
试读
26页
大学生,数学建模,美国大学生数学建模竞赛,MCM/ICM,2023年美赛特等奖O奖论文
资源推荐
资源详情
资源评论
![zip](https://img-home.csdnimg.cn/images/20210720083736.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![pdf](https://img-home.csdnimg.cn/images/20210720083512.png)
![](https://csdnimg.cn/release/download_crawler_static/89273324/bg1.jpg)
Problem Chosen
C
2023
MCM/ICM
Summary Sheet
Team Control Number
2307946
Words Behind Wordle: Puzzle Game Analysis
Using Machine Learning and Time Series Theory
Summary
Wordle is a popular puzzle currently offered daily by New York Times. Players try
to solve the puzzle by guessing a five-letter word in six tries or less, receiving feedback
with every guess. Making full use of relative information can effectively help editors to
improve operational performance.
Firstly, to explain the variation and predict the future value, a time series model based
on the number of reported results is introduced. After determining the optimal groups
of orders, ARIMA(0,1,1) model is used to forecast the prediction interval of the number
of reported results on March 1, 2023, which is [10139.23, 30808.07](80% confidence). To
find out if any attributes of the word affect the hard mode percentage, a words attributes
system and a LightGBM model are introduced. The results show that there are some lag
attributes that have some but less effect than lag Hard Mode percentage itself.
Secondly, to predict the associated percentages of (1, 2, 3, 4, 5, 6, X), two models are
established based on GBDT and MMoE. The results show that the MMoE model signif-
icantly outperforms the GBDT model, with MSE of 145. Then, we attemptd to improve
the model by using data augmentation and feature engineering methods. The former
leads to a large amount of noise, which fails to achieve the expected effect, and the latter
slightly improves the model performance. The prediction of the final model for the word
EERIE is (0.649, 7.579, 26.298, 32.614, 20.930, 9.63, 2.298).
Thirdly, K-means model is introduced to cluster the samples into 4 groups with the
distribution of attempt times as the features by difficulty. In order to determine which
features of the words are associated with the classifications, we used the classification as
the output feature and all the attributes of the words as the input feature to establish a
LightGBM model for training. The accuracy of the test set reaches 70%. The importance
of the output features is sorted. Finally, the model is used to predict the category of the
EERIE word, and the prediction result is Group 2.
Finally, some interesting features of the dataset are found in dataset. The characteris-
tics of large frequency words, the shape of distribution of attempt number and the corre-
lation of the word features are discussed.
In addition, we evaluated the advantages and disadvantages of the model and pro-
posed some suggestions, and carried out a sensitivity analysis of the model to the com-
mission rate, thereby proved the reliability and stability of the model.
Keywords: Wordle ; ARIMA; LightGBM; MMoE; data augmentation; feature engineer-
ing; K-means; sensitivity analysis
![](https://csdnimg.cn/release/download_crawler_static/89273324/bg2.jpg)
Team # 2307946 Page 1 of 25
Contents
1 Introduction 3
1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Restatement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 General Assumptions and Model Overview 4
2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Model Preparation 5
3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Model I: Time-Series Forecasting Model 7
4.1 The concept of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Stationarity of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.4 Forecasting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Extraction and Analyse Attributes of the Word 9
5.1 Extraction of Word Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Word Attributes Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Model II: Explaining Hard Mode Percentage Using LightGBM 11
6.1 Introduction of LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2 Data Description and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 12
6.3 Model Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7 Model III: Multiple Input - Multiple Output Regression Model 15
7.1 White Noise Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.2 Model Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
![](https://csdnimg.cn/release/download_crawler_static/89273324/bg3.jpg)
Team # 2307946 Page 2 of 25
7.3 Model Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.3.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.4 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.5 Model Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
8 Model IV: LightGBM Classifier based on K-means Clustering Model 18
8.1 Concept of K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8.2 Clustering Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
8.3 Evaluation of Clustering Result . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.4 Identification of Important Attributes . . . . . . . . . . . . . . . . . . . . . . 20
8.5 Classfication Result and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 20
9 Other Interesting Features of the Data Set 21
10 Sensitivity Analysis 22
10.1 Sensitivity Analysis for Question 1 . . . . . . . . . . . . . . . . . . . . . . . . 22
10.2 Sensitivity Analysis for Question 3 . . . . . . . . . . . . . . . . . . . . . . . . 23
11 Strengths and Weaknesses 23
11.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
11.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 A Memorandum to the New York Times Puzzle Editor 24
![](https://csdnimg.cn/release/download_crawler_static/89273324/bg4.jpg)
Team # 2307946 Page 3 of 25
1 Introduction
1.1 Problem Background
At the beginning of 2022, a simple but novel game gained great popularity on Twitter.
This is the web word game Wordle written by Josh Wardle and published by the New
York Times Company.
The game was fairly unknown at the very beginning, but after Wardle creatively
added a function that allows players to copy the results into a grid of colored square
emojis to share, it immediately attained public attention. As of mid-January 2022, there
have been more than 2 million people have played and more than 1.2 million Wordle re-
sults have been posted on Twitter.
In Wordle, players have to guess a word with five English letters within six chances
in one day. After each attempt, the players may get three types of feedback: green if the
letter is in the correct position; yellow if the answer contains the letter while the letter
is in the wrong place; gray if the answer does not have the letter at all. The gameplay
is similar to games like Mastermind, but Wordle will clearly indicate which letters were
guessed correctly. [1]
Apart from that,Wordle has another game mode. On the basis of the above regular
rules, the "Hard Mode" requires once a player has found a correct letter in a word, those
letters must be used in subsequent guesses.
In fact, there is a profound mathematical mechanism behind the seemingly simple
game. We can’t help wondering what mechanism affects the efficiency of players to make
correct guesses, and what laws exist behind the constantly changing number of reported
results on Twitter. On what basis do players choose Hard Mode?
We expect to solve the above problems through mathematical modeling to effectively
predict the future operation of the game and provide Puzzle Editor of the New York Times
with business suggestions.
Figure 1: NY Times Wordle Figure 2: Example of solution
![](https://csdnimg.cn/release/download_crawler_static/89273324/bg5.jpg)
Team # 2307946 Page 4 of 25
1.2 Restatement of the Problem
As we have a data set containing the date, contest number, word of the day, the num-
ber of people reporting scores that day, the number of players on Hard Mode, and the
distribution of the reported results. We need to build mathematical models to solve the
following problems for New York Times Company:
Question 1:
1. Develop a model which explains the variation of the reported results number, then
make a prediction of this number for Match 1,2023 using the developed model.
2. Find out the possible attributes of the given word which may influence the per-
centage of scores reported that were played in Hard Mode, and give the inherent
mechanism of the influence.
Question 2: Develop a model that forecasts the distribution of the reported results of
a given word on a day to come.Then discuss the uncertainties and the accuracy of the
prediction model.
Question 3: Adopt a mathematical model to classify solution words by difficulty, iden-
tify the attributes of a given word that link with each classification as well as evaluate the
accuracy of the classification.
Question 4: Discuss and find other features within the data set.
2 General Assumptions and Model Overview
2.1 Assumptions
To simplify the problem, we make the following basic assumptions, each of which is
properly justified.
1. The number of reported results on Twitter can effectively represent the total number
of players on the day, and the percentage of scores reported that were played in
Hard Mode is the same as that of all players.
2. The distribution of the reported results recorded in the dataset is completely accu-
rate.
3. There are correlations and differences between the associated percentages of 1 try, 2
tries,· · · , X.
4. The word difficulty is proportional to the average number of tries to guess the result.
2.2 Model Overview
In summary, the whole modeling process can be shown as follows:
剩余25页未读,继续阅读
资源评论
![avatar-default](https://csdnimg.cn/release/downloadcmsfe/public/img/lazyLogo2.1882d7f4.png)
![avatar](https://profile-avatar.csdnimg.cn/fcd62adb0120465d9af280215b0ff722_snowtshan.jpg!1)
阿拉伯梳子
- 粉丝: 1670
- 资源: 5735
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助
![voice](https://csdnimg.cn/release/downloadcmsfe/public/img/voice.245cc511.png)
![center-task](https://csdnimg.cn/release/downloadcmsfe/public/img/center-task.c2eda91a.png)
安全验证
文档复制为VIP权益,开通VIP直接复制
![dialog-icon](https://csdnimg.cn/release/downloadcmsfe/public/img/green-success.6a4acb44.png)