Problem Chosen
C
2023
MCM/ICM
Summary Sheet
Team Control Number
2318982
Wordle: One Letter Makes a Difference
Summary
Since its launch in early 2022, Wordle has sparked a wave of sharing yellow, green and grey
squares on social media. Wordle has simple but challenging rules that requiring only a short
attention span. Based on the Wordle dataset, we dig into the information hidden behind the number
and the percentage of reported results.
First, we focus on the number of reported results that varies over time. We try to build an
ARIMA model providing us with a prediction interval for the number of reported results on March
1, 2023. It indicates that the Wordle still maintains a high level of enthusiasm one year after
its release. Then, we explore the factors influencing the percentage of Hard Mode. By fitting a
multiple linear regression model, the results show that the number of repeated letters and the
frequency of words are correlated with the difficulty of the game. The difficulty information that
players obtained from the community in advance may influence their choice of game mode.
Next, we are curious how the distribution of the reported results would change in the future.
To simplify the model, we generalize the player’s game states to their known number of squares of
each color. Wordle can then be modeled as a Markov chain, and the problem is transformed into
solving the first-arrival distribution of it. This requires knowledge of the initial distribution
and transfer probabilities relying on the strategies chosen by players. In addition, the transfer
probability is assumed to depend on the difference in the amount of information between states. So
we propose a method to measure the current amount of information in the states. Based on this, we
model the entire Markov chain and solve the first reach-time distribution under different strategies.
To make the model more reasonable, it is assumed that the proportion of people choosing the
above two strategies varies with time. Accordingly, a method based on historical data is proposed
to estimate this proportion. Finally, we combine the estimated proportion with a Gaussian process
regression model to predict the future proportion of player strategy choices. This is then combined
with the Markov chains model to predict the distribution of future reported results. We finally
obtain the distribution of EERIE, which is (0.00, 0.15, 11.05, 28.44, 35.46, 21.16, 3.76).
Finally, we want to classify words according to their difficulty. Since word difficulty is only
related to the word itself, it is believed that clustering according to word attributes can reflect the
difficulty level of words. For this idea, K-Prototypes clustering is performed and reasonable word
difficulty index is set. Then, we extract the difficulty information of each category, and then plot
the density function and calculate Kullback-Leibler divergence. Both of results show that words
with different attributes have different difficulty levels. It proves that our idea is reasonable and the
classification model is accurate. Further, we classify the EERIE into “hard” class by its attributes,
which is consistent with the percentage distribution obtained above. In addition, we discuss other
information about the dataset, such as the difficult words, the easy words and the unexpected words.
Finally, the sensitivity analysis of the model shows the good robustness of our model.
Keywords: ARIMA; multiple linear regression; Markov chains; K-Prototypes clustering
更多数模资讯和学习资料,请关注b站/公众号:数学建模BOOM
b站主页:https://space.bilibili.com/350975620