League Of Legends Early Stats Analysis
This is a project for DSC80 where we analyze missing values, do permutation tests, and clean a League of Legends dataset.
Name(s): Quennie Zeng and Vaibhav Bommisetty
Introduction
We worked on a dataset of all professional competitive League of Legends matches that took place in 2022. In this dataset, every 12 rows are of the same game, of which 10 are for player statistics on each team and 2 are for team statistics. In total, this dataset contains 149,400 rows, alluding to 12,450 games in total. Our question is: “Which early game statistic can help us determine the outcome of a game the most?” To help us find the answer to our question, we used 4 different categories of statistics:
1. Outcome of Game
We needed a column to let us know which team won the game, that column was “result”:
result
- The value in this column lets us know if the team won or not. We cleaned the data so that if the team won the value is True, and if the team lost the value is False.
2. First to Objective Statistics:
These statistics give us insight into the team objectives in the early game. All of these columns contain booleans, as the objectives are either achieved or not achieved.
firstblood
- The value is True if the team achieved first blood, and False if the team did not achieve first blood. (A bonus of 150 gold is given to the player who gets first blood)firstdragon
- The value is True if the team claimed the first dragon, and False if the team did not claim the first dragon. (The first dragon spawns at exactly 5 minutes into the match)firstherald
- The value is True if the team claimed the first herald, and False if the team did not claim the first herald. (The first dragon spawns at exactly 8 minutes into the match)firsttower
- The value is True if the team destroyed their first tower and False if the team did not destroy the first tower. (This statistic is an early game statistic, even though it depends on how the game is going. 11 minutes is a good estimation of when the first tower is taken.)
3. Statistics at 10 Minutes:
These statistics are gathered at exactly 10 minutes into the game.
goldat10
- This gives the gold count of the team at 10 minutes.xpat10
- This gives the xp of the team at 10 minutes.csat10
- This gives the cs (creep score) of the team at 10 minutes.golddiffat10
- This gives the difference in the gold between the teams at 10 minutes. If it is positive, the team we are looking at has more gold and vice versa.xpdiffat10
- This gives the difference in the xp between the teams at 10 minutes. The positive and negative meanings are similar to the above column.csdiffat10
- This gives the difference in the cs between the teams at 10 minutes. The positive and negative meanings are similar to the above column.killsat10
- This gives the number of kills that the team has at 10 minutes.assistsat10
- This gives the number of assists that the team has at 10 minutes.
4. Statistics at 15 minutes:
Similar to the previous section, this section gives team statistics for 15 minutes.
goldat15
- This gives the gold count of the team at 15 minutes.xpat15
- This gives the xp of the team at 15 minutes.csat15
- This gives the cs (creep score) of the team at 15 minutes.golddiffat15
- This gives the difference in the gold between the teams at 15 minutes. If it is positive, the team we are looking at has more gold and vice versa.xpdiffat15
- This gives the difference in the xp between the teams at 15 minutes. The positive and negative meanings are similar to the above column.csdiffat15
- This gives the difference in the cs between the teams at 15 minutes. The positive and negative meanings are similar to the above column.killsat15
- This gives the number of kills that the team has at 15 minutes.assistsat15
- This gives the number of assists that the team has at 15 minutes.
Cleaning and EDA
Data Cleaning:
This is the head of our df dataframe:
gameid | datacompleteness | url | league | year | split | playoffs | date | game | patch | … | opp_csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 | opp_killsat15 | opp_assistsat15 | opp_deathsat15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 121.0 | 391.0 | 345.0 | 14.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 100.0 | 541.0 | -275.0 | -11.0 | 2.0 | 3.0 | 2.0 | 0.0 | 5.0 | 1.0 |
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 119.0 | -475.0 | 153.0 | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2.0 |
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 149.0 | -793.0 | -1343.0 | -34.0 | 2.0 | 1.0 | 2.0 | 3.0 | 3.0 | 0.0 |
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 21.0 | 443.0 | -497.0 | 7.0 | 1.0 | 2.0 | 2.0 | 0.0 | 6.0 | 2.0 |
ESPORTSTMNT01_2690210 | complete | NaN | LCKC | 2022 | Spring | 0 | 2022-01-10 07:44:08 | 1 | 12.01 | … | 135.0 | -391.0 | -345.0 | -14.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
We decided to keep the columns above. When we first started looking at the data, we noticed that the number of wins and the number of losses were not the same, there were four more losses than wins, meaning that for two games there was no winner.
gameid | datacompleteness | league | playoffs | result | firstblood | firstdragon | firstherald | firsttower | goldat10 | … | deathsat10 | goldat15 | xpat15 | csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | complete | LCKC | False | True | True | False | True | False | False | … | 3.0 | 24699.0 | 29618.0 | 510.0 | -107.0 | 1617.0 | 23.0 | 6.0 | 18.0 | 5.0 |
23 | complete | LCKC | False | True | True | True | True | False | True | … | 1.0 | 25285.0 | 29754.0 | 555.0 | 1763.0 | 906.0 | 22.0 | 3.0 | 3.0 | 1.0 |
34 | partial | LPL | False | True | True | False | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
46 | complete | LCKC | False | True | True | False | True | False | True | … | 1.0 | 24795.0 | 31342.0 | 560.0 | 1191.0 | 2298.0 | 15.0 | 3.0 | 8.0 | 1.0 |
58 | partial | LPL | False | True | True | True | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Games had either both winners or both losers:
gameid | datacompleteness | url | league | year | split | playoffs | date | game | patch | … | opp_csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 | opp_killsat15 | opp_assistsat15 | opp_deathsat15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34918 | ESPORTSTMNT04_2170436 | complete | NaN | ESLOL | 2022 | Spring | 0 | 2022-03-02 17:22:03 | 1 | 12.04 | … | 530.0 | 2824.0 | -795.0 | 16.0 | 2.0 | 4.0 | 4.0 | 4.0 | 5.0 | 2.0 |
34919 | ESPORTSTMNT04_2170436 | complete | NaN | ESLOL | 2022 | Spring | 0 | 2022-03-02 17:22:03 | 1 | 12.04 | … | 546.0 | -2824.0 | 795.0 | -16.0 | 4.0 | 5.0 | 2.0 | 2.0 | 4.0 | 4.0 |
87406 | ESPORTSTMNT03_2788015 | complete | NaN | ESLOL | 2022 | Summer | 0 | 2022-06-27 17:25:00 | 1 | 12.11 | … | 522.0 | -916.0 | 885.0 | 7.0 | 1.0 | 0.0 | 2.0 | 2.0 | 3.0 | 1.0 |
87407 | ESPORTSTMNT03_2788015 | complete | NaN | ESLOL | 2022 | Summer | 0 | 2022-06-27 17:25:00 | 1 | 12.11 | … | 529.0 | 916.0 | -885.0 | -7.0 | 2.0 | 3.0 | 1.0 | 1.0 | 0.0 | 2.0 |
We then proceeded to find sources to see who was the winner in both of these games: Sector One won their game against LG UltraGear1 And KRC Genk Esports won their game against LowLandLions~We then proceeded to find sources to see who was the winner in both of these games: Sector One won their game against LG UltraGear.1 And KRC Genk Esports won their game against LowLandLions 2
After we fixed the inaccuracy in the data, we then turned all columns that contain 1s and 0s into booleans. These included the columns: result
and all the first to objective statistics.
Note: To keep only relevant columns, we removed the data that included information on late-game statistics or opposing team statistics. Since we were only looking to see what statistics in the early game impacted the outcome of the game, we removed columns such as ‘firstbaron’, as that is a late-game objective. We also removed opposing team statistics, as that could be found in the other team’s row. The removed columns were focused on late-game objectives and opposing team statistics, which were not relevant to the analysis of early game outcomes.
So after examining the data and data cleaning, this is what the head of dataframe we would be working with looks like.
In this dataframe, we went through each 10th and 11th row, as that would be the results of each team, not player. Then we took out the relevant columns. When cleaning our data, we found a discreptancy within the results, where there were two games where the result was not recorded. So, we went back to the recording of the games and watched the games to individually fill out the result column. We changed the 1 and 0 to boolean on columns such as firstdragon
to make it easier to comprehend.
gameid | datacompleteness | league | playoffs | result | firstblood | firstdragon | firstherald | firsttower | goldat10 | … | deathsat10 | goldat15 | xpat15 | csat15 | golddiffat15 | xpdiffat15 | csdiffat15 | killsat15 | assistsat15 | deathsat15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | ESPORTSTMNT01_2690210 | complete | LCKC | False | False | True | False | True | 16218.0 | … | 0.0 | 24806.0 | 28001.0 | 487.0 | -1617.0 | -23.0 | 5.0 | 10.0 | 6.0 | 0.0 |
11 | ESPORTSTMNT01_2690210 | complete | LCKC | False | True | False | True | False | 14695.0 | … | 3.0 | 24699.0 | 29618.0 | 510.0 | -107.0 | 1617.0 | 23.0 | 6.0 | 18.0 | 5.0 |
22 | ESPORTSTMNT01_2690219 | complete | LCKC | False | False | False | False | True | 14939.0 | … | 3.0 | 23522.0 | 28848.0 | 533.0 | -1763.0 | -906.0 | -22.0 | 1.0 | 1.0 | 3.0 |
23 | ESPORTSTMNT01_2690219 | complete | LCKC | False | True | True | True | False | 16558.0 | … | 1.0 | 25285.0 | 29754.0 | 555.0 | 1763.0 | 906.0 | 22.0 | 3.0 | 3.0 | 1.0 |
34 | 8401-8401_game_1 | partial | LPL | False | True | False | NaN | NaN | NaN | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
EDA
Univariate Testing:
These pie charts show the proportion of games won when a team secures an early game objective.
Bivariate Testing
From these histograms, we can see that winning teams have a higher, more likely positive, gold difference than losing teams.
These bar graphs show the proportion of winning teams that secure early game objectives. Different from the pie charts from before, these bar charts focus on the proportion of winning teams that claim each specific objective, as the pie charts focus on the win rate of teams that get certain early-game objectives. As you can see, more winning teams get First Blood and First Tower than First Dragon or First Herald.
Interesting Aggregates
This groupby()
highlights the mean number of early statistics when teams lose or win. The first row is when the team loses and the second row is when the team wins. We can see the distribution between winning and losing with a particular early stat. For instance, we are much more likely to win with firsttower
than win without it. This aggregate lets us see if we got an early stat and what are the chances of winning.
result | first_dragon_mean | first_blood_mean | first_herald_mean | first_tower_mean | goldat10 | xpat10 | csat10 | killsat10 | assistsat10 | deathsat10 | goldat15 | xpat15 | csat15 | killsat15 | assistsat15 | deathsat15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
False | 0.423102 | 0.390553 | 0.417458 | 0.316997 | 15343.21 | 17986.81 | 310.04 | 1.77067 | 2.69307 | 2.60380 | 23963.84 | 28893.52 | 493.83 | 3.22754 | 5.20929 | 4.93162 |
True | 0.576710 | 0.607920 | 0.582259 | 0.683003 | 16033.79 | 18412.27 | 318.96 | 2.59562 | 3.98128 | 1.77735 | 25688.54 | 29926.60 | 511.14 | 4.92061 | 7.96557 | 3.23657 |
Assessment of Missingness
NMAR Analysis
Yes, we believe that many columns in our data are Not Missing At Random due to the trends we see in the data. After cleaning the data and seeing how some pieces of data are missing due to the column itself, we can see that statistics are missing for certain losing teams and not missing for certain winning teams. This is also similar the other way around, meaning that the missingness of the rows depend on the type of columns and many other columns as well. This correlation cannot be based on another column’s missingness (for the columns we are talking about), so it would not be missing by design. For other columns like killsat15
, we see that it is MAR, where it is dependent on other columns, where one is league
as leagues like LPL are missing this stat constantly.
We also know some information are MAR like firstdragon
and golddiffat15
as we can see a correlation with these and league
. For certain leagues, all the info for these columns are missing, so we can determine that leagues
play a part in the missingness of other columns.
Missingness Dependency:
Missingness of ‘firstdragon’ depends on ‘league’:
Missingness of firstdragon
does depend on league
. We wanted to determine if league
and league
were Missing at Random or Missing Completely at Random.
Here is the observed distribution when firstdragon
was not missing:
league | firstdragon |
---|---|
CBLOL | 0.022858 |
CBLOLA | 0.020318 |
CDF | 0.006867 |
CT | 0.002446 |
DDH | 0.019659 |
EBL | 0.017402 |
EL | 0.012699 |
ESLOL | 0.022764 |
EUM | 0.025115 |
GL | 0.016367 |
GLL | 0.019095 |
HC | 0.015238 |
HM | 0.014392 |
IC | 0.007055 |
LAS | 0.021353 |
LCK | 0.043928 |
LCKC | 0.037061 |
LCL | 0.001505 |
LCO | 0.019942 |
LCS | 0.028784 |
LCSA | 0.050795 |
LEC | 0.022858 |
LFL | 0.023234 |
LFL2 | 0.022670 |
LHE | 0.022858 |
LJL | 0.020130 |
LJLA | 0.003574 |
LLA | 0.017590 |
LMF | 0.030007 |
LPLOL | 0.019659 |
LVP SL | 0.023046 |
MSI | 0.007525 |
NEXO | 0.018154 |
NLC | 0.036027 |
PCS | 0.025491 |
PGC | 0.052864 |
PGN | 0.014016 |
PRM | 0.034522 |
SL (LATAM) | 0.015521 |
TAL | 0.019283 |
TCL | 0.020788 |
UL | 0.026150 |
UPL | 0.038755 |
VCS | 0.030383 |
VL | 0.015991 |
WLDs | 0.013263 |
Here is the observed distribution when firstdragon
was missing:
league | firstdragon |
---|---|
DCup | 0.0 |
LDL | 0.0 |
LPL | 0.0 |
WLDs | 0.0 |
In this bar chart we can see how the missing values for the column ‘firstdragon’ is concentrated in four leagues, with the majority being in two leagues: LPL and LDL.
Our observed statistic was: 0.9923034634414514
Our p-value was: 0.0
If the p-value is significant (below the chosen significance level), we may reject the null hypothesis. This suggests that the missingness is not completely random, and there may be a systematic pattern or relationship with observed or unobserved variables. This situation is more indicative of missing at random (MAR) or missing not at random (MNAR).
Here is the empirical distribution of the test statistic:
Missingness of firstdragon
does not depend on firstblood
We wanted to determine if firstdragon
and firstblood
were Missing at Random or Missing Completely at Random.
Here is the observed distribution when firstdragon
and firstblood
was missing and not missing:
firstdragon_missing = False | firstdragon_missing = True | |
---|---|---|
firstblood | ||
False | 0.500705 | 0.5011 |
True | 0.499295 | 0.4989 |
When we graph this, we see that missingness is close to each other when a team loses or wins.
After running our tests, we get
Our observed statistic was: 0.00039462604900317166 Our p-value was: 0.948
Since the p-value is above the chosen significance level, p-value from the test is not significant and we fail to reject the null hypothesis. In this case, we might conclude that the data is missing completely at random (MCAR). This makes sense as in our bar graph, we can see that the missingness does not look like they have any relation with each other.
Here is the empirical distribution of the test statistic:
Hypothesis Testing
Null Hypothesis: The distribution of early game statistics where the team won is the same as the distribution of early game statistics where the team lost.
Alternate Hypothesis: The distribution of early game statistics where the team won is difference than the distribution of early game statistics where the team lost.
Test Statistic: We will be using the Total Variation Distance (TVD).
We decided to use Total Variation Distance because we are using categoricial data, more specifically whether our sample distributions came from the same distribution. Our significance level is going to be 0.05 as it is the standard significance level.
We found the distribution of early game statistics and how the amount of points can correlate with the result of the game.
points | 0.0 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 | 7.0 | 8.0 | 9.0 | 10.0 |
---|---|---|---|---|---|---|---|---|---|---|---|
result | |||||||||||
False | 0.909385 | 0.84707 | 0.759547 | 0.687318 | 0.589577 | 0.495258 | 0.408747 | 0.311444 | 0.241554 | 0.147168 | 0.09106 |
True | 0.090615 | 0.15293 | 0.240453 | 0.312682 | 0.410423 | 0.504742 | 0.591253 | 0.688556 | 0.758446 | 0.852832 | 0.90894 |
Looking at this through a bar chart,
Our observed statistic was: 0.05610845677532025 Our p-value: 0.0
Here is the empirical distribution of the test statistic:
For our hypothesis testing, we did 500 simulations and were able to see that the p-value is significant as it is smaller than 0.05.
Conclusion: We reject the null hypothesis, meaning we believe that the distributions are not the same.
Since we may reject the null hypothesis this means there is a good amount of evidence that the distribution of early game statistics in games where the team won is different from the distribution of statistics where the team won.