Goal: Create a model that can predict a race time to 90% accuracy with given parameters
Initial work:
Simple linear regression on mileage vs 8k time. (Most current race length that I have a somewhat decent amount of data points on)
Although simple, we can see that there is a strong correlation with the amount of running done vs improving my 8k time. Mileage here is the amount of mileage in the 24 weeks leading up to each race. You can see a huge gap in the middle since I made a huge leap in mileage and a huge leap in time. over the last 3 years.
What if we wanted to relate more data though. This is a very vanilla "X vs. Y" graph and only 2 dimensions. What about doing a regression through other points? Maybe if we take number of "faster runs", number of athletes that you've run with (group runs, training partenrs, etc.), average speed of your runs, even number of runs (an increase in mileage tends to coincide with an increase in double workout days)
Initially, putting all of the above parameters into a model made for a horrible model and I quickly found out why.
If we look at the pairwise correlations between our "X" values, we see that there are a lot of high correlations. Most of them are 0.9 or above. This is bad for a multivariate regression since our model will have a very difficult time finding the individual effect of each variable.
Currently the Strava API 'get activity' data is limited in information that can be used without high pairwise correlation. However, I figured out how to retrieve lap data as well, which is a lot less efficient to retrieve since I have to pull this sort of data one activity at a time. Maybe we have some useful insight from this. Off the top of my head, I can figure out:
The types of workouts I do by split length and speed,
The amount of true "fast" miles that I've done (since sometimes I lump my fast miles with slow warmup and cooldown miles).
Update July 24th, 2025
I got my lap data and implemented it into our little correlation matrix. It turns out fast laps tracking isn't super beneficial (taking amount of laps faster than 6 minute pace). This was kind of expected since I wasn't expecting too much of a difference after calculating 'fast_laps' vs '%fast_runs'. Might have to go into other metrics such as heart rate and training consistency. I don't feel too certain about how successful this is.
july 25
Maybe we can take easy mileage pace. Possible reverse hypothesis.
Crashout meter (people who type a lot in their descriptions)
Ok post spontaneous thoughts, I'm going to come up with some sort of mileage consistency score.
Update. December 2nd 2025. I have discontinued the project. Maybe I will return to it, but I have other things to do