College Basketball Streakiness Analysis

Streaks. A team can get hot and it can turn a stale game, or a game that seems one sided, into a spicy and crazy game.

I will be analyzing NCAA teams and try to see if there are patterns in streaks of scoring that occur during games

Language for data processing and analysis: R

Libraries: BigBallR, ggplot, dplyr

Data Pipeline from BigBallR into physical storage.

Pipeline Methods

Our data pipeline scrapes from NCAA stats website using BigBallR web scraper. It was very easy to set up.

install.packages("devtools")

install.packages("chromote")

devtools::install_github("jflancer/bigballR")

library(bigballR)

custom_df <- function(team){

schedule <- get_team_schedule(season = "2024-25", team.name = team)

play_by_play <- get_play_by_play(schedule$Game_ID)

return(play_by_play)

}

View(data("teamids"))

teams = c("Auburn", "Michigan", "Ole Miss", "Michigan St.",

"Duke", "Arizona", "BYU", "Alabama",

"Houston", "Purdue", "Kentucky", "Tennessee",

"Florida", "Maryland", "Texas Tech", "Arkansas")

list_of_dfs <- lapply(teams, custom_df)

The data is already formatted, so we just need a way to identify our streaks using the play by play data, which will entail of giving each play a unique ID. Additionally, we must worry about removing repeated games as well as take out March Madness games to analyze regular season games. We decided to do this out of the concern that March Madness games appear to be different.

Definition 1.1: A streak is defined as number of baskets a target team scores before the opposing team scores again.

Initial Analysis with UC Davis MBB

Week 1 Project Report link

For UC Davis Only

Is the peak around the middle of the 2nd half real or fake? We need more data.

Week 2 Analyses

Solidified a working data pipeline to bring in scraped data and combine into a singular large dataframe from the top 16 teams in the NCAA tournament (~300,000 lines of play by play data)
Increased to a multi-team report using last year's (2025) Sweet 16 teams. Initial findings are below.

From what we see here, there seems to be 2 peaks we are interested in. Just before halftime and right after the midway point of the second half

Pattern holds for streaks greater than 4

Streak of 5

Pattern is a little different but there are still peaks around the same spots

Next update: Looking into why these peaks appear... Additionally, another problem is to add a break in the data for each halftime period. It could explain why we see a peak of streaks right before/after halftime.

Update Week 2:

Slightly changed the binning to include the halftime

Counting halftime as a streak end: histograms below

Week 3 tasks:

color code team ahead or team behind.
split up the graph into quarters/minutes/halves.
see if counting free throws differently does anything
64 teams
how many streaks have free throws in the middle: stats including free throws
look into fouls, stops in play, etc.
look at 1-3 possessions before.

February 10th 2026 update: Abstract submitted to URSCA, pending Dr. Carl's approval and conference acceptance

Approved on February 12th.

Some initial statistical analysis, Chi Squared Goodness of fit.

We have noticed those peaks in our exploratory data analysis. Let's see if each time binning is equal or if there is a discrepancy. To begin some initial analysis, even though we have our data displayed in increments of two minutes, we will break up our bins into ten minute intervals, which means 2 for each half.

If we count how many streaks we get (streaks >= length of 3), we get this:

Succeding this, we can do a Chi Squared Goodness of Fit test to see if streaks are equally likely in each 10 minute frame.

From what we see, our p-value is small, so we can reject the null that in our ten minute increments, streaks are equally distributed and conclude that the streaks are unevenly distributed through the game in 10 minute blocks. What if we shrink these blocks into our original design of 2 minute blocks?

And if we take the standardized residuals (stdres), we can see the segments that stick out as "outliers".

Areas that stick out:

18-20 minutes in the game (2 minutes left in the first half) -4.52
20-22 minutes (2 minutes into the second half) +2.68
30-32 minutes (8-10 minutes left in the second half) +2.41
32-24 minutes (6-8 minutes left in the second half) +2.59

From the Chi-squared test, it is clear that streaks aren't evenly distributed, but some assumptions must be taken into account:

The closer you get to the end of a period, the less time there is for a streak to start obviously.
The streaks in the second half that have high residuals are worht looking at in depth statistically as well.

Week 4 Observation:

Let's continue with our deep dive in long3, which as a reminder is the dataframe containing streaks of length 3 or longer.

Only about 21.4% of streaks 3 or longer have no free throw line trips and about 26.1% of the points in our streaks length 3 or more is from free throws. This brings about a question, because currently this still tracks each individual free throw make as an addition to the length. I'm possibly thinking about re-defining this.

However, this means that free throws are majorly present in the middle of streaks. They make up about a quarter of the points made in these streaks and almost 80% of streaks have free throw line appearances.

Going back to our Chi Sq. test that we conducted on the Frequencies of streaks per time bin, it is clear that there are different number of streaks across tie bins. However, the number of streaks can also be assumed to be dependent on number of possessions (more possessions equals more opportunity to score). So I decided to try and normalize the data and get a normalized streak rate per 2 minute bins.

As we can see here, we get a clear image of the normalized streak rate. It looks similar to the raw frequencies, but if you look at the y-axis, the numbers separating the highest from lowest bars isn't that much in terms of absolute difference. We can run some statistical tests.

Logistic Regression with binomial weights

Anova for logistic regression

Generalized Additive Model Test

To take the analysis further on our normalized rates of a streak happening, we can conduct a logistic regression to see if there is a linear relationship between the probability of a streak happening and time in a game. Since probabilities are between 0 and 1. We also have binomial data (streaks and possessions). We get a p score of 0.15 for the "slope" term of the regression, indicating that there is no linear relationship between time and probability of a streak. The ANOVA table for this also proves that our Chi score is too high for it to be a linear fit. If we go to a nonlinear test such as GAM which extends regression logic, but with smooth curves instead of linear lines. Here we see that after fitting a GAM model, we get a small p value for the smooth terms with edf of 7.331, or about a 7-8 degree polynomial in a way. So based on this test, there is a possibility that streak probability depends on game time, but not in a linear fashion. Using the intercept, the logit function shows e^x / (1+e^x) where x is about -2.54, which indicates that the baseline probability of starting a streak >=3 is about 7.3% on average. The deviance seen here is 29.1% which means our seemingly unknown nonlinear model is able to explain 29.1% of the deviance relative to a null model. For a binomial/logistic model, this means that the model captures meaningful structure to be further analyzed.

Wait...new results when you remove overtime. The model gets stronger...

GAM model

Our test indicates that the smooth complex model can explain about 39.8% of the deviance, which is decently strong. Our p value indicates that the model is statistically significant, however a R-sq value of just 0.013 is small, just above 1% of variance explained. Since the linear model was logistic, deviance is a better indicator of model strength. So in conclusion, there is a clear, nonlinear pattern with about 8 degrees. It is a complex pattern that we will dissect.

GAM model converted to true probabilities.

Close to 9% probability of a streak happening at the peak at the time bin 16-17. And the lowest point at less than 6.5% just before halftime. An almost 2.5% swing. In terms of basketball, it is still a small difference, but in a game that contains about 100 possessions give or take, it could be a somewhat significant jump.

Now for the fun; Let's figure out why the heck these dips and peaks happen. What leads up to these. Why do streaks happen?

Frequency bar plot of terms that appear up to 5 events before a streak starts.

Page updated

Google Sites

Report abuse