# Senior Year Experiences in Statistics

2008

Jasper Burch
"Fiscal Policy and Growth: The Case of Puerto Rico"

We statistically examine the impact that US federal transfer payments (Medicare, Welfare, Social Security etc.) have on the economic development of Puerto Rico. The economy of Puerto Rico is substantially different from that of the mainland United States. Being a commonwealth however Puerto Rico is subject to US federal laws and policies. This leads us to question the severity of inefficiencies of US federal laws and policies in Puerto Rico.

Meaghan Cahill
"Testing the Assumptions: A Simulation Study Monitoring the Effects of Manipulating Regression and Statistical Inference Assumptions"

Most statistical procedures allow for certain assumptions to be made in order for the results to be valid. What will change if these assumptions aren't met? We manipulate the assumptions involved in simple linear regression models, confidence intervals of population means, and confidence intervals of population proportions. We use Fathom to simulate data sets that do and do not satisfy the assumptions for these statistical procedures, and display the changes and problems that occur. For simple linear regression, we examine the change in SSE resulting from manipulating the model assumptions associated with this procedure. We also demonstrate that the coverage percentages for confidence intervals are not accurate if the assumptions are not met. For example, the coverage percentages for confidence intervals of population proportions is much less than expected when the probability of success in a population is either close to zero or close to one.

Joe Cleary
"Using Stein Estimation to Predict Performance"

Stein estimators can be used to predict future performance of individuals when we have small samples of previous performance for many individuals. For example, suppose we have data on save percentages for a number of NHL goalies through ten games of the season. Because we have small sample sizes the estimates for save percentages should be quite variable. Goalies with the highest percentages are likely to move towards a more typical save percentage as the season progresses; the same will tend to happen for goalies with low initial percentages. Stein estimators provide a method for adjusting estimates based on the distribution of the collection of goalies. The underlying method is a two stage hierarchical model in which binomial and beta distributions are used. We examine properties and implementation of Stein estimators theoretically and with application to real data.

Laura Daley
"Chance vs. Skill: Assessing Shootouts in the NHL"

The shootout was adopted in the 2005-06 season by the National Hockey League, an ice hockey association consisting of 30 teams across the United States and Canada. The shootout eliminates ties in National Hockey League (NHL) competition. This investigation looks at the shootout in the 2005-06, 2006-07, and 2007-08 NHL seasons. We analyze results for shooters and goalies to determine the probability of scoring and whether it differs significantly from player to player. The data will be evaluated in an attempt to determine if chance or skill is the prevailing characteristic that influences the outcome of the shootout.

Brent Davis
"Analyzing Queueing Models and an Issue of Independence"

Queueing Theory is the study of waiting in line at to get service, for example, in a checkout line at a grocery store, a toll booth or a ticket counter. We discuss standard queueing models and a method for simulating customer service times according to these models. A key question is the relationship between queue length and expected waiting time. Based on simulations we find an interesting scenario where the common statistical assumption of independence may not be valid and examine methods for dealing with this situation.

Kristen Ehringer
"The Price is Right: Using Monte Carlo Simulation to Price Stock Options"

Simulations are used to imitate real-life situations when other analyses are too mathematically complex or too difficult to compute. The Monte Carlo method of simulation generates values for uncertain variables over and over again to simulate a model. The Monte Carlo simulation can be applied to many financial applications, such as the pricing of options. In the case of European options, there exists a formula, known as Black-Scholes, to price these options based on several assumptions. We develop a Monte for Carlo simulation for pricing European options and compare the results to those of the Black-Scholes equation. Other more complex options do not have convenient equations to price their value. We show how the Monte Carlo simulation can be adapted to obtain price estimates for these more exotic options.

Andrew Ford
"Estimating Atypical Baseball Career Trajectories"

It is common in any sport for a player to continue improving throughout their career up until a certain point, at which their production begins to fall. For example, the number of home runs a player hits generally rises until around age thirty, after which their number of home runs begins to fall. However, there are times when we notice that the production rate of a player does not follow this common pattern. For example, in baseball the number of home runs a player hits may not follow the ordinary pattern because of a variety of different variables; ranging from being traded to a team that plays in a park with different field dimensions, to intense training in the off-season, medical enhancements, or even random chance. In this investigation we use the statistics and graphics software package R to apply the loess curve fitting method to career statistics for individual baseball players. In addition, we create several functions in R to allow us to compare the production rates of individual players to other players to determine when a player's career trajectory appears atypical.

Jared Fostveit
"Madness of March: More Predictable Than Golf?"

Scott Berry wrote an article in the magazine Chance in 2000 in which he used NCAA Men's Basketball Tournament data from 1986-2000 to determine the likelihood of winning based on seeding, probabilities of each seed reaching each round and expected number of upsets per round. We further this approach to golf. For example, we use historical data to create a model to predict winners based on the seedings in the World Golf Match Play Championships. The golf tournament uses a 64-golfer bracket similar to NCAA basketball. We create a relatively simple way to achieve results that are comparable to Berry's computationally intense model. We contrast our basketball predictions to those of Berry, while also comparing our basketball to the golf output. We discover whether golf is more or less predictable than the "Madness" of March.

Eloise Hilarides
"Using Weighting Schemes to Account for Coverage Bias in Internet Surveys"

Over the past decade, Internet surveys have become a popular method for collecting data about the general population. In 2005, the Harris Poll published findings which claimed that 74% of the United States Population had access to the Internet somewhere. While this number has steadily risen over recent years, bias still may be introduced if the population without Internet access is different from the Internet population in regards to the variables of interest. In this research we studied whether Internet users that only have access to the Internet outside their home can be useful in reducing bias by assuming that they are more similar to those without Internet access than the Internet population as a whole. We outline several weighting adjustment schemes aimed at reducing coverage bias. Data for this study was taken from the Computer and Internet Use Supplement of October 2003 administered by the Current Population Survey. We evaluate the schemes based on overall accuracy by considering the reduction in bias for ten variables of interest and the variability of estimates from the schemes. We find that several of the proposed schemes are successful in improving accuracy.

Jeboah Joerg
"Modeling Volatility in Today’s Marketplace"

In just 7 trading days from January 10th to January 22nd 2008, the S&P 100 dropped almost 8% from 665.64 to 612.82. Eight trading days later the index had regained half of what it lost. This variability found in financial markets is known as the "Volatility of the market". So how do we measure this volatility? We will discuss the different methods of measuring volatility first, followed by the conditions in which each method theoretically will works best, and then look into the financial markets to characterize volatility in the real world. We compare the volatility of the S&P 100 to the volatility of a created index consisting of small companies from the Russell 2000. We then compare and contrast the different methods of measuring volatility across the two indices.

Victor Kai-Rogers
"Measuring the Effects of 9/11: Intervention Models in Time Series"

In this project we focus on the study of Intervention Analysis, a special case of transfer functions in Time Series. Intervention Analysis has been applied to a variety of topics to observe the impact of various policy implementations, regime changes and more recently effects of terrorism. Specifically, we are interested in knowing if and when an intervention analysis is appropriate by looking at how the goodness of fit from AIC and BIC improve over the univariate analysis itself. In this project we investigate various forms of intervention functions such as pure jump, impulse, gradually changing and prolonged impulse, and observe their efficiency compared to regular Box-Jenkins ARIMA models through the use of theoretical analysis, simulations and actual data. The actual data used in this paper is the time series data of stock prices around the 9/11 event, namely American Airlines (AMR) and the S&P 500 market index.

Ryan Kimber
"Who's Really Better? Comparing American and Latin American Baseball Players"

In certain experienced baseball circles, there is a common belief that Latin American players are naturally better than American players. I would imagine that this can be quite frustrating for American players, especially at the professional level. How can we determine whether or not this notion is actually true? I have decided to shed some light on the question at hand by using a multiple binary regression analysis to predict the ethnicity of players (0 for Latin Americans and 1 for Americans) based on numerous career offensive statistics such as career batting average, on base percentage, slugging percentage, and fielding percentage. As a result of my analysis, I found a model that correctly predicted the ethnicity of 73.6% of the Americans and 68.7% of the Latin Americans. Also, I performed a 2-sample t-test on all of the statistics I used and found all of the significant tests gave evidence for American supremacy at least in terms of offensive statistics.

Rob LaMere
"Estimating Win Probabilities Based on Betting Lines for Sporting Events"

We use betting lines and point spread systems for sports to determine probabilities of a team winning. For example, we investigate football or basketball schemes of betting to predict winners based on their specific betting line. The study is based on data collected from 1998 through 2007 NFL and NBA seasons. We look at the distribution of the difference between the favored team's score and the underdog's score and compare to the predicted point spread for each game. For both sports they follow the normal distribution. The distribution for the NFL data is assumed normal with a mean of 0.05 and a standard deviation of 13.19. The distribution of the NBA data is assumed normal with a mean of -.5 and a standard deviation of 11.40. The probability that a team can win is based on simple linear approximations based on the probability of winning and the point spreads.

Royce Lawrence
"Bowling and the Hot Hand"

Hot hand, or just luck? Ever wonder why streaks come as they do? In this paper we will investigate the theory of the hot hand; the tendency to perform at a higher level for a period of time. For example, a bowler may be more likely to continue to throw strikes after previous strikes. Using frame by frame bowling data and statistical methods, we determine if the hot hand actually exists in the sport of bowling.

Dennis Lock
"Beyond +/-: A Rating System to Compare NHL Players"

While there are many statistics, such as plus/minus, to evaluate a hockey player's performance, few of these statistics effectively compare the worth of different kinds of players. We propose a new, more comprehensive rating method, extending the concept of plus/minus, which aims to take most aspects of a player's game into account. Each player will be rated on the same scale, where the value of each play is determined by how that play increases or decreases the likelihood of victory. This creates a method of rating where one can compare the value of a pure goal scorer to a defensive specialist, and everyone in between.

Sarah MacFarland
"The Effects of Marriage on Spousal Happiness & Well-Being"

With a long standing appreciation for marriage, in addition to the extensive dynamics that surround family life, performing a research project encompassing both topics was of very high interest. Using raw data found on the Inter-University Consortium for Political and Social Research (ICPSR), the effects of marriage on the happiness and well being of Michigan couples was investigated. Three hundred seventy-three newlywed couples (who were all intra-racial and in their first marriage) were interviewed in Wayne County, Michigan during April-June, 1986, following their first four years of marriage. All couples, 174 White and 199 Black, were between the ages of 25 and 37. Certain Independent Variables, such as Marital Status, Total Marriage Satisfaction, Level of Happiness, and Hopeful About the Future were chosen for the final data set. General findings indicated that couples in their early years of marriage were very satisfied, and felt as though their lives were full resulting from their married lives in the early years.

Peter McGoldrick

The Journal of Quantitative Analysis in Sports published a 2007 article by Jason Kubatko, Dean Oliver, Kevin Pelton, and Dan T. Rosenbaum entitled "A Starting Point for Analyzing Basketball Statistics." This study resulted in the formulation of a number of statistics that aim to give a clearer understanding of the quality of both individual and team play in the National Basketball Association (NBA). Based on a central theme of per possession statistics, we discuss the efficiency of team statistics and methods to evaluate player performance using per minute statistics. With these variations on our classical basketball box score, we are also able to compare basketball statistics from league to league. We define important statistical terms and themes for basketball, and provide examples of the effectiveness of these statistics with recent NBA data. By application of statistical procedures we analyze midseason NBA trades and their resulting impact.

Nikolay Naletov
"Investment Risk Measurement: Approaches to Risk Measurement"

This research presents different ways to measure investment risk. In particular, it shows a comparison between VaRs (Value-at-Risk) based on methods such as variance-covariance, historical and Monte Carlo simulations, and GARCH approaches. The paper compares and explains the different techniques and methodologies by using the historical database of the equity portfolio of Crown Royalties Investments, which is the investment club at Saint Lawrence University.

Jamie Wolff
"Performance vs. Pick: A Study of the NBA Draft"

Millions of dollars are invested in the top draft picks of the National Basketball Association (NBA). A significant amount of deliberation and analysis is put into determining which athlete to select. Often teams make trades in order to better their position in the draft, or they "trade down" meaning trade away an early draft pick for more draft picks later on. By giving another team money, current players, draft picks, a team is hoping this lower number pick will be more productive than their higher number. This investigation will explore any significant differences among draft picks and what would be the advantage, if any, in drafting at one pick over another. We will use NBA career statistics (points per game, minutes per game, games played, all-star appearances) to assess draft decisions based on player productivity. Data from the 1994 through the 2007 seasons was compiled from Basketball-Reference.com. This study will divide the draft picks into eight zones (four in the first round and four in the second) and compare zones to find any significant differences. Results suggest that lottery picks are valuable and trading up or down may not be productive.

2007

Nick Alena
"Linear Discriminant Analysis"

Can we trace The Mona Lisa to Leonardo DaVinci, or demonstrate how a certain wing span can mean it's a deadly midge or even evaluate if seeing Viagra in an email title means it's spam? Is it possible to use numerical characteristics of a painting to determine its creator, or measurements of an insect to distinguish its species, or titles of emails to separate spam from legitimate messages? Linear Discriminant Analysis (LDA) is a statistical method used to identify group membership using a linear combination of features. The techniques extend concepts from ANOVA (Analysis of Variance) and Regression Analysis. We examine methods for producing a linear combinition that best distinguishes between the groups using univariate and multivariate sets of predictors.

Dustin Cidorwich
"Analysis of NFL 'Draft Pick Value Chart'"

The goal of this project is to evaluate the 'Draft Pick Value Chart', which is used by most NFL teams as a way of allocating value to individual draft picks. I have used the games played and games started as a proxy for drafted players value, and will use regression methodology to to link the explanatory variables to the drafted players value. The variables used in the analysis include selection number, year, team, and position. The predicted values will be scaled and compared with the 'Draft Pick Value Chart' values. Would it be in a teams best interest to trade a second and fourth round draft pick for a first round pick?

Yordan Minev
"ROC Confidence Region Using Radial Sweep"

A biometric authentication system matches physiological characteristics to a database of such characteristics. In biometric authentication, genuine users are generally those that the system should accept and imposters are those that the system should reject. One methodology for evaluating the matching performance of biometric authentication systems is the receiver operating characteristics(ROC) curve. The ROC curve graphically illustrates the relationship between type I and type II statistical classification errors when varying a threshold across a genuine and an imposter match score distributions.
The goal of this paper is to estimate the performance of each biometric system via a confidence region for a ROC curve of that system's performance. In this project ROC confidence regions will be created using radial sweep method. Radial sweep is based on converting the type I and type II errors to polar coordinates. The technique of bootstrapping will be utilized to estimate the variability of each point on an individual ROC curve. Simulations will be performed using real biometric match score data. Next, a radial sweep method for comparing two ROC curves will be discussed.

Julie Muetterties
"Methods for Comparing Two Survival Curves"

A survival curve shows the proportion of a population at risk which survives up to a certain time. Such curves can be described by theoretical parametric models (such as exponential, lognormal, or Weibull) as well as nonparametric methods (such as Kaplan Meier). We investigate methods for determining if survival curves from two populations or treatments are significantly different with applications to real data.

Julia Palmateer
"Assessing Ratings Methods for College Hockey Teams"

Ranking college sports teams that play in leagues of different strengths can be challenging. A number of methods have been proposed to calibrate teams while accounting for performance and strength of schedule. We use Monte Carlo simulations of hundreds of seasons to compare several existing methods currently used to rate college hockey teams. These ranking methods include: raw winning percentage, Ratings Percentage Index (RPI), Bradley-Terry (KRACH) and Poisson scoring rates (CHODR). By using the results of the simulated seasons, we can examine many common questions, including how the strength of a team’s league affects their rating.

2006

Jeff DiGeronimo
"A Hitter's Salary Through Statistical Analysis of Past Performance"

What is a current Major League Baseball(MLB) player worth? Today controversy over a player's salary is debated every season. Teams are forced to make the decision on whether a player is worth keeping or be traded/released. Also many players face contract arbitration with the organization(team) they are playing for. This process involves a decision about the worth of a player. Collecting a large data set made up MLB hitter's statistics of the previous three seasons, three year average statistics and career statistics. The statistics will include things such as batting average, runs batted in, on base percentage, home runs, ect. Also other variables that could play a role in the salary of a player such as age, year's experience, position, and free agent status will be used. Then using regression techniques to analysis a player's salary and how it relates to the performance of the individual player. Using the statistics and attributes gathered to formulate a multiple regression equation to predict the salary of a hitter in the Major Leagues. By means of the prediction equation, some of the player's who filed for arbitration will have their salaries predicted, and an arbitration ruling will be made based on the predicted value.

Raluca Dragusanu
"Autoregressive Conditional Heteroscedasctic Models"

Traditional time-series models such as Autoregressive (AR) and moving Average (MA) models are based on the homoscedasticity assumption, which translates into a constant variance for the errors of a model. This assumption has been shown to be inappropriate when dealing with some economic and financial market data. A new class of models - conditional heteroscedastic models - was developed to deal with data that does not exhibit constant variance of the errors. The most well known models in this class are the Autoregresssive Conditional Heteroscedastic model (ARCH) and its generalized version (GARCH). Stock market volatility, the square root of the variance of stock returns presents a very good application of this type of model. In finance, volatility is the expression of risk. Since we must take risks to achieve rewards, finding appropriate methods to forecast volatility is ncessary in order to optimize our behavior and, in particular, our portfolio. I will present the general properties of the ARCH and GARCH models and use both Monte Carlo simulations and known financial time series data to test their performance.

Travis Gingras
"Applications of a Graphical Information System to Ice Hockey"

Statistics and sports have been related for many years, and recently the art of using statistics to observe players tendencies has become more and more common among coaches. This project looks to investigate patterns of shots taken by the St. Lawrence Men’s hockey team using a geographic information system. ArcGIS is a mapping program generally designed for geographical data, but in this project we have defined a database to store information about individual shots in multiple hockey games while placing them on a map of the offensive zone of a hockey rink. We can then study patterns and look for the trends that might benefit individual players or the team as a whole.

Jim Hall
"Investigating the Effectiveness of Bootstrap Confidence Intervals"

The statistical procedure known as “bootstrapping” is used to approximate a sampling distribution for any statistic by resampling from an original sample with replacement in order to draw conclusions about the shape, center and variability of the sample statistic. These methods avoid traditional assumptions such as assuming a certain population is normally distributed. We will give a brief description of bootstrapping techniques and discuss several methods for constructing confidence intervals based on the bootstrap distribution. We demonstrate via computer simulation (using the statistical software packages R and Fathom) the effectiveness, in terms of coverage and average width, of bootstrap confidence intervals compared to traditional confidence intervals in standard situations and in cases where standard assumptions fail.

Robin Hanson
"Statistical Analysis of CHIME Data: Relationships Involving Heart Rate VAriability"

Life threatening events including bradycardia and apnea in infants are a major health concern for families and physicians. Events further described as "extreme events" occurred at least once in 20.6% of asymptomatic infants and 33% of infants born prematurely. The motivation for this study is the Collaborative Home Infant Monitoring Evaluation (CHIME); a study formed and funded by the National Institute of Health (NIH). The CHIME database is the largest infant longitudinal physiologic dataset. The work proposed here is part of a larger study to use heart rate variability to study the prediction of life-threatening events in infants and the classification of infant sleep state. This work seeks to develop methods to examine the effect of age on heart rate variability measures in each sleep state. These methods include both numerical summaries and graphical displays.

Emily Sheldon
"Sequential Analysis for the Beta-Binomial"

In this paper we attempt to derive an equation from the Beta-binomial distribution that can be used to apply sequential probability ratio testing to biometric devices. We first examine sequential analysis testing methods and then apply them to examples of multiple independent bernoulli trials. We use these exmples to ilustrate the decision of when to stop testing. Lastly, we examine the Beta-binomial distribution and derive an equation that can be used in sequential analysis methodology.

Joshua White
"Predicting a Pitcher's Salary Using Statistical Analysis"

When looking at a baseball player's performance compared to his salary, there should be some kind of correlation, the better the player the better the salary. However, a player's worth does not always reflect his salary. Throughout the season owners and general managers are faced with decisions dealing with a player's salary and/or whether or not they should keep them. This may create conflict with a player and his team forcing the player to deal with arbitration, a process where an external person chooses a salary figure based on arguments presented by the team and player. The main goal of this project is to examine models for predicting a pitcher's salary based on past performance. We start by compiling a database of Major League Baseball pitcher's statistics from various websites on the internet (mlb.com, espn.com, baseball-reference.com). This includes the obvious variables of a pitcher (ERA, walks, strike outs, etc) and will also include other ones such as age, years in the league, free agent status, and starting pitcher vs. relief. Once the database is established, we perform statistical analysis, using techniques such as multiple regressions, to create and assess models for predicting pitchers' salaries. Once all of the statistical analysis is complete, individual case studies such as actual arbitration cases for current pitchers are examined and tested using the models developed.

2005

Hillary Hartson
"Sequential Testing"

In this paper we investigate Wald's Sequential Analysis to develop a basic understanding of sequential testing. Sequential testing differs from traditional hypothesis testing in that it allows for three possible conclusions to be made after a subset of observations have been drawn. These three conclusions are: reject the null hypothesis, accept the null hypothesis, and continue testing. This is in contrast to traditional hypothesis testing where the choices to be made are reject or accept. We begin our discussion of sequential testing by introducing a sequential test called the sequential probability ratio test and then use it to derive decision rules for comparing a proportion against a one-sided alternative. We consider application of this test to a variety of possible rates for the hypothesized value. Lastly, we demonstrate the effects of changing the values of our Type I and Type II errors on these decision rules.

Jeff Homer
"Rating Systems for College Football with Turnovers"

In this project, I create a rating system for Division III college football based on relative strength and on homefield advantage. This is similar to Stern’s Division I college football model (Stern, 1995, Chance Magazine). We use a linear regression model for this. An additional factor that we take into account is the turnover (fumbles and interceptions) differential. We further study turnovers and turnover differential to determine whether or not turnovers can be treated as a random effect for predicting outcomes from college football.

Nikki Lopez
"Analysis of Biometric Authentication Match Scores"

For bio-authentication devices the topic of template aging is an important one. Mansfield and Wayman have defined template aging as "the increase in error rates caused by time related changes in the biometric pattern, its presentation, and the sensor," Manfield and Wayman (2002). Understanding such an effect is important both for policy and for system design. To date, the work that has been done in this area has been descriptive of the template aging effect rather than inferential. Several authors have all presented descriptive evidence of template aging. Here we focus on studying error rates (both FAR and FRR) over time and determining whether or not the impact of time is statistically significant. We apply our methodology to the National Institute for Standards and Technology (NIST) Biometric Score Set Release 1 (BSSR1), which was recently made public. The results of our regression analysis of the match scores and our logistic regression of the decision data indicate that there is a significant effect due to the changes in time for some matching algorithms at some thresholds. However, the template aging effect is not consistent across all modalities and all thresholds.

"Time Series Analysis of Music"

Mathematics and music have always been very closely related to each other. Here, we are going to discuss some of the mathematical and statistical characteristics of music using techniques from time series analysis. First, we will see how music can be transcribed by applying time series analysis to it – we will look at and discuss the time series and autocorrelation plots of different songs. Then, we will observe how we can use the analysis in order to obtain the musical score from an audio recording. Here, we will work on note identification through a Fourier series representation of the wave function of the melody. For this purpose we will use the statistical program R, and its add-on musical package tuneR

Matt Norton
"Wilson Approximate Confidence Intervals"

There are a variety of ways in which confidence intervals (CI's) can be created. However the goal of all CI's are the same in that they attempt to capture a parameter a certain percentage of the time. For example, in making a 95% confidence interval, one hopes that the parameter will be captured in the interval 95% of the time. This paper investigates a method of confidence interval creation first introduced by Edwin B. Wilson in his 1927 article, "Probable Inference, the Law of Succession, and Statistical Inference," which was published in the Journal of the American Statistical Association. In Wilson's paper, he creates a confidence interval for the binomial distribution using a weighted variance. This paper extends his methodology, essentially creating a "generalized Wilson" confidence interval that can be applied to any distribution whose variance is a quadratic function of its mean. This includes many common distributions, such as the uniform, poisson, gamma, exponential, and negative binomial. These are examined throughout this paper, considering each individually as well as comparing them to the traditional large sample estimator. We further develop the work presented in Atkinson and Schuckers (2004).

2004

Travis Atkinson
"Approximate Confidence Interval Estimation for Beta-binomial Proportions of Biometric Identification Devices"

The goal of this project was to determine the performance of confidence interval estimation using the methodology of Agresti and Coull (1998) on proportions from a Beta-binomial distribution. Agresti and Coull applied their approach to the binomial distribution, specifically for estimation proportions. In this research we tried determine the appropriate manner in which to extend that work to proportions from a Beta-binomial distribution. In particular we were interested in applying such a methodology to data from biometric identification devices such as fingerprint scanners or iris recognitions systems. The approach we took was to simulate data from a variety of Beta-binomial distributions and compare the performance of the augmented approach of Agresti and Coull to more traditional approaches to interval estimation. Specifically we considered four ways to extend this work. For the binomial n independent binary observations are augmented with 2 successes and two failures. For the Beta-binomial we have k individuals with m binary observations each. To augment the Beta-binomial data, we first considered adding 2 successes and 2 failures to a single individual. Second we considered adding a success and a failure to the counts for 2 individuals. Third we considered adding 1 success to the data from each of 2 individuals and 1 failure to the data for 2 different individuals. Last we considered adding a new individual n+1 who was given 2 successes and 2 failures. We used a Monte Carlo approach for evaluating the performance of these four augmented approaches.

Katie Livingstone
"Modelling Disease: Mathematics in Epidemiology and Applications to the SARS Virus"

Epidemiology is a field of science that has made many significant advances in studies of the spread of disease. Studies of epidemics and disease spread have vast mathematical components. Epidemiological models are based on differential equations that provide the foundations for modeling change over time. These are useful in modeling rates of infection, rates of recovery, contact rates, birth and death rates, etc.  Such models can predict the impact of a disease on a population and can suggest valuable strategies for its control. This study presented the most basic epidemiological model and examined the common uses and implications of certain features. It looked at several more complex models and explained how to modify a disease model according to the characteristics of a disease. Lastly, this study examined how researchers have modeled SARS, one of the most recent disease outbreaks, in three published articles.

Whitney Browne
"The Price Sensitivity Exhibited by Admitted St. Lawrence Students"

This paper examines the sensitivity of admitted students at St. Lawrence University to changes in cost of attendance.   More specifically, the study will investigate the effects of an increase in tuition on the probability of a student enrolling.  In order to perform this test, econometric models and procedures are used to estimate the probability that a given student will attend St. Lawrence University.  Next a tuition increase is introduced by adjusting the actual cost of attendance at St. Lawrence, relative to each student’s alternative institution cost.  The probabilities found previous to the tuition increase are compared to those found post-increase and the elasticity of the demand is calculated.  The results show that there is a response by admitted students to increases in tuition.  Before the Newell center was built, the tuition increase caused a drop in the proportion of students who enroll, and admitted students are elastic in response to actual out-of-pocket cost increases.  As a result, prior to the new facilities on campus, by raising tuition St. Lawrence University failed to reap increased revenues.  However, the elasticity of demand exhibited by students is decreasing (in absolute value) over time, thus there is an indication that the capital, including buildings and athletic facilities built over the last five years, is effectively  increasing the value of St. Lawrence in the eyes of admitted students.

2003

Noelle Laing
"The Mathematical Theory Behind the Capital Asset Pricing Model"

In an attempt to assign values to assets and portfolios, William Sharpe developed the Capital Asset Pricing Model (CAPM) in 1964.  He based the valuations on the expected return of the asset or portfolio and the risk associated with the investment.  The CAPM was an extension of Harry Markowitz’s financial model, the Efficient Frontier of optimal portfolios, developed in 1954.  Although the CAPM relies on some seemingly unfeasible assumptions, its creation dramatically improved modern finance theory.  This paper explores the development and mathematical ideas behind the CAPM, including the Efficient Frontier, the Capital Market Line and the Securities Market Line.
Getting started reference: Harrington, Diana. Modern Portfolio Theory and the Capital Asset Pricing Model: A User's Guide.  Englewood Cliffs, New Jersey: Prentice Hall, 1983.

Dennis Leahy
"Random Graphs"

A graph is simply a collection of vertices and edges.  A random graph is simply a collection of vertices and edges where the edges occur by some random process.  Because of this unique state in which a random graph must be formed, much of the theory revolving around these types of graphs includes probability theory.  In this paper we will use probability theory in determining the likelihood of certain graphical properties.  The ideas of “almost all graphs” and threshold functions will be introduced and pondered.
We will discuss the concept of graphical evolution and many ideas concerning graphical connectivity.  Such probabilities will be calculated and we will show how these properties depend on the relationships between graph size and the probability of an edge.  A summary of important results will be given and computer simulations will be used to compare some of our calculations involving connectivity.
Getting started reference: Palmer, Edgar M. Graphical Evolution: An Introduction to the Theory of Random Graphs. New York: John Wiley & Sons, 1985.

2002

Justin Roth
"Survival Analysis: Finding and Fitting a Model"

This paper examines survival analysis with a focus on estimation techniques for fitting lifetime data.Some examples of lifetime data include life of automobile brakes, lifetimes of electronics, length of marriages, and length of life after medical treatment.The parametric models examined include: (1) exponential, (2) lognormal, and (3) Weibull.The (4) Kaplan-Meier estimator is an alternative nonparametric approach to model survival data.When trying to fit a lifetime distribution, censoring may become an issue.Complete data sets include all the information for all subjects, and censoring takes place when there is missing information, for example, when subjects leave the study early or outlast the length of the study.

For each of the four models, these estimation techniques have been applied to simulated data for both complete and censored data sets.Once the data has been fitted with a distribution, the Anderson-Darling test and probability plots are used to determine the goodness of fit.A case study examining the contraction of sexually transmitted diseases provides a concrete example of the estimation techniques.
Getting started reference: Lee, Elisa T. Statistical Methods for Survival Analysis. 2nd ed. New York: John Wiley & Sons 1992.

2000

Sarah Auer
"Using Environmental Data and Fathom Software to Teach Statistical Concepts"

)

As an independent project for the spring semester, I chose to explore how Fathom, a statistics software package, could be used to teach statistical concepts within the context of a high school statistics course.  The project has involved creating a series of activities to be used to guide students through various components of the software program while at the same time introducing them to statistical concepts.  In addition, the data used are focused on environmental issues.  Information regarding the topics that would be appropriate to cover at the high school level was obtained from the curriculum of the Advanced Placement Statistics course provided by the College Board (see Appendix).

Since Fathom cannot be utilized in the instruction of every statistical concept within the curriculum, I chose a set of topics where I felt Fathom could most effectively be used as an instructional tool.  The activities lead students through an exploration of bivariate data.  This includes concepts such as analyzing patterns in scatterplots, linearity, least squares regression, residual plots, outliers, influential points, and correlation.

1999

Sharon Rohloff
"Markov Chains"

)

A Markov chain describes how a process moves from state to state over time.  Real world applications can be found in many fields including biology, physics, economics, sociology and games of chance.  The structure of a Markov chain is best represented by a matrix of transition probabilities.  A variety of examples will serve to illustrate the basic properties of the chains and the usefulness of the transition matrices.  Real world examples will include applications of Markov chains to gambling, the game of tennis, and an original example about two-person random walks.
Getting started reference: Kemeny, John and Snell, J. Laurie, Finite Markov Chains. New York: Springer-Verlag, 1976.

1996

Peter Fellows
"Classification of the 1995 Blowdown in the Adirondack Park Using PCA and MIA with LANDSTAT TM Data"

Advisor: Robin Lock (rlock@stlawu.edu) and Willaim Elberty

Landstat Thematic Mapper (TM) satellite images were used to analyze the dammage of a July 15, 1995 windstrom to the western portion of forest-cover of the Adirondack Park, New York and vicinity.  Multivariate image analysis (MIA), incorporating principal components analysis (PCA) and associated feature-space scatterplots, was used to identify areas of blowdown, assess severity of impact, and identify forest-type affected.  The strategy used in MIA offers a reverse approach to image classification compared to conventional image processing methodology.  MIA proved both sensitive and discriminatng, and the procedure has potential usefulness in a variety of image analysis applications.

1995

Karen Fischer
"Statistical Models of Horse Racing"

In this project, I attempted to predict the results of horse races using multiple regression and discriminant analysis.  I explain with supporting data including a sample prediction that multiple regression does not appear to be a precise way of predicting the results.  A large portion of the paper explains the procedure of discriminant analysis.  It shows going step-by-step how to improve the model in explaining the development of the best model for horse racing using discriminant analysis.  I conclude with a sample prediction for a randomly selected race that predicts exactly both the first and second place horses of this particular race using discriminant analysis.
Getting started reference: Beyer, Andrew. Picking Winners, A Horse Player's Guide. Boston: Houghton Mifflin Company, 1985.

1994

"Mathematical Modelling: A Study of the Transmission of HIV Infection and AIDS"

The aim of this paper is to review mathematical modeling of transmission of HIV and offer some new models describing a specific situation.  I will survey models using various techniques and then present two different models to look in the case of transmission of HIV through prostitution.  At present, I am carrying out this modeling even though much of the numerical information needed is not available, as is true for other researchers engaging in this activity.  Hence, these models cannot yet be used to reliably predict the future incidence of AIDS.  The modeling is further complicated by the fact that we know little about the epidemiology of HIV and the transmission dynamics of the infection.  Questions about the proportion of HIV infected persons who will develop AIDS, the progression rate over time from infection to AIDS, role of co-factors in facilitating HIV tansmission or progression to AIDS, degree and variability in individual infectiousness remain unanswered, and, in such issues, depending upon the personal view of the modeler and models use different variables and take various forms.  The number and diversity of models of the AIDS epidemic are cumulating at a rapid rate.
Results from these models have varied greatly, with each model purporting a decisive advantage over others.  Descriptions of models of HIV transmission that appear in the literature are not sufficiently detailed for other researchers to evaluate the models and comprehend why they produce such diverse conclusions.  However, it is clear that one reason for the diversity of results is that, lacking solid date, modelers have different assumptions for initial values of parameters driving these models, different assumptions about their future time trends, and different estimations of distribution functions. Different projections are due to the fact that different populations are modeled and parameters differ.  To be able to compare all the models, all modelers would need to work with a standard set of initial parameter values and distributions in order to project future trends.  In effect, all would be modeling the same population.  All differences would then result from the differences in the description of the process of the epidemic.  However, modelers have personal interests in different populations and this adds to the increase in the number of models.  Currently, no criterion exists to facilitate a choice among the models.

1992

Matthew Griffin
"Logistic Regression"

When one mentions the world model many things might come to mind.  First there is the role model.  They are the people that we look up to and try to follow in their footsteps.  Then there is the model found in the toy section of most stores.  Here the model is usually a car, boat, or plane.  Last, but certainly not least, is the fashion mode.  We think of them as the most beautiful people in our society because they can sell the product.  We can also talk about the world model in a statistical sense, too.  When we do, we can revert back to the model in the toy store because what we are really doing is in fact building something.  A statistical model is simply an equation which tries to describe or explain the nature of the relationship among one or more random variables.
One of the main reasons one builds a model is to predict the value of a dependent variable based on these random variables.  It is the dependent variable that we will want to look at for the moment.  Say a person will allow for any possible value to be the dependent variable.  When looking at a case such as this one could use what is known as Multiple Regression and its techniques.  However, there is another possibility for the dependent variable.  What if there are only a finite number of possible outcomes?  It is this case that this paper will be focussing on.  It is in this case that we use what is known as Logistic Regression.
The distinguishing characteristic, then, between a logistic regression model and a linear regression model is the dependent variable in a logistic regression model is binary or dichotomous.  That is the dependent variable has only a finite number of outcomes.  One can think of an endless number of situations where this is true.  As a quick example lets look at test scores.  What if we wanted to predict whether or not a person would pass an exam or not.  The dependent variable clearly is dichotomous.  Either the person passes or not.  Then we could look at variables such as past test s cores, past quiz scores, the sex of the student, if the student studies or not, and even his or her GPA.  This is in contrast to what one would see in the multiple regression case where one might be trying to predict final course average based on the same above factors.  One can see that in this case the dependent variable is continuous and not dichotomous.
We worked through the book entitled Applied Logistic Regression by David W. Hosmer and Stanley Lemeshow.  They described the theory of this type of regression and illustrated the technique with a study on heart disease.  Basically they were trying to predict whether or not certain individuals were more prone to heart disease and heart attacks based on the number of factors.  We can see that their dependent variable is indeed dichotomous by nature.  The people will either have heart disease or not.  The factors we mentioned are the independent variables of their study and they included such things as age, sex, race, and history of heart disease among other family members.  For this paper, we will be doing a similar sort of thing.  We will be looking at all the theory behind this topic and extending what we have learned to a study of our own.  We will be using admissions data from St. Lawrence University and try to predict whether or not people will come as Freshman in the fall.  Thus the dependent variable h as only two possible values; either the student will attend or they will not attend.  Some of the independent variables we will be using are Sat scores, sex, high school rank, high school gpa, and the region of the country which they live.

Lisa King
"Efficient Methods of Permutation Testing with an Application to a Test for Correlation"