May 2025 - Present
Research Assistant
Github, Python, Excel, Zoom
I collaborated with two researchers to analyze large amounts of data in forms of Excel, CSV, and DBF files. I used Python to organize the data into new excel files and generate new variables. The data was used to create regression models in Python as well as charts to visualize. Together we analyze the models and files to make statistical inferences to learn more about what drives profits in this company.
The goal of this analysis is to identify the key variables that drive store profits by employing LASSO regression and OLS, using correlation matrices to detect and address multicollinearity among predictors. By narrowing down the most relevant factors, we aim to build a reliable predictive model whose findings can then be benchmarked against other stores to validate and compare our results.
We first started with the company's stores in Atlanta, Georgia. We did this so we could start with a small amount of data then work our way to more stores. We gathered the store data such as store sales, square feet, sales per square foot, opening date, and Yardi rating. We continue to find demographic data also such as population, population per square feet, medium household income, number of competitors, etc. We used this data to compare it to both Store Sales and Sales per Square feet. We ran OLS and Lasso, and made correlation matrixes. After several runs analyzing the data with different alpha’s, we were happy with the variables used and results. We were then able to add more stores from around the United States totalling our store count to 299.
We were able to get all the store locations for this analysis but we did not use all of them. We focused on the year 2018 so if the store did not have sales data we did not use it. There were 1358 stores that had no sales data. We were able to add more variables as we continued such as adding variables for each state, amount of people on social security, and more data on housing. We ended up having over 45 variables. We continued to run more and more regressions with different variables and different alpha’s. We ended up with 28 variables and an alpha of 0.075. We ended up not using all variables to avoid multicollinearity and the alpha was able to knock out irrelevant variables but not force too many out.
After gathering our final data for our main store we selected the top 15 variables plus the intercept in the OLS that had a p-value lower than 0.3. These values were selected to help predict the sales of a different company's stores to test how accurate our predictions were. The goal is to predict the store sales of Space Shop stores based on our data then compare the actual sales later. All seven lasso variables plus intercept were used also.
To do this prediction, we gathered all the same variable data that we collected from our main store to Space Shop store data. Once we gathered the Space Shop store data, we log1p the number if the variable wasn’t a dummy variable. After we multiply that number by the coefficient number we got from the OLS and Lasso. Then we added it all together. The final step was to take e and raise to the power of that final number and then multiply it by the square footage of the store. We did this for all Space Shop locations and we also did it to our main store to see how the coefficients predicted the sales.
Our final presentation is below and the link to the data is HERE!