View on GitHub

Predicting Startup Funding Via Twitter

JKMR Data

Roger Zou, Melody Guan, Kevin Yang, Jerry Anunrojwong

Screencast        Overview        Analysis        Predictions
Download this project as a .zip file Download this project as a tar.gz file

Overview

Motivation and Goals

Does more effective publicizing lead to better financial results for a startup? We propose to test the converse: Before financing rounds, might there be an abnormal amount of PR activity? Can we predict details on events like financing rounds based on mentions of a company on social media platforms like Twitter? Or is social media a noisy, meaningless indicator?

This project aims to use data analysis and predictive analytics to find correlations between tweets and startup funding rounds in order to shed light on potential importance of tweets and social media in general as indicators of startup success. Through a variety of models, we show that there is a relationship between tweets and startup funding.

Data Collection

Startup Fundings

We've downloaded our list of startups and their funding round info from AngelList, a US website with extensive startup financial data. AngelList did not have an API, so we used the urllib2 Requests library to download each search page, then used BeautifulSoup to parse the page. The data is located in the data folder.

Tweets

We obtained our tweets by directly scraping Twitter for mentions of startups on our list. While Twitter does have an API, its Search API is limited to an index of 6-9 days of tweets and its Timeline API is limited to up to 3200 tweets per timeline, not to mention rate limits on scraping both. We did initially write a code that used the API, but decided instead to scrape via the Twitter search page with Selenium Webdriver browser scripts and BeautifulSoup. The tweets are located in the data folder.

Translating Tweets

The majority of our tweets were non-English, necessiting us to translate them in order to get an accurate model that works globally, not just for US startups. After filtering English tweets with the guess_language python library, we used the Microsoft Translator API to translate all our tweets.

Exploratory Data Analysis

Startup Financial Analysis

After preliminary analysis, we determined that the average Series A funding round is approximately $6 million, the average Series B funding round is $11 million, the average Series C funding round is $16 million, and the average Series D funding round is around $17 million dollars. However, there is significant variance in the amount of money each company raises. Graphing it on a scale from 0 to $100 million, we see all Series are very much skewed right.

However, once we plot the log of the data instead, we are able to get a distribution that looks much less skewed and that we can treat as similar to normal.

Feature Extraction

Features from Tweet Metadata

We extracted features from Twitter metadata such as: the number of likes, the number of retweets, the date tweeted for each tweet. We grouped these by (company, funding_round) combination and compute the mean and standard deviation for each pair. We created features from these dates by computing the range of dates spanned among our ~200 tweets scraped, and the interquartile range. The intuition is that even though we can't scrape all the tweets made, if the range of dates are wide, given the fixed amount of tweets, then the tweets are made relatively infrequently. This might have some predictive power.

Features from Tweet Text

Apart from metadata, we extracted a number of features from the text in the tweet itself. We computed text length, the number of hashtags, the number of persontags (tags of other Twitter accounts, beginning with @), the number of links, the proportion of tweets made by the company itself versus other people, the number of times tweets are directed toward the company.

Features from Funding Series and Market Sectors

We created an indicator variable for each of the 4 funding rounds (A,B,C,D). When we looked at the market sector data (that we get from scraping) we see that some sectors have a lot of companies in it (such as Biotech, has around 300) while most categories have very few (mostly less than 10). These categories that have few companies are not very useful because they are too dispersed and specific, but we think the top sectors that have a lot of companies are more useful, so we create indicator variables for top 10 sectors.

Data Exploration and Feature Selection

Natural Language Processing

We parsed the text in each tweet using the pattern python library to extract nouns and adjectives. We removed punctuation and stopwords (from sklearn). We decided not to assign topics using LDA due to the heterogeneity of our tweets. We parsed the text into sentences and then tokenized the sentences into words. We then lemmatized the words, which means that we convert words into their basic form, for example: "walk", "walking", "walks", "walked" => "walk". Because each tweet is short (maximum 140 characters) we did not distinguish between sentences within tweets.

Sentiment Analysis

We used the sentiment dictionary SentiWordNet 3.0, which assigns to words (both nouns and adjectives) three sentiment scores: positivity, negativity, objectivity. For each tweet, we took the average positivity score over all tokens and the average negativity score over all tokens. We also defined a word as "positive" or "negative" if it had positivity score>0.5 or negativity score>0.5 respectively. For each tweet, we then summed up the total "positive" words and total "negative" words (usually 0,1, and rarely 2). To summarize, we have four features from sentiment analysis: average positivity, average negativity, positive count, negative count. For each company, funding round pair, we then take the average of these features for all their tweets.

Correlation Analysis

In this analysis, we only considered the 1200 startups that had more than 150 tweets and valid dates.

We plotted the unscaled 'Amount Raised' against all unscaled numerical features, and found no linear correlations.

We then plotted log of 'Amount Raised' against all numerical features normalized using Box Cox transformation (see PCA section), and again found no linear correlations.

We did not find linear correlations for log of 'Amount Raised' against unscaled features either.

We did find some homoskedastic correlations for scaled 'Amount Raised' against scaled features', but unfortunately these slopes were flat.

While some scaled v. scaled graphs displayed homoscedastic behavior (equal variance across features), their slopes tended to be very close to 0 and not at all linear. From our correlation analysis, we infer that using SVR may be a better method for predictive modeling than linear regression.

Principal Component Analysis

Principal Component Analysis (PCA) is a descriptive technique that aims to isolate a handful of linear combinations of features that "explain" most variances in the data and operates on the whole dataset without the training/testing division. Moreover, PCA is more informative if all features are suitably normalized, so no single feature can dominate the total variance. We therefore used Box Cox transformation on each column (the library chooses an appropriate parameter, different for each column, to make the resulting transformation approximately Normal.) The exception is the funding raised, which we use the log transformation (which is also a special case of Box Cox). This is justified because our plot shows that log(funding) looks Normal, and when we predict log(funding), reversing the function to get funding is more expedient. Our PCA shows that only a few (aggregated) features explain most of the variance. 3 top features explain 95% of the variance, while 5 top features explain 98% of the variance. The most important features appear to be number of favorites and retweets and the date range of the last 200 tweets.

Predictive Modeling

K-fold Cross-Validation

We split the data into the training/validation data and testing data, standardizing the numerical features of each of the two datasets separately. We trained our model on non-testing data, splitting it into 5 folds and assigning each fold as validation data one at a time to obtain hyperparameters. We then tested the model on the untouched testing data to check the robustness of our model.

Baseline Predictions

Based off just the Series data, we're able to predict funding round amounts naively by predicting the average Series amount. We find the root mean squared error to be approximately $10 million, a baseline prediction we'll use to compare to our later, more complex predictions.

Linear Regression

We performed regularized linear regression on the scaled data using two methods: Lasso, which constrains the L-1-norm of the parameter vector, and Ridge, which constrains the L-2-norm of the parameter vector. We computed RMSE and R-squared values in each case. We found that the two methods performed extremely similarly, with the Ridge model being slightly better (RMSE=18.3 compared to baseline of 118.2). Both methods gave an R-squared of ~0.15, which is relatively low, but not too bad as an individual predictive signal. Residual plots confirmed that residuals were approximated normal.

Support Vector Regression (SVR)

As correlation analysis suggested the relationships are non-linear, we turn to Support Vector Regression (SVR). We try three choices of kernels - rbf, linear and polynomial. For each choice of kernal, we use GridSearchCV with 5-fold cross validation to find the optimal parameters of the predictor over a reasonable (pre-determined) range of parameters. We then fit the predictor to the training data, predict it on the test data, and evaluate it by computing RMSE on log(funding). We found that rbf predictor with C=100 and gamma=0.01 is the best, with RMSE around 1. This result is comparable to linear regression.

Neural Nets

In order to explore other methods that can identify nonlinear trends, we ran neural nets as well. As before, we used normalized variations of numerical data. The data we wanted to predict was the log of the series funding amounts. We worked with various python neural net libraries with limited success. Ultimately, we decided to work with Matlab to create a neural net. We used the neural net toolbox and created a neural net with one layer of 50 intermediate nodes. The neural net package provided by Matlab uses the Levenberg-Marquadrt algorithm to run. In the end, we were able to achieve an RMSE of 18.8.

Root Mean Square Deviations

Type of Model RMSE Value
Baseline 118.2
SVR RBF 1.1
SVR Linear 10.9
SVR Poly 33.1
Lasso Regression (R-sq = 0.15) 18.3
Ridge Regression (R-sq = 0.15) 18.3
Neural Net 20.2

Conclusion

Overall, each of our models did better than our baseline predictions. In particular, SVR with an radial basis function kernel performed best, followed by SVR with a linear kernel, and then Ridge and Lasso regression.

In the future, we would perform more clustering to obtain more tailored forecasts with regard to company sizes and industries, and combine Twitter with other signals to achieve better predictions of funding success.