Our graduation is coming soon. As students in statistics, we know some of us are seeking for a career in data science. However, looking through long description of data scientist position, we found that data scientists seem to be highly rare ‘awesome nerds’- those who embody the perfect skillsets of math and stats, coding, and communication. We, statstical students, don’t need to worry about stats knowledge. However, the shortcoming for most of us, coding, is keeping us back at ‘nerds’. Our motivation is to help us make most of Stack Overflow, improve coding skills and rush towards ‘awesome nerds’.
We explored two datasets provided by Stack Overflow on Kaggle Dataset. One is for R Q&A, and the other is for Python Q&A, which are the top 2 popular programming languages that data scientists use.
Each dataset is organized as three tables:
Questions, contains the title, body, creation date, score, and owner ID for each question.
Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
4. The relationship between answer scores and question scores?
On a Q&A website, if you are an expert in programming languages, there might be some tricks for you to get higher scores.
By exploring the data, we find there’s obvious linear relationship between the question’s score and its corresponding highest answer’s score. Because there are high leverage points and outliers, we use robust linear model to regress answer’s score over question’s score.
For R, answer’s \(score= 1.01 + 0.93* question's\ score\)
For Python, answer’s \(score= 0.80 + 1.11* question's\ score\)
It indicates that you can get higher score by answering popular questions. Also, get relatively higher score by answering a question about Python, since the coefficient for Python is 1.11, higher than that for R.
Answer Score EDA
R Answer Score summary:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max |
---|---|---|---|---|---|
-8.000 | 1.000 | 1.000 | 2.833 | 3.000 | 1058.000 |
Python Answer Score summary:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max |
---|---|---|---|---|---|
-38.000 | 0.000 | 1.000 | 3.028 | 3.000 | 8384.000 |
Choose the good answer threshold = 3, the 3rd quantile.
Label Questions
Label = 1, questions get good answers, Label = 0, otherwise.
R Questions: Total number of questions is 147071. 46435 of them get acceptable answers while 100636 of questions haven’t got an acceptable answer.
Python questions: Total number of questions is 607276.177099 of them get acceptable answers while 430177 of questions haven’t got an acceptable answer.
Logistic Regression
Label prediction: Split data into train(80%) and test(20%), do logistic regression, calculate prediction accuracy.
R Questions: Prediction Accuracy = 0.7927178;
Estimate | Std.Error | z value | Pr(>abs(z)) | |
---|---|---|---|---|
(Intercept) | -1.7259480 | 0.009928074 | -173.8452 | 0 |
Score | 0.6181663 | 0.004818347 | 128.2943 | 0 |
Python Questions: Prediction Accuracy = 0.7923346
Estimate | Std.Error | z value | Pr(>abs(z)) | |
---|---|---|---|---|
(Intercept) | -1.7462727 | 0.004808411 | -363.1704 | 0 |
Score | 0.5649856 | 0.002307515 | 244.8459 | 0 |
Latent Dirichlet Allocation (LDA) is a Bayesian technique that is widely used for inferring the topic structure in corpora of documents. In LDA, a document can be represented as a mixture of \(K\) topics, and every topic has a discrete distribution over words.
The generative process is:
First, we cleaned the text of R questions (Parse HTML, remove punctuations and stopwords, change to lower case…) and splitted questions into training and test set (training 80%, test 20%).
Then we used gensim.models.ldamodel
in Python to fit the LDA model.
We built the recommender system by ensembling the fitted LDA model
, user-based collaborative filtering (KNN)
and content-based filtering
.
Now we use an example to show the process of tag recommendation.
Here is the cleaned text of a question in test set.
The LDA model classified 5206 questions in training set to topic 15. We need to find 20 questions that are most similar to the untagged question.
The similarity of two questions given by Jaccard Index
\[J(A,B) = \frac{|A \cap B|}{|A \cup B|}\]
Count tags of these questions.
The tags marked blue above table appear in the body of question, so we increase the frequency of these tags, then sort the tags again.
Finally, we recommend the top 6 tags to this question.
Repeat this process for whole test set, the accuracy of recommend correct tags is 73%
.