Kaggle Survey Dataset.
Every Year Kaggle Conducts a ML and Data Science Survey. They release the cleaned survey result and host a competition, where the kagglers have to perform data analysis and exensive EDA to come up with a story based on the facts.
You can download this particular dataset from kaggle, you can find the dataset in the below link.
Data Description.
There are totally 44 questions,but there are totally 296 columns as some of these questions have multiple choice and each of these choices are made into seperate columns.
We initially dropped the Duration column and worked with only 43 columns. And the first row contains the actual questions.
First Cut Solution.
The first cut solution is pretty simple, I grouped each question into it’s respective choice, this is more like value_counts() but more detail. This helps us analyse the Individual Columns seperately.
def print_questions_info(all_questions_groups):
for key in all_questions_groups:
question_id = key
if all_questions_groups[key]:
question_ids = all_questions_groups[key]
question = str(all_questions[key+'_1'])
question = question.split('?')[0] + '?'
print_question(question, question_id)
print_Multiple_choice_question_answers(data, question, question_ids)
print_question_summary(question, question_id)
else:
question = all_questions[key]
print_question(question, question_id)
print_Single_choice_question_answers(data, question, question_id)
print_question_summary(question, question_id)
This is the main function which prints the count of each choice for each Column. I am listing the initial analysis based on the data.
Here you can see how we get the total count for each choice for the Question “For how many years have you been writing code and/or programming”. This is the same format that you see for all the columns.
Initial Insights.
- The maximum number of kaggle users are between the age 18–21.
- 76% of kaggle users are men and 22% are women.
- Kaggle is used over 56 countries.India and USA being at the top. One of the noticeable things is that there are more Asian and Wester countries compared to Middle Eastern Countries.And the amount of people from these countries is also less in number.
- Student community is pretty active on kaggle, it’s 50:50.
- Coursera, Kaggle, Udemy are the most commonly used platform.
- Kaggle is also used by beginners to start their data science journey.
- Many kagglers have focused on Master’s as their highest education.
- About 20% have publised research papers.
- Python is the most used language and SQL is the 2nd highest. It also means there are many data analysts in kaggle.
- Jupyter, VSCode, PyCharm contributes 50% of the over all IDE’s.
- Matplotlib, Seaborn and Plotly are the most used Visualizing Libraries.
- Around 17% of the people claim that they don’t use any ML methods.
- About 23% of participats reported they have spent money on ML or Cloud Computing.
These are the initial results which is more like the results of Univariate Analysis. We will now do a second phase in-deptht analysis of the data. Before that I’d like to make few terms clear to you.
Sunburst Chart
The sunburst chart is used to display hierarchical data.Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy. A sunburst chart without any hierarchical data (one level of categories), looks similar to a doughnut chart.
Heat Map
A heat map is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions. The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.
These are the 2 charts that’s intensively used throughout the Analysis.
The points that we dervied after a detailed Multivariate Analysis and extensive EDA are as follows.
- From the Age and Gender analysis together, we can see that the men and woman ratio differs greately as their agr progresses and there are no woman over 70.This symbolizes how more woman enter the tech field in new age.
- Vietnam and India has the highest youngest peopulation and Netherlands has the highest population of elder kagglers.
- In the gender and country distribution we can see that Romania, Arab Emirates have 0 non-binary people, the notable thing is that these are the countries with high relegious morality.
- People between the age of 25–29 are most intrested in pursuing Masters.
z=df.groupby(['In which country do you currently reside?','What is the highest level of formal education that you have attained or plan to attain within the next 2 years?']).size().unstack().fillna(0).astype('int16')
fig, ax = plt.subplots(figsize=(14, 24))
sns.heatmap(z.apply(lambda x: x/x.sum(), axis=1), xticklabels=True, yticklabels=True, cmap='Purples', annot=True, linewidths=0.005, linecolor='green', annot_kws={"fontsize":12}, fmt='.4f', cbar=False)
plt.title('Planned/Current Education Distribution by Country', fontname = 'monospace', weight='bold')
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[-1] = 'Some College/Uni Study'
ax.set_xticklabels(labels)
plt.xticks(fontsize=12,rotation=45)
plt.yticks(fontsize=9)
plt.xlabel("Education", fontname = 'monospace', weight='semibold')
plt.ylabel("Country", fontname = 'monospace', weight='semibold')
plt.show()
del z
- Belgium is the country with most people who wants to pursue Masters.
z=df.groupby(['In which country do you currently reside?',
'Select the title most similar to your current role (or most recent title if retired):']).size().unstack().fillna(0).astype('int16')
fig, ax = plt.subplots(figsize=(14, 24))
sns.heatmap(z.apply(lambda x: x/x.sum(), axis=1), xticklabels=True, yticklabels=True, cmap='Purples', annot=True, linewidths=0.005, linecolor='green', annot_kws={"fontsize":10}, fmt='.4f', cbar=False)
plt.title('Role Distribution by Country', fontname = 'monospace', weight='bold')
plt.xticks(fontsize=12,rotation=90)
plt.yticks(fontsize=10)
plt.xlabel("Role", fontname = 'monospace', weight='semibold')
plt.ylabel("Country", fontname = 'monospace', weight='semibold')
plt.show()
del z
- Zimbabwe has the most Data Analyst, UK and Romania has the most data scientist.
z=df.groupby(['What is the highest level of formal education that you have attained or plan to attain within the next 2 years?',
'Select the title most similar to your current role (or most recent title if retired):']).size().unstack().fillna(0).astype('int16')
fig, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(z.apply(lambda x: x/x.sum(), axis=1), xticklabels=True, yticklabels=True, cmap='Purples', annot=True, linewidths=0.005, square=True, cbar=False, linecolor='green', annot_kws={"fontsize":12}, fmt='.4f')
plt.title('Education Distribution by Role')
labels = [item.get_text() for item in ax.get_yticklabels()]
labels[-1] = 'Some College/Uni Study'
ax.set_yticklabels(labels)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel('Role', fontname = 'monospace', weight='semibold')
plt.ylabel('Education', fontname = 'monospace', weight='semibold')
plt.show()
del z
- We can see a common trend in position and education, most of the data scientists have completed their Masters.
- Most of the ML Engineers have attended a form of University Course and most of the Developer Advocates have completed Cloud Certification.
z=df.groupby(['Select the title most similar to your current role (or most recent title if retired):',
'For how many years have you been writing code and/or programming?']).size().unstack().fillna(0).astype('int16')
fig, ax = plt.subplots(figsize=(16, 14))
sns.heatmap(z.apply(lambda x: x/x.sum(), axis=1), xticklabels=True, yticklabels=True, cmap='Purples', annot=True, linewidths=0.005, linecolor='green', annot_kws={"fontsize":12}, fmt='.4f', cbar=False)
plt.title('Role by Coding Experience Years', fontname = 'monospace', weight='bold')
plt.xticks(fontsize=12,rotation=60)
plt.yticks(fontsize=12)
plt.xlabel('Coding Experience Years', fontname = 'monospace', weight='semibold')
plt.ylabel('Role', fontname = 'monospace', weight='semibold')
plt.tight_layout()
del z
- People with less than 1 year of experince opt for data analyst jobs were most of ML Engineer and Data Scientist position is filled with 5–10 years of Experince.
fig, ax = plt.subplots(figsize=(16, 14))
sns.heatmap(df2.groupby('Select the title most similar to your current role (or most recent title if retired):').mean().iloc[:,1:-2],
xticklabels=True, yticklabels=True, cmap='Purples', annot=True, linewidths=0.005, linecolor='green', annot_kws={"fontsize":14}, fmt='.3f', cbar=False)
plt.title('Role and Language Usage', fontname = 'monospace', weight='bold')
plt.ylabel('Role', fontname = 'monospace', weight='semibold')
plt.xlabel('Programming Language', fontname = 'monospace', weight='semibold')
plt.show()
- Data Engineers use SQL the most compared to anyone.
df4 = pd.concat([df3, df2[languages].add_suffix('_language')], axis=1)
correlation_train = df4[ide_cols+[lang+'_language'for lang in languages]].corr(method='kendall')
mask = np.triu(correlation_train.corr())
plt.figure(figsize=(16, 16))
sns.heatmap(correlation_train,
annot=True,
fmt='.2f',
cmap='Purples',
square=True,
mask=mask,
cbar=False,
annot_kws={"fontsize":10})
plt.xlabel('')
plt.ylabel('')
plt.title("IDE / Language Usage", fontname = 'monospace', weight='bold')
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.show()
- Python with Jupyter Notebook is the most used combination.
- SkLearn is the most used Library and is used by data scientist.
- Linear Regression is the most used algorithm and is used by data scientist and CNN is pre-dominantly used by ML Engineers.
- People with 5–10 years of experience are mostly involved in Building ML protoypes.
This is what the competition is all about.If we refer the winners solution, we can come up with better analysis.
Reference Link
You can find the Notebook here in my github repository.