Home Other EDA from a different angle

EDA from a different angle

by admin

EDA from a different angle
Let’s talk not about food, but about exploratory data analysis ( exploratory data analysis, EDA ) which is a mandatory prelude before any severe ML.
Let’s be honest, the process is pretty boring, and to get any meaningful insights about our data – you need to spend a fair amount of time actively using your favorite visualization library.
Now let’s pretend we’re pretty lazy (but curious) and follow this postulate throughout this article.
So let’s ask a question: isn’t there a smart tool which would allow just to press CTRL+ENTER in your favorite IDE and display on the screen alone (without scrolling down and countless microscopic facets) the whole picture with useful information about our dataset?
But let’s keep in mind that even if such a tool exists, it won’t replace classic EDA, but it will be a good help for us when we don’t need to spend hours on visualization to quickly highlight key patterns in our data.
Structure of this article :

  1. A little preprocessing
  2. Visualization of predictor informativeness
  3. Variable discretization
  4. Correlationfunnel
  5. Ranked Cross-Correlations
  6. easyalluvial

Let’s finish with the introduction and take a practical example.
Approach to selecting an example Initially I wanted to take some little-known array of data, but in the end I realized that for the example it will not be very good – the patterns found may seem unobvious and therefore questionable, while our goal is to dissect the array with algorithms that, without having a priori information, will show us what we already know, thus confirming its consistency.

Titanic seemed to me the most convenient as an example, its dimensions are not too small as Iris, it has little informative variables, it is well studied and has understandable predictors and importantly, a historical basis.
All the more, I found an article on Habra where the author did a rather scrupulous EDA of this dataset and showed the conclusions he found based on the pictures.This will be our Baseline of sorts.
A link to an article with a high-profile titlefor our Baseline_EDA:
Titanic on Kaggle: you won’t read this post to the end
In order not to bother with downloading/reading csv from network, let’s use CRAN for original data set

install.packages("titanic")data("titanic_train", package="titanic")

Short preprocessing

This example is so worn out on the net by preprocessing up and down, that I will not discuss this topic, I do the basic things: take the honorific (title) as an important predictor of the name, by it I make a median filling the gaps in age.

library(tidyverse)titanic_train %> % strd <- titanic_train %> % as_tibble %> %mutate(title=str_extract(Name, "\w+\.") %> % str_replace(fixed("."), "")) %> %mutate(title=case_when(title %in% c('Mlle', 'Ms')~'Miss', # normalize variationstitle=='Mme'~'Mrs', title %in% c('Capt', 'Don', 'Major', 'Sir', 'Jonkheer', 'Col')~'Sir', title %in% c('Dona', 'Lady', 'Countess')~'Lady', TRUE~title)) %> %mutate(title=as_factor(title), Survived=factor(Survived, levels = c(0, 1), labels=c("no", "yes"))Sex=as_factor(Sex), Pclass=factor(Pclass, ordered = T)) %> %group_by(title) %> % # below - fill in gaps with medians by titlemutate(Age=replace_na(Age, replace = median(Age, na.rm = T))) %> % ungroup# let's look at the gender distribution of the titles to make sure everything is in ordertable(dEDA from a different angle, d$Sex)

title male female
Mr 517 0
Mrs 0 126
Miss 0 185
Master 40 0
Sir 8 0
Rev 6 0
Dr 6 1
Lady 0 2

Not all yogurts are equally useful…

Usually at the start of analysis I put aside uninformative variables (I put them aside and not irretrievably delete them, because when the model is maximized, engineering on some of the deferred variables gives a certain percentage of increase in the quality of the model).
The metrics for evaluating the “usefulness” of a variable are freqRatio (the ratio of the frequency of the most popular value relative to the second most frequent value) and percentUnique (power or cardinality – the proportion of unique values to the total number of values)
Detailed help can be seen in the caret package
?caret::nearZeroVar

(feat.scan <- caret::nearZeroVar(x = d, saveMetrics = T) %> % rownames_to_column("featName") %> % as_tibble)

EDA from a different angle
I find it most convenient to monitor variables in the two-dimensional plane (by prologarithmizing both axes to avoid overplotting points into one small pile because of outlier points).
I never wondered if this step is EDA, but while writing this article I got to thinking: we’re doing exploratory analysis of some utility of predictors, their visual evaluation, so why isn’t it EDA?

# install.packages("ggrepel")library(ggrepel)ggplot(feat.scan, aes(x=percentUnique, y=freqRatio, label=featName, col=featName))+ geom_point(size=2)+geom_text_repel(data = feat.scan, size=5)+scale_x_log10()+scale_y_log10()+theme_bw()

EDA from a different angle
We consider outlier predictors either by power (X axis) or by frequency ratio (Y axis) to be low-informative, and set them aside accordingly:
PassengerId; Name; Ticket; Cabin

useless.feature <- c("PassengerId", "Name", "Ticket", "Cabin")d <- d %> % select_at(vars(-useless.feature))

This universe is discrete

To understand how the libraries listed below prepare data, this section will show you with small examples what happens in these libraries during the data preparation phase.
The first step is to bring all the data into a single type – often data in the same set can be both categorical and numeric, with numbers having outliers and categorical data having rare categories.
To convert continuous variables into categorical variables, we can decompose our numbers into bins with a certain sampling period.
The simplest example of decomposition into 5 bins :

iris %> % as_tibble %> % mutate_if(is.numeric, .funs = ggplot2::cut_number, n=5)

EDA from a different angle
To obtain the strength and directionality of the interrelationships of individual elements among the predictors, a second technique is used – one hot encoding

library(recipes)iris %> % as_tibble %> % mutate_if(is.numeric, cut_number, n=5) %> %recipe(x = .) %> % step_dummy(all_nominal(), one_hot = T) %> % prep %> % juice %> % glimpse

Instead of 5 predictors we now have 23, but binary :
EDA from a different angle
That’s pretty much the end of the conversion tricks, but these steps are just the beginning of 2 of the 3 libraries for our “non-classical” EDA.
Next I introduce the functionality of 3 visualization libraries :

  1. Correlationfunnel – shows the effect of individual predictor values on the target (i.e. we can call it EDA supervised learning)
  2. Lares – shows the effect of individual predictor values on other individual values of other predictors (i.e., we can call this EDA unsupervised learning)
  3. easyalluvial – shows the cumulative relationship of the grouped values of the top “X” predictors per target (i.e., you could call it EDA supervised learning)

You can see that their functionality is different, so in demonstrating these libraries, I will quote the author’s conclusions from article of our “Baseline_EDA” depending on the above functionality of that package. (For example, if the author shows the dependence of age on survivability then I will insert such a quote in Correlationfunnel, if age on class then in Lares, etc.)
The first library is on stage.

correlationfunnel

correlationfunnel is to speed up Exploratory Data Analysis (EDA)
EDA from a different angle
In vignette to the library describes the methodology quite well, here is a fragment of the calculation of the correlation on binary values
EDA from a different angle
The library assumes a target (dependent variable) in our data and immediately shows the strength and direction of the relationship in one picture, as well as ranking in descending order of this strength forming a visual funnel (hence the name).
The binarization functions built into the library also allow you to combine small categories into Others.
Since the library doesn’t work with integer variables, let’s convert them to numeric and go back to our Titanic.

#install.packages("correlationfunnel")library(correlationfunnel)d <- d %> % mutate_if(is.integer, as.numeric)d %> % binarize(n_bins = 5, thresh_infreq = .02, one_hot = T) %> % # binarization default wiki function correlate(target = Survived__yes) %> % plot_correlation_funnel() # "interactive = T" - plotly!

EDA from a different angle
On the X-axis we have the strength and direction of the correlation, and on the Y-axis we have our predictors, ranked in descending order. The first from the top is always the target, because it has the strongest correlation with itself (-1;1).
Let’s check how the conclusions from this graph overlap with the conclusions of the author of our “Baseline_EDA”.

The following graph confirms the theory that the higher the class of the passenger’s cabin, the greater the chance of survival. (By “higher”" I mean in reverse order, since first class is higher than second and, even more so, third).

The funnel shows that class is the third strongest predictor of correlation and indeed class 3 has an inverse correlation, class 1 has a strong positive correlation.

Let’s compare the odds of survival in males and females. The data support the theory expressed earlier.
(In general, you can already tell that the main factors of the model will be the gender of the passenger)

The funnel shows that passenger gender is the 2nd most correlated, female gender correlated with survival, male gender correlated with death.

You can also test the hypothesis that younger people survive because they move faster, swim better, etc.
As you can see, there is no obvious correlation here.

The funnel does show a weak significance of this predictor (recall that the honoree/title contains age, which is why age is not so significant), but even here the funnel shows that the categories “minus infinity – 20 years” (i.e. children) and 30-38 (wealthy people, possibly 1st class) have a better chance of survival.

Let’s introduce an indicator like Survival Percentage and look at its dependence on the groups we got in the previous step

(the author’s groups are referring to the title).
The funnel fully confirms the author’s found conclusions

Now let’s look at the information we can get from the number of relatives on the ship.
Survivability seems to be negatively affected by both the lack of relatives and the large number of relatives.

SibSP in the funnel clearly says the same thing.
Well, of course, in addition to the author’s conclusions here you can see other regularities, the pleasure of contemplation is left to the reader

Lares

Find Insights with Ranked Cross-Correlations
EDA from a different angle
The author of this library went even further – he shows dependencies not only on targeting, but all on everything.

Ranked Cross-Correlations not only explains relationships of a specific target feature with the rest but the relationship of all values in your data in an easy to use and understand tabular format.
It automatically converts categorical columns into numerical with one hot encoding (1s and 0s) and other smart groupings such as โ€œothersโ€ labels for not very frequent values and new features out of date features.

At the link above you can see an example where the author feeds a Star Wars dataset to his package and shows the dependencies found, I got hooked on his page, loved it.
Let’s try our example.

# Careful, it pulls quite a few dependent packages :# devtools::install_github("laresbernardo/lares")library(lares)corr_cross(df = d, top = 30)

EDA from a different angle
In addition to the intersection with the conclusions based on the quotations in Correlationfunnell, here are the individual quotations that we can see here without regard to the targeting :

Other patterns can also be found. There is a negative correlation between age and class, which is likely due to older passengers being more likely to be able to afford more expensive staterooms.

In the quote above, the author draws this conclusion from the correlation analysis of the 2 fields in the aggregate, in our case with One-Hot-Encoding it is seen by the strong positive correlation between Age+P_Class_1.

Also, ticket price and class are highly correlated (high correlation coefficient), which is to be expected.

The third line from the top : Fare+P_Class_1
In addition to the intersection with the author’s conclusions, there are many other interesting things to highlight here, as well as leave the pleasure of contemplation for the reader.
In addition to the optional selection of the top X strongest insights, you can also reflect the whole picture and the place of these significant points in the total mass

corr_cross(df = d, type=2)

EDA from a different angle

easyalluvial

Data exploration with alluvial plots
EDA from a different angle
Here the author does the same as in the 2 previous packages – he starts binarization of numeric variables, but his paths are different with those libraries: instead of {One-HotEncoding + correlation} library puts the top X most interesting predictors (the user decides what to send) into values, forming streams, where color depends on the target, and the stream width depends on the number of observations in this stream.
Numeric variables are divided into categories HH (High High), MH (Medium High), M (Medium), ML (Medium Low), LL (Low Low)
Let’s start by taking the most significant predictors based on the graph from the correlationfunnel:

cor.feat <- c("title", "Sex", "Pclass", "Fare")

Next, we make a graph

# install.packages("easyalluvial")library(easyalluvial)al <- d %> % select(Survived, cor.feat) %> %alluvial_wide(fill_by = "first_variable")add_marginal_histograms(p = al, data_input = d, keep_labels = F)

EDA from a different angle
For the author’s quotes, let’s redraw the graph using the appropriate predictors

cor.feat <- c("Sex", "Pclass", "Age")al <- d %> % select(Survived, cor.feat) %> %alluvial_wide(fill_by = "first_variable")add_marginal_histograms(p = al, data_input = d, keep_labels = F)

EDA from a different angle

For example, the following graph shows perfectly well that the main groups of survivors are first and second class women of all ages.

The graph in addition shows that the survivors of 3rd class women are also not a small group

And among the men, all boys under 15 except the third class of service and a small proportion of older men and mostly from the first class survived.

This is confirmed, but again we see streams of surviving Class 3 men in the LL, ML age category.
Everything above was about the “easyalluvial” package, but the author wrote a second package “parcats” which on top of plotly makes the above graph interactive (as in the title of this section).
This makes it possible to not only see the tooltip context, but also to reorient the threads for a better visual experience. (unfortunately the library is not very optimized yet and lags on my titanic)

# install.packages("parcats")library(parcats)cor.feat <- c("title", "Sex", "Pclass", "Fare")a <- d %> % select(Survived, cor.feat) %> %alluvial_wide(fill_by = "first_variable")parcats(p = a, marginal_histograms = T, data_input = d)

EDA from a different angle

Bonus

In addition to exploratory data analysis, the easyalluvial library can also be used as an interpreter of black-box models (models whose parameters cannot be analyzed to understand the logic behind the model’s response based on certain predictors).
Link to the author’s article : Visualise model response with alluvial plots
And the peculiarity is that of all the libraries I have seen, the maximum one graph explained the response of the black box in no more than 2 dimensional coordinate system (one for each predictor), the color explained the response.
The easyalluvial library allows you to do this on more than 2 predictors simultaneously (better not to get carried away, of course).
For an example, let’s train a random forest on our dataset and reflect the explanation of the random forest on the 3 predictors.

library(ranger)m <- ranger(formula = Survived~., data = d, mtry = 6, min.node.size = 5, num.trees = 600, importance = "permutation")library(easyalluvial)(imp <- importance(m) %> % as.data.frame %> % easyalluvial::tidy_imp(imp = ., df=d)) # predictor importance frame from the model# generate N-dimensional space of predictors with combinations (unfortunately including impossible!) of their valuesdspace <- get_data_space(df = d, imp, degree = 3)# get response on spacepred = predict(m, data = dspace)alluvial_model_response(pred$predictions, dspace, imp, degree = 3)

The author also has a connector to CARET models (I do not know how relevant this is now, considering tidymodels)

library(caret)trc <- trainControl(method = "none")m <- train(Survived~., data = d, method="rf", trControl=trc, importance=T)alluvial_model_response_caret(train = m, degree = 4, bins=5, stratum_label_size = 2.8)

EDA from a different angle

Conclusion

Once again, I repeat that I do not call to replace the classic EDA, but agree – it is nice when there is an alternative that can save a lot of time, especially considering that people are quite lazy by nature, and that, as you know, the engine of progress ๐Ÿ™‚

You may also like