+++
Introduction
In this post, I'm going to walk through about applying machine learning algorithms in particle physics. There are many available datasets ready for ML, but one of a particular interest is the MiniBooNE
dataset available on UCI machine learning repository. This dataset is used for a published paper from the experiment collaboration which is very interesting to read, although it is somehow old. Particle Identification was part of the early success stories for machine learning and deep learning so it was applied since early 2000's.
Data
- The dataset has been taken from UCI machine learning repository
- It has been taken from the
MiniBooNE
experiment conducted in the Fermilab. - A stream of muon neutrinos are fired, and the detector measures the presence of electron neutrinos (signal) among the muon neutrinos(noise).
- There are 50 features in the dataset related to every detection made, however no information is given about the features.
- There are no missing values.
- The first line in the file
MiniBooNE_PID.txt
contains 2 space separated values, the signal events come first, followed by the background events - This is a binary classification problem where we want to tell whether a given signal is an electron neutrino or not.
We can download the dataset directly in our python/jupyter
workspace using:
We will need to do some data processing to make the data ready for more steps. We are going to do the following:
- The data is stored in
data.txt
which was downloaded from the above link. - We use the
pandas
library to read the data and skip the first row as it contains the number of positive and negative labels - Create a NumPy array of 1's (electron neutrino) and 0's (muon neutrino) which acts as our labels for the classification problem.
- Convert the input dataframe into a NumPy array for the analysis.
- After taking a look at the data we see that there are many features having large values. This makes the Machine Learning algorithms difficult to converge to a result. Therefore, the solution is to scale down the data.
- We just want the range to change, not the mean, or the variance so that the data still caries the information it did before scaling. Hence, a good scaler to use is the
minmax
scaler. - Use the train test split to split the data into training and test data sets with a default of 75% training data and 25% test data.
Let's implement this
We need to import the needed packages
define data and read it into pandas dataframe
=
We explore data features
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 48 49
0 2.59413 0.468803 20.6916 0.322648 0.009682 0.374393 0.803479 0.896592 3.59665 0.249282 ... 101.174 -31.3730 0.442259 5.86453 0.000000 0.090519 0.176909 0.457585 0.071769 0.245996
1 3.86388 0.645781 18.1375 0.233529 0.030733 0.361239 1.069740 0.878714 3.59243 0.200793 ... 186.516 45.9597 -0.478507 6.11126 0.001182 0.091800 -0.465572 0.935523 0.333613 0.230621
2 3.38584 1.197140 36.0807 0.200866 0.017341 0.260841 1.108950 0.884405 3.43159 0.177167 ... 129.931 -11.5608 -0.297008 8.27204 0.003854 0.141721 -0.210559 1.013450 0.255512 0.180901
3 4.28524 0.510155 674.2010 0.281923 0.009174 0.000000 0.998822 0.823390 3.16382 0.171678 ... 163.978 -18.4586 0.453886 2.48112 0.000000 0.180938 0.407968 4.341270 0.473081 0.258990
4 5.93662 0.832993 59.8796 0.232853 0.025066 0.233556 1.370040 0.787424 3.66546 0.174862 ... 229.555 42.9600 -0.975752 2.66109 0.000000 0.170836 -0.814403 4.679490 1.924990 0.253893
Now we need to scale the data
=
=
=
=* + *
=
=
=
=
Then split the dataset into training and test sets
=
, , ,
Evaluation
we are going to apply different ML algorithms on our data and will need to determine a rubric to evaluate our models. We define a function for model evaluation based on the confusion matrix. The confusion matrix is used to quantify how many of the predicted values were correct and incorrect.
Definitions
Accuracy: The number of true predictions(true 0's and true 1's) divided by the total number of predictions made Precision: The number of true 1's divided by the total number of 1's predicted.(Basically telling us that how well have we predicted the 1's) precision=1 if no 1's are predicted as 0 (precision=TP/(TP+FP)) Recall: The number of true 1's divided by the actual 1's.(the fraction of correctly classified 1's). Recall=1 if no 1s are predicted as 0.(recall=TP/(TP+FN)) ROC: a graph where false positive rate is plotted on the X-axis and true positive rate is plotted in the Y axis. The area under the ROC curve is a good measure of how well the algorithm has performed. A score close to 1 is a good AUC (area under the curve) score.
=
=
, , =
Models
In this analysis we will try different machine learning algorithm and seek the best model among them. We will use the following models
- Logistic regression
- K-nearest Neighbors
- Decision trees
- SVM (Support Vector Machines)
- Random Forest
Logistic regression
Logistic regression uses the sigmoid function to estimate the probability of an instance being classified as 1. The C value controls large values for weights that may lead to over fitting in the data
# define the model
=
=
=
# print model scores
=
=
# evaluate the model
The output will be the following
K-nearest neighbors
The K-nearest neighbors model does not actually train a model based on the data but rather stores all the training data given to it and then calculates the distance of each point from every other point. When test data is given, it classifies it as a 1 or 0 based on votes based on the chosen k (number of nearest neighbors). It is unsupervised learning algorithm
# define the model
=
=
=
# print model scores
=
=
# evaluate the model
The output will be the following
Decision Trees
A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target we are looking for. The model can over fit if no limit is specified on the depth the tree can go to.
# define the model
=
=
=
# print model scores
=
=
# evaluate the model
The output will be the following
Random Forest
Random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree
# define the model
=
=
=
# print model scores
=
=
# evaluate the model
The output will be the following
Conclusion
In our Analysis we find "Random forest" is the best algorithm with the highest ROC value