My Work

Check out some of my projects...

Jump to:


Statistical Learning For Data Mining Projects

Naïve Bayes Classifier for Wine quality dataset

1)Built a naīve Bayes classifier for predicting the quality of wine data with 12 categorical variables and target variable from 0 to 10.
2)Evaluated the model on the basis of Confusion Matrix and Area under ROC curve.
3)Model resulted in 55% accuracy on the test data saying that this would not be a good model for such data.

Tech Stack: Learn,matplotlib,numpy

Classification of Diabetes dataset using Decision tree, SVM, Regression Tree model.

1)A decision tree classifier,SVM,Regression Tree models were built and GridSearch Method was used to optimal values of parameters.
2)A suitable model was selected for classification considering the tradeoff between Bias and Variance.
3)The Models were evaluated on the basis of Confusion matrix,Precision, Recall and Area Under ROC.
4)Decision Tree Classifier gave Testing accuracy of 71.8%,SVM -70.56%,Rregression Tree model - 35.5%

Tech Stack: Scikit-learn,numpy,pandas,matplotlib,C5.0(CART)

K clustering of Diabetes dataset

1)Executed K means Clustering with different values of clusters.
2)Used Elbow method to determine the optimal number of clusters which came out to be 10.
3)Executed Hierarchical Clustering with single,complete,average and wards distance.
4)Used Cophenetic correlation Coefficient to compare the linkages resulting in Average linkage to be best with 0.8653.

Tech Stack: Scikit-Learn ,numpy,pandas,matplotlib,scipy

Final Project(Application in the prediction of minority class in unbalanced data)

1)Performed Descriptive and Exploratory analysis on the data given.
2)Preprocessed data using techniques like One hot encoding.
3)Applied Upsampling techniques(SMOTE) since the data was unbalanced with a minority class.
4)Test Various algorithms like Logistic regression,XGBoost,SVM with GridSearch method to optimize their parameters.
5)Models were evaluated on the basis of Balanced error rate,which is average rate on each class.
6)Random forest model gave the best results with the accuracy of 80%

Tech Stack: SMOTE,pandas,scikit-learn,numpy,XGboost,Logistic regression,matplotlib

Time Series Analysis

Time series analysis for multiple case studies.

1)Observe time plot infering if it needs transormation in case there variance is increasing.
2)Spot if differencing - seasonal or non-seasonal is required or not looking at time plot.
3)Notice ACF,PACF to determine the Order of AR,MA, SMA,SAR term and try out different models.
4)Conduct Residual analysis and calculate Ljung-Box Q-statistics to make sure there is no correlation in residuals.
5)Choose the best model on the basic of various metrics like Akaike Information Criterion, sum of squared errors.
6)Forecast the results using the best model.
or other approach can be using exponential smoothing(applied on London rainfall dataset)

The various data sets analyzed were:

  1. [LINK] Time-series analysis of Temperature and Electricity consumption (given by sir).
  2. [LINK] Recruitment data(from astsa package).
  3. [LINK] Johnson & Johnson quarterly earnings per share(from astsa package).
  4. [LINK] Daily female births in California(Time series data library).
  5. [LINK] BJsales Data setTime series data library).
  6. [LINK] Milk Production Data set(Time series data library).
  7. [LINK] Sales data at a souvenir shop in Australia(Time series data library).
  8. [LINK] London Rainfall Data set.
Tech Stack: R programming

Decision Support System

Decision Support System for Financial Institution

1)Designed a database building an Enhanced E-R diagram.
2)Transformed the E-R diagram into a relational database drawing relational integrality constraints followed by normalizing.
3)Connected MySQL database with Excel using VBA.NET.
3)Created a GUI for the end users by designing forms and coding the Buttons.
4)This resulted in Speedy Computation,improved quality of decision making,increased organizational control.
5)Effectively reduced the time required to retrieve Data for making Financial Reports.

Credit Risk Analytics

Signal Strength optimization of Wireless Internet Experience

Data collected by credit card risk modeling department over an experiment conducted over 3 years ago for every single day is analyzed to build 3 different models used for 3 different business decision making. Initially, a binary classification model is built to predict future defaults on the basis of maximum Area under ROC curve after which A cost minimizing threshold is calculated according to the price of False Negative and False positive given. Our model is then compared to the model made on scores given by a credit card scoring company and a decision is made if it will be feasible to buy the scores or not. Later on, a model is built on Profitability instead of Default since some customers who default are profitable. The model is used to forecast the profitability of each applicant before deciding whether to issue a credit card or not.


Cat or Not?

Cat Classifier using Logistic Regression

The models that were used to classify the Image Where mainly Logistic Regression and 4-Layer deep neural network. The main purpose of this project was to understand the underlying math behind the algorithms. Various helper functions were made for various tasks like preprocessing image, vectorization, forward propagation, back propagation, gradient descent, Cost computation. Logistic Regression algorithm gave out accuracy of 99.5% on training set whereas 68% on the testing set indicating overfitting on training data whereas the deep neural network gave 98.6% accuracy on training data whereas 80% on testing data indicating a much better model for the classification.

Tech Stack: Python, Jupyter notebook, GitHub, numpy, Pandas, matplotlib, scipy.

Sign Language Recognition

Recognition of numbers in sign languages for people with speech impairment.

This multiclass classification problem is solved by building a deep Neural net in Tensorflow for recognizing the hand signs.Preprocessing is done on training data as well as labels are converted into one hotmatrices, placeholders are created and model is build using various helper functions.A mini batch Gradient descent algorithm is appliedand Adams Optimizer is used.The train accuracy came out to be 99.9% where as testing accuracy was 72% which was decent given the small size of training data set.

Tech Stack: Tensorflow

Classification using NLP

Classification of Genetic variations with Natural Language Processing of Research Papers – Kaggle.

• Data preprocessing was done using One hot encoding, Response encoding & text data was cleaned using regular expression, NLTK library.
• Class balancing, Feature engineering was done after which algorithms like Naïve Bayes, KNN, Logistic Regression, SVM,RF were applied.
• Models were evaluated based on log loss metric & LR with upsampled data and one hot encoding best log loss of 1.06.

Production & Manufacturing

Modelling and analysis of manufacturing plant

Modelling and analysis of manufacturing plant for avanti manufacturing consultants

1)Detailed analysis of the approaches, given the resources WFMC currently possess and the demand forecasts predicted for 2017 has been carried out.
2)Scheduling Is done using Johnson’s algorithm with objectives like maximizing average machine utilizations, minimizing the flow of parts between groups so as to reduce material handling and minimizing setup time are also considered.
2)Designing of facility layout has been done using Group technology to acheive feasible production rates without letting the throughput times go beyond limits,minimizing queue length.
3)The design has been made in a way that the demand is at least met by the production rate for each part.
6)Queueing analysis is conducted next to analyze the performance of the system, which is followed by cost analysis and facility layout.
7)Proposed a model that would bring drastic reduction in machines required reducing annual cost by 30%.
8)Suggested a Mitigation technique using conveyer belt that would save $1.5 million anually that are spent on Labor.


Scheduling & Delivery System Optimization

Scheduling & Delivery system optimization for Centre for Disease Control & Prevention using Drones.

• Build Mathematical models for Drone Scheduling, Vehicle Routing minimizing transportation costs using Subtour elimination constraints.
• Mixed Integer Problems were then solved using Pulp package & CPLEX followed by Dynamic programming for getting Upper Bound.
• Scheduling costs were reduced by 33% & transportation costs by 16% and Greedy Heuristics used to Verify the results.

Wireless Signal Strength Optimzation

Signal Strength optimization of Wireless Internet Experience

1)Designed and performed an experiment to obtain best internet experience from a router, subjected to varying operating conditions.
2)Analyzed the interaction terms to see if they have significant effects towards the signal strengths.
3)Computed appropriate hypothesis tests and obtained the best configuration for wi-fi resulting in enhanced internet speed up to 30 Mbps.

Tech Stack: JMP, SPSS,Minitab