Aakanksha Saha


Aakanksha Saha

I joined TU Wien as a University Researcher in 2021. My research interests include the following areas:

  • Malware Analysis
  • Reverse Engineering
  • Phishing
  • Secrets in Source Code

Currently, I am researching the applicability of robust machine learning models on sophisticated malware attacks.

Prior to this I was working as a Security Software Engineer at Microsoft, Redmond USA. I graduated with a Masters in Computer Science from University of Utah, USA in 2019. You can find my Masters thesis at Secrets in Source Code

  • PreDoc Researcher
Publications (at TU Wien)
    • Secrets in Source Code: Reducing False Positives using Machine Learning
      Saha, A., Denning, T., Srikumar, V., & Kasera, S. K. (2020). Secrets in Source Code: Reducing False Positives using Machine Learning. In 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE Xplore Digital Library.
      DOI: 10.1109/comsnets48256.2020.9027350 Metadata ⯈Fulltext (preprint)
      Abstract: Private and public git repositories often contain unintentional sensitive information in the source code. Many tools have been developed to scan repositories looking for potential secrets and credentials committed in the code base, inadvertently or intentionally, for taking corrective action once these secrets and credentials are found. However, most of these existing works either target a specific type of secret or generate a large number of false positives. Our research aims to create a generalized framework to detect all kinds of secrets - which includes API keys, asymmetric private keys, client secrets, generic passwords - using an extensive regular expression list. We then apply machine learning models to intelligently distinguish between a real secret from a false positive. The combination of regular expression based approach and machine learning allows for the identification of different types of secrets, specifically generic passwords which are ignored by existing works, and subsequent reduction of possible false positives. We also evaluate our machine learning model using a precision-recall curve that can be used by an operator to find the optimal trade-off between the number of false positives and false negatives depending on their specific application. Using a Voting Classifier (combination of Logistic Regression, Naïve Bayes and SVM) we are able to reduce the number of false positives considerably.