Welcome

I'm Rohan Ben Joseph

Computational Linguist and Data Scientist
About




Me with my best friend, Artemis

Ben Joseph

Computational Linguist and Data Scientist

Interested in finding solutions to real-world language problems through technology. Exploring the potential of Data Science, ML, AI and NLP, including Training ML Models, Deploying and Evaluating Large Language Models, Data Manipulation, Better Data Imputation and Discourse and Sentiment Analysis.

  • Location: Burnaby, BC, Canada
  • Education: BSc. Joint Major - Computer Science and Linguistics
  • Research Interests: Computational Linguistics, NLP, Social Data Analytics, Data Science, LLMs
  • Experience: Currently a Language Data Scientist @Meta. Previousy 2+ Years of Experience including 1 year as a Defence Data Scientist at the DND, Canada
  • Research: 1 publication in PLOS ONE, 7 Conference Papers, 3 local Journal Publications
  • My Top 5 Book Recommendations (Unsolicited, but offered freely):
            ◙ Breathe: A life in Flow (Rickson Gracie),
            ◙ GRIT (Angela Duckworth),
            ◙ Hitch-hiker's guide to the Galaxy (Douglas Adams),
            ◙ It was on Fire When I lay Down on It (Robert Fulghum),
            ◙ The Little Prince (Antoine de Saint-Exupéry)

If you have any research projects in the area of NLP or AI that you'd like to talk about, please email me at josephrohanben@meta.com! If you work at Meta and would like to contact me, please find me on the Intern and shoot me a message. For Meta related Press Queries, please redirect your questions to Press@



Skills and Experience at a Glance

TensorFlow

TensorFlow

Pytorch

Pytorch

Numpy

Numpy

Pandas

Pandas

Matplotlib

matplotlib

Scipy

Scipy

Sci-kit Learn

sklearn

Node.js

Node.js

Python

Python

Tableau

Tableau

PowerBI

Power BI

ElasticSearch

ElasticSearch

Git

git

Agile

Agile

wug

Linguistics

Current Project(s)

  • Ragebait Detection and Classification of Sentiment Manipulation through Social Media

    Areas: Retrofitting, Classification, Vector Modeling, Multimodal Data, Sentiment Analysis, GloVe, BERT

    This project aims to develop a system to detect “ragebait”—emotionally provocative content on social media designed to incite anger and drive engagement. Using Google Cloud’s Video Intelligence API, we will analyze short-form videos (a la Tiktok/ Instagram Reels / Youtube Shorts) to generate detailed descriptions and transcripts. These will then be processed through a language model for summarization and generating vector relations. we can take the centroid of the "ragebait" vector embeddings and use the vector relations to augment a localised version of Glove (Global Vectors for Word Representation). By applying retrofitting techniques, we will add the context of ragebait to a generic dataset. This will help us to use a pretrained BERT classifier to better classify videos as ragebait or non-ragebait by taking the centroid of the vector embeddings and assessing whether it is closer to the centroid of the ragebait dataset or the Not Ragebait dataset. This work is crucial for understanding and mitigating the spread of divisive content online, helping platforms create safer, more constructive spaces for users.

            



  • Conference Paper on Clustering algorithms and NLP

    Invited to present a paper at ICPRAM 2025, Portugal - To Be written

            


Selected Past Projects

  • Information Retrieval Chatbot on Military Policies and Standards

    Area: Natural Language Processing (NLP), Retrieval Augmented Generation (RAG), Information Retreival, Source of Truth

    Gunasekara, C., Sharafeldin, A., Triff, M., Kabir, Z. and Ben Joseph, R.Information Retrieval Chatbot on Military Policies and Standards. DOI: 10.5220/0012351200003654Paper copyright by his Majesty the King in Right of Canada as represented by the Minister of National DefenceIn Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 714-722ISBN: 978-989-758-684-2; ISSN: 2184-4313Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda

  • Military Badge Detection and Classification Algorithm for Automatic Processing of Documents.

    Area: Computer Vision, Machine Learning, OCR, YOLOv5

    Gunasekara C., Matharu Y. and Ben Joseph R. (2024). Military Badge Detection and Classification Algorithm for Automatic Processing of Documents. In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM; ISBN 978-989-758-684-2, SciTePress, pages 723-731. DOI: 10.5220/0012351300003654

  • Comparing virtual reality, desktop-based 3D, and 2D versions of a category learning experiment.

    Area: Cognitive Science, Education, Category Learning, Virtual Reality

    Barrett, R. C. A., Poe, R., O’Camb, J. W., Woodruff, C., Harrison, S. M., Dolguikh, K., Chuong, C., Klassen, A.D., Zhang, R., Joseph, R.B. & Blair, M. R. (2022). Comparing virtual reality, desktop-based 3D, and 2D versions of a category learning experiment. Plos one, 17(10), e0275119.

  • Archive of the Digital Present (ADP), COVID-19 Period

    Area: Digital Humanities, Data Visualization, Information Preservation, Archival Project

    Camlot, J., Neugebauer, T., Berrizbeitia, F., Ben, J., Bustamante, A., & Gandham, S. (2022). Archive of the Digital Present (ADP), COVID-19 Period: Collecting and Visualizing Metadata of Online Literary Events Hosted in Canada, March 2020-September 2021.

  • Variety of conferences at UBC, SFU, Cornell, ...; in Rome, Tokyo, Porto, Vancouver, Montreal etc.

    Area: NLP, Cognitive Science, Linguistics, Computer Science, Sentiment Analysis

    For a more complete list, please visit Google Scholar or the Research Section of this page

Current and Past Affiliations
Experience

Education

Work Experience

  • February 2025 - Present
    Language Data Scientist, Contingent Worker (CWx)
    AI Studio Capabilities, Meta, Vancouver (VAN0200)

    ➡ Designing evaluation metrics, testing, fine tune and annotating prompts for Meta’s Generative AI models.

    ➡ Performing User Centric Design demos with the Eng and Prod team and then optimizing a solution based on prompt creation and engineering methods

    About the AI Studio Team (External Description): Our team is a product-focused ML team in the GenAI org. We are responsible for making Meta the world’s preferred destination for building and engaging with AI characters / personas. You can see our recent work here: AI Studio and in the news:TechCrunch TechCrunch. As Mark shared in Q2 2024 earnings, “we launched AI Studio, which lets anyone create AIs to interact with across our apps”. Also check out Mark’s interview with Jensen . Finally, check out the latest demo from Mark in the Meta Connect Keynote, around 18:00, where Mark is having a video call with the AI of an Instagram influencer.

    AI Studio features both UGC (User-Generated Characters) and IGC (Instagram Creator Agents). The former allows the billions of Meta users to build an AI character by themselves. The latter taps into the vast network of influencers and enables fans to interact with their AIs via text, voice, and video. You will contribute to both products by developing ML solutions. The ideal candidate has experience in LLM, RAG, NLP, and/or Computer Vision.

  • September 2023 - August 2024
    Defence Data Scientist - Co-op
    Digital Transformation Office (DTO) / Data, Innovation, Analytics (DIA), Department of National Defence (DND), Ottawa

    ➡ Co-authored papers highlighting contributions to advancing AI applications in national defense.

    ➡ Contributed to the DND's Artificial Intelligence Strategy - available on the GoC / DND website which talks about the chatbot presented at ICPRAM 24

    ➡ Designed and deployed an information retrieval chatbot focusing on military policies and standards, developing an application using Meta’s LLAMA2 Large Language Model (LLM).

    ➡ Successfully implemented YOLO (You Only Look Once) object detection algorithms, leading to more accurate and efficient object detection in real-world scenarios, which had a direct impact on mission-critical tasks.

    ➡ Led the implementation of advanced Named Entity Recognition (NER) techniques to improve the accuracy of extracting relevant entities from unstructured military texts.

    ➡ Conducted comprehensive Large Language Model (LLM) evaluations, employing the Hellaswag dataset and BigBench to test the models' communication capabilities

    ➡ Performed benchmarking (ROUGE) to assess the quality of our chatbot’s generated texts

    ➡ Contributed to the design and implementation of the DND's in-house LLM, optimizing the performance of machine learning models, ensuring efficient memory management, and enabling real-time data processing.

       ‣ Assistant Deputy Minister (Data, Innovation, Analytics), Department of National Defence

  • August 2021 - March 2023
    Research Assistant - Web Developer & Maintenance
    Spokenweb Project, Concordia University, Montreal

    ➡ Working with a diverse cohort of academics, librarians, web developers, designers, artists and researchers, developing interfaces for the meaningful presentation of metadata cataloguing large volumes of archival data.

    ➡ Working with Angular, NodeJS, MongoDB, Elasticsearch and Strapi (JAM Stack and MEAN Stack) for the Archive of the Digital Present Project website.

       ‣ Archive of the Digital Present

  • January 2022 - April 2022
    Web developer for techno-economic analysis (TEA) tools
    Energy, Mining and Environment, National Research Council, Government of Canada

    ➡ Working with the Energy, Mining and Environment subdivision of the National Research Council of Canada

    ➡ Developing Drupal-based front-end tools for Technology Development Matrix (TDM) and Life Cycle Analysis tool integration and data visualization for Carbon Capture Utilization and Storage (CCUS) to assess best available technologies and sustainable energy alternatives including nuclear energy

    ➡ Assisting the project team with tool development and data management (importation scripts in VBA) Assist the project team with report editing

    ➡ Supporting the integration of Techno-Economic Analysis tools and OpenLCA with a Drupal-based web site

       ‣ National Research Council of Canada (NRC/ CNRC)

  • September 2021 - January 2022
    Hul’q’umi’num’ Language Preservation
    Hul’q’umi’num’ Language and Culture Society, Vancouver

    ➡ Developing Python Scripts for Data Management for the field research notes (24 Terabytes of Data) Coast Salish Language Preservation and documentation effort being conducted by the Hul’q’umi’num’ Language and Culture Society

       ‣ https://sqwal.hwulmuhwqun.ca

  • September 2019 - August 2021
    Research Assistant - Metadata Task Force
    Special Collections Dept, WAC Bennett Library , SFU

    ➡ Worked with poetry at the Digital Humanities Lab at the WAC Bennett Library, using SWALLOW and Islandora Metadata Ingestion systems to digitally access and preserve fragile literary matter from University archives.

    ➡ Wrote Python and Bash scripts to help automate the editing and search processes for large repositories in the library Catalogue.

       ‣ https://spokenweb.ca/

  • January 2019 - March 2020
    STEM After-School and Weekend Instructor
    Science Al!ve, Simon Fraser University

    ➡ Designed curriculum and conducted Programs as After School and Weekend Instructor as part of SFU's outreach program introducing K-12 children to programming, computer logic and STEM subjects.

    ➡ Taught Programming and helped elementary students design their own projects using Scratch, Python, Flask, Javascript, HTML etc. at beginner levels.

    ➡ Introduced robotics and computer hardware through Microbits, Edison Bots, Spheros, PEPPER and Nao Robots.

       ‣ https://sciencealive.ca/

Research

Research Publications (Chronological)

Update on work with the Military Policies and Standards Chatbot.
Navigating through the extensive and bilingual military policies of the Canadian Armed Forces (CAF) can be a complex task. To address the need for streamlined access to these documents, this paper explores the implementation of containerized artificial intelligence (AI) models to develop a bilingual knowledge management system. This system is designed to enhance information retrieval from military policies by leveraging AI, metadata tagging, text embeddings, and vector databases. Our approach involves creating a comprehensive data processing pipeline to store and retrieve military policies and documents. We utilize a question-answering pipeline that incorporates language detection, semantic search, and large language models (LLM) to process queries in both English and French. Preliminary evaluations demonstrate an accuracy rate of 87.78\% for English and 73.33\% for French queries. This paper builds upon our previous work on semantic search for military policies by integrating a generative AI component, thus extending beyond traditional retrieval-based methods to provide more precise and contextually relevant responses.

In the Canadian Armed Forces (CAF), navigating through extensive policies and standards can be a challenging task. To address the need for streamlined access to these vital documents, this paper explores the usage of artificial intelligence (AI) and natural language processing (NLP) to create a question-answering chatbot. This chatbot is specifically tailored to pinpoint and retrieve specific passages from policy documents in response to user queries. Our approach involved first developing a comprehensive and systematic data collection technique for parsing the multi-formatted policy and standard documents. Following this, we implemented an advanced NLP-based information retrieval system to provide the most relevant answers to users’ questions. Preliminary user evaluations showcased a promising accuracy rate of 88.46%. Even though this chatbot is designed to operate on military policy documents, it can be extended for similar use cases to automate information retrieval from long documents.

Scitepress

This paper outlines a robust approach to automate the detection of military badges on official government documents utilizing YOLOv5 computer vision model. In an era where the rapid classification and management of sensitive documents is paramount, developing a system capable of accurately identifying and classifying distinct badge types plays a crucial role in supporting data management and security protocols. To address the challenges posed by the lack of accessible, real-world government and military documents for research, we introduced a novel method to simulate training data. We employ a technique that automates the data labelling process, facilitating the generation of a comprehensive and versatile dataset while eliminating the risk of compromising sensitive information. Through careful model training and hyper-parameter tuning, the YOLOv5 model demonstrated exemplary performance, successfully detecting a wide spectrum of badge types across various documents.

Scitepress

Our study suggests that individuals using more linguistic shibboleths are more likely to have a stronger bond with this online community and develop a tendency to display characteristics of mental illness themselves. We believe that our study has several far-reaching consequences, including a new psycholinguistic lens through which to examine propaganda and an explanation of echo chambers, parasocial relationships, cults of personality, and other socio-psychological phenomena

Abstract

Barrett, R.C.A., Poe, R., O’Camb, J.W., Woodruff, C., Harrison, S.M., Dolguikh, K., Chuong, C., Klassen, A.D., Zhang, R., Joseph, R.B. and Blair, M.R., 2022. Comparing virtual reality, desktop-based 3D, and 2D versions of a category learning experiment. Plos one, 17(10), p.e0275119.

Starting work into testing the viability of immersive virtual reality in conducting cognitive science research. Using the category learning paradigm, we are exploring how people change their behaviour over time as they learn to categorize different stimuli into their appropriate groups.

Journal Article

The ADP database and directory is designed for reflection, study and analysis of online literary events in Canada during the COVID-19 pandemic. This paper presents our approaches to data collection and structuring, stack development, data visualization, and front-end design, pursued through community-oriented processes of design, research and development.

Panel Talk

In 2019, SpokenWeb SFU Project Manager Cole Mash (SFU) and SpokenWeb Systems Task Force member Tomasz Neugebauer (Concordia) began work on editing SWALLOW entries. SWALLOW is an open-source metadata ingestion system developed by the SpokenWeb team to describe and manage the project’s object of study: literary audio. Since the implementation of SWALLOW in 2018, SpokenWeb team members have ingested over 4,700 entries into the system..

Guide Doc

Our aim is to facilitate investigation of potential variants by presenting a computational corpus approach towards measuring one of many ontological parameters: linguistic distinctiveness. Using the HKCSE and the BNC, we perform part-of-speech tagging, dependency parsing, and word order assessment to identify morpho-syntactic features.

Presenting

The objective of this study is to develop a model that can take in the content of a microblog (such as a tweet on Twitter) that contains opinions (or sentiments), and determine the prevalent emotions behind the tweet as a function of uncertainty about the membership functions of fuzzy sets associated with linguistic terms. This information can then be used for sentiment analysis.

Poster

An exploration of my process developing a bot to traverse through the metadata ingestion system in order to facilitate editing large volumes of data, usage of NLP word prediction techniques to fill in gaps in areas with missing data and supervised learning to automate the data entry process. Besides this, I also include considerations of more experimental applications such as lip reading.

Recording

An exploration into illustrating linguistic trends in the age of misinformation with respect to targeted advertisements creating echo chambers, manufactured consent, subjective language, online disinhibition, and how news organizations create newsworthiness.

Recording

In this review, I look at the history of these pieces of code, from Socrates’ Daemon, to the Child Machine; and examine the applications these bots could currently have in Computing Science in general and as an answer to the Turing Test in Specific.

Journal Article

Minor Contributions

We use the real-time strategy game StarCraft 2 in our lab as it is a useful domain in which to study learning and attention. I was involved with a project that utilized Starcraft 2 Gameplay Chatlogs datasets and used NLP techniques such as frequency mapping and NER to attempt to classify positive or negative expression of emotion during gameplay

ExNovo is aimed at designing and testing a new computer interface that is grounded in what we know about human cognition. I played a minor role in correcting C# code for the Ex-Novo game, testing equipment before experiments and re-formatting data

Project Page

Minor contribution to data management of field notes etc. for the Hul’q’umi’num’ Language & Culture Society (HLCS). HLCS is a not-for-profit organization whose mandate is to ensure the survival of traditional knowledge and values. Our work proceeds under the guidance of a board of Elders, teachers, and researchers, all speakers of the Hul’q’umi’num’ language.

Web Page

Minor contributions to a variety of projects involving the study of clefts, information structure and intonation

Lab Page

The Gender Gap Tracker is a collaboration between Informed Opinions, a non-profit dedicated to amplifying women’s voices in media and Simon Fraser University, through the Discourse Processing Lab and the Big Data Initiative. This research dashboard showcases results from our study on gender bias in the media. We present the Gender Gap Tracker (GGT), an automated system that measures men and women’s voices on seven major Canadian news outlets in real time. We analyze the rich information in news articles using Natural Language Processing (NLP) and quantify the discrepancy in proportions of men and women quoted.

Project Page

The SFU Opinion and Comments Corpus (SOCC) has been collected with attention to preserving reply structures and other metadata. In addition to the raw corpus, we also present annotations for four different phenomena: constructiveness, toxicity, negation and its scope, and appraisal.

Github

Projects

Misc. Efforts Beyond Academia

Information Disorders in Times of Crisis

A system analysis of Information Ecosystems through the lens of Linguistics, Psychology and Technology


Critical Infrastructure Protection

Coursework Project for CMPT318(Cybersecurity) to train a Hidden Markov Model (HMM) to identify anomalies in the control signals of a power plant given past perfomrance data.


Predictive Maintenance using IoT for Smart Buildings

Case Competition entry for the 2018 Deloitte Thinktech (Winning Team) for our project based on Predictive Maintenance using IoT technology


Food Delivery Bot

Rube Goldberg style Program for ordering food, incorporating uneccesarily complicated but heavily technical elements


Computer Vision based Plastic Segregation System

Projects as part of a startup attempting to bring a circular economy for plastic recycling and waste management Link to Website


gAIa GIS

A model using predictive analysis to effectively indicate to users interested in setting up COIs the impacts of projects, in an attempt to foster awareness and preserve land, people and ecosystems. Link to Website


Technical Skills

Skills Summary

A Summary of Technical and Soft Skills I've gained through Work Experience and Projects:

Data Science

Data Analysis: pandas | NumPy | Data Cleaning | Data Visualization | Statistical Analysis | Data Lakes | SQL

Data Visualization: Tableau | Power BI | Matplotlib | ggplot2 | Plotly | Business Intelligence (BI) Tools

AI / ML

Machine Learning & AI: TensorFlow | PyTorch | Azure MLOps | Model Training & Evaluation | Feature Extraction | YOLO (Object Detection, OCR) | Large Language Models (LLMs) – LLAMA 2, 3; WizardLM-2 30b | Triton | A/B Testing | Experiment Design | Deep Learning | Neural Networks | Bayesian Statistics

Natural Language Processing (NLP) & Generation (NLG): Information Retrieval (IR) | Retrieval Augmented Generation (RAG) | Semantic Role Labeling | Dialogue Systems | Part-of-Speech Tagging | Named Entity Recognition (NER) | NLTK | Text Preprocessing | Tokenization | Sentiment Analysis | Topic Modeling | Spacy

Programming

Programming Languages: Python (Spacy, SciPy, NumPy, Pandas) | Java | C++ | C# | C | MATLAB | R

Web Development & Database Management: Angular | Node.js | HTML, CSS, JavaScript | MySQL | Database Design & Querying | Data Model Optimization | Database Normalization | Elasticsearch

Research

Experiment Design: A/B Testing | Experiment Design | Feature Extraction | Model Training & Evaluation | Statistical Analysis | Data Storytelling

Business & Communication Skills: Business Acumen | Communication with Stakeholders | Interdisciplinary Collaboration | Ethical Judgment | Data Privacy Expertise

Cybersecurity

Data Privacy Expertise

Critical Infrastructure Protection

Anomaly Detection

Additional Skills

Software Development Lifecycle (SDLC): Agile Methodologies | Git Version Control | Docker | Continuous Integration/Continuous Deployment (CI/CD) | Testing & Debugging

Big Data & Cloud Computing: AWS | Azure | NoSQL (MongoDB, Cassandra) | Distributed Computing

Adaptive Learning & Problem-Solving: Continuous Learning | Critical Thinking | Attention to Detail | Problem-Solving Skills | Domain Expertise




Some of My Skills

Software/ Website Developement

Full Stack Development (JAM stack, MEAN stack etc.)
Completed 2 projects and Currently employed as a Lead Developer
General Web-Design (HTML, CSS, Js, nodejs)
Learnt HTML as part of 8th grade school curriculum
Typescript (Angular, Vue, Nuxt, React)
On SHHRC grant for developing Angular Website + 2yr Work Experience
Data Handling (SQL/ NoSQL Databases; Strapi, Drupal CMSs)
Managing Data Storage and Querying in current capacity with NRC
APIs, Scripting and Task Automation (Python, Bash Shell)
Received USRA (2020) to work with Scripts/APIs
OOP Languages (JAVA, C++, C#)
Published papers on Projects where I developed Unity games in C#

Natural Language Processing

Data Mining (Classification, Extraction)
Worked on as part of 2020 VPR USRA (Discourse Processing Lab)
Sentiment Analysis
1 publication in area, 2nd in Progress
ML / AI Systems (BERT, GPT-3)
Working on as part of 2022 SHHRC Grant
Syntax Trees
2 Publications in area, Focus of Linguistics component of degree
Non Textual Input/ Output
Social Data Analytics 250 Coursework
Hobbies / Contact
Avatar
Avatar
Avatar
Avatar

  

Misc. Achievements

  • Brazillian Jiu Jitsu
      - Blue Belt in Brazillian Jiu Jitsu under Michael Hansen, Budo Burnaby
      - Gold Medal at Grappling Industries Vancouver, March 2023
      - (Pictured with team) Bronze Medal at CBJJF (Canadian Braziallian Jiu Jitsu Federation) Vancouver, January 2023
      - Silver Medal at All vs All (AVA) Vancouver, July 2022

  • Debate, Model UNs, Case Competitions - Public Speaking
      - Best Team - Deloitte Thinktech Case Competition 2018 - IoT Project for Quadreal - Smart Buildings
      - Best Team - Deloitte Thinktech Case Competition 2021 - Platform Agnostic API Handling for Vancity - Open Banking
      - Assitant Director of the International Press at Harvard Model UN 2018 - Harvard University
      - (Pictured with trophy) Head Delegate of Best Large Delegation at Harvard Model UN 2017 - Harvard University
      - (Pictured with trophy) Head Delegate of Best Small Delegation at Ivy League Model UN 2016 - University of Pennsylvania
      - Chairperson at 4 conferences, Best Delegate at 3 conferences, 12 Misc. awards for self and team
      - Head of School Debate Team, Representative at National Debates, India
  •   - Best Presentation Award at Hwa Chong Asia Pacific Young Leaders Summit (HCAPYLS) 2017 representing India

  • General Achievements
      - Completion Medal - Vancouver BMO Marathon 2023 - 8k
      - (Pictured at VanSlam 2019) Prizes at Misc. Spoken Word Poetry Slams and Open Mics
      - Asian Finalists - Runners’ Up - International NASA Space Settlement Design Competition - ARSSDC 2017
      - Best Essay at 3 High School Creative Writing Competitions
      - National High School Spelling Bee Semi-Finalist
      - Editorial Committee Member 2017 - High School Magazine / Yearbook
      - Ran a Cicada 3301 themed cryptography competition named Cryptonite at Annual Science & Tech Fest
      - Awards for Academic Achievement (1st in class of 400 students) for 12 years running (Kindergarten to 10th Grade)