cv

This is a description of the page. You can modify it in '_pages/cv.md'.

Basics

Name Pramit Sahoo
Label AI Researcher
Email pramitsahoo.gnipst@gmail.com
Url https://pramitsahoo.github.io
Summary M.Tech in Artificial Intelligence at IIT Hyderabad. Researching Multilingual NMT and Cultural Adaptation of LLMs.

Work

  • 2024.11 - 2025.04
    LLM Research Intern
    Sony Research India
    Mentors: Bojun Huang, Dr. Pankaj Wasnik
    • Developed a ~ 250M parameter transformer-based machine-translation model for 12 south-east asian languages on contrastive representation alignment to boost non-English non-English directions.
    • Assembled a 22 M-sentence corpus exclusively from public resources Meta's NLLB-200 mined bitext, OPUS repositories, ALT parallel datasets eliminating reliance on proprietary data.
    • Delivered a +2.3 BLEU mean gain over the open-source SEA-LION-v3 (9b) baseline on low-resource languages Khmer, Malay and Vietnamese in the EnXX direction, without increasing model size or inference cost.
    • Presented findings to Sony Research & AISingapore leadership; recommendations adopted for forthcoming Tamil-centric SEA-LION refinements and merged into AISingapore's open-source roadmap.

Volunteer

  • 2024.05 - 2024.05
    Teaching Assistant
    Department of CSE, IIT Hyderabad
    Conducted online test and invigilated for M.Tech. in Data Science with Prof. Rameshwar Pratap.
  • 2024.03 - 2024.03
    Co-ordinator
    Capture the Flag Hackathon, IIT Hyderabad
    Organized Capture the Flag hackathon at IIT Hyderabad sponsored by IIT Bombay and HSBC Bank.
  • 2023.02 - 2023.05
    Head, Technical Committee
    Annual Fest IRIS-2023, GNIPST
    Handled technical works for the Annual Fest IRIS-2023 of GNIPST.

Education

  • 2023.01 - 2025.01
    M.Tech.
    Indian Institute of Technology Hyderabad (IITH)
    Artificial Intelligence
    • Natural Language Processing
    • Deep Learning
    • Foundations of Machine Learning
    • Matrix Theory
    • Probability
    • Random Variables and Stochastic Processes
    • Data Structures and Algorithm
  • 2019.01 - 2023.01
    Bachelor
    Guru Nanak Institute of Pharmaceutical Science and Technology, Kolkata
    Pharmaceutical Technology

Awards

Skills

Languages
Python
Bash
C
C++
JavaScript
ML / DL Frameworks
PyTorch
TensorFlow
Hugging Face Transformers
Fairseq
Sentence Piece
NLP/MT Tooling
SacreBLEU
Moses scripts
OPUS-MT tools
fastText
SpaCy
NLTK
Experiment & MLOps
Weights & Biases
MLflow
DVC
Docker
GitHub Actions
SLURM
Cloud
AWS (EC2, SageMaker)
GCP (Compute Engine)
Azure
Data / DB
PostgreSQL
SQLite
OS & Tools
Linux
Git
VS Code

Projects

  • 2024.06 - 2024.08
    IndicRASP: Multilingual Machine Translation Model
    Developed a 243M parameter transformer-based model with an alignment augmentation strategy for 22 Indian Languages.
    • Fine-tuned IndicRASP on a high-quality dataset to create IndicRASP-Seed, outperforming the 1B parameter SOTA model IndicTrans2 in specific language pairs.
    • Surpassed current SOTA IndicTrans2 for Manipuri, Oriya, and Santali (En-Indic) with chrF++ score improvements of 0.5, 2.3, and 6.5, respectively, on the IN22-Gen test set.
    • Achieved competitive performance overall, with average chrF++ scores of 46.80 (En-Indic) and 56.34 (Indic-En) on the IN22-Gen benchmark.
    • Further fine-tuned it for low-resource languages like Khasi and Mizo, achieving a chrF++ score of 42.3, 54.9 for En-Indic direction.
  • 2024.01 - 2024.04
    Empathy in Conversational Systems
    Developed a system to detect empathy in doctor-patient conversations.
    • Converted the MTS-Dialog summarization dataset into an empathy-focused dataset using the EPITOME framework.
    • Fine-tuned GPT-2 on 100 human-annotated samples and applied the model to automatically annotate the remaining data.
    • Evaluated the performance of GPT-2, ROBERTa, and DialoGPT, achieving an accuracy of approximately 97%.
  • 2024.01 - 2024.04
    Clickbait Spoiler Detection
    Developed a system to generate clickbait spoilers using the Webis Clickbait Spoiling Corpus 2022 dataset.
    • Transformed clickbait headlines into questions and employed LLaMa-2-13b to generate relevant spoilers.
    • Applied both pointwise and pairwise ranking approaches to evaluate spoiler generation across phrase, passage, and multipart clickbait types.
    • Achieved the best performance using the DeBERTa baseline, with a BLEU score of 37.89.
  • 2024.01 - 2024.04
    LLM Multimodal Traffic Accident Forecasting
    Analyzed time series traffic data from the FAIRS-2021 dataset using ARIMA and compared it with transformer-based models.
    • Prompted LLaVa-13b to interpret road images and evaluated its capability to generate descriptive text based on visual data.
    • Integrated top feature weights from PCA into GPT-4 and LLaMa-2-13b prompts for forecasting in autonomous driving scenarios.
    • Concluded that transformer models outperformed traditional statistical methods in traffic accident forecasting.