cv
This is a description of the page. You can modify it in '_pages/cv.md'.
Basics
| Name | Pramit Sahoo |
| Label | AI Researcher |
| pramitsahoo.gnipst@gmail.com | |
| Url | https://pramitsahoo.github.io |
| Summary | M.Tech in Artificial Intelligence at IIT Hyderabad. Researching Multilingual NMT and Cultural Adaptation of LLMs. |
Work
-
2024.11 - 2025.04 LLM Research Intern
Sony Research India
Mentors: Bojun Huang, Dr. Pankaj Wasnik
- Developed a ~ 250M parameter transformer-based machine-translation model for 12 south-east asian languages on contrastive representation alignment to boost non-English non-English directions.
- Assembled a 22 M-sentence corpus exclusively from public resources Meta's NLLB-200 mined bitext, OPUS repositories, ALT parallel datasets eliminating reliance on proprietary data.
- Delivered a +2.3 BLEU mean gain over the open-source SEA-LION-v3 (9b) baseline on low-resource languages Khmer, Malay and Vietnamese in the EnXX direction, without increasing model size or inference cost.
- Presented findings to Sony Research & AISingapore leadership; recommendations adopted for forthcoming Tamil-centric SEA-LION refinements and merged into AISingapore's open-source roadmap.
Volunteer
-
2024.05 - 2024.05 Teaching Assistant
Department of CSE, IIT Hyderabad
Conducted online test and invigilated for M.Tech. in Data Science with Prof. Rameshwar Pratap.
-
2024.03 - 2024.03 Co-ordinator
Capture the Flag Hackathon, IIT Hyderabad
Organized Capture the Flag hackathon at IIT Hyderabad sponsored by IIT Bombay and HSBC Bank.
-
2023.02 - 2023.05 Head, Technical Committee
Annual Fest IRIS-2023, GNIPST
Handled technical works for the Annual Fest IRIS-2023 of GNIPST.
Education
-
2023.01 - 2025.01 M.Tech.
Indian Institute of Technology Hyderabad (IITH)
Artificial Intelligence
- Natural Language Processing
- Deep Learning
- Foundations of Machine Learning
- Matrix Theory
- Probability
- Random Variables and Stochastic Processes
- Data Structures and Algorithm
-
2019.01 - 2023.01 Bachelor
Guru Nanak Institute of Pharmaceutical Science and Technology, Kolkata
Pharmaceutical Technology
Awards
- 2025.01.01
Demo Presentation - Bharat Gen Summit 2025
Bharat Gen Summit
Presented Demo on Cultural Adaptation Work.
- 2024.11.01
Best System Award
MultiIndicMT Shared Task 2024, WMT, EMNLP 2024
- 2024.11.01
- 2024.01.01
- 2023.01.01
- 2023.01.01
- 2022.04.01
National level Qualifier in Smart India Hackathon
Department of AYUSH
Skills
| Languages | |
| Python | |
| Bash | |
| C | |
| C++ | |
| JavaScript |
| ML / DL Frameworks | |
| PyTorch | |
| TensorFlow | |
| Hugging Face Transformers | |
| Fairseq | |
| Sentence Piece |
| NLP/MT Tooling | |
| SacreBLEU | |
| Moses scripts | |
| OPUS-MT tools | |
| fastText | |
| SpaCy | |
| NLTK |
| Experiment & MLOps | |
| Weights & Biases | |
| MLflow | |
| DVC | |
| Docker | |
| GitHub Actions | |
| SLURM |
| Cloud | |
| AWS (EC2, SageMaker) | |
| GCP (Compute Engine) | |
| Azure |
| Data / DB | |
| PostgreSQL | |
| SQLite |
| OS & Tools | |
| Linux | |
| Git | |
| VS Code |
Projects
- 2024.06 - 2024.08
IndicRASP: Multilingual Machine Translation Model
Developed a 243M parameter transformer-based model with an alignment augmentation strategy for 22 Indian Languages.
- Fine-tuned IndicRASP on a high-quality dataset to create IndicRASP-Seed, outperforming the 1B parameter SOTA model IndicTrans2 in specific language pairs.
- Surpassed current SOTA IndicTrans2 for Manipuri, Oriya, and Santali (En-Indic) with chrF++ score improvements of 0.5, 2.3, and 6.5, respectively, on the IN22-Gen test set.
- Achieved competitive performance overall, with average chrF++ scores of 46.80 (En-Indic) and 56.34 (Indic-En) on the IN22-Gen benchmark.
- Further fine-tuned it for low-resource languages like Khasi and Mizo, achieving a chrF++ score of 42.3, 54.9 for En-Indic direction.
- 2024.01 - 2024.04
Empathy in Conversational Systems
Developed a system to detect empathy in doctor-patient conversations.
- Converted the MTS-Dialog summarization dataset into an empathy-focused dataset using the EPITOME framework.
- Fine-tuned GPT-2 on 100 human-annotated samples and applied the model to automatically annotate the remaining data.
- Evaluated the performance of GPT-2, ROBERTa, and DialoGPT, achieving an accuracy of approximately 97%.
- 2024.01 - 2024.04
Clickbait Spoiler Detection
Developed a system to generate clickbait spoilers using the Webis Clickbait Spoiling Corpus 2022 dataset.
- Transformed clickbait headlines into questions and employed LLaMa-2-13b to generate relevant spoilers.
- Applied both pointwise and pairwise ranking approaches to evaluate spoiler generation across phrase, passage, and multipart clickbait types.
- Achieved the best performance using the DeBERTa baseline, with a BLEU score of 37.89.
- 2024.01 - 2024.04
LLM Multimodal Traffic Accident Forecasting
Analyzed time series traffic data from the FAIRS-2021 dataset using ARIMA and compared it with transformer-based models.
- Prompted LLaVa-13b to interpret road images and evaluated its capability to generate descriptive text based on visual data.
- Integrated top feature weights from PCA into GPT-4 and LLaMa-2-13b prompts for forecasting in autonomous driving scenarios.
- Concluded that transformer models outperformed traditional statistical methods in traffic accident forecasting.