Professional Summary

Accomplished Data Engineer and AI/ML Researcher with demonstrated expertise in architecting large-scale data infrastructure and applying artificial intelligence to solve complex information retrieval and analysis problems.

Specialized in designing scalable ETL pipelines, database modernization, and AI/ML applications in Natural Language Processing, with proven success migrating 1.5 million+ historical metadata records to modern cloud databases and improving overall data discoverability.

Education

Master of Science in Electrical and Computer Engineering

Tennessee Tech University

Cookeville, Tennessee

2019 - 2023

Professional Experience

Data Engineer

Vanderbilt University Television News Archive

Nashville, Tennessee

2024 - Present
  • Architected and deployed mission-critical ETL pipelines supporting a large-scale television news archive, enabling researchers to analyze broadcast media content spanning multiple decades.
  • Led database modernization initiative migrating 1.5 million+ historical metadata records from legacy systems to AWS Aurora MySQL with zero data loss and significant query performance improvements.
  • Developed AI-enhanced metadata curation system using Databricks and NLP that reduced manual quality assurance time while improving data discoverability
  • Engineered scalable data warehousing solutions on AWS utilizing S3, Glue, Lambda, and Aurora to support large video archive storage and metadata processing

Graduate Research Assistant & Adjunct Lecturer

Tennessee Tech University

Cookeville, Tennessee

2019 - 2023
  • Conducted research on IoT and wireless systems addressing critical infrastructure applications
  • Designed and implemented experimental IoT sensor networks for real-time environmental monitoring
  • Taught Industrial Electronics course as Adjunct Lecturer, developing hands-on curriculum
  • Led laboratory sessions for undergraduate engineering courses

Selected Publications & Talks

Metadata Matters: Modernizing the Vanderbilt Television News Archive Database

Jim Duran, Vaibhav Ravinutala

Coalition for Networked Information (CNI) Project Briefing Series, Winter 2026

Democratizing Access to Library Data Assets: AI-Enhanced Curation Model using Databricks

Vaibhav Ravinutala, Sathvika Talakanti

Southeast Data Librarian Symposium 2025

The Lakehouse for Research: Why Databricks? Augmenting R/Python for Petabyte-Scale Analytics and Reproducibility

Vaibhav Ravinutala

Intro to Databricks & Social Media Research Roundtable by McGee Applied Research Center for Narrative Studies, 2025

Sample calculation of Link Power Budget and Effective SNR in NB-IoT

R. S. K. Vaibhav and T. S. Reddy

2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), Palladam, India, 2018, pp. 30-32

Current Research & Manuscripts

In Progress

Data Descriptor: A Standardized Longitudinal Corpus of U.S. Broadcast News Transcripts (1968–Present) with PBCore Metadata and AI-Enhanced Validation for Scholarly Use

Vaibhav Ravinutala, Jim Duran

This descriptor details a curated 54-year corpus of American broadcast news transcripts derived from VTNA content using scalable cloud-native ASR pipelines (AWS Transcribe with serverless architectures and custom post-processing). Standardized with PBCore metadata, IPTC topics, linked data, AI-generated descriptions, and accuracy/versioning/SLO metrics (WER < 10% on validated samples), the corpus (tvn_transcripts delta table) is compliant with U.S. copyright (§107 fair use for scholarly transformation; §108(f)(3) exceptions) and accessible via institutional channels for non-commercial research. Usage notes include access protocols, noise caveats, and interoperability for AI-driven studies in media narratives, framing, and misinformation, advancing national priorities in digital preservation and computational humanities.

In Progress

Secure Persistent Identification and Machine-Actionable Metadata: Modernizing the Vanderbilt Television News Archive for AI-Driven Historiography

Vaibhav Ravinutala, Jim Duran, Anata Garnapudi

This paper addresses the technical bottleneck of scaling legacy media archives for computational research. Using the Vanderbilt Television News Archive (VTNA) as a case study, we document the migration from a monolithic on-premise database to a cloud-native AWS Aurora architecture. We introduce a novel implementation of Nano IDs—decentralized, collision-resistant, and non-sequential identifiers—to replace legacy serial numbering, thereby eliminating enumeration vulnerabilities and ensuring citation persistence in distributed environments. The framework integrates the PBCore metadata standard with AI-generated transcripts (ASR) and IPTC Media Topics, transforming unstructured broadcast video into a machine-actionable dataset. By implementing a granular versioning schema for AI outputs, we provide the scholarly provenance required for reproducible AI/ML research. This hybrid engineering and librarianship model offers a scalable blueprint for modernizing global broadcast repositories into secure, citable, and computationally tractable research hubs.

In Progress

Large-Scale PBCore Adaptation with AI/ML for Archival TV News: Infrastructure Modernization and Implications for Computational Research at Vanderbilt

Vaibhav Ravinutala, Jim Duran

This paper presents the large-scale adaptation of the PBCore metadata standard to the Vanderbilt Television News Archive (VTNA), a 58-year collection encompassing over 1.4 million news segments and commercial breaks (1968–present). The modernization effort integrates PBCore as the core metadata framework with AWS Aurora database migration, AI/ML-driven enhancements including automated speech recognition (ASR) transcripts, AI-generated titles and descriptions, and linked data structures to improve semantic interoperability and search capabilities. Databricks-enabled data lakes facilitate scalable processing, versioning, and service level objectives (SLOs) for secure, compliant computational access. Governance and stewardship protocols ensure adherence to U.S. copyright law, specifically fair use under 17 U.S.C. § 107 for transformative non-commercial scholarly research and § 108(f)(3) exceptions for audiovisual news programs. Empirical results demonstrate significant improvements in query latency (up to 70% reduction), text/data mining readiness, and researcher accessibility. The implications for computational research are substantial: the infrastructure enables advanced longitudinal studies of U.S. broadcast media narratives, framing analysis, public discourse evolution, and misinformation detection—contributing to national priorities in digital cultural heritage preservation, open science, AI ethics, and media literacy.

In Progress

AI-Enhanced Data Curation Model for Licensed or Restricted Datasets: Governance, Stewardship, and Automation in Research Collections

Vaibhav Ravinutala, Akarsha Dasaraju

Building on foundational work presented at SEDLS 2025 ("Democratizing Access to Library Data Assets: AI enhanced curation model using Databricks"), this paper proposes a novel AI-enhanced curation model specifically designed for licensed or restricted datasets and collections. The model leverages Databricks-enabled data lakes for scalable processing, integration of automated transcription outputs (e.g., ASR-derived content), AI-generated titles and descriptions, and automated workflows for versioning, service level objectives (SLOs), and access controls. Governance and stewardship protocols address ethical handling, compliance verification, and responsible use under applicable legal frameworks (including U.S. copyright fair use under 17 U.S.C. § 107 for transformative non-commercial scholarly research and relevant exceptions for archival/educational purposes). Validation through applied case studies demonstrates improved query efficiency, text/data mining readiness, and support for academic and computational applications in areas such as narrative analysis, pattern detection, and knowledge discovery—contributing to broader priorities in open science, responsible AI, and equitable access to restricted data resources.

Technical Skills

Programming

  • Python (Advanced)
  • SQL (PostgreSQL, MySQL, Aurora)
  • R (Statistical Computing)
  • Scala (Spark Applications)

Data Engineering

  • Apache Spark & PySpark
  • ETL Pipeline Development
  • Data Warehousing
  • Databricks Platform

AI/ML & Analytics

  • Natural Language Processing
  • Machine Learning (scikit-learn, TensorFlow)
  • Sentiment Analysis
  • Statistical Analysis

Cloud Platforms

  • AWS (S3, Glue, Lambda, Aurora)
  • Microsoft Azure
  • Docker & Containers
  • CI/CD Pipelines

Visualization

  • Tableau
  • Power BI
  • Matplotlib

Data Librarianship

  • Metadata Standards & Schema Design (Dublin Core, MODS)
  • Metadata Curation & Enrichment
  • Digital Preservation Practices
  • Cataloging, Controlled Vocabularies & Authority Files;IPTC and PBCore

Certifications

AWS Certified Solutions Architect - Associate

Amazon Web Services

AWS Certified Data Engineer - Associate

Amazon Web Services

Research Interests

Data Librarianship & Information Management: Advancing methodologies for organizing, preserving, and providing access to digital information resources and collections.
Data Engineering & Pipeline Optimization: Developing scalable, efficient data infrastructure to support research
Artificial Intelligence in NLP: Applying machine learning techniques to automate information extraction and analysis
Sentiment Analysis & Media Research: Utilizing computational methods to analyze public discourse and media narratives
Data Governance: Policies, standards, and practices to ensure data quality, privacy, and ethical use
Ethics & Privacy in AI: Ensuring responsible development and deployment of AI systems with consideration for bias and transparency