This position will coordinate the planning of and conduct advanced research computing engineering duties. Implement current and develop new RC solutions to keep up with the pace of complex research problems. Work independently to build, monitor, and maintain the integrity of RC systems. Provide technical expertise to teams and projects alongside research programs. Be a key contributor to multiple projects simultaneously.
Job-Specific Responsibilities :
Harvard is seeking a Sr. Systems Software Engineer that will continue to improve operational visibility of the vast FAS Research Computing (FASRC) infrastructure through strong site-reliability practices. The FASRC infrastructure is core to Science & Engineering, and Public Health research missions supporting over 5,000 researchers. This position will work within a team of RC Systems Engineers to design, implement, deploy, and maintain advanced monitoring, logging, and alerting systems for mission-critical services. The Systems Software Engineering group helps maintain core production infrastructure, provisioning, central version control, central logging, and other systems. This group offers many opportunities to build tools and patterns that help all of Research Computing work better. This is an individual contributor position that will report to the Associate Director of Systems Software Engineering in FAS Research Computing (FASRC).
- Participate fully in planning, building, configuring, and running RC systems at scale
- Monitor and maintain the health and integrity of RC systems including upgrading and patching
- Design and implement robust and secure IT solutions within a fast-paced research environment
- Define and track performance metrics to ensure efficient current and future use of IT resources
- Consult to and collaborate with researchers and other key IT (e.g. network and security) and Data Center partners in a timely manner
- Build and maintain relationships with external vendor technicians and engineers
- Collaborate with other systems engineers within the RC ecosystem
- Contribute best practices documentation and knowledge transfer
- Mentor junior staff
- Abide by and follow the Harvard University IT technical standards, policies, and Code of Conduct
Harvard University’s Research Computing continues to evolve, expand services, and support to its leading research faculty and their collaborators around the world. These services include maintaining a Top 100 academic high-performance computing cluster, cloud computing, virtual machines, storage, databases, instrumentation core facility workstations, and other development platforms. We directly engage with researchers through help requests, monitoring, office hours, training, and in-depth consultations. Research Computing has numerous other successful collaborations, including building the MGHPCC (http://www.mghpcc.org/) in Holyoke, MA with leading partner universities. It is with these and other institutions that we launched the NSF-funded NESE project (http://nese.mghpcc.org) which creates a regional data science repository, as well as the New England Research Cloud (http://nerc.mghpcc.org) which creates a collaborative on-prem regional cloud framework. Research Computing at Harvard has a track record of building partnerships to accelerate research and collaboration.
We are committed to cultivating not only the diversity of our faculty, staff, and students but also in developing an inclusive culture that is vibrant, engaging and encouraging of innovation as well as intellectual debate. We believe creating and maintaining an inclusive workplace allows employees from all backgrounds and walks of life to achieve their fullest potential. We also believe an inclusive culture is one that accepts, values and views as strength the difference we all bring to the workplace.
- Minimum of seven years’ post-secondary education or relevant work experience
Additional Qualifications and Skills:
- Broad knowledge of the deployment and management of physical and virtual systems (e.g. storage, cluster computing, network, database, applications)
- Experience automating infrastructure with tools like Puppet, Chef, Ansible, or Terraform
- Demonstrated team performance skills, the ability to communicate clearly, service mindset approach, and the ability to act as a trusted advisor
- Strong documentation, mentoring, and collaboration experience
- Programming skills in any of Ruby, Python, Go, Rust or similar
- Experience writing operational tools and forming reproducible patterns
- Experience with monitoring systems, writing checks, and creating actionable alerting
- Experience with metrics collection to gain insight into production systems
- Experience with log aggregation and Elasticsearch
- Experience with Ceph or similar distributed storage systems
- Experience with git and version control in general
- Experience with Linux system administration
- Familiarity with a relational database, like MySQL or PostgreSQL
Harvard offers an outstanding benefits package including:
- Time Off: 3- 4 weeks paid vacation, paid holiday break, 12 paid sick days, 12.5 paid holidays, and 3 paid personal days per year.
- Medical/Dental/Vision: We offer a variety of excellent medical plans, dental & vision plans, all coverage begins as of your start date.
- Retirement: University-funded retirement plan with full vesting after 3 years of service.
- Tuition Assistance Program: Competitive tuition assistance program, $40 per class at the Harvard Extension School and discounted options through participating Harvard grad schools.
- Transportation: Harvard offers a 50% discounted MBTA pass as well as additional options to assist employees in their daily commute.
- Wellness Options: Harvard offers programs and classes at little or no cost, including stress management, massages, nutrition, meditation and complimentary health services.
- Harvard access to athletic facilities, libraries, campus events and many discounts throughout metro Boston.