Company

Stanford UniversitySee more

addressAddressStanford, CA
type Form of workFull-Time
CategoryInformation Technology

Job description

GPU Cluster System Administrator
Stanford Research Computing is looking for a talented System Administrator to join our team of collaborative and innovative professionals helping Stanford's faculty and students use advanced computing and data tools to explore new frontiers in knowledge and solve some of humanity's most urgent problems. Our staff work directly with some of the world's top researchers in a broad range of disciplines, across all of Stanford's seven schools - while also supporting and learning from each other in cross-project endeavors. We maintain and steadily improve an advanced research computing facility, and we support a variety of environments for Stanford research. In Stanford Research Computing, you'll have a rare opportunity to contribute to discoveries and inventions that have global reach and positive impact, and to share in the curiosity and commitment of the scholars and scientists who lead these projects.
This new position will support Stanford's world-class data science and AI-focused research by managing and administering a substantial GPU-based cluster. You will partner closely with a team of data scientists from Stanford Data Science to ensure that the GPU cluster environment is tuned and utilized most efficiently to maximize research output. We'd love to have you join us on this exciting journey.
Responsibilities
This role is primarily systems-facing, but like all Research Computing positions, there is a significant researcher-facing component. In this position, you will put to use your in-depth knowledge of Slurm and Linux, your HPC cluster administration experience, and your passion for supporting ground-breaking research on a daily basis. You will play a crucial role in optimizing, improving and sustaining our advanced computing infrastructure.
• HPC Infrastructure Maintenance: Manage the day-to-day System Administration of an NVIDIA DGX Superpod and associated storage, management and networking infrastructure, in alignment with applicable university, regulatory agency, and/or contractual security and privacy requirements, including HIPAA.
• Slurm: Responsible for all aspects of management of Slurm for efficient resource allocation and job scheduling across the cluster, consistent with faculty guidance on system resource usage and utilization.
• GPU Resource Management: Manage GPU resources within the cluster, optimizing utilization for compute-intensive tasks while maintaining a balance. between user requirements and system stability. Provide automated, easily accessible resource utilization metrics.
• User Support: Collaborate with Stanford Data Science team members and system users to understand their computing needs, provide technical assistance, and troubleshoot issues related to system performance and job execution. Provide user consultation and training in system use as needed.
• Performance monitoring: Monitor system performance, diagnose bottlenecks, and take necessary actions to improve system performance.
• Documentation: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and team collaboration. Develop user facing documentation in coordination with colleagues from Stanford Data Science.
• Planning: Meet regularly with stakeholders to understand existing challenges, anticipated needs, and opportunities for closer collaboration.
• Vendor engagement: Liaise with system vendors and other external partners as needed to ensure system issues are triaged and resolved expeditiously and correctly.
Minimum Requirements
Education and Experience:
  • Bachelor's degree and eight years of increasingly technical work experience or a combination of education and relevant experience. In-depth experience managing complex multiuser HPC clusters and storage environments is necessary, as is experience managing GPU-based infrastructure.

Qualifications:
This position requires in-depth knowledge of and substantial hands-on experience with:
• Linux Cluster System Administration
• GPU technologies and their integration into HPC environments
• Slurm configuration and management
• NFS-based storage management and configuration
• High-performance parallel filesystem (Lustre) management and configuration
• Scripting for system management, monitoring and task automation
• Installing and repairing servers and associated cluster hardware
• Complex technical problem solving and troubleshooting, with a proactive approach to system optimization and issue resolution
• Security practices and compliance standards in a computing environment
• Collaborating effectively across teams and with researchers
Additional desired skills and experience include:
• AI/ML software and frameworks, deep learning, and LLM training
• CUDA
• System benchmarking
The expected pay range for this position is $128,000 - $170,000 per annum.
Stanford University provides pay ranges representing its good faith estimate of what the university reasonably expects to pay for a position. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs.
At Stanford University, base pay represents only one aspect of the comprehensive reqards package. The Cardinal at Work (http://cardinalatwork.stanford.edu/benefits-rewards) provides detailed information on Stanford's extensive range of benefits and rewards to employees. Specifics about the rewards package for this position may be discussed during the hiring process.
Working Conditions
This is a hybrid position, in which you will work on-site at the Stanford campus for a minimum of 3 days a week through the first 9 months of employment, and at least 2 days a week thereafter.
Our core work hours are 9 am - 5 pm Pacific. This role occasionally will require extended hours and weekend work, and you will participate in rotation of on- and off-site responsibilities during the annual winter closure. Periodically, the data center is shut down for required maintenance. All team members with system responsibilities are expected to be physically on-site to return services to production status at the end of any planned facility outage.
Why Stanford is for You:
Imagine a world without search engines or social platforms. Consider lives saved through first-ever organ transplants and research to cure illnesses. Stanford University has revolutionized the way we live and enriched the world. Supporting this mission is our diverse and dedicated 17,000 staff. We seek talent driven to impact the future of our legacy. Our culture and unique perks empower you with:
• Freedom to grow. We offer career development programs, tuition reimbursement, and course auditing. Join a TedTalk, watch a film screening, or listen to a renowned author or global leaders speak.
• A caring culture. We provide superb retirement plans, generous time-off, and family care resources.
• A healthier you. Choose from hundreds of health or fitness classes at our world-class exercise facilities. We provide excellent health care benefits.
Discovery and fun. Stroll through historic sculptures, trails, and museums.
Enviable resources. Enjoy free commuter programs, ridesharing incentives, discounts and more.
We look forward to receiving your application and cover letter.
*The job duties listed are typical examples of work performed by positions in this job classification and are not designed to contain or be interpreted as a comprehensive inventory of all duties, tasks, and responsibilities. Specific duties and responsibilities may vary depending on department or program needs without changing the general nature and scope of the job or level of responsibility. Employees may also perform other duties as assigned.
*Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources at stanfordelr@stanford.edu. For all other inquiries, please submit a contact form.
*Stanford is an equal employment opportunity and affirmative action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law.
Refer code: 8685629. Stanford University - The previous day - 2024-03-22 18:43

Stanford University

Stanford, CA
Jobs feed

RN Home Care

Cbs Lexington Ky Nrl

Richmond, KY

Assistant Store Manager

Ll Flooring

New York, NY

Family Medicine Physician

Ironside Human Resources

Eugene, OR

Director, Product Development

Bloomingdale's

New York, NY

Primary Care Physician- 293k Base Outpatient Only

Curative

Springfield, IL

Day Shift Hospitalist Opportunity Close to Myrtle Beach

The Medicus Firm

Myrtle Beach, SC

ENT in Metro Detroit Suburb

All Healthcare Staffing, Llc

Detroit, MI

Pediatric Ophthalmology - Southeast - Uncapped PTO

The Medicus Firm

Jackson, MS

Chief of Hand Surgery- Orthopedic Surgery - California

The Medicus Firm

Sacramento, CA

Pulmonologist- Growing Practice in Houston, TX

All Healthcare Staffing, Llc

Houston, TX

Share jobs with friends

Related jobs

Gpu Cluster System Admin

Systems Admin/Helpdesk

I.t. Solutions

San Mateo, CA

a month ago - seen

Linux Systems Admin

Marathon Ts

San Diego, CA

2 months ago - seen

Registered Nurse II/III/IV - Behavioral Health Housing Regional Admin

Riverside University Health System

Riverside, CA

2 months ago - seen

Systems Admin Sr

Bae Systems

San Diego, CA

2 months ago - seen

Part Time/AdHoc System Admin

Mtz Solutionz

$55 - $70 an hour

Irvine, CA

2 months ago - seen

Salesforce Admin

Hanker Systems Inc

$90 - $100 an hour

Sacramento, CA

3 months ago - seen

Office Admin and Receptionist

Systems

$19 - $25 an hour

Pleasanton, CA

3 months ago - seen

System Admin

Chime

$109K - $138K a year

San Francisco, CA

3 months ago - seen

Contract Admin II

BAE Systems

Los Angeles, CA

3 months ago - seen

Labor Compliance Admin

HCI Systems, Inc. - 3.6

$21 - $30 an hour

Ontario, CA

4 months ago - seen

IT System Admin - San Mateo, CA

Two95 International Inc.

San Mateo, CA

4 months ago - seen

Sr. System Admin Lead - Remote

Simple Solutions

Los Angeles, CA

4 months ago - seen

SCADA System Admin

Edgewater Federal Solutions, Inc.

Folsom, CA

4 months ago - seen

System Admin (Cyber Security) - Long Term Contract - Hybrid (Bay Area, CA)

Right Skale Inc

Pleasanton, CA

4 months ago - seen

System Admin I

City Experiences

Newport Beach, CA

4 months ago - seen

Information Systems Security Officer (ISSO) - POAM Admin III

ManTech International Corporation

Edwards, CA

4 months ago - seen

AIX System Admin

Arete Technologies, Inc.

Newark, CA

4 months ago - seen