[About ByteDance]
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Helo, and Resso, as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
[About the Team]
The Datacenter Infrastructure Engineering team supports the company's fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable.
[Responsibilities]
As the [Site Reliability Engineer - Infrastructure Engineering], you would be responsible for at least one if not all of these areas:
Infrastructure:
- Build, expand and operate global infrastructures, including large-scale systems in public and private clouds, data centers and content delivery networks.
- Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the global infrastructure.
- Help improve the whole lifecycle of infrastructure services from inception and design throughout development, to deployment, user support and refinement.
- Supporting end-to-end to production environment by responding to performance and reliability issues and participating in rotational on-calls.
Security:
- Conduct security reviews of core corporate and production infrastructure.
- Carry out security updates and protect enterprise infrastructure in system and network level.
- Drive enterprise focused security improvements to products and services.
- Build security tools and processes for critical infrastructure protection, monitoring and remediation.
Traffic:
- Build tools, automations, visualizations and monitors to facilitate the operation and optimization of the traffic infrastructure.
- Provide primary operational support and engineering for traffic infrastructure systems.
- Gather and analyze metrics to assist in performance tuning and fault finding.
[Minimum Qualifications]
- Bachelor s degree in Computer Science or equivalent with 3+ years of relevant experience.
- Experience in one or more programming languages such as Java, Python C++, Go, or scripting experience in Shell and Python.
- Ability to thrive in a fast-paced environment.
- Relevant experience working in a Datacenter setup or environment with large scale infrastructure setup featuring high traffic.
As a Site Reliability Engineer with the Infrastructure Engineering team, you would be expected to be an expert in at least one if not all of these areas as well:
Infrastructure:
- Experience working with Cloud infrastructure
- Experience in building solutions with AWS, Google, Azure and other cloud services.
- Experience in developing and operating one or more following systems: OpenStack, Kubernetes, Nginx, ipvs, ELK stack, Hadoop, etc.
- Experience working with Unix Linux systems, from kernel to shell and beyond.
- Experience working with system libraries, file systems, and client-server protocols.
- Experience in designing, analyzing, and building automation and tools for large scale systems.
- Experience in networking technologies such TCP/IP, BGP, DNS, etc. in a carrier grade environment.
Security:
- Experience in networking security like DDoS and WAF protection.
- Experience in security protocols like TLS protocol features and updates.
- Experience in VPNs and building encrypted communication channel.
- Conducted infrastructure security review, patch and update potential security vulnerabilities.
- Experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.
Traffic:
- Experience working with traffic systems from CDNs to loadbalancers and beyond.
- Experience working with network devices, remote management systems, and client-server protocols.
- Knowledge of network infrastructure and/or routing.
- Experience with Layer 4 / Layer 7 loadbalancers.
- Knowledge of protocols like TCP/IP, HTTP, RPC, TLS etc.
- Experience working with containerized environment.
- Experience in one or more programming languages such as Java, C++, Go, or scripting experience in Shell and Python.