Tuesday, November 08, 2022

How Hugging Face and ServiceNow tackle code-generating Large Language Model challenges

Recommendable! Exciting times!  Quite impressive! Advanced automated code generation!

This is a very promising intersection of natural language processing, machine learning, and software development!

"A little over a year ago, using large language models (LLMs) to generate software code was a cutting-edge scientific experiment that had yet to prove its worth. But while code generation has become one of the most successful applications of LLMs, BigCode, launched recently by Hugging Face and ServiceNow, has strived to address some of code-generating LLMs biggest pain points.
Today, many developers are using LLM-powered tools such as GitHub Copilot to improve productivity, stay in the flow and make their work more enjoyable. However, as LLM-powered coding matures, we’re also beginning to discover the challenges it must overcome, including licensing, transparency, security and control.  ...
software engineers will be able to use LLMs to maintain legacy code [or convert legacy code?] written in an unfamiliar programming language. ...
The BigCode project is a collaboration between Hugging Face and ServiceNow, announced in September. The Stack, which was released on October 27, comprises 3 TB of “permissively licensed source code” obtained from GitHub, assembled for training large language models for code generation. ...
In addition to providing an unprecedented 3 TB of curated source code, the BigCode team has provided a detailed breakdown of how the code was obtained and filtered. The dataset was gathered over several months. The team downloaded 137.36 million publicly available GitHub repositories. It then filtered the dataset to exclude repositories that did not have permissive licenses. ...
Licensing is not the only challenge that code LLMs face. The engineers of the models and the curators of the datasets must also address other problems such as removing sensitive information, including usernames, passwords and security tokens.
Another concern is insecure code. ... The open-source format of The Stack will allow security researchers to scrutinize the dataset for insecure code. Additionally, the BigCode team has implemented update mechanisms that take advantage of new information, such as the disclosure of vulnerabilities, and evolving best practices to limit the spread of malicious code in The Stack. ...
“The OpenRAIL license is an open-source license similar to Apache 2.0, but also includes provisions to prohibit certain use cases that could, for example, exclude the generation of malware ...
LLMs will be able to both extend the abilities of professional software engineers and enable non-technical people to build new software. ..."

How Hugging Face and ServiceNow tackle code-generating LLM challenges | VentureBeat

No comments: