AI & ML interests

Individual Army of Engineering Warriors

ajibawa-2023 
posted an update 5 days ago
view post
Post
1161
Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application development—areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.
ajibawa-2023 
posted an update 6 days ago
view post
Post
6047
Go-Code-Large
Dataset: ajibawa-2023/Go-Code-Large

Go-Code-Large is a large-scale corpus of Go (Golang) programming language source code, comprising 316,427 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, cloud-native systems, and modern backend software engineering.

By offering a focused and curated dataset for Go, this corpus enables experimentation in concurrent programming, distributed systems, and performance-oriented backend services—domains where Go is widely adopted.

Go-Code-Large addresses the relative scarcity of large, language-specific datasets for Go, enabling targeted research into idiomatic Go patterns, concurrency primitives, and scalable system design.
  • 2 replies
·
fblgit 
posted an update 16 days ago
view post
Post
158
I recently built https://github.com/fblgit/eLLMulator
A software emulator for Claude Code.

eLLMulator approach:

LLM agents become your software components. Each agent deeply studies its assigned source file, then interacts with other agents via synchronous MCP tool calls that mirror real function calls. The call graph emerges naturally from code control flow, producing traces that capture not just what happened, but why each component behaved as it did.

The Claude Agent SDK provides sessions, MCP provides the bus. The code itself is the routing layer.

https://github.com/fblgit/eLLMulator
ajibawa-2023 
posted an update about 1 month ago
view post
Post
2787
C-Code-Large
Dataset: ajibawa-2023/C-Code-Large

C-Code-Large is a large-scale corpus of C programming language source code comprising more than 4 million code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, and software engineering automation for the C ecosystem.

By offering a high-volume, language-focused dataset, C-Code-Large enables targeted experimentation in low-level programming, memory-constrained environments, and performance-critical systems, where C continues to be a dominant language.

C-Code-Large addresses the lack of large, curated, C-specific datasets, making it possible to conduct focused research on procedural programming paradigms, manual memory management, and system-level abstractions.

ajibawa-2023 
posted an update about 2 months ago
view post
Post
3854
Cpp-Code-Large
Dataset: ajibawa-2023/Cpp-Code-Large

Cpp-Code-Large is a large-scale corpus of C++ source code comprising more than 5 million lines of C++ code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the C++ ecosystem.

By providing a high-volume, language-specific corpus, Cpp-Code-Large enables systematic experimentation in C++-focused model training, domain adaptation, and downstream code understanding tasks.

Cpp-Code-Large addresses the need for a dedicated C++-only dataset at substantial scale, enabling focused research across systems programming, performance-critical applications, embedded systems, game engines, and large-scale native software projects.
  • 3 replies
·
ajibawa-2023 
posted an update about 2 months ago
view post
Post
3516
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
·
ajibawa-2023 
posted an update 2 months ago
view post
Post
2584
PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
ajibawa-2023 
posted an update 2 months ago
view post
Post
3263
JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
ajibawa-2023 
posted an update 2 months ago
view post
Post
3140
Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.
ajibawa-2023 
posted an update about 1 year ago
view post
Post
4649
Hi All, I recently released two Audio datasets which are generated using my earlier released dataset: ajibawa-2023/Children-Stories-Collection

First Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection-Large has 5600++ stories in .mp3 format.

Second Audio Dataset:https://huggingface.co/datasets/ajibawa-2023/Audio-Children-Stories-Collection has 600 stories in .mp3 format.
  • 3 replies
·
fblgit 
posted an update over 1 year ago
view post
Post
1489
Introducing miniclaus 1.5B, a tiny but powerful model. Trained with MagPie and based on Qwen2.5 1.5B model, it performs very well on many tasks scoring top on his category, with impressive results:
* MATH Hard 9.81
* MMLU-Pro 29.37
* GPQA 29.19
* MUSR 42.85
* BBH 42.04

Available already in the hub:
fblgit/miniclaus-qw1.5B-UNAMGS
fblgit 
posted an update over 1 year ago
view post
Post
830
Cybertron is back:

We released today a newest version of Cybertron: V4 based on Qwen2.5 7B and trained on MagPie. Scoring #1 LLM on 7B & 8B class.

The model hasn't go thru DPO, so the weights are in good shape to welcome further training sessions and optimizations.
Enjoy it in the hub as usual:
fblgit/cybertron-v4-qw7B-MGS
  • 1 reply
·
ajibawa-2023 
posted an update over 1 year ago
view post
Post
3955
New Dataset: Software-Architecture
Link: ajibawa-2023/Software-Architecture

I am releasing a Large Dataset covering topics related to Software-Architecture. This dataset consists of around 450,000 lines of data in jsonl.

I have included following topics:

Architectural Frameworks

Architectural Patterns for Reliability

Architectural Patterns for Scalability

Architectural Patterns

Architectural Quality Attributes

Architectural Testing

Architectural Views

Architectural Decision-Making

Advanced Research

Cloud-Based Architectures

Component-Based Architecture

Data Architecture

Emerging Trends

Event-Driven Architecture

Evolvability and Maintainability

Microservices and Monolithic

Microservices Architecture

Security Architecture

Service-Oriented Architecture

Software Design Principles

and Many More!

This dataset is useful in LLM development. Also those who are working on developing Software development related LLMs then this dataset can be useful.

This dataset is very useful to Researchers as well.
  • 4 replies
·
fblgit 
posted an update almost 2 years ago
view post
Post
2628
Introducing UNA-ThePitbull Series

We are happy to announce the release of our latest model UNA-ThePitbull, the most powerful model below 70B in the industry. In this new generation, inspired on our previous Beagle series we curated a model that balance nicely EQ and IQ. It was trained with some of the latest datasets including:
* Replete-AI/code_bagel_hermes-2.5
* mlabonne/orpo-dpo-mix-40k
* jondurbin/py-dpo-v0.1
Available in the hub fblgit/UNA-ThePitbull-21.4B-v2 and you can grab Quant versions sponsored by @bartowski at bartowski/UNA-ThePitbull-21.4B-v2-GGUF fully compatible with Ollama, llama.cpp, etc.

UNA
In this case we tried something new by alternating uniformity across layers of both MLP & Attention reducing computational requirements while keep a high performant result.

We trained him under these terms:
* ThePitbull-v1 as base: SFT maxLR 1e-4 minLR 5e-5 for 1 Epoch
* DPO maxLR 1e-4 minLR 5e-5 for 1 Epoch
You can continue the training by merely using 5e-5 maxLR and 0 warmup steps, it should minimize catastrophic forgetting of the model.

Remember if you do so, please include a Pitbull picture on your model and cite :) Have fun!
ajibawa-2023 
posted an update almost 2 years ago
view post
Post
2211
Thank you very much hf team for accepting me! I was waiting for very long time. Thank you
fblgit 
posted an update about 2 years ago
view post
Post
Over the past week, I've been putting Claude through its paces, focusing primarily on productivity tasks (you know, the good old BAU – Business As Usual).

1. Python/Torch/Transformers/AI/ML
Right off the bat, I threw some complex AI/ML tasks at Claude, and I must say, it handled them with finesse. It even caught a few things that GPT missed! However, let's not get too carried away – we're not quite at the auto-code level just yet.

2. Brainstorming
This is where Claude falls a bit short. It seems to be more grounded than its competitors, which might not be ideal for generating novel ideas. If you're looking for a brainstorming partner, you might want to look elsewhere.

3. Attention
Despite the claims of super-large attention in the paper, Claude's "forgetting" mechanism seems to be more pronounced. It tends to miss entire chunks of information rather than just specific details like GPT does.

4. Following / Tasks
I hit a roadblock when Claude couldn't generate a LaTeX document. It's not the best at following complex, multi-step tasks.

5. Hallucinations
Oh boy, does Claude hallucinate! And when it does, it's on a whole new level of nonsense. The hallucinations seem to align with its grounded nature, making them even more convincing within the context of the prompt.

6. Sycophancy
Claude is quite the people-pleaser. I've found that using an adversarial brainstorming approach is more beneficial and time-efficient, as it forces me to highlight Claude's mistakes rather than letting it focus on being a sweet, pleasant minion.

7. Interface / UI
There's definitely room for improvement here. Basic features like stepping back on a prompt and stopping generation with the ESC key are missing. These are essential for extracting and composing content effectively.

Despite these limitations, I firmly believe that Claude is currently the #1
  • 4 replies
·
fblgit 
posted an update about 2 years ago
view post
Post
Senku-70B stills undefeated within EQ-Bench, latest updates from the author shows even a further increase in performance, reaching a new score of 85.09

This new mark outperform some GPT-4 models, closing further the very thin gap between OpenCommunity LLM and Closed source models.

ShinojiResearch/Senku-70B-Full
  • 1 reply
·
fblgit 
posted an update about 2 years ago
view post
Post
Introducing UNA-SimpleSmaug-34b:

Based on Smaug-34B-v0.1, capable of slightly outperform his base model and with increased math and reasoning thanks to simple-math dataset.
The model exhibits a great performance across diverse tasks with an excellent and balanced behaviour.
It scores 77.41 AVG on the Leaderboard, landing on #1 Position of 34B models.

Available in the hub already:
fblgit/UNA-SimpleSmaug-34b-v1beta
fblgit/simple-math

In this case, we applied UNA to the Attention Layers of the model while performing SFT with simple-math on a high complexity generated data of mathematics, proving the effect of simple-math on LLM's.
  • 2 replies
·