This project is a high-performance UTF-8 Encoder/Decoder written in JavaScript. It implements the entire UTF-8 encoding and decoding pipeline, which involves encoding text to UTF-8 byte sequences and decoding back to the original text. The project aims to educate and demonstrate the inner workings of the UTF-8 encoding format while providing a robust utility for working with encoded text data.
-
Complete UTF-8 Encoding Pipeline:
- Convert strings to UTF-8 encoded byte sequences.
- Handle characters of any Unicode code point, including multi-byte characters.
- Support for encoding and decoding text data efficiently.
-
UTF-8 Decoding (Planned for Future Updates):
- Decode UTF-8 encoded byte sequences back to their original string form.
-
Modular Architecture:
- Each step of the UTF-8 encoding and decoding process is divided into small, reusable modules (encoding, decoding, etc.).
The project is organized into the following directories:
utf8-encoder/
├── src/
│ ├── encoding/ # UTF-8 encoding logic
│ │ ├── encoder.js # Converts string to UTF-8 encoded bytes
│ │ ├── decoder.js # Decodes UTF-8 bytes back to string (planned)
│ ├── utils/ # Helper utilities (byte and string manipulations)
│ ├── index.js # Entry point for running encoding/decoding
├── test/ # Unit tests for core functionality
├── package.json # Project dependencies and scripts
└── README.md # Project documentation
- Node.js (v16.x or higher)
- npm (Node Package Manager)
-
Clone the repository:
git clone https://github.com/pawvan/utf8_encoder.git cd utf8_encoder
-
Install project dependencies:
npm install
-
Test the setup: Ensure everything is set up correctly by running:
node src/index.js
To encode a string to its UTF-8 byte representation:
-
Prepare a string: Prepare the string you want to encode. For example,
"Hello, World!"
. -
Run the encoding process:
node src/index.js encode "Hello, World!"
-
The script will output the UTF-8 encoded bytes for the provided string.
Here’s an example of encoding a string ("Hello, World!"
) to UTF-8:
import { encodeToUTF8 } from './encoding/encoder';
// Encode the string to UTF-8
const utf8Bytes = encodeToUTF8('Hello, World!');
console.log(utf8Bytes); // Output: [72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33]
A future update will include decoding functionality to reverse the process, turning UTF-8 encoded byte arrays back into the original text string.
We use unit tests to ensure each module works correctly. The tests are located in the test/
directory.
To run the tests:
npm test
import { encodeToUTF8 } from '../src/encoding/encoder';
test('UTF-8 encoding works on a string', () => {
const result = encodeToUTF8('Hello, World!');
expect(result).toEqual([72, 101, 108, 108, 111, 44, 32, 87, 111, 114, 108, 100, 33]);
});
- UTF-8 Decoding: Implement the decoding pipeline (converting UTF-8 byte arrays back to the original string).
- Optimization: Improve the performance of the encoding and decoding processes.
- Support for Other Encodings: Extend support to handle other encodings (e.g., ASCII, UTF-16).
Contributions are welcome! If you want to contribute to this project, please fork the repository, make your changes, and create a pull request.
- Fork the repo.
- Create a feature branch (
git checkout -b feature/my-new-feature
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature/my-new-feature
). - Create a new Pull Request.
Please ensure that your code follows the existing style, includes unit tests, and does not break the build.
This project is licensed under the MIT License - see the LICENSE file for details.
- UTF-8 Encoding: Wikipedia - UTF-8
- Huffman Coding: Wikipedia - Huffman Coding
If you have any questions or feedback about this project, feel free to reach out:
- Email: [email protected]
- GitHub: github.com/pawvan/utf8_encoder