To see this working, head to your live site.

Time to see the Vet

in Clinic

This started out of the blue.

It looks like he got into the liquor cabinet.

The only additional symptoms I have are:

I had just completed a run of unit tests with a failure for SKILL_DATA. The write of the data failed trying to write 934 byte and only writing 704 bytes. So, there may have been a buffer overrun issue.
The voice command has been dropping out for some reason. A stray sound (like a cough) can turn it back on. But then it drops out again.

The battery was just recharged.

Tell me doc, will he make it? What's the matter?

66 comments

Comments (66)

Timothy McCarthy

May 09

Can we reassess the patient again?

Here's the state of affairs : https://youtu.be/cIR9SLpXLNw

Unsuccessful correctives:

I've upgraded the firmware using the Desktop app (1.1.7 and 1.1.9)

I've done a factory reset for both versions. The IMU calibration has been erratic, but I did get it to complete.

I need to ...

... get the bot into a stable working state

... get a valid, reliable build of the codebase

Timothy McCarthy

Replying to

Prof. Farnsworth: "Good news, everyone!"

Bender: "Uh-oh! I don't like the sound of that."

Prof. Farnsworth: "I have two resolutions and one clue ..."

Bender: "Here it comes ..."

Prof. Farnsworth: "... and two apologies"

Bender: "And I'm outta here!"

Apology #1: The shoemaker's wife has no shoes.

Dr. Petoi recently posted a video of a test of a T_SKILL_DATA command with the maximum number of elements that would fill up the newCmd buffer to demonstrate it could handle the data. In my reply I asked if he could increase the data amout in order to force the buffer overflow code to execute.

I wasn't thinking. I could have done that myself with the test software I wrote. My bot wasn't working and had a hardware problem but I have been testing it with the battery off for some commands for a while. I could have easily written the stress test to overflow the newCmd buffer. But I wasn't thinking.

So my apologies for asking you to do something I could have done myself.

Apology #2: Do the whole job, not just half.

The test is fairly trivial but the results ... well, that's getting ahead of things.

Here's my first version of the test

TEST_F(ftfBittleXProtocol, newCmd_overflow) {
    const unsigned BUFF_LEN{ 2507 };  // 1524 =125*20+7=2507
    vector < int8_t > symphony(BUFF_LEN, 32);
    cmd_def_t command{ BEEP, "b", "BEEP" };
    command += symphony;
    EXPECT_TRUE(on_command(command));
}

When I ran the test I got a long beep and the output (trimmed for space)

1>[ RUN      ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command  : b32 32 32 32 32 ... 32 32 32 32
1>write count  : 7522
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(423):
    error : Value of: result
1>  Actual: false
1>Expected: true
1>ftSerial.write FAILED expected: 7522 actual: 2304
1>expected    : 7522
1>actual      : 2304
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(654):
    error : Value of: result
1>  Actual: false
1>Expected: true
1>G:\AppDev\Robotics\Petoi\test\ftBittleX\ftprotocol.cpp(565):
    error : Value of: on_command(command)
1>  Actual: false
1>Expected: true
1>[  FAILED  ] ftfBittleXProtocol.newCmd_overflow (308 ms)

I was surprised by the number of bytes sent and we saw the write failure message before in a previous test. First I cut down the data size to get the byte count closer to the buffer size (I'd figure out why the size was large later). But once I got the byte count down to just below 2507 I was still getting the write failure. So I needed to look more closely into that.

I recalled the posts about the problem of BittleX reading the Bluetooth input and how Bluetooth would breakup the data into chunks that caused a BittleX read timeout error. Dr. Petoi said the solution was to increase the timeout value for those commands and asked if I could try that and play with the value. Well, I did try it and matched the BittleX TIMEOUT_LONG value in the OpenCat_serial code and tested it and didn't see any effect.

But I didn't "play" with the value. This time I did by increasing the timeout value by 10 ms. And the result ... no more write failure!

So, my apologies for slacking off and not doing as I'm told.

Resolutions

From my POV, this issue is on the BittleX side where the UART microcode needs to manage the input queue during read. The write timeout is the time to wait for the receiving device to signal ready to receive the next byte. But it could be on the Windows side where it manages the output queue during write.

With that resolved I could explain the size of the data being sent. For an ASCII command like 'b' each data element is a number. An ASCII number has one byte for each digit in the number. I used a 2 digit number (32) for all elements. There is a space between each number. Thus I need 3 bytes for each element. The maximum number of elements I can have is then BUFF_LEN/3. That test gives

1>[ RUN      ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command  : b32 32 32 32 ... 32 32 32 32
1>write count : 2506
1>expected    : 2506
1>actual      : 2506
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP
1>cmd         : b
1>id          : 1
1>data        : 32 32 32 32 ... 32 32 32 32
1>   12984 ms : RX_latency
1>     748 ms : RX_elapsed
1>response    : 2 lines
1>Changing volume to 10/10
1>
1>b
1>
1>response    : end
//...
1>[       OK ] ftfBittleXProtocol.newCmd_overflow (14792 ms)

To force the overflow condition I add 1 to the buffer size.

vector < int8_t > symphony(BUFF_LEN / 3 + 1, 32);	// overflow

to get the output I'm looking for

1>[ RUN      ] ftfBittleXProtocol.newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command  : b32 32 32 32 ... 32 32 32 32
1>write count : 2509
1>expected    : 2509
1>actual      : 2509
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP
1>cmd         : b
1>id          : 1
1>data        : 32 32 32 32 ... 32 32 32 32
1>      46 ms : RX_latency
1>     750 ms : RX_elapsed
1>response    : 2 lines
1>OVFb
1>
1>response    : end

This is progress. I see the overflow response and I hear the error beep. I also see confirmation of my calculation of the response time (750 ms). The RX_latency value is an indicator of how long it the code to detect the overflow.

But what about the rest of it? The overflow code replaces the command in error with the "standup" command and lets it get executed.

In order to get that I need act as if I sent the "standup" command and I'm looking for the response.

if (!HasFailure()) {	// process overflow standup command
    cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
    cmd_standup += string("up");
    EXPECT_TRUE(on_response(cmd_standup));
}

to get the output:

1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : SKILL
1>cmd         : k
1>id          : 9
1>data        : up
1>       0 ms : RX_latency
1>      23 ms : RX_elapsed
1>response    : 3 lines
1>up
1>
1>k
1>
1>k
1>
1>response    : end

We're not done yet. This demonstrates that an ASCII command overflows properly. It should be identical for a binary command. If so, then the test for the binary test should be identical to the ASCII test and trivial to write and has identical behavior,

TEST_F(ftfBittleXProtocol, BIN_newCmd_overflow) {
    const unsigned BUFF_LEN{ 2507 };  // 1524 =125*20+7=2507
    vector < int8_t > symphony(BUFF_LEN + 1, 32);	// overflow
    cmd_def_t command{ BEEP_BIN, "B", "BEEP" };
    command += symphony;
    unsigned latency{ 14000 };
    EXPECT_TRUE(on_command(command, latency));

    if (!HasFailure()) {	// process overflow standup command
	    cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
	    cmd_standup += string("up");
	    EXPECT_TRUE(on_response(cmd_standup));
    }
}

the results:

1>ftBittleX::ftfBittleX::on_send
1>TX command  : 4220202020... 202020207e
1>write count : 2510
1>expected    : 2510
1>actual      : 2510
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : BEEP_BIN
1>cmd         : B
1>id          : 3
1>data        : 32323232 ... 32323232
1>      46 ms : RX_latency
1>     749 ms : RX_elapsed
1>response    : 2 lines
1>OVFB
1>
1>response    : end
1>ftBittleX::ftfBittleX::on_response
1>Command completed (normal)
1>description : SKILL
1>cmd         : k
1>id          : 9
1>data        : up
1>       0 ms : RX_latency
1>     228 ms : RX_elapsed
1>response    : 3 lines
1>up
1>
1>k
1>
1>k
1>
1>response    : end

More progress. I was also able to expose a known issue in the binary data stream. A binary command is terminated by the tilde character ('~'). I believe the documentation cautions you to avoid using a tilde as a data value. The obvious reason is that the tilde will be interpreted as the end of the command but the command processing will see it as a data value and "strange" things may happen.

As it turns out, my first version of the binary beep test used the same data length and values as the ASCII version

TEST_F(ftfBittleXProtocol, ASC_newCmd_overflow) {
    const unsigned BUFF_LEN{ 2507 };  // 1524 =125*20+7=2507
    vector < int8_t > symphony(BUFF_LEN / 3 + 1, 32);	// no overflow
    cmd_def_t command{ BEEP_BIN, "B", "BEEP_BIN" };
    command += symphony;

    //...

This won't overflow the newCmd buffer but is a malformed command. The buzzer tone never stops.

The data for the beep command comes in pairs, "note duration", We can calculate if the test data is a properly formed data stream by subtracting the command suffix ('~') from the length of the data stream we provide, (BUFF_LEN / 3 + 1). The result must be even value for a properly formed data stream. (BUFF_LEN / 3 + 1) - 1. = 836 -1 = 835.

This is more directly shown by simply using BUFF_LEN - 1 for the data

vector < int8_t > symphony(BUFF_LEN - 1, 32);	// malformed

This yields 3 data points for testing:

vector < int8_t > malformed_symphony(BUFF_LEN - 1, 32);
vector < int8_t > no_overflow_symphony(BUFF_LEN, 32);
vector < int8_t > overflow_symphony(BUFF_LEN + 1, 32);

I don't recommend the running the malformed test. The buzzer never ends and "annoys the pig."

Here's the exciting footage of the T_BEEP buffer overflow test: https://youtu.be/vbEehKzfEt4

These results resolve two issues:

Why the Windows WriteFile call fails? (Thanks to Doc Petoi)
What happens when you do overflow newCmd buffer?

The Last Test

We need to do one last test before we put this issue to rest. We need to test when the T_SKILL_DATA command overflows the newCmd buffer the program doesn't crash. Logically, this should no different than the binary BEEP command but it is the issue under test so we want to explicitly test it. We just have to change the command we send; the data or its format doesn't matter. All that matters is the size of the data. The response should be the same as that for the binary BEEP command. There's an interesting twist to this test

TEST_F(ftfBittleXProtocol, SKILL_DATA_newCmd_overflow) {
    const unsigned BUFF_LEN{ 2507 };  // 1524 =125*20+7=2507
    vector  nonsense(BUFF_LEN + 1, 32);	// overflow
    cmd_def_t command{ SKILL_DATA, "K", "SKILL_DATA" };
    command += nonsense;
    unsigned latency{ 14000 };
    EXPECT_TRUE(on_command(command, latency));

    if (!HasFailure()) {	// process overflow standup command
	    cmd_def_t cmd_standup{ SKILL, "k", "SKILL" };
	    cmd_standup += string("up");
	    EXPECT_TRUE(on_response(cmd_standup));
    }
}

The results confirm our analysis but uncover a problem in processing some commands

1>[ RUN      ] ftfBittleXProtocol.SKILL_DATA_newCmd_overflow
1>ftBittleX::ftfBittleX::on_send
1>TX command  : 4b20202020 ... 202020207e
1>write count : 2510
1>expected    : 2510
1>actual      : 2510
1>ftBittleX::ftfBittleX::on_response
1>Command completed (toggle)
1>description : SKILL_DATA
1>cmd         : K
1>id          : 10
1>data        : 32323232 ... 32323232
1>      45 ms : RX_latency
1>    1027 ms : RX_elapsed
1>response    : 4 lines
1>OVFK
1>
1>up
1>
1>k
1>
1>response    : end
1>ftBittleX::ftfBittleX::on_response
1>ftBittleX::ftfBittleX::on_response elapse timeout!
1>description : SKILL
1>cmd         : k
1>id          : 9
1>data        : up
1>       0 ms : RX_latency
1>    2028 ms : RX_elapsed
1>response    : 1 lines
1>k
1>
1>response    : end

When most commands complete they echo the command to the serial monitor. However, some commands, like T_SKILL_DATA, do some software "sleight-of-hand" and echo back one or more lowercase values of the command. The T_SKILL_DATA is one of these and normally responds with two 'k' values.

But it doesn't do that here. For overflow it responds with "OVFK". It then issues the "kup" command which does respond with two 'k' valuss when it completes. So the test harness becomes confused about the responses. This is part of why I'm not fond of behavior I didn't ask for. It's inconsistent. I think an argument can be made for the error handling to be elsewhere. But that's a design, not a testing, issue. The tests here confirm the expected behavior and not a crash.

Here's the dramatic and astounding video of the T_SKILL_DATA buffer overflow test:

https://youtu.be/DIS5Efdx258

A Possible Clue

As a result of the diagnosis of a possible hardware failure, I started to think about how to test the hardware to confirm the diagnosis. This brought back memories of my early days working in a computer manufacturing plant when I worked with various hardware components and microprocessors. It's been so long since then (40 years) that I've forgotten the day-to-day, simple tasks you do when a hardware issue comes up. But I suddenly recalled one of them, a task so simple you do it by reflex: visual inspection. You look at the component to see if there's a visible flaw. It takes a trained eye to see some flaws but the folks who make them know where to look and can spot them at a glance. The component under suspicion here is the power supply circuit. The documentation identifies where that circuit is located on the board.

So I took a picture of Malfunctioning Eddie's board:

Here's a enlarged view of the power circuit (if I've located it correctly):

My eye isn't educated enough to know if there's a visible flaw. My eye is drawn to the one of the two circular soldered connections. The one on the right; the upper right hand corner seems odd to me; as if a connection point has been uncovered due to solder fatigue or something. But Idunno.

But I do think there's someone who could quickly tell if there's an obvious flaw.

Rongzhong Li

Replying to

Thanks for detailed debugging report. I like how the process is designed.

As to the circuit, I don't see a problem given the picture. The corner belongs to the square-shaped soldering plate.

Have you sent an email to support@petoi.com to request a replacement board?

Timothy McCarthy

Replying to

Thanks. I want to release the code on github (if I can learn how to do it) but I'm concerned that the testing code may be the cause of the failure. I don't want that. But I do feel that a test tool like this is very helpful.

I will send an email off ASAP to get that ball rolling.

Rongzhong Li

May 09

I just tested a behavior with 125 frames, which uses up the 2507 bytes allowed by the bufferLen. You need to modify the condition

 || cmdLen >= BUFF_LEN

 || cmdLen > BUFF_LEN

or download the latest code on GitHub and upload it via Arduino IDE. The Desktop app won't update very frequently.

The test file can be downloaded at: https://github.com/PetoiCamp/OpenCat/blob/main/SkillLibrary/Bittle/longPushUps_125frames.md

A longer skill may lead to a program crash.

Rongzhong Li

May 09

Replying to

I think spaceAfterStoringData is always shorter than BUFF_LEN.

Do you mean

if (

(

(token == T_SKILL || lowerToken == T_INDEXED_SIMULTANEOUS_ASC || lowerToken == T_INDEXED_SEQUENTIAL_ASC) && cmdLen >= spaceAfterStoringData

)

|| cmdLen > BUFF_LEN

)

este este

May 09

Replying to

No but the actual result is the same in both cases. It's just that, IMO, the way I wrote it is more readable. After all, when 1 of these 3 tokens are present, you cannot have either cmdLen greater than spaceAfterStoringData or cmdLen greater than BUFF_LEN. Put differently, it doesn't matter that cmdLen is more likely to exceed spaceAfterStoringData, both are bad and must never happen so just group them together for readability. Otherwise, people (like Timothy and me) are scratching our heads wondering why an overflow from spaceAfterStoringData should be checked before an overflow from BUFF_LEN. It doesn't because we need to check both.

Another way of thinking about it is, I've rearranged the code so it translates to this very simple statement:

If one of THESE things is true and one of THOSE things is also true then prevent THAT thing from happening.

So, generally speaking, if I have two ways to code something that have the same result with equivalent (or close enough) performance, I'll always choose the code that is easier to read and explain and that is what jumped out at me here.

I colored coded it below, in case that helps. Also, I set both to the ">" since, for reasons already discussed, the ">=" doesn't seem right (though I didn't check exactly how you are calculating spaceAfterStoringData so maybe ">=" there is actually right...)

if (
    (token == T_SKILL || lowerToken == T_INDEXED_SIMULTANEOUS_ASC || lowerToken == T_INDEXED_SEQUENTIAL_ASC) 
   & & 
    (cmdLen > spaceAfterStoringData || cmdLen > BUFF_LEN)
)

Rongzhong Li

May 10

Replying to

Thanks for the formatting. It helps to read. There's a logic bug in the colored condition. If the coming token is not in the three cases, the yellow part will be evaluated as false, and the whole condition will never be entered.

I included the >= for spaceAfterStoringData because there might be a '\0' at the end of the copy.

Timothy McCarthy

May 08

Part 3

More Tests

An additional test is to remove the dependence on the order of precidence of the logical operators by adding parentheses to the expression. This insures the compiler sees the expression as we intend. My assuption is the actual intent was for the expression to be

bool is_overflowex(int8_t skill_token = T_SKILL)
{
    bool result = (
        (skill_token == token
            || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
            || T_INDEXED_SEQUENTIAL_ASC == lowerToken)
        && (cmdLen >= spaceAfterStoringData
        || cmdLen >= BUFF_LEN));

    return result;
}

This shouldn't produce any new results

TEST_F(utfinput_overflow, on_inputex_SKILL_DATA_overflow_undetected)
{
    token = T_SKILL_DATA;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = spaceAfterStoringData + 1;
    EXPECT_TRUE(is_overflowex())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

1>[ RUN      ] utfinput_overflow.on_inputex_SKILL_DATA_overflow_undetected
1>G:\AppDev\Robotics\Petoi\test\utOpenCatEsp32\readoverflow.cpp(120):
    error : Value of: is_overflowex()
1>  Actual: false
1>Expected: true
1>Overflow undetected for:
1>   K : token
1>  17 : cmdLen
1>  16 : spaceAfterStoringData
1>
1>[  FAILED  ] utfinput_overflow.on_inputex_SKILL_DATA_overflow_undetected (1 ms)

The test of T_SKILL should detect the oveflow

TEST_F(utfinput_overflow, is_overflowex_SKILL_overflow_detected)
{
    token = T_SKILL;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = spaceAfterStoringData + 1;
    EXPECT_TRUE(is_overflowex())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

1>[ RUN      ] utfinput_overflow.is_overflowex_SKILL_overflow_detected
1>[       OK ] utfinput_overflow.is_overflowex_SKILL_overflow_detected (0 ms)

These tests demonstrate adding parentheses doesn't fix the defect. The T_SKILL == token comparison cripples the expression.

As a last test, I removed the exception checks and performed a simple overflow test

ool is_overflow_simple()
{
    bool result = BUFF_LEN <= cmdLen;
    return result;
}

TEST_F(utfinput_overflow, is_overflow_simple_SKILL_DATA_overflow_detected)
{
    token = T_SKILL_DATA;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = BUFF_LEN;
    EXPECT_TRUE(is_overflow_simple())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}
TEST_F(utfinput_overflow, is_overflow_simple_SKILL_overflow_undetected)
{
    token = T_SKILL;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = BUFF_LEN;
    EXPECT_TRUE(is_overflow_simple())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

1>[ RUN      ] utfinput_overflow.is_overflow_simple_SKILL_DATA_overflow_detected
1>[       OK ] utfinput_overflow.is_overflow_simple_SKILL_DATA_overflow_detected (0 ms)
1>[ RUN      ] utfinput_overflow.is_overflow_simple_SKILL_overflow_undetected
1>[       OK ] utfinput_overflow.is_overflow_simple_SKILL_overflow_undetected (0 ms)

This implies that the test of spaceAfterStoringData should be performed after reading the command. The read_seriial function only insures the raw command is completely read. I would also make the argument that read_serial should return a boolean result.

Conclusion

This ends the first part of the code analysis. I think this demonstrates the existence of the defect in the code. I said this wasn't definiative because there's an accomplice to this defect. Demonstrating that a buffer overflow might happen isn't the same as finding where it does happen. Luckily we know the name of the accomplice: spaceAfterStoringData. Part 2 will focus on tracking him down and finding where the buffer overrun happens.

In the conclusion I'll propose code changes to resolve the issue.

Timothy McCarthy

May 08

Part 2

Let's write a test

Before I spend energy and brain cells on code analysis, I want to write a test to veify my analysis. GTest makes this fairly easy to write. The test I want checks if the conditional statement

if (
    (token == T_SKILL
        || lowerToken == T_INDEXED_SIMULTANEOUS_ASC
        || lowerToken == T_INDEXED_SEQUENTIAL_ASC)
    && cmdLen >= spaceAfterStoringData
    || cmdLen >= BUFF_LEN) {

     //...
}

evaluates to true when a buffer oveflow would occur and false otherwise. The function is trivial to write

bool is_overflow()
{
    bool result = (
        (T_SKILL == token
            || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
            || T_INDEXED_SEQUENTIAL_ASC == lowerToken)
        && spaceAfterStoringData <= cmdLen
        || BUFF_LEN <= cmdLen);

    return result;
}

Style: I prefer comparison expressions operators to be left to right, smaller to larger, constants on the left;. logical expressions on separate lines.

The function uses global variables token, lowerToken, spaceAfterStoringData, and cmdLen, so we define and initialize them along with the constants they use.

const uint32_t BUFF_LEN{ 32 };
const int8_t T_SKILL{ 'k' };
const int8_t T_SKILL_DATA{ 'K' };
const int8_t T_INDEXED_SIMULTANEOUS_ASC{ 'i' };
const int8_t T_INDEXED_SEQUENTIAL_ASC{ 'm' };
const int8_t T_QUERY{ '?' };

int8_t token{ T_SKILL_DATA };
int8_t lowerToken{ int8_t(tolower(token)) };
uint32_t cmdLen{};
uint32_t spaceAfterStoringData{ BUFF_LEN };

We wrap the function and variables into a class, called a test fixture, that our test willl use.

class utfinput_overflow : public testing::Test
{
protected:

    int8_t token{ T_SKILL_DATA };
    int8_t lowerToken{ int8_t(tolower(token)) };
    uint32_t cmdLen{};
    uint32_t spaceAfterStoringData{ BUFF_LEN };
public:
    utfinput_overflow() {}
    ~utfinput_overflow() {}

    bool is_overflow()
    {
        bool result = (
            (T_SKILL == token
                || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
                || T_INDEXED_SEQUENTIAL_ASC == lowerToken)
            && spaceAfterStoringData <= cmdLen
            || BUFF_LEN <= cmdLen);

        return result;
    }
};

The class is derived from a class that GTest provides to handle the interface to the GTest framework. We aren't concerned about the details for that class here but we need it to connect our test to the fixture. The test is trivial

TEST_F(utfinput_overflow, SKILL_DATA_overflow_undetected)
{
    token = T_SKILL_DATA;
    lowerToken = int8_t(tolower(token));
    spaceAfterStoringData = BUFF_LEN/2;
    cmdLen = spaceAfterStoringData + 1;

    EXPECT_TRUE(is_overflow())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

We build and run the test and get the output

1>[ RUN      ] utfinput_overflow.SKILL_DATA_overflow_undetected
1>G:\AppDev\Robotics\Petoi\test\utOpenCatEsp32\readoverflow.cpp(70):
    error : Value of: is_overflow()
1>  Actual: false
1>Expected: true
1>Overflow undetected for:
1>   K : token
1>  17 : cmdLen
1>  16 : spaceAfterStoringData
1>
1>[  FAILED  ] utfinput_overflow.SKILL_DATA_overflow_undetected (0 ms)

This confirms our analysis that the expression doesn't do it's job. The T_SKILL_DATA command can produce an undetected buffer overflow. It also confirms our analysis of the order of precedence for the logical operators. That's progress

When we test using the expected T_SKILL token, the test passes.

TEST_F(utfinput_overflow, SKILL_overflow_detected)
{
	token = T_SKILL;
	spaceAfterStoringData = BUFF_LEN / 2;
	cmdLen = spaceAfterStoringData + 1;
	EXPECT_TRUE(is_overflow())
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

output:

1>[ RUN      ] utfinput_overflow.SKILL_overflow_detected
1>[       OK ] utfinput_overflow.SKILL_overflow_detected (0 ms)

However, the T_SKILL caommand uses a named, built-in skill and doesn't have variable data, This test case should never happen. If it does happen, it's detected but is a false-postive and means that spaceAfterStoringData isn't being maintained properly. That observation will be useful later on.

The test is helpful but could be more helpful and flexible if we could control the token value being sought in addition to the input token.This is easily accomplished by adding a default token value to the is_overflow function.

bool is_overflow(int8_t skill_token = T_SKILL)
{
    bool result = (
        (skill_token == token
            || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
            || T_INDEXED_SEQUENTIAL_ASC == lowerToken)
        && cmdLen >= spaceAfterStoringData
        || cmdLen >= BUFF_LEN);

    return result;
}

The current tests are unaffected and should behave as before but we can now test what would appear to be the obvious fix: change the test expression to use T_SKILL_DATA

TEST_F(utfinput_overflow, SKILL_DATA_overflow_detected)
{
    token = T_SKILL_DATA;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = spaceAfterStoringData + 1;
    EXPECT_TRUE(is_overflow(T_SKILL_DATA))
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

output:

1>[ RUN      ] utfinput_overflow.SKILL_DATA_overflow_detected
1>[       OK ] utfinput_overflow.SKILL_DATA_overflow_detected (0 ms)

That's great, right? Not exactly. We need to look at how T_SKILL behaves

TEST_F(utfinput_overflow, SKILL_overflow_undetected)
{
    token = T_SKILL;
    spaceAfterStoringData = BUFF_LEN / 2;
    cmdLen = spaceAfterStoringData + 1;
    EXPECT_TRUE(is_overflow(T_SKILL_DATA))
        << "Overflow undetected for:\n"
        << setw(4) << token << " : token\n"
        << setw(4) << cmdLen << " : cmdLen\n"
        << setw(4) << spaceAfterStoringData << " : spaceAfterStoringData\n"
        ;
}

output:
1>[ RUN      ] utfinput_overflow.SKILL_overflow_undetected
1>G:\AppDev\Robotics\Petoi\test\utOpenCatEsp32\readoverflow.cpp(106):
    error : Value of: is_overflow(T_SKILL_DATA)
1>  Actual: false
1>Expected: true
1>Overflow undetected for:
1>   k : token
1>  17 : cmdLen
1>  16 : spaceAfterStoringData
1>
1>[  FAILED  ] utfinput_overflow.SKILL_overflow_undetected (1 ms)

"Well." sez me, "that's not really a problem cuz T_SKILL doesn't have variable data so there won't be an overflow."

Disturbia

Maybe the programmer didn't make a typing error by using the value T_SKILL. Maybe they did it for a reason that we haven't discovered yet. There may be a case where the T_SKILL command uses the spaceAfterStoringData value.

We don't need to speculate about that because there are other commands that can have variable number of parameters that aren't included in the expression. In addition to T_INDEXED_SIMULTANEOUS_ASC and T_INDEXED_SEQUENTIAL_ASC there are matching binary versions that can also take a variable number of parameters. How many for all of them? It's not 16 but an arbitrary number, i.e., you can repeat the jointindex as long you wish. There's BEEP and BEEP_BIN. The T_JOINTS command takes a list of joints that can be arbitrarily long and have repeats as well, e.g., j 0 8 9 10 0 8 9 10 0 8 9 10 ... Not to mention any new commands added later on.

The point is the fragile nature of "exception coding" like that of the conditional expression.

More Disturbia.

What's the intent of the statement? We're trying to prevent buffer overflow when reading a command. Does it really matter what the command is that overflows the buffer? The T_READ command should only accept one pin number to read from but that doesn't mean you can't send a long list of pins. I don't know what it does with such a list because my test only checked a single pin but I don't expect it to "brick" the bot.

Up until now I've ignored what happens when a buffer overflow condition is detected. What happens is

if ( /* detect overflow */ ) {
    PTF("OVF");
    beep(5, 100, 50, 5);

    do {
        serialPort->read();
    } while (serialPort->available());

    printToAllPorts(token);

    token = T_SKILL;
    strcpy(newCmd, "up");
    cmdLen = 2;

    return;
}

This doesn't look too bad until you look closer and think about it. Remember, this all happens as the sender is trying to send data and you're trying to read the input.

The PTF macro does I/O to the serial monitor, which is slow and uses the same line that the USB is on that the sender is writing to. I beleive the USB is full-duplex (can send an receive at the same time) but it is the speed that is of more concern. It's the call to beep that boggles me:

beep(5, 100, 50, 5)

says, "Play the note 5 for 100 ms, wait for 50 ms, and do that 5 times." That's 150 ms * 5 = 750 ms! Three quarters of a second! This is 5 times longer than our largest timeout for reads. If we timeout after 150 ms, what would you expect to happen on the sender side if we delay for 750 ms?

It's only at this point do we drain the input buffer; we probably should have done that first.

The command response is sent out by echoing the token to the serial port.

Finally, for added emphasis that something has gone wrong, the skill command for "stand up" is placed into the command buffer for subsequent processing by reaction(). From my POV, out of all the poses to take to indicate a problem, "stand up" isn't the one I'd select. "Rest" I would think might be more indicative. However, I don't like doing anything you weren't told to do. Doing nothing would be a safer course of action.

end Part 2

Timothy McCarthy

May 08

Preface: This is a 3 part post. You should leave now or settle in.

Part 1

Corrupted memory location

Before analyzing the code, I wanted to get an idea of the kind of memory that would be corrupted by a buffer overflow.This might help guage the severity and behavior symptoms. The definition of the buffer newCmd is

char *newCmd = new char[BUFF_LEN + 1];

In my experience, dynamic memory corruption is the diagnostic "kiss of death". Without a debugger, unless you're an expert on memory layout, et al., it's fruitless to try to trace. As the saying goes, "And it annoys the pig."

Analysis

The expression

(token == T_SKILL
    || lowerToken == T_INDEXED_SIMULTANEOUS_ASC
    || lowerToken == T_INDEXED_SEQUENTIAL_ASC)
&& cmdLen >= spaceAfterStoringData
|| cmdLen >= BUFF_LEN

needs to be examined in some detail in order to understand the conditions necessary to replicate the defect. First, I want to reorder the expression terms for clarity.

(T_SKILL == token
    || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
    || T_INDEXED_SEQUENTIAL_ASC == lowerToken)
&& spaceAfterStoringData <= cmdLen
|| BUFF_LEN <= cmdLen

This code style brings "what I'm looking for" to the forefront when reading the code.This helps me keep my mental model of the algorithm being used.

The test case of the expression is: T_SKILL_DATA == token. The questions are:

What is the intent of the code?

How is the expression evaluated?

How do the values of lowerToken, spaceAfterStoringData, and cmdLen affect the expression.

What is the intent of the code?

The intent is to detect a buffer overflow for those commands that accept a variable length argument list.

How is the expression evaluated?

The code depends upon the order of precedence of the logial operators. (Parentesies would have made the intent unambiguous.) The order of precedence for the logical operators is: NOT (!), AND (&&), OR (||).

The expression is then equivalent to

(
    (
        T_SKILL == token
        || T_INDEXED_SIMULTANEOUS_ASC == lowerToken
        || T_INDEXED_SEQUENTIAL_ASC == lowerToken
    )
    && (spaceAfterStoringData <= cmdLen)
)
|| (BUFF_LEN <= cmdLen)

The precondition, T_SKILL_DATA == token ('K') controls the value of lowerToken, i.e., T_SKILL == lowerToken ('k'). Substitution these values into the first term of the AND expression gives,

(T_SKILL == T_SKILL_DATA
|| T_INDEXED_SIMULTANEOUS_ASC == T_SKILL
|| T_INDEXED_SEQUENTIAL_ASC == T_SKILL)

which evaluates to false.

This causes the entire expression to collapse to the last term

(BUFF_LEN <= cmdLen)

which is the check for buffer overflow. Isn't that what we want, after all?

Not rxactly. The reason is that the first term of the AND expression should be

(T_SKILL_DATA == T_SKILL_DATA
|| T_INDEXED_SIMULTANEOUS_ASC == T_SKILL
|| T_INDEXED_SEQUENTIAL_ASC == T_SKILL)

which evaluates to true. So the AND expression should be

(TRUE) && (spaceAfterStoringData <= cmdLen)

This shows the effect of spaceAfterStoringData on the test for buffer overflow. If cmdLen exceeds spaceAfterStoringData then buffer overflow will occur. Note that the code distinguishes between BUFF_LEN and spaceAfterStoringData. It certainly appears their relationship is: spaceAfterStoringData <= BUFF_LEN. The initialization of spaceAfterStoringData confirms this.

int spaceAfterStoringData = BUFF_LEN;

We want to know how and when spaceAfterStoringData is calculated. We know it doesn't happen as the input is read, so it should be calculated before the call to read_serial.

Disturbia

The data is coming through the protocol so we don't know the length or content until after we've received it. (We'd have to be psychic to do that.) However, we might, at the start of reading a command, reset spaceAfterStoringData back to it's initialization state. (Pssst! We don't.) But that doesn't really help us determine if the buffer will oveflow as we're reading the data. It's only after we've read the command and data, and we're processing it, that we can determine if the buffer will overflow. The use of spaceAfterStoringData during the read operation is suspicious.

We're almost ready to track down spaceAfterStoringData but there's another clue about it that we have. The buffer ovrflow check is restricted by the 3 token types. We can expect that spaceAfterStoringData is affected by or affects the processing of those tokens.

end Part 1

este este

May 08

Replying to

In OpenCat.h, there is:

int spaceAfterStoringData = BUFF_LEN;

and I don't see anywhere that this variable is changed. So this conditional is currently redundant.

Question for the Petoi team: Is spaceAfterStoringData used or can this redundancy be eliminated?

If so, and if we check for all tokens, we have this very simple revised conditional statement:

if (cmdLen >= BUFF_LEN) {

Seems like "Mischief managed!" to me. ⚡️

Timothy McCarthy

May 08

Replying to

see Skill::inplaceShift

este este

May 08

Replying to

Aha, I stand corrected! Thanks!

Timothy McCarthy

May 05

Timeout Tests

The Serial class ctor has a timeout argument that controls the total time of read and write operations before a timeout occurs. The default value I used was 10ms which matches SERIAL_TIMEOUT. Changing it to 150 ms had no effect.

On the bot side, the macros SERIAL_TIMEOUT and SERIAL_TIMEOUT_LONG affect only read operations. The setup function sets the timeout for the Serial object using the SERIAL_TIMEOUT macro and then clears the input buffer.

Serial.setTimeout(SERIAL_TIMEOUT);
while (Serial.available() && Serial.read())
    ;  // empty buffer

The macros are used to set the global variable serialTimeout. For practically all of the commands sent to the bot, the serialTimeout value is SERIAL_TIMEOUT (10 ms). But if the bot receives either a T_SKILL_DATA or T_BEEP command, the value of serialTimeout is set to SERIAL_TIMEOUT_LONG (150 ms). This is because these two commands have variable length data attached to them. The T_BEEP command can specify a list of notes to play that can be long. The T_SKILL_DATA command specifies lists of servo angles for posture, gaits, and behaviors that can have dozens of bytes. When the bot receives a command it checks if it is either of these two and if so, changes serialTimeout to SERIAL_TIMEOUT_LONG to provide more time to read the data.

void read_serial() {
    Stream* serialPort = NULL;

    // ... set serialPort to Bluetooth or USB
    // if they have data ...

    if (serialPort) {                   // data available
        token = serialPort->read();     // read the command
        lowerToken = tolower(token);
        newCmdIdx = 2;

        delay(1);	// leave enough time for serial read

        // capitalized tokens use binary encoding for
        // long data commands
        // '~' ASCII code = 126; may introduce bug when
        // the angle is 126
        // so only use angles <= 125
        terminator = (token >= 'A' && token <= 'Z')
            ? '~'
            : '\n';

        serialTimeout = (
            token == T_SKILL_DATA
            || lowerToken == T_BEEP)
            ? SERIAL_TIMEOUT_LONG
            : SERIAL_TIMEOUT;

        //...

The algorithm used to read command data consists of

lastSerialTime = millis();

do {
    if (serialPort->available()) {
        do {
            // overflow check ... 

            // read input
            newCmd[cmdLen++] = serialPort->read();
        } while (serialPort->available());

        lastSerialTime = millis();
    }
} while (
	newCmd[cmdLen - 1] != terminator
	&& long(millis() - lastSerialTime) < serialTimeout);

The intent of the code (as I understand the forum post) is to handle Bluetooth input.that can be broken up into chunks before the command terminator is sent. The elapsed time between chunks was greater than serialTimeout and so a timeout occured. The solution was to increase the timeout period to allow Bluetooth more time to send the chunks before timeout. This also handles the case of no terminator ("no line ending") or "newline only".

There's a small "hole" here if there's an embeded command terminator (or multiple commands sent at once). If the inner loop stops on the embedded terminator, the command is partially complete and subsequent iterations may interpret the remaining data as new commands

Of more importance is the buffer overflow check. The condition for the check is

if (
	(token == T_SKILL
		|| lowerToken == T_INDEXED_SIMULTANEOUS_ASC
		|| lowerToken == T_INDEXED_SEQUENTIAL_ASC)
	&& cmdLen >= spaceAfterStoringData
	|| cmdLen >= BUFF_LEN) {

     //...
}

T_SKILL is the named skill ('k') but the skill with all the data is T_SKILL_DATA

I have found the very cause of Hamlet’s lunacy

- Hamlet, Act 2, Scene 2

This isn't definitive but goes a long way. I think it happens intermittently, but the effects seem to be persistent. It would be good to know if such an overflow would affect the program.

Timothy McCarthy

May 08

Replying to

I'm having trouble replying to this due to forum issues. Apparently, I look like some sort of bovine.

Will try again later.

este este

May 08

Replying to

Hmm, isn't Spam porcine? 🤔

Edit: Just deleted that "spinning can of Spam" gif. It sounded like a good idea but it was just annoying! Sorry about that!

Timothy McCarthy

May 09

Replying to

"The T_SKILL_DATA overflow is already handled by cmdLen >= BUFF_LEN alone because it's an "or" condition."

That condition handles just the raw overflow of the newCmd buffer. No command may exceed the BUFF_LEN limit. By definition, read_serial would directly overflow the newCmd buffer in that case.

But that's not the test case we're interested in. We're interested in the case of spaceAfterStoringData <= cmdLen. The command length exceeds the threshold spaceAfterStoringData of the move operation but not the size of the command buffer newCmd. I provided a test of that with T_SKILL_DATA that demonstrates the condition isn't detected

1>G:\AppDev\Robotics\Petoi\test\utOpenCatEsp32\readoverflow.cpp(120):
    error : Value of: is_overflowex()
1>  Actual: false
1>Expected: true
1>Overflow undetected for:
1>   K : token
1>  17 : cmdLen
1>  16 : spaceAfterStoringData

I'm focused on the value of spaceAfterStoringData when the expression is evaluated. In read_serial, during each iteration of the do-while loop, what is the value of spaceAfterStoringData? What skill command does it pertain to?

Get Out Jail Free:

Fortunately, or unfortunately, this analysis is mute due to the sheer size of newCmd (2507 bytes). I don't have a test case to demonstrate the error state. We can calculate where the threshold value is for a given skill command, and I can't find a pair that would trigger the error. In addition, I've been mischaracterizing the error as a buffer overflow. It's not a buffer overflow but a data overlap. The duty angle data is inserted into the newCmd buffer by moving the skill command frame list to the end of the buffer. If the duty angle data overlap the frame it will overwrite the frame data.

Timothy McCarthy

May 04

For completeness I should add this data point.

This is the list of tests I have that have been successfully running for the last 2-3 weeks. Test marked "DISABLED_" are either not fully tested, tested but too energetic for my test stand, or a known to update EEPROM and not suitable repeated unit testing.

ftfBittleXBehavior.
    angry
    DISABLED_backflip
    boxing
    cheers
    check
    come_here
    dig
    DISABLED_frontflip
    high_five
    good_boy
    handstand
    hug
    hi
    hand_shake
    hands_up
    jump
    kick
    leapover
    moon_walk
    nod
    play_dead
    pee
    push_ups
    pushups_one_hand
    recover
    roll
    scratch
    sniff
    table
    test
    wave_head
    zero
    Behavior_Data
    Behavior_Data_User
  ftfBittleX.
    intentional_fail
  ftfBittleXKata.
    Head_sweep
    ForeArm_sweep
    Arm_sweep
    Leg_sweep
    Hip_sweep
  ftfcmd_def_t.
    max_cmdid_t
    default_ctor_nothrow
    default_ctor_id_match
    default_ctor_cmd_empty
    default_ctor_description_empty
    default_ctor_data_empty
    ctor_id_match
    ctor_cmd_match
    ctor_description_match
    add_ASC_char_match
    add_ASC_string_match
    add_ASC_int_match
    add_ASC_vector_match
    add_BIN_char_match
    add_BIN_string_match
    add_BIN_int_match
  ftfBittleXGait.
    DISABLED_bound_forward
    backward
    backward_left
    backward_right
    backward_random
    crawl_forward
    crawl_Left
    crawl_right
    crawl_random
    gap_forward
    gap_Left
    gap_right
    gap_random
    halloween
    DISABLED_jump_forward
    push_forward
    push_left
    push_right
    push_random
    trot_forward
    trot_Left
    trot_right
    trot_random
    step
    spring_left
    spring_right
    spring_random
    walk_forward
    walk_left
    walk_right
    walk_random
    Gait_Data
    DISABLED_Gait_Data_User
  ftfBittleXPosture.
    Balance_skill
    ButtUp_skill
    Calibrate_skill
    Dropped
    Lifted
    Landing
    Sit
    Stretch
    Standup
    Zero
    Posture_Data
    Posture_Data_User
  ftfBittleXProtocol.
    CR_only
    LF_only
    CRLF_only
    invalid
    QUERY
    ABORT
    DISABLED_BEEP_volume
    DISABLED_BEEP_getmute
    DISABLED_BEEP_setmute
    DISABLED_BEEP_tune
    CALIBRATE
    REST
    INDEXED_SIMULTANEOUS_ASC_angle
    INDEXED_SIMULTANEOUS_ASC_list
    JOINTS_list
    JOINTS_index
    JOINTS_index_sweep
    INDEXED_SEQUENTIAL_ASC_angle
    INDEXED_SEQUENTIAL_BIN_angle
    INDEXED_SEQUENTIAL_ASC_list
    INDEXED_SEQUENTIAL_ASC_OWL
    INDEXED_SEQUENTIAL_BIN_list
    INDEXED_SIMULTANEOUS_BIN_angle
    INDEXED_SIMULTANEOUS_BIN_list
    INDEXED_SEQUENTIAL_ASC_list_repeat
    COLOR
    VERBOSELY_PRINT_GYRO
    PRINT_GYRO
    GYRO_BALANCE
    LISTED_BIN
    PAUSE
    TEMP
    DISABLED_RANDOM_MIND
    DIGITAL_READ
    ANALOG_READ
    DISABLED_SAVE
    GYRO_FINENESS
    DISABLED_MELODY
    SLOPE
    TILT
    MEOW
    DISABLED_SERVO_PWM
    DISABLED_ACCELERATE
    DISABLED_DECELERATE
    on_command_list
    DISABLED_example_py
  utfwin32.
    ctor_valid

The point here is that the T_SKILL_DATA test was the last unit I was working on when the issue started.

Timothy McCarthy

May 04

I need to know the answer to this question before I do it.

If I upload a test, say one of the Module test sketches, and run it. Then I run "Upgrade Firmware" from the Petori App

Will the bot be back to where I started, i.e, just as it is now.?

Timothy McCarthy

May 04

Replying to

Thanks.

You see the problem? I want to be able to restore the original OpenCat, the one that's on my bot now and causing the problem, and not one that I build locally. If I can't, there's not much chance I'll replicate the problem. If I can't successfully build the OpenCat code, I'm sunk. "He's dead, Jim"

And IMHO, this is a serious problem. The symptoms I'm seeing indicate there's a power drain (flickering LEDs), and servo -PWM misbehavior. The bot is unusable. These appear to be the result of sending data to the bot through the serial protocol. I'm reviewing the code and I can see how some of it might have happened. And I recognize that there's a practical limit to how much checking the running code can do considering space constraints. All that means is that it becomes vital to provide a means of restoring the bot to known state.

Or did I miss something and there is a way to do this?

este este

May 04

Replying to

Regarding

I want to be able to restore the original OpenCat, the one that's on my bot now and causing the problem, and not one that I build locally

The OpenCatESsp32 source code (slightly modified as I recall) that you uploaded before your unit tests is still in flash memory. The catch is we don't know if it has been corrupted or not by such unit testing.

I know of no way to compare the flash memory's current state vs. its state before unit testing (such tools may exist but I have no such knowledge).

I don't think there is anything you can do except re-upload the source code to bring the Bittle flash memory to a "known good state". If you still have weird behavior then the EEPROM may be corrupted (which might make sense if you last were using the T_SKILL_DATA token which, AFAIK, writes to the EEPROM.)

If everything looks normal, you could repeat the unit test in question to see if the effect is reproducible.

You could also modify a copy of the source code to put in some readbacks that might be helpful.

Timothy McCarthy

May 04

Replying to

The code running now is unmodified code from Petoi. I have never compiled and uploaded the OpenCat32 code, or any other code to the bot. I compiled the code but didn't upload it. I have run the Upgrade Firmware a handful of times but only by direction from the folks at Petoi.

These are the options forward I see:

Try to build and upload the OpenCat32 code unmodified, taken from github.
If step one is successful, then write small, unit test sketches to examine the EEPROM data or, possibly, export it. Otherwise, write the test sketches to run on a separate, standalone ESP32 dev board I have.
Start extracting code from OpenCat32 to my laptop for unit testing on the laptop.
Look at how the Arduino IDE uploads a binary image of a sketch and write a tool to do that.

I've been thinking about doing a few of these anyway but I would really like to have a working bot while I do them.

For the next day or 2 I'll stay with static analysis of the issue and start on #2, 3 and wait for more feedback.

Addendum: I have a one or two hardware tests I might perform on a breadboard , e.g., battery voltage test, servo test.

Timothy McCarthy

May 03

I have to apologize for the garbled posts I made yesterday, it was a bad day and I got impatient. But I also didn't communicate my message very well. I had trouble making the videos, using the forum interface to post my code, etc. The result was an unintelligible mess.

So, I'm going to try to clean up my and break it down into smaller chunks If it can edit my past to reformat the code, that would help. Otherwise, I can repost the important parts.

Again, I apologize for the confusing I caused.

este este

May 03

Replying to

No worries!

Yes, you can "change history" by deleting or editing whichever comments you like to clarify the situation.

Importantly, you can go back to your original post where you can add an "Edit YYYY-MM-DD:" entry (here, I am recommending the international date format of ISO 8601, since different countries use different date conventions) and then add whatever additional info you deem appropriate. This can include attachment of log files (which cannot currently be attached in comments). Then, in subsequent comments you or others post, they can refer to the "Edit YYYY-MM-DD" entry.

Sound good? 😎

Timothy McCarthy

May 04

Replying to

Thank you. That helped a lot!

I was able to repair the clutter and (hopefully) some confusion.

Rongzhong Li

May 03

Are you using the USB cable or bluetooth to send the long command? Is it send through Windows or Mac? For Bluetooth, the command will be sliced into small bucket, and it may reach the timeout and think the message has ended. If you have access to Arduino IDE, try to increase this timeout 150 to some larger values, such as 300 (in ms). It is more secure but will take response time longer. The current 150 is tested to work in most cases.

Timothy McCarthy

May 03

Replying to

Preface: Remember, my bot is in a difficult state for testing. I'm thinking about using the Module tests to check if the hardware is complaining or not. My immediate need is to get the bot stabilized.

Dumb question: the "firmware upgrade" operation also restores the main app (setup, loop, etc.)? It doesn't just update EEPROM data, right?

"Are you using the USB cable or bluetooth"

USB. I'm using the OpenCat_serial library (trivially modified with namespaces and a fix for the directory listing function).

"Is it send through Windows or Mac? "

Windows.

OpenCat_serial library defines the Serial class to wrap the API. My test fixture uses the class to provide communications

class ftfBittleX : public ftfwin32
{
    // serial port ctor parameters
    string port{ "COM5" };
    unsigned long baud{ 115200 };
    uint32_t timeout{ 10U };
    Timeout ctimeout{ Timeout::simpleTimeout(timeout) };
    //...
    Serial ftSerial{};
protected:
    //...
};

"For Bluetooth, the command will be sliced into small bucket, and it may reach the timeout and think the message has ended."

I haven't looked at the Bluetooth API, so I can't speak to that. I'm currently using USB to communicate but have plans to use the other APIs for WiFi and Bluetooth. That's down the road apiece. USB is the current focus.

The short answer is that I don't think the windows API breaks the output byte stream up but I can't confirm that. And the Serial API doesn't report error status if the byte count doesn't match. I detect it but don't check for error code right away. I should.

The OpenCat_serial Serial class defines a number of write APIs. The one I use is

size_t write(const std::string& data);

This is used by the test fixture ftfBittleX::on_send function to send commands

bool ftfBittleX::on_send(const cmd_def_t& command)
{
    string outbuf{ command.cmd };

    //... append command data and end of line ...

    size_t write_count{ ftSerial.write(outbuf) };
    bool result{ outbuf.length() == write_count };
    EXPECT_TRUE(result) << "ftSerial.write FAILED "
        << "expected: " << outbuf.length()
        << " actual: " << write_count;
    return result;
}

The Serial::write command uses the win32 WriteFile function

size_t Serial::write(const string& data)
{
	ScopedWriteLock lock(pimpl_);
	return write_(reinterpret_cast(data.c_str()), data.length());
}
//...
size_t Serial::write_(const uint8_t* data, size_t length)
{
	return pimpl_->write(data, length);
}
//...
size_t Serial::SerialImpl::write(const uint8_t* data, size_t length)
{
	// ...
	DWORD bytes_written;
	if (!WriteFile(
		fd_
		, data
		, static_cast< DWORD >(length)
		, &bytes_written
		, NULL
	)
		) {
		stringstream ss;
		ss << "Error while writing to the serial port: " << GetLastError();
		THROW(IOException, ss.str().c_str());
	}
	return size_t(bytes_written);
}

The file handle comes from the CrateFile API with no special attributes so it defaults to a synchcronous I/O device. Thus the WriteFile call is a synchronous call that blocks until completion. It might be that the WriteFile function breaks the bytes stream into smaller blocks such that a timeout occurs between blocks but that would result in an error and an exception thrown.

However, the code doesn't check that all the bytes were written nor did I. I have a fixture function (on_error) to do this.

bool ftfBittleX::on_send(const cmd_def_t& command)
{
    string outbuf{ command.cmd };

    //... append command data and end of line ...

    size_t write_count{ ftSerial.write(outbuf) };
    on_error(__FUNCTION__);
    bool result{ outbuf.length() == write_count };
    EXPECT_TRUE(result) << "ftSerial.write FAILED "
        << "expected: " << outbuf.length()
        << " actual: " << write_count;
    return result;
}

I'll address the SERIAL_TIMEOUT and SERIAL_TIMEOUT_LONG in a new post

Timothy McCarthy

May 03

I wanted to note that I think there's an oddity in the anry behvior data.. There are 7 entries for the behvior, each entry consisting of 116 angles and followed by the action spec. On the BittleX, ther are no servos 1-7 yet the behavior specifies angles for those servos. IWhen I issue the joints comman for all joints I have seen values for those servos.

Timothy McCarthy

May 03

Replying to

From the joint index picture, I thought they were for OpenCat, 0 is the neck, 1 is the head; 2 is the tail.

I assume that the Bittle and BittleX code base reuse the code the cat but don't use those pins.

este este

May 03

Replying to

Agreed!

Jason

May 07

Replying to

Exactly correct!

Timothy McCarthy

May 03

Last try, just the skill data command

  [ RUN      ] ftfBittleXBehavior.Behavior_Data
  ftBittleX::ftfBittleX::on_send
  TX command  : kpu
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : pushups
  cmd         : k
  id          : 9
  data        : pu
         2 ms : RX_latency
      4726 ms : RX_elapsed
  response    :
  pu
  Progress: 1/10
  Progress: 2/10
  Progress: 3/10
  Progress: 4/10
  Progress: 5/10
  Progress: 6/10
  Progress: 7/10
  Progress: 8/10
  Progress: 9/10
  Loop remaining: 2
  Progress: 8/10
  Progress: 9/10
  Loop remaining: 1
  Progress: 8/10
  Progress: 9/10
  Progress: 10/10
  k
  k
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : j
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : JOINTS
  cmd         : j
  id          : 8
        10 ms : RX_latency
         1 ms : RX_elapsed
  response    :
  =
  0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	
  0,	0,	0,	0,	0,	0,	0,	0,	30,	30,	30,	30,	30,	30,	30,	30,	
  j
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : kzero
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : Zero
  cmd         : k
  id          : 9
  data        : zero
         2 ms : RX_latency
       232 ms : RX_elapsed
  response    :
  zero
  k
  k
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : 4bfffffff6001783000000001e1e1e1e1e1e1e1e8000f00000001e23281532ff29c000f00000001e23281e32ffe100001e00000001b23283c32f142d10000f00000002a23283c1914143cc00000000000302d4b3c1425143cc000fffffff100000003c3c4646ff3c3c10000000000001e1e6e6e3c3c3c3cc1001e000000046465555ffffffceffffffce3c3c10000000000001e1e1e1e1e1e1e1e80007e
  ftBittleX::ftfBittleX::on_response
  Command completed (toggle)
  description : SKILL_DATA
  cmd         : K
  id          : 10
  data        : 24600178300000000303030303030303080001500000003035402150151541120001500000003035403050151514160003000000002735406050152045160001500000004235406025202060120000000000048457560203720601200024100000006060707015156060160000000000030301101106060606012100300000000707085852062066060160000000000030303030303030308000
       346 ms : RX_latency
      4453 ms : RX_elapsed
  response    :
  Progress: 1/10
  Progress: 2/10
  Progress: 3/10
  Progress: 4/10
  Progress: 5/10
  Progress: 6/10
  Progress: 7/10
  Progress: 8/10
  Progress: 9/10
  Loop remaining: 2
  Progress: 8/10
  Progress: 9/10
  Loop remaining: 1
  Progress: 8/10
  Progress: 9/10
  Progress: 10/10
  k
  k
  response    : end
  ftBittleX::ftfBittleX::on_verify
  ftBittleX::ftfBittleX::on_send
  TX command  : j
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : JOINTS
  cmd         : j
  id          : 8
        11 ms : RX_latency
         1 ms : RX_elapsed
  response    :
  =
  0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	
  0,	0,	0,	0,	0,	0,	0,	0,	30,	30,	30,	30,	30,	30,	30,	30,	
  j
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : kzero
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : Zero
  cmd         : k
  id          : 9
  data        : zero
         2 ms : RX_latency
       240 ms : RX_elapsed
  response    :
  zero
  k
  k
  response    : end
  [       OK ] ftfBittleXBehavior.Behavior_Data (10151 ms)
  [ RUN      ] ftfBittleXBehavior.Behavior_User
  ftBittleX::ftfBittleX::on_send
  TX command  : kzero
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : Zero
  cmd         : k
  id          : 9
  data        : zero
         1 ms : RX_latency
        40 ms : RX_elapsed
  response    :
  zero
  k
  k
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : 4bfffffff9001343000000001e1e1e1e1e1e1e1e20000ffffffce02d0fffffffbfffffffb1414ffffffed2f475afffffff0ffffffc9292f10000ffffffce02d0fffffffbfffffffb1414ffffffba3c55134cffffffbc29ffffffe120000ffffffec02d0fffffffbfffffffb1414ffffff934161e44ffffffb31bfffffff530000ffffffac02d0fffffffbfffffffb1414ffffffb44161e5affffffc41bfffffff54040026ffffffb02d0fffffffdfffffffd33464e4616ffffffc9fffffff8ffffffddfffffffd106000ffffffb00000001e1e1e1e1e1e1e1e100007e
  ftBittleX::ftfBittleX::on_response
  Command completed (toggle)
  description : USER_SKILL
  cmd         : K
  id          : 10
  data        : 2490013430000000030303030303030303200020604502512512020237477190240201414716000206045025125120201866085197618841225320002360450251251202014765971468179272454800017204502512512020180659714901962724564400381764502532533370787022201248221253166000176000000303030303030303016000
       356 ms : RX_latency
      2868 ms : RX_elapsed
  response    :
  Progress: 1/7
  Progress: 2/7
  Progress: 3/7
  Progress: 4/7
  Progress: 5/7
  Loop remaining: 2
  Progress: 4/7
  Progress: 5/7
  Loop remaining: 1
  Progress: 4/7
  Progress: 5/7
  Progress: 6/7
  Progress: 7/7
  k
  k
  response    : end
  ftBittleX::ftfBittleX::on_send
  TX command  : kzero
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : Zero
  cmd         : k
  id          : 9
  data        : zero
         4 ms : RX_latency
       680 ms : RX_elapsed
  response    :
  zero
  k
  k
  response    : end
  [       OK ] ftfBittleXBehavior.Behavior_User (3970 ms)

Edit: cleaning up my mess. The TX_command represents binary data

este este

May 03

Replying to

Well this is a long thread. Wonder why the forum website did not want to accept your (presumably text) file. How big was it?

Anyway, I am not sure what I am looking at in these outputs you posted. How are you doing these unit tests and do you think they somehow messed up the Bittle configuration?

OpenCatEsp32 should be quite robust, at least compared to OpenCat. Partially this is because we have (relatively) lots of program memory (PROGMEM) and variables memory (SRAM). Partly this is also because most information is stored in PROGMEM which should not be changeable at runtime. Certainly, you can change the contents of the EEPROM but relatively little info is stored in EEPROM. AFAIK, the EEPROM on the BiBoard under OpenCatEsp32 contains only this info:

Birthmark
IMU Calibration
Joint Calibration
Bootup Sound State
Buzzer (Beep) Volume
Enabled Modules List
Version Date
Serial Buffer

I suppose it is possible to mess up the IMU or Joint Calibration by corrupting the EEPROM but it seems unlikely to me without more information.

BTW, there is a comment at the end of InstinctBittleESP.h that says:

//the total byte of instincts is 14528

//the maximal array size is 933 bytes of wkL.

//Make sure to leave enough memory for SRAM to work properly. Any single skill should be smaller than 400 bytes for safety.

It is a bit contradictory - 933 bytes of skill wkL is much more than 400 bytes but we have a lot of SRAM so does it still matter? Perhaps not, but you did say you were trying to write 934 bytes.

Edit 1: When I first tried to post this, I got the dreaded SPAM notice.

Turns out, it was because of the phrase "To the b-e-s-t of my understanding" which I changed to AFAIK (As Far As I Know) to circumvent the SpamBot. So don't use the "b" word unless you want to attract the attention of the SpamBot!

Edit 2: It seems we can't upload files in comments, only in the original posts. 😥

Timothy McCarthy

May 03

Replying to

The log file was just over 100Kb

This is black box testing. I use Google gtest; a popular C++ Test Framework that's been out for years and is open source. Communication is handled by a trivially modified OpenCat_serial (wrapping into a namespace and fix of the directory listing function) that is up on the PetoiCamp github site. My code uses the serial protocol over the communication channel. It would be impossible for my code to touch the bot configuration unless there were a bug on the bot side. If there were a bug it would have shown up by now.

The 704 byte count suggests that the bot has stopped communications; the result is the bot stops accepting any more data using the hardware handshake control. This is especially suspicious since the tests ran successfully twice before but fail on the third attempt. What's odd about this is that it was communicating properly right up until it failed. And I'm very confident that I'm sending the correct number of bytes because I verify it against the one from the header file

/ raw skill data from InstinctBittleESP.h
const int8_t pu[] = {
-10, 0, 0, 1,
	7, 8, 3,
	0,  0,  0,  0,  0,  0,  0,  0, 30, 30, 30, 30, 30, 30, 30, 30,	 8, 0, 0, 0,
    //...
};

// code to generate test behavior from it subtypes
// ...
// translate the behavior object into a byte stream
// and check that the size matches the raw data size
ostringstream os;
os << pu_behavior;
string dat{ os.str() };
vector < int8_t > buf{ dat.begin(), dat.end() };
EXPECT_EQ(sizeof(pu), buf.size());

The pu data consists of 10 frames where each frame has 16 angle + 4 bytes of action values; (or 20 bytes per frame) plus 7 prefix = frame count + yaw + roll + angle ratio + start frame + end frame + loop count

Finally, the data comes from the InstinctBittleESP.h file. The comment at the end the file is a note about the size of data of the compiled program.

This didn't go well at all and has me rethinking this project.

Timothy McCarthy

May 03

Can't attache the log file.

Timothy McCarthy

May 03

Well, I spent the better part of today writing my analysis of the log file only to run into a limitation of post sie on the forum. So this is going to be drastically abbreviated.

I upgraded the firmware and check if that fixed the problem. Initially, it appears to have worked. But ... only if the bot starts in practically the "rest" pose (Boot Test A). If I set the pose it in the "zero" position and powerup, the behavior repeats. (Boot Test B)

This behavior goes on for quite some time and doesn't seem to end. The bot seems like it's hunting around to get to a known pose and never making it. This makes me suspicious about the IMU. As an experiment, I picked up the robot and lifted it slightly. The result was the movement stopped but the pose wasn't reached.

Boot Test A https://youtube.com/shorts/kdnRSTnuEnI

Boot Test B https://youtube.com/shorts/UAxoWAS_BZo

I have two tests written in Google gtest for the T_SKILL_DATA protocol command.

TEST_F(ftfBittleXBehavior, Behavior_Data) {
	using std::this_thread::sleep_for;
	ASSERT_TRUE(on_pushups());	// standard, built-in behavior
	sleep_for(milliseconds(50));	// ~servo lag time
	vector <int8_t> expect_joints{};
	ASSERT_TRUE(on_joint(expect_joints));
	vector <joint_t> pose{
		{HEAD, expect_joints[HEAD]}
		, {LARM, expect_joints[LARM]}
		, {RARM, expect_joints[RARM]}
		, {RHIP, expect_joints[RHIP]}
		, {LHIP, expect_joints[LHIP]}
		, {LFOREARM, expect_joints[LFOREARM]}
		, {RFOREARM, expect_joints[RFOREARM]}
		, {RLEG, expect_joints[RLEG]}
		, {LLEG, expect_joints[LLEG]}
	};

	ASSERT_TRUE(on_zero());
	ASSERT_TRUE(on_behavior_data());
	EXPECT_TRUE(on_verify(pose));
}

TEST_F(ftfBittleXBehavior, Behavior_Data_User) {
	const int8_t ang[] = {	// anger
		-7, 0, 0, 1,
		3, 4, 3,
		0,  0,  0,  0,  0,  0,  0,  0, 30, 30, 30, 30, 30, 30, 30, 30,	32, 0, 0, 0,
		-50,  0, 45,  0, -5, -5, 20, 20, -19, 47, 71, 90, -16, -55, 41, 47,	16, 0, 0, 0,
		-50,  0, 45,  0, -5, -5, 20, 20, -70, 60, 85, 19, 76, -68, 41, -31,	32, 0, 0, 0,
		-20,  0, 45,  0, -5, -5, 20, 20,-109, 65, 97, 14, 68, -77, 27, -11,	48, 0, 0, 0,
		-84,  0, 45,  0, -5, -5, 20, 20, -76, 65, 97, 14, 90, -60, 27, -11,	64, 4, 0, 0,
		38, -80, 45,  0, -3, -3,  3,  3, 70, 78, 70, 22, -55, -8, -35, -3,	16, 6, 0, 0,
		0, -80,  0,  0,  0,  0,  0,  0, 30, 30, 30, 30, 30, 30, 30, 30,	16, 0, 0, 0,
	};

	ASSERT_TRUE(on_zero());
	EXPECT_TRUE(on_skill_data(ang, sizeof(ang)));

	// todo: verify pose
}

After I build the project, I run the tests 3 times. The first two run are successful but the third fails. The attached file is the log file out for the run.

 [ RUN      ] ftfBittleXBehavior.Behavior_Data
  ftBittleX::ftfBittleX::on_send
  TX command  : kpu
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : pushups
... 
  ftBittleX::ftfBittleX::on_send
  TX command  : j
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : JOINTS
... 
  ftBittleX::ftfBittleX::on_send
  TX command  : kzero
  ftBittleX::ftfBittleX::on_response
  Command completed (normal)
  description : Zero
... 
  ftBittleX::ftfBittleX::on_send
  TX command  : 4bfffffff6001783000000001e1e1e1e1e1e1e1e8000f00000001e23281532ff29c000f00000001e23281e32ffe100001e00000001b23283c32f142d10000f00000002a23283c1914143cc00000000000302d4b3c1425143cc000fffffff100000003c3c4646ff3c3c10000000000001e1e6e6e3c3c3c3cc1001e000000046465555ffffffceffffffce3c3c10000000000001e1e1e1e1e1e1e1e80007e
G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(420): error : Value of: result
    Actual: false
  Expected: true
  ftSerial.write FAILED
G:\AppDev\Robotics\Petoi\test\ftBittleX\ftBittleX.cpp(638): error : Value of: result
    Actual: false
  Expected: true
G:\AppDev\Robotics\Petoi\test\ftBittleX\ftbehavior.cpp(160): error : Value of: on_behavior_data()
    Actual: false
  Expected: true
  ftBittleX::ftfBittleX::on_send

The failure occurs when trying to the skill data for the first test. Somehow that turns on the verbose gryo output and causes all the remaining tests to fail.

How this can cause the boot behavior I don't know

(Arrg! I can attch the file!) I'll try in the next post.

Edit: cleaning up my mess

este este

May 02

Interesting... Thanks for including the video! So I see you're using a BiBoard but I need more context though, with an important caveat that I have not personally used the token T_SKILL_DATA aka 'K'.

Are you using unmodified OpenCatEsp32 source code? If not, I'll might have questions on what was modified.
Can you share what commands (tokens with any data) you sent in the lead up to this behavior? You mention SKILL_DATA so I gather it includes the 'K' token.
The token T_SKILL_DATA aka 'K' causes reaction() to call copydataFromBufferToI2cEeprom() in I2cEEPROM.h. That function will give a serial readback of "I2C EEPROM overflow! Delete some skills!" if "eeAddress + len >= EEPROM_SIZE" where EEPROM_SIZE = (65535 / 8). Did you see that message?
After writing the skill data to the EEPROM, reaction() then uses the skill data still in the newCmd variable to build/transform that data using code lines skill->buildSkill(); and skill->transformToSkill(skill->nearestFrame()); After that, reaction() uses strcpy(newCmd, "tmp"); to overwrite the contents of newCmd. This is kinda odd IMO since skill->buildSkill(); does the same thing... but the net effect is to "run the last skill", namely what you sent, since reaction() then sets the token to T_SKILL 'k'. No question here. Really just concluding that I would need to see the commands you sent, including that skill data.
I don't see a usb cable so are you sending over Bluetooth? If Bluetooth, what readbacks are you getting (if any) since Bluetooth can be finicky that way? Also, if Bluetooth, have you tried the same command sequence that let to this point but using the USB serial communication?

Timothy McCarthy

May 02

Replying to

"I don't see a usb cable so are you sending over Bluetooth?"

No. This is the bootup sequence with no connection. I am just turning the robot on. The boot sequence should place it in the rest position.

"Are you using unmodified OpenCatEsp32 source code?"

Yes. the QUERY command responds

  ftBittleX::ftfBittleX::on_send
  TX command  : ?
  ftBittleX::ftfBittleX::on_response
  Command completed
  description : QUERY
  cmd         : ?
  id          : 0
         7 ms : RX_latency
         0 ms : RX_elapsed
  response    :
  Bittle
  B02_240118
  ?
  response    : end

"Can you share what commands (tokens with any data) you sent in the lead up to this behavior?"

Yes. This is going to turn into a test case for the package I've been working to address this type of issue. I have to put together a new video to demonstrate it and then explain it.

Edit: cleaning up my mess.

Jason

May 02

Please try to re-upload the firmware via the Petoi Desktop App.

Jason

May 07

Replying to

The latest firmware is in a new release of the Petoi desktop app. Sorry, The firmware package can not updated automatically at present. So when a new release of the Petoi desktop app comes out, and you want to upgrade the firmware, you can download and use it. Generally the firmware file is in the release\2.0 folder.

Timothy McCarthy

May 07

Replying to

Thanks Jason.

I just downloaded it and will give it a try.

Based upon the symptoms I've been seeing, I think the app has been corrupted.

Let's hope I'm wrong.

Timothy McCarthy

May 07

Replying to

No soap. I even tried the Factory reset. All the SW reported success. The bot still jerks slowly and never gets to rest pose.

Sad face.

Testbench: Servo current measurement

Reinforcement Learning - OpenCat Gym

BiBoard Hardware Malfunction?

Time to see the Vet